Does my data ever leave my own infrastructure?

No. InferOS is fully self-hosted — prompts and responses never leave your servers, nothing is sent to us or any third party, and it can run completely offline, even air-gapped.

Do I need a GPU to try InferOS?

No. A CPU-only mode runs on an ordinary laptop or server for development and testing, so you can build against the real engine before you ever touch a GPU.

Which models does it support?

Thousands of open models across 10+ architecture families we have tested — Llama, Qwen, Mistral, Phi, Gemma, Starcoder2, Command-R and more — plus your own fine-tuned models.

Is InferOS OpenAI-compatible?

Yes. It exposes the OpenAI chat, completions, and embeddings APIs, so existing apps and SDKs point at it by changing a single line — no rewrite. Use it for chat, coding assistants, agents, RAG, or batch jobs.

Is InferOS certified for compliance?

InferOS is compliance-ready by design — encryption in transit and at rest, per-tenant audit trails, and configurable retention. We do not yet hold formal certifications and will not claim badges we have not earned.

Products / InferOS

InferOS

A self-hosted, production-grade LLM inference engine — built for concurrency, engineered to get the most out of every GPU. Runs anywhere from a laptop to a GPU fleet, fully offline, for everyone from solo builders to global enterprises.

Waitlist opens soonDrop-in OpenAI APIRuns fully offlineCPU · Single GPU · Multi-GPU

The first wave of launch users get one month of Pro, free — in exchange for their feedback as we shape the product.

Talk to us

Peak throughput

1,558tok/s

Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 32 concurrent

Single-user speed

139.8tok/s

Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 1 request

Reliability

0errors

across 1,408 served requests

Coverage

1,000+models

10+ architecture families tested, plus your fine-tunes

01 / What it is

Your own AI, on your own terms.

InferOS is a self-hosted inference engine that serves popular open models faster, on less hardware, than teams expect — so you get more concurrent users per GPU and predictable latency on the cards you already run, instead of renting scarce top-tier accelerators.

It speaks the OpenAI API, so your existing apps and SDKs point at it unchanged — chat, coding assistants, agents, RAG, batch jobs, anything — and it ships the multi-tenant isolation, encryption, and audit trail that regulated teams need but general-purpose tools leave to you. Nothing leaves your infrastructure.

Drop-in · OpenAI-compatible

curl http://your-host:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.2-3b",
       "messages": [{"role": "user",
                     "content": "Hello!"}]}'

Change one line — the base URL. The rest is the API you already use.

02 / Performance

Built for concurrency and production.

On a single NVIDIA T4, throughput scales with load instead of flatlining — past 1,500 tokens per second on Llama 3.2 1B, and several hundred on larger 3B-class models, on one card. Continuous batching keeps it climbing: even at 64 simultaneous requests it still serves around 1,465 tokens per second. Full test conditions below.

GPU

1× NVIDIA T4 (16 GB)

Precision

4-bit (Q4_K_M)

Output

128 tokens

Temperature

0.7

Environment

Closed / isolated

Llama 3.2 1B · tokens/sec by concurrent requests (C)

C=1

140

tok/s

C=4

377

tok/s

C=8

671

tok/s

C=16

1,107

tok/s

C=32

1,558

tok/s

C=64

1,465

tok/s

At 16 concurrent requests · tokens/sec by model

Llama 3.2 · 1B

1,107

tok/s

Qwen 2.5 · 3B

456

tok/s

Llama 3.2 · 3B

454

tok/s

Phi-3.5 · mini

329

tok/s

Production-shaped traffic

860 tok/s

Staggered, real-world arrivals at a 1.64 s median latency — same Llama 3.2 1B on the same T4.

Repetitive & extractive text

up to 2.5× faster

Versus standard decoding on the same model — with byte-identical output. The speed-up is lossless.

Where the throughput pays off

Agentic & tool use

Agents fan out — planning steps, tool calls, retries, several agents at once. InferOS scales with that burst of parallel calls instead of choking on it, and structured tool-call output runs up to ~1.7× faster, losslessly. More agents per card.

RAG & retrieval apps

Retrieval-augmented apps need embeddings and generation side by side. InferOS serves both from one OpenAI-compatible engine, so it drops into your existing RAG stack — and long, shared contexts stay efficient as the calls pile up.

Multi-tenant SaaS

Every extra user is another concurrent request. Throughput rises with load, so a single GPU serves far more simultaneous users before you have to add hardware.

High-volume batch jobs

Classify, extract, summarise, or embed across millions of records. High aggregate throughput turns an overnight job into a coffee break.

All performance figures were measured in a controlled, closed environment on a single NVIDIA T4 (16 GB), 4-bit weights, 128-token generations at temperature 0.7 — the current generation of our published results. The concurrency curve uses Llama 3.2 1B, each cell taken at the best batching configuration for that concurrency; the by-model figures are each model's throughput at 16 concurrent requests. Throughput varies with hardware, model, quantisation, and load; per-stream rates fall as concurrency rises, the same physics every engine faces on a given card. These are our own measurements, not a benchmark of any third-party product.

03 / Runs anywhere you do

From a laptop to a GPU fleet.

The same engine scales with you — start on a CPU for development, serve production on one card, and grow to many when you need to. No rewrite in between.

Develop on a laptop

A CPU-only mode lets engineers build and test against the real engine with no GPU at all. Spin it up anywhere.

Ship on one modest GPU

Production-grade serving on a single mainstream GPU — the kind most teams already own, not scarce top-tier silicon.

Scale across many GPUs

Grow from one card to a multi-GPU server as demand rises, without changing a line of how your apps talk to it.

Run it your way

Deploy on-prem, in your private cloud, or fully air-gapped — as a single container or scaled across a cluster. It fits your stack, not the other way around.

04 / Safety & sovereignty

We value safety and sovereignty.

Self-hosted means more than convenient — it means control. Your models, your data, your infrastructure, with isolation and compliance posture built in rather than bolted on.

Switch off the internet

InferOS runs entirely on your own hardware and never phones home. Disconnect it from the network completely and it keeps serving.

Your data never leaves

Prompts and responses stay inside your walls — nothing is sent to us or any third party. Sovereignty by default, not by add-on.

True multi-tenancy

Every tenant gets its own keys, rate limits, usage metrics, and separate audit logs — enforced end-to-end, never shared by accident.

Compliance-ready by design

Engineered and tested to support the controls regulated teams depend on — encryption in transit and at rest, audit trails, and configurable retention. We don’t yet hold formal certifications, and we won’t claim badges we haven’t earned.

05 / Observability

Full visibility — never a prompt.

See the metrics that matter — latency, throughput, usage, and health, per tenant — while the contents of every request stay private. Visibility, not surveillance.

Time-to-first-token and per-token latency at p50 / p95 / p99
Goodput — the share of requests that actually meet your latency targets
Per-tenant usage, cache efficiency, and live GPU health
A metrics endpoint with a ready-to-run dashboard out of the box

06 / Model coverage

Thousands of open models. Plus yours.

Llama 3.1 / 3.2Qwen 2.5 / Qwen 3MistralPhi-3.5Gemma 2 / 3Starcoder2Command-RYour own fine-tunes

Thousands of open models run today, across 10+ architecture families we've tested. Bring your own fine-tuned model — if it's a supported architecture, it just works. No lock-in to any one vendor's models, and more architectures are in active validation.

07 / Community

Build it with us.

A community space is launching alongside InferOS — a place to compare notes, raise issues, request models, and help shape the roadmap. The earliest users have the loudest voice.

Talk to us →

08 / FAQ

Questions, answered.

Does my data ever leave my own infrastructure?: No. InferOS is fully self-hosted — prompts and responses never leave your servers, nothing is sent to us or any third party, and it can run completely offline, even air-gapped.
Do I need a GPU to try InferOS?: No. A CPU-only mode runs on an ordinary laptop or server for development and testing, so you can build against the real engine before you ever touch a GPU.
Which models does it support?: Thousands of open models across 10+ architecture families we have tested — Llama, Qwen, Mistral, Phi, Gemma, Starcoder2, Command-R and more — plus your own fine-tuned models.
Is InferOS OpenAI-compatible?: Yes. It exposes the OpenAI chat, completions, and embeddings APIs, so existing apps and SDKs point at it by changing a single line — no rewrite. Use it for chat, coding assistants, agents, RAG, or batch jobs.
Is InferOS certified for compliance?: InferOS is compliance-ready by design — encryption in transit and at rest, per-tenant audit trails, and configurable retention. We do not yet hold formal certifications and will not claim badges we have not earned.
How do I get access?: The waitlist opens soon. The first launch users get one month of Pro, free, in exchange for their feedback as we shape the product.

Launch

Waitlist opens soon

Be first in line.

The waitlist opens soon, and it's for the first launch users — who get one month of Pro, free, in exchange for their feedback as we shape InferOS. Check back shortly.