InferOS
A self-hosted, production-grade LLM inference engine — built for concurrency, engineered to get the most out of every GPU. Runs anywhere from a laptop to a GPU fleet, fully offline, for everyone from solo builders to global enterprises.
The first wave of launch users get one month of Pro, free — in exchange for their feedback as we shape the product.
Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 32 concurrent
Llama 3.2 1B (4-bit) · 1× NVIDIA T4 · 1 request
across 1,408 served requests
10+ architecture families tested, plus your fine-tunes
Your own AI, on your own terms.
InferOS is a self-hosted inference engine that serves popular open models faster, on less hardware, than teams expect — so you get more concurrent users per GPU and predictable latency on the cards you already run, instead of renting scarce top-tier accelerators.
It speaks the OpenAI API, so your existing apps and SDKs point at it unchanged — chat, coding assistants, agents, RAG, batch jobs, anything — and it ships the multi-tenant isolation, encryption, and audit trail that regulated teams need but general-purpose tools leave to you. Nothing leaves your infrastructure.
curl http://your-host:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.2-3b",
"messages": [{"role": "user",
"content": "Hello!"}]}'Change one line — the base URL. The rest is the API you already use.
Built for concurrency and production.
On a single NVIDIA T4, throughput scales with load instead of flatlining — past 1,500 tokens per second on Llama 3.2 1B, and several hundred on larger 3B-class models, on one card. Continuous batching keeps it climbing: even at 64 simultaneous requests it still serves around 1,465 tokens per second. Full test conditions below.
Staggered, real-world arrivals at a 1.64 s median latency — same Llama 3.2 1B on the same T4.
Versus standard decoding on the same model — with byte-identical output. The speed-up is lossless.
Agentic & tool use
Agents fan out — planning steps, tool calls, retries, several agents at once. InferOS scales with that burst of parallel calls instead of choking on it, and structured tool-call output runs up to ~1.7× faster, losslessly. More agents per card.
RAG & retrieval apps
Retrieval-augmented apps need embeddings and generation side by side. InferOS serves both from one OpenAI-compatible engine, so it drops into your existing RAG stack — and long, shared contexts stay efficient as the calls pile up.
Multi-tenant SaaS
Every extra user is another concurrent request. Throughput rises with load, so a single GPU serves far more simultaneous users before you have to add hardware.
High-volume batch jobs
Classify, extract, summarise, or embed across millions of records. High aggregate throughput turns an overnight job into a coffee break.
All performance figures were measured in a controlled, closed environment on a single NVIDIA T4 (16 GB), 4-bit weights, 128-token generations at temperature 0.7 — the current generation of our published results. The concurrency curve uses Llama 3.2 1B, each cell taken at the best batching configuration for that concurrency; the by-model figures are each model's throughput at 16 concurrent requests. Throughput varies with hardware, model, quantisation, and load; per-stream rates fall as concurrency rises, the same physics every engine faces on a given card. These are our own measurements, not a benchmark of any third-party product.
From a laptop to a GPU fleet.
The same engine scales with you — start on a CPU for development, serve production on one card, and grow to many when you need to. No rewrite in between.
Develop on a laptop
A CPU-only mode lets engineers build and test against the real engine with no GPU at all. Spin it up anywhere.
Ship on one modest GPU
Production-grade serving on a single mainstream GPU — the kind most teams already own, not scarce top-tier silicon.
Scale across many GPUs
Grow from one card to a multi-GPU server as demand rises, without changing a line of how your apps talk to it.
Run it your way
Deploy on-prem, in your private cloud, or fully air-gapped — as a single container or scaled across a cluster. It fits your stack, not the other way around.
We value safety and sovereignty.
Self-hosted means more than convenient — it means control. Your models, your data, your infrastructure, with isolation and compliance posture built in rather than bolted on.
Switch off the internet
InferOS runs entirely on your own hardware and never phones home. Disconnect it from the network completely and it keeps serving.
Your data never leaves
Prompts and responses stay inside your walls — nothing is sent to us or any third party. Sovereignty by default, not by add-on.
True multi-tenancy
Every tenant gets its own keys, rate limits, usage metrics, and separate audit logs — enforced end-to-end, never shared by accident.
Compliance-ready by design
Engineered and tested to support the controls regulated teams depend on — encryption in transit and at rest, audit trails, and configurable retention. We don’t yet hold formal certifications, and we won’t claim badges we haven’t earned.
Full visibility — never a prompt.
See the metrics that matter — latency, throughput, usage, and health, per tenant — while the contents of every request stay private. Visibility, not surveillance.
- Time-to-first-token and per-token latency at p50 / p95 / p99
- Goodput — the share of requests that actually meet your latency targets
- Per-tenant usage, cache efficiency, and live GPU health
- A metrics endpoint with a ready-to-run dashboard out of the box
Thousands of open models. Plus yours.
Thousands of open models run today, across 10+ architecture families we've tested. Bring your own fine-tuned model — if it's a supported architecture, it just works. No lock-in to any one vendor's models, and more architectures are in active validation.
Build it with us.
A community space is launching alongside InferOS — a place to compare notes, raise issues, request models, and help shape the roadmap. The earliest users have the loudest voice.
Questions, answered.
- Does my data ever leave my own infrastructure?
- No. InferOS is fully self-hosted — prompts and responses never leave your servers, nothing is sent to us or any third party, and it can run completely offline, even air-gapped.
- Do I need a GPU to try InferOS?
- No. A CPU-only mode runs on an ordinary laptop or server for development and testing, so you can build against the real engine before you ever touch a GPU.
- Which models does it support?
- Thousands of open models across 10+ architecture families we have tested — Llama, Qwen, Mistral, Phi, Gemma, Starcoder2, Command-R and more — plus your own fine-tuned models.
- Is InferOS OpenAI-compatible?
- Yes. It exposes the OpenAI chat, completions, and embeddings APIs, so existing apps and SDKs point at it by changing a single line — no rewrite. Use it for chat, coding assistants, agents, RAG, or batch jobs.
- Is InferOS certified for compliance?
- InferOS is compliance-ready by design — encryption in transit and at rest, per-tenant audit trails, and configurable retention. We do not yet hold formal certifications and will not claim badges we have not earned.
- How do I get access?
- The waitlist opens soon. The first launch users get one month of Pro, free, in exchange for their feedback as we shape the product.
Be first in line.
The waitlist opens soon, and it's for the first launch users — who get one month of Pro, free, in exchange for their feedback as we shape InferOS. Check back shortly.