Capability / 05

Inference fleet

GPU, multi-region.

GPU clusters across multiple regions with autoscaling, request routing, KV-cache reuse, and fallback policies — so customers don't see a model outage, even when one provider does.

View pricing→All platform modules

§ 01

Why inference is the new compute bottleneck

Training is bursty; inference is permanent. The economics of an AI product are dominated by per-request cost and per-request latency, and both are decisions the inference fleet makes a thousand times per second. We build the fleet so those decisions are explicit, attributable, and tunable.

§ 02

Inference primitives we operate

Multi-provider routing

Anthropic, OpenAI, Google, Bedrock, self-hosted — one interface, one budget.

KV-cache reuse

Prompt prefix caching, session caching, embedding caching — across requests.

Speculative decoding

Where it pays off, latency wins of 1.5–3× with no quality loss.

Autoscaling

Per-region, per-model autoscaling with explicit cold-start budgets.

Fallback policies

Provider outage, region outage, model outage — each with a defined fallback path.

Cost attribution

Token-level accounting tagged to customer, feature, and team.

§ 03

What the fleet delivers

99.9%

Steady-state availability

Multi-region

By default

Tunable

Cost vs. latency vs. quality knob

§ Related

Connected work

Licensed and
ready to run.

View pricing→