Blazing fast inference — sub-100ms latency

AI Inference API
built for speed.

OpenAI-compatible API with the fastest open-source models. Drop-in replacement — just change the base URL and API key. Metered billing, usage analytics, rate limiting.

Get API Key — Free View documentation

quickstart.py

from openai import OpenAI

client = OpenAI(
    base_url="https://your-api.workers.dev/v1",
    api_key="inf-your-api-key-here"
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)

Why Inference

The fastest way to ship AI.

OpenAI-compatible endpoints powered by the fastest open-source models. Per-token metering, real-time analytics, and built-in rate limiting. No vendor lock-in.

⚡

Blazing Fast

Sub-100ms time-to-first-token. Up to 1200 tokens/second on our fastest models. Hardware-optimized inference.

🔌

OpenAI Compatible

Drop-in replacement for any OpenAI SDK — Python, Node, Go, Rust. Just change the base URL and API key.

📊

Usage Analytics

Real-time dashboards showing token usage, latency, costs, and per-key breakdowns. Export data anytime.

💳

Metered Billing

Pay only for what you use. Per-token pricing with credit-based plans. No surprise bills.

🔒

Rate Limiting

Built-in per-key rate limiting. Protect your budget with configurable RPM limits per API key.

🌍

Global Edge

Requests routed through Cloudflare edge network. Low latency worldwide. 99.9% uptime.

Models

Access the fastest models.

Multiple open-source models through a single API. All optimized for speed.

LLaMA 3.3 70B

Versatile reasoning

~330 tok/s

LLaMA 3.1 8B

Ultra-fast tasks

~1200 tok/s

Mixtral 8x7B

Great for code

~480 tok/s

Gemma 2 9B

Compact & efficient

~900 tok/s

Pricing

Simple, transparent pricing.

Start free. Scale when you need to. No hidden fees.

Free

$0/mo

Try it out

✓1,000 credits/month

✓30 requests/min

✓2 API keys

✓Core models

✓Usage dashboard

Get started

Starter

$19.99/mo

For building apps

✓50,000 credits/month

✓120 requests/min

✓5 API keys

✓All models

✓Usage dashboard

✓Priority support

Get started

Pro

$49.99/mo

For production

✓200,000 credits/month

✓300 requests/min

✓20 API keys

✓All models

✓Usage dashboard

✓Priority support

Get started

Enterprise

$199.99/mo

For scale

✓1,000,000 credits/month

✓1,000 requests/min

✓100 API keys

✓All models

✓Usage dashboard

✓Dedicated support

Get started

Quick Start

Get running in under a minute.

Three steps to your first API call.

Get your API key

Install the OpenAI SDK

pip install openai — or use any OpenAI-compatible SDK in your language of choice.

Make your first request

Point the SDK at our base URL, pass your API key, and call chat completions.

curl https://your-api.workers.dev/v1/chat/completions \
  -H "Authorization: Bearer inf-your-key" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.3-70b-versatile",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

AI Inference APIbuilt for speed.

The fastest way to ship AI.

Blazing Fast

OpenAI Compatible

Usage Analytics

Metered Billing

Rate Limiting

Global Edge

Access the fastest models.

LLaMA 3.3 70B

LLaMA 3.1 8B

Mixtral 8x7B

Gemma 2 9B

Simple, transparent pricing.

Free

Starter

Pro

Enterprise

Get running in under a minute.

Get your API key

Install the OpenAI SDK

Make your first request

AI Inference API
built for speed.