OpenAI-compatible API with the fastest open-source models. Drop-in replacement — just change the base URL and API key. Metered billing, usage analytics, rate limiting.
from openai import OpenAI
client = OpenAI(
base_url="https://your-api.workers.dev/v1",
api_key="inf-your-api-key-here"
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)Why Inference
OpenAI-compatible endpoints powered by the fastest open-source models. Per-token metering, real-time analytics, and built-in rate limiting. No vendor lock-in.
Sub-100ms time-to-first-token. Up to 1200 tokens/second on our fastest models. Hardware-optimized inference.
Drop-in replacement for any OpenAI SDK — Python, Node, Go, Rust. Just change the base URL and API key.
Real-time dashboards showing token usage, latency, costs, and per-key breakdowns. Export data anytime.
Pay only for what you use. Per-token pricing with credit-based plans. No surprise bills.
Built-in per-key rate limiting. Protect your budget with configurable RPM limits per API key.
Requests routed through Cloudflare edge network. Low latency worldwide. 99.9% uptime.
Models
Multiple open-source models through a single API. All optimized for speed.
Versatile reasoning
Ultra-fast tasks
Great for code
Compact & efficient
Pricing
Start free. Scale when you need to. No hidden fees.
$0/mo
Try it out
$19.99/mo
For building apps
$49.99/mo
For production
$199.99/mo
For scale
Quick Start
Three steps to your first API call.
Sign up for free and grab your API key from the dashboard. No credit card required.
pip install openai — or use any OpenAI-compatible SDK in your language of choice.
Point the SDK at our base URL, pass your API key, and call chat completions.
curl https://your-api.workers.dev/v1/chat/completions \
-H "Authorization: Bearer inf-your-key" \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.3-70b-versatile",
"messages": [{"role": "user", "content": "Hello!"}]
}'