QuotaKit SDK
Usage tracking and quota enforcement for AI API calls. Wrap any provider call with quotakit.track() and QuotaKit handles cost attribution, quota checks, and async log ingestion — all without proxying your traffic.
Self-reported usage
QuotaKit relies on the usage you report through the SDK or API. For accurate analytics and dependable enforcement, make sure your token counts and success/charge signals are correct.
Quickstart
Install the SDK, initialize with your API key, and wrap your first provider call.
pip install quotakitimport quotakit
import openai
quotakit.init(api_key="aisc_...")
client = openai.OpenAI(api_key="sk-openai")
with quotakit.track("app/prod", service="openai", model="gpt-4o") as t:
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Summarize this article"}],
)
t.result(
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
)That's it for the happy path. The call is quota-checked before it executes, and usage is logged asynchronously once .result() is called.
SDK Flow
Understanding the lifecycle of a tracked call helps when handling edge cases.
init()sets your API key and starts a background sync thread that keeps node-state (quota policies and current spend) cached locally. The sync adapts to your quota proximity — syncing every 2 minutes for open-mode paths, down to every 10 seconds as you approach a block or strict limit.track(path, service, model)opens a context manager. On enter, the SDK checks the local cache for anyblock-mode quota that would be exceeded — no network call needed.- If a quota would be exceeded in block mode,
QuotaExceededis raised before the provider call ever happens. The prevented attempt is reported as a quota event. - If allowed, your provider call runs inside the
withblock. You then callt.result()to report token usage and outcome. - On context exit, the entry is queued and batched to
/v1/log/batchasynchronously so your main thread is never blocked. The SDK adapts to your throughput: during high-volume periods it fills batches to ~2,000 entries before sending; during quiet periods it flushes within seconds of the last entry arriving.
SDK Signatures
Full surface of the SDK.
quotakit.init(api_key: str)
# Tracking
quotakit.track(path: str, service: str, model: str) -> context manager
# Service configuration
quotakit.create_service(...)
quotakit.update_service(...)
quotakit.list_services()
quotakit.delete_service(service, model)
# Quota management
quotakit.create_quota(...)
quotakit.update_quota(...)
quotakit.list_quotas(...)
quotakit.delete_quota(...)Service and quota management methods are thin wrappers over the REST endpoints documented in the API Reference below.
.result() Patterns
t.result(input_tokens, output_tokens, success, charged) normalizes usage across any provider. The shape of the response object varies — extract tokens however the provider exposes them.
# Standard OpenAI / Anthropic shape
t.result(
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
)
# Custom / non-standard provider
with quotakit.track("app/scraper", service="scraper_api", model="standard") as t:
resp = scraper.fetch(...)
usage = resp["meta"]["usage"]
t.result(
input_tokens=int(usage["in"]),
output_tokens=int(usage["out"]),
)If tokens are omitted, QuotaKit falls back to the per-request estimate configured for that service+model. Cost is computed using the pricing in /api/sdk/services. See status semantics for how success/failure maps to cost handling.
Failed Charged vs Uncharged
Not all failures are the same. Some provider errors still consume tokens and incur cost; others (timeouts, connection drops) don't. Use the charged flag on .result() to explicitly control whether a failed call is billed. By default, charged mirrors success — so a failed call is uncharged unless you say otherwise.
# Failed + uncharged (default — charged mirrors success)
try:
with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
response = provider_call()
t.result(success=False) # charged=False by default
except SomeError:
pass
# No .result() at all also logs as failed + uncharged
try:
with quotakit.track("app/assistant", service="openai", model="gpt-4o"):
provider_timeout_call() # raises before .result() is called
except TimeoutError:
pass
# Failed + charged — provider processed the request and billed you anyway
with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
response = provider_call() # e.g. returned 403 but still billed
t.result(
success=False,
charged=True,
input_tokens=response.usage.prompt_tokens,
output_tokens=response.usage.completion_tokens,
)Pass charged=True when the provider billed you despite the failure (e.g. a 403 that still consumed tokens). In analytics, failed+charged entries count against quota spend; failed+uncharged entries do not. The corresponding ingest payload rules are documented at /v1/log/batch.
Quota Modes (Open / Block / Strict)
Quotas attach to hierarchy nodes and cascade to children. Every policy has a mode that determines how enforcement behaves. The default is open so production traffic is never blocked unless you opt in.
- Open (default) - always allows the call. Overages are logged and flagged, but no exception is thrown.
- Block - predicts locally and raises
QuotaExceededbefore the provider call. - Strict - reservation-based enforcement across pods. Blocks if a reservation cannot be acquired.
from quotakit import QuotaExceeded
try:
with quotakit.track("app/team/feature", service="openai", model="gpt-4o") as t:
response = client.chat.completions.create(...)
t.result(input_tokens=..., output_tokens=...)
except QuotaExceeded as e:
print(e.path) # "app/team/feature"
print(e.service) # "openai"
print(e.current_spend) # current period spend in USD
print(e.limit) # configured limitAn ancestor node policy (e.g. on app) can block calls at any child path (e.g. app/team/feature). Manage policies via the Quota Editor (Projects tab) or /api/sdk/quotas. Node state is synced via /api/sdk/node-stateand updated opportunistically in /v1/log/batch responses.
Enforcement logic (simplified)
if mode == "open":
allow()
elif mode == "block":
projected = current_spend + pending_requests * avg_cost + estimated_cost
if projected > limit:
raise QuotaExceeded
allow()
elif mode == "strict":
if not reservation.acquire(estimated_cost):
raise QuotaExceeded
allow()Block is enforced by the SDK before the provider call: the SDK checks local quota state (refreshed via background sync and flush-response updates) and raises QuotaExceeded if the estimated cost would exceed any block-mode limit. The ingest server records usage but does not reject entries that exceed block-mode quotas — enforcement is client-side.
However, block mode has no cross-server coordination. Two batches arriving simultaneously on different servers will each see the same pre-insert snapshot and may both approve entries that together exceed the limit. The potential overshoot is bounded by the total cost of all concurrent batches running at the moment the limit is crossed.
Strict uses reservations so every server shares one authoritative spend ledger. Availability is computed as:
available = limit - logged_spend - sum(all_active_reservations)Reservations are sized to recent burn rate and renewed automatically. If a server crashes, its reservation expires within a few minutes and the quota is freed. Strict mode adds a network round-trip before the provider call and fails closed if the reservation service is unreachable.
Tradeoffs by mode
- Open: zero latency, maximum availability, highest risk of overage (no blocking).
- Block: zero extra round-trips, cannot overshoot within a single batch, small race window if multiple servers submit batches simultaneously.
- Strict: strongest correctness across multiple servers, additional latency per call, may block if the reservation service is unavailable.
Worst-case overage estimate (block — concurrent servers only)
overage_usd <= C_max * (in_flight_total + R_total * T_sync)
T_sync is adaptive: 10–120 s based on quota proximity.
At ≥ 90% usage, T_sync drops to 10 s.C_max: max cost per request (USD or credits).in_flight_total: concurrent requests already in flight across servers.R_total: aggregate request rate (req/sec) across servers.T_sync: SDK sync interval (10–120 s, adaptive based on quota proximity).
Practical guidance
- Use Open for experiments and low-risk paths. It is the default.
- Use Block for most production paths. Single-server deployments get hard enforcement; multi-server deployments get a small concurrent-batch window.
- Use Strict for hard caps where any overshoot is unacceptable — especially high-concurrency multi-server deployments.
- Set quotas 5-10% below your real hard limit when using open or block modes with multiple servers.
- Strict mode limit: each account may hold at most 100 open reservations at one time. If this cap is reached,
track()raisesQuotaExceededwithreason="reservation_limit_reached". Reservations are released automatically whenPOST /api/sdk/release-batchis called or when they expire.
Where to set it: Dashboard → Projects → select a node → Quota Editor → Mode (Open, Block, Strict). In the API, mode can beopen, block, or strict.
Quota Events
When a call is prevented, QuotaKit records a quota event. These events appear in your analytics dashboard and count as prevented attempts without mixing them into request logs or spend totals.
{
"event_id": "uuid",
"path": "app/team/feature",
"service": "openai",
"model": "gpt-4o",
"enforcement_mode": "block",
"limit_type": "usd",
"reason": "monthly spend limit exceeded"
}Async + Streaming
v1 does not auto-wrap async or streaming clients. Instrument them manually — the pattern is the same: open track(), run your call, call .result() when you have final usage.
# Async
import asyncio, quotakit
from openai import AsyncOpenAI
quotakit.init(api_key="aisc_...")
client = AsyncOpenAI()
async def run():
with quotakit.track("app/async", service="openai", model="gpt-4o") as t:
resp = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello"}],
)
t.result(
input_tokens=resp.usage.prompt_tokens,
output_tokens=resp.usage.completion_tokens,
)
asyncio.run(run())# Streaming — finalize once the stream ends
with quotakit.track("app/stream", service="openai", model="gpt-4o") as t:
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Stream this"}],
stream=True,
)
final_usage = None
for chunk in stream:
if getattr(chunk, "usage", None):
final_usage = chunk.usage
if final_usage is not None:
t.result(
input_tokens=final_usage.prompt_tokens,
output_tokens=final_usage.completion_tokens,
)Authentication
All SDK and ingest endpoints authenticate with your QuotaKit API key as a Bearer token.
curl https://ingest.quotakit.io/api/sdk/services \
-H "Authorization: Bearer aisc_..."Server-side only — never call this API from a browser
QuotaKit API keys must be kept secret. Embedding a key in browser JavaScript exposes it to anyone who opens devtools — they could log arbitrary usage against your account or read your quota configuration. The ingest API has no CORS support and is not designed for browser clients. Call it only from your backend: a Node.js server, a Python service, or a serverless function.
Monthly spend cap
Paid plans with metered overage support a customer-set dollar ceiling. Set it from the Billing tab. When your projected ingest overage cost reaches the cap, new SDK calls receive a 402 spend_cap_reached response and batches are dropped until the next billing period (or until you raise/remove the cap). Rejections are served from a Redis fast-path so hitting the cap costs near-zero per request.
Rate limits
All limits are per API key, per 1-minute sliding window. Requests that exceed the limit receive a 429 rate_limit_exceeded response. The counter resets automatically — no backoff is required beyond a brief retry.
| Endpoint | Limit | Notes |
|---|---|---|
| GET /api/sdk/services | 120 req / min | Service pricing reads. Low traffic — called once at SDK init. |
| GET /api/sdk/quotas | 240 req / min | Quota policy reads. |
| GET /api/sdk/node-state | 240 req / min | SDK background sync. Adaptive: syncs every 2 min for open-mode paths, down to 10 s as paths approach block/strict limits. |
| POST /api/sdk/reserve-batch | 1200 req / min | Strict-mode reservation acquire/renew. One request per active (path, service, model) per TTL window. |
| POST /api/sdk/release-batch | 1200 req / min | Strict-mode reservation release. Fire-and-forget; called on quota exhaustion or process exit. |
| /v1/log/batch | 300 req / min | Usage log ingest. The SDK batches entries adaptively — filling to ~2,000 entries per call during high throughput, or flushing sooner when traffic is light. |
/api/sdk/services
GET /api/sdk/services
POST /api/sdk/services
PUT /api/sdk/services
DELETE /api/sdk/services?service=<name>&model=<name>Defines how QuotaKit prices a service+model combination. Used to compute USD cost from token counts.
{
"service": "scraper_api",
"model": "standard",
"currency_type": "credits",
"price_per_request": 3,
"price_per_input_unit": 0,
"input_unit_size": 1000000,
"price_per_output_unit": 0,
"output_unit_size": 1000000
}- 409 on POST with a duplicate service+model.
- 404 on PUT/DELETE when the service+model doesn't exist.
- 400 for invalid
currency_typeor non-numeric price fields.
/api/sdk/quotas
GET /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o
POST /api/sdk/quotas
PUT /api/sdk/quotas
DELETE /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o{
"node_path": "app/api",
"service": "openai",
"model": "gpt-4o",
"limit_dollars": 100,
"window_type": "monthly",
"mode": "strict"
}mode:open(default),block, orstrict.limit_creditsrequiresserviceandmodel.- 409 on POST duplicate scope; 404 on PUT/DELETE missing scope.
Strict mode uses reservations for cross-pod coordination and may allow small overshoot within a reservation window.
/api/sdk/node-state
GET /api/sdk/node-state?path=app/apiUsed by the SDK's background sync thread. Returns the effective policy state and current spend totals for a node path, which the SDK caches locally for fast quota checks without a per-call network round-trip.
/v1/log/batch
Ingest endpoint for batched usage logs. The SDK calls this asynchronously — you don't call it directly in normal usage, but the schema is useful when building integrations or debugging.
{
"entries": [
{
"path": "app/api",
"service": "openai",
"model": "gpt-4o",
"input_tokens": 100,
"output_tokens": 60,
"usd": 0.001,
"status": "success",
"request_id": "uuid"
},
{
"path": "app/api",
"service": "openai",
"model": "gpt-4o",
"status": "failed",
"usd": 0.0007,
"request_id": "uuid2"
}
]
}entriesmust be non-empty.- Each entry requires
pathandservice. - Paths are validated for format and max depth.
- A
failedentry is charged ifusdorcreditsis non-null and positive; uncharged otherwise. The SDK sets this automatically based on thechargedflag. quota_state— array of path + policy objects with approximatecurrent_spend, returned for up to 5 paths in the batch. Values reflect a pre-insert snapshot and may be off by one batch's worth of spend. The SDK uses this to update its local quota state without a separate sync call; for an authoritative figure use the dashboard or a freshnode-statecall.
Status semantics
| Status | Meaning | Cost handling |
|---|---|---|
| success | Provider call completed successfully. | Cost computed from tokens or per-request estimate. |
| failed | Call failed after starting (transport, provider, or app error). | Uncharged by default. Pass charged=True to .result() to log cost. |
| Prevented calls (blocked by quota enforcement before the provider call was made) are not recorded as request logs. They appear as quota events instead, with zero charge, so they never inflate your spend totals. | ||
Error matrix
| Status | Meaning |
|---|---|
| 401 | Missing or invalid Authorization header |
| 402 | quota_exceeded — strict-mode quota denied |
| 404 | service or quota not found (update / delete) |
| 409 | duplicate service or quota scope (create) |
| 429 | rate_limit_exceeded (API key rate limit) or ingest_api_call_limit_reached (plan call limit) |
| 500 | internal server error |