QuotaKit SDK

Usage tracking and quota enforcement for AI API calls. Wrap any provider call with quotakit.track() and QuotaKit handles cost attribution, quota checks, and async log ingestion — all without proxying your traffic.

Self-reported usage

QuotaKit relies on the usage you report through the SDK or API. For accurate analytics and dependable enforcement, make sure your token counts and success/charge signals are correct.

Quickstart

Install the SDK, initialize with your API key, and wrap your first provider call.

bash

pip install quotakit

python

import quotakit
import openai

quotakit.init(api_key="aisc_...")
client = openai.OpenAI(api_key="sk-openai")

with quotakit.track("app/prod", service="openai", model="gpt-4o") as t:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Summarize this article"}],
    )
    t.result(
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
    )

That's it for the happy path. The call is quota-checked before it executes, and usage is logged asynchronously once .result() is called.

SDK Flow

Understanding the lifecycle of a tracked call helps when handling edge cases.

init() sets your API key and starts a background sync thread that keeps node-state (quota policies and current spend) cached locally. The sync adapts to your quota proximity — syncing every 2 minutes for open-mode paths, down to every 10 seconds as you approach a block or strict limit.
track(path, service, model) opens a context manager. On enter, the SDK checks the local cache for any block-mode quota that would be exceeded — no network call needed.
If a quota would be exceeded in block mode, QuotaExceeded is raised before the provider call ever happens. The prevented attempt is reported as a quota event.
If allowed, your provider call runs inside the with block. You then call t.result() to report token usage and outcome.
On context exit, the entry is queued and batched to /v1/log/batch asynchronously so your main thread is never blocked. The SDK adapts to your throughput: during high-volume periods it fills batches to ~2,000 entries before sending; during quiet periods it flushes within seconds of the last entry arriving.

SDK Signatures

Full surface of the SDK.

python

quotakit.init(api_key: str)

# Tracking
quotakit.track(path: str, service: str, model: str) -> context manager

# Service configuration
quotakit.create_service(...)
quotakit.update_service(...)
quotakit.list_services()
quotakit.delete_service(service, model)

# Quota management
quotakit.create_quota(...)
quotakit.update_quota(...)
quotakit.list_quotas(...)
quotakit.delete_quota(...)

Service and quota management methods are thin wrappers over the REST endpoints documented in the API Reference below.

.result() Patterns

t.result(input_tokens, output_tokens, success, charged) normalizes usage across any provider. The shape of the response object varies — extract tokens however the provider exposes them.

python

# Standard OpenAI / Anthropic shape
t.result(
    input_tokens=response.usage.prompt_tokens,
    output_tokens=response.usage.completion_tokens,
)

# Custom / non-standard provider
with quotakit.track("app/scraper", service="scraper_api", model="standard") as t:
    resp = scraper.fetch(...)
    usage = resp["meta"]["usage"]
    t.result(
        input_tokens=int(usage["in"]),
        output_tokens=int(usage["out"]),
    )

If tokens are omitted, QuotaKit falls back to the per-request estimate configured for that service+model. Cost is computed using the pricing in /api/sdk/services. See status semantics for how success/failure maps to cost handling.

Failed Charged vs Uncharged

Not all failures are the same. Some provider errors still consume tokens and incur cost; others (timeouts, connection drops) don't. Use the charged flag on .result() to explicitly control whether a failed call is billed. By default, charged mirrors success — so a failed call is uncharged unless you say otherwise.

python

# Failed + uncharged (default — charged mirrors success)
try:
    with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
        response = provider_call()
        t.result(success=False)   # charged=False by default
except SomeError:
    pass

# No .result() at all also logs as failed + uncharged
try:
    with quotakit.track("app/assistant", service="openai", model="gpt-4o"):
        provider_timeout_call()   # raises before .result() is called
except TimeoutError:
    pass

# Failed + charged — provider processed the request and billed you anyway
with quotakit.track("app/assistant", service="openai", model="gpt-4o") as t:
    response = provider_call()   # e.g. returned 403 but still billed
    t.result(
        success=False,
        charged=True,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
    )

Pass charged=True when the provider billed you despite the failure (e.g. a 403 that still consumed tokens). In analytics, failed+charged entries count against quota spend; failed+uncharged entries do not. The corresponding ingest payload rules are documented at /v1/log/batch.

Quota Modes (Open / Block / Strict)

Quotas attach to hierarchy nodes and cascade to children. Every policy has a mode that determines how enforcement behaves. The default is open so production traffic is never blocked unless you opt in.

Open (default) - always allows the call. Overages are logged and flagged, but no exception is thrown.
Block - predicts locally and raises QuotaExceeded before the provider call.
Strict - reservation-based enforcement across pods. Blocks if a reservation cannot be acquired.

python

from quotakit import QuotaExceeded

try:
    with quotakit.track("app/team/feature", service="openai", model="gpt-4o") as t:
        response = client.chat.completions.create(...)
        t.result(input_tokens=..., output_tokens=...)
except QuotaExceeded as e:
    print(e.path)           # "app/team/feature"
    print(e.service)        # "openai"
    print(e.current_spend)  # current period spend in USD
    print(e.limit)          # configured limit

An ancestor node policy (e.g. on app) can block calls at any child path (e.g. app/team/feature). Manage policies via the Quota Editor (Projects tab) or /api/sdk/quotas. Node state is synced via /api/sdk/node-stateand updated opportunistically in /v1/log/batch responses.

Enforcement logic (simplified)

text

if mode == "open":
    allow()
elif mode == "block":
    projected = current_spend + pending_requests * avg_cost + estimated_cost
    if projected > limit:
        raise QuotaExceeded
    allow()
elif mode == "strict":
    if not reservation.acquire(estimated_cost):
        raise QuotaExceeded
    allow()

Block is enforced by the SDK before the provider call: the SDK checks local quota state (refreshed via background sync and flush-response updates) and raises QuotaExceeded if the estimated cost would exceed any block-mode limit. The ingest server records usage but does not reject entries that exceed block-mode quotas — enforcement is client-side.

However, block mode has no cross-server coordination. Two batches arriving simultaneously on different servers will each see the same pre-insert snapshot and may both approve entries that together exceed the limit. The potential overshoot is bounded by the total cost of all concurrent batches running at the moment the limit is crossed.

Strict uses reservations so every server shares one authoritative spend ledger. Availability is computed as:

text

available = limit - logged_spend - sum(all_active_reservations)

Reservations are sized to recent burn rate and renewed automatically. If a server crashes, its reservation expires within a few minutes and the quota is freed. Strict mode adds a network round-trip before the provider call and fails closed if the reservation service is unreachable.

Tradeoffs by mode

Open: zero latency, maximum availability, highest risk of overage (no blocking).
Block: zero extra round-trips, cannot overshoot within a single batch, small race window if multiple servers submit batches simultaneously.
Strict: strongest correctness across multiple servers, additional latency per call, may block if the reservation service is unavailable.

Worst-case overage estimate (block — concurrent servers only)

text

overage_usd <= C_max * (in_flight_total + R_total * T_sync)

T_sync is adaptive: 10–120 s based on quota proximity.
At ≥ 90% usage, T_sync drops to 10 s.

C_max: max cost per request (USD or credits).
in_flight_total: concurrent requests already in flight across servers.
R_total: aggregate request rate (req/sec) across servers.
T_sync: SDK sync interval (10–120 s, adaptive based on quota proximity).

Practical guidance

Use Open for experiments and low-risk paths. It is the default.
Use Block for most production paths. Single-server deployments get hard enforcement; multi-server deployments get a small concurrent-batch window.
Use Strict for hard caps where any overshoot is unacceptable — especially high-concurrency multi-server deployments.
Set quotas 5-10% below your real hard limit when using open or block modes with multiple servers.
Strict mode limit: each account may hold at most 100 open reservations at one time. If this cap is reached, track() raises QuotaExceeded with reason="reservation_limit_reached". Reservations are released automatically when POST /api/sdk/release-batch is called or when they expire.

Where to set it: Dashboard → Projects → select a node → Quota Editor → Mode (Open, Block, Strict). In the API, mode can beopen, block, or strict.

Quota Events

When a call is prevented, QuotaKit records a quota event. These events appear in your analytics dashboard and count as prevented attempts without mixing them into request logs or spend totals.

json

{
  "event_id": "uuid",
  "path": "app/team/feature",
  "service": "openai",
  "model": "gpt-4o",
  "enforcement_mode": "block",
  "limit_type": "usd",
  "reason": "monthly spend limit exceeded"
}

Async + Streaming

v1 does not auto-wrap async or streaming clients. Instrument them manually — the pattern is the same: open track(), run your call, call .result() when you have final usage.

python

# Async
import asyncio, quotakit
from openai import AsyncOpenAI

quotakit.init(api_key="aisc_...")
client = AsyncOpenAI()

async def run():
    with quotakit.track("app/async", service="openai", model="gpt-4o") as t:
        resp = await client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": "Hello"}],
        )
        t.result(
            input_tokens=resp.usage.prompt_tokens,
            output_tokens=resp.usage.completion_tokens,
        )

asyncio.run(run())

python

# Streaming — finalize once the stream ends
with quotakit.track("app/stream", service="openai", model="gpt-4o") as t:
    stream = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": "Stream this"}],
        stream=True,
    )

    final_usage = None
    for chunk in stream:
        if getattr(chunk, "usage", None):
            final_usage = chunk.usage

    if final_usage is not None:
        t.result(
            input_tokens=final_usage.prompt_tokens,
            output_tokens=final_usage.completion_tokens,
        )

API Reference

Authentication

All SDK and ingest endpoints authenticate with your QuotaKit API key as a Bearer token.

bash

curl https://ingest.quotakit.io/api/sdk/services \
  -H "Authorization: Bearer aisc_..."

Server-side only — never call this API from a browser

QuotaKit API keys must be kept secret. Embedding a key in browser JavaScript exposes it to anyone who opens devtools — they could log arbitrary usage against your account or read your quota configuration. The ingest API has no CORS support and is not designed for browser clients. Call it only from your backend: a Node.js server, a Python service, or a serverless function.

Monthly spend cap

Paid plans with metered overage support a customer-set dollar ceiling. Set it from the Billing tab. When your projected ingest overage cost reaches the cap, new SDK calls receive a 402 spend_cap_reached response and batches are dropped until the next billing period (or until you raise/remove the cap). Rejections are served from a Redis fast-path so hitting the cap costs near-zero per request.

Rate limits

All limits are per API key, per 1-minute sliding window. Requests that exceed the limit receive a 429 rate_limit_exceeded response. The counter resets automatically — no backoff is required beyond a brief retry.

Endpoint	Limit	Notes
GET /api/sdk/services	120 req / min	Service pricing reads. Low traffic — called once at SDK init.
GET /api/sdk/quotas	240 req / min	Quota policy reads.
GET /api/sdk/node-state	240 req / min	SDK background sync. Adaptive: syncs every 2 min for open-mode paths, down to 10 s as paths approach block/strict limits.
POST /api/sdk/reserve-batch	1200 req / min	Strict-mode reservation acquire/renew. One request per active (path, service, model) per TTL window.
POST /api/sdk/release-batch	1200 req / min	Strict-mode reservation release. Fire-and-forget; called on quota exhaustion or process exit.
/v1/log/batch	300 req / min	Usage log ingest. The SDK batches entries adaptively — filling to ~2,000 entries per call during high throughput, or flushing sooner when traffic is light.

/api/sdk/services

bash

GET    /api/sdk/services
POST   /api/sdk/services
PUT    /api/sdk/services
DELETE /api/sdk/services?service=<name>&model=<name>

Defines how QuotaKit prices a service+model combination. Used to compute USD cost from token counts.

json

{
  "service": "scraper_api",
  "model": "standard",
  "currency_type": "credits",
  "price_per_request": 3,
  "price_per_input_unit": 0,
  "input_unit_size": 1000000,
  "price_per_output_unit": 0,
  "output_unit_size": 1000000
}

409 on POST with a duplicate service+model.
404 on PUT/DELETE when the service+model doesn't exist.
400 for invalid currency_type or non-numeric price fields.

/api/sdk/quotas

bash

GET    /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o
POST   /api/sdk/quotas
PUT    /api/sdk/quotas
DELETE /api/sdk/quotas?node_path=app&service=openai&model=gpt-4o

json

{
  "node_path": "app/api",
  "service": "openai",
  "model": "gpt-4o",
  "limit_dollars": 100,
  "window_type": "monthly",
  "mode": "strict"
}

mode: open (default), block, or strict.
limit_credits requires service and model.
409 on POST duplicate scope; 404 on PUT/DELETE missing scope.

Strict mode uses reservations for cross-pod coordination and may allow small overshoot within a reservation window.

/api/sdk/node-state

bash

GET /api/sdk/node-state?path=app/api

Used by the SDK's background sync thread. Returns the effective policy state and current spend totals for a node path, which the SDK caches locally for fast quota checks without a per-call network round-trip.

/v1/log/batch

Ingest endpoint for batched usage logs. The SDK calls this asynchronously — you don't call it directly in normal usage, but the schema is useful when building integrations or debugging.

json

{
  "entries": [
    {
      "path": "app/api",
      "service": "openai",
      "model": "gpt-4o",
      "input_tokens": 100,
      "output_tokens": 60,
      "usd": 0.001,
      "status": "success",
      "request_id": "uuid"
    },
    {
      "path": "app/api",
      "service": "openai",
      "model": "gpt-4o",
      "status": "failed",
      "usd": 0.0007,
      "request_id": "uuid2"
    }
  ]
}

entries must be non-empty.
Each entry requires path and service.
Paths are validated for format and max depth.
A failed entry is charged if usd or credits is non-null and positive; uncharged otherwise. The SDK sets this automatically based on the charged flag.
quota_state — array of path + policy objects with approximate current_spend, returned for up to 5 paths in the batch. Values reflect a pre-insert snapshot and may be off by one batch's worth of spend. The SDK uses this to update its local quota state without a separate sync call; for an authoritative figure use the dashboard or a fresh node-state call.

Status semantics

Status	Meaning	Cost handling
success	Provider call completed successfully.	Cost computed from tokens or per-request estimate.
failed	Call failed after starting (transport, provider, or app error).	Uncharged by default. Pass `charged=True` to `.result()` to log cost.
Prevented calls (blocked by quota enforcement before the provider call was made) are not recorded as request logs. They appear as quota events instead, with zero charge, so they never inflate your spend totals.

Error matrix

Status	Meaning
401	Missing or invalid Authorization header
402	quota_exceeded — strict-mode quota denied
404	service or quota not found (update / delete)
409	duplicate service or quota scope (create)
429	rate_limit_exceeded (API key rate limit) or ingest_api_call_limit_reached (plan call limit)
500	internal server error