AI inference
Concept
Apps that use AI features call Greenlight’s inference gateway under /ai/*, not a model vendor directly. The platform routes the call to the configured provider, applies the org’s model aliases, enforces budget controls, and writes every prompt and response to the audit log.
This is the same pattern as the data broker, applied to model APIs: the app never holds a model vendor’s API key, and IT controls which models are available and how they’re billed.
Why a gateway
Model APIs have the same problems as any other upstream credential: the keys end up in code, the bills become opaque, and the audit trail lives in whatever the vendor’s console gives you. A gateway puts all four problems behind one surface — the app doesn’t see the key, IT sees one bill, the audit trail is in the same audit log as everything else, and the org can swap providers without app changes.
The gateway also makes it possible to enforce things vendor APIs don’t: per-app rate limits, per-model budgets, prompt-pattern policies, and a kill switch for runaway loops.
Model aliases
Apps don’t call models by vendor name. They call models by alias — a logical name the organization defines and binds to a specific provider/model combination.
| Alias | Bound to (example) |
|---|---|
summarize-fast | Claude Haiku in dev, GPT-4o-mini in prod |
summarize-deep | Claude Sonnet 4.6 |
embedding-default | OpenAI text-embedding-3-small |
classifier | Anthropic Haiku 4.5 |
The mapping lives in the dashboard. Changing the binding doesn’t change a line of app code. Switching from OpenAI to Anthropic for the summarize-fast alias is a row update; apps consuming it pick up the new model on their next request.
Providers
Greenlight’s inference gateway routes to whichever providers IT enables for the organization. Common providers:
- Azure AI Foundry (the default target for Azure installs)
- OpenAI direct
- Anthropic direct
- Amazon Bedrock
- On-prem LiteLLM or vLLM endpoints
Each provider’s credential lives in Key Vault. The gateway swaps the credential in at request time, just like the data broker does for HTTP integrations.
What an app sends
The wire format is OpenAI-compatible. Apps that already speak the OpenAI API change one base URL and one header:
const response = await fetch( `${process.env.GREENLIGHT_PROXY_URL}/ai/v1/chat/completions`, { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ model: 'summarize-fast', messages: [{ role: 'user', content: 'Summarize this in two sentences.' }], }), });No API key. No vendor SDK. The model identifier is the alias.
Budgets and rate limits
Per-org and per-app budgets are configured in the dashboard. The gateway tracks spend in real time and rejects calls that would exceed the cap with a structured error the app can handle. Per-model rate limits prevent a single misbehaving loop from exhausting the org’s monthly quota.
A runaway-loop kill switch — automatic suspension of an app that issues a configurable threshold of calls in a short window — is available as a policy option.