Skip to content

AI inference

Concept

Apps that use AI features call Greenlight’s inference gateway under /ai/*, not a model vendor directly. The platform routes the call to the configured provider, applies the org’s model aliases, enforces budget controls, and writes every prompt and response to the audit log.

This is the same pattern as the data broker, applied to model APIs: the app never holds a model vendor’s API key, and IT controls which models are available and how they’re billed.

Why a gateway

Model APIs have the same problems as any other upstream credential: the keys end up in code, the bills become opaque, and the audit trail lives in whatever the vendor’s console gives you. A gateway puts all four problems behind one surface — the app doesn’t see the key, IT sees one bill, the audit trail is in the same audit log as everything else, and the org can swap providers without app changes.

The gateway also makes it possible to enforce things vendor APIs don’t: per-app rate limits, per-model budgets, prompt-pattern policies, and a kill switch for runaway loops.

Model aliases

Apps don’t call models by vendor name. They call models by alias — a logical name the organization defines and binds to a specific provider/model combination.

AliasBound to (example)
summarize-fastClaude Haiku in dev, GPT-4o-mini in prod
summarize-deepClaude Sonnet 4.6
embedding-defaultOpenAI text-embedding-3-small
classifierAnthropic Haiku 4.5

The mapping lives in the dashboard. Changing the binding doesn’t change a line of app code. Switching from OpenAI to Anthropic for the summarize-fast alias is a row update; apps consuming it pick up the new model on their next request.

Providers

Greenlight’s inference gateway routes to whichever providers IT enables for the organization. Common providers:

  • Azure AI Foundry (the default target for Azure installs)
  • OpenAI direct
  • Anthropic direct
  • Amazon Bedrock
  • On-prem LiteLLM or vLLM endpoints

Each provider’s credential lives in Key Vault. The gateway swaps the credential in at request time, just like the data broker does for HTTP integrations.

What an app sends

The wire format is OpenAI-compatible. Apps that already speak the OpenAI API change one base URL and one header:

const response = await fetch(
`${process.env.GREENLIGHT_PROXY_URL}/ai/v1/chat/completions`,
{
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'summarize-fast',
messages: [{ role: 'user', content: 'Summarize this in two sentences.' }],
}),
}
);

No API key. No vendor SDK. The model identifier is the alias.

Budgets and rate limits

Per-org and per-app budgets are configured in the dashboard. The gateway tracks spend in real time and rejects calls that would exceed the cap with a structured error the app can handle. Per-model rate limits prevent a single misbehaving loop from exhausting the org’s monthly quota.

A runaway-loop kill switch — automatic suspension of an app that issues a configurable threshold of calls in a short window — is available as a policy option.

Next