code/+/trust primary logo full color svg

Inference

Definition

Inference is the act of running a trained AI model on new input data to produce a prediction or generated output -- as distinct from training, which is the computationally expensive process of building the model. In production AI systems, inference cost and latency are the primary engineering constraints: a single GPT-4 inference costs $0.01-$0.06 depending on token count.

Training a large language model costs millions of dollars and happens once (or rarely). Inference happens every time a user sends a message, every time your automation pipeline processes a document, every time your agent takes a step. At scale, inference cost dominates AI budget.

Inference optimization levers

  • Model selection -- smaller models (Haiku, GPT-4o-mini) cost 10-50x less per token than frontier models
  • Prompt caching -- Anthropic and OpenAI cache static prompt prefixes, reducing cost on repeated system prompts
  • Batching -- group requests to maximize GPU utilization for non-latency-sensitive workloads
  • Quantization -- use 4-bit or 8-bit weights for self-hosted models to cut memory and cost

Inference in system design

For high-volume AI features (millions of calls per month), inference cost is a product pricing input, not just an ops detail. Model the expected call volume and token counts before architecture commits.

Related terms

Need help implementing this in your business?

Code and Trust translates AI concepts like inference into working implementations — starting with a workflow audit that shows exactly where it creates ROI.

Schedule AI Audit →