Inference
Definition
Inference is the act of running a trained AI model on new input data to produce a prediction or generated output -- as distinct from training, which is the computationally expensive process of building the model. In production AI systems, inference cost and latency are the primary engineering constraints: a single GPT-4 inference costs $0.01-$0.06 depending on token count.
Training a large language model costs millions of dollars and happens once (or rarely). Inference happens every time a user sends a message, every time your automation pipeline processes a document, every time your agent takes a step. At scale, inference cost dominates AI budget.
Inference optimization levers
- Model selection -- smaller models (Haiku, GPT-4o-mini) cost 10-50x less per token than frontier models
- Prompt caching -- Anthropic and OpenAI cache static prompt prefixes, reducing cost on repeated system prompts
- Batching -- group requests to maximize GPU utilization for non-latency-sensitive workloads
- Quantization -- use 4-bit or 8-bit weights for self-hosted models to cut memory and cost
Inference in system design
For high-volume AI features (millions of calls per month), inference cost is a product pricing input, not just an ops detail. Model the expected call volume and token counts before architecture commits.
Related terms
LLM (Large Language Model)
A large language model (LLM) is a deep-learning model trained on billions of text tokens to predict and generate human-readable language. LLMs such as GPT-4, Claude, and Gemini power chatbots, document summarization, code generation, and AI workflow automation -- and serve as the reasoning engine inside RAG systems and AI agents.
AI Agent
An AI agent is an LLM-powered system that autonomously plans, selects tools, executes multi-step tasks, and loops until a goal is achieved -- without requiring step-by-step human instruction. AI agents extend a language model''s capability from answering questions to taking actions: writing code, querying APIs, browsing the web, and updating databases.
Fine-Tuning
Fine-tuning is the process of further training a pre-trained large language model on a curated dataset of domain-specific examples to adjust its tone, format, or reasoning patterns. A fine-tuned model can match a specialized style with 10-100x fewer tokens at inference time, reducing API cost and latency for high-volume production workloads.
Prompt Engineering
Prompt engineering is the practice of designing, testing, and iterating on the instructions given to a large language model to reliably produce accurate, consistent, and useful outputs. Well-engineered prompts can increase LLM task accuracy by 20-50% compared to naive instructions, often eliminating the need for more expensive fine-tuning.
Need help implementing this in your business?
Code and Trust translates AI concepts like inference into working implementations — starting with a workflow audit that shows exactly where it creates ROI.
Schedule AI Audit →