Model Context (Context Window)
Definition
Model context -- also called the context window -- is the maximum amount of text (measured in tokens) that a large language model can process in a single inference call. GPT-4o supports 128,000 tokens; Claude 3.5 supports 200,000 tokens. Longer context windows enable whole-document analysis, multi-turn conversation history, and large-codebase reasoning without chunking.
Every LLM has a context window limit. Text beyond that limit is simply not visible to the model. Early LLMs had 4,096-token windows (roughly 3,000 words). Current frontier models support 128K-1M tokens -- enabling full book analysis, entire codebase reasoning, and multi-hour conversation histories without losing context.
Context window and RAG
RAG was invented to work around small context windows: retrieve only the relevant chunks rather than passing all documents. As context windows grow, the optimal balance shifts -- for very long contexts, you may be able to pass entire documents directly rather than chunking and retrieving. But larger context also means higher inference cost: a 100K-token prompt costs approximately 100x more than a 1K-token prompt at the same per-token rate.
Practical context window limits by task
- Short-form QA and classification: 1K-8K tokens is sufficient
- Document summarization: 32K-128K covers most business documents
- Full codebase reasoning or legal document analysis: 200K+ required
Related terms
RAG (Retrieval-Augmented Generation)
Retrieval-augmented generation (RAG) is an AI architecture that supplements a large language model's static training knowledge with real-time retrieval from a private or external knowledge base. RAG reduces hallucinations by grounding LLM responses in verified source documents, making it the standard pattern for enterprise AI assistants built on proprietary data.
LLM (Large Language Model)
A large language model (LLM) is a deep-learning model trained on billions of text tokens to predict and generate human-readable language. LLMs such as GPT-4, Claude, and Gemini power chatbots, document summarization, code generation, and AI workflow automation -- and serve as the reasoning engine inside RAG systems and AI agents.
Prompt Engineering
Prompt engineering is the practice of designing, testing, and iterating on the instructions given to a large language model to reliably produce accurate, consistent, and useful outputs. Well-engineered prompts can increase LLM task accuracy by 20-50% compared to naive instructions, often eliminating the need for more expensive fine-tuning.
Inference
Inference is the act of running a trained AI model on new input data to produce a prediction or generated output -- as distinct from training, which is the computationally expensive process of building the model. In production AI systems, inference cost and latency are the primary engineering constraints: a single GPT-4 inference costs $0.01-$0.06 depending on token count.
Need help implementing this in your business?
Code and Trust translates AI concepts like model context (context window) into working implementations — starting with a workflow audit that shows exactly where it creates ROI.
Schedule AI Audit →