What Are Context Windows and Token Budgets?
Learn what context windows are, how images and text share a fixed token budget, and how to manage token usage in Datature Vi.
A context window is the fixed number of tokens a vision-language model (VLM) can process in a single request. Everything competes for space in this window: visual tokens from the image, your system prompt, the user prompt, and the model's generated response. When the window fills up, the model truncates its output. This page explains how tokens are allocated across these components and how to manage your token budget in Datature Vi.
What is a context window?
Every VLM has a maximum number of tokens it can handle at once. Think of it as a fixed-size workspace. The system prompt, image data, user prompt, and model output all sit on this workspace. When the workspace fills up, the model stops generating.
Context window sizes vary by architecture:
A larger context window gives the model more room to process high-resolution images alongside detailed prompts and long responses. But it does not change the model's accuracy, only how much input and output it can handle per request.
How the token budget is shared
The context window is split across four components. They are processed in this order:
System prompt tokens
Your system prompt is converted to tokens first. A 150-word system prompt uses roughly 200 tokens. Longer prompts consume more of the budget.
Visual tokens (from the image)
The image is split into patches, and each patch becomes one or more tokens. A standard 448x448 image produces roughly 1,024 visual tokens. High-resolution images produce far more. See How images become tokens below.
User prompt tokens
The text you send alongside the image. For phrase grounding, this is the description to locate. For VQA, this is the question. A typical user prompt is 10-50 tokens.
Generated response tokens
Whatever is left in the window is available for the model's output. If the first three components consume most of the budget, the model has less room to write a detailed response.
Example: On a model with a 32,768-token window:
- System prompt: ~200 tokens
- Image (1344x1344): ~4,096 tokens
- User prompt: ~30 tokens
- Remaining for response: ~28,442 tokens
That's plenty. But stack a high-resolution image with a long system prompt, and the available response tokens shrink fast.
How images become tokens
The visual encoder splits each image into a grid of fixed-size patches (typically 14x14 or 16x16 pixels). Each patch becomes one visual token after passing through the encoder and cross-modal bridge.
Higher resolution preserves more visual detail. A small scratch on a large image that spans 5x5 pixels might vanish at 448x448 but remain visible at higher resolutions. The tradeoff is token cost: 4x the resolution means roughly 4x the visual tokens.
What is dynamic resolution?
Some architectures (Qwen3-VL, Qwen3.5) use dynamic resolution instead of resizing every image to a fixed size. The model adapts the number of image tiles based on the input's original dimensions and aspect ratio.
A tall document image gets more vertical tiles. A square product photo gets a balanced grid. This preserves the original aspect ratio and allocates tokens where the image has the most detail.
Dynamic resolution happens automatically. You don't need to configure it. But it means different images in the same dataset may consume different numbers of tokens. A 4000x3000 drone photo will cost more tokens than a 640x480 thumbnail.
Datature Vi shows a token monitor in the workflow canvas. Use it to see how much of the context window your images consume alongside your system prompt and user prompts. This is especially useful for high-resolution datasets (medical scans, satellite imagery, dense documents).
Managing your token budget
When working with large images or long prompts, follow these guidelines:
Frequently asked questions
Related resources
Updated 6 days ago
