What Are Context Windows and Token Budgets?

A context window is the fixed number of tokens a vision-language model (VLM) can process in a single request. Everything competes for space in this window: visual tokens from the image, your system prompt, the user prompt, and the model's generated response. When the window fills up, the model truncates its output. This page explains how tokens are allocated across these components and how to manage your token budget in Datature Vi.

On this page

What is a context window How the token budget is shared How images become tokens Dynamic resolution Managing your budget FAQ Related resources

What is a context window?

Every VLM has a maximum number of tokens it can handle at once. Think of it as a fixed-size workspace. The system prompt, image data, user prompt, and model output all sit on this workspace. When the workspace fills up, the model stops generating.

Context window sizes vary by architecture:

Architecture

Context window

Qwen2.5-VL

32,768 tokens

Qwen3-VL

65,536 tokens

Qwen3.5

128,000+ tokens

NVILA-Lite 2B

4,096 tokens

A larger context window gives the model more room to process high-resolution images alongside detailed prompts and long responses. But it does not change the model's accuracy, only how much input and output it can handle per request.

How the token budget is shared

The context window is split across four components. They are processed in this order:

System prompt tokens

Your system prompt is converted to tokens first. A 150-word system prompt uses roughly 200 tokens. Longer prompts consume more of the budget.

Visual tokens (from the image)

The image is split into patches, and each patch becomes one or more tokens. A standard 448x448 image produces roughly 1,024 visual tokens. High-resolution images produce far more. See How images become tokens below.

User prompt tokens

The text you send alongside the image. For phrase grounding, this is the description to locate. For VQA, this is the question. A typical user prompt is 10-50 tokens.

Generated response tokens

Whatever is left in the window is available for the model's output. If the first three components consume most of the budget, the model has less room to write a detailed response.

Example: On a model with a 32,768-token window:

System prompt: ~200 tokens
Image (1344x1344): ~4,096 tokens
User prompt: ~30 tokens
Remaining for response: ~28,442 tokens

That's plenty. But stack a high-resolution image with a long system prompt, and the available response tokens shrink fast.

How images become tokens

The visual encoder splits each image into a grid of fixed-size patches (typically 14x14 or 16x16 pixels). Each patch becomes one visual token after passing through the encoder and cross-modal bridge.

Image size

Approximate visual tokens

Budget impact

448 x 448

~1,024 tokens

Low

896 x 896

~4,096 tokens

Moderate

1344 x 1344

~9,216 tokens

High

2688 x 2688

~36,864 tokens

Very high

Higher resolution preserves more visual detail. A small scratch on a large image that spans 5x5 pixels might vanish at 448x448 but remain visible at higher resolutions. The tradeoff is token cost: 4x the resolution means roughly 4x the visual tokens.

What is dynamic resolution?

Some architectures (Qwen3-VL, Qwen3.5) use dynamic resolution instead of resizing every image to a fixed size. The model adapts the number of image tiles based on the input's original dimensions and aspect ratio.

A tall document image gets more vertical tiles. A square product photo gets a balanced grid. This preserves the original aspect ratio and allocates tokens where the image has the most detail.

Dynamic resolution happens automatically. You don't need to configure it. But it means different images in the same dataset may consume different numbers of tokens. A 4000x3000 drone photo will cost more tokens than a 640x480 thumbnail.

Check token usage in the workflow canvas

Datature Vi shows a token monitor in the workflow canvas. Use it to see how much of the context window your images consume alongside your system prompt and user prompts. This is especially useful for high-resolution datasets (medical scans, satellite imagery, dense documents).

Managing your token budget

When working with large images or long prompts, follow these guidelines:

Situation

Solution

Model truncates its response mid-sentence

Increase the max_new_tokens setting in generation parameters. Or reduce image resolution to free up budget.

High-resolution images consume most of the window

Crop images to the region of interest rather than downscaling the entire image. Cropping preserves detail while reducing token count.

System prompt is very long (300+ words)

Shorten the prompt. Move reference material (lists of defect types, detailed format specs) into annotations instead.

Need both high resolution and long responses

Use an architecture with a larger context window (Qwen3.5 supports 128K+ tokens).

What Are Context Windows and Token Budgets?

What is a context window?

How the token budget is shared

System prompt tokens

Visual tokens (from the image)

User prompt tokens

Generated response tokens

How images become tokens

What is dynamic resolution?

Managing your token budget

Frequently asked questions

Related resources

How Does Inference Work?

What Are VLMs?

Configure Your Dataset