How Does Inference Work?
Learn how vision-language models generate responses at inference time. Understand temperature, top-p, top-k, repetition penalty, and why the same prompt can produce different outputs.
Inference is when your trained model processes a new image and generates a response. During training, the model learned patterns from your data. During inference, it applies those patterns to images it has never seen. Datature Vi handles inference through the Vi SDK, where you load a trained model, pass in an image and a prompt, and receive a response.
This page explains what happens inside the model during inference and how generation settings control the output.
What happens during inference
When you call your model with an image and a prompt, three things happen in sequence. These are the same three components described in What Are VLMs, now working on your specific image.
The visual encoder reads the image
The model breaks your image into patches (small tiles), converts each patch into a numerical representation, and passes these through a projection layer. The result is a set of visual tokens that the language model can process alongside text.
The language model receives your prompt
Your system prompt, user prompt, and the visual tokens are combined into a single sequence. The language model reads this full sequence to understand both the image and what you are asking.
The model generates a response token by token
The language model produces one token at a time. At each step, it calculates a probability distribution over its entire vocabulary, picks a token, appends it to the sequence, and repeats. Generation stops when the model produces a special end-of-sequence token or hits the maximum token limit.
Token-by-token generation
The core of inference is a loop. At each step, the model looks at everything so far (the image tokens, prompt tokens, and any tokens it has already generated) and predicts which token comes next.
This prediction is a probability distribution: the model assigns a score to every possible next token. The token "defect" might get 0.35, "crack" might get 0.22, "scratch" might get 0.15, and thousands of other tokens get smaller scores. Generation settings control how the model picks from this distribution.
Generation settings explained
Generation settings let you control how the model selects tokens during inference. The same trained model can produce different outputs depending on these settings, without retraining.
Temperature
Temperature controls how much randomness the model uses when picking tokens.
- Low temperature (0.0-0.3): The model almost always picks the highest-probability token. Outputs are consistent and predictable. Two identical requests produce identical or near-identical answers.
- Medium temperature (0.4-0.7): The model sometimes picks less likely tokens, adding variety while staying coherent.
- High temperature (0.8+): The model picks from a wider range of tokens. Outputs become more varied and creative, but less predictable.
At temperature 0.0, inference is fully deterministic: the model always picks the single most likely token (this is called greedy decoding).
Top-p (nucleus sampling)
Top-p limits which tokens the model considers at each step. Instead of looking at all possible tokens, the model sorts them by probability and only considers the smallest group whose probabilities add up to the top-p threshold.
With top_p=0.9, the model only considers tokens that together account for 90% of the total probability. Rare tokens that fall outside this 90% are excluded. Lower values make outputs more focused; higher values allow more variety.
Top-k
Top-k is a simpler filter. The model only considers the k most likely tokens at each step, regardless of their probabilities.
With top_k=10, the model picks from the 10 most likely next tokens. With top_k=50, it picks from the top 50. Top-k and top-p work together: a token must pass both filters to be considered.
Repetition penalty
Repetition penalty reduces the probability of tokens that have already appeared in the output. A value of 1.0 means no penalty. Values above 1.0 make the model less likely to repeat words or phrases.
If your model produces outputs like "The defect is a crack. The crack is located. The crack appears to be..." then increasing the repetition penalty to 1.1-1.3 helps break the pattern.
Max new tokens
The maximum number of tokens the model can generate in one response. Once the model hits this limit, generation stops even if the response is incomplete.
Set this based on your expected output length. Short yes/no answers need 10-50 tokens. Detailed descriptions need 256-512. Structured JSON reports may need 1024 or more.
Choosing settings for your task
Different tasks benefit from different settings. Use this table as a starting point.
Temperature, top-p, and top-k only work when do_sample is set to true in the Vi SDK. Without it, the model uses greedy decoding regardless of your temperature setting. See Configure Generation for SDK usage.
Streaming vs batch inference
The Vi SDK supports two ways to receive inference results.
Single-image inference sends one image and receives a complete response. The model generates the full answer before returning it. This is the default mode and works well for most applications.
Streaming returns tokens as they are generated, without waiting for the full response. Your application receives partial results in real time. This is useful for interactive applications where users benefit from seeing the answer appear progressively, or for long outputs where you want to start processing early.
What is hallucination?
Hallucination is when a VLM generates information that isn't present in the image. The model might describe objects that don't exist, invent counts, or assign labels based on patterns from its pre-training data rather than what's visible.
This happens because VLMs are trained on millions of image-text pairs. When the model encounters an ambiguous or unfamiliar image, it fills gaps with statistically likely content from that training. A model fine-tuned on factory images might "see" a common defect type even when the image shows no defect at all.
Three factors reduce hallucination:
- Fine-tuning on clean data. The more accurate your annotations, the less the model relies on general pre-training patterns. See the annotation guide for quality standards.
- Hallucination guards in the system prompt. Instructions like "Only describe what is directly visible" constrain the model's output. See system prompt guardrails for examples.
- Lower temperature at inference. Setting temperature to 0.0-0.1 forces the model to pick the most probable tokens, reducing the chance of generating fabricated content.
Hallucination is not a bug you can fix once. It's a tendency that you manage through data quality, prompt design, and inference settings.
How system prompts affect inference
The system prompt you set during training carries over to inference. It tells the model its role and expected output format. Changing the system prompt at inference time can alter behavior, but the model performs best with the same prompt it was trained on.
If you trained with a system prompt like "You are a quality inspector. Report defects as JSON with fields: type, severity, location," the model will follow that format at inference. If you change the prompt to something unrelated, the model may produce unpredictable results because it was not trained for that instruction.
Frequently asked questions
Related resources
Configure Generation
SDK reference for all generation parameters.
Run Inference
Single-image, batch, and streaming inference with the Vi SDK.
How Does VLM Training Work?
What happens during training, before inference.
What Are VLMs?
How the visual encoder, language model, and cross-modal bridge work together.
Chain-of-Thought Reasoning
Step-by-step reasoning at inference time.
Model Settings
Training mode, hyperparameters, and evaluation settings reference.
Updated about 1 month ago
