How Does Inference Work?

Inference is when your trained model processes a new image and generates a response. During training, the model learned patterns from your data. During inference, it applies those patterns to images it has never seen. Datature Vi handles inference through the Vi SDK, where you load a trained model, pass in an image and a prompt, and receive a response.

This page explains what happens inside the model during inference and how generation settings control the output.

On this page

What happens during inference Token-by-token generation Generation settings Choosing settings for your task Streaming vs batch Hallucination FAQ Related resources

What happens during inference

When you call your model with an image and a prompt, three things happen in sequence. These are the same three components described in What Are VLMs, now working on your specific image.

Token-by-token generation

The core of inference is a loop. At each step, the model looks at everything so far (the image tokens, prompt tokens, and any tokens it has already generated) and predicts which token comes next.

This prediction is a probability distribution: the model assigns a score to every possible next token. The token "defect" might get 0.35, "crack" might get 0.22, "scratch" might get 0.15, and thousands of other tokens get smaller scores. Generation settings control how the model picks from this distribution.

Generation settings explained

Generation settings let you control how the model selects tokens during inference. The same trained model can produce different outputs depending on these settings, without retraining.

Temperature

Temperature controls how much randomness the model uses when picking tokens.

Low temperature (0.0-0.3): The model almost always picks the highest-probability token. Outputs are consistent and predictable. Two identical requests produce identical or near-identical answers.
Medium temperature (0.4-0.7): The model sometimes picks less likely tokens, adding variety while staying coherent.
High temperature (0.8+): The model picks from a wider range of tokens. Outputs become more varied and creative, but less predictable.

At temperature 0.0, inference is fully deterministic: the model always picks the single most likely token (this is called greedy decoding).

Top-p (nucleus sampling)

Top-p limits which tokens the model considers at each step. Instead of looking at all possible tokens, the model sorts them by probability and only considers the smallest group whose probabilities add up to the top-p threshold.

With top_p=0.9, the model only considers tokens that together account for 90% of the total probability. Rare tokens that fall outside this 90% are excluded. Lower values make outputs more focused; higher values allow more variety.

Top-k

Top-k is a simpler filter. The model only considers the k most likely tokens at each step, regardless of their probabilities.

With top_k=10, the model picks from the 10 most likely next tokens. With top_k=50, it picks from the top 50. Top-k and top-p work together: a token must pass both filters to be considered.

Repetition penalty

Repetition penalty reduces the probability of tokens that have already appeared in the output. A value of 1.0 means no penalty. Values above 1.0 make the model less likely to repeat words or phrases.

If your model produces outputs like "The defect is a crack. The crack is located. The crack appears to be..." then increasing the repetition penalty to 1.1-1.3 helps break the pattern.

Max new tokens

The maximum number of tokens the model can generate in one response. Once the model hits this limit, generation stops even if the response is incomplete.

Set this based on your expected output length. Short yes/no answers need 10-50 tokens. Detailed descriptions need 256-512. Structured JSON reports may need 1024 or more.

Choosing settings for your task

Different tasks benefit from different settings. Use this table as a starting point.

Task

Temperature

Top-p

Max tokens

Why

Enable sampling for temperature to take effect

Temperature, top-p, and top-k only work when do_sample is set to true in the Vi SDK. Without it, the model uses greedy decoding regardless of your temperature setting. See Configure Generation for SDK usage.

Streaming vs batch inference

The Vi SDK supports two ways to receive inference results.

Single-image inference sends one image and receives a complete response. The model generates the full answer before returning it. This is the default mode and works well for most applications.

Streaming returns tokens as they are generated, without waiting for the full response. Your application receives partial results in real time. This is useful for interactive applications where users benefit from seeing the answer appear progressively, or for long outputs where you want to start processing early.

What is hallucination?

Hallucination is when a VLM generates information that isn't present in the image. The model might describe objects that don't exist, invent counts, or assign labels based on patterns from its pre-training data rather than what's visible.

This happens because VLMs are trained on millions of image-text pairs. When the model encounters an ambiguous or unfamiliar image, it fills gaps with statistically likely content from that training. A model fine-tuned on factory images might "see" a common defect type even when the image shows no defect at all.

Three factors reduce hallucination:

Fine-tuning on clean data. The more accurate your annotations, the less the model relies on general pre-training patterns. See the annotation guide for quality standards.
Hallucination guards in the system prompt. Instructions like "Only describe what is directly visible" constrain the model's output. See system prompt guardrails for examples.
Lower temperature at inference. Setting temperature to 0.0-0.1 forces the model to pick the most probable tokens, reducing the chance of generating fabricated content.

Hallucination is not a bug you can fix once. It's a tendency that you manage through data quality, prompt design, and inference settings.

How system prompts affect inference

The system prompt you set during training carries over to inference. It tells the model its role and expected output format. Changing the system prompt at inference time can alter behavior, but the model performs best with the same prompt it was trained on.

If you trained with a system prompt like "You are a quality inspector. Report defects as JSON with fields: type, severity, location," the model will follow that format at inference. If you change the prompt to something unrelated, the model may produce unpredictable results because it was not trained for that instruction.

What happens during inference

Token-by-token generation

Generation settings explained

Temperature

Top-p (nucleus sampling)

Top-k

Repetition penalty

Max new tokens

Choosing settings for your task

Streaming vs batch inference

What is hallucination?

How system prompts affect inference

Frequently asked questions

Related resources