Configure Generation

Pass a generation_config dictionary to model(...) to control how Datature Vi generates responses during inference. You can tune output length, randomness, sampling behavior, and repetition.

Before You Start

Learn how to run inference →

Generation parameters control how the model writes its response, token by token. These are the same settings available during training evaluation, but here you can change them at inference time without retraining.

Use case
Temperature
Top-p
Max tokens
Quality inspection (consistent)
0.1
0.9
256
Detailed description
0.5
0.95
512
Creative or exploratory
0.8
1.0
1024
Deterministic output
0.0
1.0
256

Basic usage

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")

result, error = model(
    source="image.jpg",
    user_prompt="Describe this image",
    generation_config={
        "max_new_tokens": 256,
        "temperature": 0.7,
        "top_p": 0.9,
        "do_sample": True
    }
)

Parameters

generation_config: Parameters

Name
Type
Description
Required
Default
max_new_tokens
integer
Maximum number of tokens to generate. Range: 1 to 4096 (model-dependent).
Optional
1024
temperature
number
Controls randomness. Lower values produce more deterministic outputs; higher values produce more varied outputs. Range: 0.0 to 2.0.
Optional
1.0
top_p
number
Nucleus sampling threshold. The model considers only tokens whose cumulative probability reaches top_p. Range: 0.0 to 1.0.
Optional
1.0
top_k
integer
Top-k sampling. The model considers only the top k most likely tokens at each step. Range: 1 to 100.
Optional
50
do_sample
boolean
Whether to use sampling (true) or greedy decoding (false). Greedy decoding is deterministic; sampling introduces variation.
Optional
false
repetition_penalty
number
Penalty applied to repeated tokens. Values above 1.0 reduce repetition. Range: 1.0 to 2.0.
Optional
1.05
seed
integer
Random seed for reproducibility. Set to -1 for random generation.
Optional
0

Video frame sampling (fps)

Video inputs use the same generation_config fields as images (temperature, max_new_tokens, and so on). Frame sampling rate for video is controlled separately: pass fps as a keyword argument on model(...), not inside the generation_config dictionary. Optional; default 4.0 (frames sampled per second of source video for preprocessing). Ignored when the input is not video.

Run inference on video files, URLs, and batch lists →

Chain-of-thought (cot)

Chain-of-thought decoding is enabled on the model(...) call, not inside generation_config. Pass cot=True together with your source and prompt. Behavior and token budget notes are covered in Run Inference.

Parameter details

max_new_tokens

Controls how long the response can be. For short captions, 50-100 tokens is plenty. For detailed descriptions or long CoT traces, you may need 512 or more (with cot=True and a dict generation_config that omits max_new_tokens, the SDK raises the cap automatically; see Run Inference).

# Short response
generation_config={"max_new_tokens": 50}

# Medium response
generation_config={"max_new_tokens": 256}

# Long response
generation_config={"max_new_tokens": 1024}

temperature

Lower temperature makes the model repeat its highest-probability token choices more consistently. At 0.0, the output is fully deterministic. At 1.5 or above, outputs become more varied and sometimes unpredictable.

# Deterministic: best for factual Q&A
generation_config={"temperature": 0.0}

# Balanced
generation_config={"temperature": 0.7}

# Creative: more varied outputs
generation_config={"temperature": 1.5}

top_p and top_k

These two parameters work together to limit which tokens the model considers at each step:

  • top_p (nucleus sampling) limits to tokens covering the top p percent of probability mass
  • top_k limits to the top k most likely tokens
# Focused output
generation_config={"top_p": 0.8, "top_k": 10}

# Balanced output
generation_config={"top_p": 0.95, "top_k": 50}

# Consider all tokens
generation_config={"top_p": 1.0, "top_k": 100}

do_sample

Set do_sample=True when using temperature, top_p, or top_k. Sampling parameters only take effect when sampling is enabled. Without it, the model uses greedy decoding regardless of other settings.

# Sampling (varied outputs)
generation_config={"do_sample": True, "temperature": 0.7}

# Greedy decoding (fully deterministic)
generation_config={"do_sample": False}

repetition_penalty

A value of 1.0 applies no penalty. Increase it if the model repeats phrases in long outputs.

# No penalty
generation_config={"repetition_penalty": 1.0}

# Moderate: good for long descriptions
generation_config={"repetition_penalty": 1.2}

# Strong: aggressive anti-repetition
generation_config={"repetition_penalty": 1.5}

seed

Use a fixed seed to get the same output for the same input. Set to -1 for random behavior.

generation_config={"seed": 42}

Common configurations

result, error = model(
    source="image.jpg",
    user_prompt="Provide a brief caption",
    generation_config={
        "max_new_tokens": 50,
        "temperature": 0.3,
        "do_sample": True
    }
)
result, error = model(
    source="image.jpg",
    user_prompt="Provide a detailed description",
    generation_config={
        "max_new_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.95,
        "do_sample": True
    }
)
result, error = model(
    source="image.jpg",
    user_prompt="What objects are visible?",
    generation_config={
        "temperature": 0.0,
        "do_sample": False,
        "seed": 42
    }
)
result, error = model(
    source="image.jpg",
    user_prompt="Create an artistic description",
    generation_config={
        "max_new_tokens": 300,
        "temperature": 0.9,
        "top_p": 0.95,
        "do_sample": True,
        "repetition_penalty": 1.2
    }
)

Next steps

Run Inference

Images, video, batch lists and folders, streaming, and error handling.

Prediction Schemas

Full reference for VQA, phrase grounding, and generic response types returned by the SDK.

How Does VLM Training Work?

Learn the concepts behind fine-tuning, LoRA, and how VLMs learn from your data.