- Concepts
Glossary
Key terms and definitions for vision-language models, VLMOps, and computer vision concepts in Datature Vi.
This glossary defines terms and concepts in Datature Vi that you will encounter while working with vision-language models. Only terms you directly interact with or need to understand are included.
Showing all 102 terms
A
Adapter
Small trainable matrices inserted into a frozen model during LoRA training. Only adapter weights are updated, reducing training cost by 95-99%. After training, adapters merge into the base model with no inference overhead.
Annotation
Labels that teach a model what to look for in images. In Datature Vi, annotations pair images with text through phrase-box links, Q&A pairs, or freeform text.
Attention Mechanism
A technique that helps models focus on the most relevant parts of their input. When you ask about "the red car on the left," attention helps the model zero in on that region of the image.
B
Backbone Network
The core visual processing component of a model that extracts features from images. In VLMs, the backbone is typically a pre-trained vision model like SigLIP or ViT.
Batch size
The number of training examples processed together in one forward and backward pass during model training. A configurable training setting in Datature Vi that affects training speed and memory usage.
BFloat16 / Float16 / Float32
Numerical precision formats for training. BFloat16 (recommended) balances speed and stability. Float16 is faster but can cause numerical issues. Float32 is slowest but most precise.
C
Chain-of-thought reasoning (CoT)
A technique that enables vision-language models to solve complex visual tasks by breaking them down into explicit, step-by-step reasoning processes, improving accuracy and interpretability.
Checkpoint
A saved snapshot of a model's weights and training state at a specific point during training. Checkpoints let you resume training or deploy models from different training stages.
Checkpoint key
A unique identifier for a specific model checkpoint in Datature Vi. Used in the Vi SDK to load a particular checkpoint for inference or export. Also called model key when referencing the final trained model.
COCO (Common Objects in Context)
A widely-used annotation format for object detection that uses JSON files containing image metadata, bounding boxes, and category information. Supported for uploading annotations to Datature Vi.
Compute Credits
The currency for GPU time in Datature Vi. Training runs and deployed endpoints consume compute credits based on GPU type and duration.
Context window
The fixed number of tokens a VLM can process in a single request. Images, system prompts, user prompts, and the generated response all share this budget. Larger windows support higher-resolution images or longer outputs.
Convergence
When a model's loss stops decreasing and stabilizes. A converged model has learned as much as it can from the current data and settings.
Cross-entropy loss
The loss function used during VLM training. At each token position, cross-entropy measures the gap between the model's predicted probability distribution and the correct answer. Lower values mean the model's predictions are closer to the ground truth.
Cross-Modal Fusion
The process of combining information from different types of input (images and text) into a shared representation. This is how VLMs connect what they see with what you ask.
CUDA
NVIDIA's programming platform for GPU computing. Required for running VLMs on NVIDIA GPUs. The Vi SDK detects CUDA automatically.
D
Data Rows
The currency for storage and annotation in Datature Vi. Uploading an image costs 5 Data Rows, one video frame costs 5 Data Rows, and creating one annotation costs 1 Data Row.
Data sovereignty
The requirement that data is stored and processed in a specific geographic region or jurisdiction. Datature Vi's storage region setting is permanent per dataset and determines where your images reside.
Dataset split ratio
The proportion of data used for training vs validation. Datature Vi defaults to 80/20. The validation portion is held out from training and used to check for overfitting at regular intervals.
DeepSpeed ZeRO-3
A distributed training strategy that partitions model weights, gradients, and optimizer states across multiple GPUs. Datature Vi uses DeepSpeed ZeRO-3 automatically for full SFT runs that require model parallelism.
Dense captioning
A computer vision task that generates detailed natural language descriptions for multiple regions within an image, identifying and describing individual objects and their relationships.
Deployment
The process of making a trained model available for inference. In Datature Vi, you download trained model weights via the Vi SDK and run them locally or deploy through NVIDIA NIM containers for production use.
Docker / Container
A technology for packaging software into isolated, portable units. NVIDIA NIM uses Docker containers to serve VLMs.
DPO (Direct Preference Optimization)
A post-training alignment method that trains a model to prefer expert-chosen outputs over rejected ones. DPO maps preference data directly to gradient updates without a separate reward model, making it simpler and more stable than RLHF.
Dynamic resolution
A technique where the visual encoder adapts the number of image tiles based on the input image's size and aspect ratio, rather than resizing every image to a fixed square. Preserves detail in high-resolution images but consumes more tokens from the context window.
E
Embeddings
Numerical representations of text or images that capture meaning. Similar items have similar embeddings. VLMs use embeddings internally to connect visual and text features.
Encoder / Decoder
An encoder processes input (image or text) into a compact representation. A decoder generates output (text) from that representation. VLMs use a visual encoder and a text decoder.
Epoch
One complete pass through the entire training dataset during model training. A configurable training setting in Datature Vi.
F
Fine-tuning
The process of taking a pre-trained model and continuing training on a specific dataset to adapt it to a particular task or domain. This is what happens when you train a model in Datature Vi.
Flash Attention
An optimized algorithm for the attention computation in transformer models. It runs 2-3x faster on compatible NVIDIA GPUs (Ampere and newer). The Vi SDK enables it automatically when supported.
Freeform Text
A dataset type that allows custom annotation schemas for specialized computer vision applications. Unlike predefined formats like phrase grounding or visual question answering, freeform text datasets give full flexibility for research and specialized use cases.
FSDP (Fully Sharded Data Parallelism)
A distributed training technique that shards model parameters, gradients, and optimizer states across GPUs. Datature Vi selects FSDP or DeepSpeed ZeRO-3 automatically based on model size and GPU count during full SFT runs.
Full SFT (Full Supervised Fine-Tuning)
A training mode that updates every parameter in the model, giving maximum flexibility for domain adaptation. Requires more GPU memory and training time than LoRA but achieves the highest possible accuracy ceiling.
G
GPU (Graphics Processing Unit)
Specialized hardware for training and running deep learning models. You select GPU resources when configuring training runs in Datature Vi.
Gradient / Gradient Accumulation
A gradient measures how much each parameter should change to reduce loss. Gradient accumulation processes multiple small batches before updating, simulating a larger batch size without extra GPU memory.
Greedy decoding
An inference strategy where the model always picks the single most probable token at each step. Produces deterministic output with no randomness. Set temperature to 0.0 or do_sample to false to use greedy decoding.
Ground Truth
The correct answer for a training example. Your annotations are the ground truth that the model tries to match during training.
Grounded phrase
In phrase grounding, a text phrase that has been linked to one or more bounding boxes in an image, establishing the connection between language and visual regions.
GRPO (Group Relative Policy Optimization)
A reinforcement learning training method used in some VLM architectures (like Cosmos-Reason2) to improve reasoning quality. GRPO trains the model by comparing groups of generated responses and reinforcing the better ones.
H
Hallucination
When a VLM generates information not present in the image. The model may describe objects that don't exist, invent counts, or assign labels based on pre-training patterns rather than visible content. Reduced through fine-tuning, system prompt guardrails, and low temperature settings.
Harmonic Mean
A type of average that penalizes extreme imbalances. F1 score uses the harmonic mean of precision and recall, so a model with 95% precision but 10% recall gets a low F1, not a high average.
Hyperparameter
A configuration setting that controls the training process, such as learning rate, batch size, and number of epochs. You configure these when creating training workflows.
I
Inference
Running a trained model on new images to get predictions. In Datature Vi, you run inference using the Vi SDK or NVIDIA NIM containers.
IntelliScribe
Datature Vi's AI-assisted annotation feature that automatically generates text to speed up phrase grounding and freeform image annotation. For phrase grounding, it generates captions and links phrases to bounding boxes. For freeform image annotation, it generates text content matching image context.
J
JSONL (JSON Lines)
A text format where each line is a valid JSON object. Datature Vi's native annotation format uses JSONL for storing image metadata and labels.
K
KL Divergence (Kullback-Leibler Divergence)
A measure of how much one probability distribution differs from another. In DPO, KL divergence tracks how far the aligned model has drifted from its reference checkpoint. Controlled by the beta hyperparameter.
L
Learning rate
A hyperparameter that controls how much model weights are updated during training. A configurable training setting in Datature Vi.
Loss / Loss Function
A number measuring how far the model's predictions are from the correct annotations. Lower loss means better predictions. Training aims to minimize this number.
M
Model architecture
The underlying design of a VLM: how it processes images, encodes text, and generates predictions. Datature Vi supports seven architectures (Qwen3.5, Qwen3-VL, Qwen2.5-VL, NVILA-Lite, Cosmos-Reason1, Cosmos-Reason2, InternVL3.5) with different sizes, strengths, and hardware requirements.
Multi-token prediction (MTP)
An inference technique where the model predicts multiple tokens at once instead of one at a time. Used in Qwen3.5 to speed up generation while maintaining quality. Also enables speculative decoding for faster inference.
Multimodal
Involving more than one type of data. VLMs are multimodal because they process both images and text.
N
NF4 / FP4 Quantization
Methods for compressing model numbers from 32-bit to 4-bit, reducing memory by roughly 4x. NF4 (Normalized Float 4) preserves more accuracy for transformer models.
NGC (NVIDIA GPU Cloud)
NVIDIA's registry for GPU-optimized container images. You need an NGC API key to pull NIM containers.
NIM (NVIDIA Inference Microservice)
Pre-built Docker containers for serving VLMs in production. NIM provides GPU-accelerated inference with an OpenAI-compatible API.
Normalization
The process of scaling coordinate values to a standard range, typically 0 to 1. Normalized coordinates in bounding boxes are relative to image dimensions, making them resolution-independent. Important when uploading annotations.
Normalized Coordinates
Bounding box positions expressed as values between 0 and 1 (or 0 and 1024 for Vi SDK output), relative to image dimensions. This makes annotations independent of image resolution.
NVLink
A high-bandwidth interconnect between NVIDIA GPUs that enables fast gradient synchronization during multi-GPU training. H100 GPUs in Datature Vi use NVLink automatically for distributed training runs.
O
OCR (Optical Character Recognition)
Technology that extracts and recognizes text from images. Used in document analysis and free text tasks.
Optimizer
The algorithm that updates model parameters based on gradients. Datature Vi uses a default optimizer tuned for VLM training. You rarely need to change it.
Overfitting
When a model memorizes training data instead of learning general patterns. Signs: training loss drops but validation loss rises. Fix with more diverse data or fewer epochs.
P
Parameters (Model)
The learned values inside a neural network. A "7B parameter" model has 7 billion numbers tuned during pre-training. More parameters generally means more capability but also more compute.
Pascal VOC (Visual Object Classes)
An annotation format that uses XML files to store bounding box information for object detection tasks. Supported for uploading annotations to Datature Vi.
PEFT (Parameter-Efficient Fine-Tuning)
A family of techniques that update only a small subset of model parameters during fine-tuning. LoRA is the most common PEFT method. In Datature Vi, LoRA mode is the recommended PEFT approach for most tasks.
Phrase grounding
A computer vision task that localizes objects in images based on natural language descriptions. Also called visual grounding or referring expression comprehension. One of the dataset types in Datature Vi.
Pre-trained model
A model that has been trained on large datasets and can be fine-tuned for specific tasks. Datature Vi uses pre-trained VLMs as the starting point for training.
Precision (Statistical)
Of all predictions the model made, how many were correct. High precision means few false positives.
Preference Alignment
The process of training a model to produce outputs that match expert preferences. In Datature Vi, preference alignment uses DPO after supervised fine-tuning to reduce hallucinations, improve grounding accuracy, and align response style with domain experts.
Q
QLoRA (Quantized LoRA)
A training method that combines LoRA with NF4 quantization. Base model weights are stored in 4-bit precision while adapter matrices train in BF16. This gives roughly 4x memory savings over standard LoRA, letting you fine-tune a 7B model on a single T4 GPU.
R
Recall (Statistical)
Of all real objects or correct answers, how many did the model find. High recall means few missed detections.
Referring expression
A natural language phrase that describes a specific object or region in an image, used in phrase grounding tasks. For example: "the red car on the left."
Repetition penalty
A generation parameter that reduces the probability of tokens already in the output. Values above 1.0 discourage the model from repeating words or phrases. Default is 1.05 in Datature Vi.
Reward Gap
In DPO, the average log-probability difference between chosen and rejected responses. A wider reward gap indicates the model has developed a stronger preference signal, assigning higher probability to expert-preferred outputs.
RLHF (Reinforcement Learning from Human Feedback)
A multi-stage alignment method that trains a separate reward model from human preferences, then uses reinforcement learning (PPO) to optimize the policy. More complex and less stable than DPO, requiring three models and 15+ hyperparameters.
Run
A single training execution in Datature Vi. Each workflow can have multiple runs with different configurations or data.
S
SafeTensors
A secure file format for storing model weights. Prevents arbitrary code execution during model loading, unlike older serialization formats. Datature Vi exports trained models in SafeTensors format.
Secret key
An API authentication credential for your Datature Vi organization. Required to initialize the Vi SDK client and make API calls. Keep secret keys confidential and never commit them to version control.
Semantic Similarity
How close two pieces of text are in meaning, not just word overlap. BERTScore measures this. "Car" and "automobile" have high semantic similarity.
SFT (Supervised Fine-Tuning)
The standard method for fine-tuning a VLM on labeled data. The model learns by comparing its predictions against your annotations and adjusting weights to reduce the difference. In Datature Vi, both LoRA and full fine-tuning modes use SFT.
Shuffle (data shuffling)
Randomizing the order of training examples before each epoch. Prevents the model from learning patterns based on data order rather than content. Enabled by default in Datature Vi.
Speculative decoding
An inference optimization where a smaller draft model predicts multiple tokens ahead, and the main model verifies them in parallel. Speeds up generation without changing output quality. Enabled by multi-token prediction in architectures like Qwen3.5.
Streaming inference
An inference mode where the model returns tokens as they are generated, rather than waiting for the complete response. Useful for interactive applications or long outputs where users benefit from seeing progress.
System prompt
Natural language instructions that define a VLM's task and behavior. Covers role, focus area, output format, and hallucination guards. The same system prompt is used during both training and inference. Changing it after training degrades performance.
T
Temperature (Generation)
Controls randomness in model output. Low (0.1-0.3): predictable, repeatable answers. High (0.7-1.0): more varied responses. 0.0: identical output every time.
Token
A word piece that models use to process text. "Understanding" might become two tokens: "under" and "standing." One token is roughly 0.75 English words.
Top-p / Top-k (Sampling)
Parameters that limit which words the model considers when generating text. Top-p=0.9 means the model picks from the most likely words adding up to 90% probability. Top-k=50 means only the 50 most likely words are considered.
Traffic Splitting
A deployment technique that routes a percentage of inference requests to different model versions simultaneously. Used for A/B testing model performance in production before fully promoting a new version.
Training project
An organizational unit in Datature Vi that groups related workflows and runs. A project connects a dataset to one or more model training configurations, letting you compare approaches for the same task.
Training Step
One batch of data processed by the model. If you have 1000 images and a batch size of 8, one epoch takes 125 steps.
Transformer
The neural network architecture behind modern VLMs. Transformers use attention mechanisms to process sequences of data in parallel, making them fast and effective for both language and vision tasks.
True Positive / False Positive / True Negative / False Negative
Categories for prediction outcomes. True positive: correct prediction. False positive: model predicted something that isn't there. False negative: model missed something real.
U
Underfitting
When a model fails to learn patterns from the data. Signs: both training and validation loss remain high. Fix by training longer, using a larger model, or increasing the learning rate.
V
Validation / Validation Set
A held-out portion of your data (20% by default in Datature Vi) used to test the model during training. The model never trains on validation data, so it reveals overfitting.
Vi SDK
Datature Vi's Software Development Kit for programmatic access to datasets, models, training, and deployment features.
Vision-language model (VLM)
A neural network architecture that combines computer vision and natural language processing to understand relationships between images and text. VLMs can perform tasks like visual question answering, phrase grounding, and custom vision tasks. This is the type of model Datature Vi trains.
Visual Encoder
The part of a VLM that processes images. It breaks images into patches and converts them into numerical representations that the language model can understand.
Visual question answering (VQA)
A task where models answer natural language questions about images. One of the dataset types in Datature Vi.
VLMOps
Operations and practices for managing the lifecycle of vision-language models. Datature Vi is a VLMOps platform.
VRAM
Video RAM, the dedicated memory on a GPU. Larger models and batch sizes need more VRAM. A 7B parameter model typically needs 16-32 GB of VRAM for training.
W
Workflow
A reusable training configuration in Datature Vi that defines dataset splits, model architecture, and training settings. Workflows can have multiple runs.
Y
YOLO (You Only Look Once)
A family of real-time object detection models and an annotation format that uses normalized coordinates in text files. Supported for uploading annotations to Datature Vi.
Z
Zero-shot learning
The ability of a model to perform tasks on categories or domains it has never been explicitly trained on, using knowledge from pre-training. VLMs can perform zero-shot phrase grounding without task-specific training.
Related resources
Updated about 1 month ago
