Glossary
Key terms and definitions for vision language models, MLOps, and computer vision concepts.
This glossary defines user-facing terms and concepts in Datature Vi. Only terms that users directly interact with or need to understand are included.
B
The number of training examples processed together in one forward/backward pass during model training. A configurable training setting in Datature Vi that affects training speed and memory usage.
C
A technique that enables vision-language models to solve complex visual tasks by breaking them down into explicit, step-by-step reasoning processes, improving accuracy and interpretability.
A saved snapshot of a model's weights and training state at a specific point during training. Checkpoints allow you to resume training or deploy models from different training stages.
A widely-used annotation format for object detection that uses JSON files containing image metadata, bounding boxes, and category information. Supported for uploading annotations to Datature Vi.
D
A computer vision task that generates detailed natural language descriptions for multiple regions within an image, identifying and describing individual objects and their relationships.
E
One complete pass through the entire training dataset during model training. A configurable training setting in Datature Vi.
F
The process of taking a pre-trained model and continuing training on a specific dataset to adapt it to a particular task or domain. This is what happens when you train a model in Datature Vi.
🚧 Coming soon — A dataset type that allows custom annotation schemas for specialized computer vision applications. Unlike predefined formats like phrase grounding or VQA, freeform datasets provide maximum flexibility for research and specialized use cases.
G
Specialized hardware for training and running deep learning models. You can select GPU resources when configuring training runs in Datature Vi.
In phrase grounding, a text phrase that has been linked to one or more bounding boxes in an image, establishing the connection between language and visual regions.
H
A configuration setting that controls the training process, such as learning rate, batch size, and number of epochs. You configure these when creating training workflows.
I
Datature Vi's AI-assisted annotation feature that automatically generates captions and highlights phrases to speed up the phrase grounding annotation process.
J
A text format where each line is a valid JSON object. Datature Vi's native annotation format uses JSONL for storing image metadata and labels.
L
A hyperparameter that controls how much model weights are updated during training. A configurable training setting in Datature Vi.
N
The process of scaling coordinate values to a standard range, typically [0, 1]. Normalized coordinates in bounding boxes are relative to image dimensions, making them resolution-independent. Important when uploading annotations.
O
Technology that extracts and recognizes text from images. Used in document analysis and free text tasks.
P
An annotation format that uses XML files to store bounding box information for object detection tasks. Supported for uploading annotations to Datature Vi.
A computer vision task that localizes objects in images based on natural language descriptions. Also called visual grounding or referring expression comprehension. One of the dataset types in Datature Vi.
Learn about phrase grounding → | Annotate for phrase grounding → | Create dataset →
A model that has been trained on large datasets and can be fine-tuned for specific tasks. Datature Vi uses pre-trained VLMs as the starting point for training.
R
A natural language phrase that describes a specific object or region in an image, used in phrase grounding tasks (e.g., "the red car on the left").
A single training execution in Datature Vi. Each workflow can have multiple runs with different configurations or data.
V
Datature Vi's Software Development Kit for programmatic access to datasets, models, training, and deployment features.
Vi SDK docs → | Get started → | Install SDK → | API reference →
A neural network architecture that combines computer vision and natural language processing to understand relationships between images and text. VLMs can perform tasks like VQA, phrase grounding, and custom vision tasks. This is the type of model Datature Vi trains.
A task where models answer natural language questions about images. One of the dataset types in Datature Vi.
Operations and practices for managing the lifecycle of vision language models. Datature Vi is a VLMOps platform.
W
A reusable training configuration in Datature Vi that defines dataset splits, model architecture, and training settings. Workflows can have multiple runs.
Y
A family of real-time object detection models and an annotation format that uses normalized coordinates in text files. Supported for uploading annotations to Datature Vi.
Z
The ability of a model to perform tasks on categories or domains it has never been explicitly trained on, using knowledge from pre-training. VLMs can perform zero-shot phrase grounding without task-specific training.
Quick Navigation
Now that you're familiar with VLM terminology, explore these key areas:
Complete quickstart guide to train your first VLM
Deep dive into phrase grounding and VQA
Set up your first dataset
Label images for training
Fine-tune VLMs on your data
Programmatic access with Python
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
