Glossary

Key terms and definitions for vision language models, MLOps, and computer vision concepts.

This glossary defines user-facing terms and concepts in Datature Vi. Only terms that users directly interact with or need to understand are included.


B

Batch Size

The number of training examples processed together in one forward/backward pass during model training. A configurable training setting in Datature Vi that affects training speed and memory usage.

Configure batch size →


C

Chain-of-Thought Reasoning (CoT)

A technique that enables vision-language models to solve complex visual tasks by breaking them down into explicit, step-by-step reasoning processes, improving accuracy and interpretability.

Learn about CoT reasoning →

Checkpoint

A saved snapshot of a model's weights and training state at a specific point during training. Checkpoints allow you to resume training or deploy models from different training stages.

Configure checkpoints →

COCO (Common Objects in Context)

A widely-used annotation format for object detection that uses JSON files containing image metadata, bounding boxes, and category information. Supported for uploading annotations to Datature Vi.

Upload COCO annotations →


D

Dense Captioning

A computer vision task that generates detailed natural language descriptions for multiple regions within an image, identifying and describing individual objects and their relationships.


E

Epoch

One complete pass through the entire training dataset during model training. A configurable training setting in Datature Vi.

Configure epochs →


F

Fine-tuning

The process of taking a pre-trained model and continuing training on a specific dataset to adapt it to a particular task or domain. This is what happens when you train a model in Datature Vi.

Train a model → | Training modes →

Freeform

🚧 Coming soon — A dataset type that allows custom annotation schemas for specialized computer vision applications. Unlike predefined formats like phrase grounding or VQA, freeform datasets provide maximum flexibility for research and specialized use cases.

Learn about freeform → | Create dataset →


G

GPU (Graphics Processing Unit)

Specialized hardware for training and running deep learning models. You can select GPU resources when configuring training runs in Datature Vi.

GPU options → | Configure hardware →

Grounded Phrase

In phrase grounding, a text phrase that has been linked to one or more bounding boxes in an image, establishing the connection between language and visual regions.


H

Hyperparameter

A configuration setting that controls the training process, such as learning rate, batch size, and number of epochs. You configure these when creating training workflows.


I

IntelliScribe

Datature Vi's AI-assisted annotation feature that automatically generates captions and highlights phrases to speed up the phrase grounding annotation process.

Use AI-assisted tools → | Annotate for phrase grounding →


J

JSONL (JSON Lines)

A text format where each line is a valid JSON object. Datature Vi's native annotation format uses JSONL for storing image metadata and labels.

Vi JSONL format → | Download annotations →


L

Learning Rate

A hyperparameter that controls how much model weights are updated during training. A configurable training setting in Datature Vi.

Configure learning rate →


N

Normalization

The process of scaling coordinate values to a standard range, typically [0, 1]. Normalized coordinates in bounding boxes are relative to image dimensions, making them resolution-independent. Important when uploading annotations.


O

OCR (Optical Character Recognition)

Technology that extracts and recognizes text from images. Used in document analysis and free text tasks.


P

Pascal VOC (Visual Object Classes)

An annotation format that uses XML files to store bounding box information for object detection tasks. Supported for uploading annotations to Datature Vi.

Upload Pascal VOC →

Phrase Grounding

A computer vision task that localizes objects in images based on natural language descriptions. Also called visual grounding or referring expression comprehension. One of the dataset types in Datature Vi.

Learn about phrase grounding → | Annotate for phrase grounding → | Create dataset →

Pre-trained Model

A model that has been trained on large datasets and can be fine-tuned for specific tasks. Datature Vi uses pre-trained VLMs as the starting point for training.


R

Referring Expression

A natural language phrase that describes a specific object or region in an image, used in phrase grounding tasks (e.g., "the red car on the left").

Run

A single training execution in Datature Vi. Each workflow can have multiple runs with different configurations or data.

Start a run → | Monitor runs → | Manage runs →


V

Vi SDK

Datature Vi's Software Development Kit for programmatic access to datasets, models, training, and deployment features.

Vi SDK docs → | Get started → | Install SDK → | API reference →

Vision Language Model (VLM)

A neural network architecture that combines computer vision and natural language processing to understand relationships between images and text. VLMs can perform tasks like VQA, phrase grounding, and custom vision tasks. This is the type of model Datature Vi trains.

Train a VLM → | Model architectures → | Core concepts →

Visual Grounding

See Phrase Grounding.

Learn about phrase grounding →

Visual Question Answering (VQA)

A task where models answer natural language questions about images. One of the dataset types in Datature Vi.

Learn about VQA → | Annotate for VQA → | Create dataset →

VLMOps

Operations and practices for managing the lifecycle of vision language models. Datature Vi is a VLMOps platform.


W

Workflow

A reusable training configuration in Datature Vi that defines dataset splits, model architecture, and training settings. Workflows can have multiple runs.

Create workflow → | Manage workflows → | Training guide →


Y

YOLO (You Only Look Once)

A family of real-time object detection models and an annotation format that uses normalized coordinates in text files. Supported for uploading annotations to Datature Vi.

Upload YOLO annotations →


Z

Zero-shot Learning

The ability of a model to perform tasks on categories or domains it has never been explicitly trained on, using knowledge from pre-training. VLMs can perform zero-shot phrase grounding without task-specific training.


Quick Navigation

Now that you're familiar with VLM terminology, explore these key areas: