What Are Annotations and How Do I Create Good Ones?

Learn what annotations are, why quality matters for VLM training, and how to create effective annotations for phrase grounding, VQA, and freeform text in Datature Vi.

Annotations are the labels that teach your VLM what to look for and how to respond. Each annotation pairs an image with text: a descriptive phrase linked to a bounding box (phrase grounding), a question-answer pair (VQA), or custom text (freeform). Datature Vi uses your annotations as training signal, so annotation quality directly determines model quality. This page covers the three annotation types, what separates good annotations from bad ones, how to add chain-of-thought reasoning to your training data, and which formats you can import.


Annotation types in Datature Vi

Datature Vi supports three annotation types, each designed for a different task. Choose the one that matches what you want the model to do.

For step-by-step annotation instructions, see Annotate for Phrase Grounding, Annotate for VQA, or Annotate for Freeform Text.


Quality vs quantity

Fifty accurate, specific annotations outperform 500 inconsistent ones. Datature Vi learns patterns from your training data, and noisy labels teach noisy patterns. A single annotation that says "box" gives the model almost nothing to work with. An annotation that says "the small red box on the top shelf" teaches the model about color, size, position, and the relationship between the object and its surroundings.

Consistency matters just as much as specificity. If you label the same type of defect as "scratch" in half your images and "surface damage" in the other half, the model has to learn two representations for the same concept. Pick one term and use it everywhere.

The table below shows recommended annotation volumes for each task type. Start with the minimum to validate your approach, then scale up once you confirm the model is learning the right patterns.

Annotation volume guidelines

Task type
Minimum
Recommended
Production
Phrase Grounding
3-5 pairs/image, 20+ images
5-10 pairs/image, 100+ images
500+ total pairs
VQA
50 Q&A pairs, 20+ images
200+ pairs, 50+ images
500+ pairs, mixed question types
Freeform Text
20+ images
100+ images
500+ images

Annotation quality checklist

Before starting a training run, review a random sample of 20-30 annotations against this checklist. Catching problems before training saves hours of wasted compute.

Check
What to look for
Why it matters
Consistent terminology
Same object described the same way across all images
Inconsistent labels split the model attention and reduce accuracy
Tight bounding boxes (phrase grounding)
Boxes fit closely around objects with minimal background
Loose boxes teach the model to include irrelevant pixels, lowering IoU
Complete coverage
All relevant objects in each image are annotated
Missing annotations teach the model to ignore valid objects, hurting recall
Specific descriptions
Phrases include attributes: color, position, size, context
Vague phrases like "the thing" give the model weak training signal
Varied question types (VQA)
Mix of yes/no, counting, descriptive, and comparison questions
Repetitive question patterns cause the model to memorize phrasing instead of learning visual understanding
Informative answers (VQA)
Answers contain enough detail to be useful
One-word answers like "yes" limit what the model can learn

Common annotation mistakes

These patterns appear often and degrade model performance. Each one is fixable by editing your annotations before retraining.

Some annotators draw boxes that touch the object edges. Others leave padding around the object. The model learns an average, resulting in boxes that are neither tight nor consistently padded.

Fix: Establish a rule before annotation starts. A common standard: the box should touch the outermost pixels of the object on all four sides, with no more than 2-3 pixels of padding.

Phrases like "object," "item," or "thing" give the model no distinguishing information. The model cannot learn to differentiate between objects if they all share the same description.

Fix: Use specific, descriptive phrases. Instead of "object," write "the rusted bolt in the upper-left corner." Include attributes that distinguish this object from others in the image.

If every image has the question "What do you see?" the model learns to recognize the question pattern rather than the visual content. It may produce similar answers regardless of the image.

Fix: Vary question phrasing across images. Ask "Describe the main subject," "What is happening in this image?," and "List all visible objects" to create diverse training signal.

Answers like "yes" or "no" without visual grounding let the model guess based on language patterns. If 80% of yes/no questions have the answer "yes," the model learns to always say "yes."

Fix: Add context to answers. Instead of "Yes," write "Yes, there is a crack running along the bottom edge of the tile." This forces the model to connect its answer to visual evidence.

Images with no annotations confuse training. The model sees an image, receives no ground truth, and gets conflicting gradient signals.

Fix: Remove unannotated images from your dataset before training, or annotate them. Every image in the training set should have at least one annotation.

If your annotations contain errors, your evaluation metrics will be misleading. A model that correctly identifies a defect will score poorly if the annotation says the defect is not there. You end up "fixing" a model that was already right.

Fix: Audit annotations independently of model output. Have a second person review a random sample. Treat annotation review as a distinct quality step, not something you do only after training fails.


How to enable chain-of-thought reasoning in annotations

For complex reasoning tasks, you can include step-by-step reasoning in your VQA annotations. Prepend <datature_think> tags to the annotation answer text. During training, Datature Vi converts these to the model's native <think> tags, teaching the model to reason step by step before producing a final answer.

When writing a VQA annotation with chain-of-thought reasoning, wrap the reasoning portion in <datature_think> tags and place the final answer after the closing tag:

Question: "How many defects are visible on this surface?"

Answer:

<datature_think>Let me examine the surface systematically.
Top section: I see 2 scratches near the edge.
Middle section: There is 1 dent and 1 discoloration.
Bottom section: No defects visible.
Total: 4 defects.</datature_think>There are 4 defects visible: 2 scratches, 1 dent, and 1 discoloration.

The text inside <datature_think>...</datature_think> becomes the model's internal reasoning. The text after the closing tag becomes the answer the model presents to the user.

At inference time, the model outputs <think>...</think> and <answer>...</answer> tags. The Vi SDK parses these into separate thinking and answer fields on the response object, so you can access the reasoning and the final answer independently.

After running inference, pass cot=True on model(...) to request chain-of-thought decoding at inference time (alongside any CoT behavior learned from <datature_think> training data). Access the reasoning and answer separately:

from vi.inference import ViModel

model = ViModel(run_id="your-run-id")
result, error = model(
    source="image.jpg",
    user_prompt="Count all defects",
    cot=True,
    stream=False,
)

if error is None:
    if result.thinking:
        print("Reasoning:", result.thinking)
    print("Answer:", result.result.answer)

The result.thinking field contains the model's step-by-step reasoning when tags are present. The result.result.answer field contains the final answer text for VQA. See prediction schemas for the full response structure.

For more on how chain-of-thought reasoning works and when to use it, see Chain-of-Thought Reasoning. For generation parameter tuning during inference, see Configure Generation. For cot, stream, and related call options, see Run Inference.


Annotation formats for upload

If you have existing annotations from another tool or platform, Datature Vi accepts several standard formats. The format you choose depends on where your annotations came from.

Format
Best for
Notes
Vi JSONL
Starting fresh with Datature Vi
Recommended default, supports all task types
COCO JSON
Migrating from LabelImg, CVAT, Roboflow
Most widely supported format across tools
Pascal VOC XML
Migrating from older annotation tools
One XML file per image
YOLO TXT
Migrating from YOLO training workflows
One TXT file per image, plus a classes file
CSV
Simple tabular annotation data
Easy to generate from spreadsheets
Automatic Coordinate Conversion

Datature Vi converts coordinate formats automatically during upload. COCO uses absolute pixel coordinates, YOLO uses normalized center coordinates, Pascal VOC uses absolute corner coordinates. You do not need to convert between these yourself.

Different formats measure bounding box positions differently. Some use pixel coordinates (the box starts at pixel 100, 200), others use normalized coordinates (the box starts at 10% from the left, 20% from the top). YOLO measures from the center of the box, while COCO and Pascal VOC measure from the top-left corner.

Datature Vi handles all of these conversions for you during upload. As long as your annotation files follow their format's specification, the bounding boxes will appear correctly in the annotator.

For the full format specification with examples, see Upload Annotations.


AI-assisted annotation with IntelliScribe

Datature Vi includes IntelliScribe, an AI-assisted annotation tool that speeds up phrase grounding and freeform image annotation. IntelliScribe auto-generates captions and links phrases to bounding boxes. Press C to generate a caption for the current image, then press P to auto-link phrases to your drawn bounding boxes.

IntelliScribe works best on clear images with common objects. For specialized domains, treat the generated caption as a starting draft and edit it with domain-specific terminology before running phrase linking.

Learn more about AI-assisted annotation tools


Frequently asked questions

No. Annotate the objects relevant to your task. If you are training a model to find defects on a product surface, you do not need to annotate the conveyor belt, the background, or the lighting fixtures. Consistent coverage of the objects you care about matters more than total coverage of everything visible.

No. Each dataset in Datature Vi has one task type: phrase grounding, VQA, or freeform text. If you need multiple task types, create separate datasets for each one and train separate models. You can combine the models at the application level during inference.

The <datature_think> tag is a special tag you add to VQA annotation answers to enable chain-of-thought reasoning. You wrap the reasoning steps in <datature_think>...</datature_think> and place the final answer after the closing tag. During training, Datature Vi converts these to the model's native <think> tags, so the trained model learns to reason step by step before answering. See the chain-of-thought section above for examples.


Related resources

Annotate for Phrase Grounding

Step-by-step guide to creating phrase-box annotation pairs.

Annotate for VQA

Create question-answer training pairs for visual question answering.

Upload Annotations

Import existing annotations in COCO, YOLO, Pascal VOC, CSV, or Vi JSONL formats.