Upload Annotations

Annotations define the labels, bounding boxes, or question-answer pairs that teach your vision models what to recognize. If you already have annotated data from other tools or platforms, you can import it into Datature Vi to leverage your existing labeling work.

This document explains how to upload annotations to your datasets, covering supported formats, file requirements, and best practices.

💡
Quick workflow
Create dataset → Upload images → Upload annotations (you are here) → Train a model

📋
Prerequisites
Before uploading annotations, ensure you have:

A dataset with uploaded images in Datature Vi (Learn how)

Annotation files in a supported format

Image filenames in annotation files that exactly match uploaded asset filenames

📘
Important
Images must be uploaded to your dataset before uploading annotations. The system matches annotations to images by filename, so ensure filenames are identical.

Upload annotations

Follow these steps to import existing annotations:

📘
Don't have annotations yet?
You can create them manually using the visual annotator or with AI-assisted tools.

Navigate to your dataset and click the Annotations tab

Click Upload in the Upload Annotations section

Select your annotation format from the dropdown

Upload your annotation files by dragging and dropping or browsing your file system

The system will process your annotations and match them to the corresponding images in your dataset.

Supported annotation formats

Datature Vi supports different annotation formats depending on your dataset type.

Phrase Grounding datasets

For object detection and phrase grounding tasks, the following formats are supported:

Format	File Type	Description
Vi JSONL	`.jsonl`	Datature Vi native format
COCO	`.json`	Common Objects in Context format
Pascal VOC	`.xml`	Visual Object Classes XML format
YOLO Darknet	`.txt`	YOLO Darknet text format (with classes file)
YOLO Keras PyTorch	`.txt`	YOLO Keras/PyTorch text format (with class list)
CSV Four Corner	`.csv`	CSV with x1, y1, x2, y2 coordinates
CSV Width Height	`.csv`	CSV with x, y, width, height coordinates

Visual Question Answering datasets

For VQA tasks, only the Vi JSONL format is supported:

Format	File Type	Description
Vi JSONL	`.jsonl`	Datature Vi native format

Freeform datasets

🚧
Coming soon
Freeform annotation upload support is currently in development. This will allow you to upload custom annotation formats for specialized use cases.

Format specifications and examples

Detailed specifications for each supported annotation format. Expand the format you're using to see structure, examples, and field descriptions.

Vi JSONL

The native Datature Vi format supports both Phrase Grounding and Visual Question Answering annotations.

File structure:

Each line in the JSONL file represents one image's annotations. The format is structured and includes asset metadata and content-specific annotations.

Phrase Grounding example:

{"id": 0, "asset": {"type": "Image", "filename": "image1.jpg", "width": 1920, "height": 1080}, "contents": {"type": "PhraseGrounding", "caption": "A person standing next to a red car.", "groundedPhrases": [{"phrase": "person", "startCharIndex": 2, "endCharIndex": 8, "bounds": [[0.1, 0.2, 0.4, 0.9]]}, {"phrase": "red car", "startCharIndex": 28, "endCharIndex": 35, "bounds": [[0.5, 0.4, 0.9, 0.8]]}]}}
{"id": 1, "asset": {"type": "Image", "filename": "image2.jpg", "width": 1280, "height": 720}, "contents": {"type": "PhraseGrounding", "caption": "A brown dog sleeping on the couch.", "groundedPhrases": [{"phrase": "brown dog", "startCharIndex": 2, "endCharIndex": 11, "bounds": [[0.2, 0.3, 0.7, 0.8]]}]}}

Example formatted for readability:

{
  "id": 0,
  "asset": {
    "type": "Image",
    "filename": "image1.jpg",
    "width": 1920,
    "height": 1080
  },
  "contents": {
    "type": "PhraseGrounding",
    "caption": "A person standing next to a red car.",
    "groundedPhrases": [
      {
        "phrase": "person",
        "startCharIndex": 2,
        "endCharIndex": 8,
        "bounds": [[0.1, 0.2, 0.4, 0.9]]
      },
      {
        "phrase": "red car",
        "startCharIndex": 28,
        "endCharIndex": 35,
        "bounds": [[0.5, 0.4, 0.9, 0.8]]
      }
    ]
  }
}

Phrase Grounding field descriptions:

id — Unique identifier for the record
asset — Asset metadata
- type — Asset type (always "Image")
- filename — Image filename (must match uploaded asset exactly)
- width — Image width in pixels
- height — Image height in pixels
contents — Annotation content
- type — Content type (must be "PhraseGrounding")
- caption — Descriptive text caption for the image
- groundedPhrases — Array of phrase groundings
  - phrase — The text phrase being grounded
  - startCharIndex — Start position of phrase in caption (character index)
  - endCharIndex — End position of phrase in caption (character index)
  - bounds — Array of bounding boxes for this phrase [[xmin, ymin, xmax, ymax]]
    - Coordinates are normalized (0-1)
    - Format: xmin, ymin (top-left), xmax, ymax (bottom-right)

Visual Question Answering example:

{"id": 0, "asset": {"type": "Image", "filename": "image1.jpg", "width": 1920, "height": 1080}, "contents": {"type": "Vqa", "interactions": [{"question": "What color is the car?", "answer": "red", "order": 1}, {"question": "Where is the person standing?", "answer": "next to the car", "order": 2}]}}
{"id": 1, "asset": {"type": "Image", "filename": "image2.jpg", "width": 1280, "height": 720}, "contents": {"type": "Vqa", "interactions": [{"question": "What is the dog doing?", "answer": "sleeping", "order": 1}, {"question": "Where is the dog?", "answer": "on the couch", "order": 2}]}}

Example formatted for readability:

{
  "id": 0,
  "asset": {
    "type": "Image",
    "filename": "image1.jpg",
    "width": 1920,
    "height": 1080
  },
  "contents": {
    "type": "Vqa",
    "interactions": [
      {
        "question": "What color is the car?",
        "answer": "red",
        "order": 1
      },
      {
        "question": "Where is the person standing?",
        "answer": "next to the car",
        "order": 2
      }
    ]
  }
}

Visual Question Answering field descriptions:

id — Unique identifier for the record
asset — Asset metadata
- type — Asset type (always "Image")
- filename — Image filename (must match uploaded asset exactly)
- width — Image width in pixels
- height — Image height in pixels
contents — Annotation content
- type — Content type (must be "Vqa")
- interactions — Array of question-answer pairs
  - question — The question about the image
  - answer — The answer to the question
  - order — Index of the question-answer pair, useful when sequence matters or if a question-answer pair is a follow-up of another question-answer pair

📘
Important notes

Each line in a .jsonl file contains one complete JSON object (see examples above)

All bounding box coordinates in Vi JSONL are normalized (values between 0 and 1)

The bounds field uses [xmin, ymin, xmax, ymax] format (top-left to bottom-right corners)

Character indices in groundedPhrases are zero-based and refer to positions in the caption

Multiple bounding boxes can be specified for a single phrase by adding multiple coordinate arrays

COCO format

The COCO (Common Objects in Context) format is widely used for object detection tasks.

File type: .json

Structure:

{
  "images": [
    {
      "id": 1,
      "file_name": "image1.jpg",
      "width": 640,
      "height": 480
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 150, 200, 250],
      "area": 50000,
      "iscrowd": 0
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "object"
    }
  ]
}

Key fields:

images — List of image metadata
- id — Unique image identifier
- file_name — Image filename (must match uploaded asset)
- width, height — Image dimensions
annotations — List of bounding box annotations
- image_id — Reference to image ID
- category_id — Reference to category ID
- bbox — Bounding box as [x, y, width, height]
categories — List of object classes
- id — Unique category identifier
- name — Class name

📘
COCO bbox format
COCO uses [x, y, width, height] where (x, y) is the top-left corner of the bounding box.

Pascal VOC format

Pascal VOC uses XML files for each image's annotations.

File type: .xml (one file per image)

Structure:

<annotation>
  <folder>images</folder>
  <filename>image1.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
    <depth>3</depth>
  </size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>400</ymax>
    </bndbox>
  </object>
  <object>
    <name>car</name>
    <bndbox>
      <xmin>350</xmin>
      <ymin>200</ymin>
      <xmax>550</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

Key elements:

filename — Image filename (must match uploaded asset)
size — Image dimensions
object — Each object annotation
- name — Class name
- bndbox — Bounding box coordinates
  - xmin, ymin — Top-left corner
  - xmax, ymax — Bottom-right corner

📘
Upload requirements
Upload all XML files together. Each XML file should be named to correspond with its image (e.g., image1.xml for image1.jpg).

YOLO Darknet format

YOLO Darknet uses normalized coordinates in text files.

File type: .txt (one file per image) + classes.txt

Annotation file structure (image1.txt):

0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25

Classes file (classes.txt):

person
car
truck

Format specification:

Each line represents one bounding box:

<class_id> <x_center> <y_center> <width> <height>

class_id — Zero-indexed class ID (corresponds to line in classes.txt)
x_center, y_center — Center point of box (normalized 0-1)
width, height — Box dimensions (normalized 0-1)

Normalization:

All coordinates are normalized by image dimensions:

x_center = (absolute_x_center) / image_width
y_center = (absolute_y_center) / image_height
width = (absolute_width) / image_width
height = (absolute_height) / image_height

📘
Upload requirements
Upload all .txt annotation files along with the classes.txt file that lists class names in order.

YOLO Keras PyTorch format

Similar to YOLO Darknet but with a different class file structure.

File type: .txt (one file per image) + class configuration

Annotation file structure:

Same as YOLO Darknet:

0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25

Differences from Darknet:

The class names are specified differently during upload. Follow the same normalization rules as YOLO Darknet format.

CSV Four Corner format

CSV format with four corner coordinates (bounding box corners).

File type: .csv

Structure:

filename,xmin,ymin,xmax,ymax,class
image1.jpg,100,150,300,400,person
image1.jpg,350,200,550,450,car
image2.jpg,50,75,250,350,truck

Column descriptions:

filename — Image filename (must match uploaded asset)
xmin, ymin — Top-left corner coordinates
xmax, ymax — Bottom-right corner coordinates
class — Object class name

Requirements:

Header row is required with exact column names shown above
Each row represents one bounding box
Multiple boxes for the same image require multiple rows
Coordinates can be normalized (0-1) or unnormalized (pixel values)

📘
Coordinate format
The system automatically detects whether coordinates are normalized or unnormalized based on the values. For more on coordinate systems, see Coordinate system below.

CSV Width Height format

CSV format with width and height dimensions.

File type: .csv

Structure:

filename,x,y,width,height,class
image1.jpg,100,150,200,250,person
image1.jpg,350,200,200,250,car
image2.jpg,50,75,200,275,truck

Column descriptions:

filename — Image filename (must match uploaded asset)
x, y — Top-left corner coordinates
width, height — Box dimensions
class — Object class name

Requirements:

Header row is required with exact column names shown above
Each row represents one bounding box
Coordinates can be normalized (0-1) or unnormalized (pixel values)
For more on coordinate systems, see Coordinate system below

Best practices

File preparation

Filename matching:

Ensure annotation filenames exactly match uploaded image filenames
File extensions must match (e.g., .jpg vs .jpeg)
Filenames are case-sensitive

Class names:

Use consistent class naming across all annotations
Avoid special characters in class names
Keep class names descriptive but concise

Coordinate validation:

Verify bounding boxes are within image boundaries
For normalized coordinates, ensure values are between 0 and 1
Check that width and height are positive values

Format selection

Choose the format based on your existing workflow:

COCO — Best for complex datasets with multiple images and categories (Official spec)
YOLO — Best for lightweight, normalized annotations (Official docs)
Pascal VOC — Best when working with XML-based pipelines (Official site)
CSV — Best for simple datasets or custom annotation tools
Vi JSONL — Best for Datature Vi native workflows or VQA tasks

💡
Need to convert formats?
If your annotations are in a different format, you can use conversion tools or the Vi SDK to transform them before uploading.

Upload strategies

For large annotation files:

Split large COCO JSON files if they exceed 100MB
Upload in batches for better error tracking
Test with a small subset before uploading entire dataset

For multiple format types:

Upload one format at a time
Verify successful import before uploading additional formats
Do not mix formats in a single upload session

Troubleshooting

Annotations not appearing after upload

Possible causes:

Image filenames in annotations don't match uploaded assets
Incorrect annotation format selected
Malformed annotation files

Solutions:

Verify image filenames match exactly (case-sensitive)
Check that you selected the correct format during upload
Validate annotation files against format specifications
Review upload error messages for specific issues

Some annotations missing

Possible causes:

Bounding boxes outside image boundaries
Invalid coordinate values
Missing class names in class files (YOLO)

Solutions:

Validate bounding box coordinates are within image dimensions
For normalized coordinates, ensure values are between 0 and 1
Verify all class IDs reference valid classes in your class file
Check for null or empty values in annotation fields

Format validation errors

Possible causes:

Incorrect file structure
Missing required fields
Invalid JSON or XML syntax

Solutions:

Compare your file structure to format examples
Validate JSON files using a JSON validator
Validate XML files using an XML validator
Ensure all required fields are present for your chosen format

Class name mismatches

Possible causes:

Class names differ between annotation files
Inconsistent naming conventions
Special characters in class names

Solutions:

Standardize class names across all annotation files
Remove special characters from class names
Use consistent capitalization

Common questions

Can I upload annotations for only some images?

Yes, you can upload annotations for a subset of images in your dataset. Images without annotations will remain unlabeled and can be annotated manually later.

What happens if I upload annotations multiple times?

Uploading annotations for the same images will replace existing annotations. Make sure you want to overwrite before uploading.

Can I mix different annotation formats?

No, you must choose one format per upload session. However, you can upload annotations in different formats at different times (though this will replace previous annotations).

Do I need to upload images before annotations?

Yes, images must be uploaded first. The annotation upload process matches annotations to existing images by filename.

Can I edit annotations after uploading?

Yes, you can edit annotations using the visual annotator after importing them. Navigate to the Annotator tab to modify existing annotations.

How do I know if my upload was successful?

After upload completes, you'll see a summary of successful and failed imports. Check the Annotations tab to verify your annotations appear correctly.

Programmatic annotation uploads

For large-scale annotation imports or automation workflows, use the Vi SDK to upload annotations programmatically.

SDK Resources:

Vi SDK Getting Started — Installation and quick start
Annotations API Reference — Complete API documentation
Assets API Reference — Upload images programmatically

Next steps

Now that your images are annotated, you're ready to train:

Train a model — Start fine-tuning a VLM with your annotated dataset
Annotate additional data — Add more annotations manually or with AI assistance
View dataset insights — Analyze annotation statistics and class distributions
Download annotations — Export annotations in various formats

Related resources

Create annotations manually — Use the visual annotator
AI-assisted annotation tools — Speed up annotation with AI
Phrase Grounding concepts — Learn about object detection
Visual Question Answering concepts — Learn about VQA

External resources

COCO Dataset Format — Official COCO format specification
YOLO Training Guide — YOLO Darknet documentation
Pascal VOC — Pascal VOC dataset and format details

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects