Upload Annotations

Import existing annotations into a Datature Vi dataset. Supports COCO, YOLO, Pascal VOC, CSV, and Vi JSONL formats for phrase grounding and VQA tasks.

Annotations define what your model learns to recognize: bounding boxes linked to text phrases for phrase grounding, or question-answer pairs for visual question answering (VQA). If you already have annotated data from another tool or platform, you can import it directly into Datature Vi.

Before You Start
  • A dataset with uploaded images in Datature Vi. Upload images first.
  • Annotation files in a supported format.
  • Annotation filenames that reference image filenames exactly (case-sensitive, including file extensions).
1

Open the Annotations tab

Open the Annotations tab

In the Dataset Explorer, click the Annotations tab to access the upload area.

You should see
Annotation upload page showing a finished job

Your annotations are uploaded when you see a successfully finished job.

Supported annotation formats

Supported formats depend on your dataset type.

Phrase grounding datasets

Phrase grounding annotation formats

Format
File type
Notes
Vi JSONL
`.jsonl`
Datature Vi native format. Supports both phrase grounding and VQA.
COCO
`.json`
Common Objects in Context
Pascal VOC
`.xml`
One XML file per image
YOLO Darknet
`.txt` + `classes.txt`
Normalized coordinates
YOLO Keras/PyTorch
`.txt` + class config
Same coordinate format as Darknet
CSV Four Corner
`.csv`
Columns: `filename, xmin, ymin, xmax, ymax, class`
CSV Width Height
`.csv`
Columns: `filename, x, y, width, height, class`

VQA datasets

Only Vi JSONL is supported for visual question answering datasets.

VQA annotation formats

Format
File type
Notes
Vi JSONL
`.jsonl`
Datature Vi native format

Format specifications

Each line in the .jsonl file is one complete JSON object representing one image's annotations. A single record contains an id (integer), an asset object with type, filename, width, and height, and a contents object with type set to "PhraseGrounding", a caption string, and a groundedPhrases array. Each grounded phrase entry includes phrase, startCharIndex, endCharIndex, and bounds (a nested array of normalized bounding box coordinates).

Key fields:

  • asset.filename: must match the uploaded image filename exactly
  • contents.type: must be "PhraseGrounding"
  • contents.caption: descriptive text for the image
  • groundedPhrases[].bounds: normalized bounding box [xmin, ymin, xmax, ymax] (values 0 to 1)
  • groundedPhrases[].startCharIndex / endCharIndex: zero-based character positions in the caption
{"id": 0, "asset": {"type": "Image", "filename": "image1.jpg", "width": 1920, "height": 1080}, "contents": {"type": "PhraseGrounding", "caption": "A person standing next to a red car.", "groundedPhrases": [{"phrase": "person", "startCharIndex": 2, "endCharIndex": 8, "bounds": [[0.1, 0.2, 0.4, 0.9]]}, {"phrase": "red car", "startCharIndex": 28, "endCharIndex": 35, "bounds": [[0.5, 0.4, 0.9, 0.8]]}]}}
{"id": 1, "asset": {"type": "Image", "filename": "image2.jpg", "width": 1280, "height": 720}, "contents": {"type": "PhraseGrounding", "caption": "A brown dog sleeping on the couch.", "groundedPhrases": [{"phrase": "brown dog", "startCharIndex": 2, "endCharIndex": 11, "bounds": [[0.2, 0.3, 0.7, 0.8]]}]}}
{
  "id": 0,
  "asset": {
    "type": "Image",
    "filename": "image1.jpg",
    "width": 1920,
    "height": 1080
  },
  "contents": {
    "type": "PhraseGrounding",
    "caption": "A person standing next to a red car.",
    "groundedPhrases": [
      {
        "phrase": "person",
        "startCharIndex": 2,
        "endCharIndex": 8,
        "bounds": [[0.1, 0.2, 0.4, 0.9]]
      },
      {
        "phrase": "red car",
        "startCharIndex": 28,
        "endCharIndex": 35,
        "bounds": [[0.5, 0.4, 0.9, 0.8]]
      }
    ]
  }
}

Each line is one image's question-answer pairs. A single record contains an id (integer), an asset object with type, filename, width, and height, and a contents object with type set to "Vqa" and an interactions array. Each interaction entry includes question, answer, and order (an integer for sequencing).

Key fields:

  • asset.filename: must match the uploaded image filename exactly
  • contents.type: must be "Vqa"
  • interactions[].question: the question about the image
  • interactions[].answer: the answer
  • interactions[].order: sequence index (useful when one answer depends on a previous question)
{"id": 0, "asset": {"type": "Image", "filename": "image1.jpg", "width": 1920, "height": 1080}, "contents": {"type": "Vqa", "interactions": [{"question": "What color is the car?", "answer": "red", "order": 1}, {"question": "Where is the person standing?", "answer": "next to the car", "order": 2}]}}
{"id": 1, "asset": {"type": "Image", "filename": "image2.jpg", "width": 1280, "height": 720}, "contents": {"type": "Vqa", "interactions": [{"question": "What is the dog doing?", "answer": "sleeping", "order": 1}, {"question": "Where is the dog?", "answer": "on the couch", "order": 2}]}}
{
  "id": 0,
  "asset": {
    "type": "Image",
    "filename": "image1.jpg",
    "width": 1920,
    "height": 1080
  },
  "contents": {
    "type": "Vqa",
    "interactions": [
      {
        "question": "What color is the car?",
        "answer": "red",
        "order": 1
      },
      {
        "question": "Where is the person standing?",
        "answer": "next to the car",
        "order": 2
      }
    ]
  }
}

A single .json file containing three top-level arrays:

COCO format arrays

Array
Fields
Description
**images**
`id`, `file_name`, `width`, `height`
One entry per image. `file_name` must match the uploaded filename.
**annotations**
`id`, `image_id`, `category_id`, `bbox`, `area`, `iscrowd`
One entry per bounding box. `bbox` uses `[x, y, width, height]` where (x, y) is the top-left corner.
**categories**
`id`, `name`, `supercategory`
One entry per class label.

See the official COCO format spec for full details.

{
  "images": [
    {
      "id": 1,
      "file_name": "image1.jpg",
      "width": 640,
      "height": 480
    }
  ],
  "annotations": [
    {
      "id": 1,
      "image_id": 1,
      "category_id": 1,
      "bbox": [100, 150, 200, 250],
      "area": 50000,
      "iscrowd": 0
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "person",
      "supercategory": "object"
    }
  ]
}

One .xml file per image. Each file contains bounding box coordinates in absolute pixel values. The XML structure uses the following elements:

Pascal VOC elements

Element
Description
**filename**
The image filename (must match the uploaded asset)
**size**
Contains **width**, **height**, and **depth** (number of color channels)
**object**
One per annotation. Contains **name** (class label) and **bndbox** with **xmin**, **ymin**, **xmax**, **ymax** in pixel coordinates

Upload all XML files together in a single upload session.

<annotation>
  <folder>images</folder>
  <filename>image1.jpg</filename>
  <size>
    <width>640</width>
    <height>480</height>
    <depth>3</depth>
  </size>
  <object>
    <name>person</name>
    <bndbox>
      <xmin>100</xmin>
      <ymin>150</ymin>
      <xmax>300</xmax>
      <ymax>400</ymax>
    </bndbox>
  </object>
  <object>
    <name>car</name>
    <bndbox>
      <xmin>350</xmin>
      <ymin>200</ymin>
      <xmax>550</xmax>
      <ymax>450</ymax>
    </bndbox>
  </object>
</annotation>

One .txt file per image plus a classes.txt file listing class names in order. Each annotation line contains five space-separated values: class_id, x_center, y_center, width, and height. All coordinate values are normalized (0 to 1) by image dimensions. Upload the annotation .txt files and classes.txt together.

0 0.5 0.5 0.3 0.4
1 0.7 0.3 0.2 0.25
person
car
truck

CSV Four Corner: columns filename, xmin, ymin, xmax, ymax, class

CSV Width Height: columns filename, x, y, width, height, class

Both require an exact header row. Each row is one bounding box. Coordinates can be normalized (0 to 1) or in pixels; the system detects which automatically.

filename,xmin,ymin,xmax,ymax,class
image1.jpg,100,150,300,400,person
image1.jpg,350,200,550,450,car
image2.jpg,50,75,250,350,truck
filename,x,y,width,height,class
image1.jpg,100,150,200,250,person
image1.jpg,350,200,200,250,car
image2.jpg,50,75,200,275,truck

Troubleshooting

The most common cause is a filename mismatch. Check that filename values in your annotation files exactly match the filenames of uploaded images, including the file extension and case. Also verify you selected the correct format during upload.

Possible causes: bounding box coordinates outside image boundaries, invalid normalized values (must be 0 to 1), or missing class IDs in YOLO class files. Validate coordinate ranges and check your class file has an entry for every ID used.

Compare your file structure to the examples above. Validate JSON files with a JSON linter and XML files with an XML validator before uploading.

Do this with the Vi SDK

import vi

client = vi.Client(
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

result = client.annotations.upload(
    dataset_id="your-dataset-id",
    paths="annotations.jsonl",
    wait_until_done=True
)
print(f"Imported: {result.total_annotations}")

For more details, see the full SDK reference.

Next steps

Train A Model

Fine-tune a vision-language model with your annotated dataset.

Annotate Data

Create or edit annotations in the visual annotator or with AI-assisted tools.

Dataset Overview

Check annotation statistics and class distributions.