Prepare Your Dataset

Set up your dataset with images and annotations to train your VLM.

📍

Quickstart: Step 1 of 3

This is the first step in the quickstart guide. After preparing your dataset, you'll train a VLM and deploy it.

Before you can train a VLM, you need a dataset with images and annotations. Follow these three focused steps to get your data ready for training.

⏱️ Time to complete: ~15 minutes

📚 What you'll learn: Dataset creation, image uploading, and annotation basics

📋

Prerequisites

You'll need a Datature Vi account. Sign up for free if you haven't already.


Three steps to prepare your dataset

Follow these steps in order to set up your data for training:


Quick overview

What you'll do

  1. Create a dataset — Choose between Phrase Grounding, Visual Question Answering, or Freeform (coming soon), then configure storage
  2. Upload images — Use drag-and-drop or SDK to add your images
  3. Add annotations — Import existing labels or create them manually

What you'll need

  • Your images in supported formats (.jpg, .png, etc.)
  • Annotations (optional - you can create them in Datature)
  • About 15 minutes
💡

Need more detail?

This quickstart covers the essentials. For comprehensive guides, see:

Tips for success


Dataset types explained

Not sure which dataset type to choose? Here's a quick guide:

Phrase Grounding

Best for:

  • Object detection
  • Defect detection
  • Product identification
  • Counting objects

Output: Bounding boxes around objects with labels

Learn more →

Visual Question Answering

Best for:

  • Image understanding
  • Content analysis
  • Quality inspection
  • Flexible Q&A about images

Output: Natural language answers to questions

Learn more →

Freeform

🚧 Coming soon

Best for:

  • Custom annotation schemas
  • Specialized use cases
  • Research projects
  • Novel vision tasks

Output: Custom annotation formats


What's next?

Once you've completed all three steps, your dataset will be ready for training.

Create a training workflow and start fine-tuning your vision-language model on your prepared dataset.

Related resources