Configure your dataset

Dataset configuration determines how your training data is split and organized for model training. Proper dataset configuration ensures your VLM learns effectively from your data while accurately measuring performance on unseen examples.

💡
Looking for a quick start?
For streamlined workflow setup without detailed dataset configuration:

Quickstart: Create a Workflow

📋
Prerequisites
Before configuring your dataset, ensure you have:

An existing training project

A dataset with annotations

Basic understanding of train-test data splitting

Understanding dataset configuration

When you create a workflow, you configure how your dataset is used during training. This includes:

Dataset source selection — Which dataset to use for training
Train-test split ratio — How to divide data between training and evaluation
Shuffle settings — Whether to randomize data order
Data filtering — Selecting specific subsets or classes (advanced)

These settings directly impact:

Model learning — How effectively your VLM learns from examples
Performance evaluation — How accurately you measure model quality
Training efficiency — How quickly training converges
Generalization — How well the model performs on new data

Dataset configuration options

Access dataset configuration by clicking the Dataset node in the workflow canvas.

Dataset source

Select which dataset to use for training from your available datasets.

Options:

Any dataset in your organization with annotations
Datasets must contain images and annotations compatible with your system prompt task type

Considerations:

Choose datasets appropriate for your task (phrase grounding or VQA)
Ensure sufficient annotations (minimum 20 images recommended, 100+ for production)
Verify annotation quality and consistency before training

Learn how to create datasets →

Train-test split ratio

The train-test split ratio determines how your dataset is divided between training and evaluation.

What is train-test split?

During training, your dataset is divided into two subsets:

Training set — Data used to train the model (learn patterns)
Test set — Data held back to evaluate model performance (measure quality)

The split ratio represents the proportion allocated to the test set. For example:

Split Ratio	Training Data	Test Data	Example (100 images)
0.1	90%	10%	90 train, 10 test
0.2	80%	20%	80 train, 20 test
0.3	70%	30%	70 train, 30 test

📘
How the ratio works
If you set the split ratio to 0.2 (20%):

80% of your data will be used for training

20% of your data will be held back for evaluation

The test set is never seen during training, ensuring unbiased performance measurement.

Choosing the right split ratio

Recommended split ratios:

Small datasets (20-200 images)

Recommended: 0.1-0.2 (10-20% test)

For small datasets, maximize training data while maintaining enough test examples for evaluation.

Example with 50 images:

Split ratio 0.2 = 40 training, 10 test
Split ratio 0.1 = 45 training, 5 test

Best practice: Use at least 10 test images when possible. If your dataset is smaller than 50 images, consider collecting more data before training.

Medium datasets (200-1000 images)

Recommended: 0.2 (20% test)

The standard 80/20 split provides good balance for most use cases.

Example with 500 images:

Split ratio 0.2 = 400 training, 100 test
Sufficient training data for learning
Adequate test set for reliable evaluation

Best practice: The 0.2 ratio is the default and works well for most production applications.

Large datasets (1000+ images)

Recommended: 0.15-0.3 (15-30% test)

Large datasets offer flexibility in split ratios.

Example with 2000 images:

Split ratio 0.2 = 1600 training, 400 test
Split ratio 0.3 = 1400 training, 600 test

Best practice: Larger test sets (20-30%) provide more reliable performance metrics. You have enough training data even with higher test ratios.

Setting the split ratio

Click the Dataset node in your workflow
Locate the Train-Test Split Ratio field
Enter a value between 0.0 and 1.0:
- 0.1 = 10% test, 90% training
- 0.2 = 20% test, 80% training (recommended default)
- 0.3 = 30% test, 70% training

❗️
Important: Test data is never used for training
Data in the test set is completely held out during training. The model never sees these examples during learning, ensuring evaluation metrics accurately reflect performance on new, unseen data.

Shuffle

Shuffle randomizes the order of your data before splitting it into training and test sets. This is a critical setting that impacts model performance and evaluation reliability.

What does shuffle do?

When enabled:

Data is randomly reordered before train-test splitting
Each training run uses a different random split
Prevents biases from data collection order
Improves model generalization

When disabled:

Data maintains its original order
Train-test split is deterministic (same every time)
May introduce biases if data has sequential patterns
Useful for debugging or reproducibility

✅
Recommendation: Enable shuffle
Shuffle should be enabled (Yes) for almost all training scenarios. It's a best practice that prevents overfitting to data collection patterns and ensures more robust model evaluation.

Why shuffle is important

Prevents sequential biases:

If your dataset contains sequential patterns (e.g., collected over time, organized by category), shuffle prevents the model from learning these artificial patterns.

Example without shuffle:

Original data order:
- Images 1-50: Outdoor scenes (daytime)
- Images 51-100: Indoor scenes (nighttime)

With split ratio 0.2 (no shuffle):
- Training: Images 1-80 (mostly outdoor + some indoor)
- Test: Images 81-100 (all indoor nighttime)

Problem: Test set doesn't represent full data distribution

Example with shuffle:

After shuffling:
- Images mixed randomly

With split ratio 0.2 (shuffled):
- Training: Random mix of outdoor/indoor
- Test: Random mix of outdoor/indoor

Benefit: Test set accurately represents overall distribution

Improves generalization:

Random data ordering helps the model learn robust features rather than memorizing sequence-dependent patterns.

Setting shuffle

Click the Dataset node in your workflow
Locate the Shuffle toggle
Set to Yes (recommended) or No

When to enable shuffle (Yes):

Standard training workflows (recommended for 99% of cases)
Production models requiring robust generalization
Datasets with any sequential organization
When test set should represent full data distribution

When to disable shuffle (No):

Debugging specific training issues
Reproducing exact results from previous runs
Research experiments requiring deterministic splits
Temporal data where chronological order matters

💡
Shuffle vs. reproducibility
Even with shuffle enabled, you can achieve reproducibility by setting a random seed in advanced training settings. This allows random shuffling while maintaining consistent splits across runs for comparison.

Best practices

Match test set to real-world distribution

Your test set should reflect the data distribution your model will encounter in production.

Good practices:

Enable shuffle to ensure representative sampling
Include all classes/scenarios in proper proportions
Avoid artificially balanced test sets if production data is imbalanced
Test on edge cases and challenging examples

Example: If production data is 80% good parts and 20% defective parts, your test set should maintain this ratio rather than artificially balancing to 50/50.

Maintain sufficient test set size

Small test sets produce unreliable performance metrics.

Minimum recommendations:

Absolute minimum: 10 test images
Reasonable minimum: 20-30 test images
Recommended: 50+ test images
Ideal: 100+ test images

Why this matters: With only 5 test images, a single misprediction changes accuracy by 20%. With 100 test images, a single error changes accuracy by only 1%.

Consider dataset size when splitting

Adjust your split ratio based on total dataset size:

Small datasets (< 100 images):

Use smaller test ratios (0.1-0.15) to maximize training data
Consider data augmentation to expand effective dataset size
Collect more data if possible before training

Medium datasets (100-1000 images):

Standard 0.2 split ratio works well
Balanced between training data and evaluation reliability

Large datasets (> 1000 images):

Can afford larger test ratios (0.2-0.3)
More reliable performance metrics
Consider creating separate validation sets for hyperparameter tuning

Always enable shuffle

Unless you have a specific reason not to, enable shuffle:

Benefits:

Prevents temporal biases
Ensures representative test sets
Improves generalization
Reduces overfitting to collection order

Rare exceptions:

Time-series data requiring chronological order
Debugging reproducibility issues (use random seed instead)
Specific research requirements

Validate data quality before configuration

Before configuring dataset splits, ensure data quality:

Check for:

Annotation accuracy and consistency
Missing or incomplete annotations
Class imbalance issues
Duplicate images
Corrupted or low-quality images

Actions:

Review dataset insights for distribution analysis
Fix annotation errors before training
Remove or fix problematic images
Consider data filtering for large datasets

Data filtering (Advanced)

For advanced use cases, you can filter which data is used from your selected dataset.

Filtering options

By classes:

Select specific annotation classes to include
Useful for training on subset of categories
Reduces training time for focused tasks

By tags:

Filter images by custom tags
Organize training by scenarios or conditions
Enable targeted model development

By quality scores:

Filter based on annotation confidence
Include only high-quality examples
Improve training data cleanliness

When to use filtering

Training class-specific models

Train models focused on specific object types:

Example: From a dataset with 10 vehicle types, filter to train a model that only detects cars and trucks:

Select classes: "car", "truck"
Exclude: motorcycles, buses, bicycles, etc.
Result: Faster training, specialized model

Iterative model development

Start with simple cases and gradually add complexity:

Example:

Iteration 1: Train on clear, well-lit images only
Iteration 2: Add challenging lighting conditions
Iteration 3: Include edge cases and occlusions
Result: Controlled difficulty progression

Addressing class imbalance

Balance training data when some classes are overrepresented:

Example: Dataset has 1000 "good" parts and 50 "defective" parts:

Filter "good" parts to ~200 examples
Keep all "defective" parts
Result: More balanced training data

Note: Consider collecting more defect examples as the better solution.

Common questions

Can I change the dataset split after training?

No, the split is fixed when you start a training run. To use a different split ratio or different data, you need to:

Create a new workflow with updated dataset configuration
Start a new training run
Compare results across runs to determine optimal settings

This is why it's important to choose appropriate settings before training.

What happens to images in the test set?

Test set images are:

Never used for training — The model doesn't learn from them
Used for evaluation — They measure model performance during and after training
Provide unbiased metrics — Results reflect performance on truly unseen data

After training, you can evaluate your model on the test set to see performance metrics, confusion matrices, and prediction examples.

Should I use the same split ratio for all my experiments?

For comparison purposes, yes:

When comparing different model architectures or system prompts, use the same split ratio to ensure fair comparison.

For optimization, experiment:

Try different split ratios if you're uncertain what works best for your dataset size:

Create workflows with 0.1, 0.2, and 0.3 splits
Train models with each configuration
Compare training performance and test metrics
Choose the ratio that provides best balance

My dataset is small. Should I skip the test set?

No, always maintain a test set, even for small datasets. Here's why:

Without a test set:

No way to measure actual model performance
Risk of severe overfitting without knowing
Can't trust training metrics alone
Deployment failures on real data

Better approaches:

Use a smaller test ratio (0.1) to maximize training data
Collect more data before training if possible
Use cross-validation techniques
Consider data augmentation to expand effective dataset size

Minimum recommendation: At least 10 test images, even if that means only 40-50 training images.

What if my dataset has very few examples of certain classes?

Class imbalance requires special attention:

Options:

Collect more examples (best solution)
- Gather additional data for underrepresented classes
- Ensures model learns all classes effectively
Use data filtering
- Reduce overrepresented classes
- Balance training distribution artificially
Adjustclass weights (advanced)
- Increase importance of rare classes during training
- Available in advanced training settings
Accept imbalance (if it matches production)
- If real-world data is imbalanced, train on realistic distribution
- Adjust inference thresholds instead

What to avoid:

Don't oversample by duplicating images (causes overfitting)
Don't artificially balance test set if production data is imbalanced

Does shuffle change my training results every time?

Yes, slightly:

Each time you train with shuffle enabled, the random split is different, leading to minor variations in:

Which specific images are in train vs. test sets
Training metrics and final performance
Model behavior on edge cases

This is normal and expected. Variations are typically small (1-3% accuracy difference).

To maintain consistency:

Set a random seed in advanced settings
Creates reproducible shuffles
Same random split across runs
Useful for controlled experiments

Can I use multiple datasets in one workflow?

Currently, workflows use a single dataset source. To train on multiple datasets:

Option 1: Merge datasets (recommended)

Create a new dataset
Upload or copy images from multiple sources
Use the merged dataset in your workflow

Option 2: Train iteratively

Train initial model on Dataset A
Fine-tune on Dataset B using the first model as base
Useful for domain adaptation scenarios

Option 3: Dataset versioning

Create dataset versions combining different sources
Train on each version
Compare performance

Example configurations

Standard production model

Scenario: Training a quality control model with 500 annotated images

Configuration:

Dataset source: "PCB Defect Detection v3"
Train-test split ratio: 0.2 (80% train, 20% test)
Shuffle: Yes

Reasoning:

400 training images sufficient for learning
100 test images provide reliable metrics
Shuffle ensures representative evaluation
Standard setup for most production use cases

Small dataset experiment

Scenario: Initial prototype with 60 annotated images

Configuration:

Dataset source: "Retail Products Pilot"
Train-test split ratio: 0.15 (85% train, 15% test)
Shuffle: Yes

Reasoning:

51 training images (maximize learning data)
9 test images (minimum for basic evaluation)
Plan to collect more data based on results
Lower split ratio compensates for small dataset

Next steps:

Evaluate initial results
Collect 200+ more images for production model
Increase split ratio to 0.2 with larger dataset

Large-scale production model

Scenario: Comprehensive model with 5000 annotated images

Configuration:

Dataset source: "Warehouse Inventory Full"
Train-test split ratio: 0.25 (75% train, 25% test)
Shuffle: Yes

Reasoning:

3750 training images (more than sufficient)
1250 test images (highly reliable metrics)
Larger test set provides confidence in performance
Can afford higher test ratio with large dataset

Debugging reproducibility

Scenario: Investigating training instability, need consistent splits

Configuration:

Dataset source: "Debug Dataset v1"
Train-test split ratio: 0.2 (80% train, 20% test)
Shuffle: Yes
Advanced: Random seed = 42 (set in training settings)

Reasoning:

Standard split ratio
Shuffle enabled for best practices
Random seed ensures same split every run
Allows controlled comparison across debugging sessions

Next steps

After configuring your dataset:

Continue workflow configuration

Configure your system prompt — Define VLM task instructions
Configure your model — Select architecture and parameters
Configure training settings — Fine-tune advanced options

Start training

Create a workflow — Complete workflow setup
Manage runs — Launch training and monitor progress
Evaluate a model — Assess performance on test set

Improve your dataset

View dataset insights — Analyze data distribution
Upload more data — Expand your dataset
Annotate data — Improve annotation quality

Additional resources

Dataset management

Create a dataset — Set up new datasets
Upload data — Add images and annotations
View dataset insights — Analyze distribution
Manage datasets — Organize and maintain data

Training guides

Create a training project — Set up training projects
Create a workflow — Define training configurations
Configure training settings — Advanced parameters

Concept guides

Phrase grounding — Understanding visual grounding tasks
Visual question answering — Understanding VQA
Glossary — VLMOps terminology reference

Quickstart

Quickstart: Prepare your dataset — Fast-track data preparation
Quickstart: Train a model — Fast-track training guide

Related resources

Create a workflow — Complete workflow configuration guide
Configure your model — Select model architecture and settings
Configure your system prompt — Define VLM behavior
Train a model — Complete training workflow overview
Create a dataset — Set up datasets for training
Upload data — Add images and annotations
Annotate data — Create phrase grounding and VQA annotations
View dataset insights — Analyze dataset statistics
Evaluate a model — Assess model performance on validation data
Quickstart — End-to-end training tutorial
Vi SDK — Python SDK for dataset management
Resource usage — Understanding Data Rows consumption

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects