How Do I Evaluate My Model?

Learn how to evaluate VLM performance in Datature Vi. Understand IoU, F1, BLEU, BERTScore, and what good scores look like for phrase grounding and VQA.

Evaluation tells you whether your trained VLM actually learned the right patterns from your data. Datature Vi computes metrics automatically during training and shows them on the run dashboard. For phrase grounding, the key metrics are IoU and F1. For VQA, the key metrics are BLEU and BERTScore. This page explains each metric in plain language and shows you what "good" looks like.


Why evaluate?

Training loss tells you the model is learning. Evaluation metrics tell you if it's learning the RIGHT things on data it hasn't seen before. Datature Vi uses a validation set (images held out from training) to compute these metrics at regular intervals throughout the run.

A model with low training loss but poor evaluation metrics has memorized your training data rather than learning general patterns. Evaluation catches this before you deploy.


Phrase grounding metrics

These metrics apply when your task is phrase grounding. They measure how well the model draws bounding boxes around objects described in text. Datature Vi computes them on validation images at each evaluation checkpoint.

IoU (Intersection over Union)
Measures how much predicted and correct boxes overlap

IoU measures overlap between two rectangles: the predicted box and the correct box. It divides the intersection (the area where both boxes overlap) by the union (the total area covered by both boxes combined, counting the overlap only once).

IoU = Intersection Area / Union Area

An IoU of 1.0 means the boxes match perfectly. An IoU of 0.0 means they do not overlap at all. Higher values mean tighter, more accurate boxes. Most evaluation pipelines use an IoU threshold of 0.50 to decide whether a prediction counts as correct. Predictions below the threshold are treated as misses.

Suppose the predicted box covers 100 pixels and the ground-truth box covers 120 pixels. If 80 pixels overlap, the union is 100 + 120 - 80 = 140. IoU = 80 / 140 = 0.57.

F1 Score
Balances precision and recall into a single number

F1 balances two things: how many of the model's predictions were correct (precision) and how many real objects the model found (recall). A high F1 means the model finds most objects without drawing too many wrong boxes. A low F1 means the model is either missing objects, hallucinating boxes, or both.

Precision measures how many of the model's predicted boxes matched a real object. High precision means few false alarms. Recall measures how many real objects the model found. High recall means few missed objects.

A true positive is a predicted box that overlaps a real annotated box above the IoU threshold. A false positive is a predicted box where there's no matching annotation.

With several boxes per image, Datature Vi assigns predictions to annotations using Hungarian matching on pairwise IoU, then applies the IoU threshold. See Training metrics for the full bounding-box metric definitions.


VQA metrics

These metrics apply when your task is visual question answering. They measure how well the model's generated text matches your reference answers. Datature Vi tracks them on the run dashboard alongside loss curves.

BERTScore
Measures meaning similarity, not word overlap

BERTScore uses a language model to compare meaning rather than exact words. "Car" and "automobile" score high because they mean the same thing. A high BERTScore means the model's answer captures the correct meaning, even if the wording differs from your annotation. A low BERTScore means the model's answer is semantically wrong or irrelevant.

This makes BERTScore more reliable than BLEU for evaluating VQA answers in Datature Vi, where correct answers can be phrased in many ways.

BLEU
Word-for-word overlap between model output and annotation

BLEU compares exact word and phrase overlap between the model's output and your annotation. A high BLEU means the model produces text with wording close to the reference answer. A low BLEU means the phrasing differs, but this does not always indicate a wrong answer.

For example, "The car is red" vs "The red automobile" scores poorly on BLEU even though they mean the same thing. Use BERTScore alongside BLEU for a fuller picture: BERTScore catches meaning, BLEU catches phrasing.

METEOR measures text similarity with synonym matching, word stemming, and word order awareness. It handles paraphrasing better than BLEU but not as well as BERTScore. Scores above 0.40 indicate good quality.

ROUGE measures n-gram overlap with a focus on recall. It tells you how much of the reference answer's content appears in the model's output. ROUGE-1 checks individual words, ROUGE-2 checks two-word sequences, and ROUGE-L finds the longest matching subsequence. Scores above 0.30 indicate reasonable coverage.

Both metrics appear in the Training Metrics dashboard. They are useful as secondary signals alongside BERTScore and BLEU.


Freeform text evaluation

Freeform text tasks produce custom output formats: JSON, YAML, structured reports, or domain-specific text. Automated metrics like BERTScore and BLEU apply to freeform text the same way they do to VQA, but they only measure text similarity. For structured outputs, you also need to verify format compliance and field accuracy.

Automated metrics

Datature Vi computes BERTScore and BLEU on freeform text validation samples during training. These metrics tell you whether the model's output conveys the same meaning and wording as your annotations. Use the same interpretation guidelines from the VQA metrics section above.

Manual evaluation for structured output

When your freeform text model produces structured data (JSON, YAML, or custom schemas), automated metrics miss important failure modes. A response with correct meaning but broken JSON syntax will score well on BERTScore while being useless to your application.

Check these dimensions manually on 20-30 inference outputs:

Dimension
What to check
Red flag
Format compliance
Does every response parse as valid JSON/YAML/CSV?
Syntax errors, missing brackets, trailing commas
Schema adherence
Does the output contain all required fields with correct types?
Missing fields, wrong data types, extra unexpected fields
Field accuracy
Are the values in each field correct given the image?
Hallucinated values, swapped fields, wrong units
Consistency
Does the same type of image produce the same structure?
Format varies between similar images, fields appear inconsistently

If your model outputs JSON, validate format compliance programmatically after inference:

import json

def validate_response(response_text, required_fields):
    """Check if model output is valid JSON with required fields."""
    try:
        data = json.loads(response_text)
    except json.JSONDecodeError:
        return {"valid": False, "error": "Invalid JSON syntax"}

    missing = [f for f in required_fields if f not in data]
    if missing:
        return {"valid": False, "error": f"Missing fields: {missing}"}

    return {"valid": True, "data": data}

# Example: check 'defect_found', 'defect_type', 'severity' are present
result = validate_response(model_output, ["defect_found", "defect_type", "severity"])

Run this on your validation set to measure what percentage of outputs are structurally valid. Aim for 95%+ format compliance before deploying.


Reading metric combinations

Individual metrics tell you one dimension of model performance. Combining them reveals specific failure patterns and points you toward the right fix.

Phrase grounding combinations

Pattern
What it means
What to do
High F1, high IoU
Model finds objects accurately and draws tight boxes
Production-ready for most tasks
High F1, low IoU
Model finds the right objects but draws loose or offset boxes
Tighten your annotation bounding boxes. Ensure boxes fit objects closely with minimal background.
Low F1, high IoU
When the model draws a box, it is accurate, but it misses many objects
Add more training examples, especially for underrepresented object types.
High precision, low recall
Model is cautious: few false alarms but many missed objects
Add more positive examples. The model needs to see more instances of what to find.
Low precision, high recall
Model finds most objects but also draws many incorrect boxes
Add negative examples (images without target objects). The model needs to learn what NOT to find.

VQA combinations

Pattern
What it means
What to do
High BERTScore, high BLEU
Model gives correct answers using similar wording to your annotations
Strong performance across both meaning and phrasing
High BERTScore, low BLEU
Model gives correct answers in different words than your annotations
Usually fine. The model understands the task but phrases answers differently. Check a few outputs manually to confirm.
Low BERTScore, high BLEU
Rare. Model repeats annotation phrasing without understanding the meaning
Check for annotation patterns the model could exploit (e.g., all answers starting with the same phrase).
Low BERTScore, low BLEU
Model gives wrong or irrelevant answers
Review annotations for quality and consistency. Try a larger model or more training data.

What does good look like?

There is no universal threshold for any of these metrics. What counts as "good" depends on your task, your data, and how the model's output will be used. A medical imaging task may need very high IoU, while a rough object count task can tolerate lower overlap. A VQA model answering free-text questions will naturally score lower on BLEU than one producing short, formulaic answers.

Instead of chasing a specific number, focus on trends: are your metrics improving across training runs? When they plateau, use the metric combinations section above to diagnose what's holding the model back. Compare runs against each other rather than against a fixed target.


How to improve a weak model

If your metrics are stagnating or declining, work through these steps in order. Datature Vi re-evaluates automatically on your next training run, so you can measure progress after each change.

1

Add more diverse annotations

Quality and variety matter more than quantity. Include different lighting conditions, angles, backgrounds, and edge cases. Fifty well-annotated images covering many scenarios outperform 500 images that all look the same.

2

Review and fix inconsistent annotations

Inconsistent labels confuse the model. If some annotators draw tight boxes and others draw loose boxes, the model learns a blurred average. Audit your annotations for uniformity.

3

Adjust the system prompt

Make the prompt more specific about your domain. Instead of "find objects," try "find all dented cans on the shelf." A precise prompt gives the model a clearer learning signal. See Configure Your System Prompt.

4

Try a different model or training mode

Switch architectures if you've hit a ceiling. A 7B model may succeed where a 2B model plateaus. Or switch from LoRA to full fine-tuning for more capacity. See Model Architectures.

For visual, per-image analysis of where the model fails, use Advanced Evaluation.


Frequently asked questions

This is overfitting. The model memorized your training data instead of learning patterns that generalize. The fix: add more diverse data, reduce the number of epochs, or both. Check the validation loss curve in Training Metrics. If validation loss started rising while training loss kept falling, that confirms the diagnosis.

BERTScore. It captures meaning rather than exact word matches. A model that answers "The vehicle is damaged" when the reference says "The car has damage" will score poorly on BLEU but well on BERTScore. For most VQA tasks, BERTScore is the better indicator of real-world usefulness.

Loss values are not comparable across architectures. A loss of 1.0 on Qwen2.5-VL 7B does not mean the same thing as 1.0 on NVILA-Lite 2B. F1, IoU, and BERTScore are comparable because they measure prediction quality on the same scale regardless of architecture. Use these metrics when comparing runs across different models.


Further reading


Related resources

Training Metrics

Read loss curves, F1, IoU, BLEU, and other evaluation charts on the run dashboard.

Advanced Evaluation

Compare predictions against ground truth visually across checkpoints.

Quickstart

Train and deploy your first VLM in 30 minutes.