How Do I Evaluate My Model?

Evaluation tells you whether your trained VLM actually learned the right patterns from your data. Datature Vi computes metrics automatically during training and shows them on the run dashboard. For phrase grounding, the key metrics are IoU and F1. For VQA, the key metrics are BLEU and BERTScore. This page explains each metric in plain language and shows you what "good" looks like.

On this page

Why evaluate Phrase grounding metrics VQA metrics Freeform text evaluation Reading metric combinations What does good look like How to improve FAQ

Why evaluate?

Training loss tells you the model is learning. Evaluation metrics tell you if it's learning the RIGHT things on data it hasn't seen before. Datature Vi uses a validation set (images held out from training) to compute these metrics at regular intervals throughout the run.

A model with low training loss but poor evaluation metrics has memorized your training data rather than learning general patterns. Evaluation catches this before you deploy.

Phrase grounding metrics

These metrics apply when your task is phrase grounding. They measure how well the model draws bounding boxes around objects described in text. Datature Vi computes them on validation images at each evaluation checkpoint.

IoU (Intersection over Union)

Measures how much predicted and correct boxes overlap

IoU measures overlap between two rectangles: the predicted box and the correct box. It divides the intersection (the area where both boxes overlap) by the union (the total area covered by both boxes combined, counting the overlap only once).

IoU = Intersection Area / Union Area

An IoU of 1.0 means the boxes match perfectly. An IoU of 0.0 means they do not overlap at all. Higher values mean tighter, more accurate boxes. Most evaluation pipelines use an IoU threshold of 0.50 to decide whether a prediction counts as correct. Predictions below the threshold are treated as misses.

Suppose the predicted box covers 100 pixels and the ground-truth box covers 120 pixels. If 80 pixels overlap, the union is 100 + 120 - 80 = 140. IoU = 80 / 140 = 0.57.

F1 Score

Balances precision and recall into a single number

F1 balances two things: how many of the model's predictions were correct (precision) and how many real objects the model found (recall). A high F1 means the model finds most objects without drawing too many wrong boxes. A low F1 means the model is either missing objects, hallucinating boxes, or both.

Precision measures how many of the model's predicted boxes matched a real object. High precision means few false alarms. Recall measures how many real objects the model found. High recall means few missed objects.

A true positive is a predicted box that overlaps a real annotated box above the IoU threshold. A false positive is a predicted box where there's no matching annotation.

With several boxes per image, Datature Vi assigns predictions to annotations using Hungarian matching on pairwise IoU, then applies the IoU threshold. See Training metrics for the full bounding-box metric definitions.

VQA metrics

These metrics apply when your task is visual question answering. They measure how well the model's generated text matches your reference answers. Datature Vi tracks them on the run dashboard alongside loss curves.

BERTScore

Measures meaning similarity, not word overlap

BERTScore uses a language model to compare meaning rather than exact words. "Car" and "automobile" score high because they mean the same thing. A high BERTScore means the model's answer captures the correct meaning, even if the wording differs from your annotation. A low BERTScore means the model's answer is semantically wrong or irrelevant.

This makes BERTScore more reliable than BLEU for evaluating VQA answers in Datature Vi, where correct answers can be phrased in many ways.

BLEU

Word-for-word overlap between model output and annotation

BLEU compares exact word and phrase overlap between the model's output and your annotation. A high BLEU means the model produces text with wording close to the reference answer. A low BLEU means the phrasing differs, but this does not always indicate a wrong answer.

For example, "The car is red" vs "The red automobile" scores poorly on BLEU even though they mean the same thing. Use BERTScore alongside BLEU for a fuller picture: BERTScore catches meaning, BLEU catches phrasing.

Freeform text evaluation

Freeform text tasks produce custom output formats: JSON, YAML, structured reports, or domain-specific text. Automated metrics like BERTScore and BLEU apply to freeform text the same way they do to VQA, but they only measure text similarity. For structured outputs, you also need to verify format compliance and field accuracy.

Automated metrics

Datature Vi computes BERTScore and BLEU on freeform text validation samples during training. These metrics tell you whether the model's output conveys the same meaning and wording as your annotations. Use the same interpretation guidelines from the VQA metrics section above.

Manual evaluation for structured output

When your freeform text model produces structured data (JSON, YAML, or custom schemas), automated metrics miss important failure modes. A response with correct meaning but broken JSON syntax will score well on BERTScore while being useless to your application.

Check these dimensions manually on 20-30 inference outputs:

Dimension

What to check

Red flag

Format compliance

Does every response parse as valid JSON/YAML/CSV?

Syntax errors, missing brackets, trailing commas

Schema adherence

Does the output contain all required fields with correct types?

Missing fields, wrong data types, extra unexpected fields

Field accuracy

Are the values in each field correct given the image?

Hallucinated values, swapped fields, wrong units

Consistency

Does the same type of image produce the same structure?

Format varies between similar images, fields appear inconsistently

Reading metric combinations

Individual metrics tell you one dimension of model performance. Combining them reveals specific failure patterns and points you toward the right fix.

Phrase grounding combinations

Pattern

What it means

What to do

High F1, high IoU

Model finds objects accurately and draws tight boxes

Production-ready for most tasks

High F1, low IoU

Model finds the right objects but draws loose or offset boxes

Tighten your annotation bounding boxes. Ensure boxes fit objects closely with minimal background.

Low F1, high IoU

When the model draws a box, it is accurate, but it misses many objects

Add more training examples, especially for underrepresented object types.

High precision, low recall

Model is cautious: few false alarms but many missed objects

Add more positive examples. The model needs to see more instances of what to find.

Low precision, high recall

Model finds most objects but also draws many incorrect boxes

Add negative examples (images without target objects). The model needs to learn what NOT to find.

VQA combinations

Pattern

What it means

What to do

High BERTScore, high BLEU

Model gives correct answers using similar wording to your annotations

Strong performance across both meaning and phrasing

High BERTScore, low BLEU

Model gives correct answers in different words than your annotations

Usually fine. The model understands the task but phrases answers differently. Check a few outputs manually to confirm.

Low BERTScore, high BLEU

Rare. Model repeats annotation phrasing without understanding the meaning

Check for annotation patterns the model could exploit (e.g., all answers starting with the same phrase).

Low BERTScore, low BLEU

Model gives wrong or irrelevant answers

Review annotations for quality and consistency. Try a larger model or more training data.

What does good look like?

There is no universal threshold for any of these metrics. What counts as "good" depends on your task, your data, and how the model's output will be used. A medical imaging task may need very high IoU, while a rough object count task can tolerate lower overlap. A VQA model answering free-text questions will naturally score lower on BLEU than one producing short, formulaic answers.

Instead of chasing a specific number, focus on trends: are your metrics improving across training runs? When they plateau, use the metric combinations section above to diagnose what's holding the model back. Compare runs against each other rather than against a fixed target.

How to improve a weak model

If your metrics are stagnating or declining, work through these steps in order. Datature Vi re-evaluates automatically on your next training run, so you can measure progress after each change.

Add more diverse annotations

Quality and variety matter more than quantity. Include different lighting conditions, angles, backgrounds, and edge cases. Fifty well-annotated images covering many scenarios outperform 500 images that all look the same.

Review and fix inconsistent annotations

Inconsistent labels confuse the model. If some annotators draw tight boxes and others draw loose boxes, the model learns a blurred average. Audit your annotations for uniformity.

Adjust the system prompt

Make the prompt more specific about your domain. Instead of "find objects," try "find all dented cans on the shelf." A precise prompt gives the model a clearer learning signal. See Configure Your System Prompt.

Try a different model or training mode

Switch architectures if you've hit a ceiling. A 7B model may succeed where a 2B model plateaus. Or switch from LoRA to full fine-tuning for more capacity. See Model Architectures.

For visual, per-image analysis of where the model fails, use Advanced Evaluation.

Frequently asked questions

Related resources

Training Metrics

Read loss curves, F1, IoU, BLEU, and other evaluation charts on the run dashboard.

Advanced Evaluation

Compare predictions against ground truth visually across checkpoints.

Quickstart

Train and deploy your first VLM in 30 minutes.