Monitor a Run

Track real-time training progress, read live loss charts, and catch errors before they waste compute credits.

Monitoring a run in Datature Vi lets you watch training progress in real time, read loss charts as they update, and spot configuration errors before they consume your compute credits. You can also review completed runs to understand what happened during training.

Before You Start

An active training run (status: Running, Queued, or Starting), or a completed run to review.

1

Open your training project

Open your training project

Go to Training in the sidebar and click the project containing the run you want to review.

You should see
Metrics tab showing decreasing training and validation loss curves over epochs

Loss is decreasing steadily and metrics are updating after each epoch. Training is progressing as expected.

Training stages

Every run moves through three stages before active training begins. Each stage can take a different amount of time depending on your dataset size and GPU availability.

1

Preprocessing dataset (1-5 minutes)

Datature Vi validates your dataset, checks annotation formats, applies the train-validation split, and prepares the data for the selected model architecture. If this stage fails, the issue is usually a dataset configuration problem: missing annotations, incompatible formats, or an empty split.

2

Spinning up instance (2-5 minutes)

The platform allocates GPU hardware matching your selection. Wait times depend on GPU availability. If this stage takes longer than 10 minutes, the selected GPU type may be temporarily unavailable. Contact support if it persists.

3

Pending first metrics (5-30 minutes)

The model has loaded and training has started, but no evaluation checkpoint has been reached yet. The time depends on your dataset size, batch size, and evaluation interval. Once the first evaluation completes, loss charts and metrics appear on the Metrics tab.

Evaluate your model

Once metrics are available, the cold start view is replaced by three evaluation tabs:

See Evaluate a Model for a walkthrough of all three tabs.

Handle training errors

When a run fails or hits a problem like an out-of-memory (OOM) error, the run status badge at the top of the page updates to reflect the error state (for example, Out of Memory or Failed). Click the status badge to expand an error card that shows the message "The training failed with the following error(s)", along with timestamps and a description of what went wrong.

The Training Progress tracker below the error card shows which cold start stage the run reached before it failed. Use this to narrow down the cause. A failure at the dataset preprocessing stage points to a data or annotation issue, while a failure after the instance is ready usually means a model configuration or resource problem.

Common errors and fixes:

  • Out of memory: The training process tried to allocate more CPU or GPU memory than available. Select a larger GPU type, increase the GPU count, or reduce the batch size.
  • Dataset preprocessing failure: Annotation format issues or an empty train/validation split. Check your dataset configuration and re-upload if needed.

When to kill a run

Not every run is worth finishing. Kill a run early if:

  • Loss has stopped decreasing for many consecutive epochs
  • You see NaN values in the loss output
  • Out-of-memory errors repeat across multiple steps
  • You realize the configuration is wrong (wrong dataset, wrong model size)

Killing a run frees your GPU immediately and stops compute credit consumption. See Kill a run for the steps.

Run status reference

Run status reference

Status
Meaning
What to do
Queued
Waiting for GPU resources
Wait; kill if no longer needed
Starting
Allocating hardware
Wait up to 10 minutes; contact support if longer
Running
Actively training
Monitor metrics; kill if errors appear
Completed
Training finished successfully
Failed
Stopped due to an error
Check logs; fix configuration; start a new run
Killed
Manually stopped
Delete the run or start a fresh one
Out of Memory
GPU ran out of memory
Reduce batch size or select a larger GPU type
Out of Quota
Compute Credits exhausted
Refill credits; run resumes from last checkpoint

Do this with the Vi SDK

import vi

client = vi.Client(
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

run = client.runs.get("your-run-id")
if run.status.conditions:
    latest = run.status.conditions[-1]
    print(f"Status: {latest.condition.value}")
    print(f"Message: {latest.message}")

For more details, see the full SDK reference.

Next steps

Kill A Run

Stop a training session that has errors or an incorrect configuration.

Delete A Run

Remove completed, failed, or killed runs from your project.

Evaluate Your Model

Analyze performance metrics after a run completes successfully.