Training Logs

Review detailed training logs, debug errors, and troubleshoot failed runs.

Before You Start
  • A training run in any state (logs update in real time, so you can monitor a run while it is still training)
  • Access to the training project that contains the run

New runs go through a cold start period (dataset preprocessing, instance startup, and pending first metrics) before log output begins.

The Logs tab in Datature Vi displays the complete training output for a run. It records step-level loss values, epoch markers, evaluation checkpoints, and error messages. Use it to verify that training is progressing normally or to diagnose why a run failed.

What the Logs tab shows

In Datature Vi, open any training run and click the Logs tab. The tab displays the complete training output in chronological order.

1

Open your training project

Open your training project

Go to Training in the sidebar and click the project containing the run you want to review.

You should see
Logs tab showing final saver event with no errors, confirming a successful training run

After a successful run, the Logs tab ends with a saver event at the final step and epoch. No error messages appear.

Understanding log output

The log displays entries in chronological order. Each entry is prefixed with a timestamp and a label that identifies its type. Here is an example of a typical log sequence:

[1/16/2026, 4:18:37 PM] epoch: Epoch 0
[1/16/2026, 4:18:43 PM] trainingStep: Step 0, Loss: 1.894945740699768
[1/16/2026, 4:21:08 PM] evaluationExtension: {"extension":"evaluation_preview","step":0,"epoch":0}
[1/16/2026, 4:21:08 PM] evaluationStep: Step 0, Loss: 1.8899050951004028
[1/16/2026, 4:22:20 PM] saver: Step 0, Epoch 0, Status: saved

Each entry belongs to one of five categories:

Epoch markers (epoch) indicate when the model begins a new pass through the training data. If your dataset has 1,000 images and your batch size is 8, one epoch takes 125 steps.

Training steps (trainingStep) are the most frequent entries. Each line shows a step number and a loss value. The loss value tells you how far the model's predictions are from the ground truth at that step. It should decrease over time.

Evaluation extensions (evaluationExtension) log a JSON object when an evaluation checkpoint begins. The object includes the step number, epoch number, and the extension type (such as evaluation_preview).

Evaluation steps (evaluationStep) appear at the interval you configured in training settings. Each entry records the validation loss at that checkpoint. These are the values that appear on the Metrics tab.

Saver events (saver) confirm that a model checkpoint was written to storage. Each entry includes the step number, epoch number, and a status of saved. If a run is killed or fails after a saver event, the checkpoint up to that point is still available.

Troubleshoot using the logs

When a run fails or behaves unexpectedly, the Logs tab gives you the information to diagnose what went wrong.

Look at the end of the log output for these patterns:

  • CUDA out of memory: GPU memory is exhausted. Reduce batch size or switch to a larger GPU.
  • RuntimeError: Expected tensor for argument: Data format issue. Check your dataset annotations.
  • ValueError: Invalid annotation format: Annotation data does not match the expected schema.

If the log is empty, the run likely failed before training started. Check the run's status indicator for Out of Memory or look in the training progress section for the stage that failed (Dataset Ready, Instance Ready, or Training Running).

Compare trainingStep timestamps to check how long each step takes:

  • 1–2 seconds per step: Normal for most configurations
  • 5–10 seconds per step: Slow; investigate GPU utilization or data loading
  • More than 10 seconds per step: Likely a misconfiguration or resource constraint

Frequent checkpoint saves also slow training. Adjust checkpoint frequency if checkpoints are happening too often.

Check the log for these patterns:

  • Unusually high initial loss (above 10.0): May indicate data scaling issues
  • Loss: nan: Training collapsed; reduce the learning rate
  • Loss unchanged across many steps: Learning rate is too low or there are data issues

Diagnostic steps:

1

Verify the initial loss is between 2.0 and 6.0

2

Confirm loss decreases within the first 100 steps

3

Check for sudden spikes or nan values

4

Review hyperparameter settings

Step 1: Check the run status

On the Runs page, note the status:

  • Out of Memory: Follow the OOM steps below.
  • Failed: Continue to step 2.
  • Killed: You or a team member stopped the run manually.

Step 2: View error details

Hover over Additional Errors on the run detail page to see the error JSON. The error object contains four fields: condition (the execution event, such as "LatticeExecutionFinished"), status (the outcome, such as "FailedReach"), reason (the root cause, such as "OutOfGpuMemory"), and lastTransitionTime (a Unix timestamp). The reason field identifies the root cause.

Step 3: Check which training stage failed

  • Dataset Ready: Dataset or annotation errors
  • Instance Ready: GPU or hardware issues
  • Training Running: Training configuration or runtime errors

Step 4: Read the logs

Open the Logs tab and scroll to the end. Match the error to a fix:

Troubleshoot using the logs

Error
Fix
`CUDA out of memory`
Reduce batch size by 50%
Out of Quota
Refill compute credits; training resumes automatically
Dataset errors
Validate dataset; remove corrupted files
Configuration errors
Check system prompt syntax
Connection timeout
Retry the run (usually resolves automatically)

Step 5: Retry with adjustments

  1. Create a new run with the adjusted configuration.
  2. Monitor the first few minutes to verify it passes the previous failure point.

If you're still stuck, copy the full error message from the Logs tab and contact support with your run details (model, GPU, batch size).

Do this with the Vi SDK

import vi

client = vi.Client(
    secret_key="your-secret-key",
    organization_id="your-organization-id"
)

run = client.runs.get("your-run-id")
if run.status.conditions:
    latest = run.status.conditions[-1]
    print(f"Status: {latest.condition.value}")
    print(f"Message: {latest.message}")

For more details, see the full SDK reference.

Next steps

View Metrics

Analyze loss curves and evaluation metrics from your completed run.

Inspect Predictions

Compare ground truth annotations with model predictions across checkpoints.

Monitor A Run

Understand run statuses and track training progress in real time.