Logs

Review detailed training logs, debug errors, and troubleshoot failed runs.

The Logs tab displays complete training output including system messages, training steps, loss values, and evaluation results. Use logs to debug issues, track progress, and troubleshoot errors.

📋

Access logs

Open any training run and click the Logs tab to view detailed training output and error messages.


Log content

Training logs include:

Log TypeExamplePurpose
Training stepstrainingStep: Step 1169, Loss: 0.84435061686215401Monitor progress and loss per step
Epochsepoch: Epoch 98Track training rounds through dataset
EvaluationevaluationExtension: {"extension":"evaluation_preview","step":1200,"epoch":100}Checkpoint evaluation triggers
Saver eventssaver: Step 1200, Epoch 100, Status: savedModel checkpoint saves

Use logs for troubleshooting

Training won't start or crashes

Check logs for:

  • Out of memory errors — GPU memory exhausted; reduce batch size or use larger GPU
  • Dataset errors — Missing or corrupted files; verify dataset integrity
  • Configuration errors — Invalid hyperparameter combinations; check model settings

Common error patterns:

CUDA out of memory
RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long
ValueError: Invalid annotation format
Training is very slow

Check logs for:

  • Long step duration — Compare trainingStep timestamps; slow steps indicate bottlenecks
  • Frequent checkpointing — Checkpoints slow training; adjust checkpoint frequency
  • Data loading delays — Large images or slow storage; consider preprocessing or faster storage

Typical step times:

  • 1-2 seconds/step — Normal for most configurations
  • 5-10 seconds/step — Slow; investigate GPU utilization or data loading
  • >10 seconds/step — Very slow; likely misconfiguration or resource constraints
Loss isn't decreasing

Check logs for:

  • Initial loss value — Unusually high (>10.0) may indicate data scaling issues
  • NaN lossLoss: nan means training collapsed; reduce learning rate
  • Consistent loss — No change across many steps; learning rate too low or data issues

Diagnostic steps:

  1. Verify loss starts at reasonable value (2.0-6.0)
  2. Confirm loss decreases within first 100 steps
  3. Check for sudden spikes or NaN values
  4. Review hyperparameter settings
💡

Download logs

To save logs for external analysis or sharing with support:

  1. Click the three dots (•••) in the top-right corner of the Logs tab
  2. Select Download logs
  3. Logs are saved as a .txt file with timestamp

Compare multiple runs

Systematic comparison helps identify which configuration changes improve performance.

Comparison workflow

Compare metrics across runs

Create a comparison table:

RunModelLearning RateFinal LossBbox F1BLEU
Run 1Qwen2.5-VL 2B3e-40.820.780.45
Run 2Qwen2.5-VL 7B3e-40.640.850.52
Run 3Qwen2.5-VL 7B1e-40.580.880.56

Analysis:

  • Run 1 → Run 2: Larger model improves all metrics
  • Run 2 → Run 3: Lower learning rate further improves convergence

Conclusion: Qwen2.5-VL 7B with learning rate 1e-4 is optimal configuration.

Compare visual predictions

For each configuration:

  1. Open Advanced Evaluation for the run
  2. Navigate to the same evaluation specimen across runs
  3. Compare predictions side-by-side (manually or via screenshots)

What to compare:

  • Bounding box quality — Tightness, completeness, false positives
  • Text generation — Accuracy, completeness, formatting
  • Consistency — Performance across different image types
  • Edge case handling — Behavior on difficult examples

Document findings: Note which configuration handles specific scenarios better.

Systematic experimentation

Best practice: Change one variable at a time

Bad approach:

  • Run 1: Model A, learning rate 3e-4, batch size 8, epochs 3
  • Run 2: Model B, learning rate 1e-4, batch size 16, epochs 5

Impossible to determine which change caused improvements

Good approach:

  • Run 1 (baseline): Model A, learning rate 3e-4, batch size 8, epochs 3
  • Run 2: Model B, learning rate 3e-4, batch size 8, epochs 3 (only model changed)
  • Run 3: Model B, learning rate 1e-4, batch size 8, epochs 3 (only learning rate changed)
  • Run 4: Model B, learning rate 1e-4, batch size 16, epochs 3 (only batch size changed)

Clear cause-and-effect relationships


Understanding overfitting

Overfitting occurs when a model memorizes training data instead of learning general patterns. The model performs well on training data but poorly on new, unseen images.

Signs of overfitting

Loss value indicators

Training loss vs. validation loss:

Healthy training:

Both training and validation values decrease together. Validation stays close to training loss throughout the entire training process, indicating the model is learning generalizable patterns rather than memorizing.

⚠️ Problematic training (overfitting):

Key indicator: Training loss continues decreasing while validation loss plateaus or starts increasing. The gap between the two values widens over time.

What's happening:

  • Training loss keeps improving but validation loss stops improving or gets worse

  • Model is memorizing training data instead of learning generalizable patterns

  • Occurs with extended training, overly complex models, or insufficient data

    Solutions: Stop training earlier (use early stopping), reduce model complexity, add regularization, or increase dataset size.

Metric degradation

Check evaluation metrics over checkpoints:

⚠️ Overfitting signs:

  • Metrics improve on training set but degrade on validation set
  • Metrics peak at early checkpoint, then decline
  • Large gap between train and validation performance

Example:

CheckpointTrain F1Validation F1
Step 5000.750.72✅ Healthy
Step 10000.850.80✅ Healthy
Step 15000.920.78⚠️ Overfitting
Step 20000.970.74❌ Severe overfitting
Visual prediction issues

InAdvanced Evaluation:

⚠️ Overfitting behaviors:

  • Model performs perfectly on training images but poorly on validation images
  • Predictions are overly specific to training examples (e.g., only detects objects in exact poses seen during training)
  • Model fails on slight variations of training examples
  • Performance degrades on images with different lighting, angles, or contexts than training data

Prevent and fix overfitting

Add more training data

Most effective solution:

  • Increase dataset size (aim for 2× current size)
  • Add diverse examples covering different:
    • Lighting conditions
    • Object orientations and poses
    • Backgrounds and contexts
    • Image quality and resolutions

Learn about dataset preparation →

Train for fewer epochs

Early stopping:

  1. Monitor validation metrics during training
  2. Identify checkpoint where validation performance peaks
  3. Stop training at or shortly after peak
  4. Use that checkpoint for deployment

Automatic early stopping:

Configure checkpoint frequency to save more checkpoints for granular stopping.

Adjust hyperparameters

Reduce model capacity:

  • Use smaller model architecture (e.g., 2B instead of 7B)
  • Lower learning rate
  • Reduce training epochs

Increase regularization:

  • Add dropout (if supported by architecture)
  • Use stronger weight decay

Configure model settings →

Improve data quality

Quality over quantity:

  • Fix annotation errors and inconsistencies
  • Remove duplicate or near-duplicate images
  • Balance class distribution
  • Add hard negatives (challenging examples)

Dataset management guide →



Troubleshooting

My training run failed. How do I troubleshoot?

Step-by-step troubleshooting:

1. Check the run status

Look at the status indicator on the Runs page:

2. View error details

Hover over "Additional Errors" to see error JSON:

{
  "condition": "LatticeExecutionFinished",
  "status": "FailedReach",
  "reason": "OutOfGpuMemory",
  "lastTransitionTime": "1764320567481"
}

Key field: reason tells you the root cause.

3. Check training progress

Look at which stage failed:

  • Dataset Ready → Dataset or annotation errors
  • Instance Ready → GPU or hardware issues
  • Training Running → Training configuration or runtime errors

4. Review logs

Open the Logs tab and scroll to the end:

Look for these error patterns:

RuntimeError: CUDA out of memory
→ GPU memory exhausted; reduce batch size or use larger GPU

ValueError: Expected tensor for argument
→ Data format issue; check dataset annotations

FileNotFoundError: [Errno 2] No such file
→ Missing dataset files; verify uploads

RuntimeError: CUDA error: device-side assert triggered
→ Invalid operation; check hyperparameters

5. Common fixes by error type

ErrorQuick Fix
Out of MemoryReduce batch size by 50%
Out of QuotaRefill Compute Credits; training resumes automatically
Dataset errorsValidate dataset; remove corrupted files
Configuration errorsCheck system prompt syntax
Connection timeoutRetry training (usually resolves automatically)

6. Retry with adjustments

  1. Create new run with fixed configuration
  2. Monitor closely during first few minutes
  3. Verify run progresses past previous failure point

Still stuck?

  • Copy complete error message from Logs tab
  • Note your configuration (model, GPU, batch size)
  • Contact support with run details

Related resources