Monitor a Run

Monitor your active training runs in real-time to track progress, identify issues early, and ensure training proceeds as expected. Real-time monitoring helps you catch configuration errors before consuming excessive Compute Credits.

📋
Prerequisites
Before monitoring a run, you need:

An active training run in progress (Running, Queued, or Starting status)

Alternatively, a completed run to review historical progress

Access run monitoring

Open the monitoring view to track your training session:

Navigate to the Training section from the sidebar
Click on your training project to open it
Click the Runs tab to view all training runs
Click on a running or completed run to open its monitoring page

The monitoring page displays real-time progress, metrics, and logs as training proceeds.

💡
Stay on page for real-time updates
The monitoring page auto-refreshes to show current progress. You can safely navigate away—training continues in the background—but you'll need to return to see updated progress.

Review run configuration

View the workflow configuration used for any training run to understand its settings:

While viewing a run's monitoring page, click the three-dot menu (⋮) in the top right

Select View Workflow from the dropdown menu
The workflow configuration page opens, showing:
- Model architecture and parameters
- Dataset configuration and splits
- System prompt used for training
- Training settings including epochs and batch size

Why this is useful:

Verify configuration — Ensure the run used correct settings
Reproduce results — See exact settings used for successful runs
Compare runs — Understand differences between run configurations
Debug issues — Identify configuration problems causing failures
Document experiments — Record settings for successful training

💡
Quick access to workflow details
This is especially helpful when reviewing completed runs or investigating why different runs produced different results. You can instantly see all configuration details without searching through workflow lists.

Understand run statuses

Every training run has a status indicating its current state. Understanding these statuses helps you identify issues and take appropriate action.

Finished

Meaning: Training completed successfully through all configured epochs.

What you can do:

View metrics to assess model performance
Compare predictions against ground truth
Download the trained model for deployment
Start a new run to iterate on configuration

Appearance: Green status indicator

Running

Meaning: Training is currently in progress.

What you can do:

Monitor training progress in real-time (see sections below)
View partial metrics as they become available
Kill the run if needed
Wait for completion before accessing full evaluation results

Note: Metrics and evaluation results update periodically as training progresses.

Queued

Meaning: Training is waiting for GPU resources to become available.

Common reasons:

All GPUs currently in use by other runs
Waiting for previous run to release resources
System scheduling priorities

What you can do:

Wait for GPUs to become available (usually minutes)
Training will start automatically when resources ready
Kill queued run if no longer needed
Check resource usage for organization-wide GPU allocation

Typical wait times:

1-5 minutes: Normal during peak usage
5-15 minutes: High demand; consider scheduling runs during off-peak hours
>15 minutes: Unusual; contact support if queue time exceeds 30 minutes

Starting

Meaning: Training is allocating resources and preparing the environment.

What's happening:

GPU resources being allocated
Training environment being set up
Base model weights being loaded
Infrastructure initialization in progress

Typical duration: 2-5 minutes

What to watch for: If stuck in "Starting" for more than 10 minutes, there may be resource allocation issues.

Out of Memory

Meaning: Training failed because the GPU ran out of memory (CUDA OOM error).

Common causes:

Batch size too large for available GPU memory
Model architecture too large for selected GPU
Image resolution too high
Insufficient GPU memory for model + data combination

How to fix:

Reduce batch size in model settings
- Try cutting current batch size in half
- Example: 16 → 8, or 8 → 4
Select larger GPU in hardware configuration
- Upgrade from T4 (16 GB) → A10G (24 GB)
- Or A10G (24 GB) → A100 (40 GB or 80 GB)
Use smaller model architecture
- Switch from 7B → 2B parameters
- Compare model sizes
Reduce image resolution (if configurable)
- Smaller images require less GPU memory
- Check dataset configuration

Troubleshooting checklist:

Check Logs tab for exact OOM error message
Note current batch size and GPU type
Calculate GPU memory requirement (model size + batch size × image size)
Adjust one parameter and retry

Failed

Meaning: Training encountered an error and stopped unexpectedly.

Common causes:

Dataset errors (corrupted images, missing annotations)
Configuration errors (invalid hyperparameters)
Infrastructure issues (network problems, hardware failures)

How to diagnose:

Check Training Progress section for error details
Look for "Additional Errors" indicator with error information
Review Logs tab for complete error traceback
Identify error type and follow specific troubleshooting steps

Common error patterns:

CUDA errors: GPU driver or hardware issue — Retry training; contact support if persists
Invalid annotation errors: Corrupted or malformed annotations — Validate dataset; fix or remove invalid data
File not found errors: Missing dataset files — Re-upload missing files; verify dataset integrity
Connection timeouts: Network or storage issue — Retry training; issue is usually temporary

Next steps:

Review error details in Training Progress section
Check Logs tab for full error message
Apply fix based on error type
Create new run with corrected configuration
Contact support if error persists after troubleshooting

Killed

Meaning: Training was manually stopped by user before completion.

When this happens:

User clicked Kill run to stop training
Training stopped immediately
Partial progress is saved for review

What you can do:

View partial results up to the point training was killed
Review metrics available for completed checkpoints
Create new run to restart with different settings
Delete the killed run to clean up project

Note: Killed runs still consume Compute Credits for time used before stopping.

Out of Quota

Meaning: Training stopped because Compute Credits were depleted.

What happened:

Your organization ran out of Compute Credits during training
Training paused automatically at the last checkpoint
No progress is lost

How to resume:

Refill Compute Credits for your organization
Training resumes automatically from the last saved checkpoint once credits are available
No need to reconfigure or restart the run

What you can do:

View partial results up to the point credits ran out
Check resource usage to see credit consumption
Add more Compute Credits to resume training
Training will continue from where it stopped (no wasted progress)

Note: Runs with "Out of Quota" status don't consume additional credits until resumed.

View error details

When a run shows "Additional Errors," you can view detailed error information:

Look for the red error indicator next to the run status
Hover over "Additional Errors" to see error details in a tooltip
Error information includes:
- condition — Error type (e.g., "LatticeExecutionFinished")
- status — Status code (e.g., "FailedReach")
- reason — Primary error cause (e.g., "OutOfGpuMemory")
- lastTransitionTime — When the error occurred (Unix timestamp in milliseconds)

Example error details:

{
  "condition": "LatticeExecutionFinished",
  "status": "FailedReach",
  "reason": "OutOfGpuMemory",
  "lastTransitionTime": 1764320567481
}

This indicates the run failed due to GPU memory exhaustion.

Training progress stages

Training runs progress through distinct preparation stages before active model training begins. Monitor these stages to ensure training starts correctly.

Stage 1: Preprocessing dataset

The platform prepares your dataset for training:

Loads annotations from your configured dataset
Applies dataset splits according to train/validation/test ratios
Validates data format to ensure compatibility with the model
Preprocesses images as needed for model architecture

Typical duration: 1-5 minutes depending on dataset size

What to watch for:

Progress should advance within a few minutes
Extended delays may indicate dataset access issues

Stage 2: Spinning up instance

The platform allocates GPU resources and prepares the training environment:

Allocates GPU hardware based on model architecture requirements
Sets up training environment with required libraries and dependencies
Loads base model weights from the selected VLM architecture
Initializes training infrastructure for distributed training if needed

Typical duration: 2-5 minutes

What to watch for:

Status should move to "Instance Ready" within several minutes
If stuck here, GPU resources may be fully allocated (run will queue)

Stage 3: Pending first metrics

Training begins and processes the first epoch:

Starts model training on your annotated data
Processes first training epoch through all training images
Generates initial metrics including loss and validation scores
Updates progress indicators with training statistics

Typical duration: Varies by dataset size and model architecture (5-30 minutes for first epoch)

What to watch for:

First metrics should appear after completing one epoch
Training time estimate updates as first epoch progresses
Logs show detailed training progress

Active training

Once first metrics appear, training proceeds through all configured epochs:

Loss charts update in real-time as training progresses
Evaluation metrics refresh after each validation cycle
Progress percentage increases toward completion
Estimated time remaining updates based on actual training speed

✅
Training is progressing normally
When you see metrics updating regularly and loss decreasing over time, training is proceeding as expected. You can safely navigate away and return later to check progress. After completion, evaluate your model for detailed performance analysis.

Progress stage indicators

The Training Progress section visually shows your run's journey through different stages:

Stage	Icon	Meaning	What happens
Dataset Ready	📊	Dataset validation complete	Files verified, annotations loaded
Instance Ready	⚙️	GPU resources allocated	Hardware provisioned, environment set up
Training Running	📈	Model training in progress	Actively training through epochs
Training Failed	❌	Error occurred during training	See status for error type

Status colors:

Green (✓) — Stage completed successfully
Red (×) — Error occurred at this stage
Gray — Stage not yet reached
Animated — Currently in progress

💡
Quick troubleshooting tip
If training fails at "Instance Ready," the issue is usually GPU-related (Out of Memory). If it fails at "Dataset Ready," check your dataset for errors.

View training metrics

The Metrics tab displays real-time training progress and model performance data as training proceeds. For comprehensive analysis of completed runs, see Evaluate a Model.

Loss charts

Track how well the model is learning:

Total Loss — Overall training loss decreasing over time
Training Loss — Loss on training dataset (orange line)
Validation Loss — Loss on validation dataset (blue line) helps detect overfitting

Healthy training pattern

Good indicators:

Loss decreases steadily over epochs
Training and validation loss remain close (not diverging significantly)
Loss curve smooths out as training progresses
Both curves trending downward together

Underfitting or problematic training

Critical issues:

Loss not decreasing after initial epochs
Loss staying flat across many epochs
Model not learning from the data
May indicate learning rate too low or configuration problems

Evaluation metrics

Review task-specific performance indicators:

For object detection:

Bounding Box F1 Score — Balance of precision and recall
BLEU — Text generation quality (for VQA tasks)
BERTScore Recall — Semantic similarity to ground truth

For other tasks:

Metrics vary based on your task type and model architecture
All metrics update after each validation cycle

Learn more about training metrics →

Learn how to evaluate completed models →

Understanding metric trends

Good indicators:

Metrics improve consistently over epochs
Validation metrics track training metrics closely
Metrics stabilize near end of training

Warning indicators:

Metrics not improving or degrading
Large gap between training and validation metrics
Erratic metric fluctuations without clear trend

Once training completes, use advanced evaluation tools to analyze model performance in depth.

Identify training errors

Watch for error indicators that signal training issues requiring intervention.

Error notifications

When training encounters errors:

Status changes to "Failed" in red
Error count displayed next to status (e.g., "1 Additional Errors")
Training progress indicator shows where failure occurred
Error messages appear in training logs

Common error types

Out of memory (OOM) errors

Symptoms:

Training fails during early epochs
Error messages mention "CUDA out of memory" or similar

Causes:

Model architecture too large for allocated GPU
Batch size too high
Images in dataset too large

Solutions:

Reduce batch size in training settings
Select smaller model variant (e.g., 3B instead of 7B parameters)
Resize images in your dataset to lower resolution
Contact support if issue persists

Dataset errors

Symptoms:

Training fails in "Preprocessing Dataset" stage
Error messages mention data loading or format issues

Causes:

Dataset not properly configured
Missing or corrupted annotations
Incompatible annotation format for model type
Empty dataset splits

Solutions:

Verify dataset has annotations in correct format
Check dataset splits have sufficient data (minimum 10 images per split)
Review dataset selection in workflow configuration
Validate annotations are properly formatted

Configuration errors

Symptoms:

Training fails immediately or during instance setup
Error messages mention invalid parameters or settings

Causes:

Invalid training parameters
Incompatible system prompt for model
Mismatched model and task type

Solutions:

Review all workflow settings for typos or invalid values
Check system prompt follows model-specific format requirements
Verify model architecture supports your task type
Start from default settings and adjust incrementally

Infrastructure errors

Symptoms:

Training fails unexpectedly during active training
Error messages mention system or hardware issues

Causes:

Temporary infrastructure issues
GPU hardware failures
Network connectivity problems

Solutions:

Kill the failed run
Start a new run with same configuration (often succeeds)
Contact support if failures persist

Review detailed logs

Access comprehensive training logs for detailed debugging:

Open the run monitoring page
Click the Logs tab next to Metrics
Review timestamped log entries showing:
- Initialization and setup steps
- Training progress per epoch
- Loss values and metric calculations
- Warning and error messages
- Resource usage information

Learn more about training logs →

Monitor resource usage

Track Compute Credits and GPU time as training progresses:

Displayed information:

Training Time — Elapsed time since run started
GPU Type — Allocated hardware (e.g., "1 x Nvidia A10G")
Estimated remaining time — Predicted time to completion

Credit consumption:

Credits consumed based on actual GPU time used
Training time updates in real-time as run progresses
Longer training consumes more credits

💡
Kill runs early to save credits
If you notice configuration errors or training not progressing correctly, kill the run immediately to avoid consuming additional Compute Credits on failed training.

When to take action

Monitor your runs and take appropriate action based on what you observe:

Let training continue when:

✅ Loss decreasing steadily over epochs
✅ Metrics improving consistently
✅ Training and validation metrics tracking closely
✅ No error messages in logs
✅ Progress advancing through epochs normally

Kill training immediately when:

❌ Configuration clearly wrong (wrong dataset, model, or settings)
❌ Out of memory errors appearing repeatedly
❌ Loss not decreasing after several epochs
❌ Metrics show NaN or infinite values
❌ Training stuck at same progress point for extended time
❌ Want to adjust settings and restart fresh

Learn how to kill runs →

Review logs when:

⚠️ Training slower than expected
⚠️ Warnings appearing in progress indicators
⚠️ Metrics behaving unexpectedly
⚠️ Want to understand training behavior in detail

View detailed log analysis →

Best practices for monitoring

Monitor closely during initial stages

First 10-15 minutes are critical for catching configuration issues:

Stay on monitoring page during preprocessing and instance setup
Watch for error messages as stages progress
Verify first metrics appear after first epoch
Check loss curves start with reasonable values
Confirm training time estimate aligns with expectations

Early detection prevents wasting Compute Credits on misconfigured training.

Check in periodically during long training

For training runs lasting several hours:

Check progress every 1-2 hours during initial phases
Review loss curves for expected decreasing trend
Monitor metrics for consistent improvement
Watch for warnings or anomalies in logs
Verify estimated completion time remains reasonable

Long-running training can develop issues mid-session; periodic checks help catch problems early.

Watch resource consumption

Keep track of Compute Credit usage:

Note training time estimates at run start
Compare actual vs. estimated time as training progresses
Kill early if training significantly slower than expected
Monitor GPU utilization in logs to ensure efficient resource use

Understanding credit consumption helps optimize future training runs.

Document successful configurations

When training succeeds:

Note all workflow settings used for successful run
Record training time and resource requirements
Evaluate model performance using comprehensive metrics
Save final metrics for comparison with future runs
Document any issues encountered and solutions

Documentation helps replicate successful training and troubleshoot future issues.

Use logs for troubleshooting

When training behavior seems unusual:

Open detailed logs to see full training output
Search for warning messages that explain behavior
Check resource utilization for bottlenecks
Compare with successful runs if available
Copy relevant log sections when contacting support

Logs provide detailed context for understanding training issues.

Common questions

Do I need to stay on the monitoring page?

No. Training continues in the background even if you navigate away or close the browser.

However, monitoring is recommended during:

First 10-15 minutes to catch configuration errors early
First few epochs to verify training progresses correctly
Any time you're testing new configurations or model architectures

You can safely:

Navigate to other pages within Datature Vi
Close your browser and return later
Work on other tasks while training runs

Returning to check progress:

Click the Runs tab in your training project
Click the active run to see current status
Metrics and progress update automatically when you view the page

How often do metrics update?

Metric update frequency depends on training configuration:

Loss charts: Update after each logged training step (typically every few batches)
Evaluation metrics: Update after each validation cycle (typically after each epoch)
Training progress: Updates continuously throughout training

Typical update intervals:

Small datasets (< 1000 images): Every few minutes
Large datasets (10,000+ images): Every 10-30 minutes
Validation metrics: Once per epoch completion

Refresh the page if metrics appear stale or not updating. The page auto-refreshes but manual refresh ensures latest data.

What does 'Training Time' represent?

Training Time shows elapsed GPU time since the run started actively training:

Includes:

Active model training time
Validation and metric computation
Checkpointing and logging overhead

Excludes:

Queue time waiting for GPU resources
Dataset preprocessing time
Instance setup time

This is the time used to calculate Compute Credit consumption.

Estimated remaining time predicts total training duration based on current progress rate.

Can I download the model while training is in progress?

No. Models can only be downloaded after training completes successfully.

During active training:

Model checkpoints are saved automatically
You can monitor metrics to assess progress
You can review logs for detailed information
You cannot download partial or in-progress models

After successful completion:

Run status changes to "Completed"
Trained model becomes available for download
Evaluate model performance with detailed metrics
You can download model files for deployment

If you kill training:

Run status changes to "Killed"
No trained model is available (training incomplete)
You must start a new run to obtain a trained model

What if training is much slower than estimated?

Several factors can cause training to run slower than initial estimates:

Common causes:

Large dataset: More images than expected increase epoch time
Complex model: Larger VLM architectures train slower
High resolution: Large images increase processing time
Shared resources: Other concurrent training may impact speed
Validation overhead: Extensive validation adds time between epochs

What to do:

Check logs for resource utilization information
Review dataset size and image resolutions
Consider smaller model for faster training iterations
Reduce batch size if memory constrained
Let training complete if progressing correctly (just slower)
Contact support if dramatically slower than expected

For future runs:

Use smaller dataset for initial experiments
Test configurations with shorter epoch counts
Scale up gradually after validating setup

Should I worry if loss jumps around between epochs?

Some loss fluctuation is normal, but patterns matter:

Normal fluctuations:

Small variations between epochs (±10-20% of current value)
Occasional small increases followed by continued decrease
Smoothing out over time as training progresses

Concerning patterns:

Large spikes or drops (2x or more)
Consistently increasing loss over multiple epochs
Erratic, unpredictable fluctuations throughout training
Loss becoming NaN (not a number) or infinite

What to do:

For normal fluctuations: Continue training and monitor overall trend

For concerning patterns:

Check logs for error messages or warnings
Review learning rate settings in training configuration
Consider reducing learning rate if loss unstable
Verify dataset quality and annotation consistency
Kill and restart with adjusted settings if needed

General principle: Focus on overall trend across multiple epochs rather than epoch-to-epoch changes.

How do I know if my model is overfitting?

Overfitting occurs when the model memorizes training data instead of learning general patterns. Watch for these signs:

Key indicators:

Training loss continues decreasing while validation loss increases or plateaus
Large gap between training and validation metrics (e.g., training F1 = 0.95, validation F1 = 0.65)
Validation metrics stop improving or degrade after initial improvement

Visual patterns in loss charts:

Training and validation loss start together
Both decrease initially
Validation loss levels off or increases while training loss continues down
Gap widens over epochs

Prevention strategies:

Use appropriate dataset splits (typical: 70% train, 20% validation, 10% test)
Ensure sufficient training data (more data reduces overfitting)
Use regularization techniques (configured in training settings)
Stop training earlier (before validation metrics degrade)
Add more diverse training examples

Learn more about evaluation metrics →

Evaluate your model for detailed analysis →

How do I recover from Out of Memory errors?

Immediate fix:

Note your current configuration (model size, batch size, GPU type)
Reduce batch size by 50% as first attempt
Kill the failed run if it's still active
Create new run with reduced batch size

If still failing:

Upgrade to larger GPU (T4 → A10G → A100)
Switch to smaller model architecture
Reduce image resolution in dataset

Prevention:

Check GPU memory requirements before starting training to select appropriate hardware for your model size. Refer to your model's documentation for recommended GPU specifications.

Can I resume a Killed run?

No, killed runs cannot be resumed. When you kill a run, it stops permanently.

However:

Saved checkpoints are preserved for review
You can view partial results and metrics
Create a new run with the same configuration to continue training
Progress from checkpoints can inform your next run

Best practice: Only kill runs when certain you want to stop permanently. Use monitoring to track progress before deciding to kill.

What's the difference between Failed and Killed statuses?

Status	Cause	Can Resume?	Action Required
Failed	System error or configuration issue	No	Fix error and create new run
Killed	User manually cancelled	No	Create new run if needed
Out of Quota	Credits depleted	Yes, automatically	Refill credits

Failed runs indicate a problem that needs fixing. Killed runs were intentionally stopped by you. Out of Quota runs can resume automatically once you add more Compute Credits.

Why is my run queued for so long?

Common reasons:

High demand: Many runs using GPUs simultaneously
Large GPU request: Requesting 4× A100s takes longer than 1× T4
Organization limits: Check if you've hit concurrent run limits
Resource constraints: Available GPU capacity currently allocated

Actions:

Wait 5-10 minutes (most queues resolve quickly)
Check resource usage for organization GPU allocation
Consider using smaller GPU configurations for faster queue times
Schedule runs during off-peak hours if possible
Kill the queued run if no longer needed

Contact support if queued >30 minutes without explanation.

Next steps

Kill a Run

Stop training runs that have errors or incorrect configuration

View Training Logs

Review detailed logs for debugging and troubleshooting

Evaluate a Model

Analyze completed training metrics and model performance

Delete a Run

Clean up completed, failed, or killed runs

Related resources

Manage Runs — Complete run management overview
Kill a Run — Stop active training runs
Delete a Run — Clean up completed or killed runs
Train a Model — Start new training runs
Training Metrics — Understand performance indicators
Training Logs — Detailed log analysis
Evaluate a Model — Comprehensive model evaluation
Resource Usage — Monitor Compute Credits
Configure Training Settings — Optimize training parameters

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects