Manage Runs

What are training runs?

A run is a single training session where a vision-language model learns from your annotated dataset. Each time you start training, you create a new run that tracks progress, settings, and results.

Key characteristics:

Individual sessions — Each run trains one model configuration
Complete tracking — Monitors progress, metrics, and resource usage
Configurable — Defined by workflow settings including model architecture, dataset splits, and training parameters
Persistent history — Logs and metrics remain accessible after completion
Lifecycle management — Can be monitored, killed, or deleted based on status

Run lifecycle

Training runs progress through distinct stages:

Active run states

Queued

Waiting for GPU resources to become available

Starting

Allocating compute resources and preparing environment

Running

Actively training the model on your dataset

Completed run states

Completed

Training finished successfully with trained model available

Failed

Training stopped due to errors or configuration issues

Killed

Training manually stopped before completion

💡
Run status determines available actions

Active runs (Queued, Starting, Running) can be killed but not deleted

Completed runs (Completed, Failed, Killed) can be deleted but not killed

All runs can be monitored while active and reviewed after completion

Run management tasks

Manage runs effectively throughout their lifecycle:

Monitor training progress

Track your active training runs in real-time to ensure they're progressing correctly and identify issues early.

Learn how to monitor runs →

When to monitor:

During initial training setup validation
When training on new datasets or model architectures
To track estimated completion times
To watch for errors or anomalies

What you can see:

Real-time training progress through preparation stages
Live metrics and loss charts as training proceeds
Error messages if training encounters problems
Training logs for detailed debugging information
Resource usage and estimated time remaining

Kill active runs

Stop training runs that are currently in progress when you need to free resources or abandon a session.

Learn how to kill runs →

When to kill:

Training configuration was incorrect
Want to preserve Compute Credits by stopping early
Need GPU resources for higher-priority training
Notice errors or unexpected behavior
Testing workflow configurations

Important notes:

Killing is permanent—runs cannot be resumed
Compute Credits for elapsed training time are still consumed
Current training state is saved for review
Must kill runs before you can delete them

Delete completed runs

Remove completed, failed, or killed training runs to keep your project organized and focused.

Learn how to delete runs →

When to delete:

Failed runs after fixing configuration issues
Test runs used for workflow validation
Superseded runs after successful retraining
Accidental duplicate runs
Old experimental runs no longer needed

Important notes:

Deletion is permanent and cannot be undone
Models from successful runs should be downloaded first
Training metrics and logs are permanently removed
Runs in progress must be killed before deletion

Access your training runs

Navigate to runs within your training project:

Navigate to the Training section from the sidebar
Click on your training project to open it
Click the Runs tab in the project navigation
View all training runs with their current status

The Runs page displays:

Run name — Workflow name and session identifier
Status — Current run state (Running, Completed, Failed, Killed, Queued)
Start time — When training began
Training time — Elapsed or total duration
Model architecture — Base VLM used for training
GPU allocation — Compute resources assigned

📘
Runs are organized by workflow
Training runs are grouped under their parent workflows. Each workflow can have multiple runs representing different training attempts or experiments with the same base configuration.

Understanding run status indicators

Run status helps you understand what actions are available:

Status	Color	Meaning	Available Actions
Running	Blue	Training is actively progressing	Monitor, Kill
Queued	Gray	Waiting for GPU resources	Monitor, Kill
Starting	Blue	Allocating resources and preparing	Monitor, Kill
Completed	Green	Training finished successfully	View metrics, Download model, Delete
Failed	Red	Training stopped due to errors	View logs, Delete
Killed	Gray	Manually stopped before completion	View progress, Delete

Best practices for run management

Monitor new configurations carefully

When training with new model architectures, datasets, or system prompts:

Stay on the monitoring page during initial setup (first 5-10 minutes)
Watch training progress through preparation stages
Check metrics appear after first epoch completes
Review loss curves for expected patterns
Verify estimated completion time aligns with expectations

This helps catch configuration errors early and prevents wasting Compute Credits on failing runs.

Kill runs early if issues detected

If you notice problems during monitoring:

Configuration errors — Wrong dataset split, incorrect model size
Unexpected metrics — Loss not decreasing, NaN values appearing
Resource issues — Out of memory errors, extremely slow progress
Wrong workflow — Accidentally started wrong training configuration

Kill immediately to preserve Compute Credits rather than letting failed training run to completion.

Download models before cleanup

Before deleting runs:

Identify successful runs with acceptable metrics
Download trained models you want to keep
Export evaluation results if needed for documentation
Save training configurations for reproducibility
Then delete runs to keep project organized

Remember: Deletion is permanent—models cannot be recovered after run deletion.

Clean up regularly

Maintain organized training projects:

Weekly: Delete obvious test runs and failed experiments
After milestones: Clean up superseded runs after successful training
Before major work: Archive or delete old runs to reduce clutter
Keep history: Preserve successful runs that produced deployed models

Regular cleanup makes it easier to find important runs and understand project history.

Use descriptive workflow names

Workflow names appear in run listings, so use clear, descriptive names:

Good examples:

PCB-Detection-Qwen-Small
Defect-Classification-NVLM-Baseline
Quality-VQA-InternVL-Production

Poor examples:

Workflow 1
Test
New Training

Descriptive names make it easier to identify runs in listings and understand project history.

Check logs for failed runs

When runs fail, always:

View detailed logs to understand failure reason
Check error messages in the training progress section
Review configuration settings that might cause issues
Fix underlying problems before starting new runs
Delete failed runs after documenting issues

Understanding failures prevents repeating configuration mistakes.

Common questions

How many runs can I have active at once?

The number of simultaneous active runs depends on your organization plan and available Compute Credits.

Typical limits:

GPU availability — Limited by total GPU resources in your plan
Concurrent training — Usually 1-5 active runs depending on plan tier
Queued runs — No limit; runs wait for GPU resources to become available

Runs queue automatically when GPU resources are unavailable. They start when resources free up from completed runs.

View your resource usage →

Can I pause and resume a training run?

No. Training runs cannot be paused and resumed. Your options are:

Let training complete — Continue to successful completion
Kill the run — Stop permanently (cannot resume)
Start a new run — Create fresh training session with same or modified settings

Workaround for long training:

If you need to stop and continue later:

Configure workflows with shorter epoch counts
Train incrementally across multiple runs
Use checkpointing features (if available for your model architecture)

What happens to Compute Credits if I kill a run?

Compute Credits are consumed for actual GPU time used, regardless of whether training completes:

Credits consumed: Time from run start until killed
Credits saved: Remaining training time that would have been used
No refunds: Credits for elapsed time cannot be recovered

Example:

Training estimated: 60 minutes (10 credits)
Killed after: 15 minutes
Credits consumed: ~2.5 credits
Credits saved: ~7.5 credits

Best practice: Monitor runs closely during initial stages to catch issues early and minimize wasted credits.

Learn about resource usage →

Why can't I delete a run in progress?

This safeguard prevents accidental data loss and ensures training completes cleanly:

Prevents data corruption — Ensures training logs and metrics are properly saved
Avoids wasted resources — Forces intentional decision to stop training
Maintains data integrity — Training must reach a stopped state before removal

To delete an active run:

Kill the run first (stops training immediately)
Wait for status to change to "Killed"
Delete the run after it's stopped

The two-step process ensures you intentionally stop training before permanent deletion.

Can I restart a failed or killed run?

No. Training runs cannot be restarted. Instead:

For failed runs:

Review logs to identify failure cause
Fix configuration issues (dataset, model settings, prompts)
Start a new run from the corrected workflow

For killed runs:

Decide if you need to continue training
If yes, start a new run with same or modified settings
Delete the old run to clean up

Training state is not preserved between runs. Each run starts fresh from your configured workflow settings.

How long are run logs and metrics stored?

Training logs and metrics are stored permanently until you delete the run:

Active runs: Logs and metrics update in real-time
Completed runs: All data remains accessible indefinitely
Failed runs: Error logs and partial metrics remain available
Killed runs: Progress up to kill point is preserved

Storage includes:

Training progress and status history
Loss curves and evaluation metrics
Detailed logs and error messages
Resource usage data

After deletion: All run data is permanently removed and cannot be recovered.

Best practice: Download important models and export metrics before deleting runs.

Can I rename a training run?

Training runs inherit their names from their parent workflow, and individual runs cannot be renamed.

To improve run identification:

Rename the workflow — Updates future run names
Use descriptive workflow names — Helps identify runs in listings
Add run notes — Document run purpose or configuration changes (if available)

Run identification:

Runs display workflow name plus session ID
Started time helps distinguish multiple runs from same workflow
Status and metrics provide additional context

Next steps

Monitor a Run

Track real-time training progress and watch for errors

Kill a Run

Stop active training runs before completion

Delete a Run

Remove completed or failed runs from your project

Evaluate a Model

Review training metrics and model performance

Related resources

Train a Model — Start new training runs
Create a Workflow — Define training configurations
Manage Workflows — Organize training configurations
Evaluate a Model — Review training results
Training Logs — Debug training issues
Resource Usage — Monitor Compute Credits
Download a Model — Export trained models

Need help?

We're here to support your VLMOps journey. Reach out through any of these channels:

Contact Support

Get help from our team via our website or email us at [email protected]

Join Our Community

Connect with other Datature users, share ideas, and get community support on Slack

Explore Resources

Read our Blog
Check out GitHub
Watch Tutorials

Schedule a Demo

Book a personalized demo to see how Datature Vi can accelerate your vision AI projects