Manage Runs

Monitor training progress, control active runs, and manage completed training sessions in your projects.

What are training runs?

A run is a single training session where a vision-language model learns from your annotated dataset. Each time you start training, you create a new run that tracks progress, settings, and results.

Key characteristics:

  • Individual sessions — Each run trains one model configuration
  • Complete tracking — Monitors progress, metrics, and resource usage
  • Configurable — Defined by workflow settings including model architecture, dataset splits, and training parameters
  • Persistent history — Logs and metrics remain accessible after completion
  • Lifecycle management — Can be monitored, killed, or deleted based on status

Run lifecycle

Training runs progress through distinct stages:

Active run states

Queued

Waiting for GPU resources to become available

Starting

Allocating compute resources and preparing environment

Running

Actively training the model on your dataset

Completed run states

Completed

Training finished successfully with trained model available

Failed

Training stopped due to errors or configuration issues

Killed

Training manually stopped before completion

💡

Run status determines available actions

  • Active runs (Queued, Starting, Running) can be killed but not deleted
  • Completed runs (Completed, Failed, Killed) can be deleted but not killed
  • All runs can be monitored while active and reviewed after completion

Run management tasks

Manage runs effectively throughout their lifecycle:

Monitor training progress

Track your active training runs in real-time to ensure they're progressing correctly and identify issues early.

Learn how to monitor runs →

When to monitor:

  • During initial training setup validation
  • When training on new datasets or model architectures
  • To track estimated completion times
  • To watch for errors or anomalies

What you can see:

Kill active runs

Stop training runs that are currently in progress when you need to free resources or abandon a session.

Learn how to kill runs →

When to kill:

  • Training configuration was incorrect
  • Want to preserve Compute Credits by stopping early
  • Need GPU resources for higher-priority training
  • Notice errors or unexpected behavior
  • Testing workflow configurations

Important notes:

  • Killing is permanent—runs cannot be resumed
  • Compute Credits for elapsed training time are still consumed
  • Current training state is saved for review
  • Must kill runs before you can delete them

Delete completed runs

Remove completed, failed, or killed training runs to keep your project organized and focused.

Learn how to delete runs →

When to delete:

  • Failed runs after fixing configuration issues
  • Test runs used for workflow validation
  • Superseded runs after successful retraining
  • Accidental duplicate runs
  • Old experimental runs no longer needed

Important notes:

  • Deletion is permanent and cannot be undone
  • Models from successful runs should be downloaded first
  • Training metrics and logs are permanently removed
  • Runs in progress must be killed before deletion

Access your training runs

Navigate to runs within your training project:

  1. Navigate to the Training section from the sidebar

  2. Click on your training project to open it

  3. Click the Runs tab in the project navigation

  4. View all training runs with their current status

The Runs page displays:

  • Run name — Workflow name and session identifier
  • Status — Current run state (Running, Completed, Failed, Killed, Queued)
  • Start time — When training began
  • Training time — Elapsed or total duration
  • Model architecture — Base VLM used for training
  • GPU allocation — Compute resources assigned
📘

Runs are organized by workflow

Training runs are grouped under their parent workflows. Each workflow can have multiple runs representing different training attempts or experiments with the same base configuration.


Understanding run status indicators

Run status helps you understand what actions are available:

StatusColorMeaningAvailable Actions
RunningBlueTraining is actively progressingMonitor, Kill
QueuedGrayWaiting for GPU resourcesMonitor, Kill
StartingBlueAllocating resources and preparingMonitor, Kill
CompletedGreenTraining finished successfullyView metrics, Download model, Delete
FailedRedTraining stopped due to errorsView logs, Delete
KilledGrayManually stopped before completionView progress, Delete

Best practices for run management

Monitor new configurations carefully

When training with new model architectures, datasets, or system prompts:

  1. Stay on the monitoring page during initial setup (first 5-10 minutes)
  2. Watch training progress through preparation stages
  3. Check metrics appear after first epoch completes
  4. Review loss curves for expected patterns
  5. Verify estimated completion time aligns with expectations

This helps catch configuration errors early and prevents wasting Compute Credits on failing runs.

Kill runs early if issues detected

If you notice problems during monitoring:

  • Configuration errors — Wrong dataset split, incorrect model size
  • Unexpected metrics — Loss not decreasing, NaN values appearing
  • Resource issues — Out of memory errors, extremely slow progress
  • Wrong workflow — Accidentally started wrong training configuration

Kill immediately to preserve Compute Credits rather than letting failed training run to completion.

Download models before cleanup

Before deleting runs:

  1. Identify successful runs with acceptable metrics
  2. Download trained models you want to keep
  3. Export evaluation results if needed for documentation
  4. Save training configurations for reproducibility
  5. Then delete runs to keep project organized

Remember: Deletion is permanent—models cannot be recovered after run deletion.

Clean up regularly

Maintain organized training projects:

  • Weekly: Delete obvious test runs and failed experiments
  • After milestones: Clean up superseded runs after successful training
  • Before major work: Archive or delete old runs to reduce clutter
  • Keep history: Preserve successful runs that produced deployed models

Regular cleanup makes it easier to find important runs and understand project history.

Use descriptive workflow names

Workflow names appear in run listings, so use clear, descriptive names:

Good examples:

  • PCB-Detection-Qwen-Small
  • Defect-Classification-NVLM-Baseline
  • Quality-VQA-InternVL-Production

Poor examples:

  • Workflow 1
  • Test
  • New Training

Descriptive names make it easier to identify runs in listings and understand project history.

Check logs for failed runs

When runs fail, always:

  1. View detailed logs to understand failure reason
  2. Check error messages in the training progress section
  3. Review configuration settings that might cause issues
  4. Fix underlying problems before starting new runs
  5. Delete failed runs after documenting issues

Understanding failures prevents repeating configuration mistakes.


Common questions

How many runs can I have active at once?

The number of simultaneous active runs depends on your organization plan and available Compute Credits.

Typical limits:

  • GPU availability — Limited by total GPU resources in your plan
  • Concurrent training — Usually 1-5 active runs depending on plan tier
  • Queued runs — No limit; runs wait for GPU resources to become available

Runs queue automatically when GPU resources are unavailable. They start when resources free up from completed runs.

View your resource usage →

Can I pause and resume a training run?

No. Training runs cannot be paused and resumed. Your options are:

  • Let training complete — Continue to successful completion
  • Kill the run — Stop permanently (cannot resume)
  • Start a new run — Create fresh training session with same or modified settings

Workaround for long training:

If you need to stop and continue later:

  1. Configure workflows with shorter epoch counts
  2. Train incrementally across multiple runs
  3. Use checkpointing features (if available for your model architecture)
What happens to Compute Credits if I kill a run?

Compute Credits are consumed for actual GPU time used, regardless of whether training completes:

  • Credits consumed: Time from run start until killed
  • Credits saved: Remaining training time that would have been used
  • No refunds: Credits for elapsed time cannot be recovered

Example:

  • Training estimated: 60 minutes (10 credits)
  • Killed after: 15 minutes
  • Credits consumed: ~2.5 credits
  • Credits saved: ~7.5 credits

Best practice: Monitor runs closely during initial stages to catch issues early and minimize wasted credits.

Learn about resource usage →

Why can't I delete a run in progress?

This safeguard prevents accidental data loss and ensures training completes cleanly:

  • Prevents data corruption — Ensures training logs and metrics are properly saved
  • Avoids wasted resources — Forces intentional decision to stop training
  • Maintains data integrity — Training must reach a stopped state before removal

To delete an active run:

  1. Kill the run first (stops training immediately)
  2. Wait for status to change to "Killed"
  3. Delete the run after it's stopped

The two-step process ensures you intentionally stop training before permanent deletion.

Can I restart a failed or killed run?

No. Training runs cannot be restarted. Instead:

For failed runs:

  1. Review logs to identify failure cause
  2. Fix configuration issues (dataset, model settings, prompts)
  3. Start a new run from the corrected workflow

For killed runs:

  1. Decide if you need to continue training
  2. If yes, start a new run with same or modified settings
  3. Delete the old run to clean up

Training state is not preserved between runs. Each run starts fresh from your configured workflow settings.

How long are run logs and metrics stored?

Training logs and metrics are stored permanently until you delete the run:

  • Active runs: Logs and metrics update in real-time
  • Completed runs: All data remains accessible indefinitely
  • Failed runs: Error logs and partial metrics remain available
  • Killed runs: Progress up to kill point is preserved

Storage includes:

After deletion: All run data is permanently removed and cannot be recovered.

Best practice: Download important models and export metrics before deleting runs.

Can I rename a training run?

Training runs inherit their names from their parent workflow, and individual runs cannot be renamed.

To improve run identification:

  1. Rename the workflow — Updates future run names
  2. Use descriptive workflow names — Helps identify runs in listings
  3. Add run notes — Document run purpose or configuration changes (if available)

Run identification:

  • Runs display workflow name plus session ID
  • Started time helps distinguish multiple runs from same workflow
  • Status and metrics provide additional context

Next steps


Related resources