Kill a Run

Stop active training runs before completion to free resources or abandon incorrect configurations.

Kill active training runs to stop them immediately when you need to free GPU resources, fix configuration errors, or abandon a training session. Killing is permanent—training cannot be resumed from where it stopped.

⚠️

Killing is permanent

  • Cannot resume — Killed runs cannot be restarted from their current progress
  • Credits consumedCompute Credits for elapsed training time are not refunded
  • State saved — Training progress up to kill point is preserved for review
  • Must kill before delete — Active runs must be killed before you can delete them

When to kill a training run

Consider killing active training runs in these situations:

Configuration errors

  • Wrong dataset selected — Training on incorrect data
  • Incorrect model architecture — Need different VLM size or type
  • Invalid training parameters — Batch size, learning rate, or other settings wrong
  • Wrong system prompt — Prompt doesn't match intended task

Best practice: Kill immediately to avoid wasting Compute Credits on misconfigured training.

Training issues

  • Out of memory errors — GPU cannot handle current configuration
  • Loss not decreasing — Model not learning after several epochs
  • Unexpected behavior — Metrics showing anomalous values (NaN, infinite)
  • Training stuck — Progress not advancing for extended period

Best practice: Monitor runs closely during early stages to catch issues quickly.

Resource management

  • Need GPU for higher priority work — Free resources for important training
  • Testing workflow configurations — Quick experiments to validate settings
  • Wrong localization region — Want to train in different geographic region
  • Cost optimization — Preserve credits when training isn't critical

Workflow changes

  • Better dataset available — Want to train on improved or expanded data
  • Model update needed — Newer base model released
  • Adjusted training strategy — Realized better approach mid-training
💡

Kill early to save credits

The sooner you kill a misconfigured run, the more Compute Credits you preserve. Monitoring runs during the first 10-15 minutes helps catch issues before significant credit consumption.


Kill an active run

Stop a training run that is currently running, queued, or starting:

Access run actions

Navigate to the Training section from the sidebar

  1. Click on your training project to open it

  2. Click the Runs tab to view all training runs

  3. Locate the active run you want to stop (status: Running, Queued, or Starting)

  4. Click the three-dot menu (⋮) next to the run

  5. Select Kill Run from the dropdown menu

Confirm kill action

A confirmation dialog appears warning: "This action will permanently kill [Run Name]. Are you sure you want to kill this training?"

  1. Review the warning carefully—killed runs cannot be resumed

  2. Click the red Confirm button to kill the run immediately

  3. The run status changes to "Killed" and training stops

Training stopped successfully

Your run is now killed and GPU resources are freed. You can:


What happens when you kill a run

Killing a training run triggers the following sequence:

Immediate effects

Training stops:

  • GPU processing halts immediately
  • No further training epochs execute
  • Model training progress freezes at current point

Resources freed:

  • GPU allocation released
  • Queued runs can start if waiting for resources
  • System capacity becomes available

Status updates:

  • Run status changes from "Running" to "Killed"
  • Timestamps record when run was killed
  • Progress indicators show final state before kill

Data preservation

Saved for review:

  • Training progress up to kill point remains visible
  • Metrics and loss charts show progress before stopping
  • Logs contain full training output until kill
  • Configuration settings from workflow remain accessible

Not available:

  • Trained model — Incomplete training produces no usable model
  • Final metrics — Evaluation incomplete
  • Continued training — Cannot resume from killed state

Credit consumption

Compute Credits charged:

  • All GPU time from run start until killed
  • Validation cycles that completed before kill
  • Overhead time for setup and processing

Credits not charged:

  • Remaining estimated training time
  • Queue time waiting for resources
  • Time after kill action
💰

Example credit calculation

Scenario:

  • Estimated total training: 60 minutes
  • Killed after: 12 minutes of active training
  • Estimated credits: 10 credits total

Credits consumed: ~2 credits (12 minutes of 60 minutes)

Credits saved: ~8 credits (remaining 48 minutes not executed)

View your resource usage →


After killing a run

Once you've killed a training run, choose your next action:

Review what went wrong

Understand why training needed to be killed:

  1. Check logs for error messages or warnings
  2. Review metrics to see if training was progressing
  3. Examine configuration in workflow settings
  4. Identify root cause of issues requiring kill

Fix and restart training

Correct configuration issues and start fresh:

  1. Update workflow settings to fix identified problems

  2. Start a new run with corrected configuration

  3. Monitor the new run carefully during initial stages

Clean up killed run

Remove the killed run after extracting useful information:

  1. Document lessons learned from failure

  2. Save any useful metrics or configuration details

  3. Delete the killed run to keep project organized

📝

Killed runs must be deleted manually

Unlike some platforms, Datature Vi does not automatically delete killed runs. They remain in your project until you explicitly delete them.

This preserves training history for debugging and learning purposes.


Cannot kill completed runs

You can only kill runs that are actively executing or waiting to execute. Completed runs cannot be killed.

Which runs can be killed

Killable statuses:

  • Running — Actively training on GPU
  • Queued — Waiting for GPU resources to become available
  • Starting — Allocating resources and preparing environment

Which runs cannot be killed

Non-killable statuses:

  • Completed — Training finished successfully
  • Failed — Training stopped due to errors
  • Killed — Already killed (obviously)

For completed runs:


Best practices

Monitor initial stages before killing

Before deciding to kill a run, verify it's actually problematic:

  1. Monitor progress for at least 5-10 minutes
  2. Check if stages advancing through preprocessing and setup
  3. Review logs for actual errors vs. normal warnings
  4. Wait for first metrics to appear if training just started

Don't kill prematurely:

  • Initial stages can take several minutes
  • Some warnings are normal and don't indicate failure
  • First epoch often slower than subsequent epochs

Do kill immediately if:

  • Clear configuration errors visible
  • Out of memory errors appearing
  • Training stuck at same point for 15+ minutes
  • Wrong workflow started accidentally
Document why runs were killed

Keep notes on killed runs to avoid repeating mistakes:

Record:

  • What went wrong (configuration error, resource issue, etc.)
  • Why you killed it instead of letting it complete
  • What you changed before restarting
  • Lessons learned for future training

Benefits:

  • Avoid repeating same configuration errors
  • Track patterns in training issues
  • Share knowledge with team members
  • Improve workflow configurations over time
Fix root causes before restarting

Don't immediately restart after killing—fix underlying issues first:

Configuration problems:

  1. Update workflow settings with corrections
  2. Test configuration with shorter training if possible
  3. Verify dataset is correct and accessible
  4. Review model compatibility with your task

Resource problems:

  1. Reduce batch size if out of memory
  2. Select smaller model if architecture too large
  3. Resize images in dataset if too high resolution
  4. Check credit availability before long training

Validation:

  • Test with minimal epochs (e.g., 5 epochs) to validate fixes
  • Scale up gradually after confirming setup works
  • Monitor closely during initial restart
Use kill for quick experiments

Killing is useful for rapid workflow testing:

Strategy:

  1. Start training with test configuration
  2. Monitor first few minutes to verify setup correct
  3. Kill after validation (don't need full training)
  4. Iterate quickly on configurations

Good for:

  • Testing new dataset preparations
  • Validating system prompt formats
  • Checking model architecture compatibility
  • Confirming training parameters don't cause errors

Not good for:

  • Actual model training (obviously)
  • Generating metrics for evaluation
  • Producing deployable models
Consider credit costs before killing

Make informed decisions about whether to kill or continue:

Questions to ask:

  1. How much training time elapsed?

    • Early (<10 min): Minimal credits lost if kill
    • Mid-training (30+ min): More credits invested
    • Late (>80% complete): Consider letting finish
  2. Is training clearly failing?

    • Yes: Kill immediately regardless of time
    • Maybe: Review logs and metrics to decide
    • No: Let complete if making good progress
  3. Can you fix issue mid-training?

    • No: Kill and restart with fixes
    • Yes but requires restart: Kill and fix
    • Issue resolves itself: Continue training

General principle: Kill early when certain of problems, but let complete if training progressing acceptably even if not optimal.

Learn about credit management →


Common questions

Can I resume a killed run?

No. Killed runs cannot be resumed or restarted from their stopped point.

Your options after killing:

  1. Start a new run — Begins fresh training from scratch
  2. Use same workflow — Reuses configuration from killed run
  3. Modify workflow — Update settings before starting new run

Training always starts from beginning:

  • No checkpoint resumption from killed runs
  • Each run is independent training session
  • Progress from killed run cannot be continued

Workaround for long training:

  • Configure workflows with shorter epoch counts
  • Train incrementally with multiple runs
  • Save intermediate results after each run
Will I get a refund for Compute Credits?

No. Compute Credits are consumed based on actual GPU time used, regardless of whether training completes.

What you're charged for:

  • GPU time from run start until kill
  • Validation and metric computation before kill
  • Resource allocation and setup overhead

What you're not charged for:

  • Remaining estimated training time not executed
  • Queue time waiting for GPU resources
  • Time after kill action completes

Why no refunds:

  • GPU resources were allocated and used
  • Training computations were actually executed
  • Infrastructure costs incurred for running time

Credit savings:

  • Killing early saves remaining training time credits
  • Early detection of issues minimizes wasted credits
  • Quick kills (<5 min) minimize lost credits
What if I killed the wrong run by accident?

Unfortunately, kills cannot be undone. Once you confirm the kill action:

  • Training stops immediately and permanently
  • Run cannot be resumed or restarted
  • GPU resources are released

If you killed the wrong run:

  1. Start a new run immediately with same configuration
  2. Use same workflow to replicate settings exactly
  3. Monitor closely to ensure correct run this time
  4. Delete killed run after starting replacement

Prevention tips:

  • Double-check run name in kill confirmation dialog
  • Verify run status before killing (ensure it's the problematic one)
  • Review run start time to confirm identity
  • Read confirmation message carefully before clicking Confirm

In the future:

  • Use descriptive workflow names for easy identification
  • Review run details before accessing kill option
  • Take a moment to verify before confirming kill
Can I kill multiple runs at once?

Currently, runs must be killed one at a time through the interface. There is no bulk kill functionality.

To kill multiple runs:

  1. Navigate to the Runs tab
  2. For each run you want to stop:
    • Click the three-dot menu (⋮)
    • Select Kill Run
    • Confirm the action
  3. Repeat for each active run

Use cases for killing multiple runs:

  • Wrong workflow configurations applied to multiple runs
  • Need to free all GPU resources immediately
  • Testing scenarios where all test runs should be stopped

Efficiency tip: Start from newest to oldest runs to ensure most recent (likely higher priority) runs are addressed first.

How long does killing take?

Killing is nearly instantaneous in most cases:

Typical sequence:

  1. You click Confirm in kill dialog
  2. Platform sends stop signal to training process (< 1 second)
  3. GPU processing halts immediately
  4. Status updates to "Killed" (< 5 seconds)
  5. Resources freed for other uses

Possible delays:

  • Training in middle of saving checkpoint: Wait for save to complete (< 30 seconds)
  • Validation cycle in progress: Allow validation to finish (< 1 minute)
  • Infrastructure communication delays: Rare, usually resolve within 1-2 minutes

If kill seems stuck:

  • Refresh the page to see updated status
  • Check the Logs tab for kill confirmation messages
  • Wait 2-3 minutes for infrastructure to process
  • Contact support if run still shows "Running" after 5 minutes

Resource availability:

  • GPU resources usually available for new runs within 1-2 minutes after kill
  • Queued runs start automatically once resources free up
Will killing affect other concurrent runs?

No. Killing one run does not affect other active training runs:

Each run is isolated:

  • Runs use dedicated GPU allocations
  • Training processes are independent
  • Killing one run doesn't impact others

Benefits:

  • Kill problematic runs without disrupting successful training
  • Free resources for specific high-priority runs
  • Test configurations independently

Resource reallocation:

  • GPU freed from killed run becomes available
  • Queued runs may start using freed resources
  • Other active runs continue unaffected

Project organization:

  • Runs in different workflows completely independent
  • Runs in same workflow don't depend on each other
  • Project structure unaffected by kills
Should I kill or wait for a failing run to complete?

Kill immediately if:

  • ❌ Configuration is clearly wrong (can't produce useful results)
  • ❌ Out of memory errors occurring (won't complete anyway)
  • ❌ Loss showing NaN or infinite values (training broken)
  • ❌ You need the GPU resources for something else urgent

Let complete if:

  • ✅ Training progressing slowly but correctly
  • ✅ Metrics improving even if suboptimal
  • ✅ Near completion (>80% done) with acceptable progress
  • ✅ Want to analyze results before deciding next steps

Consider carefully if:

  • ⚠️ Mid-training with significant credits invested
  • ⚠️ Training isn't optimal but producing some results
  • ⚠️ Unsure if issue is temporary or permanent
  • ⚠️ Want to see what metrics look like at completion

Decision framework:

  1. Review logs for error severity
  2. Check metrics for improvement trends
  3. Calculate credits consumed vs. remaining
  4. Assess if results would be usable even if suboptimal

General principle: Kill early and decisively when certain of failure; let complete if there's value in seeing final results.


Next steps


Related resources