How Do I Choose the Right GPU?

Estimate VRAM requirements for your model, compare GPU tiers, and decide when to scale to multiple GPUs for VLM training in Datature Vi.

Choosing the right GPU depends on three things: your model size, your training method, and your budget. A 7B model with QLoRA fits on a single T4. The same model with full fine-tuning needs an A100. This page gives you the formulas, reference tables, and decision guides to pick the right hardware before you launch a training run.


How much VRAM does my model need?

VRAM consumption during training comes from four sources: model weights, optimizer states, gradients, and activation memory. The total depends on your model size, training method, and precision format.

VRAM estimation formula

For LoRA with FP16: Model parameters x 2 bytes (FP16 weights) + small adapter overhead. Optimizer states and gradients apply only to the adapter parameters (0.1-1% of total).

For QLoRA (LoRA + NF4): Model parameters x 0.5 bytes (NF4 weights) + adapter overhead in BF16. Roughly 4x less memory than FP16 LoRA for the base weights.

For Full SFT with FP16: Model parameters x 2 bytes (weights) + parameters x 2 bytes (gradients) + parameters x 8 bytes (AdamW optimizer states). Total is roughly 12 bytes per parameter, or about 2-3x the weight-only size after gradient checkpointing.

Actual usage varies by 10-15% depending on batch size, sequence length, and whether Flash Attention 2 is enabled. These estimates include gradient checkpointing.

Quick reference table

Estimated VRAM by model and training method

Configuration
Estimated VRAM
Recommended GPU
Qwen 7B, LoRA, NF4 (QLoRA)
~7 GB
T4 (16 GB)
Qwen 7B, LoRA, FP16
~12 GB
T4 or L4
Qwen 7B, Full SFT, FP16
~28 GB
A100 (80 GB)
InternVL 8B, LoRA, FP16
~14 GB
T4 or L4
Qwen 32B, LoRA, FP16
~48 GB
A100 (80 GB)
Qwen 32B, Full SFT, FP16
~128 GB
Multi-GPU (H100)
Qwen 72B, LoRA, NF4
~36 GB
A100 (80 GB)
InternVL 38B, Full SFT, FP16
~142 GB
Multi-GPU (H100)
Pre-launch validation

Datature Vi checks your configuration against the selected GPU's VRAM limit before launching a training run. If the estimated memory exceeds available VRAM, you will see a warning with a suggestion to switch GPUs or enable quantization.


GPU tiers available in Datature Vi

GPU
VRAM
CUDA Cores
Architecture
Max per run
Best for
T4
16 GB
2,560
Turing
4
Inference, LoRA on small models (up to 4B)
L4
24 GB
7,424
Ada Lovelace
8
LoRA on 7B models
A10
24 GB
9,216
Ampere
8
General purpose training
A100
80 GB
6,912
Ampere
32
Production SFT, large LoRA runs
H100
80 GB
16,896
Hopper
64
Large-scale training with NVLink

Each GPU tier has a compute credit multiplier that affects cost. See Resource Usage for multiplier values and pricing details.


When to scale to multiple GPUs

Single vs multi-GPU decision guide

Scenario
Recommendation
Why
LoRA on models up to 7B
Single GPU (T4, L4, or A10)
Model fits in VRAM with room for batch data
LoRA on 32B+ models with NF4
Single A100
NF4 shrinks weights enough for one GPU
Full SFT on 7B models
Multi-GPU (2-4x A100)
Weights, gradients, and optimizer states exceed single-GPU VRAM
Full SFT on 32B+ models
Multi-GPU (8-32x A100 or H100)
Model parallelism required; NVLink recommended for gradient sync
Faster wall-clock training time
Multi-GPU
Data parallelism splits batches across GPUs for faster epochs

How multi-GPU training works in Datature Vi

Datature Vi handles multi-GPU orchestration automatically. You select the GPU type and count; the platform configures the rest.

  • Automatic parallelism strategy: Vi selects the right combination of data parallelism and model parallelism based on your model size and GPU count.
  • NVLink interconnect: H100 GPUs use NVLink for high-bandwidth gradient synchronization. The platform configures topology automatically.
  • Per-epoch checkpointing: Checkpoints are saved periodically so you can resume if a run is interrupted.
  • VRAM-aware scheduling: The platform estimates memory before launch and warns you if your configuration exceeds available VRAM.
  • Async execution: Training runs execute in the background. You receive a notification when the run completes.

Adding GPUs reduces wall-clock time, but the speedup is not perfectly linear. Communication overhead between GPUs consumes a fraction of each step. For LoRA training, the overhead is small because only adapter gradients need synchronization. For full SFT, gradient communication is heavier and the overhead grows with GPU count.

A practical guideline: doubling your GPU count typically gives 1.6-1.8x speedup for LoRA and 1.4-1.6x for full SFT. Beyond 8 GPUs for LoRA or 32 GPUs for full SFT, the marginal benefit shrinks further.


Memory optimization with NF4 quantization

NF4 (Normalized Float 4) quantization stores model weights in 4-bit precision, cutting memory by roughly 4x compared to FP16. When combined with LoRA (a combination called QLoRA), it lets you train large models on smaller GPUs.

Format
Memory vs FP16
Quality impact
Best for
FP16
1x (baseline)
Full precision
Production runs, full SFT
INT8
~2x savings
Minimal loss (under 0.1%)
Inference, large LoRA runs
NF4
~4x savings
Slight loss (under 0.5%)
Large model LoRA, prototyping

For a 7B model, NF4 reduces weight memory from ~14 GB to ~3.5 GB. With LoRA adapters in BF16, total training memory drops to roughly 7 GB, fitting comfortably on a T4 (16 GB).

For more on how quantization works and when it affects quality, see LoRA and Quantization.


Cost management tips

GPU time is the primary cost driver for VLM training. A few choices have outsized impact on your compute credit consumption.

1

Right-size your GPU

Use the VRAM estimation table above to pick the smallest GPU that fits your model and training method. Running QLoRA on a T4 costs 1x credits per minute; the same run on an A100 costs 6x.

2

Start with QLoRA

QLoRA is the default in Datature Vi for a reason. It uses the least memory, trains the fastest, and produces results within 1-3% of full SFT for most tasks. Switch to full SFT only after confirming QLoRA results fall short.

3

Watch for early stopping signals

Monitor your validation loss curve during training. If validation loss stops improving for several epochs, the remaining epochs burn credits without improving the model. Kill the run early and save the best checkpoint.

4

Test with small batches first

Run a short experiment (5-10 epochs on 50-100 images) on a cheap GPU before committing to a large production run. This validates your configuration and catches issues before they consume significant credits.

For detailed cost calculations and credit multipliers per GPU tier, see Resource Usage.


Frequently asked questions

Check the quick reference table above for your model size and training method. If the estimated VRAM is below your GPU's capacity, it fits. Datature Vi also validates this before launch and warns you if the estimate exceeds available memory.

The training run fails with an out-of-memory (OOM) error. To fix it: reduce batch size, enable NF4 quantization, switch from full SFT to LoRA, or select a GPU with more VRAM. See Training Logs for diagnosing OOM errors.

Not necessarily. Bigger GPUs cost more credits per minute. A QLoRA run on a T4 at 1x cost often produces results comparable to the same run on an A100 at 6x cost. Use the smallest GPU that fits your configuration comfortably (with 20-30% headroom for batch data and activations).

Yes. Each training run selects its own GPU configuration. You can experiment on a T4, then run your production training on an A100. The model checkpoint is portable across GPU types.

No. Multi-GPU training produces the same model as single-GPU training (given the same hyperparameters and data). The only difference is speed: multiple GPUs process batches in parallel, reducing wall-clock time.


Related resources