How Do I Choose the Right GPU?

Choosing the right GPU depends on three things: your model size, your training method, and your budget. A 7B model with QLoRA fits on a single T4. The same model with full fine-tuning needs an A100. This page gives you the formulas, reference tables, and decision guides to pick the right hardware before you launch a training run.

On this page

VRAM estimation GPU tiers Multi-GPU scaling NF4 memory optimization Cost management FAQ Related resources

How much VRAM does my model need?

VRAM consumption during training comes from four sources: model weights, optimizer states, gradients, and activation memory. The total depends on your model size, training method, and precision format.

VRAM estimation formula

For LoRA with FP16: Model parameters x 2 bytes (FP16 weights) + small adapter overhead. Optimizer states and gradients apply only to the adapter parameters (0.1-1% of total).

For QLoRA (LoRA + NF4): Model parameters x 0.5 bytes (NF4 weights) + adapter overhead in BF16. Roughly 4x less memory than FP16 LoRA for the base weights.

For Full SFT with FP16: Model parameters x 2 bytes (weights) + parameters x 2 bytes (gradients) + parameters x 8 bytes (AdamW optimizer states). Total is roughly 12 bytes per parameter, or about 2-3x the weight-only size after gradient checkpointing.

Actual usage varies by 10-15% depending on batch size, sequence length, and whether Flash Attention 2 is enabled. These estimates include gradient checkpointing.

Quick reference table

Estimated VRAM by model and training method

Configuration

Estimated VRAM

Recommended GPU

Qwen 7B, LoRA, NF4 (QLoRA)

~7 GB

T4 (16 GB)

Qwen 7B, LoRA, FP16

~12 GB

T4 or L4

Qwen 7B, Full SFT, FP16

~28 GB

A100 (80 GB)

InternVL 8B, LoRA, FP16

~14 GB

T4 or L4

Qwen 32B, LoRA, FP16

~48 GB

A100 (80 GB)

Qwen 32B, Full SFT, FP16

~128 GB

Multi-GPU (H100)

Qwen 72B, LoRA, NF4

~36 GB

A100 (80 GB)

InternVL 38B, Full SFT, FP16

~142 GB

Multi-GPU (H100)

Pre-launch validation

Datature Vi checks your configuration against the selected GPU's VRAM limit before launching a training run. If the estimated memory exceeds available VRAM, you will see a warning with a suggestion to switch GPUs or enable quantization.

GPU tiers available in Datature Vi

GPU

VRAM

CUDA Cores

Architecture

Max per run

Best for

16 GB

2,560

Turing

Inference, LoRA on small models (up to 4B)

24 GB

7,424

Ada Lovelace

LoRA on 7B models

A10

24 GB

9,216

Ampere

General purpose training

A100

80 GB

6,912

Ampere

Production SFT, large LoRA runs

H100

80 GB

16,896

Hopper

Large-scale training with NVLink

Each GPU tier has a compute credit multiplier that affects cost. See Resource Usage for multiplier values and pricing details.

When to scale to multiple GPUs

Single vs multi-GPU decision guide

Scenario

Recommendation

Why

LoRA on models up to 7B

Single GPU (T4, L4, or A10)

Model fits in VRAM with room for batch data

LoRA on 32B+ models with NF4

Single A100

NF4 shrinks weights enough for one GPU

Full SFT on 7B models

Multi-GPU (2-4x A100)

Weights, gradients, and optimizer states exceed single-GPU VRAM

Full SFT on 32B+ models

Multi-GPU (8-32x A100 or H100)

Model parallelism required; NVLink recommended for gradient sync

Faster wall-clock training time

Multi-GPU

Data parallelism splits batches across GPUs for faster epochs

How multi-GPU training works in Datature Vi

Datature Vi handles multi-GPU orchestration automatically. You select the GPU type and count; the platform configures the rest.

Automatic parallelism strategy: Vi selects the right combination of data parallelism and model parallelism based on your model size and GPU count.
NVLink interconnect: H100 GPUs use NVLink for high-bandwidth gradient synchronization. The platform configures topology automatically.
Per-epoch checkpointing: Checkpoints are saved periodically so you can resume if a run is interrupted.
VRAM-aware scheduling: The platform estimates memory before launch and warns you if your configuration exceeds available VRAM.
Async execution: Training runs execute in the background. You receive a notification when the run completes.

Memory optimization with NF4 quantization

NF4 (Normalized Float 4) quantization stores model weights in 4-bit precision, cutting memory by roughly 4x compared to FP16. When combined with LoRA (a combination called QLoRA), it lets you train large models on smaller GPUs.

Format

Memory vs FP16

Quality impact

Best for

FP16

1x (baseline)

Full precision

Production runs, full SFT

INT8

~2x savings

Minimal loss (under 0.1%)

Inference, large LoRA runs

NF4

~4x savings

Slight loss (under 0.5%)

Large model LoRA, prototyping

For a 7B model, NF4 reduces weight memory from ~14 GB to ~3.5 GB. With LoRA adapters in BF16, total training memory drops to roughly 7 GB, fitting comfortably on a T4 (16 GB).

For more on how quantization works and when it affects quality, see LoRA and Quantization.

Cost management tips

GPU time is the primary cost driver for VLM training. A few choices have outsized impact on your compute credit consumption.

Right-size your GPU

Use the VRAM estimation table above to pick the smallest GPU that fits your model and training method. Running QLoRA on a T4 costs 1x credits per minute; the same run on an A100 costs 6x.

Start with QLoRA

QLoRA is the default in Datature Vi for a reason. It uses the least memory, trains the fastest, and produces results within 1-3% of full SFT for most tasks. Switch to full SFT only after confirming QLoRA results fall short.

Watch for early stopping signals

Monitor your validation loss curve during training. If validation loss stops improving for several epochs, the remaining epochs burn credits without improving the model. Kill the run early and save the best checkpoint.

Test with small batches first

Run a short experiment (5-10 epochs on 50-100 images) on a cheap GPU before committing to a large production run. This validates your configuration and catches issues before they consume significant credits.

For detailed cost calculations and credit multipliers per GPU tier, see Resource Usage.

Frequently asked questions

Related resources

Resource Usage

Monitor compute credit consumption and GPU multiplier rates.

Start a Training Run

Select GPU hardware and launch training.

LoRA and Quantization

How LoRA and NF4 reduce training cost and memory.

Model Settings

Configure training mode, hyperparameters, and precision.

Data Rows and Compute Credits

How the two resource currencies work in Datature Vi.

Model Architectures

Compare the available VLM architectures and their sizes.