NIM Troubleshooting

This page covers the most common errors you'll encounter with NVIDIA NIM deployment and inference in Datature Vi, along with their causes and fixes. For an overview of the NIM integration, see NVIDIA NIM Deployment.

Deployment issues

Invalid API key

Symptom: InvalidConfigError: API key must start with 'nvapi-'

Cause: The NGC API key format is wrong.

Fix:

from vi.deployment.nim import NIMConfig

# Correct
config = NIMConfig(nvidia_api_key="nvapi-...")

# Wrong: missing nvapi- prefix
# config = NIMConfig(nvidia_api_key="your-key")
echo $NGC_API_KEY
# Should print: nvapi-...

Container already exists

Symptom: ContainerExistsError: Container 'cosmos-reason2-2b' already exists

Cause: A container with the same name is already running.

Fix (reuse the existing container):

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    use_existing_container=True
)

deployer = NIMDeployer(config)
result = deployer.deploy()  # returns immediately

Fix (auto-remove and redeploy):

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    auto_kill_existing_container=True
)

deployer = NIMDeployer(config)
result = deployer.deploy()

Fix (remove manually):

docker stop cosmos-reason2-2b
docker rm cosmos-reason2-2b
from vi.deployment.nim import NIMDeployer

NIMDeployer.stop("cosmos-reason2-2b")

Image pull fails

Symptom: APIError: Failed to pull image

Cause 1: Authentication failure (unauthorized: authentication required):

echo $NGC_API_KEY

# Test Docker login
docker login nvcr.io
# Username: $oauthtoken
# Password: <your-ngc-api-key>

Cause 2: Network issues:

ping nvcr.io
docker info

# Try pulling manually
docker pull nvcr.io/nim/nvidia/cosmos-reason2-2b:latest

Cause 3: Not enough disk space (NIM images are 10–20 GB):

df -h
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    local_cache_dir="/mnt/large_disk/nim"
)

Model incompatibility

Symptom: ModelIncompatibilityError: Model incompatible with container

Cause: The custom model architecture is not supported by the chosen NIM image.

Fix: Check the NVIDIA NIM support matrix. If your model is incompatible, deploy with base model weights instead:

# Remove run_id to fall back to base model weights
config = NIMConfig(nvidia_api_key="nvapi-...")

LoRA adapters are not supported by NVIDIA NIM. Only full model weights are used at inference time.


Service not ready

Symptom: Deployment completes but predictions fail, or the health check times out.

Diagnose first:

docker logs cosmos-reason2-2b
docker exec cosmos-reason2-2b ps aux
docker port cosmos-reason2-2b

Fix 1: Wait longer. NIM services can take 5–10 minutes to initialize on first run. The deployer waits up to 10 minutes by default. If your hardware is slower, add extra wait time:

import time

deployer = NIMDeployer(config)
result = deployer.deploy()
time.sleep(300)  # Extra 5 minutes if needed

Fix 2: Check GPU availability:

nvidia-smi
docker exec cosmos-reason2-2b nvidia-smi

GPU issues

No GPU detected

Symptom: could not select device driver "" with capabilities: [[gpu]]

Diagnose:

nvidia-smi
docker run --rm --gpus all ubuntu nvidia-smi

Fix 1: Install NVIDIA Container Toolkit:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Fix 2: Verify Docker daemon config:

cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Out of GPU memory

Symptom: RuntimeError: CUDA out of memory

Fix 1: Switch to a smaller model:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    image_name="cosmos-reason2-2b"  # smaller than 8B
)

Fix 2: Reduce context length:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    max_model_len=4096  # default is 8192
)

Fix 3: Limit token generation:

from vi.deployment.nim import NIMSamplingParams

params = NIMSamplingParams(max_tokens=512)

Fix 4: Close other applications using the GPU, then redeploy:

nvidia-smi

Inference issues

Connection refused

Symptom: Connection refused when calling the predictor.

Cause: The container is not running, or the port does not match.

docker ps | grep cosmos-reason
docker port cosmos-reason2-2b
# Expected: 8000/tcp -> 0.0.0.0:8000
config = NIMConfig(nvidia_api_key="nvapi-...", port=8000)

predictor = NIMPredictor(config=config)  # uses port from config

# Or set explicitly
predictor = NIMPredictor(model_name="cosmos-reason2-2b", port=8000)

Slow inference

Cause 1: First request is slow. Model loading and JIT compilation run on the first call (30–60 seconds). Subsequent calls are faster. Warm up the model before timing:

# Run once before benchmarking
_ = predictor(source="image.jpg", stream=False)
# Now measure
result = predictor(source="image.jpg", stream=False)

Cause 2: Large images. Resize before inference:

from PIL import Image

img = Image.open("large_image.jpg")
img = img.resize((1024, 1024))
img.save("resized.jpg")

result = predictor(source="resized.jpg", stream=False)

Cause 3: max_tokens is too high. Reduce it:

from vi.deployment.nim import NIMSamplingParams

params = NIMSamplingParams(
    max_tokens=512,
    temperature=0.2
)

Video inference fails

Cause 1: Wrong model. Video input requires a Cosmos-Reason2 model:

# Correct
config = NIMConfig(nvidia_api_key="nvapi-...", image_name="cosmos-reason2-2b")

# Wrong: Cosmos-Reason1 does not support video
# config = NIMConfig(nvidia_api_key="nvapi-...", image_name="cosmos-reason1-7b")

Cause 2: File extension not recognized. The predictor detects video by extension:

# .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv, .mpeg, .3gp

import shutil
shutil.copy("video.dat", "video.mp4")
result = predictor(source="video.mp4", stream=False)

Cause 3: Too many frames causing OOM. Reduce frame sampling rate:

from vi.deployment.nim import NIMSamplingParams

params = NIMSamplingParams(
    media_io_kwargs={"fps": 1.0},
    mm_processor_kwargs={"shortest_edge": 168}
)

Empty or invalid response

Cause 1: Task type mismatch. Auto-detection failed; specify the task type explicitly:

predictor = NIMPredictor(
    model_name="cosmos-reason2-2b",
    task_type="vqa",   # or "phrase-grounding", "freeform-text"
    port=8000
)

Cause 2: Temperature too high. High temperature can produce incoherent output:

from vi.deployment.nim import NIMSamplingParams

params = NIMSamplingParams(temperature=0.2, top_p=0.9)

Cause 3: Output truncated. max_tokens is too low:

params = NIMSamplingParams(
    max_tokens=2048,
    min_tokens=100
)

Authentication issues

Vi credentials not found

Symptom: InvalidConfigError: No secret_key provided

Fix (set environment variables):

export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-org-id"
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    secret_key="your-secret-key",
    organization_id="your-org-id",
    run_id="your-run-id"
)

Model download fails

Symptom: Failed to download model weights from Datature Vi.

Diagnose by testing credentials directly:

from vi.api.client import ViClient

client = ViClient(
    secret_key="your-secret-key",
    organization_id="your-org-id"
)

result = client.get_model(run_id="your-run-id", save_path="./test")

Common causes: invalid credentials, run ID not found, training not complete, or a network issue.


Docker issues

Docker daemon not running

Symptom: docker.errors.DockerException: Error while fetching server API version

sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl status docker

Permission denied

Symptom: Permission denied while trying to connect to Docker daemon

sudo usermod -aG docker $USER
newgrp docker
docker ps  # verify

Port already in use

Symptom: Bind for 0.0.0.0:8000 failed: port is already allocated

lsof -i :8000
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    port=8080
)

General tips

Enable debug logging

import logging

logging.basicConfig(level=logging.DEBUG)

deployer = NIMDeployer(config)
result = deployer.deploy()

Check container logs

docker logs cosmos-reason2-2b
docker logs -f cosmos-reason2-2b          # follow in real-time
docker logs --tail 100 cosmos-reason2-2b  # last 100 lines

Monitor resource usage

nvidia-smi                           # GPU usage
docker stats cosmos-reason2-2b       # container stats
docker system df                     # disk usage

Free up resources

# Stop all NIM containers
docker stop $(docker ps -q --filter ancestor=nvcr.io/nim/nvidia/*)

# Remove stopped containers
docker container prune

# Remove unused images (frees significant space)
docker image prune

Getting help

If none of the above fixes your issue, collect logs before reporting:

docker logs cosmos-reason2-2b > nim.log
nvidia-smi > gpu-info.txt
docker info > docker-info.txt
docker version >> docker-info.txt

Then check:

NVIDIA NIM User Guide

Official getting started guide for NIM vision-language models.

NIM Support Matrix

Check which models and configurations are officially supported.

Vi SDK GitHub Issues

Report a bug or search for existing solutions. Include your logs and system info.


Related resources

Deploy A Container

Container setup, image selection, and lifecycle management.

Run Inference

NIMPredictor usage, streaming, sampling, and video processing.

Configuration Reference

All NIMConfig and NIMSamplingParams parameters.