NIM Troubleshooting

NIM troubleshooting

Common issues and solutions for NVIDIA NIM deployment and inference.

Deployment issues

Invalid API key

Problem: InvalidConfigError: API key must start with 'nvapi-'

Cause: NGC API key has incorrect format.

Solution:

# ✅ Correct format
config = NIMConfig(nvidia_api_key="nvapi-...")

# ❌ Wrong format
config = NIMConfig(nvidia_api_key="your-key")  # Missing nvapi- prefix

Verify your key:

echo $NGC_API_KEY
# Should output: nvapi-...

Container already exists

Problem: ContainerExistsError: Container 'cosmos-reason2-2b' already exists

Cause: A container with the same name is already running.

Solution 1 — Reuse existing container:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    use_existing_container=True  # Reuse existing
)

deployer = NIMDeployer(config)
result = deployer.deploy()  # Instant

Learn about container reuse →

Solution 2 — Auto-remove existing:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    auto_kill_existing_container=True  # Remove existing
)

deployer = NIMDeployer(config)
result = deployer.deploy()

Solution 3 — Manual removal:

# Stop and remove container
docker stop cosmos-reason2-2b
docker rm cosmos-reason2-2b

Or using Vi SDK:

from vi.deployment.nim import NIMDeployer

NIMDeployer.stop("cosmos-reason2-2b")

Image pull fails

Problem: APIError: Failed to pull image

Causes & Solutions:

1. Authentication failure

Error: unauthorized: authentication required

Solution:

# Verify NGC API key
echo $NGC_API_KEY

# Test Docker login
docker login nvcr.io
# Username: $oauthtoken
# Password: <your-ngc-api-key>

2. Network issues

Solution:

# Test connectivity
ping nvcr.io

# Check Docker daemon
docker info

# Try manual pull
docker pull nvcr.io/nim/nvidia/cosmos-reason2-2b:latest

3. Insufficient disk space

Check disk space:

df -h

# NIM images are large (10-20 GB)

Solution: Free up disk space or use different cache directory:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    local_cache_dir="/mnt/large_disk/nim"
)

Model incompatibility

Problem: ModelIncompatibilityError: Model incompatible with container

Cause: Custom model architecture not supported by the NIM container.

Solution:

Check NVIDIA NIM compatibility matrix:

# Ensure model architecture matches container
# Example: Qwen-based models work with cosmos-reason1/2

# If incompatible, use base model instead
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    # Don't specify run_id - uses base model
)
📘

Note: LoRA adapters are not supported by NVIDIA NIM. Only full model weights are used.


Service not ready

Problem: Deployment completes but service isn't responding.

Symptoms:

  • Container running but predictions fail
  • Health check timeout

Diagnosis:

# Check container logs
docker logs cosmos-reason2-2b

# Check if process is running
docker exec cosmos-reason2-2b ps aux

# Check port binding
docker port cosmos-reason2-2b

Solution 1 — Wait longer:

NIM services can take 5-10 minutes to initialize on first run:

# The deployer waits up to 10 minutes by default
deployer = NIMDeployer(config)
result = deployer.deploy()  # Will wait

Solution 2 — Check GPU:

# Verify GPU is accessible
nvidia-smi

# Check GPU in container
docker exec cosmos-reason2-2b nvidia-smi

Solution 3 — Increase timeout:

Modify timeout in code (advanced):

# containers/utils.py uses DEFAULT_SERVICE_TIMEOUT = 600s
# To increase, wait manually after deployment
import time

result = deployer.deploy()
time.sleep(300)  # Additional 5 minutes

GPU issues

No GPU detected

Problem: Container fails to start with GPU error.

Error: could not select device driver "" with capabilities: [[gpu]]

Diagnosis:

# Check NVIDIA drivers
nvidia-smi

# Check Docker GPU support
docker run --rm --gpus all ubuntu nvidia-smi

Solution 1 — Install NVIDIA Container Toolkit:

# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Solution 2 — Verify Docker daemon configuration:

# Check daemon.json
cat /etc/docker/daemon.json

# Should contain:
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

Out of GPU memory

Problem: Container starts but OOM during inference.

Error: RuntimeError: CUDA out of memory

Solution 1 — Use smaller model:

# Switch to smaller model
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    image_name="cosmos-reason2-2b"  # Smaller than 8B
)

Solution 2 — Reduce context length:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    max_model_len=4096  # Reduce from 8192
)

Solution 3 — Limit token generation:

params = NIMSamplingParams(
    max_tokens=512  # Reduce from 1024
)

Solution 4 — Close other GPU applications:

# Check GPU memory usage
nvidia-smi

# Kill other processes using GPU if needed

Inference issues

Connection refused

Problem: Connection refused when calling predictor.

Cause: Service not running or wrong port.

Solution:

# Check if container is running
docker ps | grep cosmos-reason

# Check port binding
docker port cosmos-reason2-2b

# Should show: 8000/tcp -> 0.0.0.0:8000

Ensure correct port:

# Match port in config and predictor
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    port=8000
)

predictor = NIMPredictor(
    config=config  # Uses same port
)

# Or specify explicitly
predictor = NIMPredictor(
    model_name="cosmos-reason2-2b",
    port=8000  # Match container port
)

Slow inference

Problem: Inference takes very long.

Causes & Solutions:

1. First request is slow

Cause: Model loading and compilation.

Solution: First request will be slower (30-60s). Subsequent requests are faster.

# Warm up the model
result = predictor(source="image.jpg", stream=False)
# Now subsequent calls will be faster

2. Large images

Solution: Resize images before inference:

from PIL import Image

# Resize large images
img = Image.open("large_image.jpg")
img = img.resize((1024, 1024))
img.save("resized.jpg")

result = predictor(source="resized.jpg", stream=False)

3. Long max_tokens

Solution: Reduce max_tokens:

params = NIMSamplingParams(
    max_tokens=512,  # Reduce from 2048
    temperature=0.2  # Lower temperature is faster
)

Video inference fails

Problem: Video inference throws error or returns bad results.

Causes & Solutions:

1. Wrong model

Cause: Video only supported on Cosmos-Reason2 models.

Solution:

# ✅ Correct - Cosmos-Reason2
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    image_name="cosmos-reason2-2b"  # or cosmos-reason2-8b
)

# ❌ Wrong - Cosmos-Reason1 doesn't support video
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    image_name="cosmos-reason1-7b"
)

2. Video file not recognized

Error: File treated as image instead of video.

Solution: Ensure proper file extension:

# Supported video extensions
# .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv, .mpeg, .3gp

# Rename if needed
import shutil
shutil.copy("video.dat", "video.mp4")

result = predictor(source="video.mp4", stream=False)

3. Too many frames

Cause: High FPS or long video causing OOM.

Solution: Reduce frame sampling:

params = NIMSamplingParams(
    media_io_kwargs={"fps": 1.0},  # Reduce from 2.0+
    mm_processor_kwargs={
        "shortest_edge": 168  # Reduce resolution
    }
)

Empty or invalid response

Problem: Predictor returns empty or malformed response.

Causes & Solutions:

1. Task type mismatch

Solution: Specify correct task type:

# If auto-detection fails, specify explicitly
predictor = NIMPredictor(
    model_name="cosmos-reason2-2b",
    task_type="vqa",  # or "phrase-grounding", "freeform"
    port=8000
)

2. Temperature too high

Cause: High temperature causes gibberish.

Solution:

params = NIMSamplingParams(
    temperature=0.2,  # Lower temperature
    top_p=0.9
)

3. Insufficient max_tokens

Cause: Output truncated before completion.

Solution:

params = NIMSamplingParams(
    max_tokens=2048,  # Increase limit
    min_tokens=100    # Ensure minimum length
)

Authentication issues

Vi credentials not found

Problem: InvalidConfigError: No secret_key provided

Cause: Vi credentials not set for custom weights.

Solution:

# Set environment variables
export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-org-id"

Or provide in config:

config = NIMConfig(
    nvidia_api_key="nvapi-...",
    secret_key="your-secret-key",
    organization_id="your-org-id",
    run_id="your-run-id"
)

Model download fails

Problem: Failed to download model from Datature Vi.

Diagnosis:

# Test credentials manually
from vi.api.client import ViClient

client = ViClient(
    secret_key="your-secret-key",
    organization_id="your-org-id"
)

# Try to get model
result = client.get_model(
    run_id="your-run-id",
    save_path="./test"
)

Common causes:

  1. Invalid credentials
  2. Run ID not found
  3. Model training not completed
  4. Network issues

Docker issues

Docker daemon not running

Problem: docker.errors.DockerException: Error while fetching server API version

Solution:

# Start Docker daemon
sudo systemctl start docker

# Enable on boot
sudo systemctl enable docker

# Check status
sudo systemctl status docker

Permission denied

Problem: Permission denied while trying to connect to Docker daemon

Solution:

# Add user to docker group
sudo usermod -aG docker $USER

# Log out and back in, or run:
newgrp docker

# Test
docker ps

Port already in use

Problem: docker: Error response from daemon: driver failed programming external connectivity: Bind for 0.0.0.0:8000 failed: port is already allocated

Solution:

# Find process using port
lsof -i :8000

# Kill process or use different port
config = NIMConfig(
    nvidia_api_key="nvapi-...",
    port=8080  # Use different port
)

General tips

Enable debug logging

Get more detailed error information:

import logging

# Enable debug logging
logging.basicConfig(level=logging.DEBUG)

# Now run deployment
deployer = NIMDeployer(config)
result = deployer.deploy()

Check container logs

Always check logs when troubleshooting:

# View logs
docker logs cosmos-reason2-2b

# Follow logs in real-time
docker logs -f cosmos-reason2-2b

# Last 100 lines
docker logs --tail 100 cosmos-reason2-2b

Monitor resources

Track resource usage:

# GPU usage
nvidia-smi

# Container stats
docker stats cosmos-reason2-2b

# Disk usage
docker system df

Clean up resources

Free up resources:

# Stop all NIM containers
docker stop $(docker ps -q --filter ancestor=nvcr.io/nim/nvidia/*)

# Remove stopped containers
docker container prune

# Remove unused images
docker image prune

# Remove all unused data
docker system prune

Getting help

If you're still experiencing issues:

  1. Check container logs:

    docker logs cosmos-reason2-2b > nim.log
  2. Collect system information:

    nvidia-smi > gpu-info.txt
    docker info > docker-info.txt
    docker version >> docker-info.txt
  3. Check NVIDIA documentation:

  4. Report the issue:


See also