NIM Troubleshooting
NIM troubleshooting
Common issues and solutions for NVIDIA NIM deployment and inference.
Deployment issues
Invalid API key
Problem: InvalidConfigError: API key must start with 'nvapi-'
Cause: NGC API key has incorrect format.
Solution:
# ✅ Correct format
config = NIMConfig(nvidia_api_key="nvapi-...")
# ❌ Wrong format
config = NIMConfig(nvidia_api_key="your-key") # Missing nvapi- prefixVerify your key:
echo $NGC_API_KEY
# Should output: nvapi-...Container already exists
Problem: ContainerExistsError: Container 'cosmos-reason2-2b' already exists
Cause: A container with the same name is already running.
Solution 1 — Reuse existing container:
config = NIMConfig(
nvidia_api_key="nvapi-...",
use_existing_container=True # Reuse existing
)
deployer = NIMDeployer(config)
result = deployer.deploy() # InstantSolution 2 — Auto-remove existing:
config = NIMConfig(
nvidia_api_key="nvapi-...",
auto_kill_existing_container=True # Remove existing
)
deployer = NIMDeployer(config)
result = deployer.deploy()Solution 3 — Manual removal:
# Stop and remove container
docker stop cosmos-reason2-2b
docker rm cosmos-reason2-2bOr using Vi SDK:
from vi.deployment.nim import NIMDeployer
NIMDeployer.stop("cosmos-reason2-2b")Image pull fails
Problem: APIError: Failed to pull image
Causes & Solutions:
1. Authentication failure
Error: unauthorized: authentication required
Solution:
# Verify NGC API key
echo $NGC_API_KEY
# Test Docker login
docker login nvcr.io
# Username: $oauthtoken
# Password: <your-ngc-api-key>2. Network issues
Solution:
# Test connectivity
ping nvcr.io
# Check Docker daemon
docker info
# Try manual pull
docker pull nvcr.io/nim/nvidia/cosmos-reason2-2b:latest3. Insufficient disk space
Check disk space:
df -h
# NIM images are large (10-20 GB)Solution: Free up disk space or use different cache directory:
config = NIMConfig(
nvidia_api_key="nvapi-...",
local_cache_dir="/mnt/large_disk/nim"
)Model incompatibility
Problem: ModelIncompatibilityError: Model incompatible with container
Cause: Custom model architecture not supported by the NIM container.
Solution:
Check NVIDIA NIM compatibility matrix:
# Ensure model architecture matches container
# Example: Qwen-based models work with cosmos-reason1/2
# If incompatible, use base model instead
config = NIMConfig(
nvidia_api_key="nvapi-...",
# Don't specify run_id - uses base model
)Note: LoRA adapters are not supported by NVIDIA NIM. Only full model weights are used.
Service not ready
Problem: Deployment completes but service isn't responding.
Symptoms:
- Container running but predictions fail
- Health check timeout
Diagnosis:
# Check container logs
docker logs cosmos-reason2-2b
# Check if process is running
docker exec cosmos-reason2-2b ps aux
# Check port binding
docker port cosmos-reason2-2bSolution 1 — Wait longer:
NIM services can take 5-10 minutes to initialize on first run:
# The deployer waits up to 10 minutes by default
deployer = NIMDeployer(config)
result = deployer.deploy() # Will waitSolution 2 — Check GPU:
# Verify GPU is accessible
nvidia-smi
# Check GPU in container
docker exec cosmos-reason2-2b nvidia-smiSolution 3 — Increase timeout:
Modify timeout in code (advanced):
# containers/utils.py uses DEFAULT_SERVICE_TIMEOUT = 600s
# To increase, wait manually after deployment
import time
result = deployer.deploy()
time.sleep(300) # Additional 5 minutesGPU issues
No GPU detected
Problem: Container fails to start with GPU error.
Error: could not select device driver "" with capabilities: [[gpu]]
Diagnosis:
# Check NVIDIA drivers
nvidia-smi
# Check Docker GPU support
docker run --rm --gpus all ubuntu nvidia-smiSolution 1 — Install NVIDIA Container Toolkit:
# Ubuntu/Debian
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerSolution 2 — Verify Docker daemon configuration:
# Check daemon.json
cat /etc/docker/daemon.json
# Should contain:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}Out of GPU memory
Problem: Container starts but OOM during inference.
Error: RuntimeError: CUDA out of memory
Solution 1 — Use smaller model:
# Switch to smaller model
config = NIMConfig(
nvidia_api_key="nvapi-...",
image_name="cosmos-reason2-2b" # Smaller than 8B
)Solution 2 — Reduce context length:
config = NIMConfig(
nvidia_api_key="nvapi-...",
max_model_len=4096 # Reduce from 8192
)Solution 3 — Limit token generation:
params = NIMSamplingParams(
max_tokens=512 # Reduce from 1024
)Solution 4 — Close other GPU applications:
# Check GPU memory usage
nvidia-smi
# Kill other processes using GPU if neededInference issues
Connection refused
Problem: Connection refused when calling predictor.
Cause: Service not running or wrong port.
Solution:
# Check if container is running
docker ps | grep cosmos-reason
# Check port binding
docker port cosmos-reason2-2b
# Should show: 8000/tcp -> 0.0.0.0:8000Ensure correct port:
# Match port in config and predictor
config = NIMConfig(
nvidia_api_key="nvapi-...",
port=8000
)
predictor = NIMPredictor(
config=config # Uses same port
)
# Or specify explicitly
predictor = NIMPredictor(
model_name="cosmos-reason2-2b",
port=8000 # Match container port
)Slow inference
Problem: Inference takes very long.
Causes & Solutions:
1. First request is slow
Cause: Model loading and compilation.
Solution: First request will be slower (30-60s). Subsequent requests are faster.
# Warm up the model
result = predictor(source="image.jpg", stream=False)
# Now subsequent calls will be faster2. Large images
Solution: Resize images before inference:
from PIL import Image
# Resize large images
img = Image.open("large_image.jpg")
img = img.resize((1024, 1024))
img.save("resized.jpg")
result = predictor(source="resized.jpg", stream=False)3. Long max_tokens
Solution: Reduce max_tokens:
params = NIMSamplingParams(
max_tokens=512, # Reduce from 2048
temperature=0.2 # Lower temperature is faster
)Video inference fails
Problem: Video inference throws error or returns bad results.
Causes & Solutions:
1. Wrong model
Cause: Video only supported on Cosmos-Reason2 models.
Solution:
# ✅ Correct - Cosmos-Reason2
config = NIMConfig(
nvidia_api_key="nvapi-...",
image_name="cosmos-reason2-2b" # or cosmos-reason2-8b
)
# ❌ Wrong - Cosmos-Reason1 doesn't support video
config = NIMConfig(
nvidia_api_key="nvapi-...",
image_name="cosmos-reason1-7b"
)2. Video file not recognized
Error: File treated as image instead of video.
Solution: Ensure proper file extension:
# Supported video extensions
# .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv, .mpeg, .3gp
# Rename if needed
import shutil
shutil.copy("video.dat", "video.mp4")
result = predictor(source="video.mp4", stream=False)3. Too many frames
Cause: High FPS or long video causing OOM.
Solution: Reduce frame sampling:
params = NIMSamplingParams(
media_io_kwargs={"fps": 1.0}, # Reduce from 2.0+
mm_processor_kwargs={
"shortest_edge": 168 # Reduce resolution
}
)Empty or invalid response
Problem: Predictor returns empty or malformed response.
Causes & Solutions:
1. Task type mismatch
Solution: Specify correct task type:
# If auto-detection fails, specify explicitly
predictor = NIMPredictor(
model_name="cosmos-reason2-2b",
task_type="vqa", # or "phrase-grounding", "freeform"
port=8000
)2. Temperature too high
Cause: High temperature causes gibberish.
Solution:
params = NIMSamplingParams(
temperature=0.2, # Lower temperature
top_p=0.9
)3. Insufficient max_tokens
Cause: Output truncated before completion.
Solution:
params = NIMSamplingParams(
max_tokens=2048, # Increase limit
min_tokens=100 # Ensure minimum length
)Authentication issues
Vi credentials not found
Problem: InvalidConfigError: No secret_key provided
Cause: Vi credentials not set for custom weights.
Solution:
# Set environment variables
export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-org-id"Or provide in config:
config = NIMConfig(
nvidia_api_key="nvapi-...",
secret_key="your-secret-key",
organization_id="your-org-id",
run_id="your-run-id"
)Model download fails
Problem: Failed to download model from Datature Vi.
Diagnosis:
# Test credentials manually
from vi.api.client import ViClient
client = ViClient(
secret_key="your-secret-key",
organization_id="your-org-id"
)
# Try to get model
result = client.get_model(
run_id="your-run-id",
save_path="./test"
)Common causes:
- Invalid credentials
- Run ID not found
- Model training not completed
- Network issues
Docker issues
Docker daemon not running
Problem: docker.errors.DockerException: Error while fetching server API version
Solution:
# Start Docker daemon
sudo systemctl start docker
# Enable on boot
sudo systemctl enable docker
# Check status
sudo systemctl status dockerPermission denied
Problem: Permission denied while trying to connect to Docker daemon
Solution:
# Add user to docker group
sudo usermod -aG docker $USER
# Log out and back in, or run:
newgrp docker
# Test
docker psPort already in use
Problem: docker: Error response from daemon: driver failed programming external connectivity: Bind for 0.0.0.0:8000 failed: port is already allocated
Solution:
# Find process using port
lsof -i :8000
# Kill process or use different portconfig = NIMConfig(
nvidia_api_key="nvapi-...",
port=8080 # Use different port
)General tips
Enable debug logging
Get more detailed error information:
import logging
# Enable debug logging
logging.basicConfig(level=logging.DEBUG)
# Now run deployment
deployer = NIMDeployer(config)
result = deployer.deploy()Check container logs
Always check logs when troubleshooting:
# View logs
docker logs cosmos-reason2-2b
# Follow logs in real-time
docker logs -f cosmos-reason2-2b
# Last 100 lines
docker logs --tail 100 cosmos-reason2-2bMonitor resources
Track resource usage:
# GPU usage
nvidia-smi
# Container stats
docker stats cosmos-reason2-2b
# Disk usage
docker system dfClean up resources
Free up resources:
# Stop all NIM containers
docker stop $(docker ps -q --filter ancestor=nvcr.io/nim/nvidia/*)
# Remove stopped containers
docker container prune
# Remove unused images
docker image prune
# Remove all unused data
docker system pruneGetting help
If you're still experiencing issues:
-
Check container logs:
docker logs cosmos-reason2-2b > nim.log -
Collect system information:
nvidia-smi > gpu-info.txt docker info > docker-info.txt docker version >> docker-info.txt -
Check NVIDIA documentation:
-
Report the issue:
- Vi SDK GitHub Issues
- Include logs and system information
See also
- NIM Overview — Introduction to NVIDIA NIM deployment
- Deploy container — Deployment guide
- Run inference — Inference guide
- Configuration reference — Complete configuration options
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated 1 day ago
