NIM Troubleshooting
This page covers the most common errors you'll encounter with NVIDIA NIM deployment and inference in Datature Vi, along with their causes and fixes. For an overview of the NIM integration, see NVIDIA NIM Deployment.
Deployment issues
Invalid API key
Symptom: InvalidConfigError: API key must start with 'nvapi-'
Cause: The NGC API key format is wrong.
Fix:
from vi.deployment.nim import NIMConfig
# Correct
config = NIMConfig(nvidia_api_key="nvapi-...")
# Wrong: missing nvapi- prefix
# config = NIMConfig(nvidia_api_key="your-key")echo $NGC_API_KEY
# Should print: nvapi-...Container already exists
Symptom: ContainerExistsError: Container 'cosmos-reason2-2b' already exists
Cause: A container with the same name is already running.
Fix (reuse the existing container):
config = NIMConfig(
nvidia_api_key="nvapi-...",
use_existing_container=True
)
deployer = NIMDeployer(config)
result = deployer.deploy() # returns immediatelyFix (auto-remove and redeploy):
config = NIMConfig(
nvidia_api_key="nvapi-...",
auto_kill_existing_container=True
)
deployer = NIMDeployer(config)
result = deployer.deploy()Fix (remove manually):
docker stop cosmos-reason2-2b
docker rm cosmos-reason2-2bfrom vi.deployment.nim import NIMDeployer
NIMDeployer.stop("cosmos-reason2-2b")Image pull fails
Symptom: APIError: Failed to pull image
Cause 1: Authentication failure (unauthorized: authentication required):
echo $NGC_API_KEY
# Test Docker login
docker login nvcr.io
# Username: $oauthtoken
# Password: <your-ngc-api-key>Cause 2: Network issues:
ping nvcr.io
docker info
# Try pulling manually
docker pull nvcr.io/nim/nvidia/cosmos-reason2-2b:latestCause 3: Not enough disk space (NIM images are 10–20 GB):
df -hconfig = NIMConfig(
nvidia_api_key="nvapi-...",
local_cache_dir="/mnt/large_disk/nim"
)Model incompatibility
Symptom: ModelIncompatibilityError: Model incompatible with container
Cause: The custom model architecture is not supported by the chosen NIM image.
Fix: Check the NVIDIA NIM support matrix. If your model is incompatible, deploy with base model weights instead:
# Remove run_id to fall back to base model weights
config = NIMConfig(nvidia_api_key="nvapi-...")LoRA adapters are not supported by NVIDIA NIM. Only full model weights are used at inference time.
Service not ready
Symptom: Deployment completes but predictions fail, or the health check times out.
Diagnose first:
docker logs cosmos-reason2-2b
docker exec cosmos-reason2-2b ps aux
docker port cosmos-reason2-2bFix 1: Wait longer. NIM services can take 5–10 minutes to initialize on first run. The deployer waits up to 10 minutes by default. If your hardware is slower, add extra wait time:
import time
deployer = NIMDeployer(config)
result = deployer.deploy()
time.sleep(300) # Extra 5 minutes if neededFix 2: Check GPU availability:
nvidia-smi
docker exec cosmos-reason2-2b nvidia-smiGPU issues
No GPU detected
Symptom: could not select device driver "" with capabilities: [[gpu]]
Diagnose:
nvidia-smi
docker run --rm --gpus all ubuntu nvidia-smiFix 1: Install NVIDIA Container Toolkit:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart dockerFix 2: Verify Docker daemon config:
cat /etc/docker/daemon.json{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}Out of GPU memory
Symptom: RuntimeError: CUDA out of memory
Fix 1: Switch to a smaller model:
config = NIMConfig(
nvidia_api_key="nvapi-...",
image_name="cosmos-reason2-2b" # smaller than 8B
)Fix 2: Reduce context length:
config = NIMConfig(
nvidia_api_key="nvapi-...",
max_model_len=4096 # default is 8192
)Fix 3: Limit token generation:
from vi.deployment.nim import NIMSamplingParams
params = NIMSamplingParams(max_tokens=512)Fix 4: Close other applications using the GPU, then redeploy:
nvidia-smiInference issues
Connection refused
Symptom: Connection refused when calling the predictor.
Cause: The container is not running, or the port does not match.
docker ps | grep cosmos-reason
docker port cosmos-reason2-2b
# Expected: 8000/tcp -> 0.0.0.0:8000config = NIMConfig(nvidia_api_key="nvapi-...", port=8000)
predictor = NIMPredictor(config=config) # uses port from config
# Or set explicitly
predictor = NIMPredictor(model_name="cosmos-reason2-2b", port=8000)Slow inference
Cause 1: First request is slow. Model loading and JIT compilation run on the first call (30–60 seconds). Subsequent calls are faster. Warm up the model before timing:
# Run once before benchmarking
_ = predictor(source="image.jpg", stream=False)
# Now measure
result = predictor(source="image.jpg", stream=False)Cause 2: Large images. Resize before inference:
from PIL import Image
img = Image.open("large_image.jpg")
img = img.resize((1024, 1024))
img.save("resized.jpg")
result = predictor(source="resized.jpg", stream=False)Cause 3: max_tokens is too high. Reduce it:
from vi.deployment.nim import NIMSamplingParams
params = NIMSamplingParams(
max_tokens=512,
temperature=0.2
)Video inference fails
Cause 1: Wrong model. Video input requires a Cosmos-Reason2 model:
# Correct
config = NIMConfig(nvidia_api_key="nvapi-...", image_name="cosmos-reason2-2b")
# Wrong: Cosmos-Reason1 does not support video
# config = NIMConfig(nvidia_api_key="nvapi-...", image_name="cosmos-reason1-7b")Cause 2: File extension not recognized. The predictor detects video by extension:
# .mp4, .avi, .mov, .mkv, .webm, .flv, .wmv, .mpeg, .3gp
import shutil
shutil.copy("video.dat", "video.mp4")
result = predictor(source="video.mp4", stream=False)Cause 3: Too many frames causing OOM. Reduce frame sampling rate:
from vi.deployment.nim import NIMSamplingParams
params = NIMSamplingParams(
media_io_kwargs={"fps": 1.0},
mm_processor_kwargs={"shortest_edge": 168}
)Empty or invalid response
Cause 1: Task type mismatch. Auto-detection failed; specify the task type explicitly:
predictor = NIMPredictor(
model_name="cosmos-reason2-2b",
task_type="vqa", # or "phrase-grounding", "freeform-text"
port=8000
)Cause 2: Temperature too high. High temperature can produce incoherent output:
from vi.deployment.nim import NIMSamplingParams
params = NIMSamplingParams(temperature=0.2, top_p=0.9)Cause 3: Output truncated. max_tokens is too low:
params = NIMSamplingParams(
max_tokens=2048,
min_tokens=100
)Authentication issues
Vi credentials not found
Symptom: InvalidConfigError: No secret_key provided
Fix (set environment variables):
export DATATURE_VI_SECRET_KEY="your-secret-key"
export DATATURE_VI_ORGANIZATION_ID="your-org-id"config = NIMConfig(
nvidia_api_key="nvapi-...",
secret_key="your-secret-key",
organization_id="your-org-id",
run_id="your-run-id"
)Model download fails
Symptom: Failed to download model weights from Datature Vi.
Diagnose by testing credentials directly:
from vi.api.client import ViClient
client = ViClient(
secret_key="your-secret-key",
organization_id="your-org-id"
)
result = client.get_model(run_id="your-run-id", save_path="./test")Common causes: invalid credentials, run ID not found, training not complete, or a network issue.
Docker issues
Docker daemon not running
Symptom: docker.errors.DockerException: Error while fetching server API version
sudo systemctl start docker
sudo systemctl enable docker
sudo systemctl status dockerPermission denied
Symptom: Permission denied while trying to connect to Docker daemon
sudo usermod -aG docker $USER
newgrp docker
docker ps # verifyPort already in use
Symptom: Bind for 0.0.0.0:8000 failed: port is already allocated
lsof -i :8000config = NIMConfig(
nvidia_api_key="nvapi-...",
port=8080
)General tips
Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
deployer = NIMDeployer(config)
result = deployer.deploy()Check container logs
docker logs cosmos-reason2-2b
docker logs -f cosmos-reason2-2b # follow in real-time
docker logs --tail 100 cosmos-reason2-2b # last 100 linesMonitor resource usage
nvidia-smi # GPU usage
docker stats cosmos-reason2-2b # container stats
docker system df # disk usageFree up resources
# Stop all NIM containers
docker stop $(docker ps -q --filter ancestor=nvcr.io/nim/nvidia/*)
# Remove stopped containers
docker container prune
# Remove unused images (frees significant space)
docker image pruneGetting help
If none of the above fixes your issue, collect logs before reporting:
docker logs cosmos-reason2-2b > nim.log
nvidia-smi > gpu-info.txt
docker info > docker-info.txt
docker version >> docker-info.txtThen check:
Related resources
Updated about 1 month ago
