How Do I Deploy My Trained Model?

Learn what VLM deployment means and compare three paths in Datature Vi: local inference with the Vi SDK, NVIDIA NIM containers, and self-hosted servers.

Deployment is the step where your trained model starts processing real images outside of Datature Vi. You download the model weights, load them into an inference environment, and send images with prompts to get predictions. Datature Vi supports three deployment paths: local inference with the Vi SDK, NVIDIA NIM containers for production, and self-hosted servers using frameworks like vLLM. This page explains when to use each and what you need to get started.


What does deployment mean for a VLM?

Training teaches your model to recognize patterns. Deployment puts that model to work. During deployment, you:

  1. Download the trained model weights from Datature Vi
  2. Load those weights into an inference runtime (Vi SDK, NIM container, or your own server)
  3. Send new images with prompts and receive predictions

The model runs on a GPU (or a quantized version on a smaller GPU). It uses the same system prompt and generation settings you configured during training. If you change the system prompt at inference time, the model's behavior may degrade. See system prompt consistency for details.


Three deployment paths

Path
Best for
GPU required
Setup effort
Vi SDK (local)
Prototyping, testing, small-scale use
Yes (local GPU)
Low
NVIDIA NIM containers
Production workloads, team access, scaling
Yes (server GPU)
Medium
Self-hosted (vLLM, TGI)
Full control, custom infrastructure
Yes (server GPU)
High

Vi SDK (local inference)

The Vi SDK downloads your model weights and runs inference on your local machine. You write a few lines of Python, pass in an image and a prompt, and get results back. This is the fastest way to test a trained model.

When to use it: During development, for testing on sample images, or for applications where a single machine handles all requests. Good for prototyping before committing to a production setup.

Limitations: Tied to one machine. No built-in load balancing, scaling, or API endpoint. You manage the GPU yourself.

NVIDIA NIM containers

NIM (NVIDIA Inference Microservice) packages your model into a Docker container with GPU acceleration and an OpenAI-compatible API endpoint. You deploy the container on any server with an NVIDIA GPU and send requests over HTTP.

When to use it: For production applications where multiple users or services need to call the model. NIM handles batching, GPU memory management, and provides a standard API that your backend code can call.

Limitations: Requires Docker and NVIDIA GPU drivers on the host. Container images are large (several GB).

Self-hosted (vLLM, TGI)

For teams that want full control, you can load the downloaded model weights into inference frameworks like vLLM or Hugging Face Text Generation Inference (TGI). These frameworks provide optimized serving with features like continuous batching and speculative decoding.

When to use it: When you have specific infrastructure requirements, need custom optimization, or are already running vLLM/TGI for other models.

Limitations: You manage everything: GPU allocation, model loading, API routing, scaling, and monitoring. Requires ML infrastructure experience.


Which path should I choose?

1

Start with the Vi SDK

Every deployment starts here. Download your model, run inference on a few test images, and verify the outputs match your expectations. This validates your model before you invest in production infrastructure.

2

Move to NIM for production

When you need to serve predictions to an application, other team members, or customer-facing features, package the model into a NIM container. The OpenAI-compatible API makes integration with backend services straightforward.

3

Self-host for custom needs

If NIM doesn't meet your infrastructure requirements (custom GPU clusters, specific latency SLAs, multi-model serving), use vLLM or TGI with the downloaded weights.


What you need before deploying

Before downloading and deploying your model, check that:

Requirement
Details
A completed training run
The model must have finished training with acceptable evaluation metrics. See How Do I Evaluate My Model?
A GPU for inference
All deployment paths require a GPU. Smaller models (2-4B) run on GPUs with 8+ GB VRAM. Larger models (7B+) need 16-24+ GB.
Your system prompt
Copy the exact system prompt from training into your inference code. Mismatches degrade performance.
Generation settings
Choose temperature, top-p, and max tokens for your task. See How Does Inference Work? for guidance.

Datature Vi does not add a separate "release approval" button after training. Export and inference access follow project roles and secret keys. Many teams assign an accountable approver in their own runbook; see Roles and RACI checklist for a template.


Frequently asked questions

Yes. VLMs require GPU acceleration for practical inference speeds. The Vi SDK detects available CUDA GPUs automatically. For GPUs with limited memory, the SDK supports 8-bit and 4-bit quantization to reduce memory requirements.

If you don't have a local GPU, deploy via NIM containers on a cloud server with GPU access (AWS, GCP, Azure, or NVIDIA DGX Cloud).

Yes. The downloaded weights are standard SafeTensors files. You can load them in the Vi SDK for testing, package them into a NIM container for production, and keep a copy for a self-hosted backup. The model files are the same across all paths.

Each model gets its own deployment. With NIM, you run one container per model and route requests to the right container based on the task. Self-hosted solutions like vLLM support serving multiple models from one process, but each model still needs its own GPU memory allocation.

Latency depends on model size, GPU hardware, image resolution, and output length. A 7B model on an A10G GPU typically produces a short answer (50 tokens) in 1-3 seconds. Larger models or longer outputs take more time. Streaming inference delivers partial results sooner. See streaming vs batch inference for details.


Related resources

Download a Model

Step-by-step guide to exporting trained model weights.

Vi SDK Inference

Run local inference with the Vi SDK in Python.

NVIDIA NIM Deployment

Package your model into a production-ready container.