How Do I Deploy My Trained Model?

Deployment is the step where your trained model starts processing real images outside of Datature Vi. You download the model weights, load them into an inference environment, and send images with prompts to get predictions. Datature Vi supports three deployment paths: local inference with the Vi SDK, NVIDIA NIM containers for production, and self-hosted servers using frameworks like vLLM. This page explains when to use each and what you need to get started.

On this page

What deployment means Three deployment paths Choosing a path What you need before deploying FAQ Related resources

What does deployment mean for a VLM?

Training teaches your model to recognize patterns. Deployment puts that model to work. During deployment, you:

Download the trained model weights from Datature Vi
Load those weights into an inference runtime (Vi SDK, NIM container, or your own server)
Send new images with prompts and receive predictions

The model runs on a GPU (or a quantized version on a smaller GPU). It uses the same system prompt and generation settings you configured during training. If you change the system prompt at inference time, the model's behavior may degrade. See system prompt consistency for details.

Three deployment paths

Path

Best for

GPU required

Setup effort

Vi SDK (local)

Prototyping, testing, small-scale use

Yes (local GPU)

Low

NVIDIA NIM containers

Production workloads, team access, scaling

Yes (server GPU)

Medium

Self-hosted (vLLM, TGI)

Full control, custom infrastructure

Yes (server GPU)

High

Vi SDK (local inference)

The Vi SDK downloads your model weights and runs inference on your local machine. You write a few lines of Python, pass in an image and a prompt, and get results back. This is the fastest way to test a trained model.

When to use it: During development, for testing on sample images, or for applications where a single machine handles all requests. Good for prototyping before committing to a production setup.

Limitations: Tied to one machine. No built-in load balancing, scaling, or API endpoint. You manage the GPU yourself.

NVIDIA NIM containers

NIM (NVIDIA Inference Microservice) packages your model into a Docker container with GPU acceleration and an OpenAI-compatible API endpoint. You deploy the container on any server with an NVIDIA GPU and send requests over HTTP.

When to use it: For production applications where multiple users or services need to call the model. NIM handles batching, GPU memory management, and provides a standard API that your backend code can call.

Limitations: Requires Docker and NVIDIA GPU drivers on the host. Container images are large (several GB).

Self-hosted (vLLM, TGI)

For teams that want full control, you can load the downloaded model weights into inference frameworks like vLLM or Hugging Face Text Generation Inference (TGI). These frameworks provide optimized serving with features like continuous batching and speculative decoding.

When to use it: When you have specific infrastructure requirements, need custom optimization, or are already running vLLM/TGI for other models.

Limitations: You manage everything: GPU allocation, model loading, API routing, scaling, and monitoring. Requires ML infrastructure experience.

Which path should I choose?

Start with the Vi SDK

Every deployment starts here. Download your model, run inference on a few test images, and verify the outputs match your expectations. This validates your model before you invest in production infrastructure.

Move to NIM for production

When you need to serve predictions to an application, other team members, or customer-facing features, package the model into a NIM container. The OpenAI-compatible API makes integration with backend services straightforward.

Self-host for custom needs

If NIM doesn't meet your infrastructure requirements (custom GPU clusters, specific latency SLAs, multi-model serving), use vLLM or TGI with the downloaded weights.

What you need before deploying

Before downloading and deploying your model, check that:

Requirement

Details

A completed training run

The model must have finished training with acceptable evaluation metrics. See How Do I Evaluate My Model?

A GPU for inference

All deployment paths require a GPU. Smaller models (2-4B) run on GPUs with 8+ GB VRAM. Larger models (7B+) need 16-24+ GB.

Your system prompt

Copy the exact system prompt from training into your inference code. Mismatches degrade performance.

Generation settings

Choose temperature, top-p, and max tokens for your task. See How Does Inference Work? for guidance.

Datature Vi does not add a separate "release approval" button after training. Export and inference access follow project roles and secret keys. Many teams assign an accountable approver in their own runbook; see Roles and RACI checklist for a template.

How Do I Deploy My Trained Model?

What does deployment mean for a VLM?

Three deployment paths

Vi SDK (local inference)

NVIDIA NIM containers

Self-hosted (vLLM, TGI)

Which path should I choose?

Start with the Vi SDK

Move to NIM for production

Self-host for custom needs

What you need before deploying

Frequently asked questions

Related resources

Download a Model

Vi SDK Inference

NVIDIA NIM Deployment