What Are System Prompts and How Do I Write One?

Learn what system prompts are, how they shape VLM behavior during training and inference, and how to write effective prompts for your domain.

A system prompt is a set of natural language instructions that tells your vision-language model (VLM) how to behave. It defines what the model looks for in images, how it formats responses, and what domain knowledge it applies. Datature Vi uses the same system prompt during both training and inference. This page explains what system prompts do, how to structure them, and what mistakes to avoid.


What does a system prompt do?

Think of a system prompt as a job description for your model. Before your model sees a single image, the system prompt tells it:

  • What role to play: "You are a quality control inspector for printed circuit boards."
  • What to focus on: "Identify solder bridges, missing components, and cold joints."
  • How to respond: "Return a JSON object with defect_found, defect_type, and severity fields."
  • What to avoid: "Do not speculate about components hidden under other parts."

Without a system prompt, the model falls back on its general training. It might describe an image in broad terms when you needed a specific defect report. The system prompt narrows the model's attention to your task.

New to Datature Vi?

System prompts are one of the first things you configure when setting up a training workflow. Follow the quickstart to see how prompts fit into the full pipeline.


The four elements of a system prompt

Every effective system prompt covers four areas. You don't need long paragraphs for each. A few precise sentences per element is enough.

1

Define the role

Tell the model what kind of expert it is. This sets the tone and vocabulary for all responses.

Weak: "You analyze images." Strong: "You are a radiologist assistant that describes findings in chest X-rays using standard medical terminology."

2

Specify what to look for

Name the specific objects, conditions, or patterns the model should identify. Generic instructions produce generic results.

Weak: "Find problems in the image." Strong: "Identify cracks, chips, and discoloration on ceramic tiles. Ignore surface dust and reflections."

3

Set the output format

Tell the model exactly how to structure its response. This is especially important for structured data extraction and programmatic workflows where downstream code parses the output.

Weak: "Describe what you see." Strong: "Answer in one sentence starting with Yes or No, followed by the defect type and location."

4

Add hallucination guards

Instruct the model to only report what it can see. Without guardrails, VLMs may fill gaps with plausible-sounding information that isn't grounded in the image.

Example: "Only describe what is directly visible in the image. Do not speculate about areas outside the frame or occluded by other objects. If you cannot determine the answer, say 'Unable to determine from this image.'"

See What are hallucination guards? below for more detail.


Why the training prompt and inference prompt must match

Datature Vi uses your system prompt as part of the training data. The model learns to follow the specific instructions, formatting, and vocabulary in that prompt. If you change the prompt at inference time, the model receives instructions it never trained on.

The effect is similar to training a translator in French-to-English and then asking them to translate Spanish. They might produce something, but the quality drops.

Scenario
What happens
Severity
Same prompt for training and inference
Model performs as expected
No issue
Minor wording changes (synonyms, reordering)
Small performance drop, may not be noticeable
Low
Different output format requested
Model may ignore the new format or mix formats
Medium
Different task or domain
Model produces unreliable or irrelevant output
High

If you need to change the system prompt after training, retrain the model with the updated prompt. Small adjustments to wording may work, but changes to the task, format, or domain require a new training run.


What are hallucination guards?

Hallucination is when a VLM generates information that isn't present in the image. The model might describe objects that don't exist, invent counts, or assign labels based on patterns from its pre-training data rather than what's actually visible.

This happens because VLMs are trained on large datasets of image-text pairs. When the model encounters ambiguity, it fills gaps with statistically likely content. A model trained on factory images might "see" a common defect type even when the image shows no defect at all.

Hallucination guards are instructions in the system prompt that constrain the model:

Manufacturing inspection: "Only report defects you can see in the image. If an area is obscured or out of focus, state that you cannot inspect it. Do not assume defect types based on location alone."

Medical imaging: "Describe visible findings only. Do not provide diagnoses or prognosis. State 'No visible abnormality' when the image appears normal rather than speculating."

Retail shelf analysis: "Count only products that are fully visible. If a product is partially hidden behind another, do not include it in the count. Report 'partially visible' items separately."

Document processing: "Extract only text that is legible in the image. For fields that are illegible or missing, return null rather than guessing the content."

Hallucination guards don't eliminate hallucination entirely, but they reduce it. Combining guards with high-quality annotations and fine-tuning on your specific data gives the strongest results.


Common mistakes

Mistake
Why it hurts
Fix
Prompt is too vague ("analyze this image")
Model has no specific direction, produces generic descriptions
Name the exact task, objects, and output format
Prompt is too long (500+ words)
Consumes context window tokens that could be used for image processing and responses
Keep prompts under 200 words. Move reference material to annotations instead.
Different prompt at inference vs training
Model receives unfamiliar instructions, quality degrades
Copy the exact training prompt to your inference code
No hallucination guards
Model may fabricate objects or details not in the image
Add explicit constraints: "only describe what is visible"
Generic output format ("describe what you see")
Inconsistent response structure across images
Define the exact format: JSON fields, sentence structure, or categories

Frequently asked questions

Minor wording changes (reordering sentences, adding a synonym) usually have no visible effect. But changing the task, output format, or domain requires retraining. The model learned to follow the specific instructions in the training prompt, and significant changes break that alignment.

If you're unsure whether a change is safe, run inference on a few test images with both prompts and compare the outputs.

Between 50 and 200 words for most tasks. Shorter prompts waste less of the context window and give the model more room for image processing and response generation. If your prompt exceeds 200 words, consider whether some of that information belongs in annotations instead.

Yes. Datature Vi provides different default prompts for each dataset type. Phrase grounding prompts focus on localization ("find the object described by the text"). VQA prompts focus on answering questions ("answer the question based on the image"). Freeform text prompts are fully custom.

You can modify these defaults to add domain context, but keep the core task instruction intact.

Start with the defaults. Train a first model, run inference on a few test images, and look at where the outputs fall short. Then refine the prompt to address those specific gaps. Prompt writing is iterative, not one-shot.

For examples by industry, see Configure Your System Prompt.


Related resources

Configure Your System Prompt

Step-by-step guide to setting up prompts in the workflow canvas.

Annotation Guide

How to create effective training data for your VLM.

Chain-of-Thought Reasoning

Add step-by-step reasoning to model outputs.