Structured Data Extraction
Use freeform text datasets and system prompt design to get consistent, machine-readable output from your vision-language model in JSON, YAML, code, or any custom format.
Structured data extraction is the practice of getting consistent, machine-readable output from your vision-language model instead of free-text answers. You define the exact format you want, and the model returns it every time. Because this technique builds on the freeform text dataset type, you can train models to produce any text-based format: JSON, YAML, XML, CSV, code, or a custom schema you design yourself.
This guide assumes you understand the basics of freeform text datasets and system prompts. If you're just getting started, follow the quickstart first.
Configure Vi to return structured output from any image in JSON, YAML, code, or any custom format you define.
What is structured extraction?
When you ask a VQA model "Is there a defect?", it might say "Yes, I can see a scratch on the left side." That's useful for a human but hard to process programmatically.
Structured extraction teaches your model to respond in a predictable format instead. The format depends on your use case:
{
"defect_found": true,
"defect_type": "scratch",
"location": "left side",
"severity": "low"
}defect_found: true
defect_type: scratch
location: left side
severity: low{
"defect_found": True,
"defect_type": "scratch",
"location": "left side",
"severity": "low"
}defect_found: YES
defect_type: scratch
location: left side
severity: lowEach of these can be parsed directly into your database, workflow, or application with minimal post-processing.
When to use structured extraction
Structured extraction is the right choice when:
- You need to store results in a database with specific columns
- You're building an automated pipeline that processes many images
- You need consistent field names across all predictions
- Your use case involves forms, documents, or checklists
- You want your model to generate code from visual input
Common applications:
How it works
Structured extraction uses the freeform text dataset type with two key elements:
- Annotations written in your target format: your training examples show the model what to output
- A system prompt that specifies the schema: the model is instructed to follow this format at inference
The freeform text dataset type places no restrictions on annotation content. Any text-based format works as long as you annotate consistently.
Step 1: Pick your output format
Choose a format based on how your downstream system will consume the output.
JSON is the most common choice because it handles nested data well and has broad tooling support. But if your pipeline already works with YAML config files, CSV spreadsheets, or code templates, train your model to produce that format directly.
Step 2: Design your schema
Start with the minimum fields your application needs. Keep field names short and unambiguous. Use consistent value types.
Example schemas
Step 3: Write your annotations
Create a freeform text dataset and annotate each image with output in your chosen format. Annotations should contain only the structured output with no surrounding explanation.
Good annotations:
{"defect_found": true, "defect_type": "scratch", "location": "top edge", "severity": "low"}defect_found: true
defect_type: scratch
location: top edge
severity: lowAvoid:
The image shows a scratch. Here is the result: {"defect_found": true, ...}The model learns from your examples, so consistency matters more than quantity. Aim for:
- At least 50-100 annotated images to start
- The same format and field names across all annotations
- Examples that cover all expected output variations (defect found, no defect found, different severities)
Step 4: Write your system prompt
The system prompt tells the model what task to perform and what format to use. For structured extraction, your system prompt should:
- Describe the task
- Specify the exact output format and schema
- Instruct the model to output only the structured data
Example system prompts:
You are a quality inspection assistant. Examine the product image and return a JSON object with the following fields:
- defect_found (boolean): true if any defect is visible, false otherwise
- defect_type (string or null): brief description of the defect, null if none
- location (string or null): where the defect appears, null if none
- severity (string): one of "none", "low", "medium", or "high"
Respond with only the JSON object, no additional text.You are an equipment inspection assistant. Examine the image of the equipment and return a YAML document with the following fields:
- equipment_id (string): the ID visible on the equipment label
- status (string): one of "operational", "needs_maintenance", or "out_of_service"
- issues (list): each issue has type, location, and severity
- next_inspection (date): recommended next inspection in YYYY-MM-DD format
Respond with only the YAML output, no additional text or code fences.You are a UI-to-code assistant. Given a screenshot of a web UI component, generate the HTML and inline CSS that reproduces the layout. Use semantic HTML elements. Use placeholder values for dynamic content like images and text. Output only the HTML code, no explanation.The system prompt you use during training must match the one you use at inference. If they differ, the model may produce inconsistent or malformed output. See Configure Your System Prompt.
Step 5: Parse results with the Vi SDK
A model trained on freeform text returns a GenericResponse. Access the raw output via result.result and parse it according to the format you trained on.
JSON output
import json
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key=".your-secret-key.",
organization_id="your-organization-id",
)
result, error = model(
source="product_image.jpg",
user_prompt="Inspect this product for defects.",
generation_config={
"temperature": 0.0, # deterministic output for consistent structure
"do_sample": False
}
)
if error is None:
raw = result.result # string output from the model
data = json.loads(raw)
print(f"Defect found: {data['defect_found']}")
if data['defect_found']:
print(f"Type: {data['defect_type']}")
print(f"Location: {data['location']}")
print(f"Severity: {data['severity']}")
else:
print(f"Inference error: {error}")YAML output
import yaml
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key=".your-secret-key.",
organization_id="your-organization-id",
)
result, error = model(
source="equipment_photo.jpg",
user_prompt="Inspect this equipment and report its status.",
generation_config={"temperature": 0.0, "do_sample": False}
)
if error is None:
data = yaml.safe_load(result.result)
print(f"Equipment: {data['equipment_id']}")
print(f"Status: {data['status']}")
for issue in data.get('issues', []):
print(f" Issue: {issue['type']} at {issue['location']}")
else:
print(f"Inference error: {error}")Handle malformed output
The model occasionally produces output that isn't valid for the expected format, especially during early training or when the input image is ambiguous. Add a fallback:
import json
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key=".your-secret-key.",
organization_id="your-organization-id",
)
result, error = model(
source="product_image.jpg",
user_prompt="Inspect this product for defects.",
generation_config={"temperature": 0.0, "do_sample": False}
)
if error is not None:
print(f"Inference error: {error}")
else:
try:
data = json.loads(result.result)
# process structured data
print(data)
except json.JSONDecodeError:
# log raw output for debugging
print(f"Could not parse output. Raw: {result.result}")Batch processing
For processing many images at once:
import json
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key=".your-secret-key.",
organization_id="your-organization-id",
)
image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]
results = model(
source=image_paths,
user_prompt="Inspect this product for defects.",
generation_config={"temperature": 0.0, "do_sample": False}
)
records = []
for image_path, (result, error) in zip(image_paths, results):
if error is not None:
print(f"{image_path}: error - {error}")
continue
try:
data = json.loads(result.result)
data["image"] = image_path
records.append(data)
except json.JSONDecodeError:
print(f"{image_path}: could not parse output - {result.result}")
# records is now a list of dicts ready to write to a database or CSV
print(f"Processed {len(records)} images successfully")Tips for reliable structured output
- Use
temperature: 0.0: deterministic decoding produces more consistent output structure - Keep schemas simple: fewer fields, clearer field names, fixed value vocabularies
- Be consistent in annotations: if you use
"low"in some and"Low"in others, the model will too - Provide diverse examples: include both positive and negative cases (defect found / no defect found)
- Test with held-out images: before deploying, run inference on images not used in training to verify output format
- Match your format to the consumer: use JSON for APIs, CSV for spreadsheet workflows, code for developer tools
- Avoid mixing formats: train each model on one output format. If you need both JSON and CSV, train separate models or use the system prompt to select the format at inference.
Related resources
Freeform Text
Dataset type reference for custom annotation schemas.
Configure Your System Prompt
How to write and configure the system prompt that guides your model's output format.
Configure Generation
Temperature, sampling, and other parameters that affect output consistency.
Prediction Schemas
Full reference for Vi SDK response types including GenericResponse.
Manufacturing Inspection
End-to-end example using structured extraction for quality control.
Document Processing
Invoice and form extraction use case using structured output.
Updated 5 days ago
