Structured Data Extraction

Structured data extraction is the practice of getting consistent, machine-readable output from your vision-language model instead of free-text answers. You define the exact format you want, and the model returns it every time. Because this technique builds on the freeform text dataset type, you can train models to produce any text-based format: JSON, YAML, XML, CSV, code, or a custom schema you design yourself.

New to Datature Vi?

This guide assumes you understand the basics of freeform text datasets and system prompts. If you're just getting started, follow the quickstart first.

By the end of this guide

Configure Vi to return structured output from any image in JSON, YAML, code, or any custom format you define.

What is structured extraction?

When you ask a VQA model "Is there a defect?", it might say "Yes, I can see a scratch on the left side." That's useful for a human but hard to process programmatically.

Structured extraction teaches your model to respond in a predictable format instead. The format depends on your use case:

{
  "defect_found": true,
  "defect_type": "scratch",
  "location": "left side",
  "severity": "low"
}

defect_found: true
defect_type: scratch
location: left side
severity: low

{
    "defect_found": True,
    "defect_type": "scratch",
    "location": "left side",
    "severity": "low"
}

defect_found: YES
defect_type: scratch
location: left side
severity: low

Each of these can be parsed directly into your database, workflow, or application with minimal post-processing.

When to use structured extraction

Structured extraction is the right choice when:

You need to store results in a database with specific columns
You're building an automated pipeline that processes many images
You need consistent field names across all predictions
Your use case involves forms, documents, or checklists
You want your model to generate code from visual input

Common applications:

Use Case

Example Format

Example Fields

Manufacturing inspection

JSON

defect_found, defect_type, location, severity

Invoice / receipt parsing

JSON

vendor, date, total, line_items

Property damage assessment

YAML

damage_type, estimated_severity, affected_area

Medical imaging report

JSON

finding, location, severity, recommendation

Logistics / shipping

Key-value

package_condition, label_readable, damage_visible

UI screenshot to code

HTML/CSS

Generated markup matching the visual layout

Chart data extraction

CSV

Rows extracted from bar charts or tables in images

How it works

Structured extraction uses the freeform text dataset type with two key elements:

Annotations written in your target format: your training examples show the model what to output
A system prompt that specifies the schema: the model is instructed to follow this format at inference

The freeform text dataset type places no restrictions on annotation content. Any text-based format works as long as you annotate consistently.

Step 1: Pick your output format

Choose a format based on how your downstream system will consume the output.

Format

Best for

Parsing

JSON

APIs, databases, most programmatic use

json.loads() in Python, JSON.parse() in JS

YAML

Config files, human-readable structured data

yaml.safe_load() in Python

CSV / TSV

Tabular data, spreadsheet import

csv.reader() or split by delimiter

XML

Legacy systems, SOAP APIs

xml.etree.ElementTree in Python

Key-value

Simple flat records, log-style output

Split by line, then by delimiter

Code

UI-to-code, chart-to-SQL, diagram-to-code

Execute or transpile directly

Custom

Domain-specific reporting templates

Custom parser matching your template

JSON is the most common choice because it handles nested data well and has broad tooling support. But if your pipeline already works with YAML config files, CSV spreadsheets, or code templates, train your model to produce that format directly.

Step 2: Design your schema

Start with the minimum fields your application needs. Keep field names short and unambiguous. Use consistent value types.

Example schemas

Step 3: Write your annotations

Create a freeform text dataset and annotate each image with output in your chosen format. Annotations should contain only the structured output with no surrounding explanation.

Good annotations:

{"defect_found": true, "defect_type": "scratch", "location": "top edge", "severity": "low"}

defect_found: true
defect_type: scratch
location: top edge
severity: low

Avoid:

The image shows a scratch. Here is the result: {"defect_found": true, ...}

The model learns from your examples, so consistency matters more than quantity. Aim for:

At least 50-100 annotated images to start
The same format and field names across all annotations
Examples that cover all expected output variations (defect found, no defect found, different severities)

Step 4: Write your system prompt

The system prompt tells the model what task to perform and what format to use. For structured extraction, your system prompt should:

Describe the task
Specify the exact output format and schema
Instruct the model to output only the structured data

Example system prompts:

You are a quality inspection assistant. Examine the product image and return a JSON object with the following fields:
- defect_found (boolean): true if any defect is visible, false otherwise
- defect_type (string or null): brief description of the defect, null if none
- location (string or null): where the defect appears, null if none
- severity (string): one of "none", "low", "medium", or "high"

Respond with only the JSON object, no additional text.

You are an equipment inspection assistant. Examine the image of the equipment and return a YAML document with the following fields:
- equipment_id (string): the ID visible on the equipment label
- status (string): one of "operational", "needs_maintenance", or "out_of_service"
- issues (list): each issue has type, location, and severity
- next_inspection (date): recommended next inspection in YYYY-MM-DD format

Respond with only the YAML output, no additional text or code fences.

You are a UI-to-code assistant. Given a screenshot of a web UI component, generate the HTML and inline CSS that reproduces the layout. Use semantic HTML elements. Use placeholder values for dynamic content like images and text. Output only the HTML code, no explanation.

Use the same system prompt at training and inference

The system prompt you use during training must match the one you use at inference. If they differ, the model may produce inconsistent or malformed output. See Configure Your System Prompt.

Step 5: Parse results with the Vi SDK

A model trained on freeform text returns a GenericResponse. Access the raw output via result.result and parse it according to the format you trained on.

JSON output

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="product_image.jpg",
    user_prompt="Inspect this product for defects.",
    generation_config={
        "temperature": 0.0,   # deterministic output for consistent structure
        "do_sample": False
    }
)

if error is None:
    raw = result.result  # string output from the model
    data = json.loads(raw)

    print(f"Defect found: {data['defect_found']}")
    if data['defect_found']:
        print(f"Type: {data['defect_type']}")
        print(f"Location: {data['location']}")
        print(f"Severity: {data['severity']}")
else:
    print(f"Inference error: {error}")

YAML output

import yaml
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="equipment_photo.jpg",
    user_prompt="Inspect this equipment and report its status.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

if error is None:
    data = yaml.safe_load(result.result)
    print(f"Equipment: {data['equipment_id']}")
    print(f"Status: {data['status']}")
    for issue in data.get('issues', []):
        print(f"  Issue: {issue['type']} at {issue['location']}")
else:
    print(f"Inference error: {error}")

Handle malformed output

The model occasionally produces output that isn't valid for the expected format, especially during early training or when the input image is ambiguous. Add a fallback:

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="product_image.jpg",
    user_prompt="Inspect this product for defects.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

if error is not None:
    print(f"Inference error: {error}")
else:
    try:
        data = json.loads(result.result)
        # process structured data
        print(data)
    except json.JSONDecodeError:
        # log raw output for debugging
        print(f"Could not parse output. Raw: {result.result}")

Batch processing

For processing many images at once:

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]

results = model(
    source=image_paths,
    user_prompt="Inspect this product for defects.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

records = []
for image_path, (result, error) in zip(image_paths, results):
    if error is not None:
        print(f"{image_path}: error - {error}")
        continue
    try:
        data = json.loads(result.result)
        data["image"] = image_path
        records.append(data)
    except json.JSONDecodeError:
        print(f"{image_path}: could not parse output - {result.result}")

# records is now a list of dicts ready to write to a database or CSV
print(f"Processed {len(records)} images successfully")

Tips for reliable structured output

Use temperature: 0.0: deterministic decoding produces more consistent output structure
Keep schemas simple: fewer fields, clearer field names, fixed value vocabularies
Be consistent in annotations: if you use "low" in some and "Low" in others, the model will too
Provide diverse examples: include both positive and negative cases (defect found / no defect found)
Test with held-out images: before deploying, run inference on images not used in training to verify output format
Match your format to the consumer: use JSON for APIs, CSV for spreadsheet workflows, code for developer tools
Avoid mixing formats: train each model on one output format. If you need both JSON and CSV, train separate models or use the system prompt to select the format at inference.

Related resources

Freeform Text

Dataset type reference for custom annotation schemas.

Configure Your System Prompt

How to write and configure the system prompt that guides your model's output format.

Configure Generation

Temperature, sampling, and other parameters that affect output consistency.

Prediction Schemas

Full reference for Vi SDK response types including GenericResponse.

Manufacturing Inspection

End-to-end example using structured extraction for quality control.

Document Processing

Invoice and form extraction use case using structured output.