Structured Data Extraction

Use freeform text datasets and system prompt design to get consistent, machine-readable output from your vision-language model in JSON, YAML, code, or any custom format.

Structured data extraction is the practice of getting consistent, machine-readable output from your vision-language model instead of free-text answers. You define the exact format you want, and the model returns it every time. Because this technique builds on the freeform text dataset type, you can train models to produce any text-based format: JSON, YAML, XML, CSV, code, or a custom schema you design yourself.

New to Datature Vi?

This guide assumes you understand the basics of freeform text datasets and system prompts. If you're just getting started, follow the quickstart first.

By the end of this guide

Configure Vi to return structured output from any image in JSON, YAML, code, or any custom format you define.


What is structured extraction?

When you ask a VQA model "Is there a defect?", it might say "Yes, I can see a scratch on the left side." That's useful for a human but hard to process programmatically.

Structured extraction teaches your model to respond in a predictable format instead. The format depends on your use case:

{
  "defect_found": true,
  "defect_type": "scratch",
  "location": "left side",
  "severity": "low"
}
defect_found: true
defect_type: scratch
location: left side
severity: low
{
    "defect_found": True,
    "defect_type": "scratch",
    "location": "left side",
    "severity": "low"
}
defect_found: YES
defect_type: scratch
location: left side
severity: low

Each of these can be parsed directly into your database, workflow, or application with minimal post-processing.


When to use structured extraction

Structured extraction is the right choice when:

  • You need to store results in a database with specific columns
  • You're building an automated pipeline that processes many images
  • You need consistent field names across all predictions
  • Your use case involves forms, documents, or checklists
  • You want your model to generate code from visual input

Common applications:

Use Case
Example Format
Example Fields
Manufacturing inspection
JSON
defect_found, defect_type, location, severity
Invoice / receipt parsing
JSON
vendor, date, total, line_items
Property damage assessment
YAML
damage_type, estimated_severity, affected_area
Medical imaging report
JSON
finding, location, severity, recommendation
Logistics / shipping
Key-value
package_condition, label_readable, damage_visible
UI screenshot to code
HTML/CSS
Generated markup matching the visual layout
Chart data extraction
CSV
Rows extracted from bar charts or tables in images

How it works

Structured extraction uses the freeform text dataset type with two key elements:

  1. Annotations written in your target format: your training examples show the model what to output
  2. A system prompt that specifies the schema: the model is instructed to follow this format at inference

The freeform text dataset type places no restrictions on annotation content. Any text-based format works as long as you annotate consistently.


Step 1: Pick your output format

Choose a format based on how your downstream system will consume the output.

Format
Best for
Parsing
JSON
APIs, databases, most programmatic use
json.loads() in Python, JSON.parse() in JS
YAML
Config files, human-readable structured data
yaml.safe_load() in Python
CSV / TSV
Tabular data, spreadsheet import
csv.reader() or split by delimiter
XML
Legacy systems, SOAP APIs
xml.etree.ElementTree in Python
Key-value
Simple flat records, log-style output
Split by line, then by delimiter
Code
UI-to-code, chart-to-SQL, diagram-to-code
Execute or transpile directly
Custom
Domain-specific reporting templates
Custom parser matching your template

JSON is the most common choice because it handles nested data well and has broad tooling support. But if your pipeline already works with YAML config files, CSV spreadsheets, or code templates, train your model to produce that format directly.


Step 2: Design your schema

Start with the minimum fields your application needs. Keep field names short and unambiguous. Use consistent value types.

Example schemas

{
  "defect_found": true,
  "defect_type": "surface scratch",
  "location": "upper left corner",
  "severity": "low"
}

Field guide:

  • defect_found: true or false
  • defect_type: short description, or null if no defect
  • location: use consistent spatial terms ("upper left", "center", "bottom edge")
  • severity: use a fixed vocabulary: "low", "medium", "high", or "none"
{
  "vendor": "Acme Supply Co.",
  "date": "2024-03-15",
  "total": "142.50",
  "currency": "USD",
  "line_items": [
    {"description": "Widget A", "quantity": 3, "unit_price": "25.00"},
    {"description": "Widget B", "quantity": 5, "unit_price": "11.50"}
  ]
}

Tips: Use ISO date format (YYYY-MM-DD). Store monetary values as strings to avoid floating-point issues. Keep line_items as a list even when there's only one item.

{
  "damage_visible": true,
  "damage_type": "dent",
  "affected_area": "front bumper",
  "estimated_severity": "moderate",
  "repair_recommended": true
}
equipment_id: PUMP-042
status: needs_maintenance
issues:
  - type: corrosion
    location: inlet valve
    severity: moderate
  - type: vibration
    location: bearing housing
    severity: low
next_inspection: 2024-06-01

Tips: YAML's indentation-based nesting is readable without brackets. Use consistent indentation (2 spaces) in all annotations. Avoid YAML features like anchors or multiline strings to keep output predictable.

product,quantity,unit_price,total
Widget A,3,25.00,75.00
Widget B,5,11.50,57.50
Gasket C,10,1.00,10.00

Tips: Always include a header row so columns are self-documenting. Use a consistent delimiter. Avoid commas inside field values, or wrap those values in quotes.

<div class="card">
  <img src="product.jpg" alt="Product thumbnail" />
  <h3>Widget Pro</h3>
  <p class="price">$25.00</p>
  <button class="btn-primary">Add to Cart</button>
</div>

Tips: Stick to a single code style across all annotations. Include only the relevant fragment, not a full HTML document. Use placeholder values for dynamic content like image URLs.

INSPECTION REPORT
Date: 2024-03-15
Inspector: Auto-Vi
Component: Bearing Assembly #4

[PASS] Surface finish within tolerance
[FAIL] Corrosion detected on inner race
[PASS] Dimensional check within spec
[N/A]  Lubrication (sealed unit)

Overall: FAIL
Action: Replace before next production cycle

Tips: Design a template that covers all possible outcomes. Use fixed labels like [PASS], [FAIL], [N/A] so results can be parsed with simple string matching. Keep section ordering consistent across all annotations.


Step 3: Write your annotations

Create a freeform text dataset and annotate each image with output in your chosen format. Annotations should contain only the structured output with no surrounding explanation.

Good annotations:

{"defect_found": true, "defect_type": "scratch", "location": "top edge", "severity": "low"}
defect_found: true
defect_type: scratch
location: top edge
severity: low

Avoid:

The image shows a scratch. Here is the result: {"defect_found": true, ...}

The model learns from your examples, so consistency matters more than quantity. Aim for:

  • At least 50-100 annotated images to start
  • The same format and field names across all annotations
  • Examples that cover all expected output variations (defect found, no defect found, different severities)

Step 4: Write your system prompt

The system prompt tells the model what task to perform and what format to use. For structured extraction, your system prompt should:

  1. Describe the task
  2. Specify the exact output format and schema
  3. Instruct the model to output only the structured data

Example system prompts:

You are a quality inspection assistant. Examine the product image and return a JSON object with the following fields:
- defect_found (boolean): true if any defect is visible, false otherwise
- defect_type (string or null): brief description of the defect, null if none
- location (string or null): where the defect appears, null if none
- severity (string): one of "none", "low", "medium", or "high"

Respond with only the JSON object, no additional text.
You are an equipment inspection assistant. Examine the image of the equipment and return a YAML document with the following fields:
- equipment_id (string): the ID visible on the equipment label
- status (string): one of "operational", "needs_maintenance", or "out_of_service"
- issues (list): each issue has type, location, and severity
- next_inspection (date): recommended next inspection in YYYY-MM-DD format

Respond with only the YAML output, no additional text or code fences.
You are a UI-to-code assistant. Given a screenshot of a web UI component, generate the HTML and inline CSS that reproduces the layout. Use semantic HTML elements. Use placeholder values for dynamic content like images and text. Output only the HTML code, no explanation.
Use the same system prompt at training and inference

The system prompt you use during training must match the one you use at inference. If they differ, the model may produce inconsistent or malformed output. See Configure Your System Prompt.


Step 5: Parse results with the Vi SDK

A model trained on freeform text returns a GenericResponse. Access the raw output via result.result and parse it according to the format you trained on.

JSON output

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="product_image.jpg",
    user_prompt="Inspect this product for defects.",
    generation_config={
        "temperature": 0.0,   # deterministic output for consistent structure
        "do_sample": False
    }
)

if error is None:
    raw = result.result  # string output from the model
    data = json.loads(raw)

    print(f"Defect found: {data['defect_found']}")
    if data['defect_found']:
        print(f"Type: {data['defect_type']}")
        print(f"Location: {data['location']}")
        print(f"Severity: {data['severity']}")
else:
    print(f"Inference error: {error}")

YAML output

import yaml
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="equipment_photo.jpg",
    user_prompt="Inspect this equipment and report its status.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

if error is None:
    data = yaml.safe_load(result.result)
    print(f"Equipment: {data['equipment_id']}")
    print(f"Status: {data['status']}")
    for issue in data.get('issues', []):
        print(f"  Issue: {issue['type']} at {issue['location']}")
else:
    print(f"Inference error: {error}")

Handle malformed output

The model occasionally produces output that isn't valid for the expected format, especially during early training or when the input image is ambiguous. Add a fallback:

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

result, error = model(
    source="product_image.jpg",
    user_prompt="Inspect this product for defects.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

if error is not None:
    print(f"Inference error: {error}")
else:
    try:
        data = json.loads(result.result)
        # process structured data
        print(data)
    except json.JSONDecodeError:
        # log raw output for debugging
        print(f"Could not parse output. Raw: {result.result}")

Batch processing

For processing many images at once:

import json
from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)

image_paths = ["img1.jpg", "img2.jpg", "img3.jpg"]

results = model(
    source=image_paths,
    user_prompt="Inspect this product for defects.",
    generation_config={"temperature": 0.0, "do_sample": False}
)

records = []
for image_path, (result, error) in zip(image_paths, results):
    if error is not None:
        print(f"{image_path}: error - {error}")
        continue
    try:
        data = json.loads(result.result)
        data["image"] = image_path
        records.append(data)
    except json.JSONDecodeError:
        print(f"{image_path}: could not parse output - {result.result}")

# records is now a list of dicts ready to write to a database or CSV
print(f"Processed {len(records)} images successfully")

Tips for reliable structured output

  • Use temperature: 0.0: deterministic decoding produces more consistent output structure
  • Keep schemas simple: fewer fields, clearer field names, fixed value vocabularies
  • Be consistent in annotations: if you use "low" in some and "Low" in others, the model will too
  • Provide diverse examples: include both positive and negative cases (defect found / no defect found)
  • Test with held-out images: before deploying, run inference on images not used in training to verify output format
  • Match your format to the consumer: use JSON for APIs, CSV for spreadsheet workflows, code for developer tools
  • Avoid mixing formats: train each model on one output format. If you need both JSON and CSV, train separate models or use the system prompt to select the format at inference.

Related resources