Chain-of-Thought Reasoning

Learn how chain-of-thought reasoning improves vision-language model accuracy by breaking complex visual tasks into step-by-step logical processes.

Chain-of-thought (CoT) reasoning is a technique that enables vision-language models (VLMs) to solve complex visual tasks by breaking them down into explicit, step-by-step reasoning processes. Datature Vi supports chain-of-thought reasoning through system prompt configuration and training data with annotated reasoning steps. Instead of generating a direct answer, the model articulates its reasoning path, leading to more accurate and interpretable results.

This is similar to showing your work in a math problem. Rather than jumping straight to an answer, the model explains its reasoning: "I see three red boxes on the left shelf, two blue boxes on the right shelf, so there are five boxes total."

New to Datature Vi?

Datature Vi is a platform for training custom VLMs. Learn what it does or follow the quickstart.

Best for
  • Complex questions requiring counting, comparison, or spatial reasoning
  • Multi-step logic where accuracy matters more than speed
  • Scenarios requiring auditable, explainable results
  • Quality inspection, inventory management, and complex visual analysis
Not for
  • Simple, single-step questions like "What color is the car?" (direct reasoning is faster and equally accurate)
  • Latency-sensitive applications where extra token generation is prohibitive
  • Binary yes/no questions with clear answers
By the end of this guide

Understand how chain-of-thought reasoning improves accuracy on complex visual tasks.


How chain-of-thought reasoning works

1

Question decomposition

When the model receives a complex visual question, it identifies the logical steps needed to answer it.

Example question: "Are there more red items or blue items in this image?"

Reasoning steps identified:

  1. Locate and count all red items
  2. Locate and count all blue items
  3. Compare the counts
  4. Provide the answer

This decomposition helps the model approach the problem systematically rather than guessing at a direct answer.

2

Step-by-step execution

The model executes each reasoning step sequentially, generating intermediate observations:

  • "I can see red items: one red mug on the left, one red book on the shelf, and one red box on the table. That's 3 red items."
  • "I can see blue items: two blue pens on the desk and one blue folder. That's 3 blue items."
  • "Comparing: 3 red items and 3 blue items."

Each step builds on previous observations, creating a logical chain.

3

Answer synthesis

After completing all reasoning steps, the model synthesizes a final answer based on the complete reasoning chain:

"There are an equal number of red and blue items (3 of each)."

This approach reduces errors because the model verifies each step before proceeding.

Accuracy Improvement

Chain-of-thought reasoning can improve accuracy on complex VQA tasks by reducing the chance of the model skipping intermediate steps, while also providing transparent explanations of how the model reached its conclusions.


Common use cases

Quality inspection

Systematically check products for multiple defect types. The model reasons through surface scratches, edge chips, and discoloration before reaching a pass/fail verdict.

Inventory management

Count and categorize items in complex warehouse scenes. CoT breaks down counting by shelf, section, or category for accurate totals.

Medical imaging

Multi-step diagnostic reasoning that can be validated by medical professionals. The model examines regions, describes findings, and states recommendations.

Robotics

Auditable decision-making in complex environments. The model identifies obstacles, assesses options, and explains its chosen path.

E-commerce

Answer complex product questions that require comparing features or analyzing multiple visible aspects of an item.

Autonomous vehicles

Understand complex traffic scenarios by explicitly considering traffic lights, pedestrians, and road conditions before deciding.


Chain-of-thought vs. direct reasoning

Question type
Direct reasoning
CoT reasoning
When to use CoT
Simple attributes ("What color is this?")
Works well
Unnecessary overhead
No
Counting ("How many items?")
Often inaccurate
Much better
Yes
Comparison ("Which has more?")
Unreliable
Much more accurate
Yes
Multi-step logic ("Are safety requirements met?")
Poor
Reliable
Yes
Spatial reasoning ("What's between X and Y?")
Sometimes works
More reliable
Yes
Yes/No (simple, "Is there a dog?")
Works well
Unnecessary
No
Yes/No (complex, "Is this suitable for outdoor use?")
Inconsistent
More reliable
Yes

Tips for better chain-of-thought results

Tip
Good example
Poor example
Prompt for step-by-step thinking
"First count the red items, then count blue, then compare"
"How many red and blue items?"
Structure multi-step questions
"Identify all safety equipment, check each person, then confirm compliance"
"Is this safe?"
Keep reasoning to 2-4 steps
"3 workers. Worker 1: helmet + vest. Worker 2: helmet only. Worker 3: none. Not compliant."
"I notice lighting... natural light... daytime... workers visible..."

Include reasoning examples in training data. When creating VQA annotations, include step-by-step answers using <datature_think> tags. This teaches the model to reason before answering. See Annotation Guide for the format.

Validate intermediate steps, not just the final answer. Check that observations are visually accurate, that each step follows logically, and that the conclusion matches the reasoning. This distinguishes perception errors (seeing wrong) from logic errors. See Configure Your System Prompt for prompt design guidelines.


How to enable CoT in Datature Vi

Chain-of-thought reasoning can be enabled through annotation format, through training, and at inference time by passing cot=True on ViModel(...) (see Run Inference).

Annotation-side

To train a model with CoT reasoning, include reasoning steps in your annotations using <datature_think> tags. During training, Datature Vi converts these to the model's native <think> tags.

VQA annotation with reasoning:

Question: "Are there more than 5 defects visible?"

Answer: "<datature_think>Let me count the defects systematically. Top section: 2 scratches. Middle section: 1 dent and 1 discoloration. Bottom section: 3 scratches. Total: 7 defects.</datature_think>Yes, there are 7 defects visible: 2 scratches in the top section, 1 dent and 1 discoloration in the middle, and 3 scratches in the bottom."

The <datature_think> tag wraps the reasoning steps. The text after the closing tag is the final answer. This teaches the model to reason before answering.

For more annotation format details, see the Annotation Guide.

Inference-side

Pass cot=True on ViModel(...) (same keyword the underlying predictor accepts). That enables CoT decoding at inference time: the model outputs <think> and <answer> tags, and the Vi SDK parses them into thinking and the task-specific result fields.

from vi.inference import ViModel

model = ViModel(
    run_id="your-run-id",
    secret_key=".your-secret-key.",
    organization_id="your-organization-id",
)
result, error = model(
    source="image.jpg",
    user_prompt="Count all defects",
    cot=True,
    stream=False,
)

if error is None:
    if result.thinking:
        print("Reasoning:", result.thinking)
    print("Answer:", result.result)

The thinking field holds text from <think> tags when present. The structured result field holds the final parsed output for your task type.

For token and sampling settings used together with CoT, see Configure Generation. For the full response shape, see Prediction Schemas. For call-level options (stream, cot, fps, and more), see Run Inference.


Frequently asked questions

Yes. Chain-of-thought reasoning takes longer because the model generates more tokens (the reasoning steps plus the answer).

Typical impact:

  • Simple questions: 2–3x longer than direct answers
  • Complex questions: Often only 1.5x longer (reasoning would happen internally anyway)

Use direct reasoning for simple questions and reserve CoT for complex questions where accuracy justifies the extra time.

Learn about model performance →

Yes. You can use chain-of-thought reasoning to make decisions about phrase grounding queries.

Example workflow:

1

Use CoT to analyze

"Let me check systematically. The defect region shows discoloration and surface irregularity on the upper right portion..."

2

Use phrase grounding to localize

"the defective area on the upper right"

3

Get a precise bounding box for the identified defect

Learn about phrase grounding →

Evaluate both the reasoning process and the final answer.

Reasoning quality metrics:

  • Logical consistency: Do steps follow logically?
  • Visual accuracy: Are observations correct?
  • Completeness: Are all necessary steps included?

Answer quality metrics:

  • Correctness: Is the final answer right?
  • Attribution: Does the answer follow from the reasoning?

Learn about evaluation →

Not necessarily. Modern VLMs can perform CoT reasoning through prompting alone, but training with reasoning examples improves quality.

Option 1: Prompt-based CoT (no special training). Use system prompts that request reasoning. Works with any trained VQA model. Good for getting started.

Option 2: Training with reasoning examples (better results). Include reasoning steps in training annotations. The model learns to reason naturally and produces more consistent output. Option 2 uses <datature_think> tags in your annotations. See the Annotation Guide for the format.

Start with prompt-based CoT, then add reasoning examples to your training data if you need better results.

This can happen when the model arrives at the correct answer through incorrect reasoning, a phenomenon called "right for the wrong reasons."

Example:

  • Question: "How many red items?"
  • Wrong reasoning: "I see 2 red boxes and 1 red book. Wait, that book might be orange. Let me say 2."
  • Answer: "2 red items" (correct, but reasoning showed confusion)

To address this: carefully review reasoning steps during model evaluation, and include diverse examples in training to strengthen visual understanding.

For a deep dive into CoT for VLMs, see the Introduction to Chain-of-Thought for Vision-Language Models blog post and the CoT glossary entry.


Related resources

Configure System Prompts

Set up prompts that encourage step-by-step reasoning in your VLM.

Visual Question Answering

Learn about the VQA dataset type that chain-of-thought reasoning builds on.

Annotate For VQA

Create question-answer training pairs with reasoning steps.