Robotics and Physical AI
Use Datature Vi to ground language instructions in visual scenes for robot manipulation, pick-and-place tasks, and Vision-Language-Action workflows.
Robots work best when they can understand plain-language instructions: "pick up the red mug on the left shelf" or "move the blue package to the staging area." The challenge is connecting those words to what the robot's camera actually sees. Which object is the "red mug"? Where exactly is "the left shelf"?
Datature Vi trains models that link language to images. You show the model examples of objects in your workspace, label them with natural descriptions, and the model learns to find those objects in new camera frames. The robot's control system then uses that location information to plan its movements.
This applies to warehouse pick-and-place, manufacturing assembly, bin picking, and any task where a robot needs to understand what it is looking at before acting on it.
For an interactive overview of this application, visit the robotic pick and place use case on vi.datature.com.
Common applications
Object grounding for manipulation
The most direct use case: given a language instruction, locate the target object in the image.
Task type: Phrase Grounding
Phrase grounding returns bounding box coordinates for objects described in natural language, in exactly the format a robot arm's motion planner needs.
Annotation examples:
At inference:
from vi.inference import ViModel
model = ViModel(
run_id="your-run-id",
secret_key=".your-secret-key.",
organization_id="your-organization-id",
)
result, error = model(
source="workspace_camera.jpg",
user_prompt="Pick up the red cylindrical container on the second shelf."
)
if error is None:
for grounded in result.result.groundings:
print(f"Object: {grounded.phrase}")
for bbox in grounded.grounding:
x_min, y_min, x_max, y_max = bbox
# Convert to pixel coordinates and pass to motion planner
center_x = (x_min + x_max) / 2 / 1024
center_y = (y_min + y_max) / 2 / 1024
print(f"Normalized center: ({center_x:.3f}, {center_y:.3f})")Vi returns bounding boxes in normalized coordinates in the range [0, 1024]. Scale by your image dimensions to get pixel coordinates. See Prediction Schemas for conversion code.
Scene understanding and QA
Use VQA to answer questions about the current state of the workspace, feeding answers into a planning or control system.
Task type: VQA
Structured scene state
For tighter integration with planning systems, use structured data extraction to return scene state as JSON:
{
"path_clear": false,
"obstacle_type": "cardboard box",
"obstacle_location": "center path",
"recommended_action": "remove obstacle before proceeding"
}Vision-Language-Action (VLA) workflows
In VLA systems, Vi acts as the perception and language grounding layer. The typical integration pattern:
Camera image → Vi VLM → Scene understanding / object location
↓
Task planner or policy network
↓
Robot action executionVi's role: Translate natural language instructions and camera images into structured outputs (bounding boxes, scene descriptions, state JSON) that your action model or planner consumes.
What Vi does not do: Vi does not directly output robot joint angles or action primitives. It provides the perception and language grounding; your downstream system handles action selection.
Multi-step task reasoning
For tasks that require checking multiple conditions before acting, use chain-of-thought reasoning:
Example: pre-grasp scene check:
System prompt:
You are a robot vision assistant. Before confirming a grasp action, check:
1. Is the target object visible and unobstructed?
2. Is there sufficient clearance around the object for the gripper?
3. Is the object orientation suitable for grasping?
Reason through each step, then give a final recommendation.The model's <think> output (accessible via result.thinking) shows the reasoning chain; the final answer gives a clear go/no-go.
Training tips for robotics
Match your camera setup exactly: train on images from the same camera, mounting position, and lighting conditions used in deployment. Perspective and lighting changes degrade localization accuracy significantly.
Capture object variation: the same object appears differently based on orientation, occlusion, and distance. Include multiple views of each target object in your training data.
Use precise spatial language: annotations should use consistent spatial terms that match how your system issues commands: "left", "right", "near edge", "top shelf". Avoid ambiguous phrasing.
Test at inference speed: for real-time applications, measure end-to-end latency (camera capture → model inference → output) with your deployed hardware before integrating into a control loop.
Start with static scenes: train and validate in static environments before moving to dynamic or cluttered workspaces. Add complexity incrementally.
Next steps
Phrase Grounding
Full reference for object localization with natural language, bounding box format, and annotation best practices.
Chain-of-Thought Reasoning
Add multi-step reasoning for complex scene assessment and task planning support.
Structured Data Extraction
Return scene state as JSON for direct integration with planning and control systems.
Updated 4 days ago
