Concepts
Understand the core vision-language model concepts that power Datature Vi.
Vision-language models (VLMs) combine computer vision and natural language understanding to enable flexible, intuitive interactions with images. Instead of detecting pre-defined object categories, VLMs understand natural language descriptions and questions about visual content.
Datature Vi supports core VLM capabilities that form the foundation of modern vision AI applications.
Core VLM capabilities
Dataset types
Choose the right dataset type for your vision AI application:
Localize objects in images using natural language descriptions. Find "the red car on the left" or "person wearing blue jacket" without pre-defined categories.
Answer questions about images in natural language. Ask "What color is the car?" or "Is there a defect?" and get conversational answers.
🚧 Coming soon — Define custom annotation schemas for specialized use cases and research projects.
Advanced capabilities
Which capability should I use?
Choose based on what you need to accomplish:
| Need | Use | Output |
|---|---|---|
| Locate objects described in text | Phrase Grounding | Bounding boxes with locations |
| Get information about images | Visual Question Answering | Text answers to your questions |
| Flexible object detection without fixed categories | Phrase Grounding | Spatial locations |
| Conversational interaction with images | Visual Question Answering | Natural language responses |
| Custom annotation schemas for specialized needs | Freeform (coming soon) | User-defined formats |
Learn more:
- Phrase Grounding vs. other vision tasks →
- VQA vs. other vision tasks →
- Freeform for custom use cases →
Common use cases
Phrase Grounding applications
Perfect for scenarios requiring object localization with flexible descriptions:
- Robotics — "Pick up the red mug on the left"
- Image editing — Select objects using natural language
- Autonomous vehicles — Identify "the pedestrian crossing from the right"
- Warehouse automation — Find "the damaged box on the top shelf"
Explore detailed use cases and examples →
Visual Question Answering applications
Ideal for extracting information through conversational queries:
- Quality inspection — "Is there a defect on the surface?"
- Accessibility — Describe images for visually impaired users
- Content moderation — "Does this image contain inappropriate content?"
- Inventory management — "How many items are on the shelf?"
Explore detailed use cases and examples →
Freeform applications
🚧 Coming soon — Perfect for specialized scenarios requiring custom annotation formats:
- Research projects — Novel computer vision tasks and experimental approaches
- Medical imaging — Custom diagnostic annotations and measurements
- Scientific imaging — Domain-specific labels and specialized metadata
- Hybrid requirements — Combining multiple annotation types for complex use cases
Explore freeform concepts and use cases →
Getting started
Ready to build with VLMs? Start here:
Set up datasets for Phrase Grounding, VQA, or Freeform tasks
Complete VLM workflow from data to deployment
Fine-tune VLMs on your specific use case
Learn more
- Phrase Grounding explained → — Deep dive into visual grounding, how it works, and best practices
- Visual Question Answering explained → — Complete guide to VQA, architectures, and optimization tips
- Chain-of-Thought Reasoning explained → — Learn how step-by-step reasoning improves accuracy for complex visual tasks
- Freeform explained → — Understanding custom annotation schemas and specialized use cases (coming soon)
Related resources
- Phrase grounding — Deep dive into object localization with natural language
- Visual question answering — Understand VQA capabilities and use cases
- Chain-of-Thought reasoning — Learn step-by-step reasoning for complex visual tasks
- Freeform — Custom annotation schemas for specialized use cases (coming soon)
- Quickstart — End-to-end VLM training workflow
- Train a model — Complete training guide
- Annotate data — Create phrase grounding and VQA annotations
- Create a dataset — Set up datasets for training
- Glossary — Common VLM terminology and definitions
- Vi SDK — Python SDK for programmatic access
- Run inference — Use trained models for predictions
- Configure your model — Select model architecture
- Evaluate a model — Assess model performance
- Contact us — Get help from the Datature team
Need help?
We're here to support your VLMOps journey. Reach out through any of these channels:
Updated about 1 month ago
