Model Type | Multimodal, Vision-Language |
|
Use Cases |
Areas: | Research, Commercial applications |
|
Applications: | Visual question answering, Image comprehension |
|
Primary Use Cases: | Multi-round text-image conversations |
|
Limitations: | Supports text-image conversations but not image-to-video., May hallucinate content not present in images., Resolution limited to 448x448. |
|
Considerations: | Evaluate potential risks before adopting. |
|
|
Additional Notes | Supports image understanding at 448ร448 resolution; ongoing limitations may affect certain applications. |
|
Supported Languages | English (proficient), Chinese (proficient) |
|
Training Details |
Data Sources: | LAION-400M, CLLaVA, Flickr, VQAv2, RefCOCO, Visual7w, GQA, VizWiz VQA, TextCaps, OCR-VQA, Visual Genome, LAION GPT4V |
|
Data Volume: | |
Methodology: | Three-stage training: image and text alignment using ViT and Yi LLM with datasets like LAION-400M and Visual Genome. |
|
Training Time: | 10 days for Yi-VL-34B, 3 days for Yi-VL-6B |
|
Hardware Used: | 128 NVIDIA A800 (80G) GPUs |
|
Model Architecture: | Vision Transformer (ViT) initialized with CLIP ViT-H/14, a projection module, and Yi LLMs. |
|
|
Input Output |
Input Format: | |
Accepted Modalities: | |
Output Format: | |
|