Model Type | text generation, multimodal |
|
Use Cases |
Areas: | research, commercial applications |
|
Applications: | question answering, dialog, content creation, mobile device operation, robot operation |
|
Primary Use Cases: | video-based question answering, multimodal analytics, language understanding in various languages |
|
Limitations: | no audio support, updated until June 2023, limited recognition of individuals and IP, weak spatial reasoning skills |
|
|
Additional Notes | The model supports local files, base64, and URLs for input images. Limitations in spatial reasoning and complex instruction handling are noted. |
|
Supported Languages | en (high), zh (high), fr (medium), de (medium), es (medium), ja (medium), ko (medium), ar (medium), vi (medium) |
|
Training Details |
Data Sources: | MathVista, DocVQA, RealWorldQA, MTVQA |
|
Methodology: | Naive Dynamic Resolution, Multimodal Rotary Position Embedding |
|
Model Architecture: | Multimodal architecture supporting images, video processing |
|
|
Input Output |
Input Format: | |
Accepted Modalities: | |
Output Format: | |
Performance Tips: | Use flash_attention_2 for better acceleration and memory saving. |
|
|