Model Type | text generation, multimodal |
|
Use Cases |
Areas: | commercial applications, research |
|
Applications: | visual question answering, dialog systems, content creation, device integration |
|
Primary Use Cases: | Visual QA, Dialog, Content Creation |
|
Limitations: | Lack of Audio Support, Data timeliness until June 2023, Recognition of specific individuals or IPs, Limited handling of complex instructions, Insufficient counting accuracy, Weak spatial reasoning skills |
|
|
Additional Notes | Available in different quantizations for broad hardware compatibility. |
|
Supported Languages | en (>=0.8), zh (>=0.8), fr (>=0.8), es (>=0.8), de (>=0.8), ru (>=0.8), ja (>=0.8), ko (>=0.8), ar (>=0.8), vi (>=0.8) |
|
Training Details |
Data Sources: | MathVista, DocVQA, RealWorldQA, MTVQA |
|
Methodology: | Instruction-tuning, GPTQ quantization |
|
Model Architecture: | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) |
|
|
Input Output |
Input Format: | Message-based input with role specification |
|
Accepted Modalities: | |
Output Format: | Textual descriptions or responses |
|
Performance Tips: | Use flash_attention_2 for acceleration in multi-image and video scenarios |
|
|
Release Notes |
Version: | |
Notes: | Quantized relay in 2B format, instruction-tuned. |
|
|
|