Model Type | multimodal, text-generation |
|
Use Cases |
Areas: | research, commercial applications |
|
Limitations: | No audio support, Data timeliness issue post-June 2023, Limited capability in recognizing individuals/IPs, Weak in complex instructions, Object counting and spatial reasoning difficulties |
|
|
Additional Notes | Supports up to 20min video understanding, multilingual text understanding within images. |
|
Supported Languages | primaryLanguages (English, Chinese), additionalLanguages (Most European languages, Japanese, Korean, Arabic, Vietnamese), description (Multilingual support for text understanding in images.) |
|
Training Details |
Data Volume: | Data updated until June 2023 |
|
Model Architecture: | Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE) |
|
|
Input Output |
Input Format: | |
Accepted Modalities: | |
Output Format: | |
Performance Tips: | Set min/max pixels for optimal speed and memory usage |
|
|