Model Type | text generation, multimodal |
|
Use Cases |
Areas: | research, commercial applications, agent integration |
|
Applications: | visual understanding, video-based QA, mobile and robotic integrations |
|
Primary Use Cases: | visual question answering, dialog, content creation, multilingual support |
|
Limitations: | Lack of audio support, Updates up to June 2023, Limited individual/IP recognition, Limited complex instruction handling, Low counting accuracy, Weak spatial reasoning |
|
|
Additional Notes | Reports quantized model performance across various tasks, highlighting strengths in multimodal integration and weaknesses, such as limitations in audio and complex reasoning. |
|
Supported Languages | English (native), Chinese (native), European languages (high), Japanese (high), Korean (high), Arabic (high), Vietnamese (high) |
|
Training Details |
Data Volume: | |
Model Architecture: | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) |
|
|
Input Output |
Input Format: | Images, Videos (local files, base64, URLs) |
|
Accepted Modalities: | |
Output Format: | |
Performance Tips: | Enabling flash_attention_2 recommended for better acceleration and memory saving. |
|
|
Release Notes |
Version: | Qwen2-VL-2B-Instruct-GPTQ-Int4 |
|
Date: | |
Notes: | Quantized model version with multi-language support and enhanced image and video processing capabilities. |
|
|
|