Model Type | |
Use Cases |
Areas: | Research applications, Multimodal grounding tasks |
|
Applications: | Comprehensive grounding tasks across images, audios, and videos |
|
Primary Use Cases: | Understanding and grounding multimodal inputs |
|
|
Additional Notes | Model available on Hugging Face |
|
Supported Languages | |
Training Details |
Data Sources: | LLaVA, COCO, GQA, OCR-VQA, TextVQA, VisualGenome, Flickr30K-Entities, Valley, DiDeMO, ActivityNet Captions, Charades-STA, VGGSS, WaveCaps, Clotho |
|
Data Volume: | Large and diverse multimodal dataset |
|
Methodology: | End-to-end multimodal grounding model with spatial and temporal information integration |
|
Hardware Used: | GPUs (specifics not mentioned) |
|
Model Architecture: | End-to-end multimodal grounding |
|
|
Safety Evaluation |
Methodologies: | |
Findings: | Effective in grounding tasks across various modalities |
|
|
Input Output |
Input Format: | Multimodal inputs including images, audio, and video |
|
Accepted Modalities: | |
Output Format: | Model's outputs based on multimodal inputs |
|
|