Model Type | |
Use Cases |
Areas: | research on large multimodal models and chatbots |
|
Applications: | multimodal table understanding |
|
Primary Use Cases: | research in computer vision, NLP, ML, AI |
|
Limitations: | Limited to one table image as input, low input resolution may limit capacity. |
|
|
Additional Notes | Evaluated on 17 held-in and 7 held-out tabular benchmarks, and 2 non-tabular benchmarks: TextVQA and llava-bench-in-the-wild. |
|
Supported Languages | |
Training Details |
Data Sources: | SpursgoZmy/MMTab, liuhaotian/LLaVA-Instruct-150K, liuhaotian/LLaVA-Pretrain |
|
Data Volume: | Approximately 1.5 million instances |
|
Methodology: | Two-stage training: pre-training with image-caption and table recognition data, followed by instruction tuning with multimodal data. |
|
Model Architecture: | Follows LLaVA-v1.5 with CLIP-ViT-L-336px as the visual encoder, Vicuna-v1.5-13B as base LLM, and a two-layer MLP as the vision-language connector. |
|
|
Input Output |
Input Format: | Single table image of resolution 336*336 |
|
Accepted Modalities: | |
|