Model Type | Transformer, Visual Question Answering |
|
Use Cases |
Areas: | Visual Question Answering, Image and Video Captioning, Image Classification |
|
Applications: | Research, Commercial applications |
|
Primary Use Cases: | Visual question answering on TextVQA dataset |
|
|
Additional Notes | The checkpoint described here is 'GIT-base', a smaller variant of the GIT model fine-tuned specifically for TextVQA. |
|
Supported Languages | |
Training Details |
Data Sources: | COCO, Conceptual Captions (CC3M), SBU, Visual Genome (VG), Conceptual Captions (CC12M), ALT200M, Additional 0.6B image-text pairs |
|
Data Volume: | 10 million image-text pairs for GIT-base variant |
|
Methodology: | |
Model Architecture: | Transformer decoder conditioned on CLIP image tokens and text tokens. |
|
|