Model Type | Decoder-only, Multi-modal, Transformer |
|
Use Cases |
Areas: | |
Applications: | Computer control, Digital agents, Research on multi-modal models |
|
Limitations: | Faces and people in general may not be generated properly, The model was not trained to be factual or true representations of people or events |
|
|
Additional Notes | The model is optimized for digital agents and responds well to few-shotting and fine-tuning. |
|
Training Details |
Model Architecture: | Vanilla decoder-only transformer without an image encoder, using linear projection of image patches into the transformer. |
|
|
Input Output |
Input Format: | Text and images (image patches are linearly projected) |
|
Accepted Modalities: | |
Output Format: | |
Performance Tips: | End questions with `\n` for best performance. |
|
|