Model Type | multimodal model, text generation, vision understanding |
|
Use Cases |
Areas: | broad commercial, research use, English |
|
Applications: | general purpose AI systems with visual and text input, Memory/compute constrained environments, Latency bound scenarios, General image understanding, Optical character recognition, Chart and table understanding, Multi-image comparison, Multi-image or video clip summarization |
|
Primary Use Cases: | research on language and multimodal models, building block for generative AI powered features |
|
Limitations: | models are not specifically designed or evaluated for all downstream purposes |
|
Considerations: | Developers should consider common limitations of language models and ensure use complies with applicable laws. |
|
|
Additional Notes | Developers are responsible for mitigating biases and ensuring accuracy and safety in their use cases. |
|
Supported Languages | multilingual (primarily focused on English text) |
|
Training Details |
Data Sources: | synthetic data, filtered publicly available websites, high-quality educational data, image-text interleave data, synthetic 'textbook-like' data for teaching math, coding, reasoning, etc., created multi-image and video data |
|
Data Volume: | |
Methodology: | supervised fine-tuning and direct preference optimization |
|
Context Length: | |
Training Time: | |
Hardware Used: | |
Model Architecture: | includes image encoder, connector, projector, and Phi-3 Mini language model |
|
|
Safety Evaluation |
Methodologies: | red teaming, adversarial conversation simulations, safety evaluation benchmark datasets |
|
Risk Categories: | production of undesirable outputs across multiple risk categories |
|
Ethical Considerations: | Leveraged human-labeled and synthetic datasets focusing on safety categories |
|
|
Responsible Ai Considerations |
Fairness: | Limitations may still be present due to differing levels of representation of different groups or societal biases. |
|
Transparency: | Developers should inform end-users that they are interacting with an AI system. |
|
Accountability: | Developers are responsible for ensuring compliance with relevant laws and regulations. |
|
Mitigation Strategies: | Apply responsible AI best practices, use safety classifiers or custom solutions, ensure transparency and accurate information. |
|
|
Input Output |
Input Format: | Best suited for prompts using the chat format. |
|
Accepted Modalities: | |
Output Format: | Generated text in response to input. |
|
Performance Tips: | Set num_crops=4 for multi-frame and num_crops=16 for single-frame for best performance. |
|
|
Release Notes |
Date: | |
Notes: | Model enables multi-frame image understanding, improved single image benchmark performance, supports wider range of applications. |
|
|
|