Model Type | multimodal, text generation, vision |
|
Use Cases |
Areas: | |
Applications: | general image understanding, OCR, chart and table understanding |
|
Primary Use Cases: | memory/compute constrained environments, latency bound scenarios |
|
Limitations: | Not evaluated for all downstream purposes, Potential quality degradation in non-English language use cases |
|
Considerations: | Developers should mitigate biases and misuses. |
|
|
Supported Languages | languages_supported (multilingual), proficiency_level (English focused, other languages might perform worse.) |
|
Training Details |
Data Sources: | publicly available documents, high-quality educational data, selected high-quality image-text interleave, synthetic data |
|
Data Volume: | 500B vision and text tokens |
|
Methodology: | Supervised fine-tuning, direct preference optimization |
|
Context Length: | |
Training Time: | |
Hardware Used: | |
Model Architecture: | Image encoder, connector, projector, and Phi-3 Mini language model |
|
|
Safety Evaluation |
Ethical Considerations: | Potential for producing inappropriate, unreliable, or biased content due to training data limitations. |
|
|
Responsible Ai Considerations |
Fairness: | Potential biases due to varying representation of groups in data. |
|
Transparency: | Developers responsible for evaluating model output for fairness and accuracy. |
|
Accountability: | Developers should ensure model applications adhere to laws and regulations. |
|
Mitigation Strategies: | Safety post-training, adherence to use-case laws. |
|
|
Input Output |
Input Format: | Single image with chat format prompts |
|
Accepted Modalities: | |
Output Format: | Text generated in response to input |
|
|