Model Type | omni-interactive, multimodal, text generation, speech-to-speech |
|
Use Cases |
Areas: | research, interactive applications, voice assistants |
|
Applications: | multimodal interaction, speech-to-speech conversations |
|
Primary Use Cases: | real-time speech output, understanding images, audio, and text |
|
|
Additional Notes | Uses whisper for audio encoding, clip for image encoding, snac for audio decoding, and CosyVoice for generating synthetic speech |
|
Supported Languages | |
Training Details |
Data Sources: | OpenOrca datasets, MOSS, whisper |
|
Methodology: | Three-stage training: encoder adaptation, modal alignment, and multimodal fine-tuning |
|
Model Architecture: | Uses multiple sequences for input and output to perform comprehensive tasks |
|
|
Input Output |
Input Format: | Concatenated image, audio, and text features |
|
Accepted Modalities: | |
Output Format: | Real-time speech responses guided by text |
|
|
Release Notes |
Version: | |
Notes: | Release of the model, technical report, inference, and chat demo code |
|
|
|