Model Type | |
Use Cases |
Areas: | Czech language processing |
|
Applications: | academic research, commercial applications |
|
Primary Use Cases: | text generation for Czech language content |
|
|
Additional Notes | Vocabulary swap was utilized to enhance transfer of knowledge from English to Czech. |
|
Supported Languages | |
Training Details |
Data Sources: | BUT-FIT/BUT-LCC, BUT-FIT/adult_content_classifier_dataset |
|
Data Volume: | 272 billion training tokens, with ~67 billion tokens specific to Czech |
|
Methodology: | Vocabulary swap method aligning English tokens to Czech equivalents |
|
Context Length: | |
Hardware Used: | |
Model Architecture: | Continuation of MPT7b with adaptations for Czech language |
|
|
Input Output |
Input Format: | |
Accepted Modalities: | |
Output Format: | |
|
Release Notes |
Version: | |
Date: | |
Notes: | Release of training checkpoints |
|
Version: | |
Date: | |
Notes: | Release of manually annotated dataset for adult content classification |
|
|
|