Model Type | |
Use Cases |
Areas: | Basque language technology and research |
|
Primary Use Cases: | Pre-trained LLMs for specific tasks or further fine-tuning for specific use cases. |
|
Limitations: | Not fine-tuned to follow instructions or work as a chat assistant. |
|
Considerations: | Use with Basque data; performance in other languages is not guaranteed. |
|
|
Additional Notes | Models range from 7 to 70 billion parameters, evaluation and datasets are publicly available under open licenses. |
|
Supported Languages | |
Training Details |
Data Sources: | HiTZ/latxa-corpus-v1.1, EleutherAI/pile |
|
Data Volume: | |
Methodology: | Prioritizing high-quality data sources with deduplication and filtering. Trained using GPT-Neox on HPC infrastructure. |
|
Context Length: | |
Training Time: | 10k steps with 20B total tokens, around 4 epochs |
|
Hardware Used: | CINECA HPC Leonardo computing cluster, 3456 nodes each containing 4x custom A100 64Gb GPUs |
|
Model Architecture: | Follows Metaβs LLaMA architecture, further trained on Basque corpus |
|
|
Responsible Ai Considerations |
Fairness: | Trained on carefully selected and processed data to minimize disturbing or harmful content. |
|
Mitigation Strategies: | Thorough deduplication and filtering process applied on training data. |
|
|
Input Output |
Accepted Modalities: | |
Output Format: | |
|
Release Notes | |