Model Type | transformer-based, decoder-only, language model, instruction-tuned |
|
Use Cases |
Areas: | Research, Commercial applications |
|
Primary Use Cases: | Language generation, General-purpose assistance |
|
Limitations: | Not intended for malicious uses, Current version not aligned to avoid sensitive topics |
|
Considerations: | Users should be aware of and mitigate model limitations. |
|
|
Additional Notes | Users are responsible for mitigating risks and ensuring compliance with regulations. |
|
Supported Languages | bg (high proficiency), ca (high proficiency), code (included), cs (high proficiency), cy (high proficiency), da (high proficiency), de (high proficiency), el (high proficiency), en (high proficiency), es (high proficiency), et (high proficiency), eu (high proficiency), fi (high proficiency), fr (high proficiency), ga (high proficiency), gl (high proficiency), hr (high proficiency), hu (high proficiency), it (high proficiency), lt (high proficiency), lv (high proficiency), mt (high proficiency), nl (high proficiency), nn (high proficiency), no (high proficiency), oc (high proficiency), pl (high proficiency), pt (high proficiency), ro (high proficiency), ru (high proficiency), sh (high proficiency), sk (high proficiency), sl (high proficiency), sr (high proficiency), sv (high proficiency), uk (high proficiency) |
|
Training Details |
Data Sources: | Colossal OSCAR, Starcoder, Spanish Crawling, Parlamint corpus, Wikimedia dumps, OpenSubtitlesv2016, MaCoCu web corpus, EurLEX-Resources, MC4-Legal, CURLICAT Corpus, CATalog, SYN v9, Welsh-GOV, DaNewsroom, The Danish Parliament Corpus 2009 - 2017, v1, Danish GigaWord, DK-CLARIN Reference Corpus of General Danish, Open Legal Data - German court decisions and laws, DeWaC, Greek Web Corpus, Greek Legal Code, BIGPATENT, peS2o, PG-19, proof-pile, Auxiliary Mathematics Problems and Solutions (AMPS), Pile of Law, RedPajama-Data T1, The Pile, Spanish Legal Domain Corpora, HPLTDatasets v1 - Spanish, Biomedical, Scientific, Estonian National Corpus 2021, Estonian Reference Corpus, EusCrawl, Yle Finnish News Archive, CaBeRnet, French Public Domain Newspapers, French Public Domain Books, The Gaois bilingual corpus of English-Irish legislation, Irish Universal Dependencies, CorpusNÓS, hrWaC 2.1, ITWaC, Korpus Malti, SoNaR Corpus NC 1.2, Norwegian Colossal Corpus, Occitan Corpus, Polish Parliamentary Corpus, NKJP-PodkorpusMilionowy-1.2, Brazilian Portuguese Web as Corpus, ParlamentoPT, MARCELL Romanian legislative subcorpus v2, Korpus slovenských právnych predpisov, od-justice 2.0, Corpus of academic Slovene KAS, slWaC web corpus, SrpKorSubset, The Swedish Culturomics Gigaword Corpus, Corpus of laws and legal acts of Ukraine |
|
Data Volume: | |
Methodology: | pre-training with highly curated data, instruction-tuning for general-purpose assistance |
|
Context Length: | |
Hardware Used: | MareNostrum 5, a pre-exascale EuroHPC supercomputer, Nvidia Hopper GPUs, Intel Sapphire Rapids |
|
Model Architecture: | Transformer-based decoder-only |
|
|
Safety Evaluation |
Methodologies: | LLM-judge for multilingual evaluation, BBQ dataset testing for societal biases, ARC Multiple Choice Question dataset for positional effects |
|
Findings: | High performance in disambiguated settings, Presence of societal biases in ambiguous settings, Weak primacy effects for cognitive bias |
|
Risk Categories: | Undesired societal and cognitive biases, Potential generation of harmful content |
|
Ethical Considerations: | Bias and safety improvements planned for further alignment |
|
|
Responsible Ai Considerations |
Fairness: | Detected societal biases need alignment and fairness improvements. |
|
Transparency: | Full model details, training scripts, and data sources are disclosed. |
|
Accountability: | Developers using the model are responsible for mitigating risks. |
|
Mitigation Strategies: | Further RLHF tuning and ethical audits planned. |
|
|
Release Notes |
Version: | |
Date: | |
Notes: | Multilingual transformer-based model with 7.8 trillion tokens of data and instruction-tuning. |
|
|
|