Training Details |
Data Sources: | https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2, https://huggingface.co/datasets/tiiuae/falcon-refinedweb, https://huggingface.co/datasets/allenai/c4, https://huggingface.co/datasets/EleutherAI/pile, https://data.baai.ac.cn/details/WuDaoCorporaText, https://huggingface.co/datasets/CASIA-LM/ChineseWebText |
|
Data Volume: | |
Methodology: | EfficientScale; Mixture of Experts (MoE) training; Scale-Up and Scale-Out strategies |
|
Context Length: | |
Model Architecture: | Mixture of Experts (MoE); Context Length: 4096; QKV Bias: yes; Layers: 40; Hidden Dim: 5120; Intermediate Dim: 20480; KV Group: 8 |
|
|