Model Type | |
Additional Notes | Instruction Pre-Training framework scalably augments massive raw corpora with instruction-response pairs to pre-train language models. |
|
Training Details |
Data Sources: | tiiuae/falcon-refinedweb, instruction-pretrain/ft-instruction-synthesizer-collection, instruction-pretrain/general-instruction-augmented-corpora |
|
Data Volume: | |
Methodology: | Instruction Pre-Training using supervised multitask pre-training with instruction-response pairs. |
|
|
Release Notes |
Date: | |
Notes: | Our paper has been accepted by EMNLP 2024 main conference. |
|
Date: | |
Notes: | Updated FAQ on continual pre-training from Llama3. |
|
Date: | |
Notes: | Updated guidelines on evaluating any Huggingface models on domain-specific tasks. |
|
Date: | |
Notes: | Updated pre-training suggestions in the Advanced Usage section of instruction-synthesizer. |
|
Date: | |
Notes: | Scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M. |
|
Date: | |
Notes: | Released the paper, code, and resources. |
|
|
|