Model Type | |
Use Cases |
Limitations: | The model may produce problematic outputs as it has not been aligned to generate safe completions within the RLHF phase., Unknown size and composition of the corpus used to train the base Llama 2 models, likely included a mix of Web data and technical sources. |
|
|
Additional Notes | The model was trained using a Jax DPO trainer on EasyLM. Larger RM benefited from a smaller KL penalty during training. |
|
Supported Languages | |
Training Details |
Data Sources: | https://huggingface.co/datasets/allenai/tulu-2.5-preference-data, https://huggingface.co/datasets/allenai/tulu-v2-sft-mixture |
|
Methodology: | Trained using DPO and PPO starting from the Tulu 2 suite. Trained on the UltraFeedback dataset using PPO, with a 70B RM trained on the UltraFeedback dataset. Only the prompts from the ultrafeedback_mean_aspects split were used. |
|
|
Input Output |
Input Format: | <|user|> Your message here! <|assistant|> |
|
Performance Tips: | Ensure to include a newline after <|assistant|> as it affects generation quality. |
|
|