Model Type | text-generation, content moderation |
|
Use Cases |
Areas: | Safety and content moderation |
|
Applications: | Online platforms requiring content moderation |
|
Primary Use Cases: | Classifying content for safety in both inputs and responses |
|
Limitations: | Performance limited by training data, Not designed for chat use cases, Susceptible to adversarial attacks |
|
Considerations: | Recommended to be used with additional solutions for unsupported categories |
|
|
Additional Notes | Supports 11 out of the 13 categories included in the MLCommons AI Safety taxonomy. Election and Defamation categories are not addressed. |
|
Training Details |
Data Sources: | Llama Guard training set, MLCommons taxonomy, hard samples from Llama 2 70B |
|
Methodology: | Fine-tuned for safety classification |
|
|
Safety Evaluation |
Methodologies: | Harm Taxonomy, MLCommons taxonomy alignment |
|
Findings: | Strong adaptability to other policies, Superior tradeoff between F1 score and False Positive Rate |
|
Risk Categories: | Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content |
|
Ethical Considerations: | |
|
Responsible Ai Considerations |
Mitigation Strategies: | Using external components like KNN |
|
|
Input Output |
Input Format: | |
Accepted Modalities: | |
Output Format: | Binary classification (safe/unsafe) |
|
Performance Tips: | Align model with specific safety considerations for better moderation |
|
|