Model Type | content safety classification, LLM |
|
Use Cases |
Areas: | Content moderation, Security systems, Search and code interpretation tools |
|
Applications: | Enterprise content management systems, Search optimizations and secure AI deployments |
|
Primary Use Cases: | Moderating harmful content in AI-generated inputs and outputs, Ensuring safety in search tool interactions and code interpretation applications |
|
Limitations: | Performance limited by training data context and scope, Potential susceptibility to prompt injection attacks |
|
Considerations: | Recommended for use alongside broader moderation systems for cases requiring up-to-date factual evaluations. |
|
|
Additional Notes | Dataset expansions include multilingual conversation data and challenges to border cases to reduce false positives further. |
|
Supported Languages | English (full), French (full), German (full), Hindi (full), Italian (full), Portuguese (full), Spanish (full), Thai (full) |
|
Training Details |
Data Sources: | Llama Guard 1 and 2 generations, multilingual conversation data, Brave Search API query results |
|
Methodology: | Fine-tuning on pre-trained Llama 3.1 model |
|
|
Safety Evaluation |
Methodologies: | Aligned with MLCommons standardized hazards taxonomy, Internal tests against multilingual dataset, Comparison with prior versions and competitor models, Synthetic generation of safety data |
|
Findings: | Improved safety classification while lowering false positive rates |
|
Risk Categories: | Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Defamation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content, Elections, Code Interpreter Abuse |
|
Ethical Considerations: | Focus on reducing false positives while ensuring comprehensive content moderation across categories |
|
|
Responsible Ai Considerations |
Fairness: | Supports multilingual content moderation intended for a wide range of languages with consistent policy alignment. |
|
Transparency: | Commercial conditions and usage thresholds are clearly outlined in the licensing terms. |
|
Accountability: | Users are accountable for adhering to the Acceptable Use Policy and Meta's broader policy guidelines. |
|
Mitigation Strategies: | Policies and thresholds are provided to manage usage levels, and community reporting mechanisms are in place for policy violations. |
|
|
Input Output |
Input Format: | Prompt and response classification |
|
Accepted Modalities: | |
Output Format: | Safety classification and content category indication |
|
Performance Tips: | Use the model with specified transformer versions and follow community guidelines for optimal safety checks. |
|
|
Release Notes |
Version: | |
Date: | |
Notes: | Improved safety evaluations, expanded multilingual capabilities, and refined moderation taxonomies. |
|
|
|