Model Type | text generation, code generative |
|
Use Cases |
Primary Use Cases: | Code generation, Code explanation, Code fixing, Generating unit tests, Generating documentation, Addressing technical debt issues, Vulnerability detection, Code translation |
|
Limitations: | Generated code is not guaranteed to work as intended, Model generates problematic outputs, Risk of verbatim copying due to memorization in smaller models |
|
Considerations: | Caution urged against complete reliance, potential for problematic outputs and hallucination |
|
|
Supported Languages | languages_supported (116 programming languages), proficiency (comprehensive understanding) |
|
Training Details |
Data Sources: | Publicly available datasets (e.g., GitHub Code Clean, Starcoder data), Additional public code repositories and issues from GitHub |
|
Data Volume: | Phase 1: 3 trillion tokens, Phase 2: 1 trillion tokens |
|
Methodology: | Two-phase training strategy |
|
Hardware Used: | IBM's Vela and Blue Vela super computing clusters, NVIDIA A100 and H100 GPUs |
|
Model Architecture: | Decoder-only architecture designed for code-generative tasks |
|
|
Responsible Ai Considerations |
Mitigation Strategies: | HAP content filter, PII redaction, malware scanning |
|
|