Model Type | text generation, code generation |
|
Use Cases |
Areas: | enterprise, software engineering productivity |
|
Applications: | code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation |
|
Limitations: | Not undergone any safety alignment, Potential susceptibility to hallucination in generation scenarios |
|
Considerations: | Not fully reliable for crucial decisions; ethical and responsible usage recommended. |
|
|
Additional Notes | Trained on extensive code datasets across 116 languages. |
|
Supported Languages | Python (high), C (high), C++ (high), Go (high), Java (high), JavaScript (high), TypeScript (high) |
|
Training Details |
Data Sources: | codeparrot/github-code-clean, bigcode/starcoderdata, open-web-math/open-web-math, math-ai/StackMathQA, bigcode/humanevalpack, repoqa, lcc, repobench |
|
Data Volume: | |
Methodology: | Continual pretraining with repository-level file packing and per-language length upsampling |
|
Context Length: | |
Hardware Used: | NVIDIA A100 GPUs, NVIDIA H100 GPUs |
|
|
Responsible Ai Considerations |
Mitigation Strategies: | Not undergone any safety alignment; active area of research to handle risks of problematic outputs and malicious utilization. |
|
|
Input Output |
Input Format: | text input (e.g., code snippets) |
|
Accepted Modalities: | |
Output Format: | |
|
Release Notes |
Date: | |
Notes: | Extended context length from 2K to 128K with continual pretraining. |
|
|
|