Model Type | generative code model, text generation, code synthesis |
|
Use Cases |
Areas: | |
Primary Use Cases: | |
Limitations: | Potential for misuse in generating vulnerable/malicious code |
|
Considerations: | Model-generated code must not be executed without precautions. |
|
|
Additional Notes | Pretrained on a mixture of open-source web text and Python code. |
|
Training Details |
Data Sources: | bigcode/the-stack, HuggingFaceFW/fineweb, Magicoder, StarCoder2 OSS-Instruct |
|
Data Volume: | |
Methodology: | pretrained on open-source web text and Python code. Instruction tuned on synthetic edit sequence data using the LintSeq algorithm. |
|
Training Time: | Pretraining took about two days (150M) and six days (400M). Instruction tuning took several hours. |
|
Hardware Used: | single H100 node (four GPUs) for pretraining, single H100 GPU for instruction tuning |
|
Model Architecture: | Autoregressive language models mimicking GPT-2 architectures with OLMo model transformer architecture changes. |
|
|
Safety Evaluation |
Risk Categories: | potential misuse for vulnerabilities/malicious code generation |
|
Ethical Considerations: | The importance of handling model-generated code with precautions. |
|
|
Input Output |
Input Format: | |
Output Format: | Text and code outputs. Instruction tuned models generate code via 'diffs'. |
|
|