Kudos Qwen Coder Models: Open Weights and Self-Hosted on Your Hardware
12/11/2024 13:28:49Qwen keeps generating enthusiasm in the AI community. Practitioners are highlighting the model's ability to deliver high performance on modest hardware while having fully accessible weights.
Aider LLM Leaderboards:
The Qwen2.5-Coder family's accessibility is a key talking point. Users emphasize the ability to self-host these models on consumer hardware - a 32GB RAM Mac or 24GB GPU is sufficient for running substantial versions:
- 14B model runs with Q6K quantization and 32K context on 24GB GPUs (12-16GB VRAM minimum)
- 32B model operates with 32K context at ~4.5 bits per weight on 24GB cards
- Running on CPU/RAM is possible but slower (1-3 tokens/s vs 20-30+ tokens/s on GPU)
- 32B version achieves 37-40 tokens/s with Q4KM quantization on RTX 3090
Technical implementation varies among users. Popular approaches include:
- tabbyAPI with Q6 context cache
- kobold.cpp with IQ4-M quantization and Q8_0/Q5_1 cache
- croco.cpp fork for automatic Q8/Q5_1 attention building Some users report struggles with custom flash attention setup in ollama, advising against this approach.
Performance metrics are impressive:
- 14B version surpasses Qwen2.5 72B chat model on aider leaderboard
- 32B coder version is considered state-of-the-art among open-source code models
- Trained on 5.5 trillion tokens with extensive data cleaning and balanced mixing
- Supports Fill-in-the-Middle (FIM) functionality
The Qwen2.5-Coder family includes multiple versions (0.5B, 3B, 14B, and 32B) with various quantization options (Q4, Q6, Q8), all following Apache License (except 3B version). Built upon the Qwen2.5 architecture, these Coder models maintain general capabilities while excelling at code generation. Users report success across multiple applications: code generation, role-playing/chat, and educational assistance, with additional use cases in document summarization and brainstorming.