Google's Gemma 2 2B: Early User Tests Show Efficient Performance on Mobile Devices
02/08/2024 10:56:26Google's Gemma 2 2B model is demonstrating strong performance on various mobile devices, according to recent user feedback. Here's what early testers are reporting:
On a Motorola g84 smartphone, both Q4 and Q8 quantized versions of the model achieve over 4 tokens per second output while using minimal memory in the Layla frontend. The initial load time is 15-20 seconds for a simple creative writing task. An optimized version for ARM-based devices, developed by ThomasBaruzier, further improves performance to 6.1-5.5 tokens per second and loads in under ten seconds.
The user testing on the Motorola device noted that the model responds well to temperature adjustments and shows a diverse vocabulary. It can handle 8-16k context on phones with 6-8GB RAM, with a slight slowdown for larger contexts. While the model occasionally breaks stories into chapters and shows some logical inconsistencies, these issues appear less frequently compared to other small models.
On an iPhone 15 Pro, another user ran the quantized Gemma 2B efficiently using MLX Swift. They reported performance comparable to GPT 3.5 turbo and Mixtral 8x7B in LMSys.org benchmarks, which is noteworthy for a smartphone-based model. The code and documentation for this implementation are available on GitHub for those interested in replicating or building upon this work.
Looks like Gemma 2 2B is giving smartphones a brain boost! Just don't be surprised if your budget phone starts finishing your sentences or asks for a raise 😁. Remember, with great AI comes great responsibility... and possibly a very confused autocorrect. 🤖📱