Tried gemma4 on llama.cpp

Apr 04, 2026

After my failed attempts with ollama and LM Studio (Gibberish results running gemma4 on my integrated GPU and Tried gemma4 on LM Studio), this time I tried llama.cpp as the inference engine. To my surprise it ran just fine on my integrated GPU. I mean, slow, but fine without errors. I ran the gemma:e2b version with

./llama-cli -hf ggml-org/gemma-4-E2B-it-GGUF

You can control how many layers are offloaded to GPU with the -ngl flag. For example, if we set it to 10:

./llama-cli -hf ggml-org/gemma-4-E2B-it-GGUF -ngl 10 -v

You can see offloaded 10/36 layers to GPU in the logs. You can set it to use all:

./llama-cli -hf ggml-org/gemma-4-E2B-it-GGUF -ngl all -v

and you’ll see offloaded 36/36 layers to GPU in the logs. GPU usage could be checked with intel_gpu_top program.

I uploaded an image of a kitten to the model. And it was able to identify it just fine.

#ollama #gemma4 #llama.cpp #lm-studio