Gemma 4 E4B benchmark: 53 tests on a MacBook Pro M4 (24GB)

Google released Gemma 4 today, a family of open models under Apache 2.0 that they claim deliver "frontier-class reasoning" on consumer hardware. The lineup includes four sizes: E2B, E4B, a 26B Mixture of Experts, and a 31B Dense model. The 31B currently ranks #3 among open models on the Arena AI leaderboard.

The pitch is that these models are sized to run on your hardware, from phones to laptops to workstations. I wanted to test that claim directly: what can Gemma 4 actually do on a MacBook Pro M4 with 24GB of unified memory?

I ran over 50 practical tests covering vision, reasoning, code generation, UI recreation, essay writing, mathematics, creative writing, translation, and more. No synthetic benchmarks, just tasks people actually do.

Setup

Hardware: MacBook Pro, M4 Pro chip, 24GB unified memory. Software: Ollama 0.20.0-rc1 (the stable 0.19.0 does not support Gemma 4 — you need the pre-release from GitHub releases).

Real terminal screenshot showing ollama list and ollama show gemma4:e4b on MacBook Pro

Two models were tested:

Gemma 4 E4B (9.6GB download) — the "edge" model with an effective 4 billion parameter footprint. Fits comfortably in 24GB with room to spare. This is the one you will actually use.
Gemma 4 26B MoE (17GB download) — the Mixture of Experts model ranked #6 on Arena AI. Activates only 3.8B of its 25.2B parameters during inference. Technically loads on 24GB but causes heavy memory swapping.

The 31B Dense model (20GB) was not tested. At that size on 24GB RAM, the system would be swapping constantly.

Gemma 4 model sizes vs 24GB RAM: E4B 9.6GB fits comfortably, 26B MoE 17GB causes swapping, 31B Dense 20GB impractical

The benchmark

Nine tests, each targeting a different capability Google claims for Gemma 4. Every test was run once (no cherry-picking), all locally, all free. The first six cover vision, reasoning, structured output, and long context. The last three are pure code generation: we gave the model a single prompt and asked it to build a complete bakery landing page, a SaaS analytics dashboard, and a todo app with localStorage persistence. Click through each tab to see the prompt, the model's actual output, and the performance numbers.

UI Recreation Challenge: 14 real websites

Beyond the initial tests, I pushed the model further: take a screenshot of a real website, feed it to Gemma 4 E4B, and see what it produces. 14 sites covering pricing pages, hero sections, dashboards, docs, e-commerce, and auth UIs. Same model, same hardware, zero API calls. Drag the slider to compare each original against the model's output.

Each recreation above is the model's unedited first attempt — no retries, no cherry-picking. Click "View Source" on any test to see the raw HTML it generated.

Writing, Math, and Everything Else: 30 Tests

The final round pushes Gemma 4 E4B into territory where most people actually use language models: writing essays, solving math problems, following complex instructions, translating text, and extracting data. 30 tests across 8 categories, each run once at temperature 0.

Open-ended outputs (essays, creative writing, summaries) were evaluated by Claude Opus 4.6 against a 6-dimension rubric — the same LLM-as-judge approach used in research benchmarks like MT-Bench. The judge prompt, scoring rubric, and full reasoning are published for transparency. Objective tasks (math, logic, data extraction) were verified programmatically.

What worked

Vision is real, with caveats. The E4B model correctly identified all 6 countries in a line chart and got the rank ordering and trend directions right. However, when the sidebar labels were cropped out, its absolute value estimates from the line positions were rough — consistently overestimating the lower lines by 3-8 percentage points. It also produced a recognizable HTML recreation of Stripe's pricing page from a screenshot in 75 seconds. Vision works, but don't trust it for precise numerical extraction from charts without labels.

Math and logic are flawless. All 12 objective text tests — four math problems, four logic puzzles, and four factual/data extraction tasks — scored correct on the first attempt. The model solved quadratic equations with LaTeX formatting, untangled constraint satisfaction puzzles, and computed quarter-over-quarter growth rates from CSV data without errors.

Instruction following is precise. All four structural constraint tests passed: exact paragraph counts, word limits, inclusion/exclusion rules, and format requirements. The model reliably produces output that matches specific formatting instructions.

Reasoning holds up. Three math and logic problems of increasing difficulty, all solved correctly. The snail-in-a-well problem, which requires recognizing a common reasoning trap, was handled with a clear explanation of why the naive answer fails. Step-by-step work was shown throughout.

Structured output is reliable. Valid JSON on the first attempt for a multi-function-call scenario. Correct IATA codes, ISO dates, proper schema adherence. No hallucinated parameters. This matters for anyone building local agent workflows without API access.

Speed is interactive. At 48 tokens per second on short-context tasks, the model responds at a comfortable reading pace. Generation time ranged from 15 seconds (function calling) to 75 seconds (screenshot-to-code), which is fast enough for development workflows.

What didn't

Long context is slow. Processing 56,656 tokens of source code took over 5 minutes at 20 tok/s. The model's answers were correct — it identified specific class names, function signatures, and architectural patterns in Hono's codebase — but the speed makes it impractical for interactive code review. Useful for batch analysis; not for real-time pair programming.

Code generation is impressive but not flawless. The model produced polished, complete HTML/CSS pages from single-sentence prompts; a bakery landing page with warm color palette, a dark-themed SaaS dashboard with KPI cards with production-quality layouts. But when JavaScript logic was required (a todo app with filtering and localStorage), it hit a wall: the UI rendered fine but had a runtime error in the event binding (checkbox.onchange called as a function instead of assigned as a handler). CSS and layout: excellent. Complex interactive JS from a 4B model: not quite there yet.

Translation drops accented characters. Both translation tests (English to Spanish, English to French) showed truncated accented characters — "límite" appeared as "l", "lumière" as "lumi". The translations were structurally correct and preserved technical terminology, but the character encoding issues make the output unusable without manual cleanup. This is likely a tokenizer or quantization artifact in the E4B model.

Screenshot-to-code is approximate. The model captured the structure and content of Stripe's pricing page correctly (nav bar, three-column layout, pricing text, CTAs) but the visual fidelity is rough. The gradient background was simplified, spacing was approximate, and it hallucinated a third pricing column ("Volume Discounts") that wasn't in the original. Structurally correct, visually approximate.

The 26B MoE on 24GB: a reality check

The 26B Mixture of Experts model downloaded as 17GB. On a 24GB machine, it loads but triggers heavy memory swapping. I tested it on the reasoning problem and it produced a correct answer, but at 11.6 tok/s (a fourth of the E4B's speed) and the model load alone consumed 3 minutes of the 4-minute total response time.

The full 17GB of expert weights must reside in memory even though only 3.8B parameters activate per token. On a 24GB machine, that leaves roughly 7GB for macOS, Ollama, and the KV cache, which is enough for short prompts but tight enough to cause real memory pressure. The result: 11.6 tok/s versus E4B's 48, likely a combination of memory swapping and poor cache locality across the large weight footprint.

Memory budget: E4B totals 15.6GB fitting in 24GB, 26B MoE exceeds 24GB causing swap

On 32GB or more, the 26B MoE would likely run at 25-35 tok/s with significantly higher quality output, especially on complex reasoning. On 24GB, the E4B is the practical choice.

What this means for developers

A year ago, running a model locally on a laptop meant accepting severe quality tradeoffs. Gemma 4 E4B changes the calculus:

Offline code assistance is now viable. A complete landing page or dashboard in 80 seconds, a TypeScript rate limiter in 30 seconds, no API key or internet connection required.
Vision tasks work at interactive speed. Chart reading, screenshot analysis, and document understanding at 48 tok/s.
Local agent workflows are practical. Reliable structured JSON output means tool-use chains can run entirely on-device.
The cost is $0. Apache 2.0 license, data never leaves your machine, no rate limits.

Why run locally: $0 cost, full data privacy, no internet needed, no rate limits, Apache 2.0 license

Methodology

Benchmark pipeline: Prompt to Ollama to Gemma 4 E4B to Response to Measure — 128K context, Q4 quantization, $0 cost

All tests run on a MacBook Pro M4 Pro, 24GB unified memory. Ollama 0.20.0-rc1 with Gemma 4 E4B (gemma4:e4b, 9.6GB, Q4 quantization). Context window set to 128K tokens. Each test run once without retries or cherry-picking. Generation speed measured from Ollama's reported eval_duration. Vision test inputs: Stripe pricing page screenshot (1440×900) and Our World in Data renewable energy chart (6 countries, 2000-2024). Long context input: 56,656 tokens from Hono framework source (core routing, middleware, context modules). Code generation test inputs: single-sentence prompts for a bakery landing page, SaaS dashboard, and todo app. UI recreation test: 14 real website screenshots (1280×800 viewport at 2x resolution) captured with Playwright, each fed to the model with the prompt "Recreate this UI as a single self-contained HTML file with inline CSS. Match the layout, colors, and typography as closely as possible." The model's HTML output was rendered in the same viewport and screenshotted for comparison. Total benchmark time: approximately 8 minutes for the 8 short-context tests, 5 minutes for the long context test, approximately 30 minutes for the 14 UI recreation tests, and approximately 10 minutes for the 30 text tests. Benchmark harness: custom TypeScript script using Ollama's HTTP API.

Text benchmark methodology: 30 tests across 8 categories (Mathematical Reasoning, Logical Reasoning, Essay & Argumentative, Creative Writing, Summarization, Instruction Following, Translation, Factual Knowledge). All prompts are original — not sourced from GSM8K, MATH, or any public benchmark dataset. Temperature set to 0 for deterministic output; each test run once. 12 objective tests (math, logic, factual, translation comprehension) verified programmatically against known answers. 4 hybrid tests (instruction following) checked programmatically for structural constraints and judged for quality. 14 open-ended tests (essays, creative writing, summaries, translations) evaluated by Claude Opus 4.6 (Anthropic) as LLM-as-judge against a 6-dimension rubric: Coherence, Accuracy, Relevance, Style, Instruction Following, and Depth, each scored 1-10. The judge prompt includes anti-verbosity instruction and requires chain-of-thought reasoning before each score. Scores are reported per-dimension and never averaged. Full judge reasoning is published in the interactive component above. Known limitation: LLM judges have documented biases (verbosity preference, sentiment bias) which are mitigated but not eliminated.

Gemma 4 E4B benchmark: 53 tests on a MacBook Pro M4 (24GB)

Setup

The benchmark

UI Recreation Challenge: 14 real websites

Writing, Math, and Everything Else: 30 Tests

What worked

What didn't

The 26B MoE on 24GB: a reality check

What this means for developers

Methodology

More in Models

Related stories

Google releases Gemini 3.1 Flash-Lite

Alibaba launches Qwen 3.5 small model series with sub-1B edge options

Claude hits #1 on the US App Store after Pentagon dispute

Gemini 3.1 Pro claims top-tier reasoning gains