Voxtral 4B TTS Benchmark: Local Apple Silicon Performance | AI News Home

I recorded 43 seconds of audio on a MacBook, sent it to Mistral’s API, and got back natural-sounding speech clones in minutes. No model training. No fine-tuning. No GPU cluster. Just a voice recording, an API key, and a short Python script.

This is Voxtral 4B TTS, Mistral’s first text-to-speech model. Mistral says it can clone voices from as little as a few seconds of audio (I tested with 43 seconds) and it generates speech in nine languages. The weights are open. The API is live. And a community-quantized version runs entirely on a laptop.

I tested all three tiers: the full-precision cloud API, a 6-bit quantized local version, and Kokoro 82M as a lightweight benchmark. The goal was to find out what works, what breaks, and whether you actually need a 4-billion parameter model to sound human.

The Voice Clone

The process was almost anticlimactic. I recorded myself reading a short paragraph on my MacBook. 43 seconds of casual speech, captured in the built-in Voice Memos app. The M4A file was converted to WAV, base64-encoded, and sent to Mistral’s API alongside a text prompt. The response came back as an MP3.

Below is the original recording, followed by eight AI-generated samples using my voice. The model had never heard this voice before. There was no training step, no voice profile to configure. Every sample was generated from scratch using only the reference clip.

How I Did It

The entire workflow fits in a single short script. The Mistral SDK handles the API call; you supply the reference audio as base64 and the text you want spoken.

from mistralai.client import Mistral
import base64
from pathlib import Path

client = Mistral(api_key="MISTRAL_API_KEY")
ref_audio = base64.b64encode(Path("voice_sample.wav").read_bytes()).decode()

response = client.audio.speech.complete(
    model="voxtral-mini-tts-2603",
    input="Your text goes here.",
    ref_audio=ref_audio,
    response_format="mp3",
)

Path("output.mp3").write_bytes(base64.b64decode(response.audio_data))

API response times ranged from 3.8 to 8.9 seconds depending on text length. The short utterance (“Markets are closed today”) took 4.5 seconds end-to-end; the news paragraph took 6.4 seconds. Output formats include MP3, WAV, FLAC, Opus, and raw PCM for streaming applications.

What the API Gets Right

The voice clone is recognizable. My speaking rhythm, pitch, and cadence carry through across all eight samples. The model adapts tone to match content. The excitement sample speeds up naturally, the somber passage slows down and drops in volume, and the breaking news delivery has urgency without sounding robotic.

Technical jargon is mostly handled well, though not perfectly. “Parameter” gets split into “para meter” and the delivery stiffens through dense terminology. Abbreviations like NASA, PhD, and S&P 500 are handled correctly.

Can You Run It Locally?

Yes. The community has already quantized Voxtral to run on Apple Silicon via the MLX framework. I tested the 6-bit quantized version (~3.5 GB) on the same M4 Pro MacBook.

The local version uses preset voices rather than voice cloning: five English voices (casual male/female, neutral male/female, cheerful female) plus male and female presets for French, Spanish, German, Italian, and Portuguese. No reference audio needed.

Key findings from the local benchmark:

1.33x real-time factor. The model generates audio faster than it plays back, consistent across all voices, languages, and content types.
Six languages work natively with no quality drop compared to English.
Long-form content scales linearly. A 790-character paragraph generates 40 seconds of audio at the same speed as shorter passages.
Model loads in ~2 seconds from a cold start on unified memory.

This is the 6-bit quantized model, roughly half the size of the original weights. Subtle quality differences likely exist between this and the full-precision version, but for local, private, offline use, the trade-off is compelling.

Where It Stumbles

Currency notation is the clearest weak spot I found across both the API and local versions. In the punctuation stress test, “$400 million” is read as “four hundred dollar million” and “$1.2 billion” becomes “one twenty billion” instead of “one point two billion.” This is a text normalization issue. The model isn’t expanding the dollar sign and decimal correctly before generating speech.

The fact that the same issue appears in both the full-precision API and the 6-bit quantized local version confirms it’s a model-level problem, not a quantization artifact. For production use with financial content, you’d need to pre-process dollar amounts into written-out form. Listen to the “Punctuation Stress” samples in the benchmark below to hear this firsthand.

The Tiny Challenger: Kokoro 82M

To put Voxtral’s size in context, I ran the same test prompts through Kokoro 82M, a model with 50 times fewer parameters that fits in 330 MB of disk space.

The speed difference is staggering. Kokoro generates audio at 17x real-time compared to Voxtral’s 1.33x, roughly 13 times faster. It loads in 0.8 seconds. A 790-character paragraph generates in under 2 seconds.

The quality trade-off is audible but not as dramatic as the parameter gap suggests. Kokoro produces clear, natural-sounding speech with good prosody. It lacks Voxtral’s multilingual support, but for English-only applications where speed matters, it’s remarkably competitive.

Use the “vs Kokoro 82M” tab in the benchmark below to listen to side-by-side pairs and decide for yourself.

Full Benchmark

The interactive benchmark below contains all samples from the local Voxtral (6-bit quantized) and Kokoro (82M bf16) tests. Use the tabs to explore voice comparisons, emotional range, stress tests, multilingual output, and raw performance data.

Three Tiers of TTS

What emerges from this testing is a clear three-tier landscape for open text-to-speech in 2026:

	Voxtral API	Voxtral Local (6-bit)	Kokoro 82M
Size	Cloud	3.5 GB	330 MB
Speed	3–9s per request	1.33x real-time	17x real-time
Voice cloning	Yes (zero-shot)	No (presets only)	No (presets only)
Languages	9	6	English focused
Privacy	Cloud	Fully local	Fully local
Cost	API pricing	Free	Free

For developers building voice interfaces, the choice depends on what matters most. Need voice cloning and maximum quality? Use the API. Need privacy and offline capability with good quality? Run the quantized model locally. Need speed above all else and English is sufficient? Kokoro at 82M parameters is hard to beat.

The fact that all three options exist, and that two of them run entirely on a laptop with no internet connection, represents a genuine shift in what’s accessible to individual developers and small teams.

Methodology

All local tests were run on an Apple M4 Pro with 24 GB unified memory, macOS Darwin 25.2.0, Python 3.14.

Voxtral API: voxtral-mini-tts-2603 via Mistral’s REST API. Voice cloned from a 43-second reference recording. Output: MP3.
Voxtral Local: mlx-community/Voxtral-4B-TTS-2603-mlx-6bit (6-bit quantized, ~3.5 GB) via mlx-audio library. Output: 24 kHz WAV, float32.
Kokoro: mlx-community/Kokoro-82M-bf16 (~330 MB, bf16) via mlx-audio library. Output: 24 kHz WAV, float32.

All local generation was single-threaded with no batching. Times include the full generate() call including tokenization and audio codec decoding. A 4-bit quantized Voxtral version (~2.5 GB) is also available but was not tested.

I cloned my voice with Mistral's Voxtral TTS in under a minute, then tested the quantized local model

The Voice Clone

How I Did It

What the API Gets Right

Can You Run It Locally?

Where It Stumbles

The Tiny Challenger: Kokoro 82M

Full Benchmark

Three Tiers of TTS

Methodology

More in Products

Related stories

OpenAI upgrades ChatGPT deep research with GPT-5.2 and new controls

"Fast mode" for Claude Opus 4.6, 2.5x speed at 6x the price

OpenAI drops GPT-5.3-Codex right after Anthropic's Opus 4.6

Anthropic releases Claude Opus 4.6 with 1M context window