I cloned my voice with Mistral's Voxtral TTS in under a minute, then tested the quantized local model
Written by Joseph Nordqvist/March 27, 2026 at 10:00 PM UTC
6 min read- 1.Voice cloned from a 43-second MacBook recording using Mistral's API — no training required
- 2.Full-precision API model handles most text well but stumbles on currency notation
- 3.Local 6-bit quantized version (3.5 GB) runs at 1.33x real-time on M4 Pro
- 4.Kokoro 82M is 50x smaller and 13x faster but lacks multilingual support and voice cloning
- 5.All samples are playable in-article for direct A/B comparison

I recorded 43 seconds of audio on a MacBook, sent it to Mistral’s API, and got back natural-sounding speech clones in minutes. No model training. No fine-tuning. No GPU cluster. Just a voice recording, an API key, and a short Python script.
This is Voxtral 4B TTS, Mistral’s first text-to-speech model. Mistral says it can clone voices from as little as a few seconds of audio (I tested with 43 seconds) and it generates speech in nine languages. The weights are open. The API is live. And a community-quantized version runs entirely on a laptop.
I tested all three tiers: the full-precision cloud API, a 6-bit quantized local version, and Kokoro 82M as a lightweight benchmark. The goal was to find out what works, what breaks, and whether you actually need a 4-billion parameter model to sound human.
The Voice Clone
The process was almost anticlimactic. I recorded myself reading a short paragraph on my MacBook. 43 seconds of casual speech, captured in the built-in Voice Memos app. The M4A file was converted to WAV, base64-encoded, and sent to Mistral’s API alongside a text prompt. The response came back as an MP3.
Below is the original recording, followed by eight AI-generated samples using my voice. The model had never heard this voice before. There was no training step, no voice profile to configure. Every sample was generated from scratch using only the reference clip.
How I Did It
The entire workflow fits in a single short script. The Mistral SDK handles the API call; you supply the reference audio as base64 and the text you want spoken.
from mistralai.client import Mistral
import base64
from pathlib import Path
client = Mistral(api_key="MISTRAL_API_KEY")
ref_audio = base64.b64encode(Path("voice_sample.wav").read_bytes()).decode()
response = client.audio.speech.complete(
model="voxtral-mini-tts-2603",
input="Your text goes here.",
ref_audio=ref_audio,
response_format="mp3",
)
Path("output.mp3").write_bytes(base64.b64decode(response.audio_data))API response times ranged from 3.8 to 8.9 seconds depending on text length. The short utterance (“Markets are closed today”) took 4.5 seconds end-to-end; the news paragraph took 6.4 seconds. Output formats include MP3, WAV, FLAC, Opus, and raw PCM for streaming applications.
What the API Gets Right
The voice clone is recognizable. My speaking rhythm, pitch, and cadence carry through across all eight samples. The model adapts tone to match content. The excitement sample speeds up naturally, the somber passage slows down and drops in volume, and the breaking news delivery has urgency without sounding robotic.
Technical jargon is mostly handled well, though not perfectly. “Parameter” gets split into “para meter” and the delivery stiffens through dense terminology. Abbreviations like NASA, PhD, and S&P 500 are handled correctly.
Can You Run It Locally?
Yes. The community has already quantized Voxtral to run on Apple Silicon via the MLX framework. I tested the 6-bit quantized version (~3.5 GB) on the same M4 Pro MacBook.
The local version uses preset voices rather than voice cloning: five English voices (casual male/female, neutral male/female, cheerful female) plus male and female presets for French, Spanish, German, Italian, and Portuguese. No reference audio needed.
Key findings from the local benchmark:
1.33x real-time factor. The model generates audio faster than it plays back, consistent across all voices, languages, and content types.
Six languages work natively with no quality drop compared to English.
Long-form content scales linearly. A 790-character paragraph generates 40 seconds of audio at the same speed as shorter passages.
Model loads in ~2 seconds from a cold start on unified memory.
This is the 6-bit quantized model, roughly half the size of the original weights. Subtle quality differences likely exist between this and the full-precision version, but for local, private, offline use, the trade-off is compelling.
Where It Stumbles
Currency notation is the clearest weak spot I found across both the API and local versions. In the punctuation stress test, “$400 million” is read as “four hundred dollar million” and “$1.2 billion” becomes “one twenty billion” instead of “one point two billion.” This is a text normalization issue. The model isn’t expanding the dollar sign and decimal correctly before generating speech.
The fact that the same issue appears in both the full-precision API and the 6-bit quantized local version confirms it’s a model-level problem, not a quantization artifact. For production use with financial content, you’d need to pre-process dollar amounts into written-out form. Listen to the “Punctuation Stress” samples in the benchmark below to hear this firsthand.
The Tiny Challenger: Kokoro 82M
To put Voxtral’s size in context, I ran the same test prompts through Kokoro 82M, a model with 50 times fewer parameters that fits in 330 MB of disk space.
The speed difference is staggering. Kokoro generates audio at 17x real-time compared to Voxtral’s 1.33x, roughly 13 times faster. It loads in 0.8 seconds. A 790-character paragraph generates in under 2 seconds.
The quality trade-off is audible but not as dramatic as the parameter gap suggests. Kokoro produces clear, natural-sounding speech with good prosody. It lacks Voxtral’s multilingual support, but for English-only applications where speed matters, it’s remarkably competitive.
Use the “vs Kokoro 82M” tab in the benchmark below to listen to side-by-side pairs and decide for yourself.
Full Benchmark
The interactive benchmark below contains all samples from the local Voxtral (6-bit quantized) and Kokoro (82M bf16) tests. Use the tabs to explore voice comparisons, emotional range, stress tests, multilingual output, and raw performance data.
Three Tiers of TTS
What emerges from this testing is a clear three-tier landscape for open text-to-speech in 2026:
| Voxtral API | Voxtral Local (6-bit) | Kokoro 82M | |
|---|---|---|---|
| Size | Cloud | 3.5 GB | 330 MB |
| Speed | 3–9s per request | 1.33x real-time | 17x real-time |
| Voice cloning | Yes (zero-shot) | No (presets only) | No (presets only) |
| Languages | 9 | 6 | English focused |
| Privacy | Cloud | Fully local | Fully local |
| Cost | API pricing | Free | Free |
For developers building voice interfaces, the choice depends on what matters most. Need voice cloning and maximum quality? Use the API. Need privacy and offline capability with good quality? Run the quantized model locally. Need speed above all else and English is sufficient? Kokoro at 82M parameters is hard to beat.
The fact that all three options exist, and that two of them run entirely on a laptop with no internet connection, represents a genuine shift in what’s accessible to individual developers and small teams.
Methodology
All local tests were run on an Apple M4 Pro with 24 GB unified memory, macOS Darwin 25.2.0, Python 3.14.
Voxtral API:
voxtral-mini-tts-2603via Mistral’s REST API. Voice cloned from a 43-second reference recording. Output: MP3.Voxtral Local:
mlx-community/Voxtral-4B-TTS-2603-mlx-6bit(6-bit quantized, ~3.5 GB) viamlx-audiolibrary. Output: 24 kHz WAV, float32.Kokoro:
mlx-community/Kokoro-82M-bf16(~330 MB, bf16) viamlx-audiolibrary. Output: 24 kHz WAV, float32.
All local generation was single-threaded with no batching. Times include the full generate() call including tokenization and audio codec decoding. A 4-bit quantized Voxtral version (~2.5 GB) is also available but was not tested.
Editorial Transparency
This article was produced with the assistance of AI tools as part of our editorial workflow. All analysis, conclusions, and editorial decisions were made by human editors. Read our Editorial Guidelines
Was this useful?
More in Products
View all- Claude Code now remembers what it learns between sessionsFebruary 27, 2026
- Google launches Nano Banana 2, bringing pro-level image generation to its Flash modelFebruary 26, 2026
- Anthropic launches Remote Control for Claude Code, enabling mobile accessFebruary 25, 2026
- Gemini 3.1 Pro claims top-tier reasoning gainsFebruary 19, 2026
Related stories
OpenAI upgrades ChatGPT deep research with GPT-5.2 and new controls
February 10, 2026
Models"Fast mode" for Claude Opus 4.6, 2.5x speed at 6x the price
A new "fast mode" for Claude Opus delivers up to 2.5 times higher output token generation speed at a significant premium: six times the standard.
February 8, 2026
ProductsOpenAI drops GPT-5.3-Codex right after Anthropic's Opus 4.6
February 6, 2026
ProductsAnthropic releases Claude Opus 4.6 with 1M context window
February 5, 2026