I cloned my voice with Mistral's Voxtral TTS in under a minute, then tested the quantized local model

Written by Joseph Nordqvist/March 27, 2026 at 10:00 PM UTC

6 min read
  • 1.Voice cloned from a 43-second MacBook recording using Mistral's API — no training required
  • 2.Full-precision API model handles most text well but stumbles on currency notation
  • 3.Local 6-bit quantized version (3.5 GB) runs at 1.33x real-time on M4 Pro
  • 4.Kokoro 82M is 50x smaller and 13x faster but lacks multilingual support and voice cloning
  • 5.All samples are playable in-article for direct A/B comparison
Abstract visualization of sound waves and neural networks representing AI text-to-speech technology
Mistral AI

I recorded 43 seconds of audio on a MacBook, sent it to Mistral’s API, and got back natural-sounding speech clones in minutes. No model training. No fine-tuning. No GPU cluster. Just a voice recording, an API key, and a short Python script.

This is Voxtral 4B TTS, Mistral’s first text-to-speech model. Mistral says it can clone voices from as little as a few seconds of audio (I tested with 43 seconds) and it generates speech in nine languages. The weights are open. The API is live. And a community-quantized version runs entirely on a laptop.

I tested all three tiers: the full-precision cloud API, a 6-bit quantized local version, and Kokoro 82M as a lightweight benchmark. The goal was to find out what works, what breaks, and whether you actually need a 4-billion parameter model to sound human.

The Voice Clone

The process was almost anticlimactic. I recorded myself reading a short paragraph on my MacBook. 43 seconds of casual speech, captured in the built-in Voice Memos app. The M4A file was converted to WAV, base64-encoded, and sent to Mistral’s API alongside a text prompt. The response came back as an MP3.

Below is the original recording, followed by eight AI-generated samples using my voice. The model had never heard this voice before. There was no training step, no voice profile to configure. Every sample was generated from scratch using only the reference clip.

Interactive visualization: VoxtralVoiceClone

How I Did It

The entire workflow fits in a single short script. The Mistral SDK handles the API call; you supply the reference audio as base64 and the text you want spoken.

from mistralai.client import Mistral
import base64
from pathlib import Path

client = Mistral(api_key="MISTRAL_API_KEY")
ref_audio = base64.b64encode(Path("voice_sample.wav").read_bytes()).decode()

response = client.audio.speech.complete(
    model="voxtral-mini-tts-2603",
    input="Your text goes here.",
    ref_audio=ref_audio,
    response_format="mp3",
)

Path("output.mp3").write_bytes(base64.b64decode(response.audio_data))

API response times ranged from 3.8 to 8.9 seconds depending on text length. The short utterance (“Markets are closed today”) took 4.5 seconds end-to-end; the news paragraph took 6.4 seconds. Output formats include MP3, WAV, FLAC, Opus, and raw PCM for streaming applications.

What the API Gets Right

The voice clone is recognizable. My speaking rhythm, pitch, and cadence carry through across all eight samples. The model adapts tone to match content. The excitement sample speeds up naturally, the somber passage slows down and drops in volume, and the breaking news delivery has urgency without sounding robotic.

Technical jargon is mostly handled well, though not perfectly. “Parameter” gets split into “para meter” and the delivery stiffens through dense terminology. Abbreviations like NASA, PhD, and S&P 500 are handled correctly.

Can You Run It Locally?

Yes. The community has already quantized Voxtral to run on Apple Silicon via the MLX framework. I tested the 6-bit quantized version (~3.5 GB) on the same M4 Pro MacBook.

The local version uses preset voices rather than voice cloning: five English voices (casual male/female, neutral male/female, cheerful female) plus male and female presets for French, Spanish, German, Italian, and Portuguese. No reference audio needed.

Key findings from the local benchmark:

  • 1.33x real-time factor. The model generates audio faster than it plays back, consistent across all voices, languages, and content types.

  • Six languages work natively with no quality drop compared to English.

  • Long-form content scales linearly. A 790-character paragraph generates 40 seconds of audio at the same speed as shorter passages.

  • Model loads in ~2 seconds from a cold start on unified memory.

This is the 6-bit quantized model, roughly half the size of the original weights. Subtle quality differences likely exist between this and the full-precision version, but for local, private, offline use, the trade-off is compelling.

Where It Stumbles

Currency notation is the clearest weak spot I found across both the API and local versions. In the punctuation stress test, “$400 million” is read as “four hundred dollar million” and “$1.2 billion” becomes “one twenty billion” instead of “one point two billion.” This is a text normalization issue. The model isn’t expanding the dollar sign and decimal correctly before generating speech.

The fact that the same issue appears in both the full-precision API and the 6-bit quantized local version confirms it’s a model-level problem, not a quantization artifact. For production use with financial content, you’d need to pre-process dollar amounts into written-out form. Listen to the “Punctuation Stress” samples in the benchmark below to hear this firsthand.

The Tiny Challenger: Kokoro 82M

To put Voxtral’s size in context, I ran the same test prompts through Kokoro 82M, a model with 50 times fewer parameters that fits in 330 MB of disk space.

The speed difference is staggering. Kokoro generates audio at 17x real-time compared to Voxtral’s 1.33x, roughly 13 times faster. It loads in 0.8 seconds. A 790-character paragraph generates in under 2 seconds.

The quality trade-off is audible but not as dramatic as the parameter gap suggests. Kokoro produces clear, natural-sounding speech with good prosody. It lacks Voxtral’s multilingual support, but for English-only applications where speed matters, it’s remarkably competitive.

Use the “vs Kokoro 82M” tab in the benchmark below to listen to side-by-side pairs and decide for yourself.

Full Benchmark

The interactive benchmark below contains all samples from the local Voxtral (6-bit quantized) and Kokoro (82M bf16) tests. Use the tabs to explore voice comparisons, emotional range, stress tests, multilingual output, and raw performance data.

Interactive visualization: VoxtralBenchmark

Three Tiers of TTS

What emerges from this testing is a clear three-tier landscape for open text-to-speech in 2026:

Voxtral APIVoxtral Local (6-bit)Kokoro 82M
SizeCloud3.5 GB330 MB
Speed3–9s per request1.33x real-time17x real-time
Voice cloningYes (zero-shot)No (presets only)No (presets only)
Languages96English focused
PrivacyCloudFully localFully local
CostAPI pricingFreeFree

For developers building voice interfaces, the choice depends on what matters most. Need voice cloning and maximum quality? Use the API. Need privacy and offline capability with good quality? Run the quantized model locally. Need speed above all else and English is sufficient? Kokoro at 82M parameters is hard to beat.

The fact that all three options exist, and that two of them run entirely on a laptop with no internet connection, represents a genuine shift in what’s accessible to individual developers and small teams.

Methodology

All local tests were run on an Apple M4 Pro with 24 GB unified memory, macOS Darwin 25.2.0, Python 3.14.

  • Voxtral API: voxtral-mini-tts-2603 via Mistral’s REST API. Voice cloned from a 43-second reference recording. Output: MP3.

  • Voxtral Local: mlx-community/Voxtral-4B-TTS-2603-mlx-6bit (6-bit quantized, ~3.5 GB) via mlx-audio library. Output: 24 kHz WAV, float32.

  • Kokoro: mlx-community/Kokoro-82M-bf16 (~330 MB, bf16) via mlx-audio library. Output: 24 kHz WAV, float32.

All local generation was single-threaded with no batching. Times include the full generate() call including tokenization and audio codec decoding. A 4-bit quantized Voxtral version (~2.5 GB) is also available but was not tested.

Joseph Nordqvist

Written by

Joseph Nordqvist

Founder & Editor-in-Chief at AI News Home

View all articles →

Editorial Transparency

This article was produced with the assistance of AI tools as part of our editorial workflow. All analysis, conclusions, and editorial decisions were made by human editors. Read our Editorial Guidelines

Was this useful?