JP TTS Engine Comparison

23 voices across 6 engines · Rate 1-10, review aggregates, pipeline tracks moves · M4 Pro 64GB

最近、新しい本を読み始めました。物語は静かな町から始まり、主人公はゆっくりと自分の過去と向き合っていきます。
Test passage rendered by every engine. ~10s of natural narrative JP.

Click 1-10 under each card to rate. Ratings persist locally (localStorage). Sort/filter at top, but listening order stays shuffled until you opt in.

Aggregated from your ratings. Updates live.

Per-engine: what it is, how it works, license, Apple Silicon viability, the invocation that produced its sample.

Fish Audio S2 Pro ⚠ Fish Audio Research License (NC) MLX 8-bit (mlx-audio) Fish Audio · Dual-AR + RL · 10M+ hours · 80+ langs · 44.1kHz · NEW this push
How it works
Dual-Autoregressive (Dual-AR) architecture with RL alignment for natural prosody and emotional richness. Sub-word level prosody/emotion tags ([whisper], [excited], [angry]). Native multi-speaker and multi-turn dialogue. Trained on 10M+ hours, 80+ languages.
Strong
The "Fish sound" that started this whole hunt. Strong JP. Apache-Silicon-native via mlx-audio (port from fishaudio/s2-pro, 906 downloads, last updated Mar 2026). Zero-shot clone works.
Weak
License is the OG problem — NC for commercial. Personal blind test is fine, Manaoke commercial ship is not. To get commercial Fish output legally: $11/mo Fish Audio Plus hosted tier.
Verdict
This is the ceiling reference. If something else in this A/B matches or beats it, you have a permissive answer to the Fish-quality question. If only Fish wins, the $132/yr subscription is the path of least resistance.
python fish_tts.py "..." --ref asset/voice_matsukaze.wav --ref-text "..."
Supertonic 3 MIT code · OpenRAIL weights ONNX native (no diffusion) Supertone Inc (HYBE, Korea) · 99M params · 31 langs · 44.1kHz · released 2026-04-29
How it works
4-ONNX pipeline: duration_predictor → text_encoder → vector_estimator → vocoder. 99M params total — tiny vs 500M-2B competitors. No diffusion = no Apple Silicon MLX-diffusion gap. ONNX Runtime is mature on every platform.
Strong
Tiny + fast = real-time-plus on Apple Silicon. 6 pre-trained voice styles (F1-3 / M1-3). Inline expression tags. 44.1kHz studio-grade. 9,294 stars in 6 months. Korean company puts JP in top tier of attention.
Weak / unknown
JP quality vs JP-native engines unknown. 99M is small — may underperform on pitch-accent micro-prosody. OpenRAIL has use-based restrictions (no illegal content, no impersonation without consent).
python supertonic_tts.py "..." --lang ja --voice F1 # F1-F3 / M1-M3
VoxCPM2 Apache-2.0 MLX (2nd-class) OpenBMB · 2B params · tokenizer-free DiT · 48kHz
How it works
Tokenizer-free diffusion-autoregressive: LocEnc → TSLM → RALM → LocDiT. AudioVAE V2 reconstructs 48kHz. Trained on 2M+ hours multilingual. Five modes including Voice Design (instruct), clone, ultimate.
Strong
Multilingual coverage incl. JP. Apache-2.0. Voice Design via natural-language prompt. 48kHz output.
Weak on Apple Silicon
Diffusion-on-MLX undercooked — maintainer redirects performance-hungry users to nanovllm-voxcpm (CUDA fork). Clone mode produces "fluent-foreigner accent" (HF discussion #14).
JP accent fix (maintainer-confirmed)
Switch clone → Voice Design (instruct), lower CFG 2.0→1.5, chunk to ≤2 sentences. From VoxCPM issue #222.
python voxcpm_tts.py "..." --mode default --instruct "落ち着いた大人の朗読、自然な日本語" --cfg-value 1.5
Irodori-TTS v3 MIT MLX native Aratako (Chihiro Arata) · 500M params · RF-DiT + DACVAE · 48kHz · JP-only
How it works
Flow-matching DiT with cross-attention text/caption conditioning. Three condition signals: text, caption (Voice Design), ref_audio (clone). Speaker Inversion (PR #18, Apr 2026) addresses cross-sentence drift.
Strong
Built JP-first — JP creators specifically praise it for reducing 「ペラペラ外国人っぽさ」. MIT. 40+ emoji emotion markers (😭/🤧/😄). Small + fast on Apple Silicon.
Weak
Kanji weak — pre-convert to hiragana. Long sentences (>30s) trigger skipping — chunk. Narrow dynamic range (issue #9, unanswered 3+ weeks). Cross-sentence voice drift (Wooly-Fluffy #136, may be fixed by Speaker Inversion).
python irodori_tts.py "..." --caption "落ち着いた大人の朗読、自然な日本語"
Qwen3-TTS Apache-2.0 MLX native Alibaba · 24kHz · 18 langs incl JP
How it works
Multilingual TTS with built-in voice catalog (Ono_Anna for JP) plus zero-shot clone. Natural-language instruct prompt for emotion/style.
Verdict
"Decent, not Fish-tier." Reliable Apache baseline. Use when known-working commercial-clean is enough.
python qwen_tts.py "..." --voice Ono_Anna --instruct "穏やかな大人の朗読"
Google Chirp 3 HD ja-JP Cloud — you own output N/A (cloud) $30/M chars · 1M chars/mo free · 24kHz
How it works
Google's transformer-based neural TTS. JP voices follow ja-JP-Chirp3-HD-{Name}. Voices tested: Charon, Kore, Aoede, Zephyr, Achernar.
Strong
Polished out of the box. JP-native reviewer rates 8/10. JP punctuation controls pause naturally. You own output.
Joe-specific cost math
Manaoke catalog 100 songs × 500 chars ≈ 50K chars ≈ $1.50. Free tier covers most realistic use. DBZ episode ~60K chars ≈ $1.80/ep.
python google_chirp_tts.py "..." --voice ja-JP-Chirp3-HD-Charon

Recommended next research moves

  1. 5 broad research agents (Twitter, YouTube, Reddit, Academic, Alt-routes) done
  2. 4 GitHub deep-dive agents (Irodori, VoxCPM2, AivisSpeech, SBV2) done
  3. Gap-fill agent (demos, HF, Discord, CN, awesome-lists) done
  4. VoxCPM2 Voice Design + CFG 1.5 + chunked variants done
  5. Irodori v3 via mlx-audio (just merged 2026-05-18) done
  6. Supertonic 3 (Korean, 99M ONNX, all 6 voices) done
  7. Fish Audio S2 Pro via mlx-audio (default + Matsukaze clone) done
  8. Google Chirp 3 HD ja-JP, 5 voices done
  9. Test Irodori v3 + Speaker Inversion knobs (speaker_kv_scale) — first public test of the post-PR-#18 configuration next
  10. VOICEVOX (the JP-standard baseline, robotic but a useful floor) next
  11. One month of Fish Audio Plus ($11) to compare hosted vs MLX-port quality next
  12. Step-Audio-EditX (Apache-2.0, JA-capable) — skipped this round (China vendor) but architecturally interesting future
  13. If Supertonic 3 wins the A/B, look at training a custom voice via their Voice Builder future

Quality levers by engine

VoxCPM2 — fix the accent

HF discussion #14 + VoxCPM issue #222 maintainer-confirmed:

  1. Voice Design mode (--mode default --instruct "...") not clone
  2. CFG 1.5 instead of default 2.0 (--cfg-value 1.5)
  3. Chunk to ≤2 sentences per call
  4. Avoid self-conditioning in Voice Design mode (produced 192s of hallucinated audio in our test)

Irodori v3 — play to strengths

  • Pre-convert unusual kanji to hiragana
  • Chunk anything >30s
  • Tune speaker_kv_scale for ref-adherence vs naturalness
  • Emoji emotion markers: 😭/🤧/😄
  • Accept narrow dynamic range — production-mix afterward

Supertonic 3 — what to try

  • F1 Mina is the recommended default voice
  • Quality steps 5-12 (default 8); higher = better but slower (still real-time-plus)
  • Speed 0.7-2.0
  • Inline tags: <laugh>, <breath>, <sigh>, +7
  • For unknown-language text, pass lang="na"

Fish S2 Pro — research-only

  • License caps it to personal use only — no Manaoke commercial
  • Use the in-A/B Fish renders as your quality ceiling
  • If something here ties or beats it, ship that instead
  • If only Fish wins on your ear: $11/mo Plus seat is the legal commercial path
  • Sub-word emotion tags: [whisper], [excited], [angry], etc.

Google Chirp — commercial polish

  • Charon = warm narrator · Kore = neutral · Aoede = energetic
  • 1M chars/mo free — zero $ for most realistic use
  • JP punctuation (。 、 ・) controls pause length — no SSML needed
  • You own output

Optimal config per use case

Manaoke karaoke

Default: Google Chirp 3 HD Charon / Kore. You own output, $1.50 for whole catalog at scale.

Voice variety: VoxCPM2 Voice Design with character-instructs for non-narrator lines.

Avoid: AivisSpeech / SBV2 (transitive AGPL).

DBZ Narrator (personal)

Default: top-rated MIT/Apache engine from your blind A/B.

Personal viewing = AGPL is acceptable if needed, but Irodori MIT / Supertonic MIT are cleaner defaults.

Mom's record narration

Default: Google Chirp 3 HD Charon — calm, polished, no fiddling. Personal use, no license worry either way.

If offline matters: top-rated local engine from A/B.

Silo / screenplay reads (offline)

Air-gapped Silo machine: no cloud. Local-only mandate.

Bet on top-rated local Apache/MIT engine. Chunk per-character with distinct captions/instructs.

Architectural reality on Apple Silicon (2026)

Things that aren't changing soon.