[TERMINAL · SKILLS]
> mounting /skills...
> indexing 295 manifests...
> linking agents: claude · codex · gemini · cursor
> ready.
[░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0%
Terminal.skills
Skills/vibe-voice
>

vibe-voice

Build voice AI applications using Microsoft's VibeVoice — open-source frontier voice synthesis, recognition, and real-time conversation. Use when: building voice assistants, adding TTS/STT to applications, creating real-time voice chat, voice cloning.

#voice#tts#stt#speech#real-time
terminal-skillsv1.0.0
Works with:claude-codeopenai-codexgemini-clicursor
Source
Trust Score
93/ 100
4.13×
Impact

Validation

Quality
93/ 100
Does it follow best practices?
5 PASS · 1 WEAK
Security
Passed
No known issues
Content review + injection scan
Impact
4.13×
23% → 95% agent success
Avg across 2 eval scenarios
Scored 5/13/2026 · skill v1.0.0
$
✓ Installed vibe-voice v1.0.0

Getting Started

  1. Install the skill using the command above
  2. Open your AI coding agent (Claude Code, Codex, Gemini CLI, or Cursor)
  3. Reference the skill in your prompt
  4. The AI will use the skill's capabilities automatically

Example Prompts

  • "Analyze the sales data in revenue.csv and identify trends"
  • "Create a visualization comparing Q1 vs Q2 performance metrics"

Documentation

Open-source frontier voice AI from Microsoft. Family of models for text-to-speech (TTS), automatic speech recognition (ASR), and real-time streaming synthesis.

GitHub: microsoft/VibeVoice

Overview

VibeVoice is Microsoft's open-source voice AI platform offering three model sizes: ASR (7B) for transcription with speaker diarization, TTS (1.5B) for multi-speaker long-form synthesis, and Realtime (0.5B) for low-latency streaming. It supports 50+ languages and runs on consumer GPUs.

Instructions

Installation

bash
pip install vibevoice
# Or clone for development
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
pip install -e .

Hardware Requirements

  • ASR (7B): ~16GB VRAM (GPU recommended)
  • TTS (1.5B): ~6GB VRAM
  • Realtime (0.5B): ~2GB VRAM (runs on consumer GPUs)

Text-to-Speech (TTS)

python
from vibevoice import VibeVoiceTTS

model = VibeVoiceTTS.from_pretrained("microsoft/VibeVoice-1.5B")

# Single speaker
audio = model.synthesize(
    text="Hello, welcome to the future of voice AI.",
    speaker="default"
)
audio.save("output.wav")

Multi-Speaker Conversation

Generate podcast-style audio with up to 4 distinct speakers:

python
conversation = [
    {"speaker": "host", "text": "Welcome to the show! Today we're discussing AI."},
    {"speaker": "guest1", "text": "Thanks for having me. I'm excited to dive in."},
    {"speaker": "host", "text": "Let's start with the biggest trends you're seeing."},
    {"speaker": "guest2", "text": "I think voice AI is the most underrated development."},
]

audio = model.synthesize_conversation(conversation)
audio.save("podcast.wav")  # Up to 90 minutes in a single pass

Real-Time Streaming TTS

python
from vibevoice import VibeVoiceRealtime

model = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")

for audio_chunk in model.stream("This is being generated in real time."):
    play_audio(audio_chunk)

Automatic Speech Recognition (ASR)

python
from vibevoice import VibeVoiceASR

model = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR")

# Basic transcription
result = model.transcribe("meeting_recording.wav")
print(result.text)

# Rich transcription with diarization
result = model.transcribe("meeting.wav", diarize=True, timestamps=True)
for segment in result.segments:
    print(f"[{segment.start:.1f}s - {segment.end:.1f}s] "
          f"Speaker {segment.speaker}: {segment.text}")

Custom Hotwords

python
result = model.transcribe(
    "medical_consultation.wav",
    hotwords=["Lisinopril", "Metformin", "HbA1c", "systolic"]
)

Examples

Example 1: Build a Voice Assistant

python
from vibevoice import VibeVoiceASR, VibeVoiceRealtime

asr = VibeVoiceASR.from_pretrained("microsoft/VibeVoice-ASR")
tts = VibeVoiceRealtime.from_pretrained("microsoft/VibeVoice-Realtime-0.5B")

# Listen → Transcribe → Respond → Speak
user_text = asr.transcribe("user_input.wav").text
response = generate_response(user_text)  # Your LLM call
for chunk in tts.stream(response):
    play_audio(chunk)

Example 2: Meeting Transcription with Speaker Labels

python
result = asr.transcribe("team_standup.wav", diarize=True, timestamps=True)
for segment in result.segments:
    print(f"[{segment.start:.1f}s] Speaker {segment.speaker}: {segment.text}")

# Output:
# [0.0s] Speaker 1: Let's review the Q3 numbers.
# [3.5s] Speaker 2: Revenue is up 15% from last quarter.
# [8.4s] Speaker 1: That's great. What about customer acquisition?

Guidelines

  • Use the Realtime (0.5B) model for voice assistants where latency matters
  • The ASR model handles up to 60 minutes of audio in a single pass
  • TTS supports up to 90 minutes of multi-speaker audio generation
  • Use custom hotwords for domain-specific terms (medical, legal, technical)
  • For production, consider vLLM for faster ASR inference
  • Multilingual voices are experimental — test quality before deploying
  • 7.5 Hz frame rate enables efficient long-sequence processing

Resources

Information

Version
1.0.0
Author
terminal-skills
Category
Data & AI
License
MIT