Speech Generation API

Beta Access Required: The Speech API requires whitelisted access.

To request access, email sales@demeterics.com with:

Subject: "Feature Access Request"

Feature name: "Text-to-Speech (TTS)"

For multi-speaker podcast generation, also request: "TTS Multi-Speaker"

The Demeterics Speech API provides a unified Text-to-Speech (TTS) interface across multiple providers. Convert text to natural-sounding audio with a single API while automatically tracking usage, costs, and storing generated audio for analysis.

Overview

Base URL: https://api.demeterics.com/tts/v1

Features:

Unified API: Single endpoint for OpenAI, ElevenLabs, Google Cloud TTS, Murf.ai, Groq Orpheus, and Google Gemini
Multi-Speaker: Generate podcasts and dialogues with up to 2 speakers (Gemini)
Auto-tracking: Every request logged to BigQuery with full observability
Audio Storage: Generated audio stored in GCS with 15-minute signed URLs
BYOK Support: Use your own provider API keys with dual-key authentication
Cost Control: Automatic credit billing with 15% managed or 10% BYOK fee

Authentication

Managed Keys (Default)

Use only your Demeterics API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

Bring Your Own Key (BYOK)

Use the dual-key format to provide your own provider API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

The format is: [demeterics_api_key];[provider_api_key]

BYOK Benefits:

10% service fee instead of 15%
Use your own rate limits and quotas
Provider costs billed directly to your account

Endpoints

Generate Speech

POST /tts/v1/generate

Convert text to speech audio.

Request Body:

Field	Type	Required	Description
`provider`	string	Yes	Target provider: `openai`, `elevenlabs`, `google`, `murf`, `groq`, `gemini`
`model`	string	No	TTS model (provider-specific)
`voice`	string	No	Voice identifier (single speaker)
`input`	string	Yes	Text to convert (max varies by provider)
`format`	string	No	Output format: `mp3`, `wav`, `opus`, `flac`
`speed`	float	No	Playback speed: 0.25-4.0 (default: 1.0)
`language`	string	No	Language code (ISO 639-1)
`speakers`	array	No	Multi-speaker config (Gemini only, max 2)

Example Request:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hello, welcome to Demeterics!",
    "format": "mp3"
  }'

Response:

{
  "id": "01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "openai",
  "model": "tts-1",
  "voice": "alloy",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 2.3,
  "cost_usd": 0.00023,
  "usage": {
    "input_chars": 31
  },
  "metadata": {
    "format": "mp3",
    "sample_rate": 24000,
    "channels": 1,
    "generation_ms": 450
  }
}

List Voices

GET /tts/v1/voices?provider={provider}

List available voices for a provider.

Query Parameters:

Parameter	Type	Required	Description
`provider`	string	Yes	Provider: `openai`, `elevenlabs`, `google`, `murf`

Example Request:

curl -X GET "https://api.demeterics.com/tts/v1/voices?provider=openai" \
  -H "Authorization: Bearer dmt_your_api_key"

Response:

{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Neutral and balanced",
      "gender": "neutral"
    },
    {
      "id": "echo",
      "name": "Echo",
      "description": "Clear and articulate",
      "gender": "male"
    }
  ]
}

Providers

OpenAI

Models:

gpt-4o-mini-tts - Latest model with better steerability (~85% cheaper than ElevenLabs)
tts-1 - Fast and efficient (legacy)
tts-1-hd - Higher quality (legacy)

Voices:

alloy - Neutral and balanced
ash - Warm and conversational
ballad - Soft and melodic
coral - Friendly and approachable
echo - Clear and articulate
fable - Expressive and dynamic
onyx - Deep and authoritative
nova - Friendly and warm
sage - Calm and measured
shimmer - Bright and optimistic
verse - Dynamic and engaging

Supported Formats: mp3, opus, aac, flac, wav, pcm

Max Characters: 4,096

ElevenLabs

Models:

eleven_v3 - Most expressive model — human-like speech with high emotional range, 70+ languages, supports vocal directions (recommended for high-quality content)
eleven_multilingual_v2 - Premium quality, 29 languages, 10K char limit
eleven_turbo_v2_5 - High quality + speed (~250-300ms), 32 languages, 40K char limit
eleven_turbo_v2 - Fast, English only
eleven_flash_v2_5 - Ultra-fast (~75ms), 32 languages, 50% lower cost — great for drafts and real-time
~~eleven_monolingual_v1~~ - Deprecated February 28, 2026 — migrate to eleven_v3

Vocal Directions (eleven_v3):

ElevenLabs v3 supports inline audio tags to direct the performance style:

[cheerful] Welcome to our channel!
[whisper] But here's a secret...
[dramatic] Everything is about to change.
[sarcastic] Oh sure, that went exactly as planned.

Available directions include: [cheerful], [whisper], [dramatic], [sarcastic], [excited], [friendly], [warm], [professionally], [authoritatively], [breathy], and more.

Voices: Over 100 pre-made voices plus custom voice cloning

Supported Formats: mp3, pcm, ulaw

Max Characters: 5,000 (eleven_v3), 10,000 (multilingual_v2), 40,000 (turbo/flash v2.5)

Google Cloud TTS

Models:

standard - Basic quality
neural2 - Neural network based
wavenet - High quality WaveNet
journey - Conversational style
studio - Professional quality

Voices: 220+ voices across 40+ languages

Supported Formats: mp3, wav, ogg

Max Characters: 5,000

Murf.ai

Models:

GEN2 - Latest generation, highest quality ($0.03/1000 chars)
FALCON - Fast streaming model ($0.01/1000 chars) ← Recommended for Voice-to-Voice

Voices: 120+ voices across 20+ languages including:

en-US-natalie - Natalie (US English, female) — clear, professional
en-US-samantha - Samantha (US English, female) — warm, conversational
en-US-terrell - Terrell (US English, male) — deep, authoritative
en-US-wayne - Wayne (US English, male) — friendly, casual
en-UK-hazel - Hazel (UK English, female) — British accent
en-UK-ruby - Ruby (UK English, female) — British, professional
en-UK-maisie - Maisie (UK English, female) — British, youthful
en-AU-lincoln - Lincoln (Australian, male) — Australian accent

Supported Formats: mp3, wav, flac, ogg, pcm, alaw, ulaw

Max Characters: 10,000

Features:

Voice styles (conversational, newscast, etc.)
Speed and pitch control
Multi-language support with native locales
Streaming support via /v1/speech/stream endpoint

The FALCON model supports real-time audio streaming, used internally by the AI Chat Widget's Voice-to-Voice feature.

Note: Murf Falcon streaming is not exposed as a standalone Demeterics API endpoint. It's used automatically when Voice-to-Voice is enabled on your AI Chat Widget. For direct TTS generation, use POST /tts/v1/generate with provider: "murf" and model: "FALCON".

How Voice-to-Voice Works:

When Voice-to-Voice is enabled, the widget uses a two-phase streaming architecture:

Phase 1 — POST /api/widget/voice
- Uploads user audio recording
- Returns: transcript, response text, and stream_token
- Text is displayed in the widget immediately
Phase 2 — GET /api/widget/voice/stream?token=X
- Server-Sent Events (SSE) stream audio chunks
- Web Audio API plays chunks as they arrive
- ~130ms time-to-first-audio (TTFA)

Additional streaming endpoints (internal use):

GET /api/widget/voice/stream/mp3 — MP3 format stream
GET /api/widget/voice/stream/raw — Raw audio stream
WS /api/widget/voice/ws — WebSocket streaming
WS /api/widget/voice/live — Full-duplex WebSocket

Performance:

~130ms time-to-first-audio
WAV format at 24kHz mono
Optimized for low-latency conversational AI

Cost: $0.01 per 1,000 characters (billed when stream is consumed)

Google Gemini TTS

Beta Access: Gemini TTS with multi-speaker support is available to whitelisted users. Contact support to request access.

Models:

gemini-2.5-flash-preview-tts - Fast, cost-effective (default)
gemini-2.5-pro-preview-tts - Higher quality

Voices (30 prebuilt voices):

Puck - Upbeat
Kore - Firm
Charon - Informative
Zephyr - Bright
Fenrir - Excitable
Leda - Youthful
Aoede - Breezy
Sulafat - Warm
Achird - Friendly
And 21 more...

Supported Formats: wav

Max Characters: 8,000

Features:

Multi-speaker support: Up to 2 speakers with different voices
30 prebuilt voice options
Ideal for podcasts, dialogues, and conversational content

Multi-Speaker Mode (Podcasts & Dialogues)

Generate conversational audio with up to 2 distinct speakers, each with their own voice. Perfect for:

Podcasts with host and guest
Dialogues between characters
Interview-style content
Educational back-and-forth explanations

Request Body (Multi-Speaker):

Field	Type	Required	Description
`provider`	string	Yes	Must be `gemini`
`model`	string	No	`gemini-2.5-flash-preview-tts` (default)
`input`	string	Yes	Dialogue with speaker labels
`speakers`	array	Yes	Speaker-to-voice mapping (max 2)
`format`	string	No	Output format (default: `wav`)

Speaker Configuration:

Each speaker object has:

Field	Type	Required	Description
`name`	string	Yes	Speaker label (must match input text)
`voice`	string	Yes	Voice ID (e.g., `Puck`, `Kore`)

Example: Podcast Generation

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini",
    "model": "gemini-2.5-flash-preview-tts",
    "input": "Host: Welcome to the AI Insights podcast! Today we explore the future of voice AI.\nGuest: Thanks for having me! Voice technology is transforming how we interact with machines.",
    "speakers": [
      {"name": "Host", "voice": "Puck"},
      {"name": "Guest", "voice": "Kore"}
    ],
    "format": "wav"
  }'

Response:

{
  "id": "tts_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "gemini",
  "model": "gemini-2.5-flash-preview-tts",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 8.5,
  "cost_usd": 0.00125,
  "usage": {
    "input_chars": 156
  }
}

Python Example:

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "gemini",
        "input": """Host: What's the biggest challenge in AI today?
Guest: I'd say it's making AI accessible to everyone, not just tech companies.""",
        "speakers": [
            {"name": "Host", "voice": "Puck"},
            {"name": "Guest", "voice": "Kore"}
        ]
    }
)

audio_url = response.json()["audio_url"]
print(f"Podcast audio: {audio_url}")

Best Practices for Multi-Speaker:

Consistent labels: Use the same speaker names throughout (e.g., Host: not Announcer:)
Clear formatting: Start each line with Speaker: followed by their dialogue
Voice pairing: Choose voices with distinct characteristics (e.g., upbeat + firm)
Keep turns short: Shorter dialogue turns sound more natural
Max 2 speakers: Gemini currently supports up to 2 distinct speakers

Groq Orpheus (Canopy Labs)

Migration Notice: PlayAI TTS models (playai-tts, playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate to canopylabs/orpheus-v1-english.

Models:

canopylabs/orpheus-v1-english - Expressive English TTS with vocal direction support

Voices (8 voices):

tara - Female, conversational (default)
leah - Female, professional
jess - Female, friendly
leo - Male, conversational
dan - Male, professional
mia - Female, warm
zac - Male, casual
zoe - Female, clear

Supported Formats: wav only

Max Characters: 200 per request

Features:

Vocal Directions: Control speech style with bracketed commands:
- Conversational: [cheerful], [friendly], [casual], [warm]
- Professional: [professionally], [authoritatively], [formally]
- Expressive: [whisper], [excited], [dramatic], [deadpan], [sarcastic]
- Vocal qualities: [gravelly whisper], [rapid babbling], [singsong], [breathy]
Fast generation via Groq infrastructure
More directions = more expressive; fewer/no directions = natural, casual
56% cheaper than PlayAI ($22/1M chars vs $50/1M chars)

Pricing

Managed Keys

Character-based pricing with 15% service fee:

Provider	Model	Cost per 1M chars
OpenAI	gpt-4o-mini-tts	$0.69
OpenAI	tts-1	$17.25
OpenAI	tts-1-hd	$34.50
ElevenLabs	eleven_v3	$345.00
ElevenLabs	eleven_multilingual_v2	$345.00
ElevenLabs	eleven_turbo_v2_5	$172.50
ElevenLabs	eleven_flash_v2_5	$172.50
Google	wavenet	$18.40
Google	neural2	$18.40
Google	standard	$4.60
Murf	GEN2	$27.60
Murf	FALCON	$23.00
Groq	canopylabs/orpheus-v1-english	$22.00
Gemini	gemini-2.5-flash-preview-tts	$11.50
Gemini	gemini-2.5-pro-preview-tts	$57.50

BYOK

10% service fee on top of provider costs. Provider costs billed directly to your account.

Error Handling

Error Response Format:

{
  "error": {
    "type": "invalid_request",
    "message": "Input text exceeds maximum length",
    "code": "text_too_long"
  }
}

Common Error Codes:

Code	HTTP Status	Description
`invalid_provider`	400	Unknown provider specified
`invalid_voice`	400	Voice not available for provider
`text_too_long`	400	Input exceeds provider limit
`insufficient_credits`	402	Not enough credits
`provider_error`	502	Provider API failed
`rate_limited`	429	Too many requests

Data Tracking

Every speech generation is automatically tracked in BigQuery with:

Transaction ID (ULID)
User and API key identifiers
Provider, model, and voice used
Input character count and text hash (privacy-safe)
Audio duration and format
GCS storage path
Cost breakdown (provider cost, service fee, total)
Latency metrics
Error information (if failed)

Query your speech generations:

SELECT
  transaction_id,
  provider,
  model,
  tts.voice,
  tts.input_chars,
  tts.duration_sec,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'tts'
  AND user_id = @user_id
  AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC

SDK Support

Python

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "openai",
        "voice": "alloy",
        "input": "Hello, world!",
        "format": "mp3"
    }
)

audio_url = response.json()["audio_url"]

Node.js

const response = await fetch("https://api.demeterics.com/tts/v1/generate", {
  method: "POST",
  headers: {
    "Authorization": "Bearer dmt_your_api_key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    provider: "openai",
    voice: "alloy",
    input: "Hello, world!",
    format: "mp3"
  })
});

const { audio_url } = await response.json();

Best Practices

Choose the right provider: OpenAI for speed, ElevenLabs eleven_v3 for highest quality (YouTube, podcasts), ElevenLabs eleven_flash_v2_5 for real-time, Google for language coverage
Cache audio: Store frequently-used audio locally to reduce API calls
Use appropriate formats: MP3 for web, WAV for editing, Opus for streaming
Monitor costs: Track usage in your Demeterics dashboard
Handle errors gracefully: Implement retry logic with exponential backoff

Speech Generation API

Speech Generation API

Overview

Authentication

Managed Keys (Default)

Bring Your Own Key (BYOK)

Endpoints

Generate Speech

List Voices

Providers

OpenAI

ElevenLabs

Google Cloud TTS

Murf.ai

Murf Falcon Streaming (Widget Integration)

Google Gemini TTS

Multi-Speaker Mode (Podcasts & Dialogues)

Groq Orpheus (Canopy Labs)

Pricing

Managed Keys

BYOK

Error Handling

Data Tracking

SDK Support

Python

Node.js

Best Practices