Speech Generation API

Learn how to integrate Demeterics into your workflows with step-by-step guides and API examples.

Speech Generation API

Beta Access Required: The Speech API requires whitelisted access.

To request access, email sales@demeterics.com with:

  • Subject: "Feature Access Request"
  • Feature name: "Text-to-Speech (TTS)"

For multi-speaker podcast generation, also request: "TTS Multi-Speaker"

The Demeterics Speech API provides a unified Text-to-Speech (TTS) interface across multiple providers. Convert text to natural-sounding audio with a single API while automatically tracking usage, costs, and storing generated audio for analysis.

Overview

Base URL: https://api.demeterics.com/tts/v1

Features:

  • Unified API: Single endpoint for OpenAI, ElevenLabs, Google Cloud TTS, Murf.ai, Groq Orpheus, and Google Gemini
  • Multi-Speaker: Generate podcasts and dialogues with up to 2 speakers (Gemini)
  • Auto-tracking: Every request logged to BigQuery with full observability
  • Audio Storage: Generated audio stored in GCS with 15-minute signed URLs
  • BYOK Support: Use your own provider API keys with dual-key authentication
  • Cost Control: Automatic credit billing with 15% managed or 10% BYOK fee

Authentication

Managed Keys (Default)

Use only your Demeterics API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

Bring Your Own Key (BYOK)

Use the dual-key format to provide your own provider API key:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
  -H "Content-Type: application/json" \
  -d '{...}'

The format is: [demeterics_api_key];[provider_api_key]

BYOK Benefits:

  • 10% service fee instead of 15%
  • Use your own rate limits and quotas
  • Provider costs billed directly to your account

Endpoints

Generate Speech

POST /tts/v1/generate

Convert text to speech audio.

Request Body:

Field Type Required Description
provider string Yes Target provider: openai, elevenlabs, google, murf, groq, gemini
model string No TTS model (provider-specific)
voice string No Voice identifier (single speaker)
input string Yes Text to convert (max varies by provider)
format string No Output format: mp3, wav, opus, flac
speed float No Playback speed: 0.25-4.0 (default: 1.0)
language string No Language code (ISO 639-1)
speakers array No Multi-speaker config (Gemini only, max 2)

Example Request:

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "openai",
    "model": "tts-1",
    "voice": "alloy",
    "input": "Hello, welcome to Demeterics!",
    "format": "mp3"
  }'

Response:

{
  "id": "01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "openai",
  "model": "tts-1",
  "voice": "alloy",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 2.3,
  "cost_usd": 0.00023,
  "usage": {
    "input_chars": 31
  },
  "metadata": {
    "format": "mp3",
    "sample_rate": 24000,
    "channels": 1,
    "generation_ms": 450
  }
}

List Voices

GET /tts/v1/voices?provider={provider}

List available voices for a provider.

Query Parameters:

Parameter Type Required Description
provider string Yes Provider: openai, elevenlabs, google, murf

Example Request:

curl -X GET "https://api.demeterics.com/tts/v1/voices?provider=openai" \
  -H "Authorization: Bearer dmt_your_api_key"

Response:

{
  "voices": [
    {
      "id": "alloy",
      "name": "Alloy",
      "description": "Neutral and balanced",
      "gender": "neutral"
    },
    {
      "id": "echo",
      "name": "Echo",
      "description": "Clear and articulate",
      "gender": "male"
    }
  ]
}

Providers

OpenAI

Models:

  • gpt-4o-mini-tts - Latest model with better steerability (~85% cheaper than ElevenLabs)
  • tts-1 - Fast and efficient (legacy)
  • tts-1-hd - Higher quality (legacy)

Voices:

  • alloy - Neutral and balanced
  • ash - Warm and conversational
  • ballad - Soft and melodic
  • coral - Friendly and approachable
  • echo - Clear and articulate
  • fable - Expressive and dynamic
  • onyx - Deep and authoritative
  • nova - Friendly and warm
  • sage - Calm and measured
  • shimmer - Bright and optimistic
  • verse - Dynamic and engaging

Supported Formats: mp3, opus, aac, flac, wav, pcm

Max Characters: 4,096

ElevenLabs

Models:

  • eleven_multilingual_v2 - Best quality, 29 languages
  • eleven_turbo_v2_5 - Fast, English-optimized
  • eleven_turbo_v2 - Previous fast model
  • eleven_monolingual_v1 - English only

Voices: Over 100 pre-made voices plus custom voice cloning

Supported Formats: mp3, pcm, ulaw

Max Characters: 5,000

Google Cloud TTS

Models:

  • standard - Basic quality
  • neural2 - Neural network based
  • wavenet - High quality WaveNet
  • journey - Conversational style
  • studio - Professional quality

Voices: 220+ voices across 40+ languages

Supported Formats: mp3, wav, ogg

Max Characters: 5,000

Murf.ai

Models:

  • GEN2 - Latest generation, highest quality ($0.03/1000 chars)
  • FALCON - Fast streaming model ($0.01/1000 chars) ← Recommended for Voice-to-Voice

Voices: 120+ voices across 20+ languages including:

  • en-US-natalie - Natalie (US English, female) — clear, professional
  • en-US-samantha - Samantha (US English, female) — warm, conversational
  • en-US-terrell - Terrell (US English, male) — deep, authoritative
  • en-US-wayne - Wayne (US English, male) — friendly, casual
  • en-UK-hazel - Hazel (UK English, female) — British accent
  • en-UK-ruby - Ruby (UK English, female) — British, professional
  • en-UK-maisie - Maisie (UK English, female) — British, youthful
  • en-AU-lincoln - Lincoln (Australian, male) — Australian accent

Supported Formats: mp3, wav, flac, ogg, pcm, alaw, ulaw

Max Characters: 10,000

Features:

  • Voice styles (conversational, newscast, etc.)
  • Speed and pitch control
  • Multi-language support with native locales
  • Streaming support via /v1/speech/stream endpoint

Murf Falcon Streaming

The FALCON model supports real-time audio streaming, ideal for conversational AI applications. This is used by the AI Chat Widget's Voice-to-Voice feature.

Streaming Endpoint: POST https://api.murf.ai/v1/speech/stream

Request Body:

{
  "text": "Hello, how can I help you today?",
  "voiceId": "en-US-natalie",
  "model": "FALCON",
  "format": "WAV",
  "sampleRate": 24000,
  "channelType": "MONO",
  "multiNativeLocale": "en-US"
}

Response: Raw WAV audio bytes (not JSON) — streamed as they're generated

Performance:

  • ~130ms time-to-first-audio (TTFA)
  • Optimized for low-latency applications
  • WAV format at 24kHz mono

AI Chat Widget Integration:

When Voice-to-Voice is enabled, the widget uses a two-phase approach:

  1. Phase 1POST /api/widget/voice returns text immediately + stream_token
  2. Phase 2GET /api/widget/voice/stream?token=X streams Falcon audio via SSE

This architecture displays the AI's response text immediately while audio streams in the background, providing a responsive user experience.

Cost: $0.01 per 1,000 characters (billed when stream is consumed)

Google Gemini TTS

Beta Access: Gemini TTS with multi-speaker support is available to whitelisted users. Contact support to request access.

Models:

  • gemini-2.5-flash-preview-tts - Fast, cost-effective (default)
  • gemini-2.5-pro-preview-tts - Higher quality

Voices (30 prebuilt voices):

  • Puck - Upbeat
  • Kore - Firm
  • Charon - Informative
  • Zephyr - Bright
  • Fenrir - Excitable
  • Leda - Youthful
  • Aoede - Breezy
  • Sulafat - Warm
  • Achird - Friendly
  • And 21 more...

Supported Formats: wav

Max Characters: 8,000

Features:

  • Multi-speaker support: Up to 2 speakers with different voices
  • 30 prebuilt voice options
  • Ideal for podcasts, dialogues, and conversational content

Multi-Speaker Mode (Podcasts & Dialogues)

Generate conversational audio with up to 2 distinct speakers, each with their own voice. Perfect for:

  • Podcasts with host and guest
  • Dialogues between characters
  • Interview-style content
  • Educational back-and-forth explanations

Request Body (Multi-Speaker):

Field Type Required Description
provider string Yes Must be gemini
model string No gemini-2.5-flash-preview-tts (default)
input string Yes Dialogue with speaker labels
speakers array Yes Speaker-to-voice mapping (max 2)
format string No Output format (default: wav)

Speaker Configuration:

Each speaker object has:

Field Type Required Description
name string Yes Speaker label (must match input text)
voice string Yes Voice ID (e.g., Puck, Kore)

Example: Podcast Generation

curl -X POST https://api.demeterics.com/tts/v1/generate \
  -H "Authorization: Bearer dmt_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "gemini",
    "model": "gemini-2.5-flash-preview-tts",
    "input": "Host: Welcome to the AI Insights podcast! Today we explore the future of voice AI.\nGuest: Thanks for having me! Voice technology is transforming how we interact with machines.",
    "speakers": [
      {"name": "Host", "voice": "Puck"},
      {"name": "Guest", "voice": "Kore"}
    ],
    "format": "wav"
  }'

Response:

{
  "id": "tts_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "provider": "gemini",
  "model": "gemini-2.5-flash-preview-tts",
  "audio_url": "https://storage.googleapis.com/demeterics-data/tts/...",
  "duration_seconds": 8.5,
  "cost_usd": 0.00125,
  "usage": {
    "input_chars": 156
  }
}

Python Example:

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "gemini",
        "input": """Host: What's the biggest challenge in AI today?
Guest: I'd say it's making AI accessible to everyone, not just tech companies.""",
        "speakers": [
            {"name": "Host", "voice": "Puck"},
            {"name": "Guest", "voice": "Kore"}
        ]
    }
)

audio_url = response.json()["audio_url"]
print(f"Podcast audio: {audio_url}")

Best Practices for Multi-Speaker:

  1. Consistent labels: Use the same speaker names throughout (e.g., Host: not Announcer:)
  2. Clear formatting: Start each line with Speaker: followed by their dialogue
  3. Voice pairing: Choose voices with distinct characteristics (e.g., upbeat + firm)
  4. Keep turns short: Shorter dialogue turns sound more natural
  5. Max 2 speakers: Gemini currently supports up to 2 distinct speakers

Groq Orpheus (Canopy Labs)

Migration Notice: PlayAI TTS models (playai-tts, playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate to canopylabs/orpheus-v1-english.

Models:

  • canopylabs/orpheus-v1-english - Expressive English TTS with vocal direction support

Voices (8 voices):

  • tara - Female, conversational (default)
  • leah - Female, professional
  • jess - Female, friendly
  • leo - Male, conversational
  • dan - Male, professional
  • mia - Female, warm
  • zac - Male, casual
  • zoe - Female, clear

Supported Formats: wav only

Max Characters: 200 per request

Features:

  • Vocal Directions: Control speech style with bracketed commands:
    • Conversational: [cheerful], [friendly], [casual], [warm]
    • Professional: [professionally], [authoritatively], [formally]
    • Expressive: [whisper], [excited], [dramatic], [deadpan], [sarcastic]
    • Vocal qualities: [gravelly whisper], [rapid babbling], [singsong], [breathy]
  • Fast generation via Groq infrastructure
  • More directions = more expressive; fewer/no directions = natural, casual
  • 56% cheaper than PlayAI ($22/1M chars vs $50/1M chars)

Pricing

Managed Keys

Character-based pricing with 15% service fee:

Provider Model Cost per 1M chars
OpenAI gpt-4o-mini-tts $0.69
OpenAI tts-1 $17.25
OpenAI tts-1-hd $34.50
ElevenLabs eleven_multilingual_v2 $345.00
ElevenLabs eleven_turbo_v2_5 $86.25
Google wavenet $18.40
Google neural2 $18.40
Google standard $4.60
Murf GEN2 $27.60
Murf FALCON $23.00
Groq canopylabs/orpheus-v1-english $22.00
Gemini gemini-2.5-flash-preview-tts $11.50
Gemini gemini-2.5-pro-preview-tts $57.50

BYOK

10% service fee on top of provider costs. Provider costs billed directly to your account.

Error Handling

Error Response Format:

{
  "error": {
    "type": "invalid_request",
    "message": "Input text exceeds maximum length",
    "code": "text_too_long"
  }
}

Common Error Codes:

Code HTTP Status Description
invalid_provider 400 Unknown provider specified
invalid_voice 400 Voice not available for provider
text_too_long 400 Input exceeds provider limit
insufficient_credits 402 Not enough credits
provider_error 502 Provider API failed
rate_limited 429 Too many requests

Data Tracking

Every speech generation is automatically tracked in BigQuery with:

  • Transaction ID (ULID)
  • User and API key identifiers
  • Provider, model, and voice used
  • Input character count and text hash (privacy-safe)
  • Audio duration and format
  • GCS storage path
  • Cost breakdown (provider cost, service fee, total)
  • Latency metrics
  • Error information (if failed)

Query your speech generations:

SELECT
  transaction_id,
  provider,
  model,
  tts.voice,
  tts.input_chars,
  tts.duration_sec,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'tts'
  AND user_id = @user_id
  AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC

SDK Support

Python

import requests

response = requests.post(
    "https://api.demeterics.com/tts/v1/generate",
    headers={"Authorization": "Bearer dmt_your_api_key"},
    json={
        "provider": "openai",
        "voice": "alloy",
        "input": "Hello, world!",
        "format": "mp3"
    }
)

audio_url = response.json()["audio_url"]

Node.js

const response = await fetch("https://api.demeterics.com/tts/v1/generate", {
  method: "POST",
  headers: {
    "Authorization": "Bearer dmt_your_api_key",
    "Content-Type": "application/json"
  },
  body: JSON.stringify({
    provider: "openai",
    voice: "alloy",
    input: "Hello, world!",
    format: "mp3"
  })
});

const { audio_url } = await response.json();

Best Practices

  1. Choose the right provider: OpenAI for speed, ElevenLabs for quality, Google for language coverage
  2. Cache audio: Store frequently-used audio locally to reduce API calls
  3. Use appropriate formats: MP3 for web, WAV for editing, Opus for streaming
  4. Monitor costs: Track usage in your Demeterics dashboard
  5. Handle errors gracefully: Implement retry logic with exponential backoff