Voice API

Learn how to integrate Demeterics into your workflows with step-by-step guides and API examples.

Voice API

Beta Access Required: The Voice APIs require whitelisted access.

To request access, email sales@demeterics.com with:

  • Subject: "Feature Access Request"
  • Feature name(s) needed:
    • "Speech-to-Text (STT)" - For transcription API
    • "Voice Conversation" - For STT→LLM→TTS pipeline
    • "Realtime API" - For WebSocket realtime connections

The Demeterics Voice API provides a complete voice AI platform with Speech-to-Text (STT), Voice Conversation pipelines (STT→LLM→TTS), and OpenAI Realtime WebSocket proxy. All endpoints automatically track usage, costs, and store interactions in BigQuery.

Overview

Base URLs:

  • STT: https://api.demeterics.com/audio/v1
  • Voice Conversation: https://api.demeterics.com/voice/v1
  • Realtime WebSocket: wss://api.demeterics.com/realtime/v1

Features:

  • Multi-provider STT: Groq Whisper and OpenAI transcription models
  • Voice Conversation: Complete STT→LLM→TTS pipeline in one call
  • Realtime WebSocket: Proxy to OpenAI Realtime API with automatic billing
  • Auto-tracking: Every request logged to BigQuery with full observability
  • Session Linking: Voice conversations create linked BQ rows via meta.session
  • BYOK Support: Use your own provider API keys with dual-key authentication
  • Universal Tagging: Add /// KEY value metadata to all interactions

Authentication

Managed Keys (Default)

Use only your Demeterics API key:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@audio.mp3 \
  -F model=whisper-large-v3-turbo

Bring Your Own Key (BYOK)

Use the dual-key format to provide your own provider API key:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
  -F file=@audio.mp3 \
  -F model=gpt-4o-transcribe

The format is: [demeterics_api_key];[provider_api_key]

BYOK Benefits:

  • 10% service fee instead of 15%
  • Use your own rate limits and quotas
  • Provider costs billed directly to your account

Speech-to-Text API

Transcribe Audio

POST /audio/v1/transcriptions

Convert audio to text using Groq Whisper or OpenAI transcription models.

Request (multipart/form-data):

Field Type Required Description
file file Yes Audio file (mp3, wav, m4a, webm, ogg, flac)
model string Yes STT model (see providers below)
language string No ISO 639-1 language code (e.g., "en", "es")
prompt string No Context to guide transcription
response_format string No json, text, srt, vtt, verbose_json
temperature float No 0.0-1.0, lower is more deterministic
timestamp_granularities string No word, segment, or both
tags string No Newline-separated KEY value metadata

Example Request:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@meeting.mp3 \
  -F model=whisper-large-v3-turbo \
  -F language=en \
  -F 'tags=APP voicebot
FLOW customer_support
SESSION conv_12345'

Response:

{
  "id": "stt_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "text": "Hello, I'd like to check on my order status.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, I'd like to check on my order status."
    }
  ],
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "I'd", "start": 0.6, "end": 0.8}
  ],
  "cost": {
    "provider_cost": 0.0004,
    "service_fee": 0.00006,
    "total_cost": 0.00046
  }
}

List STT Models

GET /audio/v1/models

List available speech-to-text models.

Response:

{
  "models": [
    {
      "id": "whisper-large-v3-turbo",
      "provider": "groq",
      "description": "Fast, cost-effective transcription",
      "languages": ["en", "es", "fr", "de", "..."],
      "pricing": {
        "unit": "hour",
        "cost_per_unit": 0.04
      }
    },
    {
      "id": "gpt-4o-transcribe",
      "provider": "openai",
      "description": "High-accuracy transcription",
      "languages": ["en", "es", "fr", "de", "..."],
      "pricing": {
        "unit": "minute",
        "cost_per_unit": 0.006
      }
    }
  ]
}

STT Providers

Groq (Whisper)

Models:

  • whisper-large-v3 - Highest accuracy, $0.111/hour
  • whisper-large-v3-turbo - Fast and cost-effective, $0.04/hour

Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac

Max File Size: 25 MB

Features:

  • Multi-language support (57+ languages)
  • Word-level timestamps
  • Segment-level timestamps

OpenAI (GPT-4o Transcribe)

Models:

  • gpt-4o-transcribe - High accuracy, $0.006/minute
  • gpt-4o-mini-transcribe - Cost-effective, $0.003/minute
  • gpt-4o-transcribe-diarize - Speaker diarization, $0.006/minute (beta)

Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm

Max File Size: 25 MB

Features:

  • High accuracy for complex audio
  • Speaker diarization (with diarize model)
  • Context-aware with prompt field

Voice Conversation API

The Voice Conversation API provides a complete STT→LLM→TTS pipeline in a single request. Perfect for voice assistants, phone bots, and conversational AI.

Process Voice Turn

POST /voice/v1/conversation

Process a voice conversation turn: transcribe audio, generate LLM response, and synthesize speech.

Request (multipart/form-data):

Field Type Required Description
file file Yes Audio input (mp3, wav, m4a, webm, ogg)
stt_model string Yes STT model (e.g., whisper-large-v3-turbo)
stt_language string No ISO 639-1 language code
llm_model string Yes LLM model (e.g., llama-3.3-70b-versatile)
llm_provider string No LLM provider (inferred from model if omitted)
system_prompt string No System instructions for LLM
max_tokens int No Max response tokens (default: 1024)
temperature float No LLM temperature (default: 0.7)
conversation_id string No For multi-turn conversations
tts_model string No TTS model (e.g., canopylabs/orpheus-v1-english)
tts_provider string No TTS provider (e.g., groq)
tts_voice string No Voice ID (e.g., tara)
tts_speed float No Playback speed (0.25-4.0)
tags string No Newline-separated KEY value metadata

Example Request:

curl -X POST https://api.demeterics.com/voice/v1/conversation \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@question.mp3 \
  -F stt_model=whisper-large-v3-turbo \
  -F llm_model=llama-3.3-70b-versatile \
  -F system_prompt="You are a helpful customer service agent." \
  -F tts_model=canopylabs/orpheus-v1-english \
  -F tts_voice=tara \
  -F conversation_id=conv_12345 \
  -F 'tags=APP customer_service
FLOW order_inquiry'

Response:

{
  "id": "conv_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "transcript": "What's the status of my order?",
  "language": "en",
  "response": "I'd be happy to help you check your order status. Could you please provide your order number?",
  "audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
  "timing": {
    "stt_latency_ms": 150,
    "llm_latency_ms": 450,
    "tts_latency_ms": 200,
    "total_latency_ms": 800
  },
  "cost": {
    "stt_cost_usd": 0.0004,
    "llm_cost_usd": 0.002,
    "tts_cost_usd": 0.005,
    "service_fee": 0.001,
    "total_cost_usd": 0.0084
  }
}

Stream Voice Turn (SSE)

POST /voice/v1/conversation/stream

Stream the same STT→LLM→TTS pipeline over Server-Sent Events for live playback while the response is still being generated.

Request (multipart/form-data): same fields as POST /voice/v1/conversation, plus:

Field Type Required Description
tts_format string No Output format (mp3, wav, ogg, flac, pcm). Recommended: mp3 for streaming

Event Types:

  • start - Conversation ID is available
  • transcript - STT output (text + latency)
  • response - LLM output (text + latency)
  • audio - Base64 audio chunk (if TTS enabled)
  • audio_complete - Summary for streamed audio
  • done - Final timing + cost summary (also includes audio_url if uploaded)
  • error - Error event with step and message

Audio Event Payload:

{
  "chunk": "BASE64_MP3_BYTES",
  "chunk_index": 0,
  "is_last": false,
  "format": "mp3",
  "elapsed_ms": 312
}

audio_complete Payload:

{
  "total_chunks": 185,
  "total_bytes": 229522,
  "elapsed_ms": 9656
}

done Payload (summary):

{
  "conversation_id": "conv_01KDEXYHB89BK410RYT7CHRMG1",
  "transcript": "Count a number from 1 to 20.",
  "response": "Here we go: 1, 2, 3, ...",
  "audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
  "timing": {
    "stt_latency_ms": 150,
    "llm_latency_ms": 450,
    "tts_latency_ms": 200,
    "total_latency_ms": 800
  },
  "cost": {
    "stt_cost_usd": 0.0004,
    "llm_cost_usd": 0.002,
    "tts_cost_usd": 0.005,
    "service_fee": 0.001,
    "total_cost_usd": 0.0084
  }
}

Minimal JS Example (SSE over fetch):

const resp = await fetch('https://api.demeterics.com/voice/v1/conversation/stream', {
  method: 'POST',
  headers: { Authorization: `Bearer ${apiKey}` },
  body: formData
});

const reader = resp.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || '';

  for (const line of lines) {
    if (line.startsWith('event:')) continue;
    if (!line.startsWith('data:')) continue;
    const data = JSON.parse(line.slice(5).trim());
    // Handle transcript/response/audio/done as needed
  }
}

Session Tracking

Voice conversations create three linked BigQuery rows per turn:

  1. STT row: interaction_type = 'stt' with transcription details
  2. LLM row: interaction_type = 'llm' with prompt/response
  3. TTS row: interaction_type = 'tts' with audio details

All rows share the same meta.session value (the conversation_id), enabling:

  • End-to-end latency analysis
  • Cost breakdown by pipeline step
  • Conversation flow tracking

Query Example:

SELECT
  transaction_id,
  voice.pipeline_step,
  total_cost,
  timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = 'conv_12345'
ORDER BY timing.question_time

Realtime WebSocket API

The Realtime API provides a WebSocket proxy to OpenAI's Realtime API with automatic billing, tag injection, and per-turn tracking.

Connect

wss://api.demeterics.com/realtime/v1?model={model}

Establish a WebSocket connection to the OpenAI Realtime API.

Query Parameters:

Parameter Type Required Description
model string Yes Model name (e.g., gpt-4o-realtime-preview)
tags string No URL-encoded newline-separated KEY value metadata

Connection Example (JavaScript):

const ws = new WebSocket(
  'wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview',
  {
    headers: {
      'Authorization': 'Bearer dmt_your_api_key'
    }
  }
);

ws.onopen = () => {
  // Send session.update to configure the session
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      modalities: ['text', 'audio'],
      instructions: 'You are a helpful assistant.',
      voice: 'alloy',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16',
      turn_detection: {
        type: 'server_vad',
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 500
      }
    }
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'response.audio.delta') {
    // Handle audio chunk
    const audioData = atob(data.delta);
    playAudio(audioData);
  }

  if (data.type === 'response.done') {
    // Response complete
    console.log('Turn complete:', data.response);
  }
};

Connection Example (Python):

import asyncio
import websockets
import json

async def realtime_chat():
    uri = "wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview"
    headers = {"Authorization": "Bearer dmt_your_api_key"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful assistant.",
                "voice": "alloy"
            }
        }))

        # Send text input
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "Hello!"}]
            }
        }))

        await ws.send(json.dumps({"type": "response.create"}))

        # Receive responses
        async for message in ws:
            data = json.loads(message)
            if data["type"] == "response.text.delta":
                print(data["delta"], end="")
            if data["type"] == "response.done":
                break

asyncio.run(realtime_chat())

Tag Injection

Tags provided via the tags query parameter are automatically injected into your session.update instructions. For example:

tags=APP%20voicebot%0AFLOW%20support

Gets injected as:

/// APP voicebot
/// FLOW support

[Your instructions here]

Per-Turn Billing

Each response.done event triggers:

  1. Cost calculation based on token usage
  2. Credit deduction from your account
  3. BigQuery row insertion with full metrics

Realtime Pricing (per 1M tokens):

Token Type Input Cost Output Cost
Text $5.00 $20.00
Audio $100.00 $200.00

Cached tokens receive a 50% discount on input costs.

WebSocket Events

Client Events (send):

  • session.update - Configure session settings
  • conversation.item.create - Add message to conversation
  • input_audio_buffer.append - Stream audio input
  • input_audio_buffer.commit - Commit audio buffer
  • response.create - Request a response
  • response.cancel - Cancel ongoing response

Server Events (receive):

  • session.created - Session established
  • session.updated - Session configuration updated
  • response.created - Response started
  • response.text.delta - Text chunk
  • response.audio.delta - Audio chunk (base64)
  • response.audio_transcript.delta - Audio transcript chunk
  • response.done - Response complete with usage stats
  • error - Error occurred

Universal Tagging

All Voice APIs support the /// KEY value tag format for metadata injection.

Tag Format

Tags are newline-separated KEY value pairs:

APP voicebot
FLOW customer_support
SESSION conv_12345
USER user_789
VARIANT ab_test_v2

Adding Tags

STT/Conversation (form field):

-F 'tags=APP voicebot
FLOW support'

Realtime WebSocket (query parameter):

?tags=APP%20voicebot%0AFLOW%20support

Supported Tag Keys

Key Description Example
APP Application name voicebot
FLOW Business flow customer_support
PRODUCT Product line enterprise
COMPANY Client/customer acme_corp
USER User identifier user_12345
SESSION Conversation ID conv_789
VARIANT A/B test variant new_prompt_v2
VERSION Prompt version 1.2.3
ENVIRONMENT Environment production
PROJECT Project ID proj_456
COHORT Cohort ID video_123

Pricing Summary

STT Pricing (Managed Keys)

Provider Model Unit Cost
Groq whisper-large-v3 Hour $0.111
Groq whisper-large-v3-turbo Hour $0.04
OpenAI gpt-4o-transcribe Minute $0.006
OpenAI gpt-4o-mini-transcribe Minute $0.003

TTS Pricing (Managed Keys)

Provider Model Unit Cost
Groq canopylabs/orpheus-v1-english 1M chars $22.00
OpenAI gpt-4o-mini-tts 1M chars $0.60
OpenAI tts-1 1M chars $15.00
OpenAI tts-1-hd 1M chars $30.00

Note: PlayAI TTS models (playai-tts, playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate to canopylabs/orpheus-v1-english.

Realtime Pricing (per 1M tokens)

Token Type Input Output
Text $5.00 $20.00
Audio $100.00 $200.00

Service Fees

  • Managed Keys: 15% markup on provider costs
  • BYOK: 10% service fee on tracked usage

Error Handling

Error Response Format:

{
  "error": {
    "type": "invalid_request",
    "message": "Audio file format not supported",
    "code": "unsupported_format"
  }
}

Common Error Codes:

Code HTTP Status Description
unsupported_format 400 Audio format not supported
file_too_large 400 File exceeds size limit
invalid_model 400 Unknown model specified
insufficient_credits 402 Not enough credits
not_whitelisted 403 User not in beta whitelist
provider_error 502 Provider API failed
rate_limited 429 Too many requests

BigQuery Integration

STT Tracking Schema

SELECT
  transaction_id,
  provider,
  model,
  stt.audio_duration_sec,
  stt.transcript_chars,
  stt.language,
  stt.diarization,
  stt.speaker_count,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'stt'
  AND user_id = @user_id
  AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC

Voice Conversation Tracking

-- Get full conversation pipeline
SELECT
  transaction_id,
  voice.pipeline_step,
  voice.conversation_id,
  CASE voice.pipeline_step
    WHEN 'stt' THEN stt.audio_duration_sec
    WHEN 'llm' THEN tokens.total
    WHEN 'tts' THEN tts.duration_sec
  END as metric,
  total_cost,
  timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = @conversation_id
ORDER BY timing.question_time

Realtime Session Tracking

SELECT
  realtime.session_id,
  realtime.turn_number,
  realtime.audio_in_minutes,
  realtime.audio_out_minutes,
  realtime.text_input_tokens,
  realtime.text_output_tokens,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'realtime'
  AND user_id = @user_id
ORDER BY timing.question_time DESC

Best Practices

  1. Choose the right STT model:

    • Groq Whisper for cost-effective batch processing
    • OpenAI for high-accuracy or diarization needs
  2. Optimize voice conversations:

    • Use conversation_id for multi-turn tracking
    • Keep system prompts concise to reduce LLM latency
    • Choose fast TTS models for real-time applications
  3. Realtime API tips:

    • Use server VAD for natural conversation flow
    • Handle response.done events for billing reconciliation
    • Implement reconnection logic for long sessions
  4. Monitor costs:

    • Track usage via BigQuery queries
    • Set up alerts for unusual spending patterns
    • Use BYOK for high-volume applications
  5. Handle errors gracefully:

    • Implement retry logic with exponential backoff
    • Fall back to text when audio fails
    • Log error codes for debugging