Voice API

Beta Access Required: The Voice APIs require whitelisted access.

To request access, email sales@demeterics.com with:

Subject: "Feature Access Request"

Feature name(s) needed:

"Speech-to-Text (STT)" - For transcription API

"Voice Conversation" - For STT→LLM→TTS pipeline

"Realtime API" - For WebSocket realtime connections

The Demeterics Voice API provides a complete voice AI platform with Speech-to-Text (STT), Voice Conversation pipelines (STT→LLM→TTS), and OpenAI Realtime WebSocket proxy. All endpoints automatically track usage, costs, and store interactions in BigQuery.

Overview

Base URLs:

STT: https://api.demeterics.com/audio/v1
Voice Conversation: https://api.demeterics.com/voice/v1
Realtime WebSocket: wss://api.demeterics.com/realtime/v1

Features:

Multi-provider STT: Groq Whisper and OpenAI transcription models
Voice Conversation: Complete STT→LLM→TTS pipeline in one call
Realtime WebSocket: Proxy to OpenAI Realtime API with automatic billing
Auto-tracking: Every request logged to BigQuery with full observability
Session Linking: Voice conversations create linked BQ rows via meta.session
BYOK Support: Use your own provider API keys with dual-key authentication
Universal Tagging: Add /// KEY value metadata to all interactions

Authentication

Managed Keys (Default)

Use only your Demeterics API key:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@audio.mp3 \
  -F model=whisper-large-v3-turbo

Bring Your Own Key (BYOK)

Use the dual-key format to provide your own provider API key:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
  -F file=@audio.mp3 \
  -F model=gpt-4o-transcribe

The format is: [demeterics_api_key];[provider_api_key]

BYOK Benefits:

10% service fee instead of 15%
Use your own rate limits and quotas
Provider costs billed directly to your account

Speech-to-Text API

Transcribe Audio

POST /audio/v1/transcriptions

Convert audio to text using Groq Whisper or OpenAI transcription models.

Request (multipart/form-data):

Field	Type	Required	Description
`file`	file	Yes	Audio file (mp3, wav, m4a, webm, ogg, flac)
`model`	string	Yes	STT model (see providers below)
`language`	string	No	ISO 639-1 language code (e.g., "en", "es")
`prompt`	string	No	Context to guide transcription
`response_format`	string	No	`json`, `text`, `srt`, `vtt`, `verbose_json`
`temperature`	float	No	0.0-1.0, lower is more deterministic
`timestamp_granularities`	string	No	`word`, `segment`, or both
`tags`	string	No	Newline-separated `KEY value` metadata

Example Request:

curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@meeting.mp3 \
  -F model=whisper-large-v3-turbo \
  -F language=en \
  -F 'tags=APP voicebot
FLOW customer_support
SESSION conv_12345'

Response:

{
  "id": "stt_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "text": "Hello, I'd like to check on my order status.",
  "language": "en",
  "duration": 3.5,
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 3.5,
      "text": "Hello, I'd like to check on my order status."
    }
  ],
  "words": [
    {"word": "Hello", "start": 0.0, "end": 0.5},
    {"word": "I'd", "start": 0.6, "end": 0.8}
  ],
  "cost": {
    "provider_cost": 0.0004,
    "service_fee": 0.00006,
    "total_cost": 0.00046
  }
}

List STT Models

GET /audio/v1/models

List available speech-to-text models.

Response:

{
  "models": [
    {
      "id": "whisper-large-v3-turbo",
      "provider": "groq",
      "description": "Fast, cost-effective transcription",
      "languages": ["en", "es", "fr", "de", "..."],
      "pricing": {
        "unit": "hour",
        "cost_per_unit": 0.04
      }
    },
    {
      "id": "gpt-4o-transcribe",
      "provider": "openai",
      "description": "High-accuracy transcription",
      "languages": ["en", "es", "fr", "de", "..."],
      "pricing": {
        "unit": "minute",
        "cost_per_unit": 0.006
      }
    }
  ]
}

STT Providers

Groq (Whisper)

Models:

whisper-large-v3 - Highest accuracy, $0.111/hour
whisper-large-v3-turbo - Fast and cost-effective, $0.04/hour

Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac

Max File Size: 25 MB

Features:

Multi-language support (57+ languages)
Word-level timestamps
Segment-level timestamps

OpenAI (GPT-4o Transcribe)

Models:

gpt-4o-transcribe - High accuracy, $0.006/minute
gpt-4o-mini-transcribe - Cost-effective, $0.003/minute
gpt-4o-transcribe-diarize - Speaker diarization, $0.006/minute (beta)

Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm

Max File Size: 25 MB

Features:

High accuracy for complex audio
Speaker diarization (with diarize model)
Context-aware with prompt field

Voice Conversation API

The Voice Conversation API provides a complete STT→LLM→TTS pipeline in a single request. Perfect for voice assistants, phone bots, and conversational AI.

Process Voice Turn

POST /voice/v1/conversation

Process a voice conversation turn: transcribe audio, generate LLM response, and synthesize speech.

Request (multipart/form-data):

Field	Type	Required	Description
`file`	file	Yes	Audio input (mp3, wav, m4a, webm, ogg)
`stt_model`	string	Yes	STT model (e.g., `whisper-large-v3-turbo`)
`stt_language`	string	No	ISO 639-1 language code
`llm_model`	string	Yes	LLM model (e.g., `llama-3.3-70b-versatile`)
`llm_provider`	string	No	LLM provider (inferred from model if omitted)
`system_prompt`	string	No	System instructions for LLM
`max_tokens`	int	No	Max response tokens (default: 1024)
`temperature`	float	No	LLM temperature (default: 0.7)
`conversation_id`	string	No	For multi-turn conversations
`tts_model`	string	No	TTS model (e.g., `canopylabs/orpheus-v1-english`)
`tts_provider`	string	No	TTS provider (e.g., `groq`)
`tts_voice`	string	No	Voice ID (e.g., `tara`)
`tts_speed`	float	No	Playback speed (0.25-4.0)
`tags`	string	No	Newline-separated `KEY value` metadata

Example Request:

curl -X POST https://api.demeterics.com/voice/v1/conversation \
  -H "Authorization: Bearer dmt_your_api_key" \
  -F file=@question.mp3 \
  -F stt_model=whisper-large-v3-turbo \
  -F llm_model=llama-3.3-70b-versatile \
  -F system_prompt="You are a helpful customer service agent." \
  -F tts_model=canopylabs/orpheus-v1-english \
  -F tts_voice=tara \
  -F conversation_id=conv_12345 \
  -F 'tags=APP customer_service
FLOW order_inquiry'

Response:

{
  "id": "conv_01JARV4HZ6XPQMWVCS9N1GKEFD",
  "transcript": "What's the status of my order?",
  "language": "en",
  "response": "I'd be happy to help you check your order status. Could you please provide your order number?",
  "audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
  "timing": {
    "stt_latency_ms": 150,
    "llm_latency_ms": 450,
    "tts_latency_ms": 200,
    "total_latency_ms": 800
  },
  "cost": {
    "stt_cost_usd": 0.0004,
    "llm_cost_usd": 0.002,
    "tts_cost_usd": 0.005,
    "service_fee": 0.001,
    "total_cost_usd": 0.0084
  }
}

Stream Voice Turn (SSE)

POST /voice/v1/conversation/stream

Stream the same STT→LLM→TTS pipeline over Server-Sent Events for live playback while the response is still being generated.

Request (multipart/form-data): same fields as POST /voice/v1/conversation, plus:

Field	Type	Required	Description
`tts_format`	string	No	Output format (`mp3`, `wav`, `ogg`, `flac`, `pcm`). Recommended: `mp3` for streaming

Event Types:

start - Conversation ID is available
transcript - STT output (text + latency)
response - LLM output (text + latency)
audio - Base64 audio chunk (if TTS enabled)
audio_complete - Summary for streamed audio
done - Final timing + cost summary (also includes audio_url if uploaded)
error - Error event with step and message

Audio Event Payload:

{
  "chunk": "BASE64_MP3_BYTES",
  "chunk_index": 0,
  "is_last": false,
  "format": "mp3",
  "elapsed_ms": 312
}

audio_complete Payload:

{
  "total_chunks": 185,
  "total_bytes": 229522,
  "elapsed_ms": 9656
}

done Payload (summary):

{
  "conversation_id": "conv_01KDEXYHB89BK410RYT7CHRMG1",
  "transcript": "Count a number from 1 to 20.",
  "response": "Here we go: 1, 2, 3, ...",
  "audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
  "timing": {
    "stt_latency_ms": 150,
    "llm_latency_ms": 450,
    "tts_latency_ms": 200,
    "total_latency_ms": 800
  },
  "cost": {
    "stt_cost_usd": 0.0004,
    "llm_cost_usd": 0.002,
    "tts_cost_usd": 0.005,
    "service_fee": 0.001,
    "total_cost_usd": 0.0084
  }
}

Minimal JS Example (SSE over fetch):

const resp = await fetch('https://api.demeterics.com/voice/v1/conversation/stream', {
  method: 'POST',
  headers: { Authorization: `Bearer ${apiKey}` },
  body: formData
});

const reader = resp.body.getReader();
const decoder = new TextDecoder();
let buffer = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  buffer += decoder.decode(value, { stream: true });
  const lines = buffer.split('\n');
  buffer = lines.pop() || '';

  for (const line of lines) {
    if (line.startsWith('event:')) continue;
    if (!line.startsWith('data:')) continue;
    const data = JSON.parse(line.slice(5).trim());
    // Handle transcript/response/audio/done as needed
  }
}

Session Tracking

Voice conversations create three linked BigQuery rows per turn:

STT row: interaction_type = 'stt' with transcription details
LLM row: interaction_type = 'llm' with prompt/response
TTS row: interaction_type = 'tts' with audio details

All rows share the same meta.session value (the conversation_id), enabling:

End-to-end latency analysis
Cost breakdown by pipeline step
Conversation flow tracking

Query Example:

SELECT
  transaction_id,
  voice.pipeline_step,
  total_cost,
  timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = 'conv_12345'
ORDER BY timing.question_time

Realtime WebSocket API

The Realtime API provides a WebSocket proxy to OpenAI's Realtime API with automatic billing, tag injection, and per-turn tracking.

Connect

wss://api.demeterics.com/realtime/v1?model={model}

Establish a WebSocket connection to the OpenAI Realtime API.

Query Parameters:

Parameter	Type	Required	Description
`model`	string	Yes	Model name (e.g., `gpt-4o-realtime-preview`)

Headers:

Header	Type	Required	Description
`Authorization`	string	Yes	Bearer token: `Bearer dmt_your_api_key`
`X-Demeterics-Tags`	string	No	Newline-separated `KEY value` metadata for tracking

Connection Example (JavaScript):

const ws = new WebSocket(
  'wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview',
  {
    headers: {
      'Authorization': 'Bearer dmt_your_api_key'
    }
  }
);

ws.onopen = () => {
  // Send session.update to configure the session
  ws.send(JSON.stringify({
    type: 'session.update',
    session: {
      modalities: ['text', 'audio'],
      instructions: 'You are a helpful assistant.',
      voice: 'alloy',
      input_audio_format: 'pcm16',
      output_audio_format: 'pcm16',
      turn_detection: {
        type: 'server_vad',
        threshold: 0.5,
        prefix_padding_ms: 300,
        silence_duration_ms: 500
      }
    }
  }));
};

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === 'response.audio.delta') {
    // Handle audio chunk
    const audioData = atob(data.delta);
    playAudio(audioData);
  }

  if (data.type === 'response.done') {
    // Response complete
    console.log('Turn complete:', data.response);
  }
};

Connection Example (Python):

import asyncio
import websockets
import json

async def realtime_chat():
    uri = "wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview"
    headers = {"Authorization": "Bearer dmt_your_api_key"}

    async with websockets.connect(uri, extra_headers=headers) as ws:
        # Configure session
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "modalities": ["text", "audio"],
                "instructions": "You are a helpful assistant.",
                "voice": "alloy"
            }
        }))

        # Send text input
        await ws.send(json.dumps({
            "type": "conversation.item.create",
            "item": {
                "type": "message",
                "role": "user",
                "content": [{"type": "input_text", "text": "Hello!"}]
            }
        }))

        await ws.send(json.dumps({"type": "response.create"}))

        # Receive responses
        async for message in ws:
            data = json.loads(message)
            if data["type"] == "response.text.delta":
                print(data["delta"], end="")
            if data["type"] == "response.done":
                break

asyncio.run(realtime_chat())

Tag Injection

Tags provided via the tags query parameter are automatically injected into your session.update instructions. For example:

tags=APP%20voicebot%0AFLOW%20support

Gets injected as:

/// APP voicebot
/// FLOW support

[Your instructions here]

Per-Turn Billing

Each response.done event triggers:

Cost calculation based on token usage
Credit deduction from your account
BigQuery row insertion with full metrics

Realtime Pricing (per 1M tokens):

Token Type	Input Cost	Output Cost
Text	$5.00	$20.00
Audio	$100.00	$200.00

Cached tokens receive a 50% discount on input costs.

WebSocket Events

Client Events (send):

session.update - Configure session settings
conversation.item.create - Add message to conversation
input_audio_buffer.append - Stream audio input
input_audio_buffer.commit - Commit audio buffer
response.create - Request a response
response.cancel - Cancel ongoing response

Server Events (receive):

session.created - Session established
session.updated - Session configuration updated
response.created - Response started
response.text.delta - Text chunk
response.audio.delta - Audio chunk (base64)
response.audio_transcript.delta - Audio transcript chunk
response.done - Response complete with usage stats
error - Error occurred

Universal Tagging

All Voice APIs support the /// KEY value tag format for metadata injection.

Tag Format

Tags are newline-separated KEY value pairs:

APP voicebot
FLOW customer_support
SESSION conv_12345
USER user_789
VARIANT ab_test_v2

Adding Tags

STT/Conversation (form field):

-F 'tags=APP voicebot
FLOW support'

Realtime WebSocket (query parameter):

?tags=APP%20voicebot%0AFLOW%20support

Supported Tag Keys

Key	Description	Example
`APP`	Application name	`voicebot`
`FLOW`	Business flow	`customer_support`
`PRODUCT`	Product line	`enterprise`
`COMPANY`	Client/customer	`acme_corp`
`USER`	User identifier	`user_12345`
`SESSION`	Conversation ID	`conv_789`
`VARIANT`	A/B test variant	`new_prompt_v2`
`VERSION`	Prompt version	`1.2.3`
`ENVIRONMENT`	Environment	`production`
`PROJECT`	Project ID	`proj_456`
`COHORT`	Cohort ID	`video_123`

Pricing Summary

STT Pricing (Managed Keys)

Provider	Model	Unit	Cost
Groq	whisper-large-v3	Hour	$0.111
Groq	whisper-large-v3-turbo	Hour	$0.04
OpenAI	gpt-4o-transcribe	Minute	$0.006
OpenAI	gpt-4o-mini-transcribe	Minute	$0.003

TTS Pricing (Managed Keys)

Provider	Model	Unit	Cost
Groq	canopylabs/orpheus-v1-english	1M chars	$22.00
OpenAI	gpt-4o-mini-tts	1M chars	$0.60
OpenAI	tts-1	1M chars	$15.00
OpenAI	tts-1-hd	1M chars	$30.00

Note: PlayAI TTS models (playai-tts, playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate to canopylabs/orpheus-v1-english.

Realtime Pricing (per 1M tokens)

Token Type	Input	Output
Text	$5.00	$20.00
Audio	$100.00	$200.00

Service Fees

Managed Keys: 15% markup on provider costs
BYOK: 10% service fee on tracked usage

Error Handling

Error Response Format:

{
  "error": {
    "type": "invalid_request",
    "message": "Audio file format not supported",
    "code": "unsupported_format"
  }
}

Common Error Codes:

Code	HTTP Status	Description
`unsupported_format`	400	Audio format not supported
`file_too_large`	400	File exceeds size limit
`invalid_model`	400	Unknown model specified
`insufficient_credits`	402	Not enough credits
`not_whitelisted`	403	User not in beta whitelist
`provider_error`	502	Provider API failed
`rate_limited`	429	Too many requests

BigQuery Integration

STT Tracking Schema

SELECT
  transaction_id,
  provider,
  model,
  stt.audio_duration_sec,
  stt.transcript_chars,
  stt.language,
  stt.diarization,
  stt.speaker_count,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'stt'
  AND user_id = @user_id
  AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC

Voice Conversation Tracking

-- Get full conversation pipeline
SELECT
  transaction_id,
  voice.pipeline_step,
  voice.conversation_id,
  CASE voice.pipeline_step
    WHEN 'stt' THEN stt.audio_duration_sec
    WHEN 'llm' THEN tokens.total
    WHEN 'tts' THEN tts.duration_sec
  END as metric,
  total_cost,
  timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = @conversation_id
ORDER BY timing.question_time

Realtime Session Tracking

SELECT
  realtime.session_id,
  realtime.turn_number,
  realtime.audio_in_minutes,
  realtime.audio_out_minutes,
  realtime.text_input_tokens,
  realtime.text_output_tokens,
  total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'realtime'
  AND user_id = @user_id
ORDER BY timing.question_time DESC

Best Practices

Choose the right STT model:
- Groq Whisper for cost-effective batch processing
- OpenAI for high-accuracy or diarization needs
Optimize voice conversations:
- Use conversation_id for multi-turn tracking
- Keep system prompts concise to reduce LLM latency
- Choose fast TTS models for real-time applications
Realtime API tips:
- Use server VAD for natural conversation flow
- Handle response.done events for billing reconciliation
- Implement reconnection logic for long sessions
Monitor costs:
- Track usage via BigQuery queries
- Set up alerts for unusual spending patterns
- Use BYOK for high-volume applications
Handle errors gracefully:
- Implement retry logic with exponential backoff
- Fall back to text when audio fails
- Log error codes for debugging