Voice API
Beta Access Required: The Voice APIs require whitelisted access.
To request access, email sales@demeterics.com with:
- Subject: "Feature Access Request"
- Feature name(s) needed:
- "Speech-to-Text (STT)" - For transcription API
- "Voice Conversation" - For STT→LLM→TTS pipeline
- "Realtime API" - For WebSocket realtime connections
The Demeterics Voice API provides a complete voice AI platform with Speech-to-Text (STT), Voice Conversation pipelines (STT→LLM→TTS), and OpenAI Realtime WebSocket proxy. All endpoints automatically track usage, costs, and store interactions in BigQuery.
Overview
Base URLs:
- STT:
https://api.demeterics.com/audio/v1 - Voice Conversation:
https://api.demeterics.com/voice/v1 - Realtime WebSocket:
wss://api.demeterics.com/realtime/v1
Features:
- Multi-provider STT: Groq Whisper and OpenAI transcription models
- Voice Conversation: Complete STT→LLM→TTS pipeline in one call
- Realtime WebSocket: Proxy to OpenAI Realtime API with automatic billing
- Auto-tracking: Every request logged to BigQuery with full observability
- Session Linking: Voice conversations create linked BQ rows via
meta.session - BYOK Support: Use your own provider API keys with dual-key authentication
- Universal Tagging: Add
/// KEY valuemetadata to all interactions
Authentication
Managed Keys (Default)
Use only your Demeterics API key:
curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
-H "Authorization: Bearer dmt_your_api_key" \
-F file=@audio.mp3 \
-F model=whisper-large-v3-turbo
Bring Your Own Key (BYOK)
Use the dual-key format to provide your own provider API key:
curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
-H "Authorization: Bearer dmt_your_api_key;sk-your_openai_key" \
-F file=@audio.mp3 \
-F model=gpt-4o-transcribe
The format is: [demeterics_api_key];[provider_api_key]
BYOK Benefits:
- 10% service fee instead of 15%
- Use your own rate limits and quotas
- Provider costs billed directly to your account
Speech-to-Text API
Transcribe Audio
POST /audio/v1/transcriptions
Convert audio to text using Groq Whisper or OpenAI transcription models.
Request (multipart/form-data):
| Field | Type | Required | Description |
|---|---|---|---|
file |
file | Yes | Audio file (mp3, wav, m4a, webm, ogg, flac) |
model |
string | Yes | STT model (see providers below) |
language |
string | No | ISO 639-1 language code (e.g., "en", "es") |
prompt |
string | No | Context to guide transcription |
response_format |
string | No | json, text, srt, vtt, verbose_json |
temperature |
float | No | 0.0-1.0, lower is more deterministic |
timestamp_granularities |
string | No | word, segment, or both |
tags |
string | No | Newline-separated KEY value metadata |
Example Request:
curl -X POST https://api.demeterics.com/audio/v1/transcriptions \
-H "Authorization: Bearer dmt_your_api_key" \
-F file=@meeting.mp3 \
-F model=whisper-large-v3-turbo \
-F language=en \
-F 'tags=APP voicebot
FLOW customer_support
SESSION conv_12345'
Response:
{
"id": "stt_01JARV4HZ6XPQMWVCS9N1GKEFD",
"text": "Hello, I'd like to check on my order status.",
"language": "en",
"duration": 3.5,
"segments": [
{
"id": 0,
"start": 0.0,
"end": 3.5,
"text": "Hello, I'd like to check on my order status."
}
],
"words": [
{"word": "Hello", "start": 0.0, "end": 0.5},
{"word": "I'd", "start": 0.6, "end": 0.8}
],
"cost": {
"provider_cost": 0.0004,
"service_fee": 0.00006,
"total_cost": 0.00046
}
}
List STT Models
GET /audio/v1/models
List available speech-to-text models.
Response:
{
"models": [
{
"id": "whisper-large-v3-turbo",
"provider": "groq",
"description": "Fast, cost-effective transcription",
"languages": ["en", "es", "fr", "de", "..."],
"pricing": {
"unit": "hour",
"cost_per_unit": 0.04
}
},
{
"id": "gpt-4o-transcribe",
"provider": "openai",
"description": "High-accuracy transcription",
"languages": ["en", "es", "fr", "de", "..."],
"pricing": {
"unit": "minute",
"cost_per_unit": 0.006
}
}
]
}
STT Providers
Groq (Whisper)
Models:
whisper-large-v3- Highest accuracy, $0.111/hourwhisper-large-v3-turbo- Fast and cost-effective, $0.04/hour
Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm, ogg, flac
Max File Size: 25 MB
Features:
- Multi-language support (57+ languages)
- Word-level timestamps
- Segment-level timestamps
OpenAI (GPT-4o Transcribe)
Models:
gpt-4o-transcribe- High accuracy, $0.006/minutegpt-4o-mini-transcribe- Cost-effective, $0.003/minutegpt-4o-transcribe-diarize- Speaker diarization, $0.006/minute (beta)
Supported Formats: mp3, mp4, mpeg, mpga, m4a, wav, webm
Max File Size: 25 MB
Features:
- High accuracy for complex audio
- Speaker diarization (with diarize model)
- Context-aware with prompt field
Voice Conversation API
The Voice Conversation API provides a complete STT→LLM→TTS pipeline in a single request. Perfect for voice assistants, phone bots, and conversational AI.
Process Voice Turn
POST /voice/v1/conversation
Process a voice conversation turn: transcribe audio, generate LLM response, and synthesize speech.
Request (multipart/form-data):
| Field | Type | Required | Description |
|---|---|---|---|
file |
file | Yes | Audio input (mp3, wav, m4a, webm, ogg) |
stt_model |
string | Yes | STT model (e.g., whisper-large-v3-turbo) |
stt_language |
string | No | ISO 639-1 language code |
llm_model |
string | Yes | LLM model (e.g., llama-3.3-70b-versatile) |
llm_provider |
string | No | LLM provider (inferred from model if omitted) |
system_prompt |
string | No | System instructions for LLM |
max_tokens |
int | No | Max response tokens (default: 1024) |
temperature |
float | No | LLM temperature (default: 0.7) |
conversation_id |
string | No | For multi-turn conversations |
tts_model |
string | No | TTS model (e.g., canopylabs/orpheus-v1-english) |
tts_provider |
string | No | TTS provider (e.g., groq) |
tts_voice |
string | No | Voice ID (e.g., tara) |
tts_speed |
float | No | Playback speed (0.25-4.0) |
tags |
string | No | Newline-separated KEY value metadata |
Example Request:
curl -X POST https://api.demeterics.com/voice/v1/conversation \
-H "Authorization: Bearer dmt_your_api_key" \
-F file=@question.mp3 \
-F stt_model=whisper-large-v3-turbo \
-F llm_model=llama-3.3-70b-versatile \
-F system_prompt="You are a helpful customer service agent." \
-F tts_model=canopylabs/orpheus-v1-english \
-F tts_voice=tara \
-F conversation_id=conv_12345 \
-F 'tags=APP customer_service
FLOW order_inquiry'
Response:
{
"id": "conv_01JARV4HZ6XPQMWVCS9N1GKEFD",
"transcript": "What's the status of my order?",
"language": "en",
"response": "I'd be happy to help you check your order status. Could you please provide your order number?",
"audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
"timing": {
"stt_latency_ms": 150,
"llm_latency_ms": 450,
"tts_latency_ms": 200,
"total_latency_ms": 800
},
"cost": {
"stt_cost_usd": 0.0004,
"llm_cost_usd": 0.002,
"tts_cost_usd": 0.005,
"service_fee": 0.001,
"total_cost_usd": 0.0084
}
}
Stream Voice Turn (SSE)
POST /voice/v1/conversation/stream
Stream the same STT→LLM→TTS pipeline over Server-Sent Events for live playback while the response is still being generated.
Request (multipart/form-data): same fields as POST /voice/v1/conversation, plus:
| Field | Type | Required | Description |
|---|---|---|---|
tts_format |
string | No | Output format (mp3, wav, ogg, flac, pcm). Recommended: mp3 for streaming |
Event Types:
start- Conversation ID is availabletranscript- STT output (text + latency)response- LLM output (text + latency)audio- Base64 audio chunk (if TTS enabled)audio_complete- Summary for streamed audiodone- Final timing + cost summary (also includesaudio_urlif uploaded)error- Error event withstepandmessage
Audio Event Payload:
{
"chunk": "BASE64_MP3_BYTES",
"chunk_index": 0,
"is_last": false,
"format": "mp3",
"elapsed_ms": 312
}
audio_complete Payload:
{
"total_chunks": 185,
"total_bytes": 229522,
"elapsed_ms": 9656
}
done Payload (summary):
{
"conversation_id": "conv_01KDEXYHB89BK410RYT7CHRMG1",
"transcript": "Count a number from 1 to 20.",
"response": "Here we go: 1, 2, 3, ...",
"audio_url": "https://storage.googleapis.com/demeterics-data/voice/...",
"timing": {
"stt_latency_ms": 150,
"llm_latency_ms": 450,
"tts_latency_ms": 200,
"total_latency_ms": 800
},
"cost": {
"stt_cost_usd": 0.0004,
"llm_cost_usd": 0.002,
"tts_cost_usd": 0.005,
"service_fee": 0.001,
"total_cost_usd": 0.0084
}
}
Minimal JS Example (SSE over fetch):
const resp = await fetch('https://api.demeterics.com/voice/v1/conversation/stream', {
method: 'POST',
headers: { Authorization: `Bearer ${apiKey}` },
body: formData
});
const reader = resp.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop() || '';
for (const line of lines) {
if (line.startsWith('event:')) continue;
if (!line.startsWith('data:')) continue;
const data = JSON.parse(line.slice(5).trim());
// Handle transcript/response/audio/done as needed
}
}
Session Tracking
Voice conversations create three linked BigQuery rows per turn:
- STT row:
interaction_type = 'stt'with transcription details - LLM row:
interaction_type = 'llm'with prompt/response - TTS row:
interaction_type = 'tts'with audio details
All rows share the same meta.session value (the conversation_id), enabling:
- End-to-end latency analysis
- Cost breakdown by pipeline step
- Conversation flow tracking
Query Example:
SELECT
transaction_id,
voice.pipeline_step,
total_cost,
timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = 'conv_12345'
ORDER BY timing.question_time
Realtime WebSocket API
The Realtime API provides a WebSocket proxy to OpenAI's Realtime API with automatic billing, tag injection, and per-turn tracking.
Connect
wss://api.demeterics.com/realtime/v1?model={model}
Establish a WebSocket connection to the OpenAI Realtime API.
Query Parameters:
| Parameter | Type | Required | Description |
|---|---|---|---|
model |
string | Yes | Model name (e.g., gpt-4o-realtime-preview) |
tags |
string | No | URL-encoded newline-separated KEY value metadata |
Connection Example (JavaScript):
const ws = new WebSocket(
'wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview',
{
headers: {
'Authorization': 'Bearer dmt_your_api_key'
}
}
);
ws.onopen = () => {
// Send session.update to configure the session
ws.send(JSON.stringify({
type: 'session.update',
session: {
modalities: ['text', 'audio'],
instructions: 'You are a helpful assistant.',
voice: 'alloy',
input_audio_format: 'pcm16',
output_audio_format: 'pcm16',
turn_detection: {
type: 'server_vad',
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500
}
}
}));
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
if (data.type === 'response.audio.delta') {
// Handle audio chunk
const audioData = atob(data.delta);
playAudio(audioData);
}
if (data.type === 'response.done') {
// Response complete
console.log('Turn complete:', data.response);
}
};
Connection Example (Python):
import asyncio
import websockets
import json
async def realtime_chat():
uri = "wss://api.demeterics.com/realtime/v1?model=gpt-4o-realtime-preview"
headers = {"Authorization": "Bearer dmt_your_api_key"}
async with websockets.connect(uri, extra_headers=headers) as ws:
# Configure session
await ws.send(json.dumps({
"type": "session.update",
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a helpful assistant.",
"voice": "alloy"
}
}))
# Send text input
await ws.send(json.dumps({
"type": "conversation.item.create",
"item": {
"type": "message",
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
}
}))
await ws.send(json.dumps({"type": "response.create"}))
# Receive responses
async for message in ws:
data = json.loads(message)
if data["type"] == "response.text.delta":
print(data["delta"], end="")
if data["type"] == "response.done":
break
asyncio.run(realtime_chat())
Tag Injection
Tags provided via the tags query parameter are automatically injected into your session.update instructions. For example:
tags=APP%20voicebot%0AFLOW%20support
Gets injected as:
/// APP voicebot
/// FLOW support
[Your instructions here]
Per-Turn Billing
Each response.done event triggers:
- Cost calculation based on token usage
- Credit deduction from your account
- BigQuery row insertion with full metrics
Realtime Pricing (per 1M tokens):
| Token Type | Input Cost | Output Cost |
|---|---|---|
| Text | $5.00 | $20.00 |
| Audio | $100.00 | $200.00 |
Cached tokens receive a 50% discount on input costs.
WebSocket Events
Client Events (send):
session.update- Configure session settingsconversation.item.create- Add message to conversationinput_audio_buffer.append- Stream audio inputinput_audio_buffer.commit- Commit audio bufferresponse.create- Request a responseresponse.cancel- Cancel ongoing response
Server Events (receive):
session.created- Session establishedsession.updated- Session configuration updatedresponse.created- Response startedresponse.text.delta- Text chunkresponse.audio.delta- Audio chunk (base64)response.audio_transcript.delta- Audio transcript chunkresponse.done- Response complete with usage statserror- Error occurred
Universal Tagging
All Voice APIs support the /// KEY value tag format for metadata injection.
Tag Format
Tags are newline-separated KEY value pairs:
APP voicebot
FLOW customer_support
SESSION conv_12345
USER user_789
VARIANT ab_test_v2
Adding Tags
STT/Conversation (form field):
-F 'tags=APP voicebot
FLOW support'
Realtime WebSocket (query parameter):
?tags=APP%20voicebot%0AFLOW%20support
Supported Tag Keys
| Key | Description | Example |
|---|---|---|
APP |
Application name | voicebot |
FLOW |
Business flow | customer_support |
PRODUCT |
Product line | enterprise |
COMPANY |
Client/customer | acme_corp |
USER |
User identifier | user_12345 |
SESSION |
Conversation ID | conv_789 |
VARIANT |
A/B test variant | new_prompt_v2 |
VERSION |
Prompt version | 1.2.3 |
ENVIRONMENT |
Environment | production |
PROJECT |
Project ID | proj_456 |
COHORT |
Cohort ID | video_123 |
Pricing Summary
STT Pricing (Managed Keys)
| Provider | Model | Unit | Cost |
|---|---|---|---|
| Groq | whisper-large-v3 | Hour | $0.111 |
| Groq | whisper-large-v3-turbo | Hour | $0.04 |
| OpenAI | gpt-4o-transcribe | Minute | $0.006 |
| OpenAI | gpt-4o-mini-transcribe | Minute | $0.003 |
TTS Pricing (Managed Keys)
| Provider | Model | Unit | Cost |
|---|---|---|---|
| Groq | canopylabs/orpheus-v1-english | 1M chars | $22.00 |
| OpenAI | gpt-4o-mini-tts | 1M chars | $0.60 |
| OpenAI | tts-1 | 1M chars | $15.00 |
| OpenAI | tts-1-hd | 1M chars | $30.00 |
Note: PlayAI TTS models (
playai-tts,playai-tts-arabic) are deprecated and will be decommissioned on December 31, 2025. Please migrate tocanopylabs/orpheus-v1-english.
Realtime Pricing (per 1M tokens)
| Token Type | Input | Output |
|---|---|---|
| Text | $5.00 | $20.00 |
| Audio | $100.00 | $200.00 |
Service Fees
- Managed Keys: 15% markup on provider costs
- BYOK: 10% service fee on tracked usage
Error Handling
Error Response Format:
{
"error": {
"type": "invalid_request",
"message": "Audio file format not supported",
"code": "unsupported_format"
}
}
Common Error Codes:
| Code | HTTP Status | Description |
|---|---|---|
unsupported_format |
400 | Audio format not supported |
file_too_large |
400 | File exceeds size limit |
invalid_model |
400 | Unknown model specified |
insufficient_credits |
402 | Not enough credits |
not_whitelisted |
403 | User not in beta whitelist |
provider_error |
502 | Provider API failed |
rate_limited |
429 | Too many requests |
BigQuery Integration
STT Tracking Schema
SELECT
transaction_id,
provider,
model,
stt.audio_duration_sec,
stt.transcript_chars,
stt.language,
stt.diarization,
stt.speaker_count,
total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'stt'
AND user_id = @user_id
AND timing.question_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
ORDER BY timing.question_time DESC
Voice Conversation Tracking
-- Get full conversation pipeline
SELECT
transaction_id,
voice.pipeline_step,
voice.conversation_id,
CASE voice.pipeline_step
WHEN 'stt' THEN stt.audio_duration_sec
WHEN 'llm' THEN tokens.total
WHEN 'tts' THEN tts.duration_sec
END as metric,
total_cost,
timing.latency_ms
FROM `demeterics.demeterics.interactions`
WHERE meta.session = @conversation_id
ORDER BY timing.question_time
Realtime Session Tracking
SELECT
realtime.session_id,
realtime.turn_number,
realtime.audio_in_minutes,
realtime.audio_out_minutes,
realtime.text_input_tokens,
realtime.text_output_tokens,
total_cost
FROM `demeterics.demeterics.interactions`
WHERE interaction_type = 'realtime'
AND user_id = @user_id
ORDER BY timing.question_time DESC
Best Practices
-
Choose the right STT model:
- Groq Whisper for cost-effective batch processing
- OpenAI for high-accuracy or diarization needs
-
Optimize voice conversations:
- Use
conversation_idfor multi-turn tracking - Keep system prompts concise to reduce LLM latency
- Choose fast TTS models for real-time applications
- Use
-
Realtime API tips:
- Use server VAD for natural conversation flow
- Handle
response.doneevents for billing reconciliation - Implement reconnection logic for long sessions
-
Monitor costs:
- Track usage via BigQuery queries
- Set up alerts for unusual spending patterns
- Use BYOK for high-volume applications
-
Handle errors gracefully:
- Implement retry logic with exponential backoff
- Fall back to text when audio fails
- Log error codes for debugging