Skip to main content

Batch STT

The Batch STT API is an HTTP REST API that converts audio files to text.

Supported formats

tip

Batch STT supports mp4, m4a, mp3, amr, flac, wav.

Authentication token

Obtain a token via the Authentication guide before using Batch STT.


API list

MethodURLDescription
POST/v1/transcribeSubmit a job
GET/v1/transcribe/{TRANSCRIBE_ID}Get job result

1) [POST] /v1/transcribe

Submits a transcription job for a stored audio file.

HTTP request

POST https://openapi.vito.ai/v1/transcribe

Request headers

Authorization: Bearer {YOUR_JWT_TOKEN}
  • scheme: bearer
  • bearerFormat: JWT

Request body

content-type: multipart/form-data

FieldTypeRequired
configRequestConfigrequired
fileBinaryrequired

RequestConfig

NameDescTypeRequiredValueDefault
model_nameRecognition modelstringoptionalsommers, whispersommers
languageRecognition language, whisper-onlystringoptionalko, detect, multiko
language_candidatesLanguage detection candidatesarrayoptional["ko","ja","zh","en"]
use_diarizationSpeaker diarizationbooleanoptionalfalse
diarization.spk_countNumber of speakers, effective when use_diarization is trueintegeroptional>= 00 (auto)
use_itnEnglish/Number/Unit normalizationbooleanoptionaltrue
use_disfluency_filterDisfluency filterbooleanoptionaltrue
use_profanity_filterProfanity filterbooleanoptionalfalse
use_paragraph_splitterParagraph splitterbooleanoptionaltrue
paragraph_splitter.maxMax characters per paragraph, effective when use_paragraph_splitter is trueintegeroptional>= 150
domainDomainstringoptionalGENERAL, CALLGENERAL
use_word_timestampWord-level timestampsbooleanoptionalfalse
keywordsKeyword boostingarrayoptional
caution
  • POST concurrency: number of in-flight jobs follows the Rate limit policy. Completion is determined by the GET API.
  • Max file size: 2GB; Max duration: 4 hours.
  • Jobs are processed in order. For long files and busy periods, start delays up to 30+ minutes are possible.

Sample code 1

transcribe.sh
curl -X "POST" \
"https://openapi.vito.ai/v1/transcribe" \
-H "accept: application/json" \
-H "Authorization: Bearer ${YOUR_JWT_TOKEN}" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample.wav" \
-F 'config={}'

Sample code 2

transcribe.sh
curl -X "POST" \
"https://openapi.vito.ai/v1/transcribe" \
-H "accept: application/json" \
-H "Authorization: Bearer ${YOUR_JWT_TOKEN}" \
-H "Content-Type: multipart/form-data" \
-F "file=@sample.wav" \
-F 'config={
"use_diarization": true,
"diarization": {
"spk_count": 2
},
"use_itn": false,
"use_disfluency_filter": false,
"use_profanity_filter": false,
"use_paragraph_splitter": true,
"paragraph_splitter": {
"max": 50
}
}'

Response body

On success:

{
"id": "{TRANSCRIBE_ID}"
}

Error codes

HTTP StatusCodeNotes
400H0001Invalid parameter
400H0010Unsupported file type
401H0002Invalid token
413H0005File size exceeded
413H0006File length exceeded
429A0001Usage exceeded
429A0002Concurrency exceeded
500E500Server error

Example failure:

{
"code": "H0001",
"msg": "unexpected end of JSON input"
}

2) [GET] /v1/transcribe/{TRANSCRIBE_ID}

  • Fetch transcription results using the TRANSCRIBE_ID returned by the POST API.

HTTP request

GET https://openapi.vito.ai/v1/transcribe/{TRANSCRIBE_ID}

Request headers

Authorization: Bearer {YOUR_JWT_TOKEN}
  • scheme: bearer
  • bearerFormat: JWT

Sample code

get_transcript.sh
curl -X "GET" \
"https://openapi.vito.ai/v1/transcribe/${TRANSCRIBE_ID}" \
-H "accept: application/json" \
-H "Authorization: Bearer ${YOUR_JWT_TOKEN}"

Response body

On success (selected fields):

NameDescTypeValue
idtranscribe idstring
statusstatus of the jobstringtranscribing, completed, failed
results.utterancesutterance listarray
results.utterances.start_atutterance start time (ms)integer
results.utterances.durationutterance duration (ms)integer
results.utterances.msgutterance textstring
results.utterances.spkspeaker/channel idinteger
results.utterances.langlanguage value or detected language for detect/multistringISO 639-1
tip

Batch STT uses polling for long files. When status is transcribing, poll every ~5s until completed or failed. Too short intervals may result in 429.

status: transcribing

{
"id": "{TRANSCRIBE_ID}",
"status": "transcribing"
}

status: completed

{
"id": "{TRANSCRIBE_ID}",
"status": "completed",
"results": {
"utterances": [
{
"start_at": 4737,
"duration": 2360,
"msg": "Hello.",
"spk": 0,
"lang": "en"
}
]
}
}

status: failed

{
"id": "{TRANSCRIBE_ID}",
"status": "failed",
"error": {
"code": "{ERROR_CODE}",
"message": "{MESSAGE}"
}
}

Example:

{
"id": "ZbOOQftrS1ywK_T3ikuveA",
"status": "failed",
"error": {
"code": "E500",
"message": "internal server error"
}
}

Error codes

HttpStatusCodeNotes
400H0001Invalid parameter
401H0002Invalid token
403H0003Forbidden
404H0004Not found
410H0007Result expired
429A0003Rate limited
500E500Server error

Example failure:

{
"code": "H0004",
"msg": "not found"
}

Unified example

In the example script below, you can combine the desired settings with the PRESET environment variable. The default is sommers_basic.

transcribe.py
import json
import os
import time
from typing import Any, Dict, Optional

import requests


class RTZROpenAPIClient:
"""Minimal client for RTZR OpenAPI (auth + STT file).

- Fetches JWT via /v1/authenticate using client_id/client_secret
- Submits a file transcription job via /v1/transcribe
- Polls /v1/transcribe/{id} every few seconds until completed/failed
"""

def __init__(
self,
client_id: Optional[str] = None,
client_secret: Optional[str] = None,
base_url: str = "https://openapi.vito.ai",
) -> None:
self.base_url = base_url.rstrip("/")
self.client_id = client_id or os.getenv("RTZR_CLIENT_ID")
self.client_secret = client_secret or os.getenv("RTZR_CLIENT_SECRET")
if not self.client_id or not self.client_secret:
raise ValueError(
"Missing credentials. Set RTZR_CLIENT_ID and RTZR_CLIENT_SECRET "
"environment variables, or pass client_id/client_secret to RTZROpenAPIClient."
)
self._sess = requests.Session()
self._token: Optional[Dict[str, Any]] = None

@property
def token(self) -> str:
# Renew if missing or expiring within 30 minutes
if self._token is None or self._token.get("expire_at", 0) < time.time() - 1800:
resp = self._sess.post(
f"{self.base_url}/v1/authenticate",
data={"client_id": self.client_id, "client_secret": self.client_secret},
)
resp.raise_for_status()
self._token = resp.json()
access = self._token.get("access_token")
if not access:
raise RuntimeError("authenticate: 'access_token' not found in response")
return access

def _auth_headers(self) -> Dict[str, str]:
return {"Authorization": f"Bearer {self.token}"}

def transcribe_file(self, file_path: str, config: Dict[str, Any]) -> Dict[str, Any]:
url = f"{self.base_url}/v1/transcribe"
with open(file_path, "rb") as f:
files = {"file": (os.path.basename(file_path), f)}
data = {"config": json.dumps(config)}
resp = self._sess.post(url, headers=self._auth_headers(), files=files, data=data)
resp.raise_for_status()
return resp.json()

def get_transcription(self, transcribe_id: str) -> Dict[str, Any]:
url = f"{self.base_url}/v1/transcribe/{transcribe_id}"
resp = self._sess.get(url, headers=self._auth_headers())
resp.raise_for_status()
return resp.json()

def wait_for_result(
self,
transcribe_id: str,
poll_interval_sec: int = 5,
timeout_sec: int = 3600,
) -> Dict[str, Any]:
deadline = time.time() + timeout_sec
while True:
if time.time() > deadline:
raise TimeoutError("Timed out waiting for transcription result")
result = self.get_transcription(transcribe_id)
status = result.get("status")
if status in ("completed", "failed"):
return result
time.sleep(poll_interval_sec)


# Preset configurations
PRESETS: Dict[str, Dict[str, Any]] = {
"sommers_basic": { # 1) sommers without diarization
"model_name": "sommers",
"use_diarization": False,
"domain": "GENERAL",
},
"sommers_call_diarization": { # 2) sommers + diarization + CALL, spk_count=2
"model_name": "sommers",
"domain": "CALL",
"use_diarization": True,
"diarization": {"spk_count": 2},
},
"whisper_en_diarization": { # 3) whisper + diarization, language=en
"model_name": "whisper",
"language": "en",
"use_diarization": True,
},
# Additional commonly requested options
"paragraph_split_80": {"use_paragraph_splitter": True, "paragraph_splitter": {"max": 80}},
"keywords_example": {"keywords": ["stt", "returnzero", "api"]},
"with_word_timestamps": {"use_word_timestamp": True},
"disfluency_on": {"use_disfluency_filter": True},
"profanity_on": {"use_profanity_filter": True},
"whisper_detect_multi": {
"model_name": "whisper",
"language": "multi",
"language_candidates": ["ko", "en", "ja"],
},
}


def main():
audio_path = os.getenv("AUDIO_PATH", "sample.wav")
preset_name = os.getenv("PRESET", "sommers_basic")

if preset_name not in PRESETS:
raise ValueError(f"Unknown PRESET '{preset_name}'. Available: {sorted(PRESETS.keys())}")

config = PRESETS[preset_name]

client = RTZROpenAPIClient()

submit = client.transcribe_file(audio_path, config)
transcribe_id = submit.get("id")
result = client.wait_for_result(transcribe_id, poll_interval_sec=5)
print(json.dumps(result, ensure_ascii=False, indent=2))


if __name__ == "__main__":
main()
  • sommers_basic: model_name=sommers, use_diarization=false, domain=GENERAL
  • sommers_call_diarization: model_name=sommers, domain=CALL, use_diarization=true, diarization.spk_count=2
  • whisper_en_diarization: model_name=whisper, language=en, use_diarization=true
  • paragraph_split_80: use_paragraph_splitter=true, paragraph_splitter.max=80
  • keywords_example: keywords=["stt","returnzero","api"]
  • with_word_timestamps: use_word_timestamp=true
  • disfluency_on: use_disfluency_filter=true
  • profanity_on: use_profanity_filter=true
  • whisper_detect_multi: model_name=whisper, language=multi, language_candidates=["ko","en","ja"]