Streaming STT
This guide explains how to implement streaming speech-to-text. Two modes are supported: 1) gRPC and 2) WebSocket. See Streaming STT - gRPC and Streaming STT - WebSocket. For file-based conversion, see Batch STT.
caution
Streaming STT is limited by concurrent channels. See Rate limit.
Supported encodings
LINEAR16, FLAC, MULAW, ALAW, AMR, AMR_WB, OGG_OPUS, OPUS.
- LINEAR16, MULAW, ALAW, AMR, AMR_WB: send raw audio frames
- OGG_OPUS: send OPUS frames in OGG container
- OPUS (raw) is supported for gRPC only; contact us to use it
Common DecoderConfig/Parameters
Name | Type (gRPC / WebSocket) | Description | Required | Default |
---|---|---|---|---|
sample_rate | int | 8000 ~ 48000 Hz | Yes | - |
encoding | AudioEncoding / string | See supported encodings | Yes | - |
model_name | string | sommers_ko (Korean), sommers_ja (Japanese), whisper (multilingual) | No | sommers_ko |
domain | string | See Domain | No | CALL |
use_itn | bool | See ITN | No | true |
use_disfluency_filter | bool | See Disfluency | No | false |
use_profanity_filter | bool | See Profanity | No | false |
use_punctuation | bool | Use punctuation | No | false |
keywords | string[] / string | See Keyword boosting | No | - |
language | string | Required for whisper; see supported list | No | ko |
Keyword boosting
Boost or suppress recognition for specific words.
caution
This feature is only available for sommers_ko
model.
Format and usage
- gRPC:
string[]
- WebSocket: comma-separated
string
Each keyword can be:
- "word" (default score 2.0)
- "word:score"
caution
- Words must be written in Korean phonetics
- Score range: -5.0 to 5.0
- Positive scores boost, negative suppress
- Max 100 words, ≤ 20 chars each
Example
// gRPC
["부스팅", "리턴제로:3.5", "에스티티:-1"]
// WebSocket
"부스팅,리턴제로:3.5,에스티티:-1"
Domain
- CALL (default): optimized for close-talk microphones and phone calls
- MEETING: optimized for distant-mic environments like meeting rooms
Choose based on your input environment.