Skip to main content

Streaming STT

This guide explains how to implement streaming speech-to-text. Two modes are supported: 1) gRPC and 2) WebSocket. See Streaming STT - gRPC and Streaming STT - WebSocket. For file-based conversion, see Batch STT.

caution

Streaming STT is limited by concurrent channels. See Rate limit.

Supported encodings

LINEAR16, FLAC, MULAW, ALAW, AMR, AMR_WB, OGG_OPUS, OPUS.

  • LINEAR16, MULAW, ALAW, AMR, AMR_WB: send raw audio frames
  • OGG_OPUS: send OPUS frames in OGG container
  • OPUS (raw) is supported for gRPC only; contact us to use it

Common DecoderConfig/Parameters

NameType (gRPC / WebSocket)DescriptionRequiredDefault
sample_rateint8000 ~ 48000 HzYes-
encodingAudioEncoding / stringSee supported encodingsYes-
model_namestringsommers_ko (Korean), sommers_ja (Japanese), whisper (multilingual)Nosommers_ko
domainstringSee DomainNoCALL
use_itnboolSee ITNNotrue
use_disfluency_filterboolSee DisfluencyNofalse
use_profanity_filterboolSee ProfanityNofalse
use_punctuationboolUse punctuationNofalse
keywordsstring[] / stringSee Keyword boostingNo-
languagestringRequired for whisper; see supported listNoko

Keyword boosting

Boost or suppress recognition for specific words.

caution

This feature is only available for sommers_ko model.

Format and usage

  • gRPC: string[]
  • WebSocket: comma-separated string

Each keyword can be:

  • "word" (default score 2.0)
  • "word:score"
caution
  • Words must be written in Korean phonetics
  • Score range: -5.0 to 5.0
  • Positive scores boost, negative suppress
  • Max 100 words, ≤ 20 chars each

Example

// gRPC
["부스팅", "리턴제로:3.5", "에스티티:-1"]

// WebSocket
"부스팅,리턴제로:3.5,에스티티:-1"

Domain

  • CALL (default): optimized for close-talk microphones and phone calls
  • MEETING: optimized for distant-mic environments like meeting rooms

Choose based on your input environment.