REST with Client auth

Convert audio to text with client-side JWT authentication. This endpoint is identical to the standard /api endpoint but uses a client token (JWT) instead of an org-level API key.

Request Body

audio

string

required

Base64 encoded, 16kHz wav audio. Maximum size is 25MB or 6 minutes of audio.

language

array

Optional list of (ISO 639-1) language codes that the user is expected to speak.Setting the list size to 1 forces the transcription into the specified language. Not providing an input attempts autodetection on full list of languages (less accurate).

context

object

Optional contextual information about the circumstances surrounding the user dictation.Flow can use these information to make its output more accurate by for example, getting names right, resolving speech ambiguities, etc.All properties are optional and will use default values if not provided.

Show child attributes

app

object

Information about the application the user is dictating into, used to customize the writing style.

Show child attributes

name

string

The name of the application.

type

string

The type of the application. Currently supports the following list:

email: Email clients and anywhere the user would be trying to dictate an email.
ai: Applications where the user is conversing with an AI chatbot / agent and not a human.
other: Any application that does not fit in the two groups above.

dictionary_context

array

default:"[]"

List of uncommon names or words relevant to the context that might be mentioned by the user.

user_identifier

string

User ID in the app that the person is dictating in, like their email address in an email client application.

user_first_name

string

First name of the speaker, used to make sure Flow spells the speaker’s name correctly.

user_last_name

string

Last name of the speaker.

textbox_contents

object

Text under and surrounding the cursor in the active textbox. Flow uses this information to decide casing, spacing and punctuation.

Show child attributes

before_text

object

The text immediately before the cursor.

selected_text

object

The text the user has highlighted.

after_text

object

The text immediately after the cursor.

screenshot

string

Screenshot of the screen or the app the user is dictating in, for when the user is referencing something on the screen.

content_text

string

Plaintext content of the current page in the app user is dictating in. This is a more efficient way of providing context compared to screenshot.

content_html

string

HTML content of the app user is dictating in (a more feature-rich alternative to content_text).

conversation

object

Chatbot style history of the conversation the user is dictating in. This typically applies to messaging and AI apps.

Show child attributes

string

required

Identifier of the conversation. For example name of group chat in case of a messaging app.

participants

array

List of names of other people in the conversation, used to get their names correct in case of uncommonly spelled names.

messages

array

List of recent messages in the conversation, in the chronological order. Used to identify proper context and names of things being mentioned that are not conversation participants.

Show message item attributes

role

string

required

Type of the message sender. One of: user, human, assistant

content

string

required

Content of the message.

properties

object

deprecated

Show properties

language

string

default:"en"

2-digit ISO language code (e.g. ‘en’, ‘es’)

app_type

string

default:"other"

Flow formats appropriately depending on if the user is prompting AI, writing an email, or other tasks. Options: ‘ai’, ‘email’, ‘other’

dictionary

array

default:"[]"

List of dictionary words to help with transcription accuracy

after_text

string

default:true

The text immediately after the cursor. Flow uses it to decide spacing / punctuation.

before_text

string

default:true

The text immediately before the cursor. Flow uses it to decide spacing / punctuation.

selected_text

string

default:true

The text the user has highlighted. Flow uses it to decide spacing / punctuation.

{
  "id": "550e8400-e29b-41d4-a716-446655440000",
  "text": "Testing testing 1, 2, 3",
  "detected_language": "en",
  "total_time": 432,
  "generated_tokens": 9
}

{
  "audio": "UklGRiQA....",
  "language": ["en"],
  "context": {
    "app": {
      "type": "email"
    },
    "dictionary_context": [],
    "textbox_contents": {
      "before_text": "",
      "selected_text": "",
      "after_text": ""
    },
    // ... for a full list of available fields, see the "Request Schema" page
  }
}

Authorizations

Authorization

string

header

required

Client-side token (format: Bearer <JWT>) for calling client endpoints like /client_api.

Body

application/json

audio

string

required

Base64-encoded audio data (up to 25MB / 6 minutes)

properties

object

required

Additional config for transcription

Response

Successful transcription

string<uuid>

text

string

detected_language

string

total_time

number

generated_tokens

integer

Getting Started

Basics

Other Endpoints

Client Side Auth

Sample Projects

REST with Client auth

Request Body

Authorizations

Body

Response

Getting Started

Basics

Other Endpoints

Client Side Auth

Sample Projects

​Request Body

Authorizations

Body

Response

Request Body