Speech to Text

AI Server can transcribe audio files to text using the Speech-to-Text provider. This is powered by the Whisper model and is hosted on your own ComfyUI Agent.

Using Speech to Text Endpoints

These endpoints are used in a similar way to other AI Server endpoints where you can provide:

  • RefId - provide a unique identifier to track requests
  • Tag - categorize like requests under a common group

In addition Queue requests can provide:

  • ReplyTo - URL to send a POST request to when the request is complete

Speech to Text

The Speech to Text endpoint converts audio input into text. It provides two types of output:

  1. Text with timestamps: JSON format with start and end timestamps for each segment.
  2. Plain text: The full transcription without timestamps.

These outputs are returned in the TextOutputs array, where the JSON will need to be parsed to extract the text and timestamps.

var response = client.PostFilesWithRequest(new SpeechToText()
    [new UploadFile("test_audio.wav", File.OpenRead("files/test_audio.wav"), "audio")]
);

// Two texts are returned
// The first is the timestamped text json with `start` and `end` timestamps
var textWithTimestamps = response.TextOutputs[0].Text;
// The second is the plain text
var textOnly = response.TextOutputs[1].Text;

Queue Speech to Text

For longer audio files or when you want to process the request asynchronously, you can use the Queue Speech to Text endpoint.

var response = client.PostFilesWithRequest(new QueueSpeechToText()
    [new UploadFile("test_audio.wav", File.OpenRead("files/test_audio.wav"), "audio")]
);