Speech to Text

AI Server can transcribe audio files to text using the Speech-to-Text provider. This is powered by the Whisper model and is hosted on your own ComfyUI Agent.

Speech to Text UI

AI Server's Speech to Text UI lets you transcribe audio files from its active Comfy UI Agents:

https://localhost:5006/SpeechToText

Using Speech to Text Endpoints

These endpoints are used in a similar way to other AI Server endpoints where you can provide:

  • RefId - provide a unique identifier to track requests
  • Tag - categorize like requests under a common group

In addition Queue requests can provide:

  • ReplyTo - URL to send a POST request to when the request is complete

Speech to Text

The Speech to Text endpoint converts audio input into text. It provides two types of output:

  1. Text with timestamps: JSON format with start and end timestamps for each segment.
  2. Plain text: The full transcription without timestamps.

These outputs are returned in the TextOutputs array, where the JSON will need to be parsed to extract the text and timestamps.

using var fsAudio = File.OpenRead("files/test_audio.wav");
var response = client.PostFileWithRequest(new SpeechToText(),
    new UploadFile("test_audio.wav", fsAudio, "audio"));

// Two texts are returned
// The first is the timestamped text json with `start` and `end` timestamps
var textWithTimestamps = response.Results[0].Text;
// The second is the plain text
var textOnly = response.Results[1].Text;

Queue Speech to Text

For longer audio files or when you want to process the request asynchronously, you can use the Queue Speech to Text endpoint.

using var fsAudio = File.OpenRead("files/test_audio.wav");
var response = client.PostFileWithRequest(new QueueSpeechToText(),
    new UploadFile("test_audio.wav", fsAudio, "audio"));

// Poll for Job Completion Status
GetTextGenerationStatusResponse status = new();
while (status.JobState is BackgroundJobState.Started or BackgroundJobState.Queued)
{
    status = client.Get(new GetTextGenerationStatus { RefId = response.RefId });
    Thread.Sleep(1000);
}

var answer = status.Results[0].Text;