AI Server can transcribe audio files to text using the Speech-to-Text provider. This is powered by the Whisper model and is hosted on your own ComfyUI Agent.
Using Speech to Text Endpoints​
These endpoints are used in a similar way to other AI Server endpoints where you can provide:
RefId
- provide a unique identifier to track requestsTag
- categorize like requests under a common group
In addition Queue requests can provide:
ReplyTo
- URL to send a POST request to when the request is complete
Speech to Text​
The Speech to Text endpoint converts audio input into text. It provides two types of output:
- Text with timestamps: JSON format with
start
andend
timestamps for each segment. - Plain text: The full transcription without timestamps.
These outputs are returned in the TextOutputs
array, where the JSON will need to be parsed to extract the text and timestamps.
var response = client.PostFilesWithRequest(new SpeechToText()
[new UploadFile("test_audio.wav", File.OpenRead("files/test_audio.wav"), "audio")]
);
// Two texts are returned
// The first is the timestamped text json with `start` and `end` timestamps
var textWithTimestamps = response.TextOutputs[0].Text;
// The second is the plain text
var textOnly = response.TextOutputs[1].Text;
Queue Speech to Text​
For longer audio files or when you want to process the request asynchronously, you can use the Queue Speech to Text endpoint.
var response = client.PostFilesWithRequest(new QueueSpeechToText()
[new UploadFile("test_audio.wav", File.OpenRead("files/test_audio.wav"), "audio")]
);