AI Server APIs

AI Server provides a set of APIs for interacting with AI models and services, as well as some common image, video and audio processing tasks.

Every API provides two modes of operation: synchronous and asynchronous.

  • Synchronous: The request is processed immediately and the response returns a URL to download the result. This has a timeout of 60 seconds.
  • Asynchronous: The request is queued and processed in the background. The response returns a URL to download the result when it's ready.

Image Generation APIs

AI Server has built-in ComfyUI workflows for performing image generation tasks using AI models like SDXL and Flux.

The following tasks are available for image generation:

  • Text to Image - Generate an image based on provided text prompts.
  • Image to Image - Generate a new image based on an input image and provided prompts.
  • Image with Mask - Generate a new image based on an input image, a mask, and provided prompts (applied only to the masked area).
  • Image Upscale - Upscale an input image to a higher resolution (currently 2x only).

Speech APIs

AI Server provides endpoints for speech-related tasks, including Speech-to-Text and Text-to-Speech conversions. These endpoints utilize AI models to process audio and text data.

The following tasks are available for speech processing:

AI Server API Endpoints

AI Server has endpoints for AI tasks as well as media processing tasks.

Generative AI Endpoints

  • Chat: Interact with LLMs to generate text.
    • Sync: /api/OpenAiChatCompletion
    • Async: /api/QueueOpenAiChatCompletion
  • Image: Generate images from text.
    • Sync: /api/TextToImage
    • Async: /api/QueueTextToImage
  • Image to Image: Generate images from images.
    • Sync: /api/ImageToImage
    • Async: /api/QueueImageToImage
  • Image With Mask: Generate images from images with a mask.
    • Sync: /api/ImageWithMask
    • Async: /api/QueueImageWithMask
  • Image Upscale: Upscale images.
    • Sync: /api/ImageUpscale
    • Async: /api/QueueImageUpscale
  • Image To Text: Generate text from images.
    • Sync: /api/ImageToText
    • Async: /api/QueueImageToText
  • Speech to Text: Transcribe audio to text.
    • Sync: /api/SpeechToText
    • Async: /api/QueueSpeechToText
  • Text To Speech: Generate audio from text.
    • Sync: /api/TextToSpeech
    • Async: /api/QueueTextToSpeech

INFO

The Chat API is also available as an OpenAI compatible endpoint at /v1/chat/completions with matching DTOs. While not all clients will work with this endpoint, the structure of the request and response is the same.

Media Processing Endpoints

Media endpoints are used for processing images, and videos. Videos are processed remotely by the ComfyUI Agent, while image processing is done by the AI Server itself.

Video Processing Endpoints

Image Processing Endpoints

Architecture

The AI Server is designed to be a lite-weight router for AI services, providing a common interface for AI services to be accessed via APIs with typed client support in many languages. As such, heavy processing tasks are offloaded to other services, including self-hosted ones like the ComfyUI Agent.

graph TD
    A[API Client] -->|API Request| B(<img class="w-24 h-24" src="/img/logo.svg"/>)
    B -->|API Request| C[Replicate API]
    B -->|API Request| I[OpenRouter API]
    B -->|API Request| D[OpenAI API]
    B -->|Video Processing| E[ComfyUI Agent]
    E -->|Video Processing| F[FFmpeg]
    E -->|AI Processing| G[PiperTTS]
    E -->|AI Processing| J[Whisper]
    E -->|AI Processing| K[SDXL]
    E -->|AI Processing| L[Flux.1.Schnell]
    B -->|Image Processing| H[AI Server]
    B -->|AI Processing| E[ComfyUI Agent]