Skip to content

Instantly share code, notes, and snippets.

@vkryukov
Last active December 17, 2025 16:19
Show Gist options
  • Select an option

  • Save vkryukov/0c8a2b1eea95180be974448adfa00bd1 to your computer and use it in GitHub Desktop.

Select an option

Save vkryukov/0c8a2b1eea95180be974448adfa00bd1 to your computer and use it in GitHub Desktop.
ReqLLM image support plans

Image Generation Support - Architecture and Implementation Plan

Overview

This document outlines the architecture and implementation plan for adding image generation support to ReqLLM, following the library's established patterns for operation-based dispatch and provider abstraction.

Goals

  1. Add first-class image generation support with generate_image/3 and generate_image!/3 functions
  2. Support OpenAI (DALL-E) and Google (Gemini) as initial providers
  3. Design a flexible API that can accommodate future providers
  4. Maintain consistency with existing ReqLLM patterns and conventions

API Design

High-Level Public API

Following the Vercel AI SDK-inspired pattern used for generate_text/3 and generate_object/4:

# Simple image generation
{:ok, response} = ReqLLM.generate_image("openai:dall-e-3", "A sunset over mountains")

# With options
{:ok, response} = ReqLLM.generate_image(
  "openai:dall-e-3",
  "A sunset over mountains",
  size: "1792x1024",
  quality: :hd,
  style: :natural
)

# Bang version for simple use cases
image = ReqLLM.generate_image!("google:gemini-2.0-flash-preview-image-generation", "A cat in space")

# Access response data
response.images          # List of generated images
response.revised_prompt  # Provider's revised prompt (if applicable)
response.usage           # Cost/usage metadata

Module Structure

lib/req_llm/
├── image_generation.ex         # High-level API module (like Generation, Embedding)
├── image_generation/
│   └── response.ex             # ImageGenerationResponse struct

Data Structures

ImageGenerationResponse

A new response struct specifically for image generation results:

defmodule ReqLLM.ImageGenerationResponse do
  @moduledoc """
  Response struct for image generation operations.

  Contains generated images with metadata, usage information,
  and provider-specific details.
  """

  use TypedStruct

  typedstruct enforce: true do
    # Core fields
    field(:id, String.t())
    field(:model, String.t())

    # Generated images (list to support n > 1)
    field(:images, [Image.t()])

    # Provider may revise the prompt (DALL-E 3 does this)
    field(:revised_prompt, String.t() | nil, default: nil)

    # Metadata
    field(:usage, map() | nil)
    field(:provider_meta, map(), default: %{})

    # Error handling
    field(:error, Exception.t() | nil, default: nil)
  end

  @doc "Extract first image data (convenience for n=1 case)"
  def image(response), do: List.first(response.images)

  @doc "Extract first image as binary data"
  def data(response), do: image(response) && image(response).data

  @doc "Extract first image URL (if url format was requested)"
  def url(response), do: image(response) && image(response).url
end

Image Struct

Individual image representation:

defmodule ReqLLM.ImageGenerationResponse.Image do
  @moduledoc """
  Represents a single generated image.

  Images can be returned as URLs (temporary, provider-hosted) or
  as base64-encoded binary data, depending on the response_format option.
  """

  use TypedStruct

  typedstruct do
    # Image data (mutually exclusive with url)
    field(:data, binary() | nil)
    field(:media_type, String.t() | nil)  # e.g., "image/png"

    # URL (mutually exclusive with data)
    field(:url, String.t() | nil)

    # Provider's revised prompt for this specific image
    field(:revised_prompt, String.t() | nil)

    # Index for batch generation (n > 1)
    field(:index, non_neg_integer(), default: 0)
  end

  @doc "Check if image is base64 data"
  def base64?(image), do: image.data != nil

  @doc "Check if image is URL"
  def url?(image), do: image.url != nil

  @doc "Convert to ContentPart for use in multi-modal prompts"
  def to_content_part(%__MODULE__{data: data, media_type: media_type}) when data != nil do
    ReqLLM.Message.ContentPart.image(data, media_type)
  end

  def to_content_part(%__MODULE__{url: url}) when url != nil do
    ReqLLM.Message.ContentPart.image_url(url)
  end
end

Options Schema

Universal Image Generation Options

@image_generation_schema NimbleOptions.new!(
  # Number of images to generate
  n: [
    type: :pos_integer,
    default: 1,
    doc: "Number of images to generate (1-10, provider dependent)"
  ],

  # Image dimensions
  size: [
    type: :string,
    doc: "Image size (e.g., '1024x1024', '1792x1024'). Provider-specific."
  ],

  # Response format
  response_format: [
    type: {:in, [:url, :b64_json]},
    default: :b64_json,
    doc: "Format for returned images: :url (temporary URL) or :b64_json (base64 data)"
  ],

  # User identifier
  user: [
    type: :string,
    doc: "User identifier for tracking and abuse detection"
  ],

  # Provider-specific options pass-through
  provider_options: [
    type: {:or, [:map, {:list, :any}]},
    doc: "Provider-specific options",
    default: []
  ],

  # HTTP options
  req_http_options: [
    type: {:or, [:map, {:list, :any}]},
    doc: "Req-specific HTTP options",
    default: []
  ],

  # Testing
  fixture: [
    type: {:or, [:string, {:tuple, [:atom, :string]}]},
    doc: "HTTP fixture for testing"
  ]
)

OpenAI-Specific Options

@openai_image_schema [
  # DALL-E 3 quality
  quality: [
    type: {:in, [:standard, :hd, "standard", "hd"]},
    default: :standard,
    doc: "Image quality (DALL-E 3 only): :standard or :hd"
  ],

  # DALL-E 3 style
  style: [
    type: {:in, [:vivid, :natural, "vivid", "natural"]},
    default: :vivid,
    doc: "Image style (DALL-E 3 only): :vivid or :natural"
  ]
]

Google-Specific Options

@google_image_schema [
  # Aspect ratio
  aspect_ratio: [
    type: {:in, ["1:1", "16:9", "9:16", "4:3", "3:4"]},
    doc: "Image aspect ratio"
  ],

  # Safety settings
  google_safety_settings: [
    type: {:list, :map},
    doc: "Safety filter settings"
  ]
]

Provider API Details

OpenAI DALL-E API

Endpoint: POST https://api.openai.com/v1/images/generations

Request Parameters:

  • model - "dall-e-2" or "dall-e-3"
  • prompt - Text description (required)
  • n - Number of images (1-10 for DALL-E 2, only 1 for DALL-E 3)
  • size - Image dimensions
    • DALL-E 2: "256x256", "512x512", "1024x1024"
    • DALL-E 3: "1024x1024", "1792x1024", "1024x1792"
  • quality - "standard" or "hd" (DALL-E 3 only)
  • style - "vivid" or "natural" (DALL-E 3 only)
  • response_format - "url" or "b64_json"

Response Format:

{
  "created": 1234567890,
  "data": [
    {
      "b64_json": "...",
      "revised_prompt": "A detailed sunset..."
    }
  ]
}

Google Gemini Image API

Endpoint: POST https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent

Models:

  • gemini-2.0-flash-preview-image-generation - Fast, efficient model
  • Future: gemini-3-pro-image-preview - Advanced model with reasoning

Request Format:

{
  "contents": [
    {
      "role": "user",
      "parts": [{"text": "Generate an image of..."}]
    }
  ],
  "generationConfig": {
    "responseModalities": ["TEXT", "IMAGE"],
    "imageConfig": {
      "aspectRatio": "16:9"
    }
  }
}

Response Format:

{
  "candidates": [
    {
      "content": {
        "parts": [
          {"text": "Here's the image..."},
          {
            "inlineData": {
              "mimeType": "image/png",
              "data": "base64..."
            }
          }
        ]
      }
    }
  ],
  "usageMetadata": {...}
}

Provider Implementation

Operation Type

Add :image_generation to the operation type:

# In ReqLLM.Provider
@type operation :: :chat | :object | :embedding | :image_generation | atom()

OpenAI Implementation

# In ReqLLM.Providers.OpenAI

@impl ReqLLM.Provider
def prepare_request(:image_generation, model_spec, prompt, opts) do
  with {:ok, model} <- ReqLLM.model(model_spec),
       {:ok, processed_opts} <- process_image_options(__MODULE__, model, opts) do

    http_opts = Keyword.get(processed_opts, :req_http_options, [])

    request =
      Req.new(
        [
          url: "/images/generations",
          method: :post,
          receive_timeout: 120_000  # Image generation can take longer
        ] ++ http_opts
      )
      |> Req.Request.register_options(image_option_keys())
      |> Req.Request.merge_options(
        Keyword.take(processed_opts, image_option_keys()) ++
          [
            model: model.id,
            prompt: prompt,
            operation: :image_generation,
            base_url: Keyword.get(processed_opts, :base_url, base_url())
          ]
      )
      |> attach_image(model, processed_opts)

    {:ok, request}
  end
end

defp encode_image_body(request) do
  body = %{
    "model" => request.options[:model],
    "prompt" => request.options[:prompt],
    "n" => request.options[:n] || 1,
    "size" => request.options[:size] || "1024x1024",
    "response_format" => to_string(request.options[:response_format] || :b64_json)
  }
  |> maybe_put("quality", request.options[:quality])
  |> maybe_put("style", request.options[:style])
  |> maybe_put("user", request.options[:user])

  request
  |> Req.Request.put_header("content-type", "application/json")
  |> Map.put(:body, Jason.encode!(body))
end

defp decode_image_response({req, %Req.Response{status: 200, body: body} = resp}) do
  parsed = ensure_parsed_body(body)

  images =
    parsed["data"]
    |> Enum.with_index()
    |> Enum.map(fn {image_data, index} ->
      %ReqLLM.ImageGenerationResponse.Image{
        data: image_data["b64_json"] && Base.decode64!(image_data["b64_json"]),
        url: image_data["url"],
        media_type: "image/png",
        revised_prompt: image_data["revised_prompt"],
        index: index
      }
    end)

  response = %ReqLLM.ImageGenerationResponse{
    id: parsed["created"] |> to_string(),
    model: req.options[:model],
    images: images,
    revised_prompt: List.first(images) && List.first(images).revised_prompt,
    usage: nil,  # OpenAI doesn't return usage for image generation
    provider_meta: %{}
  }

  {req, %{resp | body: response}}
end

Google Implementation

# In ReqLLM.Providers.Google

@impl ReqLLM.Provider
def prepare_request(:image_generation, model_spec, prompt, opts) do
  with {:ok, model} <- ReqLLM.model(model_spec),
       {:ok, processed_opts} <- process_image_options(__MODULE__, model, opts) do

    http_opts = Keyword.get(processed_opts, :req_http_options, [])

    request =
      Req.new(
        [
          url: "/models/#{model.id}:generateContent",
          method: :post,
          receive_timeout: 120_000
        ] ++ http_opts
      )
      |> Req.Request.register_options(image_option_keys())
      |> Req.Request.merge_options(
        Keyword.take(processed_opts, image_option_keys()) ++
          [
            model: model.id,
            prompt: prompt,
            operation: :image_generation,
            base_url: effective_base_url(processed_opts)
          ]
      )
      |> attach_image(model, processed_opts)

    {:ok, request}
  end
end

defp encode_image_body(request) when request.options[:operation] == :image_generation do
  prompt = request.options[:prompt]

  generation_config =
    %{
      responseModalities: ["TEXT", "IMAGE"]
    }
    |> maybe_put_image_config(request.options)

  body = %{
    contents: [
      %{
        role: "user",
        parts: [%{text: prompt}]
      }
    ],
    generationConfig: generation_config
  }
  |> maybe_put(:safetySettings, request.options[:google_safety_settings])

  request
  |> Req.Request.put_header("content-type", "application/json")
  |> Map.put(:body, Jason.encode!(body))
end

defp maybe_put_image_config(config, opts) do
  image_config =
    %{}
    |> maybe_put(:aspectRatio, opts[:aspect_ratio])

  if map_size(image_config) > 0 do
    Map.put(config, :imageConfig, image_config)
  else
    config
  end
end

defp decode_image_response({req, %Req.Response{status: 200, body: body} = resp})
     when req.options[:operation] == :image_generation do
  parsed = ensure_parsed_body(body)

  images =
    case parsed do
      %{"candidates" => [%{"content" => %{"parts" => parts}} | _]} ->
        parts
        |> Enum.with_index()
        |> Enum.filter(fn {part, _} -> Map.has_key?(part, "inlineData") end)
        |> Enum.map(fn {part, index} ->
          inline_data = part["inlineData"]
          %ReqLLM.ImageGenerationResponse.Image{
            data: Base.decode64!(inline_data["data"]),
            media_type: inline_data["mimeType"],
            index: index
          }
        end)
      _ ->
        []
    end

  text_content =
    case parsed do
      %{"candidates" => [%{"content" => %{"parts" => parts}} | _]} ->
        parts
        |> Enum.filter(&Map.has_key?(&1, "text"))
        |> Enum.map_join("", & &1["text"])
      _ ->
        nil
    end

  usage = extract_usage_from_google_response(parsed)

  response = %ReqLLM.ImageGenerationResponse{
    id: "google-#{System.unique_integer([:positive])}",
    model: req.options[:model],
    images: images,
    revised_prompt: text_content,
    usage: usage,
    provider_meta: %{}
  }

  {req, %{resp | body: response}}
end

ImageGeneration Module

The high-level API module following the pattern of ReqLLM.Embedding:

defmodule ReqLLM.ImageGeneration do
  @moduledoc """
  Image generation functionality for ReqLLM.

  Provides text-to-image generation capabilities with support for:
  - Single and batch image generation
  - Multiple output formats (URL or base64)
  - Provider-specific options (quality, style, aspect ratio)

  ## Supported Providers

  - OpenAI (DALL-E 2, DALL-E 3)
  - Google (Gemini 2.0 Flash Image)

  ## Examples

      # Simple generation
      {:ok, response} = ReqLLM.ImageGeneration.generate("openai:dall-e-3", "A sunset")
      image_data = ReqLLM.ImageGenerationResponse.data(response)

      # With options
      {:ok, response} = ReqLLM.ImageGeneration.generate(
        "openai:dall-e-3",
        "A professional portrait",
        size: "1024x1024",
        quality: :hd,
        style: :natural
      )
  """

  alias ReqLLM.ImageGenerationResponse

  @base_schema NimbleOptions.new!(
    n: [type: :pos_integer, default: 1],
    size: [type: :string],
    response_format: [type: {:in, [:url, :b64_json]}, default: :b64_json],
    user: [type: :string],
    provider_options: [type: {:or, [:map, {:list, :any}]}, default: []],
    req_http_options: [type: {:or, [:map, {:list, :any}]}, default: []],
    fixture: [type: {:or, [:string, {:tuple, [:atom, :string]}]}]
  )

  @doc "Returns the base image generation options schema."
  @spec schema :: NimbleOptions.t()
  def schema, do: @base_schema

  @doc """
  Returns list of model specs that support image generation.
  """
  @spec supported_models() :: [String.t()]
  def supported_models do
    # Initially hardcoded, later integrate with LLMDB capabilities
    [
      "openai:dall-e-2",
      "openai:dall-e-3",
      "google:gemini-2.0-flash-preview-image-generation"
    ]
  end

  @doc """
  Validates that a model supports image generation operations.
  """
  @spec validate_model(String.t() | {atom(), keyword()} | struct()) ::
          {:ok, LLMDB.Model.t()} | {:error, term()}
  def validate_model(model_spec) do
    with {:ok, model} <- ReqLLM.model(model_spec) do
      model_string = LLMDB.Model.spec(model)

      if model_string in supported_models() do
        {:ok, model}
      else
        {:error,
         ReqLLM.Error.Invalid.Parameter.exception(
           parameter: "model: #{model_string} does not support image generation"
         )}
      end
    end
  end

  @doc """
  Generates images from a text prompt.

  ## Parameters

    * `model_spec` - Model specification (e.g., "openai:dall-e-3")
    * `prompt` - Text description of the image to generate
    * `opts` - Generation options

  ## Options

    * `:n` - Number of images to generate (default: 1)
    * `:size` - Image dimensions (provider-specific)
    * `:response_format` - :url or :b64_json (default: :b64_json)
    * `:quality` - :standard or :hd (OpenAI DALL-E 3 only)
    * `:style` - :vivid or :natural (OpenAI DALL-E 3 only)
    * `:aspect_ratio` - "1:1", "16:9", etc. (Google only)
    * `:provider_options` - Provider-specific options

  ## Examples

      {:ok, response} = ReqLLM.ImageGeneration.generate(
        "openai:dall-e-3",
        "A serene mountain landscape at sunset"
      )

      # Get the image data
      image = ReqLLM.ImageGenerationResponse.image(response)
      File.write!("landscape.png", image.data)
  """
  @spec generate(
          String.t() | {atom(), keyword()} | struct(),
          String.t(),
          keyword()
        ) :: {:ok, ImageGenerationResponse.t()} | {:error, term()}
  def generate(model_spec, prompt, opts \\ [])

  def generate(model_spec, prompt, opts) when is_binary(prompt) do
    with {:ok, model} <- validate_model(model_spec),
         :ok <- validate_prompt(prompt),
         {:ok, provider_module} <- ReqLLM.provider(model.provider),
         {:ok, request} <- provider_module.prepare_request(:image_generation, model, prompt, opts),
         {:ok, %Req.Response{status: status, body: response}} when status in 200..299 <-
           Req.request(request) do
      {:ok, response}
    else
      {:ok, %Req.Response{status: status, body: body}} ->
        {:error,
         ReqLLM.Error.API.Request.exception(
           reason: "HTTP #{status}: Request failed",
           status: status,
           response_body: body
         )}

      {:error, error} ->
        {:error, error}
    end
  end

  @doc """
  Generates images from a text prompt, raising on error.
  """
  @spec generate!(
          String.t() | {atom(), keyword()} | struct(),
          String.t(),
          keyword()
        ) :: ImageGenerationResponse.t()
  def generate!(model_spec, prompt, opts \\ []) do
    case generate(model_spec, prompt, opts) do
      {:ok, response} -> response
      {:error, error} -> raise error
    end
  end

  defp validate_prompt("") do
    {:error, ReqLLM.Error.Invalid.Parameter.exception(parameter: "prompt: cannot be empty")}
  end

  defp validate_prompt(prompt) when is_binary(prompt), do: :ok
end

Integration with Main API

Add to the main ReqLLM module:

# In lib/req_llm.ex

alias ReqLLM.ImageGeneration

@doc """
Generates images from a text prompt using an AI model.

Returns a canonical ImageGenerationResponse which includes generated images,
usage data, and metadata.

## Parameters

  * `model_spec` - Model specification (e.g., "openai:dall-e-3")
  * `prompt` - Text description of the image to generate
  * `opts` - Additional options (keyword list)

## Options

  * `:n` - Number of images to generate (default: 1)
  * `:size` - Image dimensions (e.g., "1024x1024", "1792x1024")
  * `:response_format` - :url or :b64_json (default: :b64_json)
  * `:quality` - :standard or :hd (OpenAI DALL-E 3 only)
  * `:style` - :vivid or :natural (OpenAI DALL-E 3 only)
  * `:aspect_ratio` - "1:1", "16:9", etc. (Google only)
  * `:provider_options` - Provider-specific options

## Examples

    {:ok, response} = ReqLLM.generate_image("openai:dall-e-3", "A sunset over mountains")

    # Access first image
    image = ReqLLM.ImageGenerationResponse.image(response)
    File.write!("sunset.png", image.data)

"""
defdelegate generate_image(model_spec, prompt, opts \\ []), to: ImageGeneration, as: :generate

@doc """
Generates images from a text prompt, returning the response directly.
Raises on error.
"""
defdelegate generate_image!(model_spec, prompt, opts \\ []), to: ImageGeneration, as: :generate!

Error Handling

Error Types

# Content policy violation
%ReqLLM.Error.API.Response{
  reason: "Content policy violation",
  status: 400,
  response_body: %{"error" => %{"code" => "content_policy_violation"}}
}

# Invalid model for image generation
%ReqLLM.Error.Invalid.Parameter{
  parameter: "model: openai:gpt-4 does not support image generation"
}

# Provider-specific errors
%ReqLLM.Error.API.Response{
  reason: "Rate limit exceeded",
  status: 429,
  response_body: %{}
}

Testing Strategy

Unit Tests

defmodule ReqLLM.ImageGenerationTest do
  use ExUnit.Case

  describe "generate/3" do
    test "validates empty prompt" do
      assert {:error, %ReqLLM.Error.Invalid.Parameter{}} =
        ReqLLM.ImageGeneration.generate("openai:dall-e-3", "")
    end

    test "validates unsupported model" do
      assert {:error, %ReqLLM.Error.Invalid.Parameter{}} =
        ReqLLM.ImageGeneration.generate("openai:gpt-4", "A cat")
    end
  end

  describe "supported_models/0" do
    test "returns image generation capable models" do
      models = ReqLLM.ImageGeneration.supported_models()
      assert "openai:dall-e-3" in models
      refute "openai:gpt-4" in models
    end
  end
end

Integration Tests with Fixtures

defmodule ReqLLM.ImageGeneration.OpenAITest do
  use ExUnit.Case

  @moduletag :integration

  describe "OpenAI DALL-E 3" do
    test "generates image with default options" do
      {:ok, response} = ReqLLM.generate_image(
        "openai:dall-e-3",
        "A simple red circle on white background",
        fixture: "openai_dalle3_simple"
      )

      assert %ReqLLM.ImageGenerationResponse{} = response
      assert length(response.images) == 1
      assert response.images |> hd() |> Map.get(:data) |> is_binary()
    end
  end
end

defmodule ReqLLM.ImageGeneration.GoogleTest do
  use ExUnit.Case

  @moduletag :integration

  describe "Google Gemini Image" do
    test "generates image with default options" do
      {:ok, response} = ReqLLM.generate_image(
        "google:gemini-2.0-flash-preview-image-generation",
        "A simple blue square",
        fixture: "google_gemini_image_simple"
      )

      assert %ReqLLM.ImageGenerationResponse{} = response
      assert length(response.images) >= 1
    end
  end
end

Implementation Phases

Phase 1: Core Infrastructure

  • Create ReqLLM.ImageGenerationResponse struct
  • Create ReqLLM.ImageGenerationResponse.Image struct
  • Create ReqLLM.ImageGeneration module with schema
  • Add :image_generation operation type to Provider behavior
  • Add model validation for image generation capability

Phase 2: OpenAI Provider

  • Implement prepare_request(:image_generation, ...) in OpenAI provider
  • Implement encode_image_body/1 for DALL-E API format
  • Implement decode_image_response/1 for DALL-E response parsing
  • Add OpenAI-specific options (quality, style)
  • Add unit tests
  • Add integration tests with fixtures

Phase 3: Google Provider

  • Implement prepare_request(:image_generation, ...) in Google provider
  • Modify encode_body/1 to handle image generation operation
  • Modify decode_response/1 to handle image responses
  • Add Google-specific options (aspect_ratio)
  • Add unit tests
  • Add integration tests with fixtures

Phase 4: Main API Integration

  • Add generate_image/3 and generate_image!/3 to ReqLLM module
  • Add documentation with examples
  • Update README with image generation section

Phase 5: LLMDB Integration

  • Add image generation models to LLMDB (if not already present)
  • Add image_generation capability flag
  • Add model constraints (sizes, aspect ratios, etc.)

Phase 6: Documentation & Polish

  • Complete module documentation
  • Add usage examples
  • Add error handling guide

Future Considerations

Image Editing (Future Phase)

# Potential future API for image editing (DALL-E 2 only)
ReqLLM.edit_image("openai:dall-e-2",
  image: image_binary,
  mask: mask_binary,
  prompt: "Add a hat"
)

Image Variations (Future Phase)

# Potential future API for image variations (DALL-E 2 only)
ReqLLM.vary_image("openai:dall-e-2",
  image: original_image,
  n: 3
)

Multi-Image Composition (Google)

Google's Gemini 3 Pro supports using multiple reference images:

# Potential future API
ReqLLM.generate_image("google:gemini-3-pro-image",
  prompt: "Combine these styles",
  reference_images: [image1, image2, image3]
)

Summary

This architecture provides:

  1. Consistency: Follows existing ReqLLM patterns for operations, providers, and responses
  2. Flexibility: Provider-specific options while maintaining a unified API
  3. Extensibility: Easy to add new providers and future features (editing, variations)
  4. Type Safety: Clear struct definitions with TypedStruct
  5. Testability: Fixture support and clear separation of concerns

The implementation follows the established patterns from ReqLLM.Embedding and ReqLLM.Generation, making it familiar to existing users and maintainers.

Image Generation Support (OpenAI + Google) — Architecture & Implementation Plan

Goals

  • Add first-class image generation to ReqLLM with initial support for:
    • OpenAI Images API
    • Google Gemini image generation
  • Keep the API and provider integration flexible enough to add more providers and image-related operations later (edit, variation, upscale).
  • Reuse the existing ReqLLM architecture patterns:
    • operation-driven provider entry points (prepare_request/4)
    • canonical responses (ReqLLM.Response)
    • provider option translation (translate_options/3)
    • fixture-based coverage testing

Non-goals (v1)

  • Progressive/streaming image generation.
  • Image editing/variation endpoints (design leaves room, but initial implementation is “generate images from prompt”).
  • Helper utilities that write images to disk.

High-level Design

Image generation becomes a new operation in the existing provider + operation architecture:

  • Public API delegates to a new core module (parallel to ReqLLM.Generation and ReqLLM.Embedding).
  • Providers implement prepare_request(:image, ...) and response decoding.
  • Non-streaming provider responses are converted into canonical output by reusing the StreamChunk → ResponseBuilder convergence pattern:
    • Providers decode API responses into a list of canonical “chunks” (including image chunks).
    • A ResponseBuilder assembles a standard ReqLLM.Response with ReqLLM.Message.ContentPart entries for images.

This keeps the output consistent and ensures downstream workflows (context merging, multi-turn calls, tool loops) work without special cases.

Public API

Add new entry points in ReqLLM:

  • ReqLLM.generate_image/3
    • generate_image(model_spec, prompt_or_messages, opts \\ [])
    • Returns {:ok, ReqLLM.Response.t()} | {:error, term()}
  • ReqLLM.generate_image!/3
    • Raises on error; returns image(s) directly or returns the response and provides helper accessors.

Add response convenience:

  • ReqLLM.Response.images/1 → returns a list of image content parts from response.message.content
    • [%ReqLLM.Message.ContentPart{type: :image, data: ..., media_type: ...}, ...] and/or :image_url

Input normalization

Follow the existing ergonomics used by text generation:

  • Accept either:
    • a string prompt
    • a message list / context-compatible structure
  • Normalize into ReqLLM.Context using ReqLLM.Context.normalize/2 (same pattern as streaming).

Canonical Output

Return a normal ReqLLM.Response where the assistant message contains image content parts:

  • %ReqLLM.Message.ContentPart{type: :image, data: binary(), media_type: "image/png" | ...}
  • %ReqLLM.Message.ContentPart{type: :image_url, url: String.t()}

The response is then merged into the conversation context via ReqLLM.Context.merge_response/2 the same way as text generation.

Canonical Image Options (Provider-neutral)

Add an image generation options schema (new schema entry point similar to generation and embedding schemas). Suggested canonical options:

  • :n (positive integer, default 1) — number of images requested
  • :size — either "1024x1024"-style string or {width, height}
  • :aspect_ratio — e.g. "1:1", "16:9", "9:16" (for providers that prefer aspect ratio)
  • :output_format (:png | :jpeg | :webp, default :png)
  • :response_format (:binary | :url | :b64_json, default :binary)
    • :binary means “return ContentPart.image if provider can return bytes”
    • :url means “return ContentPart.image_url
    • :b64_json means “return base64 payload in provider_meta and/or convert to binary based on provider mapping”
  • :seed (integer | nil)
  • :quality (provider-mapped; e.g. :standard | :hd)
  • :style (provider-mapped; e.g. :vivid | :natural)
  • :negative_prompt (string | nil) — included for cross-provider compatibility
  • :user (string | nil) — OpenAI-style auditing/user tracking
  • Standard ReqLLM pass-throughs:
    • :provider_options (map/keyword)
    • :req_http_options (map/keyword)
    • :fixture (for test harness)

Providers can implement translate_options(:image, ...) to rename/drop/validate per-provider constraints without leaking provider specifics into the public API.

New Operation: :image

Add a new operation atom (:image) to the provider contract. Providers should support:

  • prepare_request(:image, model, input, opts) — build a Req request for image generation
  • decode_response/1 — parse the provider response

Recommended core implementation module:

  • ReqLLM.Images (parallel to ReqLLM.Generation and ReqLLM.Embedding)
    • schema/0 for options
    • generate/3 (used by ReqLLM.generate_image/3)
    • optional: validate_model/1 and supported_models/0

Core flow (mirrors text generation):

  1. Resolve model: ReqLLM.model(model_spec)
  2. Resolve provider: ReqLLM.provider(model.provider)
  3. Normalize prompt/messages to ReqLLM.Context
  4. Process/validate options: ReqLLM.Provider.Options.process!(provider, :image, model, opts ++ [context: context])
  5. Build request: provider.prepare_request(:image, model, context_or_prompt, processed_opts)
  6. Execute: Req.request/1
  7. Decode: provider decodes into canonical chunks and builds a canonical ReqLLM.Response

Unification: Extend StreamChunks + ResponseBuilder

ReqLLM already has an architectural convergence point for streaming and non-streaming: ReqLLM.Provider.ResponseBuilder. Image generation should reuse this approach.

Extend ReqLLM.StreamChunk

Add chunk types to represent image outputs:

  • :image — contains binary image bytes and mime type
  • :image_url — contains a URL string

Two viable representation strategies:

  • Explicit fields on StreamChunk
    • Add data, url, media_type fields and update constructors.
  • Metadata carrier
    • Keep struct fields unchanged and store image payload in chunk.metadata.

The explicit-fields approach is typically clearer for downstream assembly and typing.

Extend ReqLLM.Provider.Defaults.ResponseBuilder

Update chunk accumulation and message assembly logic to include images:

  • Accumulate image chunks and url chunks alongside text/thinking/tool chunks.
  • Build ReqLLM.Message.ContentPart list including :image and :image_url parts.
  • Preserve stable ordering:
    • either preserve provider order if events are emitted in order
    • or group as [images..., text..., thinking..., tool_calls...] if providers don’t provide stable mixed modality ordering

This ensures image generation yields a standard ReqLLM.Response and composes with Context.merge_response/2.

Provider Implementations

OpenAI (Images API)

Request

  • Endpoint: POST /v1/images/generations
  • Headers:
    • Authorization: Bearer <api_key>
    • Content-Type: application/json
  • Body mapping:
    • model ← model.id
    • prompt ← extracted prompt string (v1: treat image generation as prompt-based, not full multi-turn context)
    • n, size, quality, style, user
    • response_format:
      • :url"url"
      • :binary and :b64_json"b64_json" (decode to bytes for :binary)

Response decoding

OpenAI image responses include data entries with either:

  • {"b64_json": "..."} — base64 image bytes
  • {"url": "..."} — URL to the image

Decode into StreamChunks:

  • b64_json:image chunk(s) with media_type derived from :output_format
  • url:image_url chunk(s)

Usage metadata is typically absent for images; store provider extras in provider_meta["openai"] and set usage: nil.

Future-proofing

Reserve internal representation for:

  • edits (image + mask + prompt)
  • variations (image + n)

These can be added later as separate operations (:image_edit, :image_variation) or as additional options gated by provider support.

Google (Gemini image generation via generateContent)

Request

  • Endpoint:
    • POST https://generativelanguage.googleapis.com/v1beta/models/<model>:generateContent
  • Headers:
    • x-goog-api-key: <api_key>
    • Content-Type: application/json
  • Body mapping:
    • contents: [%{parts: [%{text: prompt}]}]
    • Include generationConfig.responseModalities containing "IMAGE"
      • optionally include "TEXT" if you want a caption alongside images
    • Align output mime with :output_format if supported

Response decoding

Responses contain image bytes in candidates[0].content.parts[*].inlineData (base64) with mime type.

Decode parts:

  • text → emit :content chunks if present
  • inlineData / inline_data:
    • base64 decode to bytes
    • emit :image chunk with media_type from the response

Store any Google-specific extras in provider_meta["google"].

Model Capability Gating

Add image capability checks similar to embeddings:

  • Preferred: registry capability flag such as model.capabilities.images == true
  • Fallback: validate by attempting provider.prepare_request(:image, ...) and treating “operation not supported” as not-image-capable

Expose convenience:

  • ReqLLM.Images.supported_models/0 — list models with images capability
  • ReqLLM.Images.validate_model/1 — validate model supports image operations and return %LLMDB.Model{}

Testing Strategy

Follow the project’s fixture-based approach.

Provider/unit tests (no live calls)

  • Encode tests:
    • OpenAI: correct endpoint, headers, and body mapping for :image
    • Google: correct endpoint and response modalities configuration
  • Decode tests:
    • OpenAI: b64_json and url response parsing
    • Google: inlineData parsing and base64 decode
  • ResponseBuilder tests:
    • ensure image chunks become ContentPart.image / image_url in response messages

Coverage tests (fixture-based)

  • Add coverage tests under test/coverage for:
    • OpenAI image generation
    • Google image generation
  • Use minimal payload sizes to keep fixtures small:
    • n: 1
    • smallest supported size
  • Assertions:
    • length(images) == 1
    • media_type matches
    • image bytes non-empty

Extensibility Checklist (Adding Future Providers)

To add a new provider:

  • Implement prepare_request(:image, model, input, opts)
  • Implement decode logic that emits canonical image chunks (:image and/or :image_url)
  • Optionally implement translate_options(:image, ...)
  • Add capability gating via registry or prepare_request validation

No change should be required in the public API if the provider conforms to the canonical chunk + response assembly model.

Step-by-step Implementation Plan

  1. Add ReqLLM.generate_image/3 and ReqLLM.generate_image!/3 delegating to a new core module.
  2. Add ReqLLM.Images:
    • schema and option validation
    • input normalization via ReqLLM.Context.normalize/2
    • request execution and error handling mirroring ReqLLM.Generation
  3. Extend canonical structures:
    • extend ReqLLM.StreamChunk with image chunk types
    • extend ReqLLM.Provider.Defaults.ResponseBuilder to assemble image content parts
    • add ReqLLM.Response.images/1
  4. Implement provider support:
    • OpenAI: prepare_request(:image, ...) + decode to image chunks
    • Google: prepare_request(:image, ...) + decode inlineData to image chunks
  5. Add capability gating (ReqLLM.Images.validate_model/1, optional supported_models/0).
  6. Add tests:
    • provider encode/decode tests
    • coverage fixture tests
  7. Update docs (README/guides) with a basic example and how to write bytes to disk.

Architecture & Implementation Plan: Image Generation Support

This document outlines the architecture for adding image generation capabilities to req_llm, supporting OpenAI (DALL-E) and Google (Imagen/Gemini) initially.

1. New Core Data Structures

We will introduce a dedicated response struct for image generation to avoid overloading the text-centric ReqLLM.Response.

lib/req_llm/image_response.ex

A new struct to standardize image outputs across providers.

defmodule ReqLLM.ImageResponse do
  @moduledoc """
  Standardized response for image generation requests.
  """
  defstruct [
    :data,           # List of image data maps
    :created,        # Timestamp
    :usage,          # Usage/cost metadata (if available)
    :provider_meta,  # Raw provider response metadata
    :model           # The model used
  ]

  @type t :: %__MODULE__{ 
    data: [map()],
    created: integer() | nil,
    usage: map() | nil,
    provider_meta: map(),
    model: String.t() | nil
  }

  # Data items will follow this shape:
  # %{
  #   url: String.t() | nil,
  #   b64_json: String.t() | nil,
  #   revised_prompt: String.t() | nil,
  #   mime_type: String.t() | nil
  # }
end

2. Public API

We will add new entry points to ReqLLM and ReqLLM.Generation.

lib/req_llm.ex

@doc """
Generates images using the specified model.

## Options
  * `:n` - Number of images to generate (default: 1)
  * `:size` - Image size (e.g., "1024x1024")
  * `:quality` - Image quality ("standard" or "hd")
  * `:response_format` - "url" or "b64_json"
  * `:user` - User identifier
"""
defdelegate generate_image(model, prompt, opts \ []), to: ReqLLM.Generation
defdelegate generate_image!(model, prompt, opts \ []), to: ReqLLM.Generation

lib/req_llm/generation.ex

def generate_image(model_spec, prompt, opts \ []) do
  with {:ok, model} <- ReqLLM.model(model_spec),
       {:ok, provider_module} <- ReqLLM.provider(model.provider),
       # New operation type: :image
       {:ok, request} <- provider_module.prepare_request(:image, model, prompt, opts),
       {:ok, response} <- Req.request(request) do
    # decoding is handled by provider's decode_response via Req steps
    # verify we got an ImageResponse
    case response.body do
       %ReqLLM.ImageResponse{} = image_resp -> {:ok, image_resp}
       _ -> {:error, "Unexpected response format"}
    end
  end
end

3. Provider Implementation

OpenAI Provider (lib/req_llm/providers/openai.ex)

prepare_request(:image, model, prompt, opts)

  • Endpoint: /v1/images/generations
  • Method: POST
  • Body:
    {
      "prompt": "...",
      "model": "dall-e-3",
      "n": 1,
      "size": "1024x1024",
      "quality": "standard",
      "response_format": "url",
      "style": "vivid"
    }
  • Translation: Map standard opts (:n, :size, :quality) to OpenAI fields.

decode_response({req, resp})

  • Detect operation via req.options[:operation] == :image.
  • Parse standard OpenAI image response:
    {
      "created": 123,
      "data": [{ "url": "...", "revised_prompt": "..." }]
    }
  • Return %ReqLLM.ImageResponse{}.

Google Provider (lib/req_llm/providers/google.ex)

prepare_request(:image, model, prompt, opts)

  • Endpoint: /models/#{model.id}:predict (standard for Imagen 3)
  • Method: POST
  • Body:
    {
      "instances": [{ "prompt": "..." }],
      "parameters": {
        "sampleCount": 1,
        "aspectRatio": "1:1", # Need to map "1024x1024" -> "1:1"
        "outputOptions": { "mimeType": "image/jpeg" }
      }
    }
  • Note: Google uses aspect ratios (1:1, 16:9, 9:16) rather than pixel dimensions for generation parameters. We will need a helper to infer aspect ratio from the :size string (e.g., "1024x1024" -> "1:1", "1024x1792" -> "9:16").

decode_response({req, resp})

  • Detect operation via req.options[:operation] == :image.
  • Parse Google/Imagen response:
    {
      "predictions": [
        {
          "bytesBase64Encoded": "...",
          "mimeType": "image/jpeg"
        }
      ]
    }
  • Return %ReqLLM.ImageResponse{}. Note that Google typically returns base64 data, so b64_json field will be populated.

4. Implementation Steps

  1. Create ReqLLM.ImageResponse struct.
  2. Update ReqLLM and ReqLLM.Generation to add generate_image/3.
  3. Update ReqLLM.Providers.OpenAI:
    • Implement prepare_request(:image, ...)
    • Update decode_response logic.
  4. Update ReqLLM.Providers.Google:
    • Implement prepare_request(:image, ...)
    • Implement aspect ratio mapping helper.
    • Update decode_response logic.
  5. Tests:
    • Add fixtures/cassettes for OpenAI (DALL-E 3) and Google (Imagen 3).
    • Verify request payloads and response decoding.

5. Usage Example

# OpenAI
{:ok, resp} = ReqLLM.generate_image("openai:dall-e-3", "A cyberpunk cat", size: "1024x1024")
IO.inspect(resp.data) 
#=> [%{url: "https://...", revised_prompt: "..."}]

# Google
{:ok, resp} = ReqLLM.generate_image("google:imagen-3.0-generate-001", "A cyberpunk cat", size: "1024x1024")
IO.inspect(resp.data)
#=> [%{b64_json: "...", mime_type: "image/jpeg"}]

Image Generation Implementation Plan Comparison

This document compares three proposed architectures for adding image generation support to ReqLLM.

Rank & Summary

Rank Plan Core Philosophy Best For
1 Codex Plan Unification: Integrates images into standard ReqLLM.Response messages. Multimodal workflows & consistency.
2 Claude Plan Specialization: Dedicated structs and modules for image generation. Type safety & explicit APIs.
3 Gemini Plan Simplicity: Lightweight implementation with a custom response struct. Rapid prototyping.

Pros & Cons

1. Codex Plan (codex-images-plan.md)

  • Pros:
    • Architectural Unity: Treats images as standard ReqLLM.Message.ContentPart items within a ReqLLM.Response. This is critical for multimodal workflows (e.g., generating an image and immediately adding it to conversation history).
    • Future-Proof: Aligns with how models like GPT-4o and Gemini 2.0 handle mixed-modality outputs natively.
    • Reuse: Leverages existing Context, ResponseBuilder, and StreamChunk machinery, reducing code duplication.
  • Cons:
    • Complexity: Requires modifying core "streaming/response building" logic to handle image chunks.
    • Ergonomics: Accessing an image requires traversing response.message.content, though the plan suggests adding helper functions like ReqLLM.Response.images/1.

2. Claude Plan (claude-images-plan.md)

  • Pros:
    • Spec & Safety: Excellent use of TypedStruct and NimbleOptions for robust validation.
    • Clarity: A dedicated ImageGenerationResponse is easier for users who only want images and don't care about chat history.
    • Detail: Provides the most complete implementation details for OpenAI and Google (Gemini 2.0).
  • Cons:
    • Divergence: Creates a parallel API structure separate from the main ReqLLM.Response, making it harder to merge results back into a ReqLLM.Context.

3. Gemini Plan (gemini-images-plan.md)

  • Pros:
    • Simplicity: Very straightforward and direct implementation.
  • Cons:
    • Google API: Uses the legacy Vertex AI-style predict endpoint rather than the unified generateContent API.
    • Isolation: Returns a bespoke struct, missing the unification benefits of the Codex plan.

Recommendation: The Hybrid Codex Approach

We should implement the Codex Plan, as it treats images as first-class citizens in the library's multimodal future. However, we should incorporate strengths from the Claude plan:

  1. Unified Response: Follow the Codex approach: images should be returned as ContentParts in a standard ReqLLM.Response.
  2. Robust Options: Borrow Claude's detailed NimbleOptions definitions for standardizing size, quality, and style.
  3. Gemini 2.0 Integration: Use Claude's generateContent implementation for Google to ensure compatibility with the latest Flash models.
  4. Helper Accessors: Implement ReqLLM.Response.images(response) (from Codex) to ensure the ergonomics are as good as a dedicated struct.
Your task is to create a detailed architecture and plan for adding image generation support to the library.
Here is the original issue that highlights one possible approach: https://github.com/agentjido/req_llm/issues/14
Here is the image generation API from OpenAI: https://platform.openai.com/docs/api-reference/images
And here is the one from Google: https://ai.google.dev/gemini-api/docs/image-generation
Initially, we just need to support image generation for these two providers, although the API and infrastructure should be flexible enough to allow adding more providers
in the future.
Do not edit any files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment