codeflash-internal/tiles/codeflash-internal-docs/docs/llm-provider-abstraction.md
Kevin Turcios dfc56f19a0 feat: add Tessl tiles for codeflash-internal (rules, docs, skills)
Three private tiles published to the codeflash workspace:
- codeflash-internal-rules: 6 eager rules (code-style, architecture,
  optimization-patterns, git-conventions, testing-rules, multi-language-handlers)
- codeflash-internal-docs: 8 lazy doc pages (domain-types, optimization-pipeline,
  test-generation-pipeline, context-extraction, aiservice/cf-api endpoints,
  configuration-thresholds, llm-provider-abstraction)
- codeflash-internal-skills: 4 on-demand skills (debug-optimization-failure,
  add-language-support, add-api-endpoint, debug-test-generation)
2026-02-14 22:16:33 -05:00

2.4 KiB

LLM Provider Abstraction

Unified LLM interface in aiservice/llm.py — all LLM calls go through this module.

Model Definition (LLM dataclass)

@pydantic_dataclass
class LLM:
    name: str              # deployment name (e.g., "gpt-5-mini")
    max_tokens: int        # max context window
    model_type: Literal["openai", "anthropic", "google"]
    input_cost: float      # USD per 1M tokens
    cached_input_cost: float  # USD per 1M cached tokens
    output_cost: float     # USD per 1M tokens

Concrete models: OpenAI_GPT_4_1, OpenAI_GPT_5_Mini, Anthropic_Claude_Sonnet_4_5, Anthropic_Claude_Haiku_4_5.

Client Setup

  • _create_openai_client() — returns AsyncAzureOpenAI (reads AZURE_OPENAI_* env vars)
  • _create_anthropic_client() — returns AsyncAnthropicFoundry (reads ANTHROPIC_FOUNDRY_API_KEY + ANTHROPIC_FOUNDRY_BASE_URL)
  • get_llm_client(model_type) — creates a fresh client per request to avoid event loop issues with Django dev server

Calling LLMs

from aiservice.llm import call_llm, LLM

response: LLMResponse = await call_llm(
    llm=model,                    # LLM instance
    messages=messages,            # OpenAI-format messages
    call_type="optimization",     # tracking label
    trace_id=trace_id,            # request identifier
    max_tokens=16384,             # max output tokens
    user_id=user_id,              # optional tracking
)
# response.content: str
# response.usage: LLMUsage(input_tokens, output_tokens)
# response.raw_response: ChatCompletion | AnthropicMessage

Provider Handling

  • OpenAI (Azure): Uses client.chat.completions.create(). GPT-5-mini uses max_completion_tokens, older models use max_tokens
  • Anthropic (Foundry): Extracts system prompt from messages list, passes separately via system= kwarg. Concatenates text blocks from response

Observability

  • Every call is recorded to database via record_llm_call() in the finally block
  • Includes: trace_id, call_type, model, messages, result, error, cost, latency

Cost Calculation

calculate_llm_cost(response, llm) accounts for cached vs non-cached input tokens:

  • Anthropic: cache_read_input_tokens and cache_creation_input_tokens are additive to input_tokens
  • OpenAI: cached_tokens is a subset of prompt_tokens

Response Types

  • LLMResponse — wraps content: str, usage: LLMUsage, raw_response
  • LLMUsageinput_tokens: int, output_tokens: int