mirror of https://github.com/codeflash-ai/codeflash-internal.git synced 2026-05-04 18:25:18 +00:00

Kevin Turcios dfc56f19a0 feat: add Tessl tiles for codeflash-internal (rules, docs, skills)

Three private tiles published to the codeflash workspace:
- codeflash-internal-rules: 6 eager rules (code-style, architecture,
  optimization-patterns, git-conventions, testing-rules, multi-language-handlers)
- codeflash-internal-docs: 8 lazy doc pages (domain-types, optimization-pipeline,
  test-generation-pipeline, context-extraction, aiservice/cf-api endpoints,
  configuration-thresholds, llm-provider-abstraction)
- codeflash-internal-skills: 4 on-demand skills (debug-optimization-failure,
  add-language-support, add-api-endpoint, debug-test-generation)

2026-02-14 22:16:33 -05:00

2.4 KiB

Raw Blame History

LLM Provider Abstraction

Unified LLM interface in aiservice/llm.py — all LLM calls go through this module.

Model Definition (`LLM` dataclass)

@pydantic_dataclass
class LLM:
    name: str              # deployment name (e.g., "gpt-5-mini")
    max_tokens: int        # max context window
    model_type: Literal["openai", "anthropic", "google"]
    input_cost: float      # USD per 1M tokens
    cached_input_cost: float  # USD per 1M cached tokens
    output_cost: float     # USD per 1M tokens

Concrete models: OpenAI_GPT_4_1, OpenAI_GPT_5_Mini, Anthropic_Claude_Sonnet_4_5, Anthropic_Claude_Haiku_4_5.

Client Setup

_create_openai_client() — returns AsyncAzureOpenAI (reads AZURE_OPENAI_* env vars)
_create_anthropic_client() — returns AsyncAnthropicFoundry (reads ANTHROPIC_FOUNDRY_API_KEY + ANTHROPIC_FOUNDRY_BASE_URL)
get_llm_client(model_type) — creates a fresh client per request to avoid event loop issues with Django dev server

Calling LLMs

from aiservice.llm import call_llm, LLM

response: LLMResponse = await call_llm(
    llm=model,                    # LLM instance
    messages=messages,            # OpenAI-format messages
    call_type="optimization",     # tracking label
    trace_id=trace_id,            # request identifier
    max_tokens=16384,             # max output tokens
    user_id=user_id,              # optional tracking
)
# response.content: str
# response.usage: LLMUsage(input_tokens, output_tokens)
# response.raw_response: ChatCompletion | AnthropicMessage

Provider Handling

OpenAI (Azure): Uses client.chat.completions.create(). GPT-5-mini uses max_completion_tokens, older models use max_tokens
Anthropic (Foundry): Extracts system prompt from messages list, passes separately via system= kwarg. Concatenates text blocks from response

Observability

Every call is recorded to database via record_llm_call() in the finally block
Includes: trace_id, call_type, model, messages, result, error, cost, latency

Cost Calculation

calculate_llm_cost(response, llm) accounts for cached vs non-cached input tokens:

Anthropic: cache_read_input_tokens and cache_creation_input_tokens are additive to input_tokens
OpenAI: cached_tokens is a subset of prompt_tokens

Response Types

LLMResponse — wraps content: str, usage: LLMUsage, raw_response
LLMUsage — input_tokens: int, output_tokens: int

2.4 KiB Raw Blame History