mirror of
https://github.com/codeflash-ai/codeflash-internal.git
synced 2026-05-04 18:25:18 +00:00
Three private tiles published to the codeflash workspace: - codeflash-internal-rules: 6 eager rules (code-style, architecture, optimization-patterns, git-conventions, testing-rules, multi-language-handlers) - codeflash-internal-docs: 8 lazy doc pages (domain-types, optimization-pipeline, test-generation-pipeline, context-extraction, aiservice/cf-api endpoints, configuration-thresholds, llm-provider-abstraction) - codeflash-internal-skills: 4 on-demand skills (debug-optimization-failure, add-language-support, add-api-endpoint, debug-test-generation)
2.4 KiB
2.4 KiB
LLM Provider Abstraction
Unified LLM interface in aiservice/llm.py — all LLM calls go through this module.
Model Definition (LLM dataclass)
@pydantic_dataclass
class LLM:
name: str # deployment name (e.g., "gpt-5-mini")
max_tokens: int # max context window
model_type: Literal["openai", "anthropic", "google"]
input_cost: float # USD per 1M tokens
cached_input_cost: float # USD per 1M cached tokens
output_cost: float # USD per 1M tokens
Concrete models: OpenAI_GPT_4_1, OpenAI_GPT_5_Mini, Anthropic_Claude_Sonnet_4_5, Anthropic_Claude_Haiku_4_5.
Client Setup
_create_openai_client()— returnsAsyncAzureOpenAI(readsAZURE_OPENAI_*env vars)_create_anthropic_client()— returnsAsyncAnthropicFoundry(readsANTHROPIC_FOUNDRY_API_KEY+ANTHROPIC_FOUNDRY_BASE_URL)get_llm_client(model_type)— creates a fresh client per request to avoid event loop issues with Django dev server
Calling LLMs
from aiservice.llm import call_llm, LLM
response: LLMResponse = await call_llm(
llm=model, # LLM instance
messages=messages, # OpenAI-format messages
call_type="optimization", # tracking label
trace_id=trace_id, # request identifier
max_tokens=16384, # max output tokens
user_id=user_id, # optional tracking
)
# response.content: str
# response.usage: LLMUsage(input_tokens, output_tokens)
# response.raw_response: ChatCompletion | AnthropicMessage
Provider Handling
- OpenAI (Azure): Uses
client.chat.completions.create(). GPT-5-mini usesmax_completion_tokens, older models usemax_tokens - Anthropic (Foundry): Extracts system prompt from messages list, passes separately via
system=kwarg. Concatenates text blocks from response
Observability
- Every call is recorded to database via
record_llm_call()in thefinallyblock - Includes: trace_id, call_type, model, messages, result, error, cost, latency
Cost Calculation
calculate_llm_cost(response, llm) accounts for cached vs non-cached input tokens:
- Anthropic:
cache_read_input_tokensandcache_creation_input_tokensare additive toinput_tokens - OpenAI:
cached_tokensis a subset ofprompt_tokens
Response Types
LLMResponse— wrapscontent: str,usage: LLMUsage,raw_responseLLMUsage—input_tokens: int,output_tokens: int