codeflash-internal

History

Kevin Turcios df90110fe8 fix: prevent log_features from 500ing optimization endpoints (#2518 ) ## Summary - `thread_sensitive=False` on `sync_to_async` so concurrent `log_features` calls get their own threads instead of serializing through one (was `True`, causing a bottleneck) - Raised DB pool `max_size` from 10 to 100 — prod Postgres allows 859 connections, giving plenty of headroom - Added `safe_log_features` wrapper that catches errors via Sentry instead of propagating — used at all 9 TaskGroup and bare-await call sites so a logging failure can't crash an otherwise successful optimization endpoint - Kept `transaction.atomic` + `select_for_update` for correctness (Django doesn't support async transactions yet, and removing these causes lost-update races on dict-merge fields) ## Root cause `log_features` uses `@sync_to_async` + `@transaction.atomic` because Django lacks async transaction support. The previous fix for pool exhaustion changed `thread_sensitive=False` to `True`, which serialized all calls through a single thread — fixing pool exhaustion but creating a throughput bottleneck that caused 500s under load. Additionally, 6 call sites used `asyncio.TaskGroup` where any `log_features` exception would propagate and crash the entire endpoint. ## Test plan - [x] `tests/log_features/test_log_features_concurrency.py` — verifies `thread_sensitive=False` and `safe_log_features` is async - [x] `ruff check` passes on all changed files - [ ] Deploy to staging and verify no 500s under concurrent optimization requests	2026-04-02 06:51:20 -05:00
..
aiservice	fix: prevent log_features from 500ing optimization endpoints (#2518 )	2026-04-02 06:51:20 -05:00
.dockerignore	local setup (#1898 )	2025-11-17 12:35:09 -08:00

fix: prevent log_features from 500ing optimization endpoints (#2518 )

## Summary

- **`thread_sensitive=False`** on `sync_to_async` so concurrent
`log_features` calls get their own threads instead of serializing
through one (was `True`, causing a bottleneck)
- **Raised DB pool `max_size` from 10 to 100** — prod Postgres allows
859 connections, giving plenty of headroom
- **Added `safe_log_features` wrapper** that catches errors via Sentry
instead of propagating — used at all 9 TaskGroup and bare-await call
sites so a logging failure can't crash an otherwise successful
optimization endpoint
- **Kept `transaction.atomic` + `select_for_update`** for correctness
(Django doesn't support async transactions yet, and removing these
causes lost-update races on dict-merge fields)

## Root cause

`log_features` uses `@sync_to_async` + `@transaction.atomic` because
Django lacks async transaction support. The previous fix for pool
exhaustion changed `thread_sensitive=False` to `True`, which serialized
all calls through a single thread — fixing pool exhaustion but creating
a throughput bottleneck that caused 500s under load. Additionally, 6
call sites used `asyncio.TaskGroup` where any `log_features` exception
would propagate and crash the entire endpoint.

## Test plan

- [x] `tests/log_features/test_log_features_concurrency.py` — verifies
`thread_sensitive=False` and `safe_log_features` is async
- [x] `ruff check` passes on all changed files
- [ ] Deploy to staging and verify no 500s under concurrent optimization
requests

2026-04-02 06:51:20 -05:00

aiservice

fix: prevent log_features from 500ing optimization endpoints (#2518 )

2026-04-02 06:51:20 -05:00

.dockerignore

local setup (#1898 )

2025-11-17 12:35:09 -08:00