## Summary - **`thread_sensitive=False`** on `sync_to_async` so concurrent `log_features` calls get their own threads instead of serializing through one (was `True`, causing a bottleneck) - **Raised DB pool `max_size` from 10 to 100** — prod Postgres allows 859 connections, giving plenty of headroom - **Added `safe_log_features` wrapper** that catches errors via Sentry instead of propagating — used at all 9 TaskGroup and bare-await call sites so a logging failure can't crash an otherwise successful optimization endpoint - **Kept `transaction.atomic` + `select_for_update`** for correctness (Django doesn't support async transactions yet, and removing these causes lost-update races on dict-merge fields) ## Root cause `log_features` uses `@sync_to_async` + `@transaction.atomic` because Django lacks async transaction support. The previous fix for pool exhaustion changed `thread_sensitive=False` to `True`, which serialized all calls through a single thread — fixing pool exhaustion but creating a throughput bottleneck that caused 500s under load. Additionally, 6 call sites used `asyncio.TaskGroup` where any `log_features` exception would propagate and crash the entire endpoint. ## Test plan - [x] `tests/log_features/test_log_features_concurrency.py` — verifies `thread_sensitive=False` and `safe_log_features` is async - [x] `ruff check` passes on all changed files - [ ] Deploy to staging and verify no 500s under concurrent optimization requests |
||
|---|---|---|
| .. | ||
| aiservice | ||
| .dockerignore | ||