mirror of https://github.com/codeflash-ai/codeflash-agent.git synced 2026-05-04 18:25:19 +00:00

chore: add standup dashboard with CI audit integration (#36 )

Dash app at .codeflash/standups/ for weekly eng meetings. Pulls live PR data across 4 org repos, renders markdown standup notes, integrates CI audit report with corrected billing numbers from real GitHub API data. Deployed to Plotly Cloud.

2026-04-23 18:52:33 -05:00

23 KiB

Raw Blame History

Kevin — 2026-04-23

Done (yesterday)

CI audit across codeflash org: disabled Actions on 200+ forks, built interactive Dash cost report showing ~$12K/yr savings
Scaffolded codeflash-api FastAPI rewrite (all 9 endpoints), unified sync/async code paths, set up Postgres integration tests with testcontainers, 78% coverage
Project governance: enabled GitHub Discussions, created parent issue #2079 for 100% test coverage, filed coverage issue on typeagent-py
Merged 5 PRs on codeflash: PR template + linked-issue policy (#2093), coverage CI with 60% floor (#2094), CODEOWNERS (#2095), 2 dependabot bumps
Triaged ~28 open PRs, posted contributor follow-ups, closed stale PRs and reopened as issues
Reimplemented JS function tracer from stale PR #1377 onto current main as PR #2105
Bumped stale action versions across 3 repos (setup-uv, prek-action, gh-action-pypi-publish)
Fixed 3-month-old VSCode extension build failure from out-of-sync package-lock.json (PR #2617)
Extended shared ci-python-uv.yml with test-secret-env input, migrated aiservice-test job to use it (PR #2620)
Initialized tessl (vendored mode) across codeflash, codeflash-internal, and codeflash-agent with weekly tile update workflows
Created codeflash-ci-bot GitHub App for PR creation (enterprise GITHUB_TOKEN workaround)
Added org-level ruleset requiring PR branches to be up to date with main
Ran vulture dead-code audit (428/432 false positives), deleted 4 verified dead items, added vulture as dev dep
Cleaned up 19 stale local branches and removed associated worktrees

Done (today)

Opened PRs #244 (tests for resolve_azure_model_name) and #245 (max_retries bump) on typeagent-py
Fixed all ruff lint errors across codeflash-agent repo (PR #38, merged)
Built this weekly eng meeting dashboard with auto-refresh, CI audit integration, and live note rendering

Blocked

Bedrock OIDC regression blocking unpin of claude-code-action
~33 missing tessl tiles (internal + agent), waiting on tessl to publish

Asks

Hesham: Resume work on the universal detector
Saurabh: Sync on PyCon USA booth, 1-1 on roadmap
Sarthak: Update me on current work — I have no visibility right now
Aseem: Sync on launched Claude Code plugin
Ali: Clean up existing PRs, update me on current work
Mohammed: Update me on JS optimization progress, get #2020 and #1993 ready to merge
Everyone: If you have pending PRs or issues, get them ready to close

Continue working on codeflash-agent repo, get backend ready for codeflash-agent
Continue auditing codeflash / codeflash-internal
Continue applying vertical optimization and learn GitHub org perms, Azure org perms, etc. so when they're ready to work with us we know exactly how to apply it

In Summary

Everything in "Done (yesterday)" was accomplished in a single 12-hour workday. I've developed a very effective strategy for increasing productivity while still maintaining quality. I'm still refining this approach since I've been focused on other things, but once I have it decently prepared I will onboard the entire team so we can all learn and be effective.

Strategy: Vertical Optimization

What is vertical optimization? It's our approach to optimizing a client's entire technical stack, not just code in isolation. In order to apply vertical optimizations, we have to optimize not just the code, but the organization, the company, and the humans — E2E. We work from both the ground up (code → tooling → infra) and the top down (team → infra → tooling → code) simultaneously. This requires applying multiple disciplines — not just technical engineering, but also psychology, organizational design, and process engineering:

Psychology: understanding how developers adopt new practices. You don't push — you contribute quality work and let people see the value. The typeagent example shows this: Guido resisted test coverage, but bmerkle saw value in it and started doing it independently. Influence without authority.
Organizational design: shaping how teams operate — CODEOWNERS, PR templates, linked-issue policies, org-level rulesets, permission structures. The scaffolding that makes quality the default.
Process engineering: optimizing the processes themselves — CI/CD pipelines, automated gates, workflow consolidation, build times. Eliminating friction so the team spends time on real work.
Social engineering: building trust and relationships with external teams to open doors. Contributing upstream PRs that ask nothing in return, using trusted intermediaries (like PSF connections for pip), and making it easy for maintainers to say yes — include tests, docs, and benchmarks so merging is frictionless. The key is doing a lot of upfront work — solving real problems, shipping real improvements — so that when we come to the table, people are already willing to work with us because we've already delivered value. Once trust is established, deeper collaboration follows naturally.
Security engineering: we currently have a terrible security posture and I want us to improve that — both in our personal lives and in our work lives. This means org-level secrets management, dependency vulnerability remediation (24 known Dependabot alerts), proper token scoping, 2FA enforcement, and building good security hygiene as a team habit.
Technical engineering: the code-level work — profiling, benchmarking, lazy imports, batching, N+1 fixes. The foundation everything else builds on.

There are 4 layers:

Layer 1 — Code: function-level perf (lazy imports, batching, N+1 fixes, micro-opts)
Layer 2 — Developer Tooling: CI/CD pipelines, GitHub Actions, build times, test infrastructure
Layer 3 — Infrastructure: Docker images, Kubernetes memory, database queries, server architecture
Layer 4 — Engineering Team: workflow alignment, environment standardization, developer productivity

                      DISCIPLINES
                      ───────────
┌──────────────────────────────────────┐
│  Layer 4: Engineering Team           │  ← psychology, org design
├──────────────────────────────────────┤
│  Layer 3: Infrastructure             │  ← security, process eng
├──────────────────────────────────────┤
│  Layer 2: Developer Tooling          │  ← process eng, security
├──────────────────────────────────────┤
│  Layer 1: Code                       │  ← technical eng
└──────────────────────────────────────┘
  ↕ social engineering spans all layers ↕

The goal — Unstructured case study: GH Actions jobs 301→33 (89% reduction), Docker image -2.5GB, infra cost ~90% reduction on a ~$1M/yr bill, DB queries 10s→100ms.

Strategy in Action: typeagent-py

I'm already applying this strategy not just on our own org and repos, but externally on typeagent-py with Guido's team. Yesterday Guido merged my two perf PRs — batch SQLite inserts #230 (1.14x-1.16x indexing speedup) and batch metadata N+1 fix #232 (2.1x-2.6x query speedup). I suggested incremental test coverage requirements like we do internally. Guido pushed back, but his teammate bmerkle saw the value and independently kicked off a branch with additional function coverage tests. Today I also opened #244 (tests for resolve_azure_model_name) and #245 (max_retries bump). This is a real-time demo of the strategy in action — contribute quality work, demonstrate best practices, and let team members adopt them organically.

Follow-ups

source:https://notes.granola.ai/t/f47117c5-4b11-4a0a-8bd8-a7428d42af56-008umkv4
Migrate weekly sprint planning back to Linear — Saurabh: "I think we should return back to use linear [...] we can return back to linear weekly and we can keep it as is"
Clarify meeting cadence: Mondays = formal weekly eng meeting, dailies = informal — Kevin: "on Mondays, we have the engineering cycle planning [...] for stand up, it will be informal"
Publish the dashboard for the entire team to review async — Kevin: "I will publish this site for everybody to review"
Collect standups from the team — demo ran long and nobody else presented — Saurabh: "I think we're super over time, so I think we're just gonna skip the stand up"
Saurabh: sync with Sarthak on current work — Saurabh: "Sartak, can I sync with you? I think I wanna sync on the work"
Everyone: come prepared with asks/blockers for next Monday's meeting — Kevin: "I want you guys to come up with these things for me as well, so that I'm able to support you guys"
Keep dailies informal — Saurabh: "the process should be to enable the work rather than becoming the work itself [...] stand ups are supposed to be super informal"
Focus on delivery accountability — Saurabh: "what we're missing is the delivery accountability [...] we have to build up in a way that if we have talked about something, deliver rather than saying the same word in every scrum call"
Demo Linear + Notion setup for weekly coordination — Kevin: "I want us to be using linear, notion more often so that we can keep track of what we're doing and we can coordinate between everybody"

Open Questions

source:https://notes.granola.ai/t/f47117c5-4b11-4a0a-8bd8-a7428d42af56-008umkv4
Sarthak: What's the concrete end goal of vertical optimization? — "what's the end goal is to apply something where outcome is very relevant"
answer:Measurable, dollarized reduction in total engineering cost-of-operation — meaning we quantify every improvement in dollars saved or hours reclaimed. The Unstructured case study is the reference target: CI went from 301 jobs to 33 (89% reduction), Docker images shrank by 2.5GB, infrastructure spend dropped from ~$1M/yr to ~$100K/yr, and database queries went from 10s to 100ms. The deliverable at each layer is merged PRs with before/after benchmark numbers — not a report or a slide deck. If we can't measure it and merge it, it doesn't count. This is what we want to replicate for every client: walk in, audit all four layers, ship concrete optimizations, and hand them a dollarized summary of what changed.
Sarthak: Has any startup actually applied this approach and seen results? — "if any startup guy do have back before or if there's any kind of social backing he has done before, for the startup and that have boosted the [revenue]"
answer:What we're doing is, to the best of our research, unprecedented as a packaged service. No company currently sells integrated four-layer optimization (code, tooling, infra, team) with merged PRs and before/after benchmarks as the deliverable. The closest analogues are internal platform teams at large companies: Spotify built Backstage to unify developer tooling across hundreds of services, and Google's Engineering Productivity group (EngProd) optimizes builds, tests, and infrastructure at massive scale — but both are strictly internal efforts, not products or services they sell. On the consulting side, every existing player is either single-layer deep or multi-layer but advisory-only: CloudZero and Vantage do infrastructure cost dashboards (no code changes). Dagger and Earthly optimize CI (don't touch code perf or infra costs). Chainguard and Slim.AI optimize Docker supply chains (single layer). Thoughtworks and Slalom do broad DevOps transformation consulting that can span multiple layers, but their deliverable is recommendations and organizational change — not merged PRs with benchmark numbers. Even Brendan Gregg's performance consulting at Netflix is observability/profiling advisory at the code layer only. The "merged PRs as deliverable" constraint is what eliminates every existing player. The market gap is real: companies either hire specialized engineers internally to do this (like Unstructured did) or they don't do it at all. We're packaging that capability as a service.
Sarthak: What specific processes are being introduced? — "I would like to understand the process part there, like, what are the key process you are trying to include here"
answer:Here's what's already shipped: (1) CODEOWNERS files — every directory has a designated reviewer so PRs don't sit unreviewed. (2) PR templates with a linked-issue policy — every PR must reference an issue, so we can trace why changes were made. (3) Org-level rulesets requiring branches to be up-to-date with main before merging — prevents broken builds from stale branches. (4) Coverage CI with a 60% floor — PRs that drop test coverage below 60% are blocked from merging. (5) This weekly eng meeting dashboard with auto-refresh and live notes. (6) Tessl for dependency tracking across all repos — automated weekly tile updates so we know what's outdated. (7) CI audit process with a Dash cost report showing Actions spend and savings. (8) Shared reusable GitHub workflows (ci-python-uv.yml) so teams don't copy-paste CI configs. The common thread: these are all enforced automated guardrails, not wiki pages or Notion docs that people forget to read. The system blocks bad patterns by default rather than relying on people to remember best practices.
Saurabh: Are we actually paying for GitHub Actions overages? — "are currently not really paying for good of actions because our good of action used is under the whatever quota we get"
answer:Saurabh was right to push back. After pulling the real GitHub billing export (gh api /orgs/codeflash-ai/settings/billing/usage), here's what the data shows: our Enterprise plan fully discounts all Actions usage — netAmount is $0.00 every month. We are not paying cash overages. The original $12K figure used $0.008/min (the listed overage rate), but the actual billing rate is $0.006/min, and the discount covers 100% of it. The real before/after numbers from the billing API: Feb 2026 (peak before audit) was 198K min, Apr 2026 (after audit) is 93K min — a ~105K min/mo reduction. Even at the listed $0.006/min rate, the theoretical gross savings would be ~$7,600/yr, not $12K. The corrected framing: the audit eliminated ~105K wasted minutes/month and $625/mo in gross compute that GitHub is currently absorbing. If our plan changes or usage grows past the discount threshold, we'd start paying immediately — the audit prevents that. The real value was operational: 200+ fork Actions disabled, 13 ghost workflows eliminated, non-code PR runtime from $1.85 to $0.001, and 22 workflow files consolidated to 7.
Saurabh: Are we still doing the written async standup format? — "So we are not gonna do the writing thing that you were talking about earlier?"
answer:Not daily — I didn't have enough time to polish this up for a daily cadence, and honestly it's better this way. Building these notes takes real time and effort, so doing it once a week is the right tradeoff. Monday = formal weekly eng meeting where everyone presents what they did, what's blocked, and what's next. This dashboard is the centerpiece — live notes, CI data, follow-up tracking, all in one place. Dailies = informal verbal standups, no written prep required. The investment pays off because once a week we get a full picture of what everyone's working on, what's stuck, and what needs to happen next — and it compounds over time as we build a searchable history. Saurabh said it well: "the process should be to enable the work rather than becoming the work itself." We'll keep iterating on the tooling to make standups a breeze — the goal is that preparing for Monday's meeting takes minutes, not hours.

Dollar Impact: What Vertical Optimization Saves Customers

Estimated annual savings for a mid-size engineering org (50-200 engineers, $5M-$20M cloud spend):

Layer 1 — Code: $200K-$800K/yr

Function-level performance fixes reduce compute requirements. Faster code means fewer server instances needed to serve the same load. Industry data: 10-20% of cloud compute is consumed by inefficient application code (Google SRE, Netflix performance engineering). A 100x query speedup on a top-10 endpoint typically retires 30-50% of backend instances serving that path.

Unstructured PRs in this layer:

#1481: Reduce attribute lookups in elements_intersect_vertically
#1464: Replace lazyproperty with functools.cached_property
#4266: Fix O(N^2) text extraction (re-scanned full character list on every patch operation — quadratic on long documents)
Result: End-to-end latency 50.8s to 44.3s (12.9% faster), text extraction from O(N^2) to O(N)

Layer 2 — Developer Tooling: $500K-$2M/yr

The big number here isn't CI minutes — it's developer wait time. Developers spend 25-40% of their time waiting on builds and CI (Gradle 2023 survey, Spotify eng blog). At $250K fully-loaded cost, 50 engineers losing 30% = $3.75M in wait time. Direct CI spend savings are smaller ($50K-$200K) but the productivity recovery is massive.

Codeflash org CI audit PRs in this layer:

codeflash #2025: Replace wildcard path triggers on 12 E2E workflows (every README edit triggered ~2hrs of E2E tests)
codeflash #2044: Consolidate 17 workflows into single ci.yaml with gate job (22 workflow files down to 7)
codeflash #2050: npm cache, Maven consolidation, remove duplicate workflow
codeflash-internal #2588: Fix deploy path filter (workflow edits were triggering production deploys)
Disabled Actions on 200+ forks (burning ~960 noise runs/month)
Result: ~$1.85 compute per non-code PR down to ~$0.001, ~~105K min/mo reduction (~~$7.6K/yr gross, currently discounted by Enterprise plan), 13 ghost workflows eliminated

Layer 3 — Infrastructure: $1.5M-$6M/yr

This is where the biggest dollar amounts are. Flexera 2024 State of Cloud: orgs waste 28-32% of cloud spend on average. HashiCorp 2023: 94% of enterprises overprovision. For a $10M cloud bill, 30% waste = $3M recoverable. Smaller Docker images cut deploy times 40-60%, reducing incident MTTR.

Unstructured PRs in this layer:

#1502: Cap OCR workers to available CPUs — eliminated 4 idle workers each loading ~500MB of duplicate ONNX models
#1448: Free page image before table extraction
#1441: Resize-first numpy preprocessing for YOLOX
#1503: Render PDF pages as BMP instead of PNG (eliminated unnecessary PNG compression for in-memory data)
Result: K8s pod memory 32GB to 4GB (87.5% reduction), pods per node 5 to 46 (9.2x density), monthly compute $10K to $1.1K — $107K/yr saved on a single service

Layer 4 — Engineering Team: $400K-$1.5M/yr

Stripe 2023 Developer Coefficient: developers lose 17.3 hours/week to bad tooling, maintenance, and tech debt — 42% of their time. DORA research: elite teams ship 973x more frequently with 6,570x faster lead time. Standardizing environments recovers 5-10% of engineering capacity.

Codeflash org PRs in this layer:

codeflash #2093: PR template + linked-issue policy
codeflash #2094: Coverage CI with 60% floor
codeflash #2095: CODEOWNERS
Org-level ruleset requiring branches to be up-to-date with main
Tessl dependency tracking across all repos
This weekly eng meeting dashboard
Result: Every PR now requires an issue link, a reviewer, and passing coverage — automated guardrails instead of wiki pages

Total Potential Savings

Scenario	50 eng / $5M cloud	200 eng / $20M cloud
Layer 1 — Code	$200K	$800K
Layer 2 — Dev Tooling	$500K	$2M
Layer 3 — Infrastructure	$1.5M	$6M
Layer 4 — Eng Team	$400K	$1.5M
Total	$2.6M/yr	$10.3M/yr

Conservative midpoint for a typical engagement: $4M-$6M/yr in combined savings. Infrastructure dominates in raw dollars (cloud waste is real and measurable), but developer tooling has the highest ROI per dollar invested because it compounds — every minute saved on CI is saved for every engineer on every commit.

Competitive Landscape: CodSpeed and Why We Win

What CodSpeed Does

CodSpeed (codspeed.io) is a continuous benchmarking platform. They instrument your code, run benchmarks in CI, detect performance regressions on every PR, and show differential flamegraphs. Supports Rust, C++, Go, Python, and Node.js. Pricing: free for OSS, $15/user/mo for Pro, enterprise custom. Notable customers: Pydantic, Astral (ruff/uv), LangChain, Prisma, Vercel, ByteDance.

Where CodSpeed Stops

CodSpeed is a Layer 1 tool — and only the detection half of Layer 1 at that. They tell you "this PR made function X 15% slower." They don't:

Fix the regression (no optimization, just alerting)
Touch CI/CD costs (Layer 2)
Optimize infrastructure (Layer 3)
Improve engineering processes (Layer 4)
Ship merged PRs with before/after benchmarks

Their deliverable is a dashboard and PR checks. Ours is merged code.

How We're Different

	CodSpeed	Codeflash Vertical Optimization
Scope	Layer 1 detection only	All 4 layers, detection + remediation
Deliverable	Dashboard + PR checks	Merged PRs with before/after benchmarks
Model	SaaS product ($15/user/mo)	Service engagement with dollarized ROI
Depth	"This function got slower"	Root cause analysis + fix + benchmark proof
Coverage	Code perf regressions	Code, CI/CD, Docker, K8s, team processes
Outcome	Alerts	Shipped optimizations with dollar savings

How We Plan to Beat Them

CodSpeed is a complement, not a competitor. We can recommend CodSpeed (or our own benchmarking tooling) as part of Layer 2 guardrails — continuous benchmarking becomes one of the automated gates we install, like coverage CI floors. We play at a different altitude.
We ship code, they ship alerts. CodSpeed tells you there's a problem. We find the problem, fix it, benchmark the fix, and merge it. The Unstructured engagement is proof: 24 PRs merged, $107K/yr saved, 87.5% memory reduction — CodSpeed couldn't have done any of that.
We optimize the full stack. CodSpeed's entire value is Layer 1 regression detection. Our Unstructured engagement found that the biggest savings weren't in code at all — they were in infrastructure (32GB pods → 4GB, $107K/yr) and CI/CD ($12K/yr in wasted Actions). A CodSpeed subscription wouldn't have found any of those.
Our moat is depth + breadth. Anyone can build a benchmarking SaaS. Nobody else combines code profiling, CI auditing, infrastructure right-sizing, and team process optimization into a single engagement with merged PRs as the deliverable. The market gap we identified is real: every existing player is either single-layer deep (CodSpeed, Vantage, Dagger) or multi-layer but advisory-only (Thoughtworks, Slalom). We're the first to do both.
We use AI to scale. Our optimization workflow is AI-augmented — the same tooling we use internally (Claude Code, codeflash plugin, automated benchmarking) lets us move at a pace that traditional consulting can't match. The Unstructured engagement was 24 PRs across 5 repos in 7 weeks. A traditional consulting firm would quote 3-6 months for the same scope.

What's Next for the Team

I'll schedule onboarding sessions once the approach is more refined, but in the meantime — start thinking about what vertical optimization looks like for your own work. Where are the bottlenecks in the code you touch? In the CI pipelines? In the way we collaborate? The goal is for everyone on the team to be thinking this way, not just me.

23 KiB Raw Blame History