mirror of
https://github.com/codeflash-ai/codeflash-agent.git
synced 2026-05-04 18:25:19 +00:00
2.8 KiB
2.8 KiB
SHA1 -> Faster Hash Proposal (Content Addressed Store)
Status: Deferred — needs discussion with maintainers before implementation.
Opportunity
SHA1 is used as the content-addressing hash in content_addressed_store.py:98. Benchmarks on Azure Standard_D2s_v5:
| Blob Size | SHA1 | xxh64 | xxh64 Speedup | blake2b (est) |
|---|---|---|---|---|
| 1KB | 0.001ms | 0.0004ms | 2.5x | ~1.5x |
| 100KB | 0.060ms | 0.008ms | 7.5x | ~3x |
| 1MB | 0.596ms | 0.073ms | 8.2x | ~4x |
| 10MB | 5.979ms | 0.736ms | 8.1x | ~4x |
Why it's not a simple drop-in
The SHA1 hex digest is the storage key — it determines where artifacts live on disk/S3 (<prefix>/<sha[:2]>/<sha>). It's persisted in metadata databases and used across 14 locations in the codebase.
All SHA1 usage locations
| File | Line | Purpose | Persistent? | Breaking? |
|---|---|---|---|---|
content_addressed_store.py |
98 | Artifact content-address key | S3/filesystem paths | Yes |
filecache.py |
96-100 | Log/metadata cache tokens | Local cache filenames | Local only |
includefile.py |
417 | Include file hash | Metadata DB | Yes |
metadata.py |
588 | Artifact metadata field | Metadata DB | Yes |
argo_workflows.py |
781 | Event name suffix | Argo event names | No (regenerable) |
argo_workflows_cli.py |
550, 605, 611 | Workflow name truncation | Argo workflow names | No (regenerable) |
step_functions_cli.py |
299, 307 | StateMachine name suffix | AWS resource names | No (regenerable) |
airflow_cli.py |
454 | DAG naming | Airflow DAG names | No (regenerable) |
event_bridge_client.py |
82 | Rule name truncation | AWS resource names | No (regenerable) |
s3op.py |
726 | S3 download cache filename | Local cache filenames | Local only |
| test files | Multiple | Test verification | No | No |
Key concerns
- Dedup boundary: Same content saved before/after change gets different keys — no cross-version dedup
- Collision safety: xxh64 (64-bit) birthday bound is ~2^32, too small for content addressing. Must use xxh128 or blake2b.
- New dependency: xxhash adds to
install_requires; blake2b is stdlib (Python 3.6+) but slower - Migration: Needs versioned hash algorithm in metadata, dual-compute transition period
Proposed approach (for future PR)
- Add
hash_versionfield to CAS metadata - Use blake2b (stdlib, no new dep) or xxh128 for new writes
- Keep SHA1 reader support indefinitely for backward compat
load_blobsis already key-based (no rehash), so old artifacts remain loadable- Open a discussion issue first to align with maintainers on migration strategy
Recommendation
Open a GitHub issue proposing the change and linking benchmark data. Let maintainers weigh in on hash choice and migration strategy before implementing.