codeflash-agent/.codeflash/netflix/metaflow/data/sha1-proposal.md
Kevin Turcios 3b59d97647 squash
2026-04-13 14:12:17 -05:00

2.8 KiB

SHA1 -> Faster Hash Proposal (Content Addressed Store)

Status: Deferred — needs discussion with maintainers before implementation.

Opportunity

SHA1 is used as the content-addressing hash in content_addressed_store.py:98. Benchmarks on Azure Standard_D2s_v5:

Blob Size SHA1 xxh64 xxh64 Speedup blake2b (est)
1KB 0.001ms 0.0004ms 2.5x ~1.5x
100KB 0.060ms 0.008ms 7.5x ~3x
1MB 0.596ms 0.073ms 8.2x ~4x
10MB 5.979ms 0.736ms 8.1x ~4x

Why it's not a simple drop-in

The SHA1 hex digest is the storage key — it determines where artifacts live on disk/S3 (<prefix>/<sha[:2]>/<sha>). It's persisted in metadata databases and used across 14 locations in the codebase.

All SHA1 usage locations

File Line Purpose Persistent? Breaking?
content_addressed_store.py 98 Artifact content-address key S3/filesystem paths Yes
filecache.py 96-100 Log/metadata cache tokens Local cache filenames Local only
includefile.py 417 Include file hash Metadata DB Yes
metadata.py 588 Artifact metadata field Metadata DB Yes
argo_workflows.py 781 Event name suffix Argo event names No (regenerable)
argo_workflows_cli.py 550, 605, 611 Workflow name truncation Argo workflow names No (regenerable)
step_functions_cli.py 299, 307 StateMachine name suffix AWS resource names No (regenerable)
airflow_cli.py 454 DAG naming Airflow DAG names No (regenerable)
event_bridge_client.py 82 Rule name truncation AWS resource names No (regenerable)
s3op.py 726 S3 download cache filename Local cache filenames Local only
test files Multiple Test verification No No

Key concerns

  1. Dedup boundary: Same content saved before/after change gets different keys — no cross-version dedup
  2. Collision safety: xxh64 (64-bit) birthday bound is ~2^32, too small for content addressing. Must use xxh128 or blake2b.
  3. New dependency: xxhash adds to install_requires; blake2b is stdlib (Python 3.6+) but slower
  4. Migration: Needs versioned hash algorithm in metadata, dual-compute transition period

Proposed approach (for future PR)

  1. Add hash_version field to CAS metadata
  2. Use blake2b (stdlib, no new dep) or xxh128 for new writes
  3. Keep SHA1 reader support indefinitely for backward compat
  4. load_blobs is already key-based (no rehash), so old artifacts remain loadable
  5. Open a discussion issue first to align with maintainers on migration strategy

Recommendation

Open a GitHub issue proposing the change and linking benchmark data. Let maintainers weigh in on hash choice and migration strategy before implementing.