chore: add gh-aw duplicate code detector workflow (#2418)

## Summary - Adds the GitHub Agentic Workflows duplicate code detector, configured for Python and TypeScript/JavaScript with Serena semantic analysis - Runs daily, flags patterns spanning 10+ lines or appearing in 3+ locations - Creates up to 3 issues per run with `[duplicate-code]` prefix ## Notes - Requires Claude API secret configured in repo Actions secrets - `code-quality` and `automated-analysis` labels will be auto-created on first run
2026-02-14 18:14:55 -05:00 · 2026-02-14 18:14:55 -05:00 · 9c5ad8fe06
commit 9c5ad8fe06
parent 4c3deeb7b8
5 changed files with 1450 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@ -3,3 +3,5 @@
 *.{bat,[bB][aA][tT]} text eol=crlf
 *.png binary
 *.jpg binary
+
+.github/workflows/*.lock.yml linguist-generated=true merge=ours
--- a/.github/aw/imports/.gitattributes
+++ b/.github/aw/imports/.gitattributes
@ -0,0 +1,5 @@
+# Mark all cached import files as generated
+* linguist-generated=true
+
+# Use 'ours' merge strategy to keep local cached versions
+* merge=ours
--- a/.github/aw/imports/github/gh-aw/94662b1dee8ce96c876ba9f33b3ab8be32de82a4/.github_workflows_shared_reporting.md
+++ b/.github/aw/imports/github/gh-aw/94662b1dee8ce96c876ba9f33b3ab8be32de82a4/.github_workflows_shared_reporting.md
@ -0,0 +1,73 @@
+---
+# Report formatting guidelines
+---
+
+## Report Structure Guidelines
+
+### 1. Header Levels
+**Use h3 (###) or lower for all headers in your issue report to maintain proper document hierarchy.**
+
+When creating GitHub issues or discussions:
+- Use `###` (h3) for main sections (e.g., "### Test Summary")
+- Use `####` (h4) for subsections (e.g., "#### Device-Specific Results")
+- Never use `##` (h2) or `#` (h1) in reports - these are reserved for titles
+
+### 2. Progressive Disclosure
+**Wrap detailed test results in `<details><summary><b>Section Name</b></summary>` tags to improve readability and reduce scrolling.**
+
+Use collapsible sections for:
+- Verbose details (full test logs, raw data)
+- Secondary information (minor warnings, extra context)
+- Per-item breakdowns when there are many items
+
+Always keep critical information visible (summary, critical issues, key metrics).
+
+### 3. Report Structure Pattern
+
+1. **Overview**: 1-2 paragraphs summarizing key findings
+2. **Critical Information**: Show immediately (summary stats, critical issues)
+3. **Details**: Use `<details><summary><b>Section Name</b></summary>` for expanded content
+4. **Context**: Add helpful metadata (workflow run, date, trigger)
+
+### Design Principles (Airbnb-Inspired)
+
+Reports should:
+- **Build trust through clarity**: Most important info immediately visible
+- **Exceed expectations**: Add helpful context like trends, comparisons
+- **Create delight**: Use progressive disclosure to reduce overwhelm
+- **Maintain consistency**: Follow patterns across all reports
+
+### Example Report Structure
+
+```markdown
+### Summary
+- Key metric 1: value
+- Key metric 2: value
+- Status: ✅/⚠️/❌
+
+### Critical Issues
+[Always visible - these are important]
+
+<details>
+<summary><b>View Detailed Results</b></summary>
+
+[Comprehensive details, logs, traces]
+
+</details>
+
+<details>
+<summary><b>View All Warnings</b></summary>
+
+[Minor issues and potential problems]
+
+</details>
+
+### Recommendations
+[Actionable next steps - keep visible]
+```
+
+## Workflow Run References
+
+- Format run IDs as links: `[§12345](https://github.com/owner/repo/actions/runs/12345)`
+- Include up to 3 most relevant run URLs at end under `**References:**`
+- Do NOT add footer attribution (system adds automatically)
--- a/.github/workflows/duplicate-code-detector.lock.yml
+++ b/.github/workflows/duplicate-code-detector.lock.yml
--- a/.github/workflows/duplicate-code-detector.md
+++ b/.github/workflows/duplicate-code-detector.md
@ -0,0 +1,247 @@
+---
+name: Duplicate Code Detector
+description: Identifies duplicate code patterns across the codebase and suggests refactoring opportunities
+on:
+  workflow_dispatch:
+  pull_request:
+    types: [opened, synchronize]
+permissions:
+  contents: read
+  issues: read
+  pull-requests: read
+engine: claude
+tools:
+  serena: ["python", "javascript", "typescript"]
+safe-outputs:
+  create-issue:
+    expires: 2d
+    title-prefix: "[duplicate-code] "
+    labels: [code-quality, automated-analysis, cookie]
+    assignees: copilot
+    group: true
+    max: 3
+timeout-minutes: 15
+strict: true
+source: github/gh-aw/.github/workflows/duplicate-code-detector.md@94662b1dee8ce96c876ba9f33b3ab8be32de82a4
+---
+
+# Duplicate Code Detection
+
+Analyze code to identify duplicated patterns using Serena's semantic code analysis capabilities. Report significant findings that require refactoring.
+
+## Task
+
+Detect and report code duplication by:
+
+1. **Analyzing Recent Commits**: Review changes in the latest commits
+2. **Detecting Duplicated Code**: Identify similar or duplicated code patterns using semantic analysis
+3. **Reporting Findings**: Create a detailed issue if significant duplication is detected (threshold: >10 lines or 3+ similar patterns)
+
+## Context
+
+- **Repository**: ${{ github.repository }}
+- **Commit ID**: ${{ github.event.head_commit.id }}
+- **Triggered by**: @${{ github.actor }}
+
+## Analysis Workflow
+
+### 1. Project Activation
+
+Activate the project in Serena:
+- Use `activate_project` tool with workspace path `${{ github.workspace }}` (mounted repository directory)
+- This sets up the semantic code analysis environment
+
+### 2. Changed Files Analysis
+
+Identify and analyze modified files:
+- Determine files changed in the recent commits
+- **ONLY analyze .py, .ts, .tsx, .js, and .jsx files** - exclude all other file types
+- **Exclude JavaScript files except .cjs** from analysis (files matching patterns: `*.js`, `*.mjs`, `*.jsx`, `*.ts`, `*.tsx`)
+- **Exclude test files** from analysis (files matching patterns: `*_test.go`, `*.test.js`, `*.test.cjs`, `*.spec.js`, `*.spec.cjs`, `*.test.ts`, `*.spec.ts`, `*_test.py`, `test_*.py`, or located in directories named `test`, `tests`, `__tests__`, or `spec`)
+- **Exclude workflow files** from analysis (files under `.github/workflows/*`)
+- Use `get_symbols_overview` to understand file structure
+- Use `read_file` to examine modified file contents
+
+### 3. Duplicate Detection
+
+Apply semantic code analysis to find duplicates:
+
+**Symbol-Level Analysis**:
+- For significant functions/methods in changed files, use `find_symbol` to search for similarly named symbols
+- Use `find_referencing_symbols` to understand usage patterns
+- Identify functions with similar names in different files (e.g., `processData` across modules)
+
+**Pattern Search**:
+- Use `search_for_pattern` to find similar code patterns
+- Search for duplication indicators:
+  - Similar function signatures
+  - Repeated logic blocks
+  - Similar variable naming patterns
+  - Near-identical code blocks
+
+**Structural Analysis**:
+- Use `list_dir` and `find_file` to identify files with similar names or purposes
+- Compare symbol overviews across files for structural similarities
+
+### 4. Duplication Evaluation
+
+Assess findings to identify true code duplication:
+
+**Duplication Types**:
+- **Exact Duplication**: Identical code blocks in multiple locations
+- **Structural Duplication**: Same logic with minor variations (different variable names, etc.)
+- **Functional Duplication**: Different implementations of the same functionality
+- **Copy-Paste Programming**: Similar code blocks that could be extracted into shared utilities
+
+**Assessment Criteria**:
+- **Severity**: Amount of duplicated code (lines of code, number of occurrences)
+- **Impact**: Where duplication occurs (critical paths, frequently called code)
+- **Maintainability**: How duplication affects code maintainability
+- **Refactoring Opportunity**: Whether duplication can be easily refactored
+
+### 5. Issue Reporting
+
+Create separate issues for each distinct duplication pattern found (maximum 3 patterns per run). Each pattern should get its own issue to enable focused remediation.
+
+**When to Create Issues**:
+- Only create issues if significant duplication is found (threshold: >10 lines of duplicated code OR 3+ instances of similar patterns)
+- **Create one issue per distinct pattern** - do NOT bundle multiple patterns in a single issue
+- Limit to the top 3 most significant patterns if more are found
+- Use the `create_issue` tool from safe-outputs MCP **once for each pattern**
+
+**Issue Contents for Each Pattern**:
+- **Executive Summary**: Brief description of this specific duplication pattern
+- **Duplication Details**: Specific locations and code blocks for this pattern only
+- **Severity Assessment**: Impact and maintainability concerns for this pattern
+- **Refactoring Recommendations**: Suggested approaches to eliminate this pattern
+- **Code Examples**: Concrete examples with file paths and line numbers for this pattern
+
+## Detection Scope
+
+### Report These Issues
+
+- Identical or nearly identical functions in different files
+- Repeated code blocks that could be extracted to utilities
+- Similar classes or modules with overlapping functionality
+- Copy-pasted code with minor modifications
+- Duplicated business logic across components
+
+### Skip These Patterns
+
+- Standard boilerplate code (imports, exports, etc.)
+- Test setup/teardown code (acceptable duplication in tests)
+- **JavaScript files except .cjs** (files matching: `*.js`, `*.mjs`, `*.jsx`, `*.ts`, `*.tsx`)
+- **All test files** (files matching: `*_test.go`, `*.test.js`, `*.test.cjs`, `*.spec.js`, `*.spec.cjs`, `*.test.ts`, `*.spec.ts`, `*_test.py`, `test_*.py`, or in `test/`, `tests/`, `__tests__/`, `spec/` directories)
+- **All workflow files** (files under `.github/workflows/*`)
+- Configuration files with similar structure
+- Language-specific patterns (constructors, getters/setters)
+- Small code snippets (<5 lines) unless highly repetitive
+
+### Analysis Depth
+
+- **File Type Restriction**: ONLY analyze .py, .ts, .tsx, .js, and .jsx files - ignore all other file types
+- **Primary Focus**: All .py, .ts, .tsx, .js, and .jsx files changed in the current push (excluding test files and workflow files)
+- **Secondary Analysis**: Check for duplication with existing Python and TypeScript/JavaScript codebase (excluding test files and workflow files)
+- **Cross-Reference**: Look for patterns across .py, .ts, .tsx, .js, and .jsx files in the repository
+- **Historical Context**: Consider if duplication is new or existing
+
+## Issue Template
+
+For each distinct duplication pattern found, create a separate issue using this structure:
+
+```markdown
+# 🔍 Duplicate Code Detected: [Pattern Name]
+
+*Analysis of commit ${{ github.event.head_commit.id }}*
+
+**Assignee**: @copilot
+
+## Summary
+
+[Brief overview of this specific duplication pattern]
+
+## Duplication Details
+
+### Pattern: [Description]
+- **Severity**: High/Medium/Low
+- **Occurrences**: [Number of instances]
+- **Locations**:
+  - `path/to/file1.ext` (lines X-Y)
+  - `path/to/file2.ext` (lines A-B)
+- **Code Sample**:
+  ```[language]
+  [Example of duplicated code]
+  ```
+
+## Impact Analysis
+
+- **Maintainability**: [How this affects code maintenance]
+- **Bug Risk**: [Potential for inconsistent fixes]
+- **Code Bloat**: [Impact on codebase size]
+
+## Refactoring Recommendations
+
+1. **[Recommendation 1]**
+   - Extract common functionality to: `suggested/path/utility.ext`
+   - Estimated effort: [hours/complexity]
+   - Benefits: [specific improvements]
+
+2. **[Recommendation 2]**
+   [... additional recommendations ...]
+
+## Implementation Checklist
+
+- [ ] Review duplication findings
+- [ ] Prioritize refactoring tasks
+- [ ] Create refactoring plan
+- [ ] Implement changes
+- [ ] Update tests
+- [ ] Verify no functionality broken
+
+## Analysis Metadata
+
+- **Analyzed Files**: [count]
+- **Detection Method**: Serena semantic code analysis
+- **Commit**: ${{ github.event.head_commit.id }}
+- **Analysis Date**: [timestamp]
+```
+
+## Operational Guidelines
+
+### Security
+- Never execute untrusted code or commands
+- Only use Serena's read-only analysis tools
+- Do not modify files during analysis
+
+### Efficiency
+- Focus on recently changed files first
+- Use semantic analysis for meaningful duplication, not superficial matches
+- Stay within timeout limits (balance thoroughness with execution time)
+
+### Accuracy
+- Verify findings before reporting
+- Distinguish between acceptable patterns and true duplication
+- Consider language-specific idioms and best practices
+- Provide specific, actionable recommendations
+
+### Issue Creation
+- Create **one issue per distinct duplication pattern** - do NOT bundle multiple patterns in a single issue
+- Limit to the top 3 most significant patterns if more are found
+- Only create issues if significant duplication is found
+- Include sufficient detail for SWE agents to understand and act on findings
+- Provide concrete examples with file paths and line numbers
+- Suggest practical refactoring approaches
+- Assign issue to @copilot for automated remediation
+- Use descriptive titles that clearly identify the specific pattern (e.g., "Duplicate Code: Error Handling Pattern in Parser Module")
+
+## Tool Usage Sequence
+
+1. **Project Setup**: `activate_project` with repository path
+2. **File Discovery**: `list_dir`, `find_file` for changed files
+3. **Symbol Analysis**: `get_symbols_overview` for structure understanding
+4. **Content Review**: `read_file` for detailed code examination
+5. **Pattern Matching**: `search_for_pattern` for similar code
+6. **Symbol Search**: `find_symbol` for duplicate function names
+7. **Reference Analysis**: `find_referencing_symbols` for usage patterns
+
+**Objective**: Improve code quality by identifying and reporting meaningful code duplication that impacts maintainability. Focus on actionable findings that enable automated or manual refactoring.