codeflash-internal/.github/workflows/duplicate-code-detector.md
Kevin Turcios 9c5ad8fe06
chore: add gh-aw duplicate code detector workflow (#2418)
## Summary
- Adds the GitHub Agentic Workflows duplicate code detector, configured
for Python and TypeScript/JavaScript with Serena semantic analysis
- Runs daily, flags patterns spanning 10+ lines or appearing in 3+
locations
- Creates up to 3 issues per run with `[duplicate-code]` prefix

## Notes
- Requires Claude API secret configured in repo Actions secrets
- `code-quality` and `automated-analysis` labels will be auto-created on
first run
2026-02-14 18:14:55 -05:00

9.7 KiB

name description on permissions engine tools safe-outputs timeout-minutes strict source
Duplicate Code Detector Identifies duplicate code patterns across the codebase and suggests refactoring opportunities
workflow_dispatch pull_request
types
opened
synchronize
contents issues pull-requests
read read read
claude
serena
python
javascript
typescript
create-issue
expires title-prefix labels assignees group max
2d [duplicate-code]
code-quality
automated-analysis
cookie
copilot true 3
15 true github/gh-aw/.github/workflows/duplicate-code-detector.md@94662b1dee8ce96c876ba9f33b3ab8be32de82a4

Duplicate Code Detection

Analyze code to identify duplicated patterns using Serena's semantic code analysis capabilities. Report significant findings that require refactoring.

Task

Detect and report code duplication by:

  1. Analyzing Recent Commits: Review changes in the latest commits
  2. Detecting Duplicated Code: Identify similar or duplicated code patterns using semantic analysis
  3. Reporting Findings: Create a detailed issue if significant duplication is detected (threshold: >10 lines or 3+ similar patterns)

Context

  • Repository: ${{ github.repository }}
  • Commit ID: ${{ github.event.head_commit.id }}
  • Triggered by: @${{ github.actor }}

Analysis Workflow

1. Project Activation

Activate the project in Serena:

  • Use activate_project tool with workspace path ${{ github.workspace }} (mounted repository directory)
  • This sets up the semantic code analysis environment

2. Changed Files Analysis

Identify and analyze modified files:

  • Determine files changed in the recent commits
  • ONLY analyze .py, .ts, .tsx, .js, and .jsx files - exclude all other file types
  • Exclude JavaScript files except .cjs from analysis (files matching patterns: *.js, *.mjs, *.jsx, *.ts, *.tsx)
  • Exclude test files from analysis (files matching patterns: *_test.go, *.test.js, *.test.cjs, *.spec.js, *.spec.cjs, *.test.ts, *.spec.ts, *_test.py, test_*.py, or located in directories named test, tests, __tests__, or spec)
  • Exclude workflow files from analysis (files under .github/workflows/*)
  • Use get_symbols_overview to understand file structure
  • Use read_file to examine modified file contents

3. Duplicate Detection

Apply semantic code analysis to find duplicates:

Symbol-Level Analysis:

  • For significant functions/methods in changed files, use find_symbol to search for similarly named symbols
  • Use find_referencing_symbols to understand usage patterns
  • Identify functions with similar names in different files (e.g., processData across modules)

Pattern Search:

  • Use search_for_pattern to find similar code patterns
  • Search for duplication indicators:
    • Similar function signatures
    • Repeated logic blocks
    • Similar variable naming patterns
    • Near-identical code blocks

Structural Analysis:

  • Use list_dir and find_file to identify files with similar names or purposes
  • Compare symbol overviews across files for structural similarities

4. Duplication Evaluation

Assess findings to identify true code duplication:

Duplication Types:

  • Exact Duplication: Identical code blocks in multiple locations
  • Structural Duplication: Same logic with minor variations (different variable names, etc.)
  • Functional Duplication: Different implementations of the same functionality
  • Copy-Paste Programming: Similar code blocks that could be extracted into shared utilities

Assessment Criteria:

  • Severity: Amount of duplicated code (lines of code, number of occurrences)
  • Impact: Where duplication occurs (critical paths, frequently called code)
  • Maintainability: How duplication affects code maintainability
  • Refactoring Opportunity: Whether duplication can be easily refactored

5. Issue Reporting

Create separate issues for each distinct duplication pattern found (maximum 3 patterns per run). Each pattern should get its own issue to enable focused remediation.

When to Create Issues:

  • Only create issues if significant duplication is found (threshold: >10 lines of duplicated code OR 3+ instances of similar patterns)
  • Create one issue per distinct pattern - do NOT bundle multiple patterns in a single issue
  • Limit to the top 3 most significant patterns if more are found
  • Use the create_issue tool from safe-outputs MCP once for each pattern

Issue Contents for Each Pattern:

  • Executive Summary: Brief description of this specific duplication pattern
  • Duplication Details: Specific locations and code blocks for this pattern only
  • Severity Assessment: Impact and maintainability concerns for this pattern
  • Refactoring Recommendations: Suggested approaches to eliminate this pattern
  • Code Examples: Concrete examples with file paths and line numbers for this pattern

Detection Scope

Report These Issues

  • Identical or nearly identical functions in different files
  • Repeated code blocks that could be extracted to utilities
  • Similar classes or modules with overlapping functionality
  • Copy-pasted code with minor modifications
  • Duplicated business logic across components

Skip These Patterns

  • Standard boilerplate code (imports, exports, etc.)
  • Test setup/teardown code (acceptable duplication in tests)
  • JavaScript files except .cjs (files matching: *.js, *.mjs, *.jsx, *.ts, *.tsx)
  • All test files (files matching: *_test.go, *.test.js, *.test.cjs, *.spec.js, *.spec.cjs, *.test.ts, *.spec.ts, *_test.py, test_*.py, or in test/, tests/, __tests__/, spec/ directories)
  • All workflow files (files under .github/workflows/*)
  • Configuration files with similar structure
  • Language-specific patterns (constructors, getters/setters)
  • Small code snippets (<5 lines) unless highly repetitive

Analysis Depth

  • File Type Restriction: ONLY analyze .py, .ts, .tsx, .js, and .jsx files - ignore all other file types
  • Primary Focus: All .py, .ts, .tsx, .js, and .jsx files changed in the current push (excluding test files and workflow files)
  • Secondary Analysis: Check for duplication with existing Python and TypeScript/JavaScript codebase (excluding test files and workflow files)
  • Cross-Reference: Look for patterns across .py, .ts, .tsx, .js, and .jsx files in the repository
  • Historical Context: Consider if duplication is new or existing

Issue Template

For each distinct duplication pattern found, create a separate issue using this structure:

# 🔍 Duplicate Code Detected: [Pattern Name]

*Analysis of commit ${{ github.event.head_commit.id }}*

**Assignee**: @copilot

## Summary

[Brief overview of this specific duplication pattern]

## Duplication Details

### Pattern: [Description]
- **Severity**: High/Medium/Low
- **Occurrences**: [Number of instances]
- **Locations**:
  - `path/to/file1.ext` (lines X-Y)
  - `path/to/file2.ext` (lines A-B)
- **Code Sample**:
  ```[language]
  [Example of duplicated code]

Impact Analysis

  • Maintainability: [How this affects code maintenance]
  • Bug Risk: [Potential for inconsistent fixes]
  • Code Bloat: [Impact on codebase size]

Refactoring Recommendations

  1. [Recommendation 1]

    • Extract common functionality to: suggested/path/utility.ext
    • Estimated effort: [hours/complexity]
    • Benefits: [specific improvements]
  2. [Recommendation 2] [... additional recommendations ...]

Implementation Checklist

  • Review duplication findings
  • Prioritize refactoring tasks
  • Create refactoring plan
  • Implement changes
  • Update tests
  • Verify no functionality broken

Analysis Metadata

  • Analyzed Files: [count]
  • Detection Method: Serena semantic code analysis
  • Commit: ${{ github.event.head_commit.id }}
  • Analysis Date: [timestamp]

## Operational Guidelines

### Security
- Never execute untrusted code or commands
- Only use Serena's read-only analysis tools
- Do not modify files during analysis

### Efficiency
- Focus on recently changed files first
- Use semantic analysis for meaningful duplication, not superficial matches
- Stay within timeout limits (balance thoroughness with execution time)

### Accuracy
- Verify findings before reporting
- Distinguish between acceptable patterns and true duplication
- Consider language-specific idioms and best practices
- Provide specific, actionable recommendations

### Issue Creation
- Create **one issue per distinct duplication pattern** - do NOT bundle multiple patterns in a single issue
- Limit to the top 3 most significant patterns if more are found
- Only create issues if significant duplication is found
- Include sufficient detail for SWE agents to understand and act on findings
- Provide concrete examples with file paths and line numbers
- Suggest practical refactoring approaches
- Assign issue to @copilot for automated remediation
- Use descriptive titles that clearly identify the specific pattern (e.g., "Duplicate Code: Error Handling Pattern in Parser Module")

## Tool Usage Sequence

1. **Project Setup**: `activate_project` with repository path
2. **File Discovery**: `list_dir`, `find_file` for changed files
3. **Symbol Analysis**: `get_symbols_overview` for structure understanding
4. **Content Review**: `read_file` for detailed code examination
5. **Pattern Matching**: `search_for_pattern` for similar code
6. **Symbol Search**: `find_symbol` for duplicate function names
7. **Reference Analysis**: `find_referencing_symbols` for usage patterns

**Objective**: Improve code quality by identifying and reporting meaningful code duplication that impacts maintainability. Focus on actionable findings that enable automated or manual refactoring.