codeflash-agent/plugin/languages/java/references/database/guide.md
mashraf-222 270cb56cee
Feat/java language support (#12)
* Add Java/Kotlin detection to top-level language router

Adds pom.xml, build.gradle, build.gradle.kts, settings.gradle, and
settings.gradle.kts as markers that route to the codeflash-java router.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Java/Kotlin agent definitions for all optimization domains

10 agents covering the full optimization pipeline:
- codeflash-java: router/team lead for domain detection
- codeflash-java-setup: environment detection (build tool, JDK, profiling tools)
- codeflash-java-deep: cross-domain optimizer (default)
- codeflash-java-cpu: data structures, algorithms, JIT deopt, JMH benchmarks
- codeflash-java-memory: heap/GC tuning, escape analysis, leak detection
- codeflash-java-async: virtual threads, lock contention, CompletableFuture
- codeflash-java-structure: class loading, JPMS, startup time, circular deps
- codeflash-java-scan: quick cross-domain diagnosis via JFR/jdeps/GC logs
- codeflash-java-ci: GitHub webhook handler for Java PRs
- codeflash-java-pr-prep: JMH benchmarks and PR body templates

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Java domain reference guides for all optimization domains

6 guides covering deep domain knowledge for agent consumption:
- data-structures: collection selection, autoboxing, JIT patterns, sorting
- memory: JVM heap layout, GC algorithms and tuning, escape analysis, leaks
- async: virtual threads, structured concurrency, lock hierarchy, contention
- structure: class loading, JPMS, CDS/AppCDS, ServiceLoader, Spring startup
- database: JPA N+1, HikariCP, pagination, batch operations, EXPLAIN plans
- native: JNI, Panama FFM API, GraalVM native-image, Vector API

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add Java optimization skills: session launcher and JFR profiling

- codeflash-optimize: session launcher with start/resume/status/scan/review
- jfr-profiling: quick-action JFR profiling in cpu/alloc/wall modes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Slim Java agents to match Go's concise ~175-line pattern

Move inline code examples, antipattern encyclopedias, JMH templates,
and deep-dive sections from agent prompts into reference guides.
Agents now contain only: target tables, one-liner antipatterns,
reasoning checklists, profiling commands, and keep/discard trees.

Line counts (before → after):
  cpu:       636 → 181
  memory:    878 → 193
  async:     578 → 165
  structure: 532 → 167
  deep:      507 → 186
  scan:      440 → 163
  Average:   595 → 176 (vs Go's 175)

Adds to data-structures/guide.md:
  - Collection contract traps table
  - Reflection → MethodHandle migration pattern
  - JMH benchmark template

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Fix Makefile build: use rsync merge and portable sed -i

Two bugs in the build target:
1. cp -R created nested dirs (agents/agents/, references/references/)
   instead of merging language overlay into shared base. Fix: rsync -a.
2. sed -i '' is macOS-only; fails silently on Linux. Fix: sed -i.bak
   (works on both macOS and Linux), then delete .bak files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Add HANDOFF.md session lifecycle to Java agents

Java agents could read HANDOFF.md on resume but never wrote or
updated it. A session that hit plateau would lose all context —
what was tried, what worked, why it stopped, what to do next.

Changes:
- Deep agent: init HANDOFF.md on fresh start, record after each
  experiment, write Stop Reason + learnings.md on session end
- Domain agents (CPU, memory, async, structure): record to
  HANDOFF.md after each keep/discard, write session-end state
- Handoff template: make language-agnostic (was Python-specific),
  add Session status, Strategy & Decisions, and Stop Reason fields

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* Close 11 gaps between Java and Python plugins

Add missing sections to Java deep agent: experiment loop depth (12 steps),
library boundary breaking, Phase 0 environment setup, CI mode, pre-submit
review, adversarial review, team orchestration, cross-domain results schema,
and structured progress reporting.

Add polymorphic dispatch safety to CPU agent and data-structures guide.
Add diff hygiene to CPU agent. Add native reference to router.

Create two new reference files: library-replacement.md (Guava/Commons/
Jackson/Joda replacement tables) and team-orchestration.md (full dispatch
and merge protocol).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-14 18:49:41 -05:00

15 KiB

Database Query Optimization for Java

This guide covers JPA/Hibernate performance patterns, connection pooling, query optimization, caching, batch operations, and EXPLAIN plan verification. For general database verification tiers (EXPLAIN comparison, result diffing, generated integration tests), see the shared database reference patterns.

JPA/Hibernate N+1 Problem

The N+1 problem is the most common JPA performance issue. Loading a parent entity and then accessing its lazy-loaded children triggers N additional queries -- one per parent row.

Detection

// BAD: N+1 -- one query per order to load its items
List<Order> orders = entityManager.createQuery(
    "SELECT o FROM Order o WHERE o.status = :status", Order.class)
    .setParameter("status", "ACTIVE")
    .getResultList();

for (Order order : orders) {
    order.getItems().size();  // triggers SELECT * FROM order_item WHERE order_id = ?
    // One query per order -- if 100 orders, 101 queries total
}

Detection signals:

  • Hibernate SQL logging (spring.jpa.show-sql=true or hibernate.show_sql=true) shows repeated queries with different parameter values
  • hibernate.generate_statistics=true shows high prepareStatement count
  • P6Spy or datasource-proxy logs show query count per request

Fix 1: JOIN FETCH (JPQL)

// GOOD: single query with JOIN
List<Order> orders = entityManager.createQuery(
    "SELECT DISTINCT o FROM Order o JOIN FETCH o.items WHERE o.status = :status", Order.class)
    .setParameter("status", "ACTIVE")
    .getResultList();

// All items already loaded -- no additional queries
for (Order order : orders) {
    order.getItems().size();  // no query -- already fetched
}

Warning: JOIN FETCH with multiple collections causes a cartesian product. Hibernate limits to one JOIN FETCH collection per query (or use @Fetch(FetchMode.SUBSELECT) for the second collection).

Fix 2: @EntityGraph

@Entity
public class Order {
    @OneToMany(mappedBy = "order", fetch = FetchType.LAZY)
    private List<OrderItem> items;
}

// Define the entity graph
@EntityGraph(attributePaths = {"items", "items.product"})
List<Order> findByStatus(String status);

// Or programmatically
EntityGraph<Order> graph = entityManager.createEntityGraph(Order.class);
graph.addAttributeNodes("items");
Subgraph<OrderItem> itemGraph = graph.addSubgraph("items");
itemGraph.addAttributeNodes("product");

List<Order> orders = entityManager.createQuery("SELECT o FROM Order o", Order.class)
    .setHint("javax.persistence.fetchgraph", graph)
    .getResultList();

Fix 3: @BatchSize

@Entity
public class Order {
    @OneToMany(mappedBy = "order", fetch = FetchType.LAZY)
    @BatchSize(size = 25)  // loads items in batches of 25 orders
    private List<OrderItem> items;
}

Instead of N queries, Hibernate issues ceil(N/25) queries using WHERE order_id IN (?, ?, ..., ?).

Global batch size

# application.properties (Spring Boot)
spring.jpa.properties.hibernate.default_batch_fetch_size=25

This applies batch fetching to ALL lazy associations globally -- often the single highest-impact Hibernate tuning parameter.

Fix selection guide

Scenario Best fix Why
Always need children with parent JOIN FETCH Single query, minimal overhead
Sometimes need children @EntityGraph on specific queries Selective eager loading per use case
Multiple collections on entity @BatchSize or default_batch_fetch_size Avoids cartesian product from multiple JOINs
Large result sets @BatchSize JOIN FETCH with pagination is problematic (Hibernate warns about applying in-memory pagination)

Connection Pooling

HikariCP Configuration

HikariCP is the default connection pool for Spring Boot 2+ and the recommended pool for any Java application.

# Essential settings
spring.datasource.hikari.minimum-idle=5           # min connections kept open (default: same as max)
spring.datasource.hikari.maximum-pool-size=10      # max connections (default: 10)
spring.datasource.hikari.connection-timeout=30000   # ms to wait for connection (default: 30s)
spring.datasource.hikari.max-lifetime=1800000       # max connection age before recycling (default: 30min)
spring.datasource.hikari.idle-timeout=600000        # max idle time before eviction (default: 10min)
spring.datasource.hikari.leak-detection-threshold=60000  # ms -- log warning if connection not returned

Common misconfigurations

Misconfiguration Symptom Fix
maximum-pool-size too small SQLTransientConnectionException: Connection not available, request timed out after 30000ms Increase pool size. Rule of thumb: pool_size = (core_count * 2) + effective_spindle_count. For SSDs, start at ~10.
maximum-pool-size too large Database overwhelmed with connections, context switching overhead PostgreSQL: keep total connections (across all app instances) under max_connections. Each idle connection uses ~10 MB of DB memory.
connection-timeout too short Spurious timeouts during traffic spikes Increase to 30-60s. If timeouts persist, the pool is too small.
max-lifetime not set or too high Connections go stale, database restarts cause errors Set to 5 minutes less than database's wait_timeout / idle_in_transaction_session_timeout.
minimum-idle = maximum-pool-size Pool never shrinks during idle periods Set minimum-idle lower to release connections during off-peak.
No leak-detection-threshold Connection leaks go undetected until pool exhaustion Set to 60000 (60s). Logs a warning with stack trace when a connection isn't returned within the threshold.

Connection pool sizing formula

The PostgreSQL wiki suggests: pool_size = ((core_count * 2) + effective_spindle_count). For most modern servers with SSDs:

  • 4-core machine: 10 connections
  • 8-core machine: 20 connections
  • More is NOT always better -- beyond the optimal point, context switching and lock contention reduce throughput

Multiple app instances: If you have 4 app instances each with a pool of 10, the database sees 40 connections. Size accordingly.

Query Optimization

JPQL vs Criteria API vs Native SQL

Approach Type-safe Readable Performance Use when
JPQL No (string) High Good Simple queries, most use cases
Criteria API Yes Low (verbose) Same as JPQL (same query plan) Dynamic queries with optional filters
Native SQL No Medium Best (full DB feature access) Complex aggregations, CTEs, window functions, DB-specific features

Pagination: OFFSET vs Keyset

// OFFSET pagination: simple but slow for deep pages
// Page 1000 = database reads and discards 999 * pageSize rows
List<Order> page = entityManager.createQuery(
    "SELECT o FROM Order o ORDER BY o.createdAt DESC", Order.class)
    .setFirstResult(999 * 20)  // skip 19,980 rows
    .setMaxResults(20)
    .getResultList();

// KEYSET pagination: constant performance regardless of page depth
// Pass the last seen value from the previous page
List<Order> page = entityManager.createQuery(
    "SELECT o FROM Order o WHERE o.createdAt < :cursor ORDER BY o.createdAt DESC", Order.class)
    .setParameter("cursor", lastSeenCreatedAt)
    .setMaxResults(20)
    .getResultList();

Rule: Use OFFSET for shallow pages (< 100 pages) or admin UIs. Use keyset pagination for any user-facing infinite scroll, API pagination, or deep result sets.

Projection: DTO vs Entity

// FULL ENTITY: loads all columns, managed by persistence context
List<Order> orders = entityManager.createQuery(
    "SELECT o FROM Order o WHERE o.status = :status", Order.class)
    .setParameter("status", "ACTIVE")
    .getResultList();
// Each Order is tracked for dirty checking, occupies identity map memory

// DTO PROJECTION: loads only needed columns, not managed
List<OrderSummary> summaries = entityManager.createQuery(
    "SELECT new com.example.OrderSummary(o.id, o.total, o.createdAt) " +
    "FROM Order o WHERE o.status = :status", OrderSummary.class)
    .setParameter("status", "ACTIVE")
    .getResultList();
// Lightweight: no dirty checking, no identity map, less memory

// TUPLE PROJECTION (Criteria API)
CriteriaBuilder cb = entityManager.getCriteriaBuilder();
CriteriaQuery<Tuple> q = cb.createTupleQuery();
Root<Order> root = q.from(Order.class);
q.multiselect(root.get("id"), root.get("total"));

Rule: Use DTO projections for read-only queries (reports, lists, API responses). Use entity loading only when you need to modify the entity or traverse lazy relationships.

Second-Level Cache

Configuration (Ehcache / Caffeine)

# Enable second-level cache
spring.jpa.properties.hibernate.cache.use_second_level_cache=true
spring.jpa.properties.hibernate.cache.region.factory_class=org.hibernate.cache.jcache.JCacheRegionFactory
spring.jpa.properties.hibernate.javax.cache.provider=org.ehcache.jsr107.EhcacheCachingProvider

# Enable query cache (caches JPQL/HQL query results)
spring.jpa.properties.hibernate.cache.use_query_cache=true

Entity cache

@Entity
@Cacheable
@org.hibernate.annotations.Cache(usage = CacheConcurrencyStrategy.READ_WRITE)
public class Product {
    // Cached after first load. Subsequent findById() returns from cache.
}

Query cache

List<Product> products = entityManager.createQuery(
    "SELECT p FROM Product p WHERE p.category = :cat", Product.class)
    .setParameter("cat", "electronics")
    .setHint("org.hibernate.cacheable", true)  // enable query cache for this query
    .getResultList();

When to cache vs when to avoid

Cache type Use when Avoid when
Entity cache Read-heavy entities updated rarely (products, configuration, reference data) Frequently updated entities (orders, events, logs)
Query cache Same query with same parameters runs repeatedly Queries with frequently-changing underlying data (cache invalidated on any table change)
Collection cache @Cache on @OneToMany -- collection accessed repeatedly and rarely modified Large collections or frequently-modified collections

Warning: The query cache is invalidated when ANY entity in the queried table changes. For tables with frequent writes, the cache hit rate drops to near zero and the cache management overhead makes things slower.

Batch Operations

Hibernate batch inserts

# Enable JDBC batching
spring.jpa.properties.hibernate.jdbc.batch_size=50
spring.jpa.properties.hibernate.order_inserts=true
spring.jpa.properties.hibernate.order_updates=true
// Batch insert with periodic flush/clear
for (int i = 0; i < 10_000; i++) {
    entityManager.persist(new OrderItem(/* ... */));
    if (i % 50 == 0) {
        entityManager.flush();  // execute batched INSERTs
        entityManager.clear();  // detach all entities (free memory)
    }
}

JDBC batch inserts (bypass Hibernate)

For maximum insert throughput, bypass Hibernate entirely:

@Autowired
JdbcTemplate jdbcTemplate;

jdbcTemplate.batchUpdate(
    "INSERT INTO order_item (order_id, product_id, quantity) VALUES (?, ?, ?)",
    items, 1000,  // batch size
    (ps, item) -> {
        ps.setLong(1, item.getOrderId());
        ps.setLong(2, item.getProductId());
        ps.setInt(3, item.getQuantity());
    }
);

Statement ordering

When batch-inserting entities with multiple types, Hibernate may interleave INSERT statements for different tables, breaking JDBC batching. Enable statement ordering:

hibernate.order_inserts=true   # group INSERTs by table
hibernate.order_updates=true   # group UPDATEs by table

EXPLAIN Plan Verification

Running EXPLAIN from Java

// Spring JdbcTemplate
String plan = jdbcTemplate.queryForObject(
    "EXPLAIN (FORMAT JSON) SELECT * FROM orders WHERE status = ? AND created_at > ?",
    String.class, "ACTIVE", cutoffDate);

// EntityManager native query
Query q = entityManager.createNativeQuery(
    "EXPLAIN (FORMAT TEXT) SELECT * FROM orders WHERE status = ?1 AND created_at > ?2");
q.setParameter(1, "ACTIVE");
q.setParameter(2, cutoffDate);
List<Object[]> plan = q.getResultList();
for (Object[] row : plan) {
    System.out.println(row[0]);
}

What to check

Check What to look for Problem if wrong
Scan type Index Scan or Index Only Scan on filtered columns Seq Scan on large table = missing index
Estimated rows Should match actual row count (use EXPLAIN ANALYZE in dev) Stale statistics = wrong query plan. Run ANALYZE table_name.
Join type Nested Loop for small result sets, Hash Join for large Nested Loop on large joins = O(n*m)
Sort Index Scan providing order, or Sort node Sort on large result set = disk sort possible
Bitmap Heap Scan Filter efficiency -- Rows Removed by Filter should be low High Rows Removed = index returns too many false matches

Common missing indexes

// If you frequently filter by these patterns, verify indexes exist:

// 1. Status + timestamp (range query on created_at with status filter)
@Index(columnList = "status, created_at")

// 2. Foreign keys (JPA does NOT auto-create FK indexes -- unlike Django)
@Index(columnList = "customer_id")

// 3. Composite for multi-column WHERE
@Index(columnList = "tenant_id, status, created_at")

// 4. Partial index (PostgreSQL) for common filter
// Create via native DDL or Flyway migration:
// CREATE INDEX idx_orders_active ON orders (created_at) WHERE status = 'ACTIVE'

Pitfalls

  • N+1 is the default: JPA loads associations lazily by default. Every access to an unloaded association triggers a query. Use default_batch_fetch_size as a global safety net.
  • JOIN FETCH + pagination = in-memory pagination: Hibernate cannot apply LIMIT/OFFSET when JOIN FETCH produces a cartesian product. It loads ALL rows and paginates in memory, logging: HHH90003004: firstResult/maxResults specified with collection fetch; applying in memory. Use @BatchSize or DTO projection for paginated queries with eager loading.
  • Entity loading for read-only queries: Loading full entities for display/API responses wastes memory (identity map tracking, dirty checking) and may trigger lazy-loading cascades. Use DTO projections.
  • HikariCP pool exhaustion during long transactions: A transaction holds a connection for its entire duration. Long transactions (batch processing, report generation) exhaust the pool. Move long operations to a separate data source or use streaming (ScrollableResults).
  • Query cache with frequent writes: The query cache is table-level -- ANY update to the table invalidates ALL cached queries for that table. For write-heavy tables, the query cache has near-zero hit rate and adds overhead.
  • Missing JDBC batching: Without hibernate.jdbc.batch_size, each persist() generates a separate INSERT statement. For bulk inserts, this is orders of magnitude slower than batched inserts.
  • Identity generation disables batching: @GeneratedValue(strategy = GenerationType.IDENTITY) forces Hibernate to execute INSERT immediately (to get the generated ID), defeating JDBC batching. Use SEQUENCE strategy instead.