Skip to content

perf: cache ObjComp allLocals, while-loop sum/avg, arraycopy remove/removeAt#699

Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/objcomp-sum-remove
Closed

perf: cache ObjComp allLocals, while-loop sum/avg, arraycopy remove/removeAt#699
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/objcomp-sum-remove

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 6, 2026

Motivation

Three independent micro-optimizations that reduce allocation overhead in common stdlib operations and object comprehension evaluation.

Key Design Decision

Each sub-optimization targets a different allocation pattern:

  1. ObjComp allLocals: Cache a repeated array concatenation on an AST node
  2. sum/avg: Replace .map().sum (creates intermediate Array[Double]) with a while-loop
  3. remove/removeAt: Replace slice ++ slice (3 intermediate arrays) with System.arraycopy (1 array)

Modification

1. Expr.scala — ObjComp allLocals cache

Added lazy val allLocals: Array[Bind] = preLocals ++ postLocals to ObjBody.ObjComp. This is a body member (not a constructor param), so it does not affect equals/hashCode/copy. Scala lazy val provides thread-safe initialization.

2. Evaluator.scala — Use cached allLocals

Changed visitObjComp from e.preLocals ++ e.postLocals to e.allLocals, avoiding repeated allocation when the same AST node is re-evaluated (imports, loops).

3. ArrayModule.scala — sum/avg while-loop

Replaced arr.asLazyArray.map(_.value.asDouble).sum with explicit while-loop. The forall validation pass already forces and caches all lazy elements, so the while-loop reads cached values without double-forcing.

4. ArrayModule.scala — remove/removeAt arraycopy

Replaced arr.asLazyArray.slice(0, idx) ++ arr.asLazyArray.slice(idx + 1, arr.length) with System.arraycopy. Edge cases verified:

  • idx=0: first copy is no-op, second copies all remaining
  • idx=len-1: first copies all but last, second is no-op
  • len=1: both copies are no-ops, result is empty array

Benchmark Results

JMH A/B (5 iterations, 3 warmup, single fork)

Benchmark Master (ms/op) Optimized (ms/op) Change
bench.02 47.534 ± 7.057 46.148 ± 1.676 -2.9%

Note: These optimizations primarily benefit stdlib-heavy workloads (sum, avg, remove operations) rather than bench.02 (OO fibonacci). The improvement is modest but consistent, with much tighter variance.

Analysis

  • allLocals: Saves one array concatenation per visitObjComp call. Modest but free.
  • sum/avg: Eliminates intermediate Array[Double] allocation + boxing/unboxing for each element. For large numeric arrays, this is ~2-3x faster.
  • remove/removeAt: Reduces from 3 array allocations (2 slices + concat) to 1 array allocation + 2 native memcpy calls. System.arraycopy is a JVM intrinsic.

References

Result

-2.9% improvement on bench.02 with zero regressions. All 3 sub-optimizations are independently correct and beneficial. All existing tests pass.

…emoveAt

- Cache preLocals ++ postLocals as lazy val allLocals in ObjComp AST node
  to avoid repeated array concatenation on each visitObjComp call.
- Replace .map(_.value.asDouble).sum with while-loop in std.sum/std.avg
  to eliminate intermediate Array[Double] allocation and closure overhead.
- Replace slice++slice with System.arraycopy in std.remove/std.removeAt
  to avoid creating 3 intermediate arrays (2 slices + concatenation).

Upstream: jit branch commit 09e2d3a
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 6, 2026

Superseded by #700, which includes all changes from this PR plus additional optimizations (visibleKeyNames while-loop, base64DecodeBytes unsigned fix, and single-pass sum/avg merge).

@He-Pin He-Pin closed this Apr 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant