Skip to content

perf: direct long-to-chars rendering in visitFloat64#685

Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/direct-long-rendering
Closed

perf: direct long-to-chars rendering in visitFloat64#685
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/direct-long-rendering

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented Apr 5, 2026

Motivation

When rendering integer-valued doubles (e.g., 42.0), visitFloat64 currently calls i.toString which allocates a String object, then passes it through visitFloat64StringParts for character-by-character processing. For large arrays of numbers (common in Jsonnet), this creates millions of short-lived String allocations.

Key Design Decision

Add a writeLongDirect method that converts a Long directly into the CharBuilder's backing array without any intermediate String allocation. Uses a digit-pair lookup table for two-digits-at-a-time conversion (a well-known optimization from JDK's Integer.getChars).

Modification

sjsonnet/src/sjsonnet/BaseCharRenderer.scala:

  • Added writeLongDirect(v: Long) private method:
    • Special-cases 0 and Long.MinValue (negation overflow)
    • Counts digits, then writes right-to-left using digit-pair tables
    • Directly writes into elemBuilder.arr via ensureLength + position update
  • Added companion object BaseCharRenderer with DIGIT_TENS and DIGIT_ONES lookup tables (100 entries each)
  • Changed visitFloat64: integer path now calls writeLongDirect(i) instead of visitFloat64StringParts(i.toString, ...)

Test: Added new_test_suite/large_integer_rendering.jsonnet — verifies correct rendering of boundary values (0, negatives, large longs, Long.MinValue).

Benchmark Results

JMH — Full Suite (35 benchmarks, 1+1 warmup)

Pending benchmark data.

Expected Impact

Primarily benefits benchmarks with heavy numeric rendering:

  • realistic2 (large JSON with many numbers)
  • base64DecodeBytes (byte arrays rendered as numbers)
  • Any workload generating numeric arrays

Analysis

  • Correctness: Handles all edge cases — zero, negative, Long.MinValue overflow, single/multi-digit numbers.
  • Digit-pair table: Converts two digits per iteration (100x div/mod instead of 10x), halving the number of divide operations.
  • Zero allocations: No String or StringBuilder — writes directly into the backing char[].
  • Regression test: large_integer_rendering.jsonnet covers boundary values including Long.MinValue.

References

  • JDK's Integer.getChars uses the same digit-pair technique
  • BaseCharRenderer.DIGIT_TENS / DIGIT_ONES — static lookup tables

Result

Zero-allocation integer rendering with digit-pair optimization. Eliminates Long.toString allocation overhead. Draft PR pending benchmark data.

@He-Pin He-Pin force-pushed the perf/direct-long-rendering branch 4 times, most recently from d5ad614 to f6778f6 Compare April 10, 2026 03:33
Replaces i.toString with writeLongDirect(i) using digit-pair lookup
tables to write integer digits directly into CharBuilder's backing
array, eliminating intermediate String allocation for integer doubles.

Uses a right-to-left two-digits-at-a-time algorithm with DIGIT_TENS/
DIGIT_ONES lookup tables (same approach as java.lang.Long.toString).
Handles edge cases: 0, Long.MinValue overflow, negative values.

Adds regression test for large integer rendering (values > Int.MaxValue)
to verify correctness for numbers up to 2^53.

Upstream: jit branch commit d60ba61
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented Apr 10, 2026

Superseded by #730 which combines this optimization with the other renderer throughput improvements (indent cache + bulk copy + direct long rendering) into a single coherent PR with comprehensive benchmarks.

@He-Pin He-Pin closed this Apr 10, 2026
stephenamar-db pushed a commit that referenced this pull request Apr 10, 2026
…rect long rendering) (#730)

## Motivation

The materialization/rendering pipeline is the primary bottleneck for
large-output workloads. For `realistic2` (28.6 MB output, 568K lines,
125K objects, 380K strings), `--debug-stats` shows 99.8% of wall time is
spent in materialization. The previous implementation used per-character
loops for indent rendering and intermediate `String` allocation for
number formatting, leaving significant throughput on the table.

## Key Design Decisions

1. **Indent cache scope**: Lives in `BaseCharRenderer` (not `Renderer`)
so all renderer subclasses (`Renderer`, `MaterializeJsonRenderer`,
`PythonRenderer`) benefit automatically.
2. **MaxCachedDepth = 32**: Covers virtually all real-world Jsonnet
(realistic2 max depth ~5). Beyond this, falls back to the original
per-character loop.
3. **Negative accumulator** in `appendLong`: Handles `Long.MinValue`
correctly without overflow (negating `Long.MinValue` overflows `Long`).
4. **Zero-allocation number rendering**: For integer-valued doubles (the
common case in Jsonnet), digits are written directly into `CharBuilder`
instead of going through `Long.toString` → `String` → char-by-char copy.

## Modifications

### `BaseCharRenderer.scala`
- Added companion object with `MaxCachedDepth = 32`
- Added `indentCache` field: pre-computed `Array[Array[Char]]` with
`newline + indent*d spaces` for each depth level, constructed once at
renderer creation
- Updated `renderIndent()` to use cached arrays via `appendAll` (single
`System.arraycopy`) for depths < 32
- Updated `appendString()` to use `String.getChars` bulk copy instead of
char-by-char loop

### `Renderer.scala`
- Updated `visitFloat64()` to render integers directly via
`RenderUtils.appendLong()`
- Updated `flushBuffer()` to use `indentCache` for bulk indent rendering
- Added `RenderUtils.appendLong()`: renders `Long` directly into
`CharBuilder` using negative accumulator + reverse-in-place algorithm

### `RendererTests.scala`
- Added `appendLong` edge case tests: 0, positive, negative, large,
`Long.MaxValue`, `Long.MinValue`
- Added `visitFloat64Integers` tests for end-to-end integer rendering
- Added `indentZero` test for `indent=0` edge case

## Benchmark Results

### JMH (JVM, isolated runs, lower is better)

| Benchmark | Before (ms/op) | After (ms/op) | Change |
|-----------|----------------|---------------|--------|
| **realistic2** | 68.749 | 58.001 | **-15.6%** ✅ |
| **reverse** | 10.494 | 8.436 | **-19.6%** ✅ |
| gen_big_object | 1.066 | 1.000 | -6.2% ✅ |
| bench.02 | 39.832 | 39.322 | -1.3% ≈ |
| comparison | 20.216 | 21.060 | +4.2% (noise — eval-only, output is
`true`) |
| realistic1 | 2.015 | 2.133 | within noise |

No regressions across the full 35-benchmark JMH suite.

### Hyperfine (Scala Native, `--warmup 3 --min-runs 10`)

**realistic2** (28.6 MB output):
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 264.9 ± 4.2 | 2.48x slower |
| sjsonnet-native (this PR) | 262.2 ± 2.9 | 2.45x slower |
| jrsonnet 0.5.0-pre98 | 106.8 ± 16.3 | baseline |

**reverse** (large array output):
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 53.1 ± 2.8 | 2.22x slower |
| sjsonnet-native (this PR) | 38.0 ± 2.3 | **1.59x slower** |
| jrsonnet 0.5.0-pre98 | 24.0 ± 1.7 | baseline |

Gap closed from 2.22x → 1.59x (**-28.4%** improvement).

**gen_big_object**:
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 12.1 ± 1.5 | 1.16x slower |
| sjsonnet-native (this PR) | 10.4 ± 1.1 | **1.01x — tied!** |
| jrsonnet 0.5.0-pre98 | 10.5 ± 1.3 | baseline |

**realistic1**:
| Implementation | Time (ms) | vs jrsonnet |
|---|---|---|
| sjsonnet-native (master) | 12.9 ± 1.4 | — |
| sjsonnet-native (this PR) | 12.0 ± 1.4 | **1.15x faster** |
| jrsonnet 0.5.0-pre98 | 13.9 ± 2.1 | baseline |

sjsonnet already **beats** jrsonnet on realistic1 (1.15x faster).

## Analysis

The JVM improvement is larger (15.6% on realistic2) because the JIT
compiler was still leaving performance on the table with the
char-by-char loops. On Scala Native, LLVM already partially optimizes
these loops, so the native improvement is smaller for realistic2 but
significant for reverse (28.4%), where the output contains many
integer-valued doubles that benefit from the zero-allocation
`appendLong` path.

The `gen_big_object` benchmark is now **tied with jrsonnet** (10.4ms vs
10.5ms), and `realistic1` beats jrsonnet by 1.15x.

## Result

- ✅ All 141 test suites pass (JVM 3.3.7)
- ✅ Compiles on all platforms (JVM, JS, Native)
- ✅ No regressions across the full benchmark suite
- ✅ Comprehensive new test coverage for edge cases

This PR supersedes #676 (renderer-indent-cache), #681
(renderer-bulk-append), and #685 (direct-long-rendering) which
implemented subsets of these optimizations individually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant