perf: direct long-to-chars rendering in visitFloat64 by He-Pin · Pull Request #685 · databricks/sjsonnet

He-Pin · 2026-04-05T03:32:40Z

Motivation

When rendering integer-valued doubles (e.g., 42.0), visitFloat64 currently calls i.toString which allocates a String object, then passes it through visitFloat64StringParts for character-by-character processing. For large arrays of numbers (common in Jsonnet), this creates millions of short-lived String allocations.

Key Design Decision

Add a writeLongDirect method that converts a Long directly into the CharBuilder's backing array without any intermediate String allocation. Uses a digit-pair lookup table for two-digits-at-a-time conversion (a well-known optimization from JDK's Integer.getChars).

Modification

sjsonnet/src/sjsonnet/BaseCharRenderer.scala:

Added writeLongDirect(v: Long) private method:
- Special-cases 0 and Long.MinValue (negation overflow)
- Counts digits, then writes right-to-left using digit-pair tables
- Directly writes into elemBuilder.arr via ensureLength + position update
Added companion object BaseCharRenderer with DIGIT_TENS and DIGIT_ONES lookup tables (100 entries each)
Changed visitFloat64: integer path now calls writeLongDirect(i) instead of visitFloat64StringParts(i.toString, ...)

Test: Added new_test_suite/large_integer_rendering.jsonnet — verifies correct rendering of boundary values (0, negatives, large longs, Long.MinValue).

Benchmark Results

JMH — Full Suite (35 benchmarks, 1+1 warmup)

Pending benchmark data.

Expected Impact

Primarily benefits benchmarks with heavy numeric rendering:

realistic2 (large JSON with many numbers)
base64DecodeBytes (byte arrays rendered as numbers)
Any workload generating numeric arrays

Analysis

Correctness: Handles all edge cases — zero, negative, Long.MinValue overflow, single/multi-digit numbers.
Digit-pair table: Converts two digits per iteration (100x div/mod instead of 10x), halving the number of divide operations.
Zero allocations: No String or StringBuilder — writes directly into the backing char[].
Regression test: large_integer_rendering.jsonnet covers boundary values including Long.MinValue.

References

JDK's Integer.getChars uses the same digit-pair technique
BaseCharRenderer.DIGIT_TENS / DIGIT_ONES — static lookup tables

Result

Zero-allocation integer rendering with digit-pair optimization. Eliminates Long.toString allocation overhead. Draft PR pending benchmark data.

Replaces i.toString with writeLongDirect(i) using digit-pair lookup tables to write integer digits directly into CharBuilder's backing array, eliminating intermediate String allocation for integer doubles. Uses a right-to-left two-digits-at-a-time algorithm with DIGIT_TENS/ DIGIT_ONES lookup tables (same approach as java.lang.Long.toString). Handles edge cases: 0, Long.MinValue overflow, negative values. Adds regression test for large integer rendering (values > Int.MaxValue) to verify correctness for numbers up to 2^53. Upstream: jit branch commit d60ba61

He-Pin · 2026-04-10T15:36:44Z

Superseded by #730 which combines this optimization with the other renderer throughput improvements (indent cache + bulk copy + direct long rendering) into a single coherent PR with comprehensive benchmarks.

…rect long rendering) (#730) ## Motivation The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For `realistic2` (28.6 MB output, 568K lines, 125K objects, 380K strings), `--debug-stats` shows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediate `String` allocation for number formatting, leaving significant throughput on the table. ## Key Design Decisions 1. **Indent cache scope**: Lives in `BaseCharRenderer` (not `Renderer`) so all renderer subclasses (`Renderer`, `MaterializeJsonRenderer`, `PythonRenderer`) benefit automatically. 2. **MaxCachedDepth = 32**: Covers virtually all real-world Jsonnet (realistic2 max depth ~5). Beyond this, falls back to the original per-character loop. 3. **Negative accumulator** in `appendLong`: Handles `Long.MinValue` correctly without overflow (negating `Long.MinValue` overflows `Long`). 4. **Zero-allocation number rendering**: For integer-valued doubles (the common case in Jsonnet), digits are written directly into `CharBuilder` instead of going through `Long.toString` → `String` → char-by-char copy. ## Modifications ### `BaseCharRenderer.scala` - Added companion object with `MaxCachedDepth = 32` - Added `indentCache` field: pre-computed `Array[Array[Char]]` with `newline + indent*d spaces` for each depth level, constructed once at renderer creation - Updated `renderIndent()` to use cached arrays via `appendAll` (single `System.arraycopy`) for depths < 32 - Updated `appendString()` to use `String.getChars` bulk copy instead of char-by-char loop ### `Renderer.scala` - Updated `visitFloat64()` to render integers directly via `RenderUtils.appendLong()` - Updated `flushBuffer()` to use `indentCache` for bulk indent rendering - Added `RenderUtils.appendLong()`: renders `Long` directly into `CharBuilder` using negative accumulator + reverse-in-place algorithm ### `RendererTests.scala` - Added `appendLong` edge case tests: 0, positive, negative, large, `Long.MaxValue`, `Long.MinValue` - Added `visitFloat64Integers` tests for end-to-end integer rendering - Added `indentZero` test for `indent=0` edge case ## Benchmark Results ### JMH (JVM, isolated runs, lower is better) | Benchmark | Before (ms/op) | After (ms/op) | Change | |-----------|----------------|---------------|--------| | **realistic2** | 68.749 | 58.001 | **-15.6%** ✅ | | **reverse** | 10.494 | 8.436 | **-19.6%** ✅ | | gen_big_object | 1.066 | 1.000 | -6.2% ✅ | | bench.02 | 39.832 | 39.322 | -1.3% ≈ | | comparison | 20.216 | 21.060 | +4.2% (noise — eval-only, output is `true`) | | realistic1 | 2.015 | 2.133 | within noise | No regressions across the full 35-benchmark JMH suite. ### Hyperfine (Scala Native, `--warmup 3 --min-runs 10`) **realistic2** (28.6 MB output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 264.9 ± 4.2 | 2.48x slower | | sjsonnet-native (this PR) | 262.2 ± 2.9 | 2.45x slower | | jrsonnet 0.5.0-pre98 | 106.8 ± 16.3 | baseline | **reverse** (large array output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 53.1 ± 2.8 | 2.22x slower | | sjsonnet-native (this PR) | 38.0 ± 2.3 | **1.59x slower** | | jrsonnet 0.5.0-pre98 | 24.0 ± 1.7 | baseline | Gap closed from 2.22x → 1.59x (**-28.4%** improvement). **gen_big_object**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.1 ± 1.5 | 1.16x slower | | sjsonnet-native (this PR) | 10.4 ± 1.1 | **1.01x — tied!** | | jrsonnet 0.5.0-pre98 | 10.5 ± 1.3 | baseline | **realistic1**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.9 ± 1.4 | — | | sjsonnet-native (this PR) | 12.0 ± 1.4 | **1.15x faster** | | jrsonnet 0.5.0-pre98 | 13.9 ± 2.1 | baseline | sjsonnet already **beats** jrsonnet on realistic1 (1.15x faster). ## Analysis The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation `appendLong` path. The `gen_big_object` benchmark is now **tied with jrsonnet** (10.4ms vs 10.5ms), and `realistic1` beats jrsonnet by 1.15x. ## Result - ✅ All 141 test suites pass (JVM 3.3.7) - ✅ Compiles on all platforms (JVM, JS, Native) - ✅ No regressions across the full benchmark suite - ✅ Comprehensive new test coverage for edge cases This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.

He-Pin mentioned this pull request Apr 5, 2026

performance optimization #666

Open

He-Pin force-pushed the perf/direct-long-rendering branch 4 times, most recently from d5ad614 to f6778f6 Compare April 10, 2026 03:33

He-Pin force-pushed the perf/direct-long-rendering branch from f6778f6 to f948303 Compare April 10, 2026 09:30

He-Pin mentioned this pull request Apr 10, 2026

perf: renderer throughput optimization (indent cache + bulk copy + direct long rendering) #730

Merged

He-Pin closed this Apr 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: direct long-to-chars rendering in visitFloat64#685

perf: direct long-to-chars rendering in visitFloat64#685
He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin:perf/direct-long-rendering

He-Pin commented Apr 5, 2026 •

edited

Loading

Uh oh!

He-Pin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Key Design Decision

Modification

Benchmark Results

JMH — Full Suite (35 benchmarks, 1+1 warmup)

Expected Impact

Analysis

References

Result

Uh oh!

He-Pin commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented Apr 5, 2026 •

edited

Loading