perf: direct long-to-chars rendering in visitFloat64#685
Closed
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Closed
perf: direct long-to-chars rendering in visitFloat64#685He-Pin wants to merge 1 commit intodatabricks:masterfrom
He-Pin wants to merge 1 commit intodatabricks:masterfrom
Conversation
d5ad614 to
f6778f6
Compare
Replaces i.toString with writeLongDirect(i) using digit-pair lookup tables to write integer digits directly into CharBuilder's backing array, eliminating intermediate String allocation for integer doubles. Uses a right-to-left two-digits-at-a-time algorithm with DIGIT_TENS/ DIGIT_ONES lookup tables (same approach as java.lang.Long.toString). Handles edge cases: 0, Long.MinValue overflow, negative values. Adds regression test for large integer rendering (values > Int.MaxValue) to verify correctness for numbers up to 2^53. Upstream: jit branch commit d60ba61
f6778f6 to
f948303
Compare
Contributor
Author
|
Superseded by #730 which combines this optimization with the other renderer throughput improvements (indent cache + bulk copy + direct long rendering) into a single coherent PR with comprehensive benchmarks. |
stephenamar-db
pushed a commit
that referenced
this pull request
Apr 10, 2026
…rect long rendering) (#730) ## Motivation The materialization/rendering pipeline is the primary bottleneck for large-output workloads. For `realistic2` (28.6 MB output, 568K lines, 125K objects, 380K strings), `--debug-stats` shows 99.8% of wall time is spent in materialization. The previous implementation used per-character loops for indent rendering and intermediate `String` allocation for number formatting, leaving significant throughput on the table. ## Key Design Decisions 1. **Indent cache scope**: Lives in `BaseCharRenderer` (not `Renderer`) so all renderer subclasses (`Renderer`, `MaterializeJsonRenderer`, `PythonRenderer`) benefit automatically. 2. **MaxCachedDepth = 32**: Covers virtually all real-world Jsonnet (realistic2 max depth ~5). Beyond this, falls back to the original per-character loop. 3. **Negative accumulator** in `appendLong`: Handles `Long.MinValue` correctly without overflow (negating `Long.MinValue` overflows `Long`). 4. **Zero-allocation number rendering**: For integer-valued doubles (the common case in Jsonnet), digits are written directly into `CharBuilder` instead of going through `Long.toString` → `String` → char-by-char copy. ## Modifications ### `BaseCharRenderer.scala` - Added companion object with `MaxCachedDepth = 32` - Added `indentCache` field: pre-computed `Array[Array[Char]]` with `newline + indent*d spaces` for each depth level, constructed once at renderer creation - Updated `renderIndent()` to use cached arrays via `appendAll` (single `System.arraycopy`) for depths < 32 - Updated `appendString()` to use `String.getChars` bulk copy instead of char-by-char loop ### `Renderer.scala` - Updated `visitFloat64()` to render integers directly via `RenderUtils.appendLong()` - Updated `flushBuffer()` to use `indentCache` for bulk indent rendering - Added `RenderUtils.appendLong()`: renders `Long` directly into `CharBuilder` using negative accumulator + reverse-in-place algorithm ### `RendererTests.scala` - Added `appendLong` edge case tests: 0, positive, negative, large, `Long.MaxValue`, `Long.MinValue` - Added `visitFloat64Integers` tests for end-to-end integer rendering - Added `indentZero` test for `indent=0` edge case ## Benchmark Results ### JMH (JVM, isolated runs, lower is better) | Benchmark | Before (ms/op) | After (ms/op) | Change | |-----------|----------------|---------------|--------| | **realistic2** | 68.749 | 58.001 | **-15.6%** ✅ | | **reverse** | 10.494 | 8.436 | **-19.6%** ✅ | | gen_big_object | 1.066 | 1.000 | -6.2% ✅ | | bench.02 | 39.832 | 39.322 | -1.3% ≈ | | comparison | 20.216 | 21.060 | +4.2% (noise — eval-only, output is `true`) | | realistic1 | 2.015 | 2.133 | within noise | No regressions across the full 35-benchmark JMH suite. ### Hyperfine (Scala Native, `--warmup 3 --min-runs 10`) **realistic2** (28.6 MB output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 264.9 ± 4.2 | 2.48x slower | | sjsonnet-native (this PR) | 262.2 ± 2.9 | 2.45x slower | | jrsonnet 0.5.0-pre98 | 106.8 ± 16.3 | baseline | **reverse** (large array output): | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 53.1 ± 2.8 | 2.22x slower | | sjsonnet-native (this PR) | 38.0 ± 2.3 | **1.59x slower** | | jrsonnet 0.5.0-pre98 | 24.0 ± 1.7 | baseline | Gap closed from 2.22x → 1.59x (**-28.4%** improvement). **gen_big_object**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.1 ± 1.5 | 1.16x slower | | sjsonnet-native (this PR) | 10.4 ± 1.1 | **1.01x — tied!** | | jrsonnet 0.5.0-pre98 | 10.5 ± 1.3 | baseline | **realistic1**: | Implementation | Time (ms) | vs jrsonnet | |---|---|---| | sjsonnet-native (master) | 12.9 ± 1.4 | — | | sjsonnet-native (this PR) | 12.0 ± 1.4 | **1.15x faster** | | jrsonnet 0.5.0-pre98 | 13.9 ± 2.1 | baseline | sjsonnet already **beats** jrsonnet on realistic1 (1.15x faster). ## Analysis The JVM improvement is larger (15.6% on realistic2) because the JIT compiler was still leaving performance on the table with the char-by-char loops. On Scala Native, LLVM already partially optimizes these loops, so the native improvement is smaller for realistic2 but significant for reverse (28.4%), where the output contains many integer-valued doubles that benefit from the zero-allocation `appendLong` path. The `gen_big_object` benchmark is now **tied with jrsonnet** (10.4ms vs 10.5ms), and `realistic1` beats jrsonnet by 1.15x. ## Result - ✅ All 141 test suites pass (JVM 3.3.7) - ✅ Compiles on all platforms (JVM, JS, Native) - ✅ No regressions across the full benchmark suite - ✅ Comprehensive new test coverage for edge cases This PR supersedes #676 (renderer-indent-cache), #681 (renderer-bulk-append), and #685 (direct-long-rendering) which implemented subsets of these optimizations individually.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
When rendering integer-valued doubles (e.g.,
42.0),visitFloat64currently callsi.toStringwhich allocates aStringobject, then passes it throughvisitFloat64StringPartsfor character-by-character processing. For large arrays of numbers (common in Jsonnet), this creates millions of short-livedStringallocations.Key Design Decision
Add a
writeLongDirectmethod that converts aLongdirectly into theCharBuilder's backing array without any intermediateStringallocation. Uses a digit-pair lookup table for two-digits-at-a-time conversion (a well-known optimization from JDK'sInteger.getChars).Modification
sjsonnet/src/sjsonnet/BaseCharRenderer.scala:writeLongDirect(v: Long)private method:0andLong.MinValue(negation overflow)elemBuilder.arrviaensureLength+ position updateBaseCharRendererwithDIGIT_TENSandDIGIT_ONESlookup tables (100 entries each)visitFloat64: integer path now callswriteLongDirect(i)instead ofvisitFloat64StringParts(i.toString, ...)Test: Added
new_test_suite/large_integer_rendering.jsonnet— verifies correct rendering of boundary values (0, negatives, large longs, Long.MinValue).Benchmark Results
JMH — Full Suite (35 benchmarks, 1+1 warmup)
Pending benchmark data.
Expected Impact
Primarily benefits benchmarks with heavy numeric rendering:
realistic2(large JSON with many numbers)base64DecodeBytes(byte arrays rendered as numbers)Analysis
StringorStringBuilder— writes directly into the backingchar[].large_integer_rendering.jsonnetcovers boundary values includingLong.MinValue.References
Integer.getCharsuses the same digit-pair techniqueBaseCharRenderer.DIGIT_TENS/DIGIT_ONES— static lookup tablesResult
Zero-allocation integer rendering with digit-pair optimization. Eliminates
Long.toStringallocation overhead. Draft PR pending benchmark data.