A handful of optimizations for the DRC collector#12974
A handful of optimizations for the DRC collector#12974fitzgen wants to merge 13 commits intobytecodealliance:mainfrom
Conversation
79013cf to
56a5b5a
Compare
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "wasmtime:api", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
alexcrichton
left a comment
There was a problem hiding this comment.
I need to spend more time looking at Combine dec_ref, trace, and dealloc into single-pass loop but this is one thing I noticed. The later commits seem fine though.
This is another case though where in-wasm GC allocation, GC mark/sweep, etc, would I suspect remove a huge amount of the overhead since the host has to dance around "the heap could be corrupt at any time" which loses a lot of perf I believe. I realize that's a big undertaking, but we may want to discuss more seriously in a meeting at some point if it's table stakes or not for shipping gc.
Happy to discuss at a meeting, I'll add an item, but I find it super surprising that we would even entertain the idea of blocking enabling the GC proposal by default on self-hosting the free list (or even worse from a time-to-shipping perspective: self-hosting the whole collector runtime). |
Ideally we would just use a `SecondaryMap<VMSharedTypeIndex, TraceInfo>` here but allocating `O(num engine types)` space inside a store that uses only a couple types seems not great. So instead, we just have a fixed size cache that is probably big enough for most things in practice.
Inline `dec_ref`, `trace_gc_ref`, and `dealloc` into `dec_ref_and_maybe_dealloc`'s main loop so that we read the `VMDrcHeader` once per object to get `ref_count`, type index, and `object_size`, avoiding 3 separate GC heap accesses and bounds checks per freed object. For struct tracing, read gc_ref fields directly from the heap slice at known offsets instead of going through gc_object_data → object_range → object_size which would re-read the object_size from the header. 301,333,979,721 -> 291,038,676,119 instructions (~3.4% improvement)
…exists When the GC store is already initialized and the allocation succeeds, avoid async machinery entirely. This avoids the overhead of taking/restoring fiber async state pointers on every allocation. 291,038,676,119 -> 230,503,364,489 instructions (~20.8% improvement)
Avoids converting `ModuleInternedTypeIndex` to `VMSharedTypeIndex` in host code, which requires look ups in the instance's module's `TypeCollection`. We already have helpers to do this conversion inline in JIT code. 230,503,364,489 -> 216,937,168,529 instructions (~5.9% improvement)
Moves the `externref` host data cleanup inside the `ty.is_none()` branch of `dec_ref_and_maybe_dealloc`, since only `externref`s have host data. Additionally the type check is sort of expensive since it involves additional bounds-checked reads from the GC heap.
This reverts commit 41dcbd931170c0e510b5baf9e0cafa19a83c0ddd.
56a5b5a to
569278a
Compare
|
I pulled this out of the queue manually due to the failure at https://github.com/bytecodealliance/wasmtime/actions/runs/24208166925/job/70669519553 |
Depends on #12969
See each commit message for details.
More coming soon after this.