-
-
Notifications
You must be signed in to change notification settings - Fork 14.3k
Optimize Vec push by preventing address escapes #150950
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize Vec push by preventing address escapes #150950
Conversation
This comment has been minimized.
This comment has been minimized.
132f837 to
b32c1a0
Compare
This comment has been minimized.
This comment has been minimized.
b32c1a0 to
67c1768
Compare
Please try to only use that attribute where it is demonstrated to be better than #[inline].
Isn't this a penalty for small custom allocators that can be inlined? |
4050066 to
52ccbc8
Compare
This comment has been minimized.
This comment has been minimized.
52ccbc8 to
0313271
Compare
This comment has been minimized.
This comment has been minimized.
0313271 to
8d85a31
Compare
This comment has been minimized.
This comment has been minimized.
8d85a31 to
f02ca6c
Compare
This comment has been minimized.
This comment has been minimized.
f02ca6c to
8597392
Compare
This comment has been minimized.
This comment has been minimized.
These are on xxx(&mut self) functions that forward to functions that take by self and return self. If not always inlined the &mut self escapes a reference, producing code that is drastically worse. The inner loop in the benchmarks with this PR is bf330: mov %r13,(%rdx,%r13,8) ; vec[len] = len before: bf600: mov -0x38(%rbp),%rax ; LOAD ptr from stack
I suspect there is a small penalty due to indirection (although the compiler seem to generate call reg in the direct case too). But there are also positive side effects due to code dedup. These are fallback paths so from that perspective a tiny regression isn't the worst. The point of this PR is that the existence of fallback path should not influence the compilers ability to optimize the fast path and keep that clean and tight. The &dyn Allocator change can be changed to &Allocator at the cost of monomorphizing grow function. |
8597392 to
ac22726
Compare
|
@bors try @rust-timer queue |
|
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
|
⌛ Trying commit ac22726 with merge b00b54d… To cancel the try build, run the command Workflow: https://github.com/rust-lang/rust/actions/runs/20890110771 |
Optimize Vec push by preventing address escapes
This comment has been minimized.
This comment has been minimized.
|
Looks like the build isn't working yet |
|
Try build cancelled. Cancelled workflows: |
| b.iter(|| { | ||
| let mut v = Vec::new(); | ||
| for i in 0..n { | ||
| v.push(i); | ||
| } | ||
| black_box(v.as_slice()); | ||
| v | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should move the constructor outside of the loop so we're not benchmarking that cost (more relevant for with_capacity). It would also be a good idea to black_box(v).push(i), in which case you don't need to do the as_slice bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
black_box(v).push(i) would prevent the compiler optimizations we are trying to enable, by explicitly escaping the reference to v.
Moving the constructor out of the lambda also have a similar effect, due to criterion iter black_boxing the returned v.
library/alloc/src/vec/mod.rs
Outdated
| #[cfg(not(no_global_oom_handling))] | ||
| #[rustc_const_unstable(feature = "const_heap", issue = "79597")] | ||
| #[rustfmt::skip] // FIXME(fee1-dead): temporary measure before rustfmt is bumped | ||
| #[rustfmt::skip] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accidental change?
library/alloc/src/raw_vec/mod.rs
Outdated
| /// # Safety | ||
| /// | ||
| /// This function deallocates the owned allocation, but does not update `ptr` or `cap` to | ||
| /// prevent double-free or use-after-free. Essentially, do not do anything with the caller | ||
| /// after this function returns. | ||
| /// Ideally this function would take `self` by move, but it cannot because it exists to be | ||
| /// called from a `Drop` impl. | ||
| unsafe fn deallocate(&mut self, elem_layout: Layout) { | ||
| /// This function deallocates the owned allocation. | ||
| #[inline] | ||
| unsafe fn deallocate(self, elem_layout: Layout, alloc: &dyn Allocator) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted safety comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The safety comments are mainly because it taked &mut self. The comments even mention it should take self, but couldn't because earlier it wasn't Copy. Now it takes self so the safety comments don't make sense.
library/alloc/src/raw_vec/mod.rs
Outdated
| unsafe { | ||
| // Make it more obvious that a subsequent Vec::reserve(capacity) will not allocate. | ||
| hint::assert_unchecked(!inner.needs_to_grow(0, capacity, T::LAYOUT)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're trying to get better about having SAFETY comments in std, please make sure to cover any new unsafe blocks.
| pub(crate) fn grow_one(&mut self) { | ||
| // SAFETY: All calls on self.inner pass T::LAYOUT as the elem_layout | ||
| unsafe { self.inner.grow_one(T::LAYOUT) } | ||
| // Copy allocator to a temporary to prevent &self from escaping | ||
| // through the &dyn Allocator parameter, allowing LLVM to keep | ||
| // the Vec fields in registers. | ||
| let alloc = unsafe { ptr::read(&self.alloc) }; | ||
| self.inner = self.inner.grow_one(&alloc, T::LAYOUT); | ||
| unsafe { ptr::write(&mut self.alloc, alloc) }; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something but I don't understand this at all. How does self "escape" through &dyn Allocator? How does saving+restoring make a difference? What happens when alloc has interior mutability and you clobber it? What happens when grow_one panics on OOM and there are now two instances of A, which may impl drop, pointing to the same memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this perhaps just a trick to get around the borrow checker? If so, then perhaps something like this would work:
let inner = self.inner;
inner.grow_one(self.alloc, T::LAYOUT);|
Reminder, once the PR becomes ready for a review, use |
|
FWIW the compile time benchmarks will not say anything about the runtime perf of this change, since such large refactorings of Vec are pretty much guaranteed to have some compile time impact on crates that use vec (so every crate). |
| { | ||
| handle_error(err); | ||
| } | ||
| self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see any changes that could warrant this function changing to no longer being unsafe, it even still has the same safety comments...
library/alloctests/benches/vec.rs
Outdated
| // ============================================================================ | ||
| // PUSH BENCHMARKS - The focus of your optimization work | ||
| // ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these AI comments.
library/alloctests/benches/vec.rs
Outdated
| do_bench_push_preallocated(b, 10000); | ||
| } | ||
|
|
||
| // ============================================================================ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
When you say "always inlined" are you referring to the attribute or the optimization? |
|
Could you define the “escaping” term that you keep using? In Rust I’m only aware of that referring to lifetimes, which isn’t relevant for codegen. |
In compiler analysis the most important step is mem2reg, ie. lower stack variables to register. The crucial part of this step is that there is no reference to the stack variable. So local function variables whose stack address does not escape the compiler analysis (for example by passing it to some other function) can easily be moved to SSA variables. This allows subsequent codegen to keep those variables in register (because there is no need to sync the register to stack). So currently the reference to self (len, cap, ptr) is passed to grow and thus all subsequent codegen is poluted by unnecessary register <-> stack syncing (see the asm i posted). This PR removes the reference to stack variables from the outline function call. in lieu of passing and returning cap, ptr by value (ie. register) |
I refer to the optimization, it's crucial that functions taking &mut self, are always inlined because then compiler optimization will see that the reference can eliminated. |
I'm not sure if I understand you. The main point of this PR is runtime performance of code. I hope the benchmarks I'm running are the benchmarks for measuring the runtime perf of the Vec implementation. There might be a compile time benefit. Because the fallback grow function is only compiled once as part of the standard lib and not, like the current state, in each crate that uses vec. |
I do not think this is sufficient justification for |
ac22726 to
6b13ac1
Compare
|
^ to reiterate that, the rule of thumb now is that any use of In general here, it would be helpful if you could put a mini version of the before and after code on godbolt so we can get the bigger picture of what’s actually happening at the different levels. |
This comment has been minimized.
This comment has been minimized.
https://godbolt.org/z/nrnP4T83e shows a rather minimal version, you can see the difference in codegen the test functions vec_push vs rf_push |
This change makes RawVecInner non-generic over the allocator, allowing it to be Copy. The allocator is moved to RawVec itself. Key optimizations: - RawVecInner is now Copy (no allocator field) - grow_one uses ptr::read/ptr::write to copy allocator to a temporary, preventing &self from escaping through &dyn Allocator parameter - Drop::drop similarly copies to temporaries before deallocating - deallocate takes self by value instead of &mut self - All these functions are #[inline(always)] This allows LLVM to keep Vec fields (cap, ptr, len) in registers during push loops instead of storing/loading from memory every iteration. Benchmark results (push with pre-allocated capacity): - 100 elements: 1.74x faster - 1000 elements: 1.87x faster - 10000 elements: 2.41x faster Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator, so they are compiled once in libstd rather than monomorphized per allocator type. Preserves const compatibility with the const_heap feature by using generics for the const allocation path while using &dyn Allocator for runtime paths. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
6b13ac1 to
b136c1a
Compare
|
The job Click to see the possible cause of the failure (guessed by this bot) |
This change makes RawVecInner non-generic over the allocator, allowing it to be Copy. The allocator is moved to RawVec itself. Key optimizations:
This allows LLVM to keep Vec fields (cap, ptr, len) in registers during push loops instead of storing/loading from memory every iteration.
Benchmark results (push with pre-allocated capacity):
Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator, so they are compiled once in libstd rather than monomorphized per allocator type.
Preserves const compatibility with the const_heap feature by using generics for the const allocation path while using &dyn Allocator for runtime paths.