Skip to content

Conversation

@gerben-stavenga
Copy link

This change makes RawVecInner non-generic over the allocator, allowing it to be Copy. The allocator is moved to RawVec itself. Key optimizations:

  • RawVecInner is now Copy (no allocator field)
  • grow_one uses ptr::read/ptr::write to copy allocator to a temporary, preventing &self from escaping through &dyn Allocator parameter
  • Drop::drop similarly copies to temporaries before deallocating
  • deallocate takes self by value instead of &mut self
  • All these functions are #[inline(always)]

This allows LLVM to keep Vec fields (cap, ptr, len) in registers during push loops instead of storing/loading from memory every iteration.

Benchmark results (push with pre-allocated capacity):

  • 100 elements: 1.74x faster
  • 1000 elements: 1.87x faster
  • 10000 elements: 2.41x faster

Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator, so they are compiled once in libstd rather than monomorphized per allocator type.

Preserves const compatibility with the const_heap feature by using generics for the const allocation path while using &dyn Allocator for runtime paths.

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jan 11, 2026
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@saethlin
Copy link
Member

  • All these functions are #[inline(always)]

Please try to only use that attribute where it is demonstrated to be better than #[inline].

Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator, so they are compiled once in libstd rather than monomorphized per allocator type.

Isn't this a penalty for small custom allocators that can be inlined?

@gerben-stavenga gerben-stavenga force-pushed the vec-push-optimization branch 2 times, most recently from 4050066 to 52ccbc8 Compare January 11, 2026 03:47
@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@rust-log-analyzer

This comment has been minimized.

@gerben-stavenga
Copy link
Author

gerben-stavenga commented Jan 11, 2026

  • All these functions are #[inline(always)]

Please try to only use that attribute where it is demonstrated to be better than #[inline].

These are on xxx(&mut self) functions that forward to functions that take by self and return self. If not always inlined the &mut self escapes a reference, producing code that is drastically worse. The inner loop in the benchmarks with this PR is

bf330: mov %r13,(%rdx,%r13,8) ; vec[len] = len
bf334: inc %r13 ; len++
bf337: cmp %r13,%r15 ; compare with target
bf33a: je bf370 ; done if equal
bf33c: cmp %rax,%r13 ; compare with capacity
bf33f: jne bf330 ; loop back

before:

bf600: mov -0x38(%rbp),%rax ; LOAD ptr from stack
bf604: mov %r15,(%rax,%r15,8) ; vec[len] = len
bf608: inc %r15 ; len++
bf60b: mov %r15,-0x30(%rbp) ; STORE len to stack
bf60f: cmp %r15,%r14 ; compare with target
bf612: je bf630 ; done if equal
bf614: cmp -0x40(%rbp),%r15 ; LOAD capacity from stack
bf618: jne bf600 ; loop back

Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator, so they are compiled once in libstd rather than monomorphized per allocator type.

Isn't this a penalty for small custom allocators that can be inlined?

I suspect there is a small penalty due to indirection (although the compiler seem to generate call reg in the direct case too). But there are also positive side effects due to code dedup. These are fallback paths so from that perspective a tiny regression isn't the worst. The point of this PR is that the existence of fallback path should not influence the compilers ability to optimize the fast path and keep that clean and tight.

The &dyn Allocator change can be changed to &Allocator at the cost of monomorphizing grow function.

@gerben-stavenga gerben-stavenga marked this pull request as ready for review January 11, 2026 05:06
@rustbot rustbot added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jan 11, 2026
@rustbot rustbot removed the S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. label Jan 11, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 11, 2026

r? @tgross35

rustbot has assigned @tgross35.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@tgross35
Copy link
Contributor

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 11, 2026

⌛ Trying commit ac22726 with merge b00b54d

To cancel the try build, run the command @bors try cancel.

Workflow: https://github.com/rust-lang/rust/actions/runs/20890110771

rust-bors bot added a commit that referenced this pull request Jan 11, 2026
Optimize Vec push by preventing address escapes
@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label Jan 11, 2026
@rust-log-analyzer

This comment has been minimized.

@tgross35
Copy link
Contributor

Looks like the build isn't working yet
@bors try cancel

@rust-bors
Copy link
Contributor

rust-bors bot commented Jan 11, 2026

Try build cancelled. Cancelled workflows:

Comment on lines 11 to 15
b.iter(|| {
let mut v = Vec::new();
for i in 0..n {
v.push(i);
}
black_box(v.as_slice());
v
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should move the constructor outside of the loop so we're not benchmarking that cost (more relevant for with_capacity). It would also be a good idea to black_box(v).push(i), in which case you don't need to do the as_slice bit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

black_box(v).push(i) would prevent the compiler optimizations we are trying to enable, by explicitly escaping the reference to v.

Moving the constructor out of the lambda also have a similar effect, due to criterion iter black_boxing the returned v.

#[cfg(not(no_global_oom_handling))]
#[rustc_const_unstable(feature = "const_heap", issue = "79597")]
#[rustfmt::skip] // FIXME(fee1-dead): temporary measure before rustfmt is bumped
#[rustfmt::skip]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidental change?

Comment on lines 810 to 824
/// # Safety
///
/// This function deallocates the owned allocation, but does not update `ptr` or `cap` to
/// prevent double-free or use-after-free. Essentially, do not do anything with the caller
/// after this function returns.
/// Ideally this function would take `self` by move, but it cannot because it exists to be
/// called from a `Drop` impl.
unsafe fn deallocate(&mut self, elem_layout: Layout) {
/// This function deallocates the owned allocation.
#[inline]
unsafe fn deallocate(self, elem_layout: Layout, alloc: &dyn Allocator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted safety comments?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The safety comments are mainly because it taked &mut self. The comments even mention it should take self, but couldn't because earlier it wasn't Copy. Now it takes self so the safety comments don't make sense.

Comment on lines 165 to 168
unsafe {
// Make it more obvious that a subsequent Vec::reserve(capacity) will not allocate.
hint::assert_unchecked(!inner.needs_to_grow(0, capacity, T::LAYOUT));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're trying to get better about having SAFETY comments in std, please make sure to cover any new unsafe blocks.

Comment on lines 181 to 208
pub(crate) fn grow_one(&mut self) {
// SAFETY: All calls on self.inner pass T::LAYOUT as the elem_layout
unsafe { self.inner.grow_one(T::LAYOUT) }
// Copy allocator to a temporary to prevent &self from escaping
// through the &dyn Allocator parameter, allowing LLVM to keep
// the Vec fields in registers.
let alloc = unsafe { ptr::read(&self.alloc) };
self.inner = self.inner.grow_one(&alloc, T::LAYOUT);
unsafe { ptr::write(&mut self.alloc, alloc) };
}
Copy link
Contributor

@tgross35 tgross35 Jan 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something but I don't understand this at all. How does self "escape" through &dyn Allocator? How does saving+restoring make a difference? What happens when alloc has interior mutability and you clobber it? What happens when grow_one panics on OOM and there are now two instances of A, which may impl drop, pointing to the same memory?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this perhaps just a trick to get around the borrow checker? If so, then perhaps something like this would work:

let inner = self.inner;
inner.grow_one(self.alloc, T::LAYOUT);

@rustbot rustbot added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jan 11, 2026
@rustbot
Copy link
Collaborator

rustbot commented Jan 11, 2026

Reminder, once the PR becomes ready for a review, use @rustbot ready.

@Noratrieb
Copy link
Member

FWIW the compile time benchmarks will not say anything about the runtime perf of this change, since such large refactorings of Vec are pretty much guaranteed to have some compile time impact on crates that use vec (so every crate).
So the compile time impact of this change and the runtime impact on the vecs in the compiler will be hard to untangle.

{
handle_error(err);
}
self
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any changes that could warrant this function changing to no longer being unsafe, it even still has the same safety comments...

Comment on lines 6 to 8
// ============================================================================
// PUSH BENCHMARKS - The focus of your optimization work
// ============================================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these AI comments.

do_bench_push_preallocated(b, 10000);
}

// ============================================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@saethlin
Copy link
Member

If not always inlined the &mut self escapes a reference, producing code that is drastically worse.

When you say "always inlined" are you referring to the attribute or the optimization?

@tgross35
Copy link
Contributor

Could you define the “escaping” term that you keep using? In Rust I’m only aware of that referring to lifetimes, which isn’t relevant for codegen.

@gerben-stavenga
Copy link
Author

Could you define the “escaping” term that you keep using? In Rust I’m only aware of that referring to lifetimes, which isn’t relevant for codegen.

In compiler analysis the most important step is mem2reg, ie. lower stack variables to register. The crucial part of this step is that there is no reference to the stack variable. So local function variables whose stack address does not escape the compiler analysis (for example by passing it to some other function) can easily be moved to SSA variables. This allows subsequent codegen to keep those variables in register (because there is no need to sync the register to stack). So currently the reference to self (len, cap, ptr) is passed to grow and thus all subsequent codegen is poluted by unnecessary register <-> stack syncing (see the asm i posted). This PR removes the reference to stack variables from the outline function call. in lieu of passing and returning cap, ptr by value (ie. register)

@gerben-stavenga
Copy link
Author

If not always inlined the &mut self escapes a reference, producing code that is drastically worse.

When you say "always inlined" are you referring to the attribute or the optimization?

I refer to the optimization, it's crucial that functions taking &mut self, are always inlined because then compiler optimization will see that the reference can eliminated.

@gerben-stavenga
Copy link
Author

FWIW the compile time benchmarks will not say anything about the runtime perf of this change, since such large refactorings of Vec are pretty much guaranteed to have some compile time impact on crates that use vec (so every crate). So the compile time impact of this change and the runtime impact on the vecs in the compiler will be hard to untangle.

I'm not sure if I understand you. The main point of this PR is runtime performance of code. I hope the benchmarks I'm running are the benchmarks for measuring the runtime perf of the Vec implementation.

There might be a compile time benefit. Because the fallback grow function is only compiled once as part of the standard lib and not, like the current state, in each crate that uses vec.

@saethlin
Copy link
Member

saethlin commented Jan 11, 2026

I refer to the optimization, it's crucial that functions taking &mut self, are always inlined because then compiler optimization will see that the reference can eliminated.

I do not think this is sufficient justification for inline(always). We have so many functions which would cause similar or worse optimization degradation if they weren't inlined in optimized builds. This just isn't worth the cost of the degradation to debug build times that is caused by inline(always). If #[inline] suffices, always is all cost no benefit.

@tgross35
Copy link
Contributor

^ to reiterate that, the rule of thumb now is that any use of #[inline(always)] needs to be backed up by benchmarks and codegen showing it makes a meaningful difference over ‘#[inline]. ‘#[inline(always)]` hurts unoptimized builds and size-optimized binaries so we need to be very cautious with its use.

In general here, it would be helpful if you could put a mini version of the before and after code on godbolt so we can get the bigger picture of what’s actually happening at the different levels.

@rust-log-analyzer

This comment has been minimized.

@gerben-stavenga
Copy link
Author

^ to reiterate that, the rule of thumb now is that any use of #[inline(always)] needs to be backed up by benchmarks and codegen showing it makes a meaningful difference over ‘#[inline]. ‘#[inline(always)]` hurts unoptimized builds and size-optimized binaries so we need to be very cautious with its use.

In general here, it would be helpful if you could put a mini version of the before and after code on godbolt so we can get the bigger picture of what’s actually happening at the different levels.

https://godbolt.org/z/nrnP4T83e

shows a rather minimal version, you can see the difference in codegen the test functions

vec_push vs rf_push

This change makes RawVecInner non-generic over the allocator, allowing it
to be Copy. The allocator is moved to RawVec itself. Key optimizations:

- RawVecInner is now Copy (no allocator field)
- grow_one uses ptr::read/ptr::write to copy allocator to a temporary,
  preventing &self from escaping through &dyn Allocator parameter
- Drop::drop similarly copies to temporaries before deallocating
- deallocate takes self by value instead of &mut self
- All these functions are #[inline(always)]

This allows LLVM to keep Vec fields (cap, ptr, len) in registers during
push loops instead of storing/loading from memory every iteration.

Benchmark results (push with pre-allocated capacity):
- 100 elements:   1.74x faster
- 1000 elements:  1.87x faster
- 10000 elements: 2.41x faster

Secondary benefit: grow_one_impl and other growth functions use &dyn Allocator,
so they are compiled once in libstd rather than monomorphized per allocator type.

Preserves const compatibility with the const_heap feature by using generics
for the const allocation path while using &dyn Allocator for runtime paths.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@rust-log-analyzer
Copy link
Collaborator

The job aarch64-gnu-llvm-20-1 failed! Check out the build log: (web) (plain enhanced) (plain)

Click to see the possible cause of the failure (guessed by this bot)
   Compiling alloc v0.0.0 (/checkout/library/alloc)
[RUSTC-TIMING] rustc_std_workspace_core test:false 0.047
[RUSTC-TIMING] core test:false 32.544
   Compiling memchr v2.7.6
error[E0277]: the trait bound `CaptureLocally<'_, A>: core::alloc::Allocator` is not satisfied
   --> library/alloc/src/raw_vec/mod.rs:207:42
    |
207 |         self.inner = self.inner.grow_one(&local_alloc, T::LAYOUT);
    |                                          ^^^^^^^^^^^^ unsatisfied trait bound
    |
help: the trait `core::alloc::Allocator` is not implemented for `CaptureLocally<'_, A>`
   --> library/alloc/src/raw_vec/mod.rs:171:1
    |
171 | struct CaptureLocally<'a, T> {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    = help: the following other types implement trait `core::alloc::Allocator`:
              &A
              &mut A
              Arc<T, A>
              Box<T, A>
              Rc<T, A>
              alloc::Global
    = note: required for the cast from `&CaptureLocally<'_, A>` to `&dyn core::alloc::Allocator`

[RUSTC-TIMING] libc test:false 1.965
   Compiling unwind v0.0.0 (/checkout/library/unwind)
[RUSTC-TIMING] unwind test:false 0.065
   Compiling adler2 v2.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. S-waiting-on-perf Status: Waiting on a perf run to be completed. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants