KAFKA-19012 Fix rare producer message corruption, don't reuse buffers… #21288

chia7712 · 2026-01-11T11:27:51Z

… on the client in certain error cases (#21065)

Client versions 2.8.0 and later are affected by a

change that exposes a latent bug in how BufferPool is used (BufferPool is a class used on the client side to allocate memory in ByteBuffers, for performance it will reuse them with the caller of the class doing manual memory management by calling free when they are done with the memory). The bug is that a pooled ByteBuffer can be freed while it is still in use by the network sending thread - this early freeing can happen when batches expire / brokers are disconnecting from clients. This bug has existed for more than a decade (since Kafka 0.x it seems), but never manifested because prior to 2.8.0 the pooled ByteBuffer (which contained record data aka your publishes) was copied into a freshly allocated ByteBuffer before any potential reuse and that fresh ByteBuffer was what got written over the network to the broker. With a change included in 2.8.0, the pooled ByteBuffer remains as-is inside of a MemoryRecords instance and this pooled ByteBuffer (which in some cases can be reused and overwritten with other data) is written over the network. Two contributing factors are that the checksum for Kafka records only includes the key/value/headers/etc and not the topic so there is no protection there, and also an implementation detail is that, also newly in the commit that exposed the bug, the produce request header (which includes the topic and partition of a group of message batches) is serialized in a buffer separately from the messages themselves (and the latter is what gets put in the pooled ByteBuffer) which allows you to get messages misrouted to a random recently used topic as opposed to simple duplicate messages on their intended topic.

The key change is in Sender.sendProducerData, we cannot allow the pooled ByteBuffer to be reused for expired in-flight batches until the request completes. For these batches we avoid deallocating the buffer in the normal failBatch call, deferring it until we call completeBatch (or a different path of failBatch).

There are some automated tests to cover this, and also manual testing done to reproduce the issue from KAFKA-19012 and verify that this is sufficient to stop it.

Reviewers: Justine Olshan jolshan@confluent.io, Jun Rao
junrao@gmail.com, Chia-Ping Tsai chia7712@gmail.com

… on the client in certain error cases (apache#21065) Client versions 2.8.0 and later are affected by a [change](apache@30bc21c) that exposes a latent bug in how BufferPool is used (BufferPool is a class used on the client side to allocate memory in ByteBuffers, for performance it will reuse them with the caller of the class doing manual memory management by calling free when they are done with the memory). The bug is that a pooled ByteBuffer can be freed while it is still in use by the network sending thread - this early freeing can happen when batches expire / brokers are disconnecting from clients. This bug has existed for more than a decade (since Kafka 0.x it seems), but never manifested because prior to 2.8.0 the pooled ByteBuffer (which contained record data aka your publishes) was copied into a freshly allocated ByteBuffer before any potential reuse and that fresh ByteBuffer was what got written over the network to the broker. With a change included in 2.8.0, the pooled ByteBuffer remains as-is inside of a MemoryRecords instance and this pooled ByteBuffer (which in some cases can be reused and overwritten with other data) is written over the network. Two contributing factors are that the checksum for Kafka records only includes the key/value/headers/etc and not the topic so there is no protection there, and also an implementation detail is that, also newly in the commit that exposed the bug, the produce request header (which includes the topic and partition of a group of message batches) is serialized in a buffer separately from the messages themselves (and the latter is what gets put in the pooled ByteBuffer) which allows you to get messages misrouted to a random recently used topic as opposed to simple duplicate messages on their intended topic. The key change is in Sender.sendProducerData, we cannot allow the pooled ByteBuffer to be reused for expired in-flight batches until the request completes. For these batches we avoid deallocating the buffer in the normal failBatch call, deferring it until we call completeBatch (or a different path of failBatch). There are some automated tests to cover this, and also manual testing done to reproduce the issue from KAFKA-19012 and verify that this is sufficient to stop it. Reviewers: Justine Olshan <jolshan@confluent.io>, Jun Rao <junrao@gmail.com>, Chia-Ping Tsai <chia7712@gmail.com>

junrao

@chia7712 : Thanks for the PR. LGTM

Are the test failures related?

chia7712 · 2026-01-12T03:48:55Z

testOutdatedCoordinatorAssignment was fixed in #18945, but that was only merged to 4.1 and above. Meanwhile, testDeleteExceedsMaxRecords is being tracked in KAFKA-19513

chia7712 mentioned this pull request Jan 11, 2026

KAFKA-19012: Fix rare producer message corruption, don't reuse buffers on the client in certain error cases #21065

Merged

junrao approved these changes Jan 11, 2026

View reviewed changes

FrankYang0529 approved these changes Jan 12, 2026

View reviewed changes

chia7712 merged commit 9386374 into apache:3.9 Jan 12, 2026
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19012 Fix rare producer message corruption, don't reuse buffers… #21288

KAFKA-19012 Fix rare producer message corruption, don't reuse buffers… #21288

chia7712 commented Jan 11, 2026

Uh oh!

junrao left a comment

Uh oh!

chia7712 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

KAFKA-19012 Fix rare producer message corruption, don't reuse buffers… #21288

KAFKA-19012 Fix rare producer message corruption, don't reuse buffers… #21288

Conversation

chia7712 commented Jan 11, 2026

Uh oh!

junrao left a comment

Choose a reason for hiding this comment

Uh oh!

chia7712 commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants