diff mbox series

[PATC] block: update queue limits atomically

Message ID ee66a4f2-ecc4-68d2-4594-a0bcba7ffe9c@redhat.com (mailing list archive)
State New
Headers show
Series [PATC] block: update queue limits atomically | expand

Checks

Context Check Description
shin/vmtest-linus-master-VM_Test-2 success Logs for run-tests-on-kernel
shin/vmtest-linus-master-PR fail PR summary
shin/vmtest-linus-master-VM_Test-1 success Logs for build-kernel
shin/vmtest-linus-master-VM_Test-0 success Logs for build-kernel

Commit Message

Mikulas Patocka March 18, 2025, 2:26 p.m. UTC
The block limits may be read while they are being modified. The statement
"q->limits = *lim" is not really atomic. The compiler may turn it into
memcpy (clang does).

On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
optimized on modern CPUs, but it is not atomic, it may be interrupted at
any byte boundary - and if it is interrupted, the readers may read
garbage.

On sparc64, there's an instruction that zeroes a cache line without
reading it from memory. The kernel memcpy implementation uses it (see
b3a04ed507bf) to avoid loading the destination buffer from memory. The
problem is that if we copy a block of data to q->limits and someone reads
it at the same time, the reader may read zeros.

This commit changes it to use WRITE_ONCE, so that individual words are
updated atomically.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org

---
 block/blk-settings.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Comments

Ming Lei March 18, 2025, 2:56 p.m. UTC | #1
On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
> The block limits may be read while they are being modified. The statement

It is supposed to not be so for IO path, that is why queue is usually down
or frozen when updating limit.

For other cases, limit lock can be held for sync the read/write.

Or you have cases not covered by both queue freeze and limit lock?

> "q->limits = *lim" is not really atomic. The compiler may turn it into
> memcpy (clang does).
> 
> On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
> optimized on modern CPUs, but it is not atomic, it may be interrupted at
> any byte boundary - and if it is interrupted, the readers may read
> garbage.
> 
> On sparc64, there's an instruction that zeroes a cache line without
> reading it from memory. The kernel memcpy implementation uses it (see
> b3a04ed507bf) to avoid loading the destination buffer from memory. The
> problem is that if we copy a block of data to q->limits and someone reads
> it at the same time, the reader may read zeros.
> 
> This commit changes it to use WRITE_ONCE, so that individual words are
> updated atomically.

It isn't necessary, for this particular problem, it is also fragile to
provide atomic word update in this low level way, such as, what if
sizeof(struct queue_limits) isn't 8byte aligned?

> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org

stable often requires bug description.



Thanks,
Ming
Mikulas Patocka March 18, 2025, 3:31 p.m. UTC | #2
On Tue, 18 Mar 2025, Ming Lei wrote:

> On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
> > The block limits may be read while they are being modified. The statement
> 
> It is supposed to not be so for IO path, that is why queue is usually down
> or frozen when updating limit.

The limits are read at some points when constructing a bio - for example 
bio_integrity_add_page, bvec_try_merge_hw_page, bio_integrity_map_user.

> For other cases, limit lock can be held for sync the read/write.
> 
> Or you have cases not covered by both queue freeze and limit lock?

For example, device mapper reads the limits of the underlying devices 
without holding any lock (dm_set_device_limits, __process_abnormal_io, 
__max_io_len). It also writes the limits in the I/O path - 
disable_discard, disable_write_zeroes - you couldn't easily lock it here 
because it happens in the interrupt contex.

I'm not sure how many other kernel subsystems do it and whether they could 
all be converted to locking.

> > "q->limits = *lim" is not really atomic. The compiler may turn it into
> > memcpy (clang does).
> > 
> > On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
> > optimized on modern CPUs, but it is not atomic, it may be interrupted at
> > any byte boundary - and if it is interrupted, the readers may read
> > garbage.
> > 
> > On sparc64, there's an instruction that zeroes a cache line without
> > reading it from memory. The kernel memcpy implementation uses it (see
> > b3a04ed507bf) to avoid loading the destination buffer from memory. The
> > problem is that if we copy a block of data to q->limits and someone reads
> > it at the same time, the reader may read zeros.
> > 
> > This commit changes it to use WRITE_ONCE, so that individual words are
> > updated atomically.
> 
> It isn't necessary, for this particular problem, it is also fragile to
> provide atomic word update in this low level way, such as, what if
> sizeof(struct queue_limits) isn't 8byte aligned?

struct queue_limits contains two "unsigned long" fields, so it must be 
aligned on "unsigned long" boundary.

In order to make it future-proof, we could use __alignof__(struct 
queue_limits) to determine the size of the update step.

> > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> > Cc: stable@vger.kernel.org
> 
> stable often requires bug description.

The bug is that you may read mangled numbers when reading the 
queue_limits.

Mikulas
Bart Van Assche March 18, 2025, 3:58 p.m. UTC | #3
On 3/18/25 7:26 AM, Mikulas Patocka wrote:
> The block limits may be read while they are being modified. The statement
> "q->limits = *lim" is not really atomic. The compiler may turn it into
> memcpy (clang does).

Which code reads block limits while these are being updated? This should
be mentioned in the patch description.

Bart.
Mikulas Patocka March 18, 2025, 4:13 p.m. UTC | #4
On Tue, 18 Mar 2025, Bart Van Assche wrote:

> On 3/18/25 7:26 AM, Mikulas Patocka wrote:
> > The block limits may be read while they are being modified. The statement
> > "q->limits = *lim" is not really atomic. The compiler may turn it into
> > memcpy (clang does).
> 
> Which code reads block limits while these are being updated?

See my reply to Ming - 
https://lore.kernel.org/dm-devel/14dd4360-c846-43e3-86bc-b1e7448e5896@acm.org/T/#m7e4e49fed1cbcb56954b880e54a5155c4089c0e0

> This should be mentioned in the patch description.
> 
> Bart.

Yes, I can add it there.

Mikulas
Ming Lei March 19, 2025, 1:22 a.m. UTC | #5
On Tue, Mar 18, 2025 at 04:31:35PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 18 Mar 2025, Ming Lei wrote:
> 
> > On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
> > > The block limits may be read while they are being modified. The statement
> > 
> > It is supposed to not be so for IO path, that is why queue is usually down
> > or frozen when updating limit.
> 
> The limits are read at some points when constructing a bio - for example 
> bio_integrity_add_page, bvec_try_merge_hw_page, bio_integrity_map_user.

For request based code path, there isn't such issue because queue usage
counter is grabbed.

I should be one device mapper specific issue because the above interface
may not be called from dm_submit_bio().

One fix is to make sure that queue usage counter is grabbed in dm's bio/clone
submission code path.

> 
> > For other cases, limit lock can be held for sync the read/write.
> > 
> > Or you have cases not covered by both queue freeze and limit lock?
> 
> For example, device mapper reads the limits of the underlying devices 
> without holding any lock (dm_set_device_limits,

dm_set_device_limits() need to be fixed by holding limit lock.


> __process_abnormal_io, 
> __max_io_len).

The two is called with queue usage counter grabbed, so it should be fine.


> It also writes the limits in the I/O path - 
> disable_discard, disable_write_zeroes - you couldn't easily lock it here 
> because it happens in the interrupt contex.

IMO it is one bad implementation, why does device mapper have to clear
it in bio->end_io() or request's blk_mq_ops->complete()?

> 
> I'm not sure how many other kernel subsystems do it and whether they could 
> all be converted to locking.

Most request based driver should have been converted to new API.

I guess only device mapper / raid / other bio based driver should have such
kind of risk.

> 
> > > "q->limits = *lim" is not really atomic. The compiler may turn it into
> > > memcpy (clang does).
> > > 
> > > On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
> > > optimized on modern CPUs, but it is not atomic, it may be interrupted at
> > > any byte boundary - and if it is interrupted, the readers may read
> > > garbage.
> > > 
> > > On sparc64, there's an instruction that zeroes a cache line without
> > > reading it from memory. The kernel memcpy implementation uses it (see
> > > b3a04ed507bf) to avoid loading the destination buffer from memory. The
> > > problem is that if we copy a block of data to q->limits and someone reads
> > > it at the same time, the reader may read zeros.
> > > 
> > > This commit changes it to use WRITE_ONCE, so that individual words are
> > > updated atomically.
> > 
> > It isn't necessary, for this particular problem, it is also fragile to
> > provide atomic word update in this low level way, such as, what if
> > sizeof(struct queue_limits) isn't 8byte aligned?
> 
> struct queue_limits contains two "unsigned long" fields, so it must be 
> aligned on "unsigned long" boundary.
> 
> In order to make it future-proof, we could use __alignof__(struct 
> queue_limits) to determine the size of the update step.

Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
accepted solution.



Thanks,
Ming
Jens Axboe March 19, 2025, 1:58 a.m. UTC | #6
On 3/18/25 7:22 PM, Ming Lei wrote:
> On Tue, Mar 18, 2025 at 04:31:35PM +0100, Mikulas Patocka wrote:
>>
>>
>> On Tue, 18 Mar 2025, Ming Lei wrote:
>>
>>> On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
>>>> The block limits may be read while they are being modified. The statement
>>>
>>> It is supposed to not be so for IO path, that is why queue is usually down
>>> or frozen when updating limit.
>>
>> The limits are read at some points when constructing a bio - for example 
>> bio_integrity_add_page, bvec_try_merge_hw_page, bio_integrity_map_user.
> 
> For request based code path, there isn't such issue because queue usage
> counter is grabbed.
> 
> I should be one device mapper specific issue because the above interface
> may not be called from dm_submit_bio().
> 
> One fix is to make sure that queue usage counter is grabbed in dm's bio/clone
> submission code path.
> 
>>
>>> For other cases, limit lock can be held for sync the read/write.
>>>
>>> Or you have cases not covered by both queue freeze and limit lock?
>>
>> For example, device mapper reads the limits of the underlying devices 
>> without holding any lock (dm_set_device_limits,
> 
> dm_set_device_limits() need to be fixed by holding limit lock.
> 
> 
>> __process_abnormal_io, 
>> __max_io_len).
> 
> The two is called with queue usage counter grabbed, so it should be fine.
> 
> 
>> It also writes the limits in the I/O path - 
>> disable_discard, disable_write_zeroes - you couldn't easily lock it here 
>> because it happens in the interrupt contex.
> 
> IMO it is one bad implementation, why does device mapper have to clear
> it in bio->end_io() or request's blk_mq_ops->complete()?
> 
>>
>> I'm not sure how many other kernel subsystems do it and whether they could 
>> all be converted to locking.
> 
> Most request based driver should have been converted to new API.
> 
> I guess only device mapper / raid / other bio based driver should have such
> kind of risk.
> 
>>
>>>> "q->limits = *lim" is not really atomic. The compiler may turn it into
>>>> memcpy (clang does).
>>>>
>>>> On x86-64, the kernel uses the "rep movsb" instruction for memcpy - it is
>>>> optimized on modern CPUs, but it is not atomic, it may be interrupted at
>>>> any byte boundary - and if it is interrupted, the readers may read
>>>> garbage.
>>>>
>>>> On sparc64, there's an instruction that zeroes a cache line without
>>>> reading it from memory. The kernel memcpy implementation uses it (see
>>>> b3a04ed507bf) to avoid loading the destination buffer from memory. The
>>>> problem is that if we copy a block of data to q->limits and someone reads
>>>> it at the same time, the reader may read zeros.
>>>>
>>>> This commit changes it to use WRITE_ONCE, so that individual words are
>>>> updated atomically.
>>>
>>> It isn't necessary, for this particular problem, it is also fragile to
>>> provide atomic word update in this low level way, such as, what if
>>> sizeof(struct queue_limits) isn't 8byte aligned?
>>
>> struct queue_limits contains two "unsigned long" fields, so it must be 
>> aligned on "unsigned long" boundary.
>>
>> In order to make it future-proof, we could use __alignof__(struct 
>> queue_limits) to determine the size of the update step.
> 
> Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
> accepted solution.

Agree - it'd be much better to have the bio drivers provide the same
guarantees that we get on the request side, rather than play games with
this and pretend that concurrent update and usage is fine.
Mikulas Patocka March 19, 2025, 9:18 p.m. UTC | #7
On Tue, 18 Mar 2025, Jens Axboe wrote:

> > Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
> > accepted solution.
> 
> Agree - it'd be much better to have the bio drivers provide the same
> guarantees that we get on the request side, rather than play games with
> this and pretend that concurrent update and usage is fine.
> 
> -- 
> Jens Axboe

And what mechanism should they use to read the queue limits?
* locking? (would degrade performance)
* percpu-rwsem? (no overhead for readers, writers wait for the RCU 
  synchronization)
* RCU?
* anything else?

Mikulas
Ming Lei March 20, 2025, 2:22 a.m. UTC | #8
On Wed, Mar 19, 2025 at 10:18:39PM +0100, Mikulas Patocka wrote:
> 
> 
> On Tue, 18 Mar 2025, Jens Axboe wrote:
> 
> > > Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
> > > accepted solution.
> > 
> > Agree - it'd be much better to have the bio drivers provide the same
> > guarantees that we get on the request side, rather than play games with
> > this and pretend that concurrent update and usage is fine.
> > 
> > -- 
> > Jens Axboe
> 
> And what mechanism should they use to read the queue limits?
> * locking? (would degrade performance)
> * percpu-rwsem? (no overhead for readers, writers wait for the RCU 
>   synchronization)
> * RCU?
> * anything else?

1) queue usage counter is for covering fast IO code path

- in __submit_bio(), queue usage counter is grabbed when calling
  ->submit_bio()

- the only trouble should be from dm-crypt or thin-provision which offloads
bio submission to other context, so you can grab the usage counter by
percpu_ref_get(&q->q_usage_counter) until this bio submission or queue
limit consumption is done

2) slow path: dm_set_device_limits

which is done before DM disk is on, so it should be fine by holding limit lock.

3) changing queue limits from bio->end_io() or request completion handler

- this usage need fix



thanks,
Ming
Christoph Hellwig March 20, 2025, 5:25 a.m. UTC | #9
On Tue, Mar 18, 2025 at 03:26:10PM +0100, Mikulas Patocka wrote:
> The block limits may be read while they are being modified. The statement
> "q->limits = *lim" is not really atomic. The compiler may turn it into
> memcpy (clang does).

And that is intentional.

> This commit changes it to use WRITE_ONCE, so that individual words are
> updated atomically.

You fail to explain why the intentended non-atomic semantics are a
problem.

Note: it usually helps to Cc the other of the commit you suspect is
broken if you want a quick resolution.
Christoph Hellwig March 20, 2025, 5:26 a.m. UTC | #10
On Tue, Mar 18, 2025 at 07:58:09PM -0600, Jens Axboe wrote:
> Agree - it'd be much better to have the bio drivers provide the same
> guarantees that we get on the request side, rather than play games with
> this and pretend that concurrent update and usage is fine.

Exactly.  That is long overdue.
Jens Axboe March 20, 2025, 12:58 p.m. UTC | #11
On 3/19/25 8:22 PM, Ming Lei wrote:
> On Wed, Mar 19, 2025 at 10:18:39PM +0100, Mikulas Patocka wrote:
>>
>>
>> On Tue, 18 Mar 2025, Jens Axboe wrote:
>>
>>>> Yeah, it looks fine, but I feel it is still fragile, and not sure it is one
>>>> accepted solution.
>>>
>>> Agree - it'd be much better to have the bio drivers provide the same
>>> guarantees that we get on the request side, rather than play games with
>>> this and pretend that concurrent update and usage is fine.
>>>
>>> -- 
>>> Jens Axboe
>>
>> And what mechanism should they use to read the queue limits?
>> * locking? (would degrade performance)
>> * percpu-rwsem? (no overhead for readers, writers wait for the RCU 
>>   synchronization)
>> * RCU?
>> * anything else?
> 
> 1) queue usage counter is for covering fast IO code path
> 
> - in __submit_bio(), queue usage counter is grabbed when calling
>   ->submit_bio()
> 
> - the only trouble should be from dm-crypt or thin-provision which offloads
> bio submission to other context, so you can grab the usage counter by
> percpu_ref_get(&q->q_usage_counter) until this bio submission or queue
> limit consumption is done

Indeed - this is an entirely solved problem already, it's just that the
bio bypassing gunk thinks it can get away with just bypassing all of
that. The mechanisms very much exist and are used by the request path,
which is why this problem doesn't exist there.

> 2) slow path: dm_set_device_limits
> 
> which is done before DM disk is on, so it should be fine by holding limit lock.
> 
> 3) changing queue limits from bio->end_io() or request completion handler
> 
> - this usage need fix

All looks reasonable.
diff mbox series

Patch

Index: linux-2.6/block/blk-settings.c
===================================================================
--- linux-2.6.orig/block/blk-settings.c
+++ linux-2.6/block/blk-settings.c
@@ -433,6 +433,7 @@  int queue_limits_commit_update(struct re
 		struct queue_limits *lim)
 {
 	int error;
+	size_t i;
 
 	error = blk_validate_limits(lim);
 	if (error)
@@ -446,7 +447,14 @@  int queue_limits_commit_update(struct re
 	}
 #endif
 
-	q->limits = *lim;
+	/*
+	 * Note that direct assignment like "q->limits = *lim" is not atomic
+	 * (the compiler can generate things like "rep movsb" for it),
+	 * so we use WRITE_ONCE.
+	 */
+	for (i = 0; i < sizeof(struct queue_limits) / sizeof(long); i++)
+		WRITE_ONCE(*((long *)&q->limits + i), *((long *)lim + i));
+
 	if (q->disk)
 		blk_apply_bdi_limits(q->disk->bdi, lim);
 out_unlock: