diff mbox series

[2/2] blk-mq: ensure a q_usage_counter reference is held when splitting bios

Message ID 20240111135705.2155518-3-hch@lst.de (mailing list archive)
State New, archived
Headers show
Series [1/2] blk-mq: rename blk_mq_can_use_cached_rq | expand

Commit Message

Christoph Hellwig Jan. 11, 2024, 1:57 p.m. UTC
q_usage_counter is the only thing preventing us from the limits changing
under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.

Change __submit_bio to always acquire the q_usage_counter counter before
branching out into bio vs request based helper, and let blk_mq_submit_bio
tell it if it consumed the reference by handing it off to the request.

Fixes: 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter")
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 block/blk-core.c | 14 ++++++++-----
 block/blk-mq.c   | 52 +++++++++++++++++++++---------------------------
 block/blk-mq.h   |  2 +-
 3 files changed, 33 insertions(+), 35 deletions(-)

Comments

Jens Axboe Jan. 11, 2024, 4:12 p.m. UTC | #1
On 1/11/24 6:57 AM, Christoph Hellwig wrote:
> q_usage_counter is the only thing preventing us from the limits changing
> under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
> 
> Change __submit_bio to always acquire the q_usage_counter counter before
> branching out into bio vs request based helper, and let blk_mq_submit_bio
> tell it if it consumed the reference by handing it off to the request.

This causes hangs for me on shutdown/reset:

[   56.146054] sd 6:0:0:0: [sdb] Synchronizing SCSI cache
[   56.147739] sd 6:0:0:0: [sdb] Stopping disk
[   56.148976] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   56.150803] sd 0:0:0:0: [sda] Stopping disk
[   75.549201] INFO: task systemd-shutdow:1 blocked for more than 15 seconds.
[   75.549636]       Not tainted 6.7.0-rc5-00173-g34d71db9cce2 #4540
[   75.549977] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[   75.550401] task:systemd-shutdow state:D stack:0     pid:1     tgid:1     ppid:0      flags:0x00000004
[   75.550900] Call trace:
[   75.551042]  __switch_to+0x114/0x150
[   75.551253]  __schedule+0x510/0x10d4
[   75.551451]  schedule+0x7c/0x1ac
[   75.551635]  schedule_timeout+0xe8/0x1a4
[   75.551857]  blk_mq_freeze_queue_wait_timeout+0xf4/0x18c
[   75.552157]  nvme_wait_freeze_timeout+0x68/0xa4
[   75.552503]  nvme_dev_disable+0x35c/0x374
[   75.552734]  nvme_shutdown+0x34/0x40
[   75.552956]  pci_device_shutdown+0x48/0x54
[   75.553184]  device_shutdown+0x1c4/0x314
[   75.553403]  kernel_power_off+0x40/0x88

which seems to indicate that a reference is being leaked. Haven't poked
any further at it, I'll drop these two for now.
Christoph Hellwig Jan. 11, 2024, 4:14 p.m. UTC | #2
On Thu, Jan 11, 2024 at 09:12:23AM -0700, Jens Axboe wrote:
> On 1/11/24 6:57 AM, Christoph Hellwig wrote:
> > q_usage_counter is the only thing preventing us from the limits changing
> > under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
> > 
> > Change __submit_bio to always acquire the q_usage_counter counter before
> > branching out into bio vs request based helper, and let blk_mq_submit_bio
> > tell it if it consumed the reference by handing it off to the request.
> 
> This causes hangs for me on shutdown/reset:

> which seems to indicate that a reference is being leaked. Haven't poked
> any further at it, I'll drop these two for now.

Can you send me your .config?
Jens Axboe Jan. 11, 2024, 4:17 p.m. UTC | #3
On 1/11/24 9:14 AM, Christoph Hellwig wrote:
> On Thu, Jan 11, 2024 at 09:12:23AM -0700, Jens Axboe wrote:
>> On 1/11/24 6:57 AM, Christoph Hellwig wrote:
>>> q_usage_counter is the only thing preventing us from the limits changing
>>> under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
>>>
>>> Change __submit_bio to always acquire the q_usage_counter counter before
>>> branching out into bio vs request based helper, and let blk_mq_submit_bio
>>> tell it if it consumed the reference by handing it off to the request.
>>
>> This causes hangs for me on shutdown/reset:
> 
>> which seems to indicate that a reference is being leaked. Haven't poked
>> any further at it, I'll drop these two for now.
> 
> Can you send me your .config?

Don't think it's .config related, hit it on my test box and then in my
vm as well. It's likely due to batched allocations, would be my guess.
I can start/halt the vm fine with just a boot, but if I do:

$ ~/git/fio/t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R0 -X1 -P1 /dev/nvme0n1

for just a brief moment, nvme0 will hang on shutdown.
Christoph Hellwig Jan. 11, 2024, 4:18 p.m. UTC | #4
On Thu, Jan 11, 2024 at 09:17:41AM -0700, Jens Axboe wrote:
> Don't think it's .config related, hit it on my test box and then in my
> vm as well. It's likely due to batched allocations, would be my guess.
> I can start/halt the vm fine with just a boot, but if I do:
> 
> $ ~/git/fio/t/io_uring -p1 -d128 -b512 -s32 -c32 -F1 -B1 -R0 -X1 -P1 /dev/nvme0n1
> 
> for just a brief moment, nvme0 will hang on shutdown.

Thanks, I'll look into it.
Christoph Hellwig Jan. 11, 2024, 5:10 p.m. UTC | #5
On Thu, Jan 11, 2024 at 09:17:41AM -0700, Jens Axboe wrote:
> On 1/11/24 9:14 AM, Christoph Hellwig wrote:
> > On Thu, Jan 11, 2024 at 09:12:23AM -0700, Jens Axboe wrote:
> >> On 1/11/24 6:57 AM, Christoph Hellwig wrote:
> >>> q_usage_counter is the only thing preventing us from the limits changing
> >>> under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
> >>>
> >>> Change __submit_bio to always acquire the q_usage_counter counter before
> >>> branching out into bio vs request based helper, and let blk_mq_submit_bio
> >>> tell it if it consumed the reference by handing it off to the request.
> >>
> >> This causes hangs for me on shutdown/reset:
> > 
> >> which seems to indicate that a reference is being leaked. Haven't poked
> >> any further at it, I'll drop these two for now.
> > 
> > Can you send me your .config?
> 
> Don't think it's .config related, hit it on my test box and then in my
> vm as well. It's likely due to batched allocations, would be my guess.
> I can start/halt the vm fine with just a boot, but if I do:

Yupp, cached rqs it was.  The incremental patch below fixes it.
Can you add some cached request testing to blktests so that this gets
covered by default?

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 421db29535ba50..39eb4b99320c11 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2967,8 +2967,14 @@ bool blk_mq_submit_bio(struct bio *bio)
 		if (rq && rq->q == q) {
 			if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
 				return false;
-			if (blk_mq_use_cached_rq(rq, plug, bio))
+			if (blk_mq_use_cached_rq(rq, plug, bio)) {
+				/*
+				 * We're using the reference already owned by
+				 * rq from here on.
+				 */
+				blk_queue_exit(q);
 				goto has_rq;
+			}
 		}
 	}
Jens Axboe Jan. 11, 2024, 5:18 p.m. UTC | #6
On 1/11/24 10:10 AM, Christoph Hellwig wrote:
> On Thu, Jan 11, 2024 at 09:17:41AM -0700, Jens Axboe wrote:
>> On 1/11/24 9:14 AM, Christoph Hellwig wrote:
>>> On Thu, Jan 11, 2024 at 09:12:23AM -0700, Jens Axboe wrote:
>>>> On 1/11/24 6:57 AM, Christoph Hellwig wrote:
>>>>> q_usage_counter is the only thing preventing us from the limits changing
>>>>> under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
>>>>>
>>>>> Change __submit_bio to always acquire the q_usage_counter counter before
>>>>> branching out into bio vs request based helper, and let blk_mq_submit_bio
>>>>> tell it if it consumed the reference by handing it off to the request.
>>>>
>>>> This causes hangs for me on shutdown/reset:
>>>
>>>> which seems to indicate that a reference is being leaked. Haven't poked
>>>> any further at it, I'll drop these two for now.
>>>
>>> Can you send me your .config?
>>
>> Don't think it's .config related, hit it on my test box and then in my
>> vm as well. It's likely due to batched allocations, would be my guess.
>> I can start/halt the vm fine with just a boot, but if I do:
> 
> Yupp, cached rqs it was.  The incremental patch below fixes it.
> Can you add some cached request testing to blktests so that this gets
> covered by default?
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 421db29535ba50..39eb4b99320c11 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2967,8 +2967,14 @@ bool blk_mq_submit_bio(struct bio *bio)
>  		if (rq && rq->q == q) {
>  			if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
>  				return false;
> -			if (blk_mq_use_cached_rq(rq, plug, bio))
> +			if (blk_mq_use_cached_rq(rq, plug, bio)) {
> +				/*
> +				 * We're using the reference already owned by
> +				 * rq from here on.
> +				 */
> +				blk_queue_exit(q);
>  				goto has_rq;
> +			}
>  		}
>  	}

This also highlights a potential inefficiency in the patch, as now we're
grabbing+dropping references when we don't need to. May not be a big
deal, but it's one of the things that cached requests got rid of. Though
I'm not quite sure how to refactor to get rid of that, as we'd need to
shuffle the splitting and request get for that.

Could you take another look at the series with that in mind?
Christoph Hellwig Jan. 11, 2024, 5:24 p.m. UTC | #7
On Thu, Jan 11, 2024 at 10:18:31AM -0700, Jens Axboe wrote:
> This also highlights a potential inefficiency in the patch, as now we're
> grabbing+dropping references when we don't need to. May not be a big
> deal, but it's one of the things that cached requests got rid of. Though
> I'm not quite sure how to refactor to get rid of that, as we'd need to
> shuffle the splitting and request get for that.
> 
> Could you take another look at the series with that in mind?

I thought about it, but it gets pretty ugly quickly.  bio_queue_enter
needs to move back into blk_mq_submit_bio, and then we'd skip it
initially if bio_may_exceed_limits is false, and then we later need
to add it back.  (we'll probably also need to special case
blk_queue_bounce as that setting could change to.  I wish we could
finally kill that)
Jens Axboe Jan. 11, 2024, 8:06 p.m. UTC | #8
On 1/11/24 10:24 AM, Christoph Hellwig wrote:
> On Thu, Jan 11, 2024 at 10:18:31AM -0700, Jens Axboe wrote:
>> This also highlights a potential inefficiency in the patch, as now we're
>> grabbing+dropping references when we don't need to. May not be a big
>> deal, but it's one of the things that cached requests got rid of. Though
>> I'm not quite sure how to refactor to get rid of that, as we'd need to
>> shuffle the splitting and request get for that.
>>
>> Could you take another look at the series with that in mind?
> 
> I thought about it, but it gets pretty ugly quickly.  bio_queue_enter
> needs to move back into blk_mq_submit_bio, and then we'd skip it
> initially if bio_may_exceed_limits is false, and then we later need
> to add it back.  (we'll probably also need to special case
> blk_queue_bounce as that setting could change to.  I wish we could
> finally kill that)

Something like this? Not super pretty with the duplication, but...
Should suffice for a fix, and then we can refactor it on top of that.
ioprio is inherited when cloning, so we don't need to do that post the
split.

For the bounce side, how would these settings change at runtime? Should
be set at init time and then never change. And agree would be so nice to
kill that code...

diff --git a/block/blk-mq.c b/block/blk-mq.c
index aa9a05fdd023..01134e845d8d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2945,12 +2945,6 @@ void blk_mq_submit_bio(struct bio *bio)
 	blk_status_t ret;
 
 	bio = blk_queue_bounce(bio, q);
-	if (bio_may_exceed_limits(bio, &q->limits)) {
-		bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
-		if (!bio)
-			return;
-	}
-
 	bio_set_ioprio(bio);
 
 	if (plug) {
@@ -2959,6 +2953,11 @@ void blk_mq_submit_bio(struct bio *bio)
 			rq = NULL;
 	}
 	if (rq) {
+		if (unlikely(bio_may_exceed_limits(bio, &q->limits))) {
+			bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
+			if (!bio)
+				return;
+		}
 		if (!bio_integrity_prep(bio))
 			return;
 		if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
@@ -2969,6 +2968,11 @@ void blk_mq_submit_bio(struct bio *bio)
 	} else {
 		if (unlikely(bio_queue_enter(bio)))
 			return;
+		if (unlikely(bio_may_exceed_limits(bio, &q->limits))) {
+			bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
+			if (!bio)
+				goto fail;
+		}
 		if (!bio_integrity_prep(bio))
 			goto fail;
 	}
Ming Lei Jan. 11, 2024, 10:22 p.m. UTC | #9
On Thu, Jan 11, 2024 at 02:57:05PM +0100, Christoph Hellwig wrote:
> q_usage_counter is the only thing preventing us from the limits changing
> under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.

Can you share why the limits change matters wrt. split? If queue is
live, both new and old one are supposed to work, such as, we don't
freeze queue when changing `max_sectors_kb` via sysfs.

> 
> Change __submit_bio to always acquire the q_usage_counter counter before
> branching out into bio vs request based helper, and let blk_mq_submit_bio
> tell it if it consumed the reference by handing it off to the request.
> 
> Fixes: 9d497e2941c3 ("block: don't protect submit_bio_checks by q_usage_counter")

The above tag looks wrong, the logic change is actually from commit

900e08075202 ("block: move queue enter logic into blk_mq_submit_bio()")

And commit 9d497e2941c3 doesn't move q_usage_counter from
blk_mq_submit_bio() or bio split.

Thanks, 
Ming
Christoph Hellwig Jan. 12, 2024, 5:44 a.m. UTC | #10
On Thu, Jan 11, 2024 at 01:06:43PM -0700, Jens Axboe wrote:
> Something like this? Not super pretty with the duplication, but...
> Should suffice for a fix, and then we can refactor it on top of that.
> ioprio is inherited when cloning, so we don't need to do that post the
> split.

Yes, this could work.  It'll get worse with anything we need to do under
q_usage_counter counter, though.  I mean, it is a perpcu_counter, which
should be really light-weight compared to all the other stuff you do.
I'd really love to see numbers that show it matters.

> For the bounce side, how would these settings change at runtime?

Well, we don't really prevent any setting from changing at runtime.
But yes, neither mmc nor the few scsi drivers using it seems to do
any runtime re-configuration.

> Should
> be set at init time and then never change. And agree would be so nice to
> kill that code...

I wish we could see some more folks from the mmc maintainers to do
proper scatterlist (or bio/request) kmap helpers.  The scsi drivers
could easily piggy back on that or just be disabled for HIGHMEM configs.
Christoph Hellwig Jan. 12, 2024, 5:46 a.m. UTC | #11
On Fri, Jan 12, 2024 at 06:22:59AM +0800, Ming Lei wrote:
> On Thu, Jan 11, 2024 at 02:57:05PM +0100, Christoph Hellwig wrote:
> > q_usage_counter is the only thing preventing us from the limits changing
> > under us in __bio_split_to_limits, but blk_mq_submit_bio doesn't hold it.
> 
> Can you share why the limits change matters wrt. split? If queue is
> live, both new and old one are supposed to work, such as, we don't
> freeze queue when changing `max_sectors_kb` via sysfs.

Hardware reconfigurations for example, which can happen in nvme for
just about every limit, in SCSI for limits controller by the ULD,
and in quite a few places for disabling write zeroes or discard.
Jens Axboe Jan. 12, 2024, 2:22 p.m. UTC | #12
On 1/11/24 10:44 PM, Christoph Hellwig wrote:
> On Thu, Jan 11, 2024 at 01:06:43PM -0700, Jens Axboe wrote:
>> Something like this? Not super pretty with the duplication, but...
>> Should suffice for a fix, and then we can refactor it on top of that.
>> ioprio is inherited when cloning, so we don't need to do that post the
>> split.
> 
> Yes, this could work.  It'll get worse with anything we need to do under
> q_usage_counter counter, though.  I mean, it is a perpcu_counter, which
> should be really light-weight compared to all the other stuff you do.
> I'd really love to see numbers that show it matters.

Yep it is pretty cheap, but it's not free. Here's a test where we just
grab a ref and drop it, which should arguably be cheaper than doing a
ref at the top and dropping it at the bottom due to temporal locality:

     5.01%     +0.86%  [kernel.vmlinux]  [k] blk_mq_submit_bio

>> Should
>> be set at init time and then never change. And agree would be so nice to
>> kill that code...
> 
> I wish we could see some more folks from the mmc maintainers to do
> proper scatterlist (or bio/request) kmap helpers.  The scsi drivers
> could easily piggy back on that or just be disabled for HIGHMEM configs.

Maybe we just nudge them that way by making them depend on !HIGHMEM...
Christoph Hellwig Jan. 12, 2024, 2:25 p.m. UTC | #13
On Fri, Jan 12, 2024 at 07:22:18AM -0700, Jens Axboe wrote:
> Yep it is pretty cheap, but it's not free. Here's a test where we just
> grab a ref and drop it, which should arguably be cheaper than doing a
> ref at the top and dropping it at the bottom due to temporal locality:
> 
>      5.01%     +0.86%  [kernel.vmlinux]  [k] blk_mq_submit_bio

Ok.  Do you want to send out your version formally?
Jens Axboe Jan. 12, 2024, 4:10 p.m. UTC | #14
On 1/12/24 7:25 AM, Christoph Hellwig wrote:
> On Fri, Jan 12, 2024 at 07:22:18AM -0700, Jens Axboe wrote:
>> Yep it is pretty cheap, but it's not free. Here's a test where we just
>> grab a ref and drop it, which should arguably be cheaper than doing a
>> ref at the top and dropping it at the bottom due to temporal locality:
>>
>>      5.01%     +0.86%  [kernel.vmlinux]  [k] blk_mq_submit_bio
> 
> Ok.  Do you want to send out your version formally?

Sure, can do. I'll pick your first patch here as that one makes
sense separately.
Ulf Hansson Jan. 15, 2024, 11:20 a.m. UTC | #15
+ Arnd, Linus

On Fri, 12 Jan 2024 at 15:22, Jens Axboe <axboe@kernel.dk> wrote:
>
> On 1/11/24 10:44 PM, Christoph Hellwig wrote:
> > On Thu, Jan 11, 2024 at 01:06:43PM -0700, Jens Axboe wrote:
> >> Something like this? Not super pretty with the duplication, but...
> >> Should suffice for a fix, and then we can refactor it on top of that.
> >> ioprio is inherited when cloning, so we don't need to do that post the
> >> split.
> >
> > Yes, this could work.  It'll get worse with anything we need to do under
> > q_usage_counter counter, though.  I mean, it is a perpcu_counter, which
> > should be really light-weight compared to all the other stuff you do.
> > I'd really love to see numbers that show it matters.
>
> Yep it is pretty cheap, but it's not free. Here's a test where we just
> grab a ref and drop it, which should arguably be cheaper than doing a
> ref at the top and dropping it at the bottom due to temporal locality:
>
>      5.01%     +0.86%  [kernel.vmlinux]  [k] blk_mq_submit_bio
>
> >> Should
> >> be set at init time and then never change. And agree would be so nice to
> >> kill that code...
> >
> > I wish we could see some more folks from the mmc maintainers to do
> > proper scatterlist (or bio/request) kmap helpers.  The scsi drivers
> > could easily piggy back on that or just be disabled for HIGHMEM configs.
>
> Maybe we just nudge them that way by making them depend on !HIGHMEM...
>
> --
> Jens Axboe
>

Not sure exactly what problem you are trying to solve here, but I am
certainly happy to help, if I can.

Can you perhaps point me to a couple of drivers that need to be converted?

Kind regards
Uffe
Christoph Hellwig Jan. 22, 2024, 7:34 a.m. UTC | #16
On Mon, Jan 15, 2024 at 12:20:50PM +0100, Ulf Hansson wrote:
> Not sure exactly what problem you are trying to solve here, but I am
> certainly happy to help, if I can.
> 
> Can you perhaps point me to a couple of drivers that need to be converted?

Sure.

mmc_alloc_disk sets BLK_BOUNCE_HIGH for any mmc host that doesn't have a
DMA mask set, which is a bit odd as all proper devices should have a
valid DMA mask.  I suspect platform devices might sometimes not have
one, which historically was the wild west.

A better indicator might be the use of page_address in the I/O path,
which usually comes in the form of using the sg_virt() helper.
For drivers/mmc/ that seems to be: davinci_mmc, moxart-mmc, mvsdio,
mxcmmc, omap, sdhci-esdhc-mcf and sh_mmcif.

tmio_mmc_core on the other hand seems to have code properly kmapping
a scatterlist, which might be worth lifting into a common helper to
help the above drivers.

> 
> Kind regards
> Uffe
---end quoted text---
Arnd Bergmann Jan. 22, 2024, 9:26 a.m. UTC | #17
On Mon, Jan 22, 2024, at 08:34, Christoph Hellwig wrote:
> On Mon, Jan 15, 2024 at 12:20:50PM +0100, Ulf Hansson wrote:
>> Not sure exactly what problem you are trying to solve here, but I am
>> certainly happy to help, if I can.
>> 
>> Can you perhaps point me to a couple of drivers that need to be converted?
>
> Sure.
>
> mmc_alloc_disk sets BLK_BOUNCE_HIGH for any mmc host that doesn't have a
> DMA mask set, which is a bit odd as all proper devices should have a
> valid DMA mask.  I suspect platform devices might sometimes not have
> one, which historically was the wild west.

I found five drivers that have a legacy platform device
definition without a DMA mask:

arch/m68k/coldfire/device.c: "sdhci-esdhc-mcf"
arch/arm/mach-omap1/devices.c: "mmci-omap" (slave DMA)
arch/sh/boards/board-sh7757lcr.c: "sh_mmcif" (slave DMA)
arch/sh/boards/mach-ecovec24/setup.c: sh_mmcif" (slave DMA)
arch/sh/boards/mach-*/setup.c: "sh_mobile_sdhi" (slave DMA)
drivers/misc/cb710/core.c: "cb710-mmc" (pio-only)

None of these embedded platforms actually have highmem,
though the omap1 machine may run a kernel that has highmem
support enabled.

Most of the others only support DT based probing after we
removed a lot of old board files a year ago, so they will
always have a 32-bit mask set at probe time.

The slave DMA case is interesting, 

> A better indicator might be the use of page_address in the I/O path,
> which usually comes in the form of using the sg_virt() helper.
> For drivers/mmc/ that seems to be: davinci_mmc, moxart-mmc, mvsdio,
> mxcmmc, omap, sdhci-esdhc-mcf and sh_mmcif.

Out of these, I think only the mvsdio one is actually use
on boards with highmem: davinci, moxart, mcx (imx3) and omap2
are old enough to never have had more than 256MB or so of RAM,
and mcf (m68k) and sh can't even be built with CONFIG_HIGHMEM.

       Arnd
Christoph Hellwig Jan. 22, 2024, 1:39 p.m. UTC | #18
On Mon, Jan 22, 2024 at 10:26:30AM +0100, Arnd Bergmann wrote:
> > A better indicator might be the use of page_address in the I/O path,
> > which usually comes in the form of using the sg_virt() helper.
> > For drivers/mmc/ that seems to be: davinci_mmc, moxart-mmc, mvsdio,
> > mxcmmc, omap, sdhci-esdhc-mcf and sh_mmcif.
> 
> Out of these, I think only the mvsdio one is actually use
> on boards with highmem: davinci, moxart, mcx (imx3) and omap2
> are old enough to never have had more than 256MB or so of RAM,
> and mcf (m68k) and sh can't even be built with CONFIG_HIGHMEM.

It would be good to fix the one or two that could use highmem and add a
depends on !HIGHMEM for the others and then kill the bounce setup in
mmc.  It turn out in addition to the one legacy ISA SCSI driver and the
two parport SCSI driver usb-storage actually also sets this flag,
which might be a road blocker, but at this point I'm getting ready
to just pull the plug if it doesn't break using embedded platforms
using mmc entirely.
Arnd Bergmann Jan. 22, 2024, 2:57 p.m. UTC | #19
On Mon, Jan 22, 2024, at 14:39, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 10:26:30AM +0100, Arnd Bergmann wrote:
>> > A better indicator might be the use of page_address in the I/O path,
>> > which usually comes in the form of using the sg_virt() helper.
>> > For drivers/mmc/ that seems to be: davinci_mmc, moxart-mmc, mvsdio,
>> > mxcmmc, omap, sdhci-esdhc-mcf and sh_mmcif.
>> 
>> Out of these, I think only the mvsdio one is actually use
>> on boards with highmem: davinci, moxart, mxc (imx3) and omap2
>> are old enough to never have had more than 256MB or so of RAM,
>> and mcf (m68k) and sh can't even be built with CONFIG_HIGHMEM.
>
> It would be good to fix the one or two that could use highmem and add a
> depends on !HIGHMEM for the others

I would prefer a runtime check here, as one might still have a
multiplatform kernel where one machine can use highmem and
another machine can use one of these drivers, e.g. in
imx_v6_v7_defconfig.

> and then kill the bounce setup in
> mmc.  It turn out in addition to the one legacy ISA SCSI driver and the
> two parport SCSI driver usb-storage actually also sets this flag,
> which might be a road blocker, but at this point I'm getting ready
> to just pull the plug if it doesn't break using embedded platforms
> using mmc entirely.

Agreed. I've added the maintainers for the Marvell Armada
platform to Cc, maybe one of them can look into this, see
the thread at [1] for this. The problem in the mvsdio.c
is the sg_virt(data->sg) in mvsd_setup_data(), which will
stop working when the MMC layer stops using bounce buffers
for highmem pages.

     Arnd

[1] https://lore.kernel.org/all/20240122073423.GA25859@lst.de/
Christoph Hellwig Jan. 23, 2024, 9:11 a.m. UTC | #20
On Mon, Jan 22, 2024 at 03:57:16PM +0100, Arnd Bergmann wrote:
> > It would be good to fix the one or two that could use highmem and add a
> > depends on !HIGHMEM for the others
> 
> I would prefer a runtime check here, as one might still have a
> multiplatform kernel where one machine can use highmem and
> another machine can use one of these drivers, e.g. in
> imx_v6_v7_defconfig.

Well, if someone can come up with a good runtime check that's fine with me.
Arnd Bergmann Jan. 24, 2024, 11:59 a.m. UTC | #21
On Tue, Jan 23, 2024, at 10:11, Christoph Hellwig wrote:
> On Mon, Jan 22, 2024 at 03:57:16PM +0100, Arnd Bergmann wrote:
>> > It would be good to fix the one or two that could use highmem and add a
>> > depends on !HIGHMEM for the others
>> 
>> I would prefer a runtime check here, as one might still have a
>> multiplatform kernel where one machine can use highmem and
>> another machine can use one of these drivers, e.g. in
>> imx_v6_v7_defconfig.
>
> Well, if someone can come up with a good runtime check that's fine with me.

I assumed there was a generic way already, but it looks like
there is not, and adding one across nine architectures would be
nontrivial.

Let's use your initial suggestion then and use a Kconfig
dependency. I still don't like how this may impact users that
currently enable highmem and use one of these drivers, but
on a more positive note this might help us kill off HIGHMEM
in the future if users instead choose to go with
CONFIG_VMSPLIT_3G_OPT or similar.

The Marvell driver still needs some other solution of course.

    Arnd
Linus Walleij Jan. 24, 2024, 12:33 p.m. UTC | #22
On Wed, Jan 24, 2024 at 12:59 PM Arnd Bergmann <arnd@arndb.de> wrote:

> Let's use your initial suggestion then and use a Kconfig
> dependency.

I'm looking into this.

> I still don't like how this may impact users that
> currently enable highmem and use one of these drivers,

I'm taking it on a machine-by-machine basis.

For example it
turns out all OMAP2 that use mmci-omap are Nokia n800, n810
and the H4 reference design.

So I am seeing if these can be excluded from the "most omap2plus
systems" list.

Yours,
Linus Walleij
Arnd Bergmann Jan. 24, 2024, 12:45 p.m. UTC | #23
On Mon, Jan 22, 2024, at 15:57, Arnd Bergmann wrote:
> On Mon, Jan 22, 2024, at 14:39, Christoph Hellwig wrote:
>> On Mon, Jan 22, 2024 at 10:26:30AM +0100, Arnd Bergmann wrote:
>
>> and then kill the bounce setup in
>> mmc.  It turn out in addition to the one legacy ISA SCSI driver and the
>> two parport SCSI driver usb-storage actually also sets this flag,
>> which might be a road blocker, but at this point I'm getting ready
>> to just pull the plug if it doesn't break using embedded platforms
>> using mmc entirely.
>
> Agreed. I've added the maintainers for the Marvell Armada
> platform to Cc, maybe one of them can look into this, see
> the thread at [1] for this. The problem in the mvsdio.c
> is the sg_virt(data->sg) in mvsd_setup_data(), which will
> stop working when the MMC layer stops using bounce buffers
> for highmem pages.

I talked to Linus Walleij about this driver, as he's already
looking at highmem related issues in Arm kernels. Here is
some updated information about what I think is going on:

- The driver is used on kirkwood, dove, armadaxp/370, and
  armada375, but not on armada380 and later, as those use
  sdhci instead.

- Most of the boards with those chips have the sdio controller
  disabled. In particular for the later chips it is only
  enabled in the reference design but not in products like the
  Armada XP based NAS boxes.

- Products that actually use it and have at least 1GB of RAM
  are the Globalscale Mirabox/D3Plug and  Solidrun Cubox.

- All of the machines using this driver have a valid DMA mask
  set, so the BLK_BOUNCE_HIGH hack is never actually active
  here.

- The reason I think the driver still works is that the
  PIO mode using the broken sg_virt() call is only used
  for unaligned data, and all data sent to the controller is
  either fully aligned (mmc block data) or in lowmem (sdio
  data structures filled in the kernel).

- The hack for non-caheline-aligned DMA in commit 3c583f70a8e2
  ("mmc: mvsdio: Work around broken TX DMA") specifically
  mentions small data structures. I wonder if this is instead
  a bug with dma_mapping stack variables or similar instead of
  an actual hardware bug. The only sdio_writesb() I see in the
  mwifiex driver mentioned here is for network data buffers,
  which I think don't ever share cache lines though.

      Arnd
Arnd Bergmann Jan. 24, 2024, 12:54 p.m. UTC | #24
On Wed, Jan 24, 2024, at 13:33, Linus Walleij wrote:
> On Wed, Jan 24, 2024 at 12:59 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
>> Let's use your initial suggestion then and use a Kconfig
>> dependency.
>
> I'm looking into this.
>
>> I still don't like how this may impact users that
>> currently enable highmem and use one of these drivers,
>
> I'm taking it on a machine-by-machine basis.
>
> For example it
> turns out all OMAP2 that use mmci-omap are Nokia n800, n810
> and the H4 reference design.
>
> So I am seeing if these can be excluded from the "most omap2plus
> systems" list.

Unfortunately excluding Nokia n8x0 would turn the omap2plus
defconfig into an omap3plus_defconfig effectively.

This is also something that has come up repeatedly in the
past though: omap24xx (and imx31 to a lesser degree) support
in multiplatform kernels is really annoying because it
requires a lot of special cases to also work on SMP enabled
machines, and it completely breaks on ARMv8-aarch32 machines.

     Arnd
Linus Walleij Jan. 24, 2024, 1:16 p.m. UTC | #25
On Wed, Jan 24, 2024 at 1:55 PM Arnd Bergmann <arnd@arndb.de> wrote:

> > So I am seeing if these can be excluded from the "most omap2plus
> > systems" list.
>
> Unfortunately excluding Nokia n8x0 would turn the omap2plus
> defconfig into an omap3plus_defconfig effectively.

I did like this:

@@ -135,7 +135,7 @@ config ARCH_OMAP2PLUS_TYPICAL
        bool "Typical OMAP configuration"
        default y
        select AEABI
-       select HIGHMEM
+       select HIGHMEM if !SOC_OMAP2420

Effectively disabling HIGHMEM when using omap2plus_defconfig.

If we want all systems supported, we just apply this at the expense
of highmem for OMAP 2430, OMAP3 and OMAP4 and the

We can then either

- Disable SOC_OMAP2420 in omap2plus_defconfig (I made a
  patch for this) turning it
  into an omap3plus_defconfig as you say

or

- Actually add a new defconfig named omap3plus_defconfig
  with highmem enabled but SOC_OMAP2420 disabled.

I don't know which option is the lesser evil ... it's a bit hairy.

(A third option would be to reexamine runtime restriction options...)

Yours,
Linus Walleij
Linus Walleij Jan. 24, 2024, 1:49 p.m. UTC | #26
On Mon, Jan 22, 2024 at 10:26 AM Arnd Bergmann <arnd@arndb.de> wrote:

> I found five drivers that have a legacy platform device
> definition without a DMA mask:
>
> arch/m68k/coldfire/device.c: "sdhci-esdhc-mcf"
> arch/arm/mach-omap1/devices.c: "mmci-omap" (slave DMA)
> arch/sh/boards/board-sh7757lcr.c: "sh_mmcif" (slave DMA)
> arch/sh/boards/mach-ecovec24/setup.c: sh_mmcif" (slave DMA)
> arch/sh/boards/mach-*/setup.c: "sh_mobile_sdhi" (slave DMA)
> drivers/misc/cb710/core.c: "cb710-mmc" (pio-only)
>
> None of these embedded platforms actually have highmem,
> though the omap1 machine may run a kernel that has highmem
> support enabled.
>
> Most of the others only support DT based probing after we
> removed a lot of old board files a year ago, so they will
> always have a 32-bit mask set at probe time.

For sh_mmcif I just added dma_mask and coherent_dma_mask
as DMA_BIT_MASK(32) in the boardfile and I consider doing it
for pretty much all of them: If they
- Run without HIGHMEM enabled and
- With highmem are bouncing buffers around to PKMAP (right?) when
  BLK_BOUNCE_HIGH is set

That kind of indicates that they are
probably 32bit DMA capable, pretty much as the device trees
assumes in most cases.

This avoids doing Kconfig trickery, make it runtime handled
and we can delete BLK_BOUNCE_HIGH as that branch is
never taken and just refuse to probe if dma_mask == 0.

Yours,
Linus Walleij
Arnd Bergmann Jan. 24, 2024, 2:14 p.m. UTC | #27
On Wed, Jan 24, 2024, at 14:16, Linus Walleij wrote:
> On Wed, Jan 24, 2024 at 1:55 PM Arnd Bergmann <arnd@arndb.de> wrote:
>
>> > So I am seeing if these can be excluded from the "most omap2plus
>> > systems" list.
>>
>> Unfortunately excluding Nokia n8x0 would turn the omap2plus
>> defconfig into an omap3plus_defconfig effectively.
>
> I did like this:
>
> @@ -135,7 +135,7 @@ config ARCH_OMAP2PLUS_TYPICAL
>         bool "Typical OMAP configuration"
>         default y
>         select AEABI
> -       select HIGHMEM
> +       select HIGHMEM if !SOC_OMAP2420
>
> Effectively disabling HIGHMEM when using omap2plus_defconfig.
>
> If we want all systems supported, we just apply this at the expense
> of highmem for OMAP 2430, OMAP3 and OMAP4 and the

As far as I can tell, none of the above actually have more than
1GB of RAM, as OMAP4/AM4 maxes out at a single 8Gbit LPDDR2
RAM. For those machines, using CONFIG_VMSPLIT_3G_OPT is likely
going to be much better than CONFIG_HIGHMEM anyway.

Unfortunately, this does not work for OMAP5/AM5/DRA7, which
can have 2GB or possibly 4GB (as used in the Pyra) of DDR3,
so we'd still lose.

> We can then either
>
> - Disable SOC_OMAP2420 in omap2plus_defconfig (I made a
>   patch for this) turning it
>   into an omap3plus_defconfig as you say
>
> or
>
> - Actually add a new defconfig named omap3plus_defconfig
>   with highmem enabled but SOC_OMAP2420 disabled.
>
> I don't know which option is the lesser evil ... it's a bit hairy.
>
> (A third option would be to reexamine runtime restriction options...)

Or actually using kmap_local_page() in mmc_omap_xfer_data(),
as Christoph suggested.

      Arnd
Arnd Bergmann Jan. 24, 2024, 4:35 p.m. UTC | #28
On Wed, Jan 24, 2024, at 14:49, Linus Walleij wrote:
> On Mon, Jan 22, 2024 at 10:26 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
>> I found five drivers that have a legacy platform device
>> definition without a DMA mask:
>>
>> arch/m68k/coldfire/device.c: "sdhci-esdhc-mcf"
>> arch/arm/mach-omap1/devices.c: "mmci-omap" (slave DMA)
>> arch/sh/boards/board-sh7757lcr.c: "sh_mmcif" (slave DMA)
>> arch/sh/boards/mach-ecovec24/setup.c: sh_mmcif" (slave DMA)
>> arch/sh/boards/mach-*/setup.c: "sh_mobile_sdhi" (slave DMA)
>> drivers/misc/cb710/core.c: "cb710-mmc" (pio-only)
>>
>> None of these embedded platforms actually have highmem,
>> though the omap1 machine may run a kernel that has highmem
>> support enabled.
>>
>> Most of the others only support DT based probing after we
>> removed a lot of old board files a year ago, so they will
>> always have a 32-bit mask set at probe time.
>
> For sh_mmcif I just added dma_mask and coherent_dma_mask
> as DMA_BIT_MASK(32) in the boardfile

I think technically it's wrong to set the DMA mask
for the sh_mmcif device. The mask is set correctly
for the dmaengine driver, which is used correctly in

        ret = dma_map_sg(chan->device->dev, sg, data->sg_len,
                         DMA_FROM_DEVICE);

Since SH never has highmem, I don't think there is
anything that needs to be done for these.

Also for cb710, there is no DMA, and highmem gets
handled correctly through using sg_miter_next() etc.

> and I consider doing it
> for pretty much all of them: If they
> - Run without HIGHMEM enabled and
> - With highmem are bouncing buffers around to PKMAP (right?) when
>   BLK_BOUNCE_HIGH is set

> That kind of indicates that they are
> probably 32bit DMA capable, pretty much as the device trees
> assumes in most cases.

> This avoids doing Kconfig trickery, make it runtime handled
> and we can delete BLK_BOUNCE_HIGH as that branch is
> never taken and just refuse to probe if dma_mask == 0.

We can probably treat this as two different issues now:

a) drivers that don't set a DMA mask get the block layer
   bounce buffers, and as far as I can tell none of these
   actually need the bounce buffers, so we can already
   remove the BLK_BOUNCE_HIGH setting without causing
   a chance in behavior. cb710 is likely to see a
   small performance improvement when used on highmem
   systems without BLK_BOUNCE_HIGH but it's a very slow
   and old device, so nobody will notice

b) drivers that use sg_virt() instead of kmap_local_page()
   or sg_iter are currently broken if they run on on
   systems using highmem, as there is no bounce buffer
   when the DMA mask is set. Some of these may have
   used BLK_BOUNCE_HIGH in the past before the conversion
   to DT probing.

It's the second point that requires the Kconfig dependency.

      Arnd
diff mbox series

Patch

diff --git a/block/blk-core.c b/block/blk-core.c
index 9520ccab305007..885ba6bb58556f 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -592,17 +592,21 @@  static inline blk_status_t blk_check_zone_append(struct request_queue *q,
 
 static void __submit_bio(struct bio *bio)
 {
+	struct gendisk *disk = bio->bi_bdev->bd_disk;
+
 	if (unlikely(!blk_crypto_bio_prep(&bio)))
 		return;
+	if (unlikely(bio_queue_enter(bio)))
+		return;
 
 	if (!bio->bi_bdev->bd_has_submit_bio) {
-		blk_mq_submit_bio(bio);
-	} else if (likely(bio_queue_enter(bio) == 0)) {
-		struct gendisk *disk = bio->bi_bdev->bd_disk;
-
+		if (blk_mq_submit_bio(bio))
+			return;
+	} else {
 		disk->fops->submit_bio(bio);
-		blk_queue_exit(disk->queue);
 	}
+
+	blk_queue_exit(disk->queue);
 }
 
 /*
diff --git a/block/blk-mq.c b/block/blk-mq.c
index a6731cacd0132c..421db29535ba50 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -2936,14 +2936,17 @@  static void bio_set_ioprio(struct bio *bio)
  *
  * It will not queue the request if there is an error with the bio, or at the
  * request creation.
+ *
+ * Returns %true if the q_usage_counter usage is consumed, or %false if it
+ * isn't.
  */
-void blk_mq_submit_bio(struct bio *bio)
+bool blk_mq_submit_bio(struct bio *bio)
 {
 	struct request_queue *q = bdev_get_queue(bio->bi_bdev);
 	struct blk_plug *plug = blk_mq_plug(bio);
 	const int is_sync = op_is_sync(bio->bi_opf);
 	struct blk_mq_hw_ctx *hctx;
-	struct request *rq = NULL;
+	struct request *rq;
 	unsigned int nr_segs = 1;
 	blk_status_t ret;
 
@@ -2951,39 +2954,28 @@  void blk_mq_submit_bio(struct bio *bio)
 	if (bio_may_exceed_limits(bio, &q->limits)) {
 		bio = __bio_split_to_limits(bio, &q->limits, &nr_segs);
 		if (!bio)
-			return;
+			return false;
 	}
 
 	bio_set_ioprio(bio);
 
+	if (!bio_integrity_prep(bio))
+		return false;
+
 	if (plug) {
 		rq = rq_list_peek(&plug->cached_rq);
-		if (rq && rq->q != q)
-			rq = NULL;
-	}
-	if (rq) {
-		if (!bio_integrity_prep(bio))
-			return;
-		if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
-			return;
-		if (blk_mq_use_cached_rq(rq, plug, bio))
-			goto done;
-		percpu_ref_get(&q->q_usage_counter);
-	} else {
-		if (unlikely(bio_queue_enter(bio)))
-			return;
-		if (!bio_integrity_prep(bio))
-			goto fail;
+		if (rq && rq->q == q) {
+			if (blk_mq_attempt_bio_merge(q, bio, nr_segs))
+				return false;
+			if (blk_mq_use_cached_rq(rq, plug, bio))
+				goto has_rq;
+		}
 	}
 
 	rq = blk_mq_get_new_requests(q, plug, bio, nr_segs);
-	if (unlikely(!rq)) {
-fail:
-		blk_queue_exit(q);
-		return;
-	}
-
-done:
+	if (unlikely(!rq))
+		return false;
+has_rq:
 	trace_block_getrq(bio);
 
 	rq_qos_track(q, rq, bio);
@@ -2995,15 +2987,15 @@  void blk_mq_submit_bio(struct bio *bio)
 		bio->bi_status = ret;
 		bio_endio(bio);
 		blk_mq_free_request(rq);
-		return;
+		return true;
 	}
 
 	if (op_is_flush(bio->bi_opf) && blk_insert_flush(rq))
-		return;
+		return true;
 
 	if (plug) {
 		blk_add_rq_to_plug(plug, rq);
-		return;
+		return true;
 	}
 
 	hctx = rq->mq_hctx;
@@ -3014,6 +3006,8 @@  void blk_mq_submit_bio(struct bio *bio)
 	} else {
 		blk_mq_run_dispatch_ops(q, blk_mq_try_issue_directly(hctx, rq));
 	}
+
+	return true;
 }
 
 #ifdef CONFIG_BLK_MQ_STACKING
diff --git a/block/blk-mq.h b/block/blk-mq.h
index f75a9ecfebde1b..d45f222f117748 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -39,7 +39,7 @@  enum {
 typedef unsigned int __bitwise blk_insert_t;
 #define BLK_MQ_INSERT_AT_HEAD		((__force blk_insert_t)0x01)
 
-void blk_mq_submit_bio(struct bio *bio);
+bool blk_mq_submit_bio(struct bio *bio);
 int blk_mq_poll(struct request_queue *q, blk_qc_t cookie, struct io_comp_batch *iob,
 		unsigned int flags);
 void blk_mq_exit_queue(struct request_queue *q);