diff mbox series

block: make queue limits workable in case of 64K PAGE_SIZE

Message ID 20250102015620.500754-1-ming.lei@redhat.com (mailing list archive)
State New
Headers show
Series block: make queue limits workable in case of 64K PAGE_SIZE | expand

Commit Message

Ming Lei Jan. 2, 2025, 1:56 a.m. UTC
PAGE_SIZE is applied in some block device queue limits, this way is
very fragile and is wrong:

- queue limits are read from hardware, which is often one readonly
hardware property

- PAGE_SIZE is one config option which can be changed during build time.

In RH lab, it has been found that max segment size of some mmc card is
less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.

Fix this issue by using SZ_4K in related code for dealing with queue
limits.

Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-settings.c | 6 +++---
 block/blk.h          | 2 +-
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Bart Van Assche Jan. 2, 2025, 2:30 a.m. UTC | #1
On 1/1/25 5:56 PM, Ming Lei wrote:
> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.

That means that that MMC card is incompatible with the 64K page size.

Additionally, this patch looks wrong to me. There are good reasons why 
the block layer requires that the DMA segment size is at least as large
as the page size.

You may want to take a look at this rejected patch series:
Bart Van Assche, "PATCH v6 0/8] Support limits below the page size",
June 2023 
(https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/).

Bart.
Ming Lei Jan. 2, 2025, 2:49 a.m. UTC | #2
On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote:
> On 1/1/25 5:56 PM, Ming Lei wrote:
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> 
> That means that that MMC card is incompatible with the 64K page size.

Is there any such linux kernel compatibility standard? If yes, please share
it here.

> 
> Additionally, this patch looks wrong to me. There are good reasons why the
> block layer requires that the DMA segment size is at least as large
> as the page size.

Do you think 512byte sector can't be DMAed?

> 
> You may want to take a look at this rejected patch series:
> Bart Van Assche, "PATCH v6 0/8] Support limits below the page size",
> June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/).

'502 Bad Gateway' is returned for the above link.

Thanks,
Ming
Bart Van Assche Jan. 2, 2025, 3:46 a.m. UTC | #3
On 1/1/25 6:49 PM, Ming Lei wrote:
> On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote:
>> Additionally, this patch looks wrong to me. There are good reasons why the
>> block layer requires that the DMA segment size is at least as large
>> as the page size.
> 
> Do you think 512byte sector can't be DMAed?

A clarification: I was referring to the maximum DMA segment size. The
512 byte number in your email refers to the actual DMA transfer size. My
statement does not apply to the actual DMA transfer size.

>> You may want to take a look at this rejected patch series:
>> Bart Van Assche, "PATCH v6 0/8] Support limits below the page size",
>> June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/).
> 
> '502 Bad Gateway' is returned for the above link.

That link works fine here on multiple devices (laptop, workstation,
smartphone) so please check your setup.

Thanks,

Bart.
Ming Lei Jan. 3, 2025, 2:01 a.m. UTC | #4
On Wed, Jan 01, 2025 at 07:46:41PM -0800, Bart Van Assche wrote:
> On 1/1/25 6:49 PM, Ming Lei wrote:
> > On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote:
> > > Additionally, this patch looks wrong to me. There are good reasons why the
> > > block layer requires that the DMA segment size is at least as large
> > > as the page size.
> > 
> > Do you think 512byte sector can't be DMAed?
> 
> A clarification: I was referring to the maximum DMA segment size. The
> 512 byte number in your email refers to the actual DMA transfer size. My
> statement does not apply to the actual DMA transfer size.

But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)?

I don't think that the constraint is from hardware because DMA controller is
capable of doing DMA on buffer with smaller alignment & size.

IMO, the limit is actually from block layer, that is why I posted out
this patch.

> 
> > > You may want to take a look at this rejected patch series:
> > > Bart Van Assche, "PATCH v6 0/8] Support limits below the page size",
> > > June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/).
> > 
> > '502 Bad Gateway' is returned for the above link.
> 
> That link works fine here on multiple devices (laptop, workstation,
> smartphone) so please check your setup.

It is reachable now.

From the link, you have storage controllers with DMA segment size which
is less than 4K, which may never get supported by linux kernel.

I am looking at hardware which works fine on kernel with 4K page size, but
it becomes not workable when kernel is built as 64K page size, which confuses
final kernel users.


Thanks,
Ming
Bart Van Assche Jan. 3, 2025, 10:12 p.m. UTC | #5
On 1/2/25 6:01 PM, Ming Lei wrote:
> But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)?

 From the description of patch 5/8 of my patch series: "If the segment
size is smaller than the page size there may be multiple segments per
bvec even if a bvec only contains a single page." The current block
layer implementation is based on the assumption that a single page
fits in a single DMA segment. Please take a look at patch 5/8 of my
patch series.

> From the link, you have storage controllers with DMA segment size which
> is less than 4K, which may never get supported by linux kernel.

As mentioned in the cover letter of that patch series, I came up with
that patch series to support a DMA controller with max_segment_size of
4 KiB on a system with a PAGE_SIZE of 16 KiB.

Thanks,

Bart.
Luis Chamberlain Jan. 4, 2025, 1:47 a.m. UTC | #6
On Fri, Jan 03, 2025 at 02:12:36PM -0800, Bart Van Assche wrote:
> On 1/2/25 6:01 PM, Ming Lei wrote:
> > But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)?
> 
> From the description of patch 5/8 of my patch series: "If the segment
> size is smaller than the page size there may be multiple segments per
> bvec even if a bvec only contains a single page." The current block
> layer implementation is based on the assumption that a single page
> fits in a single DMA segment. Please take a look at patch 5/8 of my
> patch series.

While that addresses a smaller segment then page size this still leaves
open the question of if a dma segment can be larger than page size,
given our adoption of folios I don't see why we would constrain
ourselves to a page size. I already have PoC code which enables this.

  Luis
Bart Van Assche Jan. 4, 2025, 2:15 a.m. UTC | #7
On 1/3/25 5:47 PM, Luis Chamberlain wrote:
> While that addresses a smaller segment then page size this still leaves
> open the question of if a dma segment can be larger than page size,
Hmm ... aren't max_segment_size values that are larger than the page
size supported since day one of the Linux kernel? Or are you perhaps
referring to Ming's multi-page bvec patch series that was merged about
six years ago?

Thanks,

Bart.
Luis Chamberlain Jan. 4, 2025, 4:04 a.m. UTC | #8
On Fri, Jan 03, 2025 at 06:15:55PM -0800, Bart Van Assche wrote:
> On 1/3/25 5:47 PM, Luis Chamberlain wrote:
> > While that addresses a smaller segment then page size this still leaves
> > open the question of if a dma segment can be larger than page size,
> Hmm ... aren't max_segment_size values that are larger than the page
> size supported since day one of the Linux kernel? Or are you perhaps
> referring to Ming's multi-page bvec patch series that was merged about
> six years ago?

Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is
currently not possible. At the max 128 NVMe number of DMA segments, and we have
4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec
enable to lift this?

  Luis
Bart Van Assche Jan. 4, 2025, 10:30 p.m. UTC | #9
On 1/3/25 8:04 PM, Luis Chamberlain wrote:
> Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is
> currently not possible. At the max 128 NVMe number of DMA segments, and we have
> 4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec
> enable to lift this?

4 KiB per DMA segment for NVMe? I think that the DMA segment size limit 
for PRP and SGL modes is much larger than 4 KiB. See also the
description of the CC.MPS parameter and PRP Lists in the NVMe base
specification. From a system with an NVMe controller:

$ cat /sys/block/nvme0n1/queue/max_segment_size
4294967295

Thanks,

Bart.
Keith Busch Jan. 5, 2025, 12:59 a.m. UTC | #10
On Fri, Jan 03, 2025 at 08:04:32PM -0800, Luis Chamberlain wrote:
> On Fri, Jan 03, 2025 at 06:15:55PM -0800, Bart Van Assche wrote:
> > On 1/3/25 5:47 PM, Luis Chamberlain wrote:
> > > While that addresses a smaller segment then page size this still leaves
> > > open the question of if a dma segment can be larger than page size,
> > Hmm ... aren't max_segment_size values that are larger than the page
> > size supported since day one of the Linux kernel? Or are you perhaps
> > referring to Ming's multi-page bvec patch series that was merged about
> > six years ago?
> 
> Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is
> currently not possible. At the max 128 NVMe number of DMA segments, and we have
> 4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec
> enable to lift this?

You need huge pages to guarantee you can reach those transfer sizes in a
single command. Otherwise you need to get lucky with larger contiguous
folios, which can certainly happen but it's just not as desterministic.

There are many nvme devices that absoulutely require 4KiB as the largest
segment, but the driver says it can handle much larger segments anyway.
It's trivial for the driver to split a large bio vec into a bunch of
smaller device specific descriptors.
Ming Lei Jan. 6, 2025, 1:20 a.m. UTC | #11
On Fri, Jan 03, 2025 at 02:12:36PM -0800, Bart Van Assche wrote:
> On 1/2/25 6:01 PM, Ming Lei wrote:
> > But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)?
> 
> From the description of patch 5/8 of my patch series: "If the segment
> size is smaller than the page size there may be multiple segments per
> bvec even if a bvec only contains a single page." The current block
> layer implementation is based on the assumption that a single page
> fits in a single DMA segment. Please take a look at patch 5/8 of my
> patch series.

OK, I guess you agree it is one block layer constraint now, which
need to be relaxed for both big logical block size and >4k PAGE_SIZE.

Yes, your patch 5/8 is still needed.

> 
> > From the link, you have storage controllers with DMA segment size which
> > is less than 4K, which may never get supported by linux kernel.
> 
> As mentioned in the cover letter of that patch series, I came up with
> that patch series to support a DMA controller with max_segment_size of
> 4 KiB on a system with a PAGE_SIZE of 16 KiB.

Probably the conception of subpage need to avoid, because PAGE_SIZE is
config option, and not see strong reason to couple with fixed &
readable hardware properties with configurable kernel page size.


Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-settings.c b/block/blk-settings.c
index 8f09e33f41f6..3b6918c134b1 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -300,7 +300,7 @@  int blk_validate_limits(struct queue_limits *lim)
 	max_hw_sectors = min_not_zero(lim->max_hw_sectors,
 				lim->max_dev_sectors);
 	if (lim->max_user_sectors) {
-		if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
+		if (lim->max_user_sectors < SZ_4K / SECTOR_SIZE)
 			return -EINVAL;
 		lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
 	} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
@@ -338,7 +338,7 @@  int blk_validate_limits(struct queue_limits *lim)
 	 */
 	if (!lim->seg_boundary_mask)
 		lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
-	if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
+	if (WARN_ON_ONCE(lim->seg_boundary_mask < SZ_4K - 1))
 		return -EINVAL;
 
 	/*
@@ -359,7 +359,7 @@  int blk_validate_limits(struct queue_limits *lim)
 		 */
 		if (!lim->max_segment_size)
 			lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
-		if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
+		if (WARN_ON_ONCE(lim->max_segment_size < SZ_4K))
 			return -EINVAL;
 	}
 
diff --git a/block/blk.h b/block/blk.h
index 2c26abf505b8..570d66f1e338 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -357,7 +357,7 @@  static inline bool bio_may_need_split(struct bio *bio,
 		const struct queue_limits *lim)
 {
 	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
-		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
+		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > SZ_4K;
 }
 
 /**