Message ID | 20250102015620.500754-1-ming.lei@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | block: make queue limits workable in case of 64K PAGE_SIZE | expand |
On 1/1/25 5:56 PM, Ming Lei wrote: > In RH lab, it has been found that max segment size of some mmc card is > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE. That means that that MMC card is incompatible with the 64K page size. Additionally, this patch looks wrong to me. There are good reasons why the block layer requires that the DMA segment size is at least as large as the page size. You may want to take a look at this rejected patch series: Bart Van Assche, "PATCH v6 0/8] Support limits below the page size", June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/). Bart.
On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote: > On 1/1/25 5:56 PM, Ming Lei wrote: > > In RH lab, it has been found that max segment size of some mmc card is > > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE. > > That means that that MMC card is incompatible with the 64K page size. Is there any such linux kernel compatibility standard? If yes, please share it here. > > Additionally, this patch looks wrong to me. There are good reasons why the > block layer requires that the DMA segment size is at least as large > as the page size. Do you think 512byte sector can't be DMAed? > > You may want to take a look at this rejected patch series: > Bart Van Assche, "PATCH v6 0/8] Support limits below the page size", > June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/). '502 Bad Gateway' is returned for the above link. Thanks, Ming
On 1/1/25 6:49 PM, Ming Lei wrote: > On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote: >> Additionally, this patch looks wrong to me. There are good reasons why the >> block layer requires that the DMA segment size is at least as large >> as the page size. > > Do you think 512byte sector can't be DMAed? A clarification: I was referring to the maximum DMA segment size. The 512 byte number in your email refers to the actual DMA transfer size. My statement does not apply to the actual DMA transfer size. >> You may want to take a look at this rejected patch series: >> Bart Van Assche, "PATCH v6 0/8] Support limits below the page size", >> June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/). > > '502 Bad Gateway' is returned for the above link. That link works fine here on multiple devices (laptop, workstation, smartphone) so please check your setup. Thanks, Bart.
On Wed, Jan 01, 2025 at 07:46:41PM -0800, Bart Van Assche wrote: > On 1/1/25 6:49 PM, Ming Lei wrote: > > On Wed, Jan 01, 2025 at 06:30:30PM -0800, Bart Van Assche wrote: > > > Additionally, this patch looks wrong to me. There are good reasons why the > > > block layer requires that the DMA segment size is at least as large > > > as the page size. > > > > Do you think 512byte sector can't be DMAed? > > A clarification: I was referring to the maximum DMA segment size. The > 512 byte number in your email refers to the actual DMA transfer size. My > statement does not apply to the actual DMA transfer size. But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)? I don't think that the constraint is from hardware because DMA controller is capable of doing DMA on buffer with smaller alignment & size. IMO, the limit is actually from block layer, that is why I posted out this patch. > > > > You may want to take a look at this rejected patch series: > > > Bart Van Assche, "PATCH v6 0/8] Support limits below the page size", > > > June 2023 (https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/). > > > > '502 Bad Gateway' is returned for the above link. > > That link works fine here on multiple devices (laptop, workstation, > smartphone) so please check your setup. It is reachable now. From the link, you have storage controllers with DMA segment size which is less than 4K, which may never get supported by linux kernel. I am looking at hardware which works fine on kernel with 4K page size, but it becomes not workable when kernel is built as 64K page size, which confuses final kernel users. Thanks, Ming
On 1/2/25 6:01 PM, Ming Lei wrote: > But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)? From the description of patch 5/8 of my patch series: "If the segment size is smaller than the page size there may be multiple segments per bvec even if a bvec only contains a single page." The current block layer implementation is based on the assumption that a single page fits in a single DMA segment. Please take a look at patch 5/8 of my patch series. > From the link, you have storage controllers with DMA segment size which > is less than 4K, which may never get supported by linux kernel. As mentioned in the cover letter of that patch series, I came up with that patch series to support a DMA controller with max_segment_size of 4 KiB on a system with a PAGE_SIZE of 16 KiB. Thanks, Bart.
On Fri, Jan 03, 2025 at 02:12:36PM -0800, Bart Van Assche wrote: > On 1/2/25 6:01 PM, Ming Lei wrote: > > But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)? > > From the description of patch 5/8 of my patch series: "If the segment > size is smaller than the page size there may be multiple segments per > bvec even if a bvec only contains a single page." The current block > layer implementation is based on the assumption that a single page > fits in a single DMA segment. Please take a look at patch 5/8 of my > patch series. While that addresses a smaller segment then page size this still leaves open the question of if a dma segment can be larger than page size, given our adoption of folios I don't see why we would constrain ourselves to a page size. I already have PoC code which enables this. Luis
On 1/3/25 5:47 PM, Luis Chamberlain wrote: > While that addresses a smaller segment then page size this still leaves > open the question of if a dma segment can be larger than page size, Hmm ... aren't max_segment_size values that are larger than the page size supported since day one of the Linux kernel? Or are you perhaps referring to Ming's multi-page bvec patch series that was merged about six years ago? Thanks, Bart.
On Fri, Jan 03, 2025 at 06:15:55PM -0800, Bart Van Assche wrote: > On 1/3/25 5:47 PM, Luis Chamberlain wrote: > > While that addresses a smaller segment then page size this still leaves > > open the question of if a dma segment can be larger than page size, > Hmm ... aren't max_segment_size values that are larger than the page > size supported since day one of the Linux kernel? Or are you perhaps > referring to Ming's multi-page bvec patch series that was merged about > six years ago? Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is currently not possible. At the max 128 NVMe number of DMA segments, and we have 4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec enable to lift this? Luis
On 1/3/25 8:04 PM, Luis Chamberlain wrote: > Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is > currently not possible. At the max 128 NVMe number of DMA segments, and we have > 4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec > enable to lift this? 4 KiB per DMA segment for NVMe? I think that the DMA segment size limit for PRP and SGL modes is much larger than 4 KiB. See also the description of the CC.MPS parameter and PRP Lists in the NVMe base specification. From a system with an NVMe controller: $ cat /sys/block/nvme0n1/queue/max_segment_size 4294967295 Thanks, Bart.
On Fri, Jan 03, 2025 at 08:04:32PM -0800, Luis Chamberlain wrote: > On Fri, Jan 03, 2025 at 06:15:55PM -0800, Bart Van Assche wrote: > > On 1/3/25 5:47 PM, Luis Chamberlain wrote: > > > While that addresses a smaller segment then page size this still leaves > > > open the question of if a dma segment can be larger than page size, > > Hmm ... aren't max_segment_size values that are larger than the page > > size supported since day one of the Linux kernel? Or are you perhaps > > referring to Ming's multi-page bvec patch series that was merged about > > six years ago? > > Try aiming high for a single 2 MiB for a single IO on x86_64 on NVMe, that is > currently not possible. At the max 128 NVMe number of DMA segments, and we have > 4 KiB per DMA segment, for a 512 KiB IO limit. Should multi-page bvec > enable to lift this? You need huge pages to guarantee you can reach those transfer sizes in a single command. Otherwise you need to get lucky with larger contiguous folios, which can certainly happen but it's just not as desterministic. There are many nvme devices that absoulutely require 4KiB as the largest segment, but the driver says it can handle much larger segments anyway. It's trivial for the driver to split a large bio vec into a bunch of smaller device specific descriptors.
On Fri, Jan 03, 2025 at 02:12:36PM -0800, Bart Van Assche wrote: > On 1/2/25 6:01 PM, Ming Lei wrote: > > But why does DMA segment size have to be >= PAGE_SIZE(4KB, 64KB)? > > From the description of patch 5/8 of my patch series: "If the segment > size is smaller than the page size there may be multiple segments per > bvec even if a bvec only contains a single page." The current block > layer implementation is based on the assumption that a single page > fits in a single DMA segment. Please take a look at patch 5/8 of my > patch series. OK, I guess you agree it is one block layer constraint now, which need to be relaxed for both big logical block size and >4k PAGE_SIZE. Yes, your patch 5/8 is still needed. > > > From the link, you have storage controllers with DMA segment size which > > is less than 4K, which may never get supported by linux kernel. > > As mentioned in the cover letter of that patch series, I came up with > that patch series to support a DMA controller with max_segment_size of > 4 KiB on a system with a PAGE_SIZE of 16 KiB. Probably the conception of subpage need to avoid, because PAGE_SIZE is config option, and not see strong reason to couple with fixed & readable hardware properties with configurable kernel page size. Thanks, Ming
diff --git a/block/blk-settings.c b/block/blk-settings.c index 8f09e33f41f6..3b6918c134b1 100644 --- a/block/blk-settings.c +++ b/block/blk-settings.c @@ -300,7 +300,7 @@ int blk_validate_limits(struct queue_limits *lim) max_hw_sectors = min_not_zero(lim->max_hw_sectors, lim->max_dev_sectors); if (lim->max_user_sectors) { - if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE) + if (lim->max_user_sectors < SZ_4K / SECTOR_SIZE) return -EINVAL; lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors); } else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) { @@ -338,7 +338,7 @@ int blk_validate_limits(struct queue_limits *lim) */ if (!lim->seg_boundary_mask) lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK; - if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1)) + if (WARN_ON_ONCE(lim->seg_boundary_mask < SZ_4K - 1)) return -EINVAL; /* @@ -359,7 +359,7 @@ int blk_validate_limits(struct queue_limits *lim) */ if (!lim->max_segment_size) lim->max_segment_size = BLK_MAX_SEGMENT_SIZE; - if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE)) + if (WARN_ON_ONCE(lim->max_segment_size < SZ_4K)) return -EINVAL; } diff --git a/block/blk.h b/block/blk.h index 2c26abf505b8..570d66f1e338 100644 --- a/block/blk.h +++ b/block/blk.h @@ -357,7 +357,7 @@ static inline bool bio_may_need_split(struct bio *bio, const struct queue_limits *lim) { return lim->chunk_sectors || bio->bi_vcnt != 1 || - bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE; + bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > SZ_4K; } /**
PAGE_SIZE is applied in some block device queue limits, this way is very fragile and is wrong: - queue limits are read from hardware, which is often one readonly hardware property - PAGE_SIZE is one config option which can be changed during build time. In RH lab, it has been found that max segment size of some mmc card is less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE. Fix this issue by using SZ_4K in related code for dealing with queue limits. Cc: Yi Zhang <yi.zhang@redhat.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: John Garry <john.g.garry@oracle.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-settings.c | 6 +++--- block/blk.h | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-)