diff mbox series

[V2] block: make segment size limit workable for > 4K PAGE_SIZE

Message ID 20250210090319.1519778-1-ming.lei@redhat.com (mailing list archive)
State New
Headers show
Series [V2] block: make segment size limit workable for > 4K PAGE_SIZE | expand

Commit Message

Ming Lei Feb. 10, 2025, 9:03 a.m. UTC
PAGE_SIZE is applied in some block device queue limits, this way is
very fragile and is wrong:

- queue limits are read from hardware, which is often one readonly
hardware property

- PAGE_SIZE is one config option which can be changed during build time.

In RH lab, it has been found that max segment size of some mmc card is
less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.

Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
as 4K(minimized PAGE_SIZE).

Cc: Yi Zhang <yi.zhang@redhat.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
V2:
	- cover bio_split_rw_at()
	- add BLK_MIN_SEGMENT_SIZE

 block/blk-merge.c      | 2 +-
 block/blk-settings.c   | 6 +++---
 block/blk.h            | 2 +-
 include/linux/blkdev.h | 1 +
 4 files changed, 6 insertions(+), 5 deletions(-)

Comments

Hannes Reinecke Feb. 10, 2025, 12:14 p.m. UTC | #1
On 2/10/25 10:03, Ming Lei wrote:
> PAGE_SIZE is applied in some block device queue limits, this way is
> very fragile and is wrong:
> 
> - queue limits are read from hardware, which is often one readonly
> hardware property
> 
> - PAGE_SIZE is one config option which can be changed during build time.
> 
> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> 
So why isn't this reflected in the blk_min_segment settings?
Or, rather, why isn't setting blk_min_segment not enough?

> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> as 4K(minimized PAGE_SIZE).
> 
But why 4k then? That is a value like anything else, and what is the 
rationale to use that instead of the more natural sector size?

Cheers,

Hannes
Ming Lei Feb. 10, 2025, 1:26 p.m. UTC | #2
On Mon, Feb 10, 2025 at 01:14:00PM +0100, Hannes Reinecke wrote:
> On 2/10/25 10:03, Ming Lei wrote:
> > PAGE_SIZE is applied in some block device queue limits, this way is
> > very fragile and is wrong:
> > 
> > - queue limits are read from hardware, which is often one readonly
> > hardware property
> > 
> > - PAGE_SIZE is one config option which can be changed during build time.
> > 
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > 
> So why isn't this reflected in the blk_min_segment settings?
> Or, rather, why isn't setting blk_min_segment not enough?

There isn't min_segment_size setting, at block layer takes PAGE_SIZE
as the actual min_segment_size.

> 
> > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > as 4K(minimized PAGE_SIZE).
> > 
> But why 4k then? That is a value like anything else, and what is the
> rationale to use that instead of the more natural sector size?

The comment explains it already: 4K = min(PAGE_SIZE).


Thanks,
Ming
Luis Chamberlain Feb. 10, 2025, 8:17 p.m. UTC | #3
On Mon, Feb 10, 2025 at 05:03:19PM +0800, Ming Lei wrote:
> PAGE_SIZE is applied in some block device queue limits, this way is
> very fragile and is wrong:
> 
> - queue limits are read from hardware, which is often one readonly
> hardware property
> 
> - PAGE_SIZE is one config option which can be changed during build time.

This is true.

> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.

This is true, but check the note on block/blk-merge.c blk_bvec_map_sg().
It would seem that this is a limitation of MMC/SD and that this should
ideally be fixed.

> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> as 4K(minimized PAGE_SIZE).

But indeed if the block driver isn't yet fixed, then sure, we have to
deal with the issue, I am not convinced that the logic below addresses
this in a generic way, rather it seems to conflate the areas where we
do need the generic block layer min defined, and when we have a block
min segment limit.

> Cc: Yi Zhang <yi.zhang@redhat.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Keith Busch <kbusch@kernel.org>
> Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> 	- cover bio_split_rw_at()
> 	- add BLK_MIN_SEGMENT_SIZE
> 
>  block/blk-merge.c      | 2 +-
>  block/blk-settings.c   | 6 +++---
>  block/blk.h            | 2 +-
>  include/linux/blkdev.h | 1 +
>  4 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 15cd231d560c..b55c52a42303 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  
>  		if (nsegs < lim->max_segments &&
>  		    bytes + bv.bv_len <= max_bytes &&
> -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
>  			nsegs++;
>  			bytes += bv.bv_len;

I'll note that the 64k BLK_MAX_SEGMENT_SIZE is an old "odd historic" default
value, ie, not a documented hard limit but some odd old thing which
blk_validate_limits() encourages block drivers to override, so a soft
max.

That said, if we validate this soft max and if you also validate the min
shouldn't value in the above instead be lim->max_segment_size instead,
provided that we also address the coment in blk_bvec_map_sg()?

More forward looking -- are you using BLK_MIN_SEGMENT_SIZE here due to
the same mmc/sd limitations ? Can we overcome the mmc/sd limitations by
only using this BLK_MIN_SEGMENT_SIZE only on block drivers which have the
scatterlists limitation?

The rest in your patch seem like sensible places to use a BLK_MIN_SEGMENT_SIZE
although I need to think more about bio_may_need_split() with larger segments
in mind some more.

  Luis
Ming Lei Feb. 11, 2025, 2:10 a.m. UTC | #4
On Mon, Feb 10, 2025 at 12:17:07PM -0800, Luis Chamberlain wrote:
> On Mon, Feb 10, 2025 at 05:03:19PM +0800, Ming Lei wrote:
> > PAGE_SIZE is applied in some block device queue limits, this way is
> > very fragile and is wrong:
> > 
> > - queue limits are read from hardware, which is often one readonly
> > hardware property
> > 
> > - PAGE_SIZE is one config option which can be changed during build time.
> 
> This is true.
> 
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> 
> This is true, but check the note on block/blk-merge.c blk_bvec_map_sg().
> It would seem that this is a limitation of MMC/SD and that this should
> ideally be fixed.

The mmc card works just fine in case of 4K page size, there isn't any
limitation for the mmc/ssd from storage viewpoint, the failure is just
because this card's max segment size is < 64KB in case of 64K page size.

> 
> > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > as 4K(minimized PAGE_SIZE).
> 
> But indeed if the block driver isn't yet fixed, then sure, we have to
> deal with the issue, I am not convinced that the logic below addresses
> this in a generic way, rather it seems to conflate the areas where we
> do need the generic block layer min defined, and when we have a block
> min segment limit.
> 
> > Cc: Yi Zhang <yi.zhang@redhat.com>
> > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > Cc: John Garry <john.g.garry@oracle.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > V2:
> > 	- cover bio_split_rw_at()
> > 	- add BLK_MIN_SEGMENT_SIZE
> > 
> >  block/blk-merge.c      | 2 +-
> >  block/blk-settings.c   | 6 +++---
> >  block/blk.h            | 2 +-
> >  include/linux/blkdev.h | 1 +
> >  4 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 15cd231d560c..b55c52a42303 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> >  
> >  		if (nsegs < lim->max_segments &&
> >  		    bytes + bv.bv_len <= max_bytes &&
> > -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> > +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
> >  			nsegs++;
> >  			bytes += bv.bv_len;
> 
> I'll note that the 64k BLK_MAX_SEGMENT_SIZE is an old "odd historic" default
> value, ie, not a documented hard limit but some odd old thing which
> blk_validate_limits() encourages block drivers to override, so a soft
> max.

BLK_MAX_SEGMENT_SIZE is default or fallback max segment size if the hardware
doesn't provide this limit, so nothing odd here because block layer has
to use something reasonable here.

> 
> That said, if we validate this soft max and if you also validate the min

There isn't soft max segment size.

> shouldn't value in the above instead be lim->max_segment_size instead,

min segment size is page_size and it is soft, and has been applied
for long time. This patch just fixes it as 4k(min(page_size)).

> provided that we also address the coment in blk_bvec_map_sg()?

The comment in blk_bvec_map_sg() has been removed, and blk_bvec_map_sg
has been re-written in commit b7175e24d6ac ("block: add a dma mapping
iterator") by following segment limits only.

> 
> More forward looking -- are you using BLK_MIN_SEGMENT_SIZE here due to
> the same mmc/sd limitations ? Can we overcome the mmc/sd limitations by
> only using this BLK_MIN_SEGMENT_SIZE only on block drivers which have the
> scatterlists limitation?

Please see my comment above, the mmc card doesn't have any limitation,
it is just that its max segment size is < 64K, which is absolutely
allowed from storage viewpoint.


Thanks, 
Ming
Daniel Gomez Feb. 13, 2025, 7:34 a.m. UTC | #5
On Tue, Feb 11, 2025 at 10:10:36AM +0100, Ming Lei wrote:
> On Mon, Feb 10, 2025 at 12:17:07PM -0800, Luis Chamberlain wrote:
> > On Mon, Feb 10, 2025 at 05:03:19PM +0800, Ming Lei wrote:
> > > PAGE_SIZE is applied in some block device queue limits, this way is
> > > very fragile and is wrong:
> > > 
> > > - queue limits are read from hardware, which is often one readonly
> > > hardware property
> > > 
> > > - PAGE_SIZE is one config option which can be changed during build time.
> > 
> > This is true.
> > 
> > > In RH lab, it has been found that max segment size of some mmc card is
> > > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > 
> > This is true, but check the note on block/blk-merge.c blk_bvec_map_sg().
> > It would seem that this is a limitation of MMC/SD and that this should
> > ideally be fixed.
> 
> The mmc card works just fine in case of 4K page size, there isn't any
> limitation for the mmc/ssd from storage viewpoint, the failure is just
> because this card's max segment size is < 64KB in case of 64K page size.
> 
> > 
> > > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > > as 4K(minimized PAGE_SIZE).
> > 
> > But indeed if the block driver isn't yet fixed, then sure, we have to
> > deal with the issue, I am not convinced that the logic below addresses
> > this in a generic way, rather it seems to conflate the areas where we
> > do need the generic block layer min defined, and when we have a block
> > min segment limit.
> > 
> > > Cc: Yi Zhang <yi.zhang@redhat.com>
> > > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > > Cc: John Garry <john.g.garry@oracle.com>
> > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > Cc: Keith Busch <kbusch@kernel.org>
> > > Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > ---
> > > V2:
> > > 	- cover bio_split_rw_at()
> > > 	- add BLK_MIN_SEGMENT_SIZE
> > > 
> > >  block/blk-merge.c      | 2 +-
> > >  block/blk-settings.c   | 6 +++---
> > >  block/blk.h            | 2 +-
> > >  include/linux/blkdev.h | 1 +
> > >  4 files changed, 6 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > > index 15cd231d560c..b55c52a42303 100644
> > > --- a/block/blk-merge.c
> > > +++ b/block/blk-merge.c
> > > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> > >  
> > >  		if (nsegs < lim->max_segments &&
> > >  		    bytes + bv.bv_len <= max_bytes &&
> > > -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> > > +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
> > >  			nsegs++;
> > >  			bytes += bv.bv_len;
> > 
> > I'll note that the 64k BLK_MAX_SEGMENT_SIZE is an old "odd historic" default
> > value, ie, not a documented hard limit but some odd old thing which
> > blk_validate_limits() encourages block drivers to override, so a soft
> > max.
> 
> BLK_MAX_SEGMENT_SIZE is default or fallback max segment size if the hardware
> doesn't provide this limit, so nothing odd here because block layer has
> to use something reasonable here.
> 
> > 
> > That said, if we validate this soft max and if you also validate the min
> 
> There isn't soft max segment size.
> 
> > shouldn't value in the above instead be lim->max_segment_size instead,
> 
> min segment size is page_size and it is soft, and has been applied
> for long time. This patch just fixes it as 4k(min(page_size)).
> 
> > provided that we also address the coment in blk_bvec_map_sg()?
> 
> The comment in blk_bvec_map_sg() has been removed, and blk_bvec_map_sg
> has been re-written in commit b7175e24d6ac ("block: add a dma mapping
> iterator") by following segment limits only.

Would it be possible for the driver to split the minimum segment size, PAGE_SIZE
(64k in your case), into smaller chunks that your hardware supports? For
example, NVMe supports 512-byte I/Os while maintaining the minimum segment
boundary at 4k.

> 
> > 
> > More forward looking -- are you using BLK_MIN_SEGMENT_SIZE here due to
> > the same mmc/sd limitations ? Can we overcome the mmc/sd limitations by
> > only using this BLK_MIN_SEGMENT_SIZE only on block drivers which have the
> > scatterlists limitation?
> 
> Please see my comment above, the mmc card doesn't have any limitation,
> it is just that its max segment size is < 64K, which is absolutely
> allowed from storage viewpoint.
> 
> 
> Thanks, 
> Ming
>
Ming Lei Feb. 13, 2025, 8:02 a.m. UTC | #6
On Thu, Feb 13, 2025 at 08:34:28AM +0100, Daniel Gomez wrote:
> On Tue, Feb 11, 2025 at 10:10:36AM +0100, Ming Lei wrote:
> > On Mon, Feb 10, 2025 at 12:17:07PM -0800, Luis Chamberlain wrote:
> > > On Mon, Feb 10, 2025 at 05:03:19PM +0800, Ming Lei wrote:
> > > > PAGE_SIZE is applied in some block device queue limits, this way is
> > > > very fragile and is wrong:
> > > > 
> > > > - queue limits are read from hardware, which is often one readonly
> > > > hardware property
> > > > 
> > > > - PAGE_SIZE is one config option which can be changed during build time.
> > > 
> > > This is true.
> > > 
> > > > In RH lab, it has been found that max segment size of some mmc card is
> > > > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > > 
> > > This is true, but check the note on block/blk-merge.c blk_bvec_map_sg().
> > > It would seem that this is a limitation of MMC/SD and that this should
> > > ideally be fixed.
> > 
> > The mmc card works just fine in case of 4K page size, there isn't any
> > limitation for the mmc/ssd from storage viewpoint, the failure is just
> > because this card's max segment size is < 64KB in case of 64K page size.
> > 
> > > 
> > > > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > > > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > > > as 4K(minimized PAGE_SIZE).
> > > 
> > > But indeed if the block driver isn't yet fixed, then sure, we have to
> > > deal with the issue, I am not convinced that the logic below addresses
> > > this in a generic way, rather it seems to conflate the areas where we
> > > do need the generic block layer min defined, and when we have a block
> > > min segment limit.
> > > 
> > > > Cc: Yi Zhang <yi.zhang@redhat.com>
> > > > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > > > Cc: John Garry <john.g.garry@oracle.com>
> > > > Cc: Bart Van Assche <bvanassche@acm.org>
> > > > Cc: Keith Busch <kbusch@kernel.org>
> > > > Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> > > > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > > > ---
> > > > V2:
> > > > 	- cover bio_split_rw_at()
> > > > 	- add BLK_MIN_SEGMENT_SIZE
> > > > 
> > > >  block/blk-merge.c      | 2 +-
> > > >  block/blk-settings.c   | 6 +++---
> > > >  block/blk.h            | 2 +-
> > > >  include/linux/blkdev.h | 1 +
> > > >  4 files changed, 6 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > > > index 15cd231d560c..b55c52a42303 100644
> > > > --- a/block/blk-merge.c
> > > > +++ b/block/blk-merge.c
> > > > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> > > >  
> > > >  		if (nsegs < lim->max_segments &&
> > > >  		    bytes + bv.bv_len <= max_bytes &&
> > > > -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> > > > +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
> > > >  			nsegs++;
> > > >  			bytes += bv.bv_len;
> > > 
> > > I'll note that the 64k BLK_MAX_SEGMENT_SIZE is an old "odd historic" default
> > > value, ie, not a documented hard limit but some odd old thing which
> > > blk_validate_limits() encourages block drivers to override, so a soft
> > > max.
> > 
> > BLK_MAX_SEGMENT_SIZE is default or fallback max segment size if the hardware
> > doesn't provide this limit, so nothing odd here because block layer has
> > to use something reasonable here.
> > 
> > > 
> > > That said, if we validate this soft max and if you also validate the min
> > 
> > There isn't soft max segment size.
> > 
> > > shouldn't value in the above instead be lim->max_segment_size instead,
> > 
> > min segment size is page_size and it is soft, and has been applied
> > for long time. This patch just fixes it as 4k(min(page_size)).
> > 
> > > provided that we also address the coment in blk_bvec_map_sg()?
> > 
> > The comment in blk_bvec_map_sg() has been removed, and blk_bvec_map_sg
> > has been re-written in commit b7175e24d6ac ("block: add a dma mapping
> > iterator") by following segment limits only.
> 
> Would it be possible for the driver to split the minimum segment size, PAGE_SIZE
> (64k in your case), into smaller chunks that your hardware supports? For
> example, NVMe supports 512-byte I/Os while maintaining the minimum segment
> boundary at 4k.

The problem[1] is that this kind of mmc card fails to be recognized as
block disk. Block layer io split code can handle this case actually.

Just because max segment size of the card is < 64K when PAGE_SIZE is
configured as 64K, the issue is in block layer limit validation code.

For mmc card, it isn't strange to see small max_segment_size.


[1] dmesg

[    5.461130] WARNING: CPU: 2 PID: 397 at block/blk-settings.c:339 blk_validate_limits+0x364/0x3c0
[    5.461152] Modules linked in: mmc_block(+) rpmb_core crct10dif_ce ghash_ce sha2_ce dw_mmc_bluefield sha256_arm64 dw_mmc_pltfm sha1_ce dw_mmc mmc_core nfit i2c_mlxbf sbsa_gwdt gpio_mlxbf2 libnvdimm mlxbf_tmfifo dm_mirror dm_region_hash dm_log dm_mod
[    5.492042] CPU: 2 UID: 0 PID: 397 Comm: (udev-worker) Not tainted 6.12.0-39.el10.aarch64+64k #1
[    5.492050] Hardware name: https://www.mellanox.com BlueField SoC/BlueField SoC, BIOS BlueField:3.5.1-1-g4078432 Jan 28 2021
         Starting
system[    5.492054] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    5.492058] pc : blk_validate_limits+0x364/0x3c0
d-vconsole-setup.service
 - V[    5.492075] lr : blk_set_default_limits+0x20/0x40
irtual Console Setup...
[    5.492079] sp : ffff80008688f2d0
[    5.539494] x29: ffff80008688f2d0 x28: ffff000082acb600 x27: ffff80007bef02a8
[    5.546622] x26: ffff80007bef0000 x25: ffff80008688f58e x24: ffff80008688f450
[    5.553752] x23: ffff80008301b000 x22: 00000000ffffffff x21: ffff800082c39950
[    5.553759] x20: 0000000000000000 x19: ffff0000930169e0 x18: 0000000000000014
[    5.553765] x17: 00000000767472b1 x16: 0000000005a697e6 x15: 0000000002f42ca4
[    5.585117] x11: 00000000de7f0111 x10: 000000005285b53a x9 : ffff800080752908
[    5.595019] x8 : 0000000000000001 x7 : 0000000000000000 x6 : 0000000000000200
[    5.605003] x5 : 0000000000000000 x4 : 000000000000ffff x3 : 0000000000004000
[    5.612556] x2 : 0000000000000200 x1 : 0000000000001000 x0 : ffff80008688f450
[    5.619684] Call trace:
[    5.622121]  blk_validate_limits+0x364/0x3c0
[    5.626391]  blk_set_default_limits+0x20/0x40
[    5.630737]  blk_alloc_queue+0x84/0x240
[    5.634562]  blk_mq_alloc_queue+0x80/0x118
[    5.638648]  __blk_mq_alloc_disk+0x28/0x198
[    5.642820]  mmc_alloc_disk+0xe0/0x260 [mmc_block]
...
[    5.751521] mmcblk mmc0:0001: probe with driver mmcblk failed with error -22


Thanks,
Ming
Christoph Hellwig Feb. 13, 2025, 8:30 a.m. UTC | #7
On Thu, Feb 13, 2025 at 04:02:46PM +0800, Ming Lei wrote:
> The problem[1] is that this kind of mmc card fails to be recognized as
> block disk. Block layer io split code can handle this case actually.

When we still had bio_add_pc_page it would break the assumption that
you could always add a full page.  With that gone and everything going
through the split machinery we might be fine now, but backporting it
to 6.13 and earlier will cause breakage.
Christoph Hellwig Feb. 13, 2025, 8:33 a.m. UTC | #8
On Mon, Feb 10, 2025 at 05:03:19PM +0800, Ming Lei wrote:
> PAGE_SIZE is applied in some block device queue limits, this way is
> very fragile and is wrong:

Can you rephrase this?  what is "some block device queue limits"?

>  	}
>  
> diff --git a/block/blk.h b/block/blk.h
> index 90fa5f28ccab..cbfa8a3d4e42 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -359,7 +359,7 @@ static inline bool bio_may_need_split(struct bio *bio,
>  		const struct queue_limits *lim)
>  {
>  	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
> -		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
> +		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > BLK_MIN_SEGMENT_SIZE;

please avoid the overly long line here.  And maybe split up the
condition to actually be readable?  I.e.

	if (lim->chunk_sectors)
		return true;
	if (bio->bi_vcnt != 1)
		return true;
	if (bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset >
	    BLK_MIN_SEGMENT_SIZE)
		return true;
	return false;

> +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */

That's a very sparse and cryptic comment.  Please write down an
actual explanation.
John Garry Feb. 13, 2025, 8:45 a.m. UTC | #9
On 10/02/2025 09:03, Ming Lei wrote:
> PAGE_SIZE is applied in some block device queue limits, this way is
> very fragile and is wrong:
> 
> - queue limits are read from hardware, which is often one readonly
> hardware property
> 
> - PAGE_SIZE is one config option which can be changed during build time.
> 
> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> 
> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> as 4K(minimized PAGE_SIZE).

Please note that blk_queue_max_quaranteed_bio() for atomic writes 
assumes that we can fit a PAGE_SIZE in a segment. I suppose that if the 
max_segment_size < PAGE_SIZE is supported, then the calculation there 
needs to change.

> 
> Cc: Yi Zhang <yi.zhang@redhat.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Keith Busch <kbusch@kernel.org>
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/__;!!ACWV5N9M2RV99hQ!OvnGwXMRIGpyaHe2nucoewQFL7ObGtM2kAjxf_He-BXr6P4Q1uN8peKArl4nO1Yo2yahJypfmK_Tdxt8W9PX$
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> 	- cover bio_split_rw_at()
> 	- add BLK_MIN_SEGMENT_SIZE
> 
>   block/blk-merge.c      | 2 +-
>   block/blk-settings.c   | 6 +++---
>   block/blk.h            | 2 +-
>   include/linux/blkdev.h | 1 +
>   4 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 15cd231d560c..b55c52a42303 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>   
>   		if (nsegs < lim->max_segments &&
>   		    bytes + bv.bv_len <= max_bytes &&
> -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
>   			nsegs++;
>   			bytes += bv.bv_len;
>   		} else {
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index c44dadc35e1e..539a64ad7989 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim)
>   	max_hw_sectors = min_not_zero(lim->max_hw_sectors,
>   				lim->max_dev_sectors);
>   	if (lim->max_user_sectors) {
> -		if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
> +		if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
>   			return -EINVAL;
>   		lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
>   	} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
> @@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim)
>   	 */
>   	if (!lim->seg_boundary_mask)
>   		lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
> -	if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
> +	if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
>   		return -EINVAL;
>   
>   	/*
> @@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim)
>   		 */
>   		if (!lim->max_segment_size)
>   			lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
> -		if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> +		if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
>   			return -EINVAL;
>   	}
>   
> diff --git a/block/blk.h b/block/blk.h
> index 90fa5f28ccab..cbfa8a3d4e42 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -359,7 +359,7 @@ static inline bool bio_may_need_split(struct bio *bio,
>   		const struct queue_limits *lim)
>   {
>   	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
> -		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
> +		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > BLK_MIN_SEGMENT_SIZE;
>   }
>   
>   /**
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 248416ecd01c..32188af4051e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1163,6 +1163,7 @@ static inline bool bdev_is_partition(struct block_device *bdev)
>   enum blk_default_limits {
>   	BLK_MAX_SEGMENTS	= 128,
>   	BLK_SAFE_MAX_SECTORS	= 255,
> +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */
>   	BLK_MAX_SEGMENT_SIZE	= 65536,
>   	BLK_SEG_BOUNDARY_MASK	= 0xFFFFFFFFUL,
>   };
Ming Lei Feb. 13, 2025, 8:51 a.m. UTC | #10
On Thu, Feb 13, 2025 at 12:30:29AM -0800, Christoph Hellwig wrote:
> On Thu, Feb 13, 2025 at 04:02:46PM +0800, Ming Lei wrote:
> > The problem[1] is that this kind of mmc card fails to be recognized as
> > block disk. Block layer io split code can handle this case actually.
> 
> When we still had bio_add_pc_page it would break the assumption that
> you could always add a full page.  With that gone and everything going
> through the split machinery we might be fine now, but backporting it
> to 6.13 and earlier will cause breakage.
 
Yes, I will mention the following two are depended for backporting in
commit log:

02ee5d69e3ba block: remove blk_rq_bio_prep
6aeb4f836480 block: remove bio_add_pc_page

Thanks, 
Ming
Ming Lei Feb. 13, 2025, 9:58 a.m. UTC | #11
On Thu, Feb 13, 2025 at 08:45:18AM +0000, John Garry wrote:
> On 10/02/2025 09:03, Ming Lei wrote:
> > PAGE_SIZE is applied in some block device queue limits, this way is
> > very fragile and is wrong:
> > 
> > - queue limits are read from hardware, which is often one readonly
> > hardware property
> > 
> > - PAGE_SIZE is one config option which can be changed during build time.
> > 
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > 
> > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > as 4K(minimized PAGE_SIZE).
> 
> Please note that blk_queue_max_quaranteed_bio() for atomic writes assumes
> that we can fit a PAGE_SIZE in a segment. I suppose that if the
> max_segment_size < PAGE_SIZE is supported, then the calculation there needs
> to change.

It isn't related with blk_queue_max_guaranteed_bio() which calculates the max
allowed ubuf bytes which can fit in a bio, so PAGE_SIZE has to be used here.

BLK_MIN_SEGMENT_SIZE is just one hint which can be used to check if one
bvec can fit in single segment quickly, otherwise the normal split code
path is run into.


Thanks,
Ming
John Garry Feb. 13, 2025, 10:23 a.m. UTC | #12
On 13/02/2025 09:58, Ming Lei wrote:
> On Thu, Feb 13, 2025 at 08:45:18AM +0000, John Garry wrote:
>> On 10/02/2025 09:03, Ming Lei wrote:
>>> PAGE_SIZE is applied in some block device queue limits, this way is
>>> very fragile and is wrong:
>>>
>>> - queue limits are read from hardware, which is often one readonly
>>> hardware property
>>>
>>> - PAGE_SIZE is one config option which can be changed during build time.
>>>
>>> In RH lab, it has been found that max segment size of some mmc card is
>>> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
>>>
>>> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
>>> with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
>>> as 4K(minimized PAGE_SIZE).
>> Please note that blk_queue_max_quaranteed_bio() for atomic writes assumes
>> that we can fit a PAGE_SIZE in a segment. I suppose that if the
>> max_segment_size < PAGE_SIZE is supported, then the calculation there needs
>> to change.
> It isn't related with blk_queue_max_guaranteed_bio() which calculates the max
> allowed ubuf bytes which can fit in a bio, so PAGE_SIZE has to be used here.
> 
> BLK_MIN_SEGMENT_SIZE is just one hint which can be used to check if one
> bvec can fit in single segment quickly, otherwise the normal split code
> path is run into.

So consider we have PAGE_SIZE > 4k and max_segment_size=4k, if an iovec 
has PAGE_SIZE then a bvec can also have PAGE_SIZE but then we need to 
split into multiple segments, right?

Thanks,
John
Ming Lei Feb. 13, 2025, 10:35 a.m. UTC | #13
On Thu, Feb 13, 2025 at 10:23:02AM +0000, John Garry wrote:
> On 13/02/2025 09:58, Ming Lei wrote:
> > On Thu, Feb 13, 2025 at 08:45:18AM +0000, John Garry wrote:
> > > On 10/02/2025 09:03, Ming Lei wrote:
> > > > PAGE_SIZE is applied in some block device queue limits, this way is
> > > > very fragile and is wrong:
> > > > 
> > > > - queue limits are read from hardware, which is often one readonly
> > > > hardware property
> > > > 
> > > > - PAGE_SIZE is one config option which can be changed during build time.
> > > > 
> > > > In RH lab, it has been found that max segment size of some mmc card is
> > > > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > > > 
> > > > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > > > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > > > as 4K(minimized PAGE_SIZE).
> > > Please note that blk_queue_max_quaranteed_bio() for atomic writes assumes
> > > that we can fit a PAGE_SIZE in a segment. I suppose that if the
> > > max_segment_size < PAGE_SIZE is supported, then the calculation there needs
> > > to change.
> > It isn't related with blk_queue_max_guaranteed_bio() which calculates the max
> > allowed ubuf bytes which can fit in a bio, so PAGE_SIZE has to be used here.
> > 
> > BLK_MIN_SEGMENT_SIZE is just one hint which can be used to check if one
> > bvec can fit in single segment quickly, otherwise the normal split code
> > path is run into.
> 
> So consider we have PAGE_SIZE > 4k and max_segment_size=4k, if an iovec has
> PAGE_SIZE then a bvec can also have PAGE_SIZE but then we need to split into
> multiple segments, right?

Yes, hardware limit needs to be respected.

Looks one write atomic application trouble in case of 64K page size,
and it can't work w/wo this patchset.


Thanks, 
Ming
John Garry Feb. 13, 2025, 11:12 a.m. UTC | #14
On 13/02/2025 10:35, Ming Lei wrote:
>>> BLK_MIN_SEGMENT_SIZE is just one hint which can be used to check if one
>>> bvec can fit in single segment quickly, otherwise the normal split code
>>> path is run into.
>> So consider we have PAGE_SIZE > 4k and max_segment_size=4k, if an iovec has
>> PAGE_SIZE then a bvec can also have PAGE_SIZE but then we need to split into
>> multiple segments, right?
> Yes, hardware limit needs to be respected.
> 
> Looks one write atomic application trouble in case of 64K page size,
> and it can't work w/wo this patchset.

I think that we need to take max_segment_size into account in 
blk_queue_max_guaranteed_bio(), like:

static unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim)
{
	unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments);
	unsigned int length;

	length = min(max_segments, 2) * lim->logical_block_size;
	if (max_segments > 2)
		length += (max_segments - 2) * min(PAGE_SIZE, lim->max_segment_size);

	return length;
}

Note that blk_queue_max_guaranteed_bio() is only really relevant to dio, 
so assumes that the iov_iter follows the bdev dio rules

Thanks,
John
Ming Lei Feb. 13, 2025, 11:33 a.m. UTC | #15
On Thu, Feb 13, 2025 at 11:12:07AM +0000, John Garry wrote:
> On 13/02/2025 10:35, Ming Lei wrote:
> > > > BLK_MIN_SEGMENT_SIZE is just one hint which can be used to check if one
> > > > bvec can fit in single segment quickly, otherwise the normal split code
> > > > path is run into.
> > > So consider we have PAGE_SIZE > 4k and max_segment_size=4k, if an iovec has
> > > PAGE_SIZE then a bvec can also have PAGE_SIZE but then we need to split into
> > > multiple segments, right?
> > Yes, hardware limit needs to be respected.
> > 
> > Looks one write atomic application trouble in case of 64K page size,
> > and it can't work w/wo this patchset.
> 
> I think that we need to take max_segment_size into account in
> blk_queue_max_guaranteed_bio(), like:
> 
> static unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim)
> {
> 	unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments);
> 	unsigned int length;
> 
> 	length = min(max_segments, 2) * lim->logical_block_size;
> 	if (max_segments > 2)
> 		length += (max_segments - 2) * min(PAGE_SIZE, lim->max_segment_size);
> 
> 	return length;
> }
> 
> Note that blk_queue_max_guaranteed_bio() is only really relevant to dio, so
> assumes that the iov_iter follows the bdev dio rules

It can't work because ITER_UBUF from pwritev2(iovcnt=1) is virtually-contiguous,
and the middle segment size has to be PAGE_SIZE.


Thanks,
Ming
John Garry Feb. 13, 2025, 11:41 a.m. UTC | #16
On 13/02/2025 11:33, Ming Lei wrote:
>> I think that we need to take max_segment_size into account in
>> blk_queue_max_guaranteed_bio(), like:
>>
>> static unsigned int blk_queue_max_guaranteed_bio(struct queue_limits *lim)
>> {
>> 	unsigned int max_segments = min(BIO_MAX_VECS, lim->max_segments);
>> 	unsigned int length;
>>
>> 	length = min(max_segments, 2) * lim->logical_block_size;
>> 	if (max_segments > 2)
>> 		length += (max_segments - 2) * min(PAGE_SIZE, lim->max_segment_size);
>>
>> 	return length;
>> }
>>
>> Note that blk_queue_max_guaranteed_bio() is only really relevant to dio, so
>> assumes that the iov_iter follows the bdev dio rules
> It can't work because ITER_UBUF from pwritev2(iovcnt=1) is virtually-contiguous,

Right

> and the middle segment size has to be PAGE_SIZE.

I would have thought that those PAGE_SIZE-sized middle iovecs should be 
split into multiple segments, no?

Thanks,
John
Daniel Gomez Feb. 13, 2025, 2:18 p.m. UTC | #17
On Thu, Feb 13, 2025 at 04:51:55PM +0100, Ming Lei wrote:
> On Thu, Feb 13, 2025 at 12:30:29AM -0800, Christoph Hellwig wrote:
> > On Thu, Feb 13, 2025 at 04:02:46PM +0800, Ming Lei wrote:
> > > The problem[1] is that this kind of mmc card fails to be recognized as
> > > block disk. Block layer io split code can handle this case actually.
> > 
> > When we still had bio_add_pc_page it would break the assumption that
> > you could always add a full page.  With that gone and everything going
> > through the split machinery we might be fine now, but backporting it
> > to 6.13 and earlier will cause breakage.
>  
> Yes, I will mention the following two are depended for backporting in
> commit log:
> 
> 02ee5d69e3ba block: remove blk_rq_bio_prep
> 6aeb4f836480 block: remove bio_add_pc_page

Ming, is that sufficient for your use case? Or do you still need to remove the
assumption that the "minimum" segment size is not PAGE_SIZE?

> 
> Thanks, 
> Ming
>
Ming Lei Feb. 14, 2025, 1:37 a.m. UTC | #18
On Thu, Feb 13, 2025 at 03:18:43PM +0100, Daniel Gomez wrote:
> On Thu, Feb 13, 2025 at 04:51:55PM +0100, Ming Lei wrote:
> > On Thu, Feb 13, 2025 at 12:30:29AM -0800, Christoph Hellwig wrote:
> > > On Thu, Feb 13, 2025 at 04:02:46PM +0800, Ming Lei wrote:
> > > > The problem[1] is that this kind of mmc card fails to be recognized as
> > > > block disk. Block layer io split code can handle this case actually.
> > > 
> > > When we still had bio_add_pc_page it would break the assumption that
> > > you could always add a full page.  With that gone and everything going
> > > through the split machinery we might be fine now, but backporting it
> > > to 6.13 and earlier will cause breakage.
> >  
> > Yes, I will mention the following two are depended for backporting in
> > commit log:
> > 
> > 02ee5d69e3ba block: remove blk_rq_bio_prep
> > 6aeb4f836480 block: remove bio_add_pc_page
> 
> Ming, is that sufficient for your use case? Or do you still need to remove the
> assumption that the "minimum" segment size is not PAGE_SIZE?

If you want to make any block device with < 64K max_segment_size working
on 64K page_size kernel, you need this patch and the following three
dependencies:

commit 6aeb4f836480 ("block: remove bio_add_pc_page")
commit 02ee5d69e3ba ("block: remove blk_rq_bio_prep")
commit b7175e24d6ac ("block: add a dma mapping iterator")


Thanks,
Ming
Daniel Gomez Feb. 14, 2025, 9:38 a.m. UTC | #19
On Mon, Feb 10, 2025 at 05:03:19PM +0100, Ming Lei wrote:
> PAGE_SIZE is applied in some block device queue limits, this way is
> very fragile and is wrong:
> 
> - queue limits are read from hardware, which is often one readonly
> hardware property
> 
> - PAGE_SIZE is one config option which can be changed during build time.
> 
> In RH lab, it has been found that max segment size of some mmc card is
> less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> 
> Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> as 4K(minimized PAGE_SIZE).
> 
> Cc: Yi Zhang <yi.zhang@redhat.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Cc: John Garry <john.g.garry@oracle.com>
> Cc: Bart Van Assche <bvanassche@acm.org>
> Cc: Keith Busch <kbusch@kernel.org>
> Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
> V2:
> 	- cover bio_split_rw_at()
> 	- add BLK_MIN_SEGMENT_SIZE
> 
>  block/blk-merge.c      | 2 +-
>  block/blk-settings.c   | 6 +++---
>  block/blk.h            | 2 +-
>  include/linux/blkdev.h | 1 +
>  4 files changed, 6 insertions(+), 5 deletions(-)
> 
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 15cd231d560c..b55c52a42303 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
>  
>  		if (nsegs < lim->max_segments &&
>  		    bytes + bv.bv_len <= max_bytes &&
> -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
>  			nsegs++;
>  			bytes += bv.bv_len;
>  		} else {
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index c44dadc35e1e..539a64ad7989 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim)
>  	max_hw_sectors = min_not_zero(lim->max_hw_sectors,
>  				lim->max_dev_sectors);
>  	if (lim->max_user_sectors) {
> -		if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
> +		if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
>  			return -EINVAL;
>  		lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
>  	} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
> @@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim)
>  	 */
>  	if (!lim->seg_boundary_mask)
>  		lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
> -	if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
> +	if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
>  		return -EINVAL;
>  
>  	/*
> @@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim)
>  		 */
>  		if (!lim->max_segment_size)
>  			lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
> -		if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> +		if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
>  			return -EINVAL;
>  	}
>  
> diff --git a/block/blk.h b/block/blk.h
> index 90fa5f28ccab..cbfa8a3d4e42 100644
> --- a/block/blk.h
> +++ b/block/blk.h
> @@ -359,7 +359,7 @@ static inline bool bio_may_need_split(struct bio *bio,
>  		const struct queue_limits *lim)
>  {
>  	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
> -		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
> +		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > BLK_MIN_SEGMENT_SIZE;
>  }
>  
>  /**
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 248416ecd01c..32188af4051e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1163,6 +1163,7 @@ static inline bool bdev_is_partition(struct block_device *bdev)
>  enum blk_default_limits {
>  	BLK_MAX_SEGMENTS	= 128,
>  	BLK_SAFE_MAX_SECTORS	= 255,
> +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */

I think it would be useful to expose this value to the queue_limits and
sysfs (and remove it from here). We can default it to PAGE_SIZE (as it has
always been) and allow to overwrite it when the block driver initializes the
limits. This allows to see we are not anymore in the range of PAGE_SIZE -
max_segment_size 'world' but min_segment_size - max_segment_size one. Unless
there's a reason to not increase queue_limits data struct?


>  	BLK_MAX_SEGMENT_SIZE	= 65536,
>  	BLK_SEG_BOUNDARY_MASK	= 0xFFFFFFFFUL,
>  };
> -- 
> 2.47.1
>
Ming Lei Feb. 14, 2025, 11:19 a.m. UTC | #20
On Fri, Feb 14, 2025 at 10:38:36AM +0100, Daniel Gomez wrote:
> On Mon, Feb 10, 2025 at 05:03:19PM +0100, Ming Lei wrote:
> > PAGE_SIZE is applied in some block device queue limits, this way is
> > very fragile and is wrong:
> > 
> > - queue limits are read from hardware, which is often one readonly
> > hardware property
> > 
> > - PAGE_SIZE is one config option which can be changed during build time.
> > 
> > In RH lab, it has been found that max segment size of some mmc card is
> > less than 64K, then this kind of card can't work in case of 64K PAGE_SIZE.
> > 
> > Fix this issue by using BLK_MIN_SEGMENT_SIZE in related code for dealing
> > with queue limits and checking if bio needn't split. Define BLK_MIN_SEGMENT_SIZE
> > as 4K(minimized PAGE_SIZE).
> > 
> > Cc: Yi Zhang <yi.zhang@redhat.com>
> > Cc: Luis Chamberlain <mcgrof@kernel.org>
> > Cc: John Garry <john.g.garry@oracle.com>
> > Cc: Bart Van Assche <bvanassche@acm.org>
> > Cc: Keith Busch <kbusch@kernel.org>
> > Link: https://lore.kernel.org/linux-block/20250102015620.500754-1-ming.lei@redhat.com/
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> > V2:
> > 	- cover bio_split_rw_at()
> > 	- add BLK_MIN_SEGMENT_SIZE
> > 
> >  block/blk-merge.c      | 2 +-
> >  block/blk-settings.c   | 6 +++---
> >  block/blk.h            | 2 +-
> >  include/linux/blkdev.h | 1 +
> >  4 files changed, 6 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block/blk-merge.c b/block/blk-merge.c
> > index 15cd231d560c..b55c52a42303 100644
> > --- a/block/blk-merge.c
> > +++ b/block/blk-merge.c
> > @@ -329,7 +329,7 @@ int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
> >  
> >  		if (nsegs < lim->max_segments &&
> >  		    bytes + bv.bv_len <= max_bytes &&
> > -		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
> > +		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
> >  			nsegs++;
> >  			bytes += bv.bv_len;
> >  		} else {
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index c44dadc35e1e..539a64ad7989 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -303,7 +303,7 @@ int blk_validate_limits(struct queue_limits *lim)
> >  	max_hw_sectors = min_not_zero(lim->max_hw_sectors,
> >  				lim->max_dev_sectors);
> >  	if (lim->max_user_sectors) {
> > -		if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
> > +		if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
> >  			return -EINVAL;
> >  		lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
> >  	} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
> > @@ -341,7 +341,7 @@ int blk_validate_limits(struct queue_limits *lim)
> >  	 */
> >  	if (!lim->seg_boundary_mask)
> >  		lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
> > -	if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
> > +	if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
> >  		return -EINVAL;
> >  
> >  	/*
> > @@ -362,7 +362,7 @@ int blk_validate_limits(struct queue_limits *lim)
> >  		 */
> >  		if (!lim->max_segment_size)
> >  			lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
> > -		if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> > +		if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
> >  			return -EINVAL;
> >  	}
> >  
> > diff --git a/block/blk.h b/block/blk.h
> > index 90fa5f28ccab..cbfa8a3d4e42 100644
> > --- a/block/blk.h
> > +++ b/block/blk.h
> > @@ -359,7 +359,7 @@ static inline bool bio_may_need_split(struct bio *bio,
> >  		const struct queue_limits *lim)
> >  {
> >  	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
> > -		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
> > +		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > BLK_MIN_SEGMENT_SIZE;
> >  }
> >  
> >  /**
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 248416ecd01c..32188af4051e 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1163,6 +1163,7 @@ static inline bool bdev_is_partition(struct block_device *bdev)
> >  enum blk_default_limits {
> >  	BLK_MAX_SEGMENTS	= 128,
> >  	BLK_SAFE_MAX_SECTORS	= 255,
> > +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */
> 
> I think it would be useful to expose this value to the queue_limits and

Can you share it is useful for what?

> sysfs (and remove it from here). We can default it to PAGE_SIZE (as it has
> always been) and allow to overwrite it when the block driver initializes the

Which device driver needs to initialize it?

> limits. This allows to see we are not anymore in the range of PAGE_SIZE -
> max_segment_size 'world' but min_segment_size - max_segment_size one. Unless
> there's a reason to not increase queue_limits data struct?

Unless you provide one real hardware which needs this way, I don't think
the min_segment_size limit is useful.


Thanks,
Ming
Daniel Gomez Feb. 14, 2025, 12:28 p.m. UTC | #21
On Fri, Feb 14, 2025 at 07:19:45PM +0100, Ming Lei wrote:
> On Fri, Feb 14, 2025 at 10:38:36AM +0100, Daniel Gomez wrote:
> > On Mon, Feb 10, 2025 at 05:03:19PM +0100, Ming Lei wrote:
> > >  /**
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 248416ecd01c..32188af4051e 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -1163,6 +1163,7 @@ static inline bool bdev_is_partition(struct block_device *bdev)
> > >  enum blk_default_limits {
> > >  	BLK_MAX_SEGMENTS	= 128,
> > >  	BLK_SAFE_MAX_SECTORS	= 255,
> > > +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */
> > 
> > I think it would be useful to expose this value to the queue_limits and
> 
> Can you share it is useful for what?

I meant for your use case.

> 
> > sysfs (and remove it from here). We can default it to PAGE_SIZE (as it has
> > always been) and allow to overwrite it when the block driver initializes the
> 
> Which device driver needs to initialize it?

I mean, it would be yours. Keeping the default minimum segment size to PAGE_SIZE
rather than changing it to 4k, would keep the current behaviour. Then, adding
the minimum segment limit would allow your driver to overwrite it for your use
case.
Ming Lei Feb. 14, 2025, 12:51 p.m. UTC | #22
On Fri, Feb 14, 2025 at 01:28:41PM +0100, Daniel Gomez wrote:
> On Fri, Feb 14, 2025 at 07:19:45PM +0100, Ming Lei wrote:
> > On Fri, Feb 14, 2025 at 10:38:36AM +0100, Daniel Gomez wrote:
> > > On Mon, Feb 10, 2025 at 05:03:19PM +0100, Ming Lei wrote:
> > > >  /**
> > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > index 248416ecd01c..32188af4051e 100644
> > > > --- a/include/linux/blkdev.h
> > > > +++ b/include/linux/blkdev.h
> > > > @@ -1163,6 +1163,7 @@ static inline bool bdev_is_partition(struct block_device *bdev)
> > > >  enum blk_default_limits {
> > > >  	BLK_MAX_SEGMENTS	= 128,
> > > >  	BLK_SAFE_MAX_SECTORS	= 255,
> > > > +	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */
> > > 
> > > I think it would be useful to expose this value to the queue_limits and
> > 
> > Can you share it is useful for what?
> 
> I meant for your use case.

No, it isn't single case, there are many such devices with < 64K
max_segment_size, please see previous Bart's post:

https://lore.kernel.org/linux-block/20230612203314.17820-1-bvanassche@acm.org/

> 
> > 
> > > sysfs (and remove it from here). We can default it to PAGE_SIZE (as it has
> > > always been) and allow to overwrite it when the block driver initializes the
> > 
> > Which device driver needs to initialize it?
> 
> I mean, it would be yours. Keeping the default minimum segment size to PAGE_SIZE
> rather than changing it to 4k, would keep the current behaviour. Then, adding
> the minimum segment limit would allow your driver to overwrite it for your use
> case.

But these devices doesn't export min_segment_size, why do you want to fake this
limit?

It is fragile to take variable PAGE_SIZE as soft min_segment_size, and
it is actually wrong to bind it with fixed hardware max_segment_size.

To be honest, not see any benefit with your approach, just make things
complicated.

Thanks,
Ming
diff mbox series

Patch

diff --git a/block/blk-merge.c b/block/blk-merge.c
index 15cd231d560c..b55c52a42303 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -329,7 +329,7 @@  int bio_split_rw_at(struct bio *bio, const struct queue_limits *lim,
 
 		if (nsegs < lim->max_segments &&
 		    bytes + bv.bv_len <= max_bytes &&
-		    bv.bv_offset + bv.bv_len <= PAGE_SIZE) {
+		    bv.bv_offset + bv.bv_len <= BLK_MIN_SEGMENT_SIZE) {
 			nsegs++;
 			bytes += bv.bv_len;
 		} else {
diff --git a/block/blk-settings.c b/block/blk-settings.c
index c44dadc35e1e..539a64ad7989 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -303,7 +303,7 @@  int blk_validate_limits(struct queue_limits *lim)
 	max_hw_sectors = min_not_zero(lim->max_hw_sectors,
 				lim->max_dev_sectors);
 	if (lim->max_user_sectors) {
-		if (lim->max_user_sectors < PAGE_SIZE / SECTOR_SIZE)
+		if (lim->max_user_sectors < BLK_MIN_SEGMENT_SIZE / SECTOR_SIZE)
 			return -EINVAL;
 		lim->max_sectors = min(max_hw_sectors, lim->max_user_sectors);
 	} else if (lim->io_opt > (BLK_DEF_MAX_SECTORS_CAP << SECTOR_SHIFT)) {
@@ -341,7 +341,7 @@  int blk_validate_limits(struct queue_limits *lim)
 	 */
 	if (!lim->seg_boundary_mask)
 		lim->seg_boundary_mask = BLK_SEG_BOUNDARY_MASK;
-	if (WARN_ON_ONCE(lim->seg_boundary_mask < PAGE_SIZE - 1))
+	if (WARN_ON_ONCE(lim->seg_boundary_mask < BLK_MIN_SEGMENT_SIZE - 1))
 		return -EINVAL;
 
 	/*
@@ -362,7 +362,7 @@  int blk_validate_limits(struct queue_limits *lim)
 		 */
 		if (!lim->max_segment_size)
 			lim->max_segment_size = BLK_MAX_SEGMENT_SIZE;
-		if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
+		if (WARN_ON_ONCE(lim->max_segment_size < BLK_MIN_SEGMENT_SIZE))
 			return -EINVAL;
 	}
 
diff --git a/block/blk.h b/block/blk.h
index 90fa5f28ccab..cbfa8a3d4e42 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -359,7 +359,7 @@  static inline bool bio_may_need_split(struct bio *bio,
 		const struct queue_limits *lim)
 {
 	return lim->chunk_sectors || bio->bi_vcnt != 1 ||
-		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > PAGE_SIZE;
+		bio->bi_io_vec->bv_len + bio->bi_io_vec->bv_offset > BLK_MIN_SEGMENT_SIZE;
 }
 
 /**
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 248416ecd01c..32188af4051e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1163,6 +1163,7 @@  static inline bool bdev_is_partition(struct block_device *bdev)
 enum blk_default_limits {
 	BLK_MAX_SEGMENTS	= 128,
 	BLK_SAFE_MAX_SECTORS	= 255,
+	BLK_MIN_SEGMENT_SIZE	= 4096, /* min(PAGE_SIZE) */
 	BLK_MAX_SEGMENT_SIZE	= 65536,
 	BLK_SEG_BOUNDARY_MASK	= 0xFFFFFFFFUL,
 };