diff mbox series

[1/3] block: try one write zeroes request before going further

Message ID 20201206055332.3144-1-tom.ty89@gmail.com (mailing list archive)
State New, archived
Headers show
Series [1/3] block: try one write zeroes request before going further | expand

Commit Message

Tom Yan Dec. 6, 2020, 5:53 a.m. UTC
At least the SCSI disk driver is "benevolent" when it try to decide
whether the device actually supports write zeroes, i.e. unless the
device explicity report otherwise, it assumes it does at first.

Therefore before we pile up bios that would fail at the end, we try
the command/request once, as not doing so could trigger quite a
disaster in at least certain case. For example, the host controller
can be messed up entirely when one does `blkdiscard -z` a UAS drive.

Signed-off-by: Tom Yan <tom.ty89@gmail.com>
---
 block/blk-lib.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke Dec. 6, 2020, 11:25 a.m. UTC | #1
On 12/6/20 6:53 AM, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>   block/blk-lib.c | 14 +++++++++++++-
>   1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e90614fd8d6a..c1e9388a8fb8 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   	struct bio *bio = *biop;
>   	unsigned int max_write_zeroes_sectors;
>   	struct request_queue *q = bdev_get_queue(bdev);
> +	int i = 0;
>   
>   	if (!q)
>   		return -ENXIO;
> @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   		return -EOPNOTSUPP;
>   
>   	while (nr_sects) {
> -		bio = blk_next_bio(bio, 0, gfp_mask);
> +		if (i != 1) {
> +			bio = blk_next_bio(bio, 0, gfp_mask);
> +		} else {
> +			submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (bdev_write_zeroes_sectors(bdev) == 0)
> +				return -EOPNOTSUPP;
> +			else
> +				bio = bio_alloc(gfp_mask, 0);
> +		}
>   		bio->bi_iter.bi_sector = sector;
>   		bio_set_dev(bio, bdev);
>   		bio->bi_opf = REQ_OP_WRITE_ZEROES;
> @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>   			nr_sects = 0;
>   		}
>   		cond_resched();
> +		i++;
>   	}
>   
>   	*biop = bio;
> 
We do want to keep the chain of bios intact such that end_io processing 
will recurse back to the original end_io callback.
As such we need to call bio_chain on the first bio, submit that 
(possibly with submit_bio_wait()), and then decide whether we can / 
should continue.
With your patch we'll lose the information that indeed other bios might 
be linked to the original one.

Cheers,

Hannes
Tom Yan Dec. 6, 2020, 1:25 p.m. UTC | #2
I think you misunderstood it. The goal of this patch is to split the
current situation into two chains (or one unchained bio + a series of
chained bio). The first one is an attempt/trial which makes sure that
the latter large bio chain can actually be handled (as per the
"command capability" of the device).

P.S. I think I missed the fact that it requires my blk_next_bio()
patch to work properly. (It still seems like a typo bug to me.)

On Sun, 6 Dec 2020 at 19:25, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 6:53 AM, Tom Yan wrote:
> > At least the SCSI disk driver is "benevolent" when it try to decide
> > whether the device actually supports write zeroes, i.e. unless the
> > device explicity report otherwise, it assumes it does at first.
> >
> > Therefore before we pile up bios that would fail at the end, we try
> > the command/request once, as not doing so could trigger quite a
> > disaster in at least certain case. For example, the host controller
> > can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >   block/blk-lib.c | 14 +++++++++++++-
> >   1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e90614fd8d6a..c1e9388a8fb8 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >       struct bio *bio = *biop;
> >       unsigned int max_write_zeroes_sectors;
> >       struct request_queue *q = bdev_get_queue(bdev);
> > +     int i = 0;
> >
> >       if (!q)
> >               return -ENXIO;
> > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >               return -EOPNOTSUPP;
> >
> >       while (nr_sects) {
> > -             bio = blk_next_bio(bio, 0, gfp_mask);
> > +             if (i != 1) {
> > +                     bio = blk_next_bio(bio, 0, gfp_mask);
> > +             } else {
> > +                     submit_bio_wait(bio);
> > +                     bio_put(bio);
> > +
> > +                     if (bdev_write_zeroes_sectors(bdev) == 0)
> > +                             return -EOPNOTSUPP;
> > +                     else
> > +                             bio = bio_alloc(gfp_mask, 0);
> > +             }
> >               bio->bi_iter.bi_sector = sector;
> >               bio_set_dev(bio, bdev);
> >               bio->bi_opf = REQ_OP_WRITE_ZEROES;
> > @@ -280,6 +291,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >                       nr_sects = 0;
> >               }
> >               cond_resched();
> > +             i++;
> >       }
> >
> >       *biop = bio;
> >
> We do want to keep the chain of bios intact such that end_io processing
> will recurse back to the original end_io callback.
> As such we need to call bio_chain on the first bio, submit that
> (possibly with submit_bio_wait()), and then decide whether we can /
> should continue.
> With your patch we'll lose the information that indeed other bios might
> be linked to the original one.
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Hannes Reinecke Dec. 6, 2020, 1:56 p.m. UTC | #3
On 12/6/20 2:25 PM, Tom Yan wrote:
> I think you misunderstood it. The goal of this patch is to split the
> current situation into two chains (or one unchained bio + a series of
> chained bio). The first one is an attempt/trial which makes sure that
> the latter large bio chain can actually be handled (as per the
> "command capability" of the device).
> 
Oh, I think I do get what you're trying to do. And, in fact, I don't 
argue with what you're trying to achieve.

What I would like to see, though, is keep the current bio_chain logic 
intact (irrespective of your previous patch, which should actually be 
part of this series), and just lift the first check out of the loop:

@@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,

         if (max_write_zeroes_sectors == 0)
                 return -EOPNOTSUPP;
-
+       new = bio_alloc(gfp_mask, 0);
+       bio_chain(bio, new);
+       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
+               bio_put(new);
+               return -ENOPNOTSUPP;
+       }
+       bio = new;
         while (nr_sects) {
-               bio = blk_next_bio(bio, 0, gfp_mask);
                 bio->bi_iter.bi_sector = sector;
                 bio_set_dev(bio, bdev);
                 bio->bi_opf = REQ_OP_WRITE_ZEROES;
@@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct 
block_device *bdev,
                         bio->bi_iter.bi_size = nr_sects << 9;
                         nr_sects = 0;
                 }
+               bio = blk_next_bio(bio, 0, gfp_mask);
                 cond_resched();
         }

(The error checking from submit_bio_wait() could be improved :-)

Cheers,

Hannes
Tom Yan Dec. 6, 2020, 2:07 p.m. UTC | #4
Yes it does have "dependency" to the blk_next_bio() patch. I just
somehow missed that.

The problem is, I don't think I'm trying to change the logic of
bio_chain(), or even that of blk_next_bio(). It really just looks like
a careless mistake, that the arguments were typed in the wrong order.

Adding those who signed off the original commit (block: remove struct
bio_batch / 9082e87b) here too to the CC list.


On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote:
>
> On 12/6/20 2:25 PM, Tom Yan wrote:
> > I think you misunderstood it. The goal of this patch is to split the
> > current situation into two chains (or one unchained bio + a series of
> > chained bio). The first one is an attempt/trial which makes sure that
> > the latter large bio chain can actually be handled (as per the
> > "command capability" of the device).
> >
> Oh, I think I do get what you're trying to do. And, in fact, I don't
> argue with what you're trying to achieve.
>
> What I would like to see, though, is keep the current bio_chain logic
> intact (irrespective of your previous patch, which should actually be
> part of this series), and just lift the first check out of the loop:
>
> @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct
> block_device *bdev,
>
>          if (max_write_zeroes_sectors == 0)
>                  return -EOPNOTSUPP;
> -
> +       new = bio_alloc(gfp_mask, 0);
> +       bio_chain(bio, new);
> +       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
> +               bio_put(new);
> +               return -ENOPNOTSUPP;
> +       }
> +       bio = new;
>          while (nr_sects) {
> -               bio = blk_next_bio(bio, 0, gfp_mask);
>                  bio->bi_iter.bi_sector = sector;
>                  bio_set_dev(bio, bdev);
>                  bio->bi_opf = REQ_OP_WRITE_ZEROES;
> @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct
> block_device *bdev,
>                          bio->bi_iter.bi_size = nr_sects << 9;
>                          nr_sects = 0;
>                  }
> +               bio = blk_next_bio(bio, 0, gfp_mask);
>                  cond_resched();
>          }
>
> (The error checking from submit_bio_wait() could be improved :-)
>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke                Kernel Storage Architect
> hare@suse.de                              +49 911 74053 688
> SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Tom Yan Dec. 6, 2020, 2:28 p.m. UTC | #5
Btw, while this series relies on the blk_next_bio() patch to work, it
was not the reason that I sent the latter. It was just because the way
it calls bio_chain() doesn't look right to any of the functions that
make use of it (or in other words, the apparent logic of itself).
That's actually why I didn't have it in the same series.

On Sun, 6 Dec 2020 at 22:07, Tom Yan <tom.ty89@gmail.com> wrote:
>
> Yes it does have "dependency" to the blk_next_bio() patch. I just
> somehow missed that.
>
> The problem is, I don't think I'm trying to change the logic of
> bio_chain(), or even that of blk_next_bio(). It really just looks like
> a careless mistake, that the arguments were typed in the wrong order.
>
> Adding those who signed off the original commit (block: remove struct
> bio_batch / 9082e87b) here too to the CC list.
>
>
> On Sun, 6 Dec 2020 at 21:56, Hannes Reinecke <hare@suse.de> wrote:
> >
> > On 12/6/20 2:25 PM, Tom Yan wrote:
> > > I think you misunderstood it. The goal of this patch is to split the
> > > current situation into two chains (or one unchained bio + a series of
> > > chained bio). The first one is an attempt/trial which makes sure that
> > > the latter large bio chain can actually be handled (as per the
> > > "command capability" of the device).
> > >
> > Oh, I think I do get what you're trying to do. And, in fact, I don't
> > argue with what you're trying to achieve.
> >
> > What I would like to see, though, is keep the current bio_chain logic
> > intact (irrespective of your previous patch, which should actually be
> > part of this series), and just lift the first check out of the loop:
> >
> > @@ -262,9 +262,14 @@ static int __blkdev_issue_write_zeroes(struct
> > block_device *bdev,
> >
> >          if (max_write_zeroes_sectors == 0)
> >                  return -EOPNOTSUPP;
> > -
> > +       new = bio_alloc(gfp_mask, 0);
> > +       bio_chain(bio, new);
> > +       if (submit_bio_wait(bio) == BLK_STS_NOTSUPP) {
> > +               bio_put(new);
> > +               return -ENOPNOTSUPP;
> > +       }
> > +       bio = new;
> >          while (nr_sects) {
> > -               bio = blk_next_bio(bio, 0, gfp_mask);
> >                  bio->bi_iter.bi_sector = sector;
> >                  bio_set_dev(bio, bdev);
> >                  bio->bi_opf = REQ_OP_WRITE_ZEROES;
> > @@ -279,6 +284,7 @@ static int __blkdev_issue_write_zeroes(struct
> > block_device *bdev,
> >                          bio->bi_iter.bi_size = nr_sects << 9;
> >                          nr_sects = 0;
> >                  }
> > +               bio = blk_next_bio(bio, 0, gfp_mask);
> >                  cond_resched();
> >          }
> >
> > (The error checking from submit_bio_wait() could be improved :-)
> >
> > Cheers,
> >
> > Hannes
> > --
> > Dr. Hannes Reinecke                Kernel Storage Architect
> > hare@suse.de                              +49 911 74053 688
> > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
> > HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Christoph Hellwig Dec. 7, 2020, 1:36 p.m. UTC | #6
On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 
> Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> ---
>  block/blk-lib.c | 14 +++++++++++++-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/block/blk-lib.c b/block/blk-lib.c
> index e90614fd8d6a..c1e9388a8fb8 100644
> --- a/block/blk-lib.c
> +++ b/block/blk-lib.c
> @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  	struct bio *bio = *biop;
>  	unsigned int max_write_zeroes_sectors;
>  	struct request_queue *q = bdev_get_queue(bdev);
> +	int i = 0;
>  
>  	if (!q)
>  		return -ENXIO;
> @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
>  		return -EOPNOTSUPP;
>  
>  	while (nr_sects) {
> -		bio = blk_next_bio(bio, 0, gfp_mask);
> +		if (i != 1) {
> +			bio = blk_next_bio(bio, 0, gfp_mask);
> +		} else {
> +			submit_bio_wait(bio);
> +			bio_put(bio);
> +
> +			if (bdev_write_zeroes_sectors(bdev) == 0)
> +				return -EOPNOTSUPP;
> +			else

This means you now massively slow down say nvme operations by adding
a wait.  If at all we need a maybe supports write zeroes flag and
only do that if the driver hasn't decided yet if write zeroes is
actually supported.
Tom Yan Dec. 8, 2020, 12:48 p.m. UTC | #7
You mean like submit_bio_wait() is a blocking operation?

On Mon, 7 Dec 2020 at 21:36, Christoph Hellwig <hch@infradead.org> wrote:
>
> On Sun, Dec 06, 2020 at 01:53:30PM +0800, Tom Yan wrote:
> > At least the SCSI disk driver is "benevolent" when it try to decide
> > whether the device actually supports write zeroes, i.e. unless the
> > device explicity report otherwise, it assumes it does at first.
> >
> > Therefore before we pile up bios that would fail at the end, we try
> > the command/request once, as not doing so could trigger quite a
> > disaster in at least certain case. For example, the host controller
> > can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> >
> > Signed-off-by: Tom Yan <tom.ty89@gmail.com>
> > ---
> >  block/blk-lib.c | 14 +++++++++++++-
> >  1 file changed, 13 insertions(+), 1 deletion(-)
> >
> > diff --git a/block/blk-lib.c b/block/blk-lib.c
> > index e90614fd8d6a..c1e9388a8fb8 100644
> > --- a/block/blk-lib.c
> > +++ b/block/blk-lib.c
> > @@ -250,6 +250,7 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >       struct bio *bio = *biop;
> >       unsigned int max_write_zeroes_sectors;
> >       struct request_queue *q = bdev_get_queue(bdev);
> > +     int i = 0;
> >
> >       if (!q)
> >               return -ENXIO;
> > @@ -264,7 +265,17 @@ static int __blkdev_issue_write_zeroes(struct block_device *bdev,
> >               return -EOPNOTSUPP;
> >
> >       while (nr_sects) {
> > -             bio = blk_next_bio(bio, 0, gfp_mask);
> > +             if (i != 1) {
> > +                     bio = blk_next_bio(bio, 0, gfp_mask);
> > +             } else {
> > +                     submit_bio_wait(bio);
> > +                     bio_put(bio);
> > +
> > +                     if (bdev_write_zeroes_sectors(bdev) == 0)
> > +                             return -EOPNOTSUPP;
> > +                     else
>
> This means you now massively slow down say nvme operations by adding
> a wait.  If at all we need a maybe supports write zeroes flag and
> only do that if the driver hasn't decided yet if write zeroes is
> actually supported.
Ewan Milne Dec. 8, 2020, 10:46 p.m. UTC | #8
On Sun, 2020-12-06 at 13:53 +0800, Tom Yan wrote:
> At least the SCSI disk driver is "benevolent" when it try to decide
> whether the device actually supports write zeroes, i.e. unless the
> device explicity report otherwise, it assumes it does at first.
> 
> Therefore before we pile up bios that would fail at the end, we try
> the command/request once, as not doing so could trigger quite a
> disaster in at least certain case. For example, the host controller
> can be messed up entirely when one does `blkdiscard -z` a UAS drive.
> 

It's not as simple as that.  There are some SCSI devices that support
WRITE ZEROES, but do not return the MAXIMUM WRITE SAME LENGTH in the
block device limits VPD page.  So, some commands might work, and others
might not.  In particular, a commonly-used hypervisor does this.

The sd driver disables the use of write same if certain errors are
returned (INVALID COMMAND w/ INVALID COMMAND OPCODE or INVALID FIELD IN
CDB), but if you do a blkdiscard -z of an entire drive a whole lot of
bios/requests are already queued by the time you get that.

Higher level code checks to see if write zeroes is supported, and
won't queue the requests once it is turned off, but that doesn't
happen until a command fails.  We also check in command setup, see
my earlier patch which deals with spurious blk_update_request errors
if the disablement of write same gets detected there...  I explicitly
did not try to fix that by "testing" with one bio for the reason
Christoph mentions.

-Ewan
Christoph Hellwig Dec. 9, 2020, 5:51 p.m. UTC | #9
On Tue, Dec 08, 2020 at 08:48:15PM +0800, Tom Yan wrote:
> You mean like submit_bio_wait() is a blocking operation?

Yes, it blocks to wait for the I/O to complete.
diff mbox series

Patch

diff --git a/block/blk-lib.c b/block/blk-lib.c
index e90614fd8d6a..c1e9388a8fb8 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -250,6 +250,7 @@  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 	struct bio *bio = *biop;
 	unsigned int max_write_zeroes_sectors;
 	struct request_queue *q = bdev_get_queue(bdev);
+	int i = 0;
 
 	if (!q)
 		return -ENXIO;
@@ -264,7 +265,17 @@  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 		return -EOPNOTSUPP;
 
 	while (nr_sects) {
-		bio = blk_next_bio(bio, 0, gfp_mask);
+		if (i != 1) {
+			bio = blk_next_bio(bio, 0, gfp_mask);
+		} else {
+			submit_bio_wait(bio);
+			bio_put(bio);
+
+			if (bdev_write_zeroes_sectors(bdev) == 0)
+				return -EOPNOTSUPP;
+			else
+				bio = bio_alloc(gfp_mask, 0);
+		}
 		bio->bi_iter.bi_sector = sector;
 		bio_set_dev(bio, bdev);
 		bio->bi_opf = REQ_OP_WRITE_ZEROES;
@@ -280,6 +291,7 @@  static int __blkdev_issue_write_zeroes(struct block_device *bdev,
 			nr_sects = 0;
 		}
 		cond_resched();
+		i++;
 	}
 
 	*biop = bio;