diff mbox series

[2/2] loop: Better discard support for block devices

Message ID 20181030230624.61834-3-evgreen@chromium.org (mailing list archive)
State New, archived
Headers show
Series loop: Better discard for block devices | expand

Commit Message

Evan Green Oct. 30, 2018, 11:06 p.m. UTC
If the backing device for a loop device is a block device,
then mirror the discard properties of the underlying block
device into the loop device. While in there, differentiate
between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
different for block devices, but which the loop device had
just been lumping together.

Signed-off-by: Evan Green <evgreen@chromium.org>
---

 drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 41 insertions(+), 20 deletions(-)

Comments

Evan Green Nov. 26, 2018, 6:53 p.m. UTC | #1
On Tue, Oct 30, 2018 at 4:06 PM Evan Green <evgreen@chromium.org> wrote:
>
> If the backing device for a loop device is a block device,
> then mirror the discard properties of the underlying block
> device into the loop device. While in there, differentiate
> between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> different for block devices, but which the loop device had
> just been lumping together.
>
> Signed-off-by: Evan Green <evgreen@chromium.org>

Any thoughts on this patch? This fixes issues for us when using a loop
device backed by a block device, where we see many logs like:

[  372.767286] print_req_error: I/O error, dev loop5, sector 88125696

-Evan
Ming Lei Nov. 27, 2018, 2:55 a.m. UTC | #2
On Tue, Nov 27, 2018 at 2:55 AM Evan Green <evgreen@chromium.org> wrote:
>
> On Tue, Oct 30, 2018 at 4:06 PM Evan Green <evgreen@chromium.org> wrote:
> >
> > If the backing device for a loop device is a block device,

This shouldn't be a very common use case wrt. loop.

> > then mirror the discard properties of the underlying block
> > device into the loop device. While in there, differentiate
> > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > different for block devices, but which the loop device had
> > just been lumping together.
> >
> > Signed-off-by: Evan Green <evgreen@chromium.org>
>
> Any thoughts on this patch? This fixes issues for us when using a loop
> device backed by a block device, where we see many logs like:
>
> [  372.767286] print_req_error: I/O error, dev loop5, sector 88125696

Seems not see any explanation about this IO error and the fix in your patch.
Could you describe it a bit more?


thanks,
Ming Lei
Evan Green Nov. 27, 2018, 11:34 p.m. UTC | #3
Hi Ming,

On Mon, Nov 26, 2018 at 6:55 PM Ming Lei <tom.leiming@gmail.com> wrote:
>
> On Tue, Nov 27, 2018 at 2:55 AM Evan Green <evgreen@chromium.org> wrote:
> >
> > On Tue, Oct 30, 2018 at 4:06 PM Evan Green <evgreen@chromium.org> wrote:
> > >
> > > If the backing device for a loop device is a block device,
>
> This shouldn't be a very common use case wrt. loop.

Yeah, I'm starting to gather that. Or maybe I'm just the first one to
mention it on the kernel lists ;) We've used this in our Chrome OS
installer, I believe for many years. Gwendal piped in with a few
reasons we do it this way on the cover letter, but in general I think
it allows us to have a unified set of functions to install to a file,
disk, or prepare an image that may have a different block size than
those on the running system.

>
> > > then mirror the discard properties of the underlying block
> > > device into the loop device. While in there, differentiate
> > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > different for block devices, but which the loop device had
> > > just been lumping together.
> > >
> > > Signed-off-by: Evan Green <evgreen@chromium.org>
> >
> > Any thoughts on this patch? This fixes issues for us when using a loop
> > device backed by a block device, where we see many logs like:
> >
> > [  372.767286] print_req_error: I/O error, dev loop5, sector 88125696
>
> Seems not see any explanation about this IO error and the fix in your patch.
> Could you describe it a bit more?

Sure, I probably should have included more context with the series.

The loop device always reports that it supports discard, by setting up
the max_discard_sectors and max_write_zeroes_sectors in the blk queue.
When the loop device gets a discard or write-zeroes request, it turns
around and calls fallocate on the underlying device with the
PUNCH_HOLE flag. This makes sense when you're backed by a file and
hoping to just deallocate the space, but may fail when you're backed
by a block device that doesn't support discard, or doesn't write
zeroes to discarded sectors. Weirdly, lo_discard already had some code
for preserving EOPNOTSUPP, but then later the error is smashed into
EIO. Patch 1 pipes out EOPNOTSUPP properly, so it doesn't get squashed
into EIO.

Patch 2 reflects the discard characteristics of the underlying device
into the loop device. That way, if you're backed by a file or a block
device that does support discard, everything works great, and user
mode can even see and use the correct discard and write zero
granularities. If you're backed by a block device that does not
support discard, this is exposed to user mode, which then usually
avoids calling fallocate, and doesn't feel betrayed that their
requests are unexpectedly failing.

-Evan
Ming Lei Nov. 28, 2018, 1:26 a.m. UTC | #4
On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> If the backing device for a loop device is a block device,
> then mirror the discard properties of the underlying block
> device into the loop device. While in there, differentiate
> between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> different for block devices, but which the loop device had
> just been lumping together.
> 
> Signed-off-by: Evan Green <evgreen@chromium.org>
> ---
> 
>  drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 41 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 28990fc94841a..176e65101c4ef 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
>  	return ret;
>  }
>  
> -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> +static int lo_discard(struct loop_device *lo, struct request *rq,
> +		int mode, loff_t pos)
>  {
> -	/*
> -	 * We use punch hole to reclaim the free space used by the
> -	 * image a.k.a. discard. However we do not support discard if
> -	 * encryption is enabled, because it may give an attacker
> -	 * useful information.
> -	 */
>  	struct file *file = lo->lo_backing_file;
> -	int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> +	struct request_queue *q = lo->lo_queue;
>  	int ret;
>  
> -	if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> +	if (!blk_queue_discard(q)) {
>  		ret = -EOPNOTSUPP;
>  		goto out;
>  	}
> @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
>  	case REQ_OP_FLUSH:
>  		return lo_req_flush(lo, rq);
>  	case REQ_OP_DISCARD:
> +		return lo_discard(lo, rq,
> +			FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> +
>  	case REQ_OP_WRITE_ZEROES:
> -		return lo_discard(lo, rq, pos);
> +		return lo_discard(lo, rq,
> +			FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> +
>  	case REQ_OP_WRITE:
>  		if (lo->transfer)
>  			return lo_write_transfer(lo, rq, pos);
> @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
>  	struct file *file = lo->lo_backing_file;
>  	struct inode *inode = file->f_mapping->host;
>  	struct request_queue *q = lo->lo_queue;
> +	struct request_queue *backingq;
> +
> +	/*
> +	 * If the backing device is a block device, mirror its discard
> +	 * capabilities.
> +	 */
> +	if (S_ISBLK(inode->i_mode)) {
> +		backingq = bdev_get_queue(inode->i_bdev);
> +		blk_queue_max_discard_sectors(q,
> +			backingq->limits.max_discard_sectors);
> +
> +		blk_queue_max_write_zeroes_sectors(q,
> +			backingq->limits.max_write_zeroes_sectors);
> +
> +		q->limits.discard_granularity =
> +			backingq->limits.discard_granularity;
> +
> +		q->limits.discard_alignment =
> +			backingq->limits.discard_alignment;

I think it isn't necessary to mirror backing queue's discard/write_zeros
capabilities, given either fs of the underlying queue can deal with well.

>  
>  	/*
>  	 * We use punch hole to reclaim the free space used by the
> @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
>  	 * encryption is enabled, because it may give an attacker
>  	 * useful information.
>  	 */
> -	if ((!file->f_op->fallocate) ||
> -	    lo->lo_encrypt_key_size) {
> +	} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
>  		q->limits.discard_granularity = 0;
>  		q->limits.discard_alignment = 0;
>  		blk_queue_max_discard_sectors(q, 0);
>  		blk_queue_max_write_zeroes_sectors(q, 0);
> -		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> -		return;
> -	}
>  
> -	q->limits.discard_granularity = inode->i_sb->s_blocksize;
> -	q->limits.discard_alignment = 0;
> +	} else {
> +		q->limits.discard_granularity = inode->i_sb->s_blocksize;
> +		q->limits.discard_alignment = 0;
> +
> +		blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> +		blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> +	}
>  
> -	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> -	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> -	blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> +	if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> +		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> +	else
> +		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
>  }

Looks it should work just by mirroring backing queue's discard
capability to loop queue in case that the loop is backed by
block device, doesn't it? Meantime the unified discard limits &
write_zeros limits can be kept.

thanks,
Ming
Ming Lei Nov. 28, 2018, 1:28 a.m. UTC | #5
On Tue, Nov 27, 2018 at 03:34:04PM -0800, Evan Green wrote:
> Hi Ming,
> 
> On Mon, Nov 26, 2018 at 6:55 PM Ming Lei <tom.leiming@gmail.com> wrote:
> >
> > On Tue, Nov 27, 2018 at 2:55 AM Evan Green <evgreen@chromium.org> wrote:
> > >
> > > On Tue, Oct 30, 2018 at 4:06 PM Evan Green <evgreen@chromium.org> wrote:
> > > >
> > > > If the backing device for a loop device is a block device,
> >
> > This shouldn't be a very common use case wrt. loop.
> 
> Yeah, I'm starting to gather that. Or maybe I'm just the first one to
> mention it on the kernel lists ;) We've used this in our Chrome OS
> installer, I believe for many years. Gwendal piped in with a few
> reasons we do it this way on the cover letter, but in general I think
> it allows us to have a unified set of functions to install to a file,
> disk, or prepare an image that may have a different block size than
> those on the running system.

OK, got it, it makes sense.

> 
> >
> > > > then mirror the discard properties of the underlying block
> > > > device into the loop device. While in there, differentiate
> > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > different for block devices, but which the loop device had
> > > > just been lumping together.
> > > >
> > > > Signed-off-by: Evan Green <evgreen@chromium.org>
> > >
> > > Any thoughts on this patch? This fixes issues for us when using a loop
> > > device backed by a block device, where we see many logs like:
> > >
> > > [  372.767286] print_req_error: I/O error, dev loop5, sector 88125696
> >
> > Seems not see any explanation about this IO error and the fix in your patch.
> > Could you describe it a bit more?
> 
> Sure, I probably should have included more context with the series.
> 
> The loop device always reports that it supports discard, by setting up
> the max_discard_sectors and max_write_zeroes_sectors in the blk queue.
> When the loop device gets a discard or write-zeroes request, it turns
> around and calls fallocate on the underlying device with the
> PUNCH_HOLE flag. This makes sense when you're backed by a file and
> hoping to just deallocate the space, but may fail when you're backed
> by a block device that doesn't support discard, or doesn't write
> zeroes to discarded sectors. Weirdly, lo_discard already had some code
> for preserving EOPNOTSUPP, but then later the error is smashed into
> EIO. Patch 1 pipes out EOPNOTSUPP properly, so it doesn't get squashed
> into EIO.
> 
> Patch 2 reflects the discard characteristics of the underlying device
> into the loop device. That way, if you're backed by a file or a block
> device that does support discard, everything works great, and user
> mode can even see and use the correct discard and write zero
> granularities. If you're backed by a block device that does not
> support discard, this is exposed to user mode, which then usually
> avoids calling fallocate, and doesn't feel betrayed that their
> requests are unexpectedly failing.

Thanks for your detailed explanation, and I think we need to fix it.

Thanks,
Ming
Evan Green Dec. 4, 2018, 10:19 p.m. UTC | #6
Hi Ming,

On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > If the backing device for a loop device is a block device,
> > then mirror the discard properties of the underlying block
> > device into the loop device. While in there, differentiate
> > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > different for block devices, but which the loop device had
> > just been lumping together.
> >
> > Signed-off-by: Evan Green <evgreen@chromium.org>
> > ---
> >
> >  drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 41 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index 28990fc94841a..176e65101c4ef 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> >       return ret;
> >  }
> >
> > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > +             int mode, loff_t pos)
> >  {
> > -     /*
> > -      * We use punch hole to reclaim the free space used by the
> > -      * image a.k.a. discard. However we do not support discard if
> > -      * encryption is enabled, because it may give an attacker
> > -      * useful information.
> > -      */
> >       struct file *file = lo->lo_backing_file;
> > -     int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > +     struct request_queue *q = lo->lo_queue;
> >       int ret;
> >
> > -     if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > +     if (!blk_queue_discard(q)) {
> >               ret = -EOPNOTSUPP;
> >               goto out;
> >       }
> > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> >       case REQ_OP_FLUSH:
> >               return lo_req_flush(lo, rq);
> >       case REQ_OP_DISCARD:
> > +             return lo_discard(lo, rq,
> > +                     FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > +
> >       case REQ_OP_WRITE_ZEROES:
> > -             return lo_discard(lo, rq, pos);
> > +             return lo_discard(lo, rq,
> > +                     FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > +
> >       case REQ_OP_WRITE:
> >               if (lo->transfer)
> >                       return lo_write_transfer(lo, rq, pos);
> > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> >       struct file *file = lo->lo_backing_file;
> >       struct inode *inode = file->f_mapping->host;
> >       struct request_queue *q = lo->lo_queue;
> > +     struct request_queue *backingq;
> > +
> > +     /*
> > +      * If the backing device is a block device, mirror its discard
> > +      * capabilities.
> > +      */
> > +     if (S_ISBLK(inode->i_mode)) {
> > +             backingq = bdev_get_queue(inode->i_bdev);
> > +             blk_queue_max_discard_sectors(q,
> > +                     backingq->limits.max_discard_sectors);
> > +
> > +             blk_queue_max_write_zeroes_sectors(q,
> > +                     backingq->limits.max_write_zeroes_sectors);
> > +
> > +             q->limits.discard_granularity =
> > +                     backingq->limits.discard_granularity;
> > +
> > +             q->limits.discard_alignment =
> > +                     backingq->limits.discard_alignment;
>
> I think it isn't necessary to mirror backing queue's discard/write_zeros
> capabilities, given either fs of the underlying queue can deal with well.
>
> >
> >       /*
> >        * We use punch hole to reclaim the free space used by the
> > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> >        * encryption is enabled, because it may give an attacker
> >        * useful information.
> >        */
> > -     if ((!file->f_op->fallocate) ||
> > -         lo->lo_encrypt_key_size) {
> > +     } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> >               q->limits.discard_granularity = 0;
> >               q->limits.discard_alignment = 0;
> >               blk_queue_max_discard_sectors(q, 0);
> >               blk_queue_max_write_zeroes_sectors(q, 0);
> > -             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > -             return;
> > -     }
> >
> > -     q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > -     q->limits.discard_alignment = 0;
> > +     } else {
> > +             q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > +             q->limits.discard_alignment = 0;
> > +
> > +             blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > +             blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > +     }
> >
> > -     blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > -     blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > -     blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > +     if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > +             blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > +     else
> > +             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> >  }
>
> Looks it should work just by mirroring backing queue's discard
> capability to loop queue in case that the loop is backed by
> block device, doesn't it? Meantime the unified discard limits &
> write_zeros limits can be kept.

I tested this out, and you're right that I could just flip the
QUEUE_FLAG_DISCARD based on whether its a block device, and leave
everything else alone, to completely disable discard support for loop
devices backed by block devices. This seems to work for programs like
mkfs.ext4, but still leaves problems for coreutils cp.

But is that really the right call? With this change, we're not only
able to use loop devices in this way, but we're able to use the
discard and zero functionality of the underlying block device by
simply passing it through. To me that seemed better than disabling all
discard support for block devices, which would severely slow us down
on some devices.
-Evan
Ming Lei Dec. 5, 2018, 1:10 a.m. UTC | #7
On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> Hi Ming,
> 
> On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > If the backing device for a loop device is a block device,
> > > then mirror the discard properties of the underlying block
> > > device into the loop device. While in there, differentiate
> > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > different for block devices, but which the loop device had
> > > just been lumping together.
> > >
> > > Signed-off-by: Evan Green <evgreen@chromium.org>
> > > ---
> > >
> > >  drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > >  1 file changed, 41 insertions(+), 20 deletions(-)
> > >
> > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > index 28990fc94841a..176e65101c4ef 100644
> > > --- a/drivers/block/loop.c
> > > +++ b/drivers/block/loop.c
> > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > >       return ret;
> > >  }
> > >
> > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > +             int mode, loff_t pos)
> > >  {
> > > -     /*
> > > -      * We use punch hole to reclaim the free space used by the
> > > -      * image a.k.a. discard. However we do not support discard if
> > > -      * encryption is enabled, because it may give an attacker
> > > -      * useful information.
> > > -      */
> > >       struct file *file = lo->lo_backing_file;
> > > -     int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > +     struct request_queue *q = lo->lo_queue;
> > >       int ret;
> > >
> > > -     if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > +     if (!blk_queue_discard(q)) {
> > >               ret = -EOPNOTSUPP;
> > >               goto out;
> > >       }
> > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > >       case REQ_OP_FLUSH:
> > >               return lo_req_flush(lo, rq);
> > >       case REQ_OP_DISCARD:
> > > +             return lo_discard(lo, rq,
> > > +                     FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > +
> > >       case REQ_OP_WRITE_ZEROES:
> > > -             return lo_discard(lo, rq, pos);
> > > +             return lo_discard(lo, rq,
> > > +                     FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > +
> > >       case REQ_OP_WRITE:
> > >               if (lo->transfer)
> > >                       return lo_write_transfer(lo, rq, pos);
> > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > >       struct file *file = lo->lo_backing_file;
> > >       struct inode *inode = file->f_mapping->host;
> > >       struct request_queue *q = lo->lo_queue;
> > > +     struct request_queue *backingq;
> > > +
> > > +     /*
> > > +      * If the backing device is a block device, mirror its discard
> > > +      * capabilities.
> > > +      */
> > > +     if (S_ISBLK(inode->i_mode)) {
> > > +             backingq = bdev_get_queue(inode->i_bdev);
> > > +             blk_queue_max_discard_sectors(q,
> > > +                     backingq->limits.max_discard_sectors);
> > > +
> > > +             blk_queue_max_write_zeroes_sectors(q,
> > > +                     backingq->limits.max_write_zeroes_sectors);
> > > +
> > > +             q->limits.discard_granularity =
> > > +                     backingq->limits.discard_granularity;
> > > +
> > > +             q->limits.discard_alignment =
> > > +                     backingq->limits.discard_alignment;
> >
> > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > capabilities, given either fs of the underlying queue can deal with well.
> >
> > >
> > >       /*
> > >        * We use punch hole to reclaim the free space used by the
> > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > >        * encryption is enabled, because it may give an attacker
> > >        * useful information.
> > >        */
> > > -     if ((!file->f_op->fallocate) ||
> > > -         lo->lo_encrypt_key_size) {
> > > +     } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > >               q->limits.discard_granularity = 0;
> > >               q->limits.discard_alignment = 0;
> > >               blk_queue_max_discard_sectors(q, 0);
> > >               blk_queue_max_write_zeroes_sectors(q, 0);
> > > -             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > -             return;
> > > -     }
> > >
> > > -     q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > -     q->limits.discard_alignment = 0;
> > > +     } else {
> > > +             q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > +             q->limits.discard_alignment = 0;
> > > +
> > > +             blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > +             blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > +     }
> > >
> > > -     blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > -     blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > -     blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > +     if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > +             blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > +     else
> > > +             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > >  }
> >
> > Looks it should work just by mirroring backing queue's discard
> > capability to loop queue in case that the loop is backed by
> > block device, doesn't it? Meantime the unified discard limits &
> > write_zeros limits can be kept.
> 
> I tested this out, and you're right that I could just flip the
> QUEUE_FLAG_DISCARD based on whether its a block device, and leave

What I meant actually is to do the following discard config:

	bool discard;
	if (S_ISBLK(inode->i_mode)) {
		struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
		discard = blk_queue_discard(backingq);
	} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
		discard = false;
	else
		discard = true;

	if (discard) {
		blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
		blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
	} else {
		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
	}

> everything else alone, to completely disable discard support for loop
> devices backed by block devices. This seems to work for programs like
> mkfs.ext4, but still leaves problems for coreutils cp.
> 
> But is that really the right call? With this change, we're not only
> able to use loop devices in this way, but we're able to use the
> discard and zero functionality of the underlying block device by
> simply passing it through. To me that seemed better than disabling all
> discard support for block devices, which would severely slow us down
> on some devices.

I guess the above approach can do the same job with yours, but simpler.

thanks,
Ming
Evan Green Dec. 5, 2018, 7:35 p.m. UTC | #8
Hi Ming,

On Tue, Dec 4, 2018 at 5:11 PM Ming Lei <ming.lei@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> > Hi Ming,
> >
> > On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <ming.lei@redhat.com> wrote:
> > >
> > > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > > If the backing device for a loop device is a block device,
> > > > then mirror the discard properties of the underlying block
> > > > device into the loop device. While in there, differentiate
> > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > different for block devices, but which the loop device had
> > > > just been lumping together.
> > > >
> > > > Signed-off-by: Evan Green <evgreen@chromium.org>
> > > > ---
> > > >
> > > >  drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > > >  1 file changed, 41 insertions(+), 20 deletions(-)
> > > >
> > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > > index 28990fc94841a..176e65101c4ef 100644
> > > > --- a/drivers/block/loop.c
> > > > +++ b/drivers/block/loop.c
> > > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > >       return ret;
> > > >  }
> > > >
> > > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > > +             int mode, loff_t pos)
> > > >  {
> > > > -     /*
> > > > -      * We use punch hole to reclaim the free space used by the
> > > > -      * image a.k.a. discard. However we do not support discard if
> > > > -      * encryption is enabled, because it may give an attacker
> > > > -      * useful information.
> > > > -      */
> > > >       struct file *file = lo->lo_backing_file;
> > > > -     int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > > +     struct request_queue *q = lo->lo_queue;
> > > >       int ret;
> > > >
> > > > -     if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > +     if (!blk_queue_discard(q)) {
> > > >               ret = -EOPNOTSUPP;
> > > >               goto out;
> > > >       }
> > > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > >       case REQ_OP_FLUSH:
> > > >               return lo_req_flush(lo, rq);
> > > >       case REQ_OP_DISCARD:
> > > > +             return lo_discard(lo, rq,
> > > > +                     FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > > +
> > > >       case REQ_OP_WRITE_ZEROES:
> > > > -             return lo_discard(lo, rq, pos);
> > > > +             return lo_discard(lo, rq,
> > > > +                     FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > > +
> > > >       case REQ_OP_WRITE:
> > > >               if (lo->transfer)
> > > >                       return lo_write_transfer(lo, rq, pos);
> > > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > > >       struct file *file = lo->lo_backing_file;
> > > >       struct inode *inode = file->f_mapping->host;
> > > >       struct request_queue *q = lo->lo_queue;
> > > > +     struct request_queue *backingq;
> > > > +
> > > > +     /*
> > > > +      * If the backing device is a block device, mirror its discard
> > > > +      * capabilities.
> > > > +      */
> > > > +     if (S_ISBLK(inode->i_mode)) {
> > > > +             backingq = bdev_get_queue(inode->i_bdev);
> > > > +             blk_queue_max_discard_sectors(q,
> > > > +                     backingq->limits.max_discard_sectors);
> > > > +
> > > > +             blk_queue_max_write_zeroes_sectors(q,
> > > > +                     backingq->limits.max_write_zeroes_sectors);
> > > > +
> > > > +             q->limits.discard_granularity =
> > > > +                     backingq->limits.discard_granularity;
> > > > +
> > > > +             q->limits.discard_alignment =
> > > > +                     backingq->limits.discard_alignment;
> > >
> > > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > > capabilities, given either fs of the underlying queue can deal with well.
> > >
> > > >
> > > >       /*
> > > >        * We use punch hole to reclaim the free space used by the
> > > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > > >        * encryption is enabled, because it may give an attacker
> > > >        * useful information.
> > > >        */
> > > > -     if ((!file->f_op->fallocate) ||
> > > > -         lo->lo_encrypt_key_size) {
> > > > +     } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > >               q->limits.discard_granularity = 0;
> > > >               q->limits.discard_alignment = 0;
> > > >               blk_queue_max_discard_sectors(q, 0);
> > > >               blk_queue_max_write_zeroes_sectors(q, 0);
> > > > -             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > -             return;
> > > > -     }
> > > >
> > > > -     q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > -     q->limits.discard_alignment = 0;
> > > > +     } else {
> > > > +             q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > +             q->limits.discard_alignment = 0;
> > > > +
> > > > +             blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > +             blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > +     }
> > > >
> > > > -     blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > -     blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > -     blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > +     if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > > +             blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > +     else
> > > > +             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > >  }
> > >
> > > Looks it should work just by mirroring backing queue's discard
> > > capability to loop queue in case that the loop is backed by
> > > block device, doesn't it? Meantime the unified discard limits &
> > > write_zeros limits can be kept.
> >
> > I tested this out, and you're right that I could just flip the
> > QUEUE_FLAG_DISCARD based on whether its a block device, and leave
>
> What I meant actually is to do the following discard config:
>
>         bool discard;
>         if (S_ISBLK(inode->i_mode)) {
>                 struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
>                 discard = blk_queue_discard(backingq);
>         } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
>                 discard = false;
>         else
>                 discard = true;
>
>         if (discard) {
>                 blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
>                 blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
>                 blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
>         } else {
>                 blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
>         }

Ah, I see. But I think it's useful to reflect max_discard_sectors,
max_write_zeroes_sectors, discard_granularity, and discard_alignment
from the block device to the loop device. With the exception of
discard_alignment, these parameters are visible via sysfs, so usermode
can actually use these to make more intelligent use of fallocate.
Without this part of it, I still see issues with GNU cp.

I'm not totally sure about discard_alignment, that seems to be useful
in cases of merging blk requests. So I can stop mirroring that one if
it's harmful or not helpful. But unless it's a nak, I'd really love to
keep most of the mirroring. In which case the bool doesn't do a whole
lot of simplifying.

>
> > everything else alone, to completely disable discard support for loop
> > devices backed by block devices. This seems to work for programs like
> > mkfs.ext4, but still leaves problems for coreutils cp.
> >
> > But is that really the right call? With this change, we're not only
> > able to use loop devices in this way, but we're able to use the
> > discard and zero functionality of the underlying block device by
> > simply passing it through. To me that seemed better than disabling all
> > discard support for block devices, which would severely slow us down
> > on some devices.
>
> I guess the above approach can do the same job with yours, but simpler.
>
> thanks,
> Ming
Ming Lei Dec. 6, 2018, 12:22 a.m. UTC | #9
On Wed, Dec 05, 2018 at 11:35:57AM -0800, Evan Green wrote:
> Hi Ming,
> 
> On Tue, Dec 4, 2018 at 5:11 PM Ming Lei <ming.lei@redhat.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 02:19:46PM -0800, Evan Green wrote:
> > > Hi Ming,
> > >
> > > On Tue, Nov 27, 2018 at 5:26 PM Ming Lei <ming.lei@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 30, 2018 at 04:06:24PM -0700, Evan Green wrote:
> > > > > If the backing device for a loop device is a block device,
> > > > > then mirror the discard properties of the underlying block
> > > > > device into the loop device. While in there, differentiate
> > > > > between REQ_OP_DISCARD and REQ_OP_WRITE_ZEROES, which are
> > > > > different for block devices, but which the loop device had
> > > > > just been lumping together.
> > > > >
> > > > > Signed-off-by: Evan Green <evgreen@chromium.org>
> > > > > ---
> > > > >
> > > > >  drivers/block/loop.c | 61 +++++++++++++++++++++++++++++++++++-----------------
> > > > >  1 file changed, 41 insertions(+), 20 deletions(-)
> > > > >
> > > > > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > > > > index 28990fc94841a..176e65101c4ef 100644
> > > > > --- a/drivers/block/loop.c
> > > > > +++ b/drivers/block/loop.c
> > > > > @@ -417,19 +417,14 @@ static int lo_read_transfer(struct loop_device *lo, struct request *rq,
> > > > >       return ret;
> > > > >  }
> > > > >
> > > > > -static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
> > > > > +static int lo_discard(struct loop_device *lo, struct request *rq,
> > > > > +             int mode, loff_t pos)
> > > > >  {
> > > > > -     /*
> > > > > -      * We use punch hole to reclaim the free space used by the
> > > > > -      * image a.k.a. discard. However we do not support discard if
> > > > > -      * encryption is enabled, because it may give an attacker
> > > > > -      * useful information.
> > > > > -      */
> > > > >       struct file *file = lo->lo_backing_file;
> > > > > -     int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
> > > > > +     struct request_queue *q = lo->lo_queue;
> > > > >       int ret;
> > > > >
> > > > > -     if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > > +     if (!blk_queue_discard(q)) {
> > > > >               ret = -EOPNOTSUPP;
> > > > >               goto out;
> > > > >       }
> > > > > @@ -603,8 +598,13 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
> > > > >       case REQ_OP_FLUSH:
> > > > >               return lo_req_flush(lo, rq);
> > > > >       case REQ_OP_DISCARD:
> > > > > +             return lo_discard(lo, rq,
> > > > > +                     FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
> > > > > +
> > > > >       case REQ_OP_WRITE_ZEROES:
> > > > > -             return lo_discard(lo, rq, pos);
> > > > > +             return lo_discard(lo, rq,
> > > > > +                     FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
> > > > > +
> > > > >       case REQ_OP_WRITE:
> > > > >               if (lo->transfer)
> > > > >                       return lo_write_transfer(lo, rq, pos);
> > > > > @@ -859,6 +859,25 @@ static void loop_config_discard(struct loop_device *lo)
> > > > >       struct file *file = lo->lo_backing_file;
> > > > >       struct inode *inode = file->f_mapping->host;
> > > > >       struct request_queue *q = lo->lo_queue;
> > > > > +     struct request_queue *backingq;
> > > > > +
> > > > > +     /*
> > > > > +      * If the backing device is a block device, mirror its discard
> > > > > +      * capabilities.
> > > > > +      */
> > > > > +     if (S_ISBLK(inode->i_mode)) {
> > > > > +             backingq = bdev_get_queue(inode->i_bdev);
> > > > > +             blk_queue_max_discard_sectors(q,
> > > > > +                     backingq->limits.max_discard_sectors);
> > > > > +
> > > > > +             blk_queue_max_write_zeroes_sectors(q,
> > > > > +                     backingq->limits.max_write_zeroes_sectors);
> > > > > +
> > > > > +             q->limits.discard_granularity =
> > > > > +                     backingq->limits.discard_granularity;
> > > > > +
> > > > > +             q->limits.discard_alignment =
> > > > > +                     backingq->limits.discard_alignment;
> > > >
> > > > I think it isn't necessary to mirror backing queue's discard/write_zeros
> > > > capabilities, given either fs of the underlying queue can deal with well.
> > > >
> > > > >
> > > > >       /*
> > > > >        * We use punch hole to reclaim the free space used by the
> > > > > @@ -866,22 +885,24 @@ static void loop_config_discard(struct loop_device *lo)
> > > > >        * encryption is enabled, because it may give an attacker
> > > > >        * useful information.
> > > > >        */
> > > > > -     if ((!file->f_op->fallocate) ||
> > > > > -         lo->lo_encrypt_key_size) {
> > > > > +     } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
> > > > >               q->limits.discard_granularity = 0;
> > > > >               q->limits.discard_alignment = 0;
> > > > >               blk_queue_max_discard_sectors(q, 0);
> > > > >               blk_queue_max_write_zeroes_sectors(q, 0);
> > > > > -             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > > -             return;
> > > > > -     }
> > > > >
> > > > > -     q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > > -     q->limits.discard_alignment = 0;
> > > > > +     } else {
> > > > > +             q->limits.discard_granularity = inode->i_sb->s_blocksize;
> > > > > +             q->limits.discard_alignment = 0;
> > > > > +
> > > > > +             blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > > +             blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > > +     }
> > > > >
> > > > > -     blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> > > > > -     blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> > > > > -     blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > > +     if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
> > > > > +             blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> > > > > +     else
> > > > > +             blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> > > > >  }
> > > >
> > > > Looks it should work just by mirroring backing queue's discard
> > > > capability to loop queue in case that the loop is backed by
> > > > block device, doesn't it? Meantime the unified discard limits &
> > > > write_zeros limits can be kept.
> > >
> > > I tested this out, and you're right that I could just flip the
> > > QUEUE_FLAG_DISCARD based on whether its a block device, and leave
> >
> > What I meant actually is to do the following discard config:
> >
> >         bool discard;
> >         if (S_ISBLK(inode->i_mode)) {
> >                 struct request_queue *backingq = bdev_get_queue(inode->i_bdev);
> >                 discard = blk_queue_discard(backingq);
> >         } else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size)
> >                 discard = false;
> >         else
> >                 discard = true;
> >
> >         if (discard) {
> >                 blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
> >                 blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
> >                 blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
> >         } else {
> >                 blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
> >         }
> 
> Ah, I see. But I think it's useful to reflect max_discard_sectors,
> max_write_zeroes_sectors, discard_granularity, and discard_alignment
> from the block device to the loop device. With the exception of
> discard_alignment, these parameters are visible via sysfs, so usermode
> can actually use these to make more intelligent use of fallocate.

Could you share us what the intelligent use of fallocate is?

The block layer code of blk_bio_discard_split() can deal with all the
magic limits.

The unified discard limits is simpler from implement view of loop, or
from userspace view.

> Without this part of it, I still see issues with GNU cp.

Could you investigate the cause of the GNU cp issue?

Thanks,
Ming
Martin K. Petersen Dec. 6, 2018, 3:15 a.m. UTC | #10
Evan,

> Ah, I see. But I think it's useful to reflect max_discard_sectors,
> max_write_zeroes_sectors, discard_granularity, and discard_alignment
> from the block device to the loop device. With the exception of
> discard_alignment, these parameters are visible via sysfs,

discard_alignment is visible in sysfs, just not in the queue directory
since alignment can be different on a per-partition basis. So there is
one discard_alignment at the root of each /sys/block/foo directory and
one for each partition subdirectory. This mirrors the alignment_offset
variable which indicates a partitions alignment wrt. the underlying
physical block size.

That said, there are not many devices that actually report a non-zero
discard alignment so it's not as useful as the device manufacturers
(that were looking for an implementation shortcut) envisioned.

> I'm not totally sure about discard_alignment, that seems to be useful
> in cases of merging blk requests. So I can stop mirroring that one if
> it's harmful or not helpful. But unless it's a nak, I'd really love to
> keep most of the mirroring. In which case the bool doesn't do a whole
> lot of simplifying.

I think it's fine to export these. The block device topology was
explicitly designed to be stackable like this.
Evan Green Dec. 10, 2018, 5:31 p.m. UTC | #11
On Wed, Dec 5, 2018 at 7:15 PM Martin K. Petersen
<martin.petersen@oracle.com> wrote:
>
>
> Evan,
>
> > Ah, I see. But I think it's useful to reflect max_discard_sectors,
> > max_write_zeroes_sectors, discard_granularity, and discard_alignment
> > from the block device to the loop device. With the exception of
> > discard_alignment, these parameters are visible via sysfs,
>
> discard_alignment is visible in sysfs, just not in the queue directory
> since alignment can be different on a per-partition basis. So there is
> one discard_alignment at the root of each /sys/block/foo directory and
> one for each partition subdirectory. This mirrors the alignment_offset
> variable which indicates a partitions alignment wrt. the underlying
> physical block size.
>
> That said, there are not many devices that actually report a non-zero
> discard alignment so it's not as useful as the device manufacturers
> (that were looking for an implementation shortcut) envisioned.

Ah ok, thanks.

>
> > I'm not totally sure about discard_alignment, that seems to be useful
> > in cases of merging blk requests. So I can stop mirroring that one if
> > it's harmful or not helpful. But unless it's a nak, I'd really love to
> > keep most of the mirroring. In which case the bool doesn't do a whole
> > lot of simplifying.
>
> I think it's fine to export these. The block device topology was
> explicitly designed to be stackable like this.

Yeah, it seemed to fall in pretty naturally, which is why I was hoping
it might not be so controversial. Thanks Martin.
-Evan
Evan Green Dec. 18, 2018, 11:48 p.m. UTC | #12
On Mon, Dec 10, 2018 at 9:31 AM Evan Green <evgreen@chromium.org> wrote:
>
> On Wed, Dec 5, 2018 at 7:15 PM Martin K. Petersen
> <martin.petersen@oracle.com> wrote:
> >
> >
> > Evan,
> >
> > > Ah, I see. But I think it's useful to reflect max_discard_sectors,
> > > max_write_zeroes_sectors, discard_granularity, and discard_alignment
> > > from the block device to the loop device. With the exception of
> > > discard_alignment, these parameters are visible via sysfs,
> >
> > discard_alignment is visible in sysfs, just not in the queue directory
> > since alignment can be different on a per-partition basis. So there is
> > one discard_alignment at the root of each /sys/block/foo directory and
> > one for each partition subdirectory. This mirrors the alignment_offset
> > variable which indicates a partitions alignment wrt. the underlying
> > physical block size.
> >
> > That said, there are not many devices that actually report a non-zero
> > discard alignment so it's not as useful as the device manufacturers
> > (that were looking for an implementation shortcut) envisioned.
>
> Ah ok, thanks.
>
> >
> > > I'm not totally sure about discard_alignment, that seems to be useful
> > > in cases of merging blk requests. So I can stop mirroring that one if
> > > it's harmful or not helpful. But unless it's a nak, I'd really love to
> > > keep most of the mirroring. In which case the bool doesn't do a whole
> > > lot of simplifying.
> >
> > I think it's fine to export these. The block device topology was
> > explicitly designed to be stackable like this.
>
> Yeah, it seemed to fall in pretty naturally, which is why I was hoping
> it might not be so controversial. Thanks Martin.

Any other thoughts about this series?

Ming, I did go back and experiment a little. The pseudocode you had
proposed earlier would work, and solve the error prints I was seeing.
But I still would like to keep the reflection of the block device
properties, since that allows discard to be used on block devices that
do support it.
-Evan
diff mbox series

Patch

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 28990fc94841a..176e65101c4ef 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -417,19 +417,14 @@  static int lo_read_transfer(struct loop_device *lo, struct request *rq,
 	return ret;
 }
 
-static int lo_discard(struct loop_device *lo, struct request *rq, loff_t pos)
+static int lo_discard(struct loop_device *lo, struct request *rq,
+		int mode, loff_t pos)
 {
-	/*
-	 * We use punch hole to reclaim the free space used by the
-	 * image a.k.a. discard. However we do not support discard if
-	 * encryption is enabled, because it may give an attacker
-	 * useful information.
-	 */
 	struct file *file = lo->lo_backing_file;
-	int mode = FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE;
+	struct request_queue *q = lo->lo_queue;
 	int ret;
 
-	if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
+	if (!blk_queue_discard(q)) {
 		ret = -EOPNOTSUPP;
 		goto out;
 	}
@@ -603,8 +598,13 @@  static int do_req_filebacked(struct loop_device *lo, struct request *rq)
 	case REQ_OP_FLUSH:
 		return lo_req_flush(lo, rq);
 	case REQ_OP_DISCARD:
+		return lo_discard(lo, rq,
+			FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, pos);
+
 	case REQ_OP_WRITE_ZEROES:
-		return lo_discard(lo, rq, pos);
+		return lo_discard(lo, rq,
+			FALLOC_FL_ZERO_RANGE | FALLOC_FL_KEEP_SIZE, pos);
+
 	case REQ_OP_WRITE:
 		if (lo->transfer)
 			return lo_write_transfer(lo, rq, pos);
@@ -859,6 +859,25 @@  static void loop_config_discard(struct loop_device *lo)
 	struct file *file = lo->lo_backing_file;
 	struct inode *inode = file->f_mapping->host;
 	struct request_queue *q = lo->lo_queue;
+	struct request_queue *backingq;
+
+	/*
+	 * If the backing device is a block device, mirror its discard
+	 * capabilities.
+	 */
+	if (S_ISBLK(inode->i_mode)) {
+		backingq = bdev_get_queue(inode->i_bdev);
+		blk_queue_max_discard_sectors(q,
+			backingq->limits.max_discard_sectors);
+
+		blk_queue_max_write_zeroes_sectors(q,
+			backingq->limits.max_write_zeroes_sectors);
+
+		q->limits.discard_granularity =
+			backingq->limits.discard_granularity;
+
+		q->limits.discard_alignment =
+			backingq->limits.discard_alignment;
 
 	/*
 	 * We use punch hole to reclaim the free space used by the
@@ -866,22 +885,24 @@  static void loop_config_discard(struct loop_device *lo)
 	 * encryption is enabled, because it may give an attacker
 	 * useful information.
 	 */
-	if ((!file->f_op->fallocate) ||
-	    lo->lo_encrypt_key_size) {
+	} else if ((!file->f_op->fallocate) || lo->lo_encrypt_key_size) {
 		q->limits.discard_granularity = 0;
 		q->limits.discard_alignment = 0;
 		blk_queue_max_discard_sectors(q, 0);
 		blk_queue_max_write_zeroes_sectors(q, 0);
-		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
-		return;
-	}
 
-	q->limits.discard_granularity = inode->i_sb->s_blocksize;
-	q->limits.discard_alignment = 0;
+	} else {
+		q->limits.discard_granularity = inode->i_sb->s_blocksize;
+		q->limits.discard_alignment = 0;
+
+		blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
+		blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+	}
 
-	blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
-	blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
-	blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+	if (q->limits.max_discard_sectors || q->limits.max_write_zeroes_sectors)
+		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
+	else
+		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
 }
 
 static void loop_unprepare_queue(struct loop_device *lo)