block: loop: set discard granularity and alignment for block device backed loop
diff mbox series

Message ID 20200804041508.1870113-1-ming.lei@redhat.com
State New
Headers show
Series
  • block: loop: set discard granularity and alignment for block device backed loop
Related show

Commit Message

Ming Lei Aug. 4, 2020, 4:15 a.m. UTC
In case of block device backend, if the backend supports discard, the
loop device will set queue flag of QUEUE_FLAG_DISCARD.

However, limits.discard_granularity isn't setup, and this way is wrong,
see the following description in Documentation/ABI/testing/sysfs-block:

	A discard_granularity of 0 means that the device does not support
	discard functionality.

Especially 9b15d109a6b2 ("block: improve discard bio alignment in
__blkdev_issue_discard()") starts to take q->limits.discard_granularity
for computing max discard sectors. And zero discard granularity causes
kernel oops[1].

Fix the issue by set up discard granularity and alignment.

[1] kernel oops when use block disk as loop backend:

[   33.405947] ------------[ cut here ]------------
[   33.405948] kernel BUG at block/blk-mq.c:563!
[   33.407504] invalid opcode: 0000 [#1] SMP PTI
[   33.408245] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc2_update_rq+ #17
[   33.409466] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedor4
[   33.411546] RIP: 0010:blk_mq_end_request+0x1c/0x2a
[   33.412354] Code: 48 89 df 5b 5d 41 5c 41 5d e9 2d fe ff ff 0f 1f 44 00 00 55 48 89 fd 53 40 0f b6 de 8b 57 5
[   33.415724] RSP: 0018:ffffc900019ccf48 EFLAGS: 00010202
[   33.416688] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88826f2600c0
[   33.417990] RDX: 0000000000200042 RSI: 0000000000000001 RDI: ffff888270c13100
[   33.419286] RBP: ffff888270c13100 R08: 0000000000000001 R09: ffffffff81345144
[   33.420584] R10: ffff88826f260600 R11: 0000000000000000 R12: ffff888270c13100
[   33.421881] R13: ffffffff820050c0 R14: 0000000000000004 R15: 0000000000000004
[   33.423053] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
[   33.424360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   33.425416] CR2: 00005581d7903f08 CR3: 00000001e9a16002 CR4: 0000000000760ee0
[   33.426580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   33.427706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   33.428910] PKRU: 55555554
[   33.429412] Call Trace:
[   33.429858]  <IRQ>
[   33.430189]  blk_done_softirq+0x84/0xa4
[   33.430884]  __do_softirq+0x11a/0x278
[   33.431567]  asm_call_on_stack+0x12/0x20
[   33.432283]  </IRQ>
[   33.432673]  do_softirq_own_stack+0x31/0x40
[   33.433448]  __irq_exit_rcu+0x46/0xae
[   33.434132]  sysvec_call_function_single+0x7d/0x8b
[   33.435021]  asm_sysvec_call_function_single+0x12/0x20
[   33.435965] RIP: 0010:native_safe_halt+0x7/0x8
[   33.436795] Code: 25 c0 7b 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 db 38 95 7e b
[   33.440179] RSP: 0018:ffffc9000191fee0 EFLAGS: 00000246
[   33.441143] RAX: ffffffff816c4118 RBX: 0000000000000000 RCX: ffff888276631a00
[   33.442442] RDX: 000000000000ddfe RSI: 0000000000000003 RDI: 0000000000000001
[   33.443742] RBP: 0000000000000000 R08: 00000000f461e6ac R09: 0000000000000004
[   33.445046] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000000
[   33.446352] R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000000
[   33.447450]  ? ldsem_down_write+0x1f5/0x1f5
[   33.448158]  default_idle+0x1b/0x2c
[   33.448809]  do_idle+0xf9/0x248
[   33.449392]  cpu_startup_entry+0x1d/0x1f
[   33.450118]  start_secondary+0x168/0x185
[   33.450843]  secondary_startup_64+0xa4/0xb0
[   33.451612] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support nvme nvme_core i2c_i801 lpc_ich virtis
[   33.454292] Dumping ftrace buffer:
[   33.454951]    (ftrace buffer empty)
[   33.455626] ---[ end trace beece202d663d38f ]---

Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
Cc: Coly Li <colyli@suse.de>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Xiao Ni <xni@redhat.com>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 drivers/block/loop.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Coly Li Aug. 4, 2020, 4:29 a.m. UTC | #1
On 2020/8/4 12:15, Ming Lei wrote:
> In case of block device backend, if the backend supports discard, the
> loop device will set queue flag of QUEUE_FLAG_DISCARD.
> 
> However, limits.discard_granularity isn't setup, and this way is wrong,
> see the following description in Documentation/ABI/testing/sysfs-block:
> 
> 	A discard_granularity of 0 means that the device does not support
> 	discard functionality.
> 
> Especially 9b15d109a6b2 ("block: improve discard bio alignment in
> __blkdev_issue_discard()") starts to take q->limits.discard_granularity
> for computing max discard sectors. And zero discard granularity causes
> kernel oops[1].
> 
> Fix the issue by set up discard granularity and alignment.
> 
> [1] kernel oops when use block disk as loop backend:
> 
> [   33.405947] ------------[ cut here ]------------
> [   33.405948] kernel BUG at block/blk-mq.c:563!
> [   33.407504] invalid opcode: 0000 [#1] SMP PTI
> [   33.408245] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc2_update_rq+ #17
> [   33.409466] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedor4
> [   33.411546] RIP: 0010:blk_mq_end_request+0x1c/0x2a
> [   33.412354] Code: 48 89 df 5b 5d 41 5c 41 5d e9 2d fe ff ff 0f 1f 44 00 00 55 48 89 fd 53 40 0f b6 de 8b 57 5
> [   33.415724] RSP: 0018:ffffc900019ccf48 EFLAGS: 00010202
> [   33.416688] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88826f2600c0
> [   33.417990] RDX: 0000000000200042 RSI: 0000000000000001 RDI: ffff888270c13100
> [   33.419286] RBP: ffff888270c13100 R08: 0000000000000001 R09: ffffffff81345144
> [   33.420584] R10: ffff88826f260600 R11: 0000000000000000 R12: ffff888270c13100
> [   33.421881] R13: ffffffff820050c0 R14: 0000000000000004 R15: 0000000000000004
> [   33.423053] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
> [   33.424360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   33.425416] CR2: 00005581d7903f08 CR3: 00000001e9a16002 CR4: 0000000000760ee0
> [   33.426580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [   33.427706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [   33.428910] PKRU: 55555554
> [   33.429412] Call Trace:
> [   33.429858]  <IRQ>
> [   33.430189]  blk_done_softirq+0x84/0xa4
> [   33.430884]  __do_softirq+0x11a/0x278
> [   33.431567]  asm_call_on_stack+0x12/0x20
> [   33.432283]  </IRQ>
> [   33.432673]  do_softirq_own_stack+0x31/0x40
> [   33.433448]  __irq_exit_rcu+0x46/0xae
> [   33.434132]  sysvec_call_function_single+0x7d/0x8b
> [   33.435021]  asm_sysvec_call_function_single+0x12/0x20
> [   33.435965] RIP: 0010:native_safe_halt+0x7/0x8
> [   33.436795] Code: 25 c0 7b 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 db 38 95 7e b
> [   33.440179] RSP: 0018:ffffc9000191fee0 EFLAGS: 00000246
> [   33.441143] RAX: ffffffff816c4118 RBX: 0000000000000000 RCX: ffff888276631a00
> [   33.442442] RDX: 000000000000ddfe RSI: 0000000000000003 RDI: 0000000000000001
> [   33.443742] RBP: 0000000000000000 R08: 00000000f461e6ac R09: 0000000000000004
> [   33.445046] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000000
> [   33.446352] R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000000
> [   33.447450]  ? ldsem_down_write+0x1f5/0x1f5
> [   33.448158]  default_idle+0x1b/0x2c
> [   33.448809]  do_idle+0xf9/0x248
> [   33.449392]  cpu_startup_entry+0x1d/0x1f
> [   33.450118]  start_secondary+0x168/0x185
> [   33.450843]  secondary_startup_64+0xa4/0xb0
> [   33.451612] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support nvme nvme_core i2c_i801 lpc_ich virtis
> [   33.454292] Dumping ftrace buffer:
> [   33.454951]    (ftrace buffer empty)
> [   33.455626] ---[ end trace beece202d663d38f ]---
> 
> Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
> Cc: Coly Li <colyli@suse.de>
> Cc: Hannes Reinecke <hare@suse.com>
> Cc: Xiao Ni <xni@redhat.com>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  drivers/block/loop.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index d18160146226..6102370bee35 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -890,6 +890,11 @@ static void loop_config_discard(struct loop_device *lo)
>  		struct request_queue *backingq;
>  
>  		backingq = bdev_get_queue(inode->i_bdev);
> +
> +		q->limits.discard_granularity =
> +			queue_physical_block_size(backingq);
> +		q->limits.discard_alignment = 0;
> +
>  		blk_queue_max_discard_sectors(q,
>  			backingq->limits.max_write_zeroes_sectors);
>  
> 

Hi Ming,

I did similar change, it can avoid the panic or 0 length discard bio.
But yesterday I realize the discard request to loop device should not go
into __blkdev_issue_discard(). As commit c52abf563049 ("loop: Better
discard support for block devices") mentioned it should go into
blkdev_issue_zeroout(), this is why in loop_config_discard() the
max_discard_sectors is set to backingq->limits.max_write_zeroes_sectors.

Now I am looking at the problem why discard request on loop device
doesn't go into blkdev_issue_zeroout().

With the above change, the discard is very slow on loop device with
backing device. In my testing, mkfs.xfs on /dev/loop0 does not complete
in 20 minutes, each discard request is only 4097 sectors.

Coly Li
Ming Lei Aug. 4, 2020, 6:01 a.m. UTC | #2
On Tue, Aug 04, 2020 at 12:29:07PM +0800, Coly Li wrote:
> On 2020/8/4 12:15, Ming Lei wrote:
> > In case of block device backend, if the backend supports discard, the
> > loop device will set queue flag of QUEUE_FLAG_DISCARD.
> > 
> > However, limits.discard_granularity isn't setup, and this way is wrong,
> > see the following description in Documentation/ABI/testing/sysfs-block:
> > 
> > 	A discard_granularity of 0 means that the device does not support
> > 	discard functionality.
> > 
> > Especially 9b15d109a6b2 ("block: improve discard bio alignment in
> > __blkdev_issue_discard()") starts to take q->limits.discard_granularity
> > for computing max discard sectors. And zero discard granularity causes
> > kernel oops[1].
> > 
> > Fix the issue by set up discard granularity and alignment.
> > 
> > [1] kernel oops when use block disk as loop backend:
> > 
> > [   33.405947] ------------[ cut here ]------------
> > [   33.405948] kernel BUG at block/blk-mq.c:563!
> > [   33.407504] invalid opcode: 0000 [#1] SMP PTI
> > [   33.408245] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc2_update_rq+ #17
> > [   33.409466] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedor4
> > [   33.411546] RIP: 0010:blk_mq_end_request+0x1c/0x2a
> > [   33.412354] Code: 48 89 df 5b 5d 41 5c 41 5d e9 2d fe ff ff 0f 1f 44 00 00 55 48 89 fd 53 40 0f b6 de 8b 57 5
> > [   33.415724] RSP: 0018:ffffc900019ccf48 EFLAGS: 00010202
> > [   33.416688] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88826f2600c0
> > [   33.417990] RDX: 0000000000200042 RSI: 0000000000000001 RDI: ffff888270c13100
> > [   33.419286] RBP: ffff888270c13100 R08: 0000000000000001 R09: ffffffff81345144
> > [   33.420584] R10: ffff88826f260600 R11: 0000000000000000 R12: ffff888270c13100
> > [   33.421881] R13: ffffffff820050c0 R14: 0000000000000004 R15: 0000000000000004
> > [   33.423053] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
> > [   33.424360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   33.425416] CR2: 00005581d7903f08 CR3: 00000001e9a16002 CR4: 0000000000760ee0
> > [   33.426580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [   33.427706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [   33.428910] PKRU: 55555554
> > [   33.429412] Call Trace:
> > [   33.429858]  <IRQ>
> > [   33.430189]  blk_done_softirq+0x84/0xa4
> > [   33.430884]  __do_softirq+0x11a/0x278
> > [   33.431567]  asm_call_on_stack+0x12/0x20
> > [   33.432283]  </IRQ>
> > [   33.432673]  do_softirq_own_stack+0x31/0x40
> > [   33.433448]  __irq_exit_rcu+0x46/0xae
> > [   33.434132]  sysvec_call_function_single+0x7d/0x8b
> > [   33.435021]  asm_sysvec_call_function_single+0x12/0x20
> > [   33.435965] RIP: 0010:native_safe_halt+0x7/0x8
> > [   33.436795] Code: 25 c0 7b 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 db 38 95 7e b
> > [   33.440179] RSP: 0018:ffffc9000191fee0 EFLAGS: 00000246
> > [   33.441143] RAX: ffffffff816c4118 RBX: 0000000000000000 RCX: ffff888276631a00
> > [   33.442442] RDX: 000000000000ddfe RSI: 0000000000000003 RDI: 0000000000000001
> > [   33.443742] RBP: 0000000000000000 R08: 00000000f461e6ac R09: 0000000000000004
> > [   33.445046] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000000
> > [   33.446352] R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000000
> > [   33.447450]  ? ldsem_down_write+0x1f5/0x1f5
> > [   33.448158]  default_idle+0x1b/0x2c
> > [   33.448809]  do_idle+0xf9/0x248
> > [   33.449392]  cpu_startup_entry+0x1d/0x1f
> > [   33.450118]  start_secondary+0x168/0x185
> > [   33.450843]  secondary_startup_64+0xa4/0xb0
> > [   33.451612] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support nvme nvme_core i2c_i801 lpc_ich virtis
> > [   33.454292] Dumping ftrace buffer:
> > [   33.454951]    (ftrace buffer empty)
> > [   33.455626] ---[ end trace beece202d663d38f ]---
> > 
> > Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
> > Cc: Coly Li <colyli@suse.de>
> > Cc: Hannes Reinecke <hare@suse.com>
> > Cc: Xiao Ni <xni@redhat.com>
> > Cc: Martin K. Petersen <martin.petersen@oracle.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  drivers/block/loop.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> > index d18160146226..6102370bee35 100644
> > --- a/drivers/block/loop.c
> > +++ b/drivers/block/loop.c
> > @@ -890,6 +890,11 @@ static void loop_config_discard(struct loop_device *lo)
> >  		struct request_queue *backingq;
> >  
> >  		backingq = bdev_get_queue(inode->i_bdev);
> > +
> > +		q->limits.discard_granularity =
> > +			queue_physical_block_size(backingq);
> > +		q->limits.discard_alignment = 0;
> > +
> >  		blk_queue_max_discard_sectors(q,
> >  			backingq->limits.max_write_zeroes_sectors);
> >  
> > 
> 
> Hi Ming,
> 
> I did similar change, it can avoid the panic or 0 length discard bio.
> But yesterday I realize the discard request to loop device should not go
> into __blkdev_issue_discard(). As commit c52abf563049 ("loop: Better
> discard support for block devices") mentioned it should go into
> blkdev_issue_zeroout(), this is why in loop_config_discard() the
> max_discard_sectors is set to backingq->limits.max_write_zeroes_sectors.

That commit meant REQ_OP_DISCARD on a loop device is translated into
blkdev_issue_zeroout(), because REQ_OP_DISCARD is handled as
file->f_op->fallocate(FALLOC_FL_PUNCH_HOLE), which will cause
"subsequent reads from this range will return zeroes".

> 
> Now I am looking at the problem why discard request on loop device
> doesn't go into blkdev_issue_zeroout().

No, that is correct behavior, since loop can support discard or zeroout.

If QUEUE_FLAG_DISCARD is set, either discard_granularity or max discard
sectors shouldn't be zero.

This patch shouldn't set discard_granularity if limits.max_write_zeroes_sectors
is zero, will fix it in V2.

> 
> With the above change, the discard is very slow on loop device with
> backing device. In my testing, mkfs.xfs on /dev/loop0 does not complete
> in 20 minutes, each discard request is only 4097 sectors.

I'd suggest you to check the discard related queue limits, and see why
each discard request just sends 4097 sectors.

Or we need to mirror underlying queue's discard_granularity too? Can you
try the following patch?

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d18160146226..661c0814d63c 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -878,6 +878,7 @@ static void loop_config_discard(struct loop_device *lo)
 	struct file *file = lo->lo_backing_file;
 	struct inode *inode = file->f_mapping->host;
 	struct request_queue *q = lo->lo_queue;
+	u32 granularity, max_discard_sectors;
 
 	/*
 	 * If the backing device is a block device, mirror its zeroing
@@ -890,11 +891,10 @@ static void loop_config_discard(struct loop_device *lo)
 		struct request_queue *backingq;
 
 		backingq = bdev_get_queue(inode->i_bdev);
-		blk_queue_max_discard_sectors(q,
-			backingq->limits.max_write_zeroes_sectors);
 
-		blk_queue_max_write_zeroes_sectors(q,
-			backingq->limits.max_write_zeroes_sectors);
+		max_discard_sectors = backingq->limits.max_write_zeroes_sectors;
+		granularity = backingq->limits.discard_granularity ?:
+			queue_physical_block_size(backingq);
 
 	/*
 	 * We use punch hole to reclaim the free space used by the
@@ -903,23 +903,26 @@ static void loop_config_discard(struct loop_device *lo)
 	 * useful information.
 	 */
 	} else if (!file->f_op->fallocate || lo->lo_encrypt_key_size) {
-		q->limits.discard_granularity = 0;
-		q->limits.discard_alignment = 0;
-		blk_queue_max_discard_sectors(q, 0);
-		blk_queue_max_write_zeroes_sectors(q, 0);
+		max_discard_sectors = 0;
+		granularity = 0;
 
 	} else {
-		q->limits.discard_granularity = inode->i_sb->s_blocksize;
-		q->limits.discard_alignment = 0;
-
-		blk_queue_max_discard_sectors(q, UINT_MAX >> 9);
-		blk_queue_max_write_zeroes_sectors(q, UINT_MAX >> 9);
+		max_discard_sectors = UINT_MAX >> 9;
+		granularity = inode->i_sb->s_blocksize;
 	}
 
-	if (q->limits.max_write_zeroes_sectors)
+	if (max_discard_sectors) {
+		q->limits.discard_granularity = granularity;
+		blk_queue_max_discard_sectors(q, max_discard_sectors);
+		blk_queue_max_write_zeroes_sectors(q, max_discard_sectors);
 		blk_queue_flag_set(QUEUE_FLAG_DISCARD, q);
-	else
+	} else {
+		q->limits.discard_granularity = 0;
+		blk_queue_max_discard_sectors(q, 0);
+		blk_queue_max_write_zeroes_sectors(q, 0);
 		blk_queue_flag_clear(QUEUE_FLAG_DISCARD, q);
+	}
+	q->limits.discard_alignment = 0;
 }
 
 static void loop_unprepare_queue(struct loop_device *lo)

Thanks,
Ming
Coly Li Aug. 4, 2020, 7:50 a.m. UTC | #3
On 2020/8/4 14:01, Ming Lei wrote:
> On Tue, Aug 04, 2020 at 12:29:07PM +0800, Coly Li wrote:
>> On 2020/8/4 12:15, Ming Lei wrote:
>>> In case of block device backend, if the backend supports discard, the
>>> loop device will set queue flag of QUEUE_FLAG_DISCARD.
>>>
>>> However, limits.discard_granularity isn't setup, and this way is wrong,
>>> see the following description in Documentation/ABI/testing/sysfs-block:
>>>
>>> 	A discard_granularity of 0 means that the device does not support
>>> 	discard functionality.
>>>
>>> Especially 9b15d109a6b2 ("block: improve discard bio alignment in
>>> __blkdev_issue_discard()") starts to take q->limits.discard_granularity
>>> for computing max discard sectors. And zero discard granularity causes
>>> kernel oops[1].
>>>
>>> Fix the issue by set up discard granularity and alignment.
>>>
>>> [1] kernel oops when use block disk as loop backend:
>>>
>>> [   33.405947] ------------[ cut here ]------------
>>> [   33.405948] kernel BUG at block/blk-mq.c:563!
>>> [   33.407504] invalid opcode: 0000 [#1] SMP PTI
>>> [   33.408245] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc2_update_rq+ #17
>>> [   33.409466] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedor4
>>> [   33.411546] RIP: 0010:blk_mq_end_request+0x1c/0x2a
>>> [   33.412354] Code: 48 89 df 5b 5d 41 5c 41 5d e9 2d fe ff ff 0f 1f 44 00 00 55 48 89 fd 53 40 0f b6 de 8b 57 5
>>> [   33.415724] RSP: 0018:ffffc900019ccf48 EFLAGS: 00010202
>>> [   33.416688] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88826f2600c0
>>> [   33.417990] RDX: 0000000000200042 RSI: 0000000000000001 RDI: ffff888270c13100
>>> [   33.419286] RBP: ffff888270c13100 R08: 0000000000000001 R09: ffffffff81345144
>>> [   33.420584] R10: ffff88826f260600 R11: 0000000000000000 R12: ffff888270c13100
>>> [   33.421881] R13: ffffffff820050c0 R14: 0000000000000004 R15: 0000000000000004
>>> [   33.423053] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
>>> [   33.424360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [   33.425416] CR2: 00005581d7903f08 CR3: 00000001e9a16002 CR4: 0000000000760ee0
>>> [   33.426580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [   33.427706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [   33.428910] PKRU: 55555554
>>> [   33.429412] Call Trace:
>>> [   33.429858]  <IRQ>
>>> [   33.430189]  blk_done_softirq+0x84/0xa4
>>> [   33.430884]  __do_softirq+0x11a/0x278
>>> [   33.431567]  asm_call_on_stack+0x12/0x20
>>> [   33.432283]  </IRQ>
>>> [   33.432673]  do_softirq_own_stack+0x31/0x40
>>> [   33.433448]  __irq_exit_rcu+0x46/0xae
>>> [   33.434132]  sysvec_call_function_single+0x7d/0x8b
>>> [   33.435021]  asm_sysvec_call_function_single+0x12/0x20
>>> [   33.435965] RIP: 0010:native_safe_halt+0x7/0x8
>>> [   33.436795] Code: 25 c0 7b 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 db 38 95 7e b
>>> [   33.440179] RSP: 0018:ffffc9000191fee0 EFLAGS: 00000246
>>> [   33.441143] RAX: ffffffff816c4118 RBX: 0000000000000000 RCX: ffff888276631a00
>>> [   33.442442] RDX: 000000000000ddfe RSI: 0000000000000003 RDI: 0000000000000001
>>> [   33.443742] RBP: 0000000000000000 R08: 00000000f461e6ac R09: 0000000000000004
>>> [   33.445046] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000000
>>> [   33.446352] R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000000
>>> [   33.447450]  ? ldsem_down_write+0x1f5/0x1f5
>>> [   33.448158]  default_idle+0x1b/0x2c
>>> [   33.448809]  do_idle+0xf9/0x248
>>> [   33.449392]  cpu_startup_entry+0x1d/0x1f
>>> [   33.450118]  start_secondary+0x168/0x185
>>> [   33.450843]  secondary_startup_64+0xa4/0xb0
>>> [   33.451612] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support nvme nvme_core i2c_i801 lpc_ich virtis
>>> [   33.454292] Dumping ftrace buffer:
>>> [   33.454951]    (ftrace buffer empty)
>>> [   33.455626] ---[ end trace beece202d663d38f ]---
>>>
>>> Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
>>> Cc: Coly Li <colyli@suse.de>
>>> Cc: Hannes Reinecke <hare@suse.com>
>>> Cc: Xiao Ni <xni@redhat.com>
>>> Cc: Martin K. Petersen <martin.petersen@oracle.com>
>>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>>> ---
>>>  drivers/block/loop.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
>>> index d18160146226..6102370bee35 100644
>>> --- a/drivers/block/loop.c
>>> +++ b/drivers/block/loop.c
>>> @@ -890,6 +890,11 @@ static void loop_config_discard(struct loop_device *lo)
>>>  		struct request_queue *backingq;
>>>  
>>>  		backingq = bdev_get_queue(inode->i_bdev);
>>> +
>>> +		q->limits.discard_granularity =
>>> +			queue_physical_block_size(backingq);
>>> +		q->limits.discard_alignment = 0;
>>> +
>>>  		blk_queue_max_discard_sectors(q,
>>>  			backingq->limits.max_write_zeroes_sectors);
>>>  
>>>
>>
>> Hi Ming,
>>
>> I did similar change, it can avoid the panic or 0 length discard bio.
>> But yesterday I realize the discard request to loop device should not go
>> into __blkdev_issue_discard(). As commit c52abf563049 ("loop: Better
>> discard support for block devices") mentioned it should go into
>> blkdev_issue_zeroout(), this is why in loop_config_discard() the
>> max_discard_sectors is set to backingq->limits.max_write_zeroes_sectors.
> 
> That commit meant REQ_OP_DISCARD on a loop device is translated into
> blkdev_issue_zeroout(), because REQ_OP_DISCARD is handled as
> file->f_op->fallocate(FALLOC_FL_PUNCH_HOLE), which will cause
> "subsequent reads from this range will return zeroes".
> 
>>
>> Now I am looking at the problem why discard request on loop device
>> doesn't go into blkdev_issue_zeroout().
> 
> No, that is correct behavior, since loop can support discard or zeroout.
> 
> If QUEUE_FLAG_DISCARD is set, either discard_granularity or max discard
> sectors shouldn't be zero.
> 
> This patch shouldn't set discard_granularity if limits.max_write_zeroes_sectors
> is zero, will fix it in V2.
> 
>>
>> With the above change, the discard is very slow on loop device with
>> backing device. In my testing, mkfs.xfs on /dev/loop0 does not complete
>> in 20 minutes, each discard request is only 4097 sectors.
> 
> I'd suggest you to check the discard related queue limits, and see why
> each discard request just sends 4097 sectors.
> 
> Or we need to mirror underlying queue's discard_granularity too? Can you
> try the following patch?

Can I know the exact command line to reproduce the panic ?

I try to use blkdiscard, or dbench -D /mount_point_to_loop0, and see all
the discard requests go into blkdev_fallocate()=>blkdev_issue_zeroout().
No request goes into __blkdev_issue_discard().

The test kernel is at commit e4cbce4d1317 on Aug 3.

Thanks.

Coly Li
Ming Lei Aug. 4, 2020, 8:14 a.m. UTC | #4
On Tue, Aug 04, 2020 at 03:50:32PM +0800, Coly Li wrote:
> On 2020/8/4 14:01, Ming Lei wrote:
> > On Tue, Aug 04, 2020 at 12:29:07PM +0800, Coly Li wrote:
> >> On 2020/8/4 12:15, Ming Lei wrote:
> >>> In case of block device backend, if the backend supports discard, the
> >>> loop device will set queue flag of QUEUE_FLAG_DISCARD.
> >>>
> >>> However, limits.discard_granularity isn't setup, and this way is wrong,
> >>> see the following description in Documentation/ABI/testing/sysfs-block:
> >>>
> >>> 	A discard_granularity of 0 means that the device does not support
> >>> 	discard functionality.
> >>>
> >>> Especially 9b15d109a6b2 ("block: improve discard bio alignment in
> >>> __blkdev_issue_discard()") starts to take q->limits.discard_granularity
> >>> for computing max discard sectors. And zero discard granularity causes
> >>> kernel oops[1].
> >>>
> >>> Fix the issue by set up discard granularity and alignment.
> >>>
> >>> [1] kernel oops when use block disk as loop backend:
> >>>
> >>> [   33.405947] ------------[ cut here ]------------
> >>> [   33.405948] kernel BUG at block/blk-mq.c:563!
> >>> [   33.407504] invalid opcode: 0000 [#1] SMP PTI
> >>> [   33.408245] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 5.8.0-rc2_update_rq+ #17
> >>> [   33.409466] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180724_192412-buildhw-07.phx2.fedor4
> >>> [   33.411546] RIP: 0010:blk_mq_end_request+0x1c/0x2a
> >>> [   33.412354] Code: 48 89 df 5b 5d 41 5c 41 5d e9 2d fe ff ff 0f 1f 44 00 00 55 48 89 fd 53 40 0f b6 de 8b 57 5
> >>> [   33.415724] RSP: 0018:ffffc900019ccf48 EFLAGS: 00010202
> >>> [   33.416688] RAX: 0000000000000001 RBX: 0000000000000000 RCX: ffff88826f2600c0
> >>> [   33.417990] RDX: 0000000000200042 RSI: 0000000000000001 RDI: ffff888270c13100
> >>> [   33.419286] RBP: ffff888270c13100 R08: 0000000000000001 R09: ffffffff81345144
> >>> [   33.420584] R10: ffff88826f260600 R11: 0000000000000000 R12: ffff888270c13100
> >>> [   33.421881] R13: ffffffff820050c0 R14: 0000000000000004 R15: 0000000000000004
> >>> [   33.423053] FS:  0000000000000000(0000) GS:ffff888277d80000(0000) knlGS:0000000000000000
> >>> [   33.424360] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> [   33.425416] CR2: 00005581d7903f08 CR3: 00000001e9a16002 CR4: 0000000000760ee0
> >>> [   33.426580] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> [   33.427706] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> [   33.428910] PKRU: 55555554
> >>> [   33.429412] Call Trace:
> >>> [   33.429858]  <IRQ>
> >>> [   33.430189]  blk_done_softirq+0x84/0xa4
> >>> [   33.430884]  __do_softirq+0x11a/0x278
> >>> [   33.431567]  asm_call_on_stack+0x12/0x20
> >>> [   33.432283]  </IRQ>
> >>> [   33.432673]  do_softirq_own_stack+0x31/0x40
> >>> [   33.433448]  __irq_exit_rcu+0x46/0xae
> >>> [   33.434132]  sysvec_call_function_single+0x7d/0x8b
> >>> [   33.435021]  asm_sysvec_call_function_single+0x12/0x20
> >>> [   33.435965] RIP: 0010:native_safe_halt+0x7/0x8
> >>> [   33.436795] Code: 25 c0 7b 01 00 f0 80 60 02 df f0 83 44 24 fc 00 48 8b 00 a8 08 74 0b 65 81 25 db 38 95 7e b
> >>> [   33.440179] RSP: 0018:ffffc9000191fee0 EFLAGS: 00000246
> >>> [   33.441143] RAX: ffffffff816c4118 RBX: 0000000000000000 RCX: ffff888276631a00
> >>> [   33.442442] RDX: 000000000000ddfe RSI: 0000000000000003 RDI: 0000000000000001
> >>> [   33.443742] RBP: 0000000000000000 R08: 00000000f461e6ac R09: 0000000000000004
> >>> [   33.445046] R10: 8080808080808080 R11: fefefefefefefeff R12: 0000000000000000
> >>> [   33.446352] R13: 0000000000000003 R14: ffffffffffffffff R15: 0000000000000000
> >>> [   33.447450]  ? ldsem_down_write+0x1f5/0x1f5
> >>> [   33.448158]  default_idle+0x1b/0x2c
> >>> [   33.448809]  do_idle+0xf9/0x248
> >>> [   33.449392]  cpu_startup_entry+0x1d/0x1f
> >>> [   33.450118]  start_secondary+0x168/0x185
> >>> [   33.450843]  secondary_startup_64+0xa4/0xb0
> >>> [   33.451612] Modules linked in: scsi_debug iTCO_wdt iTCO_vendor_support nvme nvme_core i2c_i801 lpc_ich virtis
> >>> [   33.454292] Dumping ftrace buffer:
> >>> [   33.454951]    (ftrace buffer empty)
> >>> [   33.455626] ---[ end trace beece202d663d38f ]---
> >>>
> >>> Fixes: 9b15d109a6b2 ("block: improve discard bio alignment in __blkdev_issue_discard()")
> >>> Cc: Coly Li <colyli@suse.de>
> >>> Cc: Hannes Reinecke <hare@suse.com>
> >>> Cc: Xiao Ni <xni@redhat.com>
> >>> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> >>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> >>> ---
> >>>  drivers/block/loop.c | 5 +++++
> >>>  1 file changed, 5 insertions(+)
> >>>
> >>> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> >>> index d18160146226..6102370bee35 100644
> >>> --- a/drivers/block/loop.c
> >>> +++ b/drivers/block/loop.c
> >>> @@ -890,6 +890,11 @@ static void loop_config_discard(struct loop_device *lo)
> >>>  		struct request_queue *backingq;
> >>>  
> >>>  		backingq = bdev_get_queue(inode->i_bdev);
> >>> +
> >>> +		q->limits.discard_granularity =
> >>> +			queue_physical_block_size(backingq);
> >>> +		q->limits.discard_alignment = 0;
> >>> +
> >>>  		blk_queue_max_discard_sectors(q,
> >>>  			backingq->limits.max_write_zeroes_sectors);
> >>>  
> >>>
> >>
> >> Hi Ming,
> >>
> >> I did similar change, it can avoid the panic or 0 length discard bio.
> >> But yesterday I realize the discard request to loop device should not go
> >> into __blkdev_issue_discard(). As commit c52abf563049 ("loop: Better
> >> discard support for block devices") mentioned it should go into
> >> blkdev_issue_zeroout(), this is why in loop_config_discard() the
> >> max_discard_sectors is set to backingq->limits.max_write_zeroes_sectors.
> > 
> > That commit meant REQ_OP_DISCARD on a loop device is translated into
> > blkdev_issue_zeroout(), because REQ_OP_DISCARD is handled as
> > file->f_op->fallocate(FALLOC_FL_PUNCH_HOLE), which will cause
> > "subsequent reads from this range will return zeroes".
> > 
> >>
> >> Now I am looking at the problem why discard request on loop device
> >> doesn't go into blkdev_issue_zeroout().
> > 
> > No, that is correct behavior, since loop can support discard or zeroout.
> > 
> > If QUEUE_FLAG_DISCARD is set, either discard_granularity or max discard
> > sectors shouldn't be zero.
> > 
> > This patch shouldn't set discard_granularity if limits.max_write_zeroes_sectors
> > is zero, will fix it in V2.
> > 
> >>
> >> With the above change, the discard is very slow on loop device with
> >> backing device. In my testing, mkfs.xfs on /dev/loop0 does not complete
> >> in 20 minutes, each discard request is only 4097 sectors.
> > 
> > I'd suggest you to check the discard related queue limits, and see why
> > each discard request just sends 4097 sectors.
> > 
> > Or we need to mirror underlying queue's discard_granularity too? Can you
> > try the following patch?
> 
> Can I know the exact command line to reproduce the panic ?

modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1

losetup -f /dev/sdc --direct-io=on   #suppose /dev/sdc is the scsi_debug LUN

mkfs.ext4 /dev/loop0				#suppose loop0 is the loop backed by /dev/sdc

mount /dev/loop0 /mnt

cd /mnt
dbench -t 20 -s 64


> 
> I try to use blkdiscard, or dbench -D /mount_point_to_loop0, and see all
> the discard requests go into blkdev_fallocate()=>blkdev_issue_zeroout().
> No request goes into __blkdev_issue_discard().

The issue isn't related with backend queue, and it is triggered on
loop's request.


Thanks,
Ming
Coly Li Aug. 4, 2020, 2:30 p.m. UTC | #5
On 2020/8/4 16:14, Ming Lei wrote:
> On Tue, Aug 04, 2020 at 03:50:32PM +0800, Coly Li wrote:
>> On 2020/8/4 14:01, Ming Lei wrote:
>>> On Tue, Aug 04, 2020 at 12:29:07PM +0800, Coly Li wrote:
>>>> On 2020/8/4 12:15, Ming Lei wrote:
>>>>> In case of block device backend, if the backend supports discard, the
>>>>> loop device will set queue flag of QUEUE_FLAG_DISCARD.
>>>>>
>>>>> However, limits.discard_granularity isn't setup, and this way is wrong,
>>>>> see the following description in Documentation/ABI/testing/sysfs-block:
>>>>>
>>>>> 	A discard_granularity of 0 means that the device does not support
>>>>> 	discard functionality.
>>>>>
>>>>> Especially 9b15d109a6b2 ("block: improve discard bio alignment in
>>>>> __blkdev_issue_discard()") starts to take q->limits.discard_granularity
>>>>> for computing max discard sectors. And zero discard granularity causes
>>>>> kernel oops[1].
>>>>>
>>>>> Fix the issue by set up discard granularity and alignment.
>>>>>

[snipped]

>>>>
>>>> Hi Ming,
>>>>
>>>> I did similar change, it can avoid the panic or 0 length discard bio.
>>>> But yesterday I realize the discard request to loop device should not go
>>>> into __blkdev_issue_discard(). As commit c52abf563049 ("loop: Better
>>>> discard support for block devices") mentioned it should go into
>>>> blkdev_issue_zeroout(), this is why in loop_config_discard() the
>>>> max_discard_sectors is set to backingq->limits.max_write_zeroes_sectors.
>>>
>>> That commit meant REQ_OP_DISCARD on a loop device is translated into
>>> blkdev_issue_zeroout(), because REQ_OP_DISCARD is handled as
>>> file->f_op->fallocate(FALLOC_FL_PUNCH_HOLE), which will cause
>>> "subsequent reads from this range will return zeroes".
>>>
>>>>
>>>> Now I am looking at the problem why discard request on loop device
>>>> doesn't go into blkdev_issue_zeroout().
>>>
>>> No, that is correct behavior, since loop can support discard or zeroout.
>>>
>>> If QUEUE_FLAG_DISCARD is set, either discard_granularity or max discard
>>> sectors shouldn't be zero.
>>>
>>> This patch shouldn't set discard_granularity if limits.max_write_zeroes_sectors
>>> is zero, will fix it in V2.
>>>
>>>>
>>>> With the above change, the discard is very slow on loop device with
>>>> backing device. In my testing, mkfs.xfs on /dev/loop0 does not complete
>>>> in 20 minutes, each discard request is only 4097 sectors.
>>>
>>> I'd suggest you to check the discard related queue limits, and see why
>>> each discard request just sends 4097 sectors.
>>>
>>> Or we need to mirror underlying queue's discard_granularity too? Can you
>>> try the following patch?
>>
>> Can I know the exact command line to reproduce the panic ?
> 
> modprobe scsi_debug delay=0 dev_size_mb=2048 max_queue=1
> 
> losetup -f /dev/sdc --direct-io=on   #suppose /dev/sdc is the scsi_debug LUN
> 
> mkfs.ext4 /dev/loop0				#suppose loop0 is the loop backed by /dev/sdc
> 
> mount /dev/loop0 /mnt
> 
> cd /mnt
> dbench -t 20 -s 64
> 

Thanks for the information. With the above command lines, I also find a
special case that I can see a 0 byte request triggers a similar BUG()
panic --- when the discard LBA is 0, and loop device driver
queue->limits.discard_granularity is 0.

> 
>>
>> I try to use blkdiscard, or dbench -D /mount_point_to_loop0, and see all
>> the discard requests go into blkdev_fallocate()=>blkdev_issue_zeroout().
>> No request goes into __blkdev_issue_discard().
> 
> The issue isn't related with backend queue, and it is triggered on
> loop's request.

Yes it is clear to me now. Beside the loop device driver,
__blkdev_issue_discard() should also be fixed to tolerate the buggy
queue->limits.discard_granularity, in case some other driver does
similar mistake, or a unlucky downstream kernel maintainer missing your
loop device driver fix.

Yes, please post the v2 patch, now I can help to review this patch.

Thanks for your hint :-)

Coly Li

Patch
diff mbox series

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d18160146226..6102370bee35 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -890,6 +890,11 @@  static void loop_config_discard(struct loop_device *lo)
 		struct request_queue *backingq;
 
 		backingq = bdev_get_queue(inode->i_bdev);
+
+		q->limits.discard_granularity =
+			queue_physical_block_size(backingq);
+		q->limits.discard_alignment = 0;
+
 		blk_queue_max_discard_sectors(q,
 			backingq->limits.max_write_zeroes_sectors);