diff mbox

INFO: task hung in generic_file_write_iter

Message ID 9b9fcdda-c347-53ee-fdbb-8a7d11cf430e@I-love.SAKURA.ne.jp (mailing list archive)
State New, archived
Headers show

Commit Message

Tetsuo Handa July 20, 2018, 10:36 a.m. UTC
On 2018/07/18 19:28, Tetsuo Handa wrote:
> There are many reports which are stalling inside __getblk_gfp().

Currently 18 reports out of 65 "INFO: task hung in " reports.

  INFO: task hung in aead_recvmsg
  INFO: task hung in inode_sleep_on_writeback
  INFO: task hung in __writeback_inodes_sb_nr
  INFO: task hung in __blkdev_get (2)
  INFO: task hung in lookup_slow
  INFO: task hung in iterate_supers
  INFO: task hung in flush_work
  INFO: task hung in vfs_setxattr
  INFO: task hung in lock_mount
  INFO: task hung in __get_super
  INFO: task hung in do_unlinkat
  INFO: task hung in fat_fallocate
  INFO: task hung in generic_file_write_iter
  INFO: task hung in d_alloc_parallel
  INFO: task hung in __fdget_pos (2)
  INFO: task hung in path_openat
  INFO: task hung in do_truncate
  INFO: task hung in filename_create

> And there is horrible comment for __getblk_gfp():
> 
>   /*
>    * __getblk_gfp() will locate (and, if necessary, create) the buffer_head
>    * which corresponds to the passed block_device, block and size. The
>    * returned buffer has its reference count incremented.
>    *
>    * __getblk_gfp() will lock up the machine if grow_dev_page's
>    * try_to_free_buffers() attempt is failing.  FIXME, perhaps?
>    */
> 
> This report is stalling after mount() completed and process used remap_file_pages().
> I think that we might need to use debug printk(). But I don't know what to examine.
> 

Andrew, can you pick up this debug printk() patch?
I guess we can get the result within one week.

From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Fri, 20 Jul 2018 19:29:06 +0900
Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem

Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
__getblk_gfp() line in the backtrace. Since there is a comment block that
says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
attempt from grow_dev_page() is failing, let's start from checking whether
syzbot is hitting that case. This change will be removed after the bug is
fixed.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Dmitry Vyukov <dvyukov@google.com>
---
 fs/buffer.c           | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 include/linux/sched.h |  7 +++++++
 lib/Kconfig.debug     |  6 ++++++
 3 files changed, 61 insertions(+), 2 deletions(-)

Comments

Andrew Morton July 20, 2018, 8:06 p.m. UTC | #1
On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:

> > 
> > This report is stalling after mount() completed and process used remap_file_pages().
> > I think that we might need to use debug printk(). But I don't know what to examine.
> > 
> 
> Andrew, can you pick up this debug printk() patch?
> I guess we can get the result within one week.

Sure, let's toss it in -next for a while.

> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Fri, 20 Jul 2018 19:29:06 +0900
> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
> 
> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
> __getblk_gfp() line in the backtrace. Since there is a comment block that
> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
> attempt from grow_dev_page() is failing, let's start from checking whether
> syzbot is hitting that case. This change will be removed after the bug is
> fixed.

I'm not sure that grow_dev_page() is hanging.  It has often been
suspected, but always is proven innocent.  Lets see.

>
> ...
>
> @@ -978,6 +988,9 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
>  	spin_unlock(&inode->i_mapping->private_lock);
>  done:
>  	ret = (block < end_block) ? 1 : -ENXIO;
> +#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
> +	current->getblk_executed |= 0x08;
> +#endif
>  failed:
>  	unlock_page(page);
>  	put_page(page);
> @@ -1033,6 +1046,12 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)

Something is wrong with your diff(1).  That's grow_dev_page(), not
blkdev_max_block().

>  		return NULL;
>  	}
>  
> +#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
> +	current->getblk_stamp = jiffies;

AFACIT getblk_stamp didn't need to be in the task_struct - it could be
a local.  Doesn't matter much.

> +	current->getblk_executed = 0;
> +	current->getblk_bh_count = 0;
> +	current->getblk_bh_state = 0;
> +#endif
>  	for (;;) {
>  		struct buffer_head *bh;
>  		int ret;
> @@ -1044,6 +1063,18 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
>  		ret = grow_buffers(bdev, block, size, gfp);
>  		if (ret < 0)
>  			return NULL;
> +
> +#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
> +		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
> +			continue;
> +		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
> +		       current->comm, current->pid, current->getblk_executed,
> +		       current->getblk_bh_count, current->getblk_bh_state);
> +		current->getblk_executed = 0;
> +		current->getblk_bh_count = 0;
> +		current->getblk_bh_state = 0;
> +		current->getblk_stamp = jiffies;
> +#endif
>  	}
>  }
>  
> @@ -3216,6 +3247,11 @@ int sync_dirty_buffer(struct buffer_head *bh)
>   */
>  static inline int buffer_busy(struct buffer_head *bh)
>  {
> +#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
> +	current->getblk_executed |= 0x80;
> +	current->getblk_bh_count = atomic_read(&bh->b_count);
> +	current->getblk_bh_state = bh->b_state;
> +#endif

Some explanation of your design wouldn't have hurt.  What does
getblk_executed do, why were these particular fields chosen?

>  	return atomic_read(&bh->b_count) |
>  		(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
>  }
>
> ...
>
Tetsuo Handa July 30, 2018, 3:07 p.m. UTC | #2
On 2018/07/21 5:06, Andrew Morton wrote:
> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> 
>>>
>>> This report is stalling after mount() completed and process used remap_file_pages().
>>> I think that we might need to use debug printk(). But I don't know what to examine.
>>>
>>
>> Andrew, can you pick up this debug printk() patch?
>> I guess we can get the result within one week.
> 
> Sure, let's toss it in -next for a while.
> 
>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Fri, 20 Jul 2018 19:29:06 +0900
>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
>>
>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
>> __getblk_gfp() line in the backtrace. Since there is a comment block that
>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
>> attempt from grow_dev_page() is failing, let's start from checking whether
>> syzbot is hitting that case. This change will be removed after the bug is
>> fixed.
> 
> I'm not sure that grow_dev_page() is hanging.  It has often been
> suspected, but always is proven innocent.  Lets see.

syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?
Jan Kara Aug. 6, 2018, 10:09 a.m. UTC | #3
On Tue 31-07-18 00:07:22, Tetsuo Handa wrote:
> On 2018/07/21 5:06, Andrew Morton wrote:
> > On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > 
> >>>
> >>> This report is stalling after mount() completed and process used remap_file_pages().
> >>> I think that we might need to use debug printk(). But I don't know what to examine.
> >>>
> >>
> >> Andrew, can you pick up this debug printk() patch?
> >> I guess we can get the result within one week.
> > 
> > Sure, let's toss it in -next for a while.
> > 
> >> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
> >> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >> Date: Fri, 20 Jul 2018 19:29:06 +0900
> >> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
> >>
> >> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
> >> __getblk_gfp() line in the backtrace. Since there is a comment block that
> >> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
> >> attempt from grow_dev_page() is failing, let's start from checking whether
> >> syzbot is hitting that case. This change will be removed after the bug is
> >> fixed.
> > 
> > I'm not sure that grow_dev_page() is hanging.  It has often been
> > suspected, but always is proven innocent.  Lets see.
> 
> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?

Looks like some kind of a race where device block size gets changed while
getblk() runs (and creates buffers for underlying page). I don't have time
to nail it down at this moment can have a look into it later unless someone
beats me to it.

								Honza
Tetsuo Handa Aug. 6, 2018, 11:56 a.m. UTC | #4
On 2018/08/06 19:09, Jan Kara wrote:
>> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
>> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?

So far syzbot reproduced 7 times, and all reports are saying that
grow_dev_page() is returning 1 but __find_get_block() is failing forever.

  INFO: task hung in __get_super
  https://syzkaller.appspot.com/text?tag=CrashLog&x=17347272400000

  INFO: task hung in __blkdev_get (2)
  https://syzkaller.appspot.com/text?tag=CrashLog&x=144347fc400000
  https://syzkaller.appspot.com/text?tag=CrashLog&x=12382bfc400000

  INFO: task hung in __writeback_inodes_sb_nr
  https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000
  https://syzkaller.appspot.com/text?tag=CrashLog&x=13c2449c400000

  INFO: task hung in filename_create
  https://syzkaller.appspot.com/text?tag=CrashLog&x=174a672c400000

  INFO: task hung in lookup_slow
  https://syzkaller.appspot.com/text?tag=CrashLog&x=16113bdc400000

> 
> Looks like some kind of a race where device block size gets changed while
> getblk() runs (and creates buffers for underlying page). I don't have time
> to nail it down at this moment can have a look into it later unless someone
> beats me to it.
> 

It seems that loop device is relevant to this problem.

144347fc400000:

[  557.484258] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  560.491589] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  563.500432] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  566.508502] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  569.516546] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  572.524800] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.532499] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.803688] INFO: task syz-executor7:7893 blocked for more than 140 seconds.
[  575.810957]       Not tainted 4.18.0-rc7-next-20180801+ #29
[  575.816685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  575.824663] syz-executor7   D24464  7893   4272 0x00000004
[  575.830298] Call Trace:
[  575.832911]  __schedule+0x87c/0x1ec0
[  575.879299]  schedule+0xfb/0x450
[  575.912448]  rwsem_down_read_failed+0x362/0x610
[  575.951922]  call_rwsem_down_read_failed+0x18/0x30
[  575.956843]  down_read+0xc3/0x1d0
[  575.990222]  __get_super.part.12+0x20f/0x2e0
[  575.994654]  get_super+0x2d/0x50
[  575.998016]  fsync_bdev+0x17/0xc0
[  576.001469]  invalidate_partition+0x35/0x60
[  576.005800]  drop_partitions.isra.13+0xe8/0x200
[  576.023366]  rescan_partitions+0x75/0x910
[  576.037534]  __blkdev_reread_part+0x1ad/0x230
[  576.042023]  blkdev_reread_part+0x26/0x40
[  576.046173]  loop_reread_partitions+0x163/0x190
[  576.055795]  loop_set_status+0xb95/0x1010
[  576.059938]  loop_set_status64+0xaa/0x100
[  576.078629]  lo_ioctl+0x90e/0x1d70
[  576.095290]  blkdev_ioctl+0x9cd/0x2030
[  576.146746]  block_ioctl+0xee/0x130
[  576.154704]  do_vfs_ioctl+0x1de/0x1720
[  576.183069]  ksys_ioctl+0xa9/0xd0
[  576.186519]  __x64_sys_ioctl+0x73/0xb0
[  576.190423]  do_syscall_64+0x1b9/0x820



It seems that there was loop_reread_partitions() failure before this message starts.

12382bfc400000:

[  205.467816] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  205.467816] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=1)

[  205.905418] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  205.905418] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=-16)

[  206.113592] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  206.113592] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=-16)

[  206.767225] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  206.767225] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=-16)

[  206.990060] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  206.990060] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=1)

[  208.630329] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  208.630329] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=-16)

[  228.101929] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  228.101929] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=-16)

[  228.605840] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  228.605840] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=1)

[  229.400629] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  229.400629] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=1)

[  229.622379] loop_reread_partitions: partition scan of loop0 (<E1><D3><F8>w<E5><EA><E4>S<E5>]}d<D0>^MI^A<BA><ED>!<F7><DE><92><A8>f<9B>8<CC><D6>W<DB><F5><AE>F5Eice^W<B5>^O<80>Z<E2>^H%<8D><BA>}
[  229.622379] <BE><8D> <8F>^OESC<CC><88>\<8B><A9>) failed (rc=1)

[  233.406342] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  236.414328] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  239.422327] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  242.430332] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
(...snipped...)
[  416.894327] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  419.902327] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  422.910325] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0
[  425.918329] syz-executor0(18214): getblk(): executed=9 bh_count=0 bh_state=0

[  431.005871] NMI backtrace for cpu 0
[  431.005876] CPU: 0 PID: 18214 Comm: syz-executor0 Not tainted 4.18.0-rc6-next-20180725+ #18
[  431.005881] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[  431.005884] RIP: 0010:write_comp_data+0x14/0x70
[  431.005893] Code: 4c 89 ef e8 1e 93 3e 00 e9 76 fc ff ff e8 b4 9c ca ff 90 90 90 90 55 65 4c 8b 04 25 40 ee 01 00 65 8b 05 2f 46 85 7e 48 89 e5 <a9> 00 01 1f 00 75 51 41 8b 80 98 12 00 00 83 f8 03 75 45 49 8b 80
[  431.005896] RSP: 0018:ffff880192487130 EFLAGS: 00000046
[  431.005903] RAX: 0000000080000000 RBX: ffff880168426b28 RCX: ffffffff81d704dd
[  431.005906] RDX: 000000000003085f RSI: 000000000000000d RDI: 0000000000000006
[  431.005910] RBP: ffff880192487130 R08: ffff8801d9696200 R09: fffff94000af6a66
[  431.005914] R10: fffff94000af6a66 R11: ffffea00057b5337 R12: 0000000000000006
[  431.005918] R13: dffffc0000000000 R14: 0000000000000006 R15: 000000000003085f
[  431.005923] FS:  00007f6c291f1700(0000) GS:ffff8801db000000(0000) knlGS:0000000000000000
[  431.005926] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  431.005930] CR2: ffffffffff600400 CR3: 00000001c5ff4000 CR4: 00000000001406f0
[  431.005935] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  431.005938] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  431.005941] Call Trace:
[  431.005944]  __sanitizer_cov_trace_cmp8+0x18/0x20
[  431.005946]  __find_get_block+0x1bd/0xe60
[  431.005969]  __getblk_gfp+0x3dc/0xe00
[  431.005996]  __bread_gfp+0x2d/0x310
[  431.005999]  fat__get_entry+0x59c/0xa30
[  431.006017]  fat_get_short_entry+0x13c/0x2c0
[  431.006020]  fat_subdirs+0x13c/0x290
[  431.006043]  fat_fill_super+0x29f6/0x4430
[  431.006068]  vfat_fill_super+0x31/0x40
[  431.006070]  mount_bdev+0x314/0x3e0
[  431.006075]  vfat_mount+0x3c/0x50
[  431.006080]  legacy_get_tree+0x131/0x460
[  431.006083]  vfs_get_tree+0x1cb/0x5c0
[  431.006088]  do_mount+0x6f2/0x1e20
[  431.006113]  ksys_mount+0x12d/0x140
[  431.006116]  __x64_sys_mount+0xbe/0x150
[  431.006118]  do_syscall_64+0x1b9/0x820
[  431.006143]  entry_SYSCALL_64_after_hwframe+0x49/0xbe



There were a lot of directory bread failure messages before this message starts.

17347272400000:

[  431.497237] FAT-fs (loop3): Directory bread(block 2574) failed
[  431.505561] FAT-fs (loop3): Directory bread(block 2575) failed
[  431.516048] FAT-fs (loop3): Directory bread(block 2576) failed
[  431.525740] FAT-fs (loop3): Directory bread(block 2577) failed
[  431.535899] FAT-fs (loop3): Directory bread(block 2578) failed
[  431.636001] FAT-fs (loop5): bogus number of reserved sectors
[  431.641910] FAT-fs (loop5): Can't find a valid FAT filesystem
[  434.145080] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  437.152496] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  440.161598] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  443.169498] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
(...snipped...)
[  563.500432] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  566.508502] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  569.516546] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  572.524800] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.532499] syz-executor0(7883): getblk(): executed=9 bh_count=0 bh_state=0
[  575.803688] INFO: task syz-executor7:7893 blocked for more than 140 seconds.
[  575.810957]       Not tainted 4.18.0-rc7-next-20180801+ #29
[  575.816685] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  575.824663] syz-executor7   D24464  7893   4272 0x00000004
[  575.830298] Call Trace:
[  575.832911]  __schedule+0x87c/0x1ec0
[  575.879299]  schedule+0xfb/0x450
[  575.912448]  rwsem_down_read_failed+0x362/0x610
[  575.951922]  call_rwsem_down_read_failed+0x18/0x30
[  575.956843]  down_read+0xc3/0x1d0
[  575.990222]  __get_super.part.12+0x20f/0x2e0
[  575.994654]  get_super+0x2d/0x50
[  575.998016]  fsync_bdev+0x17/0xc0
[  576.001469]  invalidate_partition+0x35/0x60
[  576.005800]  drop_partitions.isra.13+0xe8/0x200
[  576.023366]  rescan_partitions+0x75/0x910
[  576.037534]  __blkdev_reread_part+0x1ad/0x230
[  576.042023]  blkdev_reread_part+0x26/0x40
[  576.046173]  loop_reread_partitions+0x163/0x190
[  576.055795]  loop_set_status+0xb95/0x1010
[  576.059938]  loop_set_status64+0xaa/0x100
[  576.078629]  lo_ioctl+0x90e/0x1d70
[  576.095290]  blkdev_ioctl+0x9cd/0x2030
[  576.146746]  block_ioctl+0xee/0x130
[  576.154704]  do_vfs_ioctl+0x1de/0x1720
[  576.183069]  ksys_ioctl+0xa9/0xd0
[  576.186519]  __x64_sys_ioctl+0x73/0xb0
[  576.190423]  do_syscall_64+0x1b9/0x820
[  576.246994]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
Tetsuo Handa Aug. 20, 2018, 2:12 p.m. UTC | #5
On 2018/08/06 20:56, Tetsuo Handa wrote:
> On 2018/08/06 19:09, Jan Kara wrote:
>> Looks like some kind of a race where device block size gets changed while
>> getblk() runs (and creates buffers for underlying page). I don't have time
>> to nail it down at this moment can have a look into it later unless someone
>> beats me to it.
>>
> 
> It seems that loop device is relevant to this problem.

Speak of loop device, I'm waiting for Jens for three months
http://lkml.kernel.org/r/1527297408-4428-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
but Jens is too busy to come up with an alternative. Since that is a big patch, I wish we can
start testing that patch before Jan starts writing a patch for fixing getblk() race problem.
Tetsuo Handa Dec. 28, 2018, 1:34 p.m. UTC | #6
On 2018/08/06 19:09, Jan Kara wrote:
> On Tue 31-07-18 00:07:22, Tetsuo Handa wrote:
>> On 2018/07/21 5:06, Andrew Morton wrote:
>>> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
>>>
>>>>>
>>>>> This report is stalling after mount() completed and process used remap_file_pages().
>>>>> I think that we might need to use debug printk(). But I don't know what to examine.
>>>>>
>>>>
>>>> Andrew, can you pick up this debug printk() patch?
>>>> I guess we can get the result within one week.
>>>
>>> Sure, let's toss it in -next for a while.
>>>
>>>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
>>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>>>> Date: Fri, 20 Jul 2018 19:29:06 +0900
>>>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
>>>>
>>>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
>>>> __getblk_gfp() line in the backtrace. Since there is a comment block that
>>>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
>>>> attempt from grow_dev_page() is failing, let's start from checking whether
>>>> syzbot is hitting that case. This change will be removed after the bug is
>>>> fixed.
>>>
>>> I'm not sure that grow_dev_page() is hanging.  It has often been
>>> suspected, but always is proven innocent.  Lets see.
>>
>> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
>> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?
> 
> Looks like some kind of a race where device block size gets changed while
> getblk() runs (and creates buffers for underlying page). I don't have time
> to nail it down at this moment can have a look into it later unless someone
> beats me to it.

I feel that the frequency of hitting this problem was decreased
by merging loop module's ioctl() serialization patches. But this
problem is still there, and syzbot got a new line in
https://syzkaller.appspot.com/text?tag=CrashLog&x=177f889f400000 .

  [  615.881781] __loop_clr_fd: partition scan of loop5 failed (rc=-22)
  [  619.059920] syz-executor4(2193): getblk(): executed=cd bh_count=0 bh_state=29
  [  622.069808] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
  [  625.080013] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
  [  628.089900] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0

I guess that loop module is somehow related to this problem.
Jan Kara Jan. 2, 2019, 2:40 p.m. UTC | #7
On Fri 28-12-18 22:34:13, Tetsuo Handa wrote:
> On 2018/08/06 19:09, Jan Kara wrote:
> > On Tue 31-07-18 00:07:22, Tetsuo Handa wrote:
> >> On 2018/07/21 5:06, Andrew Morton wrote:
> >>> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> >>>
> >>>>>
> >>>>> This report is stalling after mount() completed and process used remap_file_pages().
> >>>>> I think that we might need to use debug printk(). But I don't know what to examine.
> >>>>>
> >>>>
> >>>> Andrew, can you pick up this debug printk() patch?
> >>>> I guess we can get the result within one week.
> >>>
> >>> Sure, let's toss it in -next for a while.
> >>>
> >>>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
> >>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> >>>> Date: Fri, 20 Jul 2018 19:29:06 +0900
> >>>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
> >>>>
> >>>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
> >>>> __getblk_gfp() line in the backtrace. Since there is a comment block that
> >>>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
> >>>> attempt from grow_dev_page() is failing, let's start from checking whether
> >>>> syzbot is hitting that case. This change will be removed after the bug is
> >>>> fixed.
> >>>
> >>> I'm not sure that grow_dev_page() is hanging.  It has often been
> >>> suspected, but always is proven innocent.  Lets see.
> >>
> >> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
> >> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?
> > 
> > Looks like some kind of a race where device block size gets changed while
> > getblk() runs (and creates buffers for underlying page). I don't have time
> > to nail it down at this moment can have a look into it later unless someone
> > beats me to it.
> 
> I feel that the frequency of hitting this problem was decreased
> by merging loop module's ioctl() serialization patches. But this
> problem is still there, and syzbot got a new line in
> https://syzkaller.appspot.com/text?tag=CrashLog&x=177f889f400000 .
> 
>   [  615.881781] __loop_clr_fd: partition scan of loop5 failed (rc=-22)
>   [  619.059920] syz-executor4(2193): getblk(): executed=cd bh_count=0 bh_state=29
>   [  622.069808] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
>   [  625.080013] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
>   [  628.089900] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
> 
> I guess that loop module is somehow related to this problem.

I had a look into this and the only good explanation for this I have is
that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
If that would happen, we'd get exactly the behavior syzkaller observes
because grow_buffers() would populate different page than
__find_get_block() then looks up.

However I don't see how that's possible since the filesystem has the block
device open exclusively and blkdev_bszset() makes sure we also have
exclusive access to the block device before changing the block device size.
So changing block device block size after filesystem gets access to the
device should be impossible. 

Anyway, could you perhaps add to your debug patch a dump of 'size' passed
to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
whether my theory is right or not. Thanks!

								Honza
Dmitry Vyukov Jan. 2, 2019, 2:46 p.m. UTC | #8
On Wed, Jan 2, 2019 at 3:40 PM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 28-12-18 22:34:13, Tetsuo Handa wrote:
> > On 2018/08/06 19:09, Jan Kara wrote:
> > > On Tue 31-07-18 00:07:22, Tetsuo Handa wrote:
> > >> On 2018/07/21 5:06, Andrew Morton wrote:
> > >>> On Fri, 20 Jul 2018 19:36:23 +0900 Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> wrote:
> > >>>
> > >>>>>
> > >>>>> This report is stalling after mount() completed and process used remap_file_pages().
> > >>>>> I think that we might need to use debug printk(). But I don't know what to examine.
> > >>>>>
> > >>>>
> > >>>> Andrew, can you pick up this debug printk() patch?
> > >>>> I guess we can get the result within one week.
> > >>>
> > >>> Sure, let's toss it in -next for a while.
> > >>>
> > >>>> >From 8f55e00b21fefffbc6abd9085ac503c52a302464 Mon Sep 17 00:00:00 2001
> > >>>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> > >>>> Date: Fri, 20 Jul 2018 19:29:06 +0900
> > >>>> Subject: [PATCH] fs/buffer.c: add debug print for __getblk_gfp() stall problem
> > >>>>
> > >>>> Among syzbot's unresolved hung task reports, 18 out of 65 reports contain
> > >>>> __getblk_gfp() line in the backtrace. Since there is a comment block that
> > >>>> says that __getblk_gfp() will lock up the machine if try_to_free_buffers()
> > >>>> attempt from grow_dev_page() is failing, let's start from checking whether
> > >>>> syzbot is hitting that case. This change will be removed after the bug is
> > >>>> fixed.
> > >>>
> > >>> I'm not sure that grow_dev_page() is hanging.  It has often been
> > >>> suspected, but always is proven innocent.  Lets see.
> > >>
> > >> syzbot reproduced this problem ( https://syzkaller.appspot.com/text?tag=CrashLog&x=11f2fc44400000 ) .
> > >> It says that grow_dev_page() is returning 1 but __find_get_block() is failing forever. Any idea?
> > >
> > > Looks like some kind of a race where device block size gets changed while
> > > getblk() runs (and creates buffers for underlying page). I don't have time
> > > to nail it down at this moment can have a look into it later unless someone
> > > beats me to it.
> >
> > I feel that the frequency of hitting this problem was decreased
> > by merging loop module's ioctl() serialization patches. But this
> > problem is still there, and syzbot got a new line in
> > https://syzkaller.appspot.com/text?tag=CrashLog&x=177f889f400000 .
> >
> >   [  615.881781] __loop_clr_fd: partition scan of loop5 failed (rc=-22)
> >   [  619.059920] syz-executor4(2193): getblk(): executed=cd bh_count=0 bh_state=29
> >   [  622.069808] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
> >   [  625.080013] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
> >   [  628.089900] syz-executor4(2193): getblk(): executed=9 bh_count=0 bh_state=0
> >
> > I guess that loop module is somehow related to this problem.
>
> I had a look into this and the only good explanation for this I have is
> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> If that would happen, we'd get exactly the behavior syzkaller observes
> because grow_buffers() would populate different page than
> __find_get_block() then looks up.
>
> However I don't see how that's possible since the filesystem has the block
> device open exclusively and blkdev_bszset() makes sure we also have
> exclusive access to the block device before changing the block device size.
> So changing block device block size after filesystem gets access to the
> device should be impossible.

If this is that critical and impossible to fire, maybe it makes sense
to add a corresponding debug check to some code paths?
syzkaller will immediately catch any violations if they happen.


> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> whether my theory is right or not. Thanks!
Tetsuo Handa Jan. 2, 2019, 4:07 p.m. UTC | #9
On 2019/01/02 23:40, Jan Kara wrote:
> I had a look into this and the only good explanation for this I have is
> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> If that would happen, we'd get exactly the behavior syzkaller observes
> because grow_buffers() would populate different page than
> __find_get_block() then looks up.
> 
> However I don't see how that's possible since the filesystem has the block
> device open exclusively and blkdev_bszset() makes sure we also have
> exclusive access to the block device before changing the block device size.
> So changing block device block size after filesystem gets access to the
> device should be impossible. 
> 
> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> whether my theory is right or not. Thanks!
> 

OK. Andrew, will you add (or fold into) this change?

From e6f334380ad2c87457bfc2a4058316c47f75824a Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 3 Jan 2019 01:03:35 +0900
Subject: [PATCH] fs/buffer.c: dump more info for __getblk_gfp() stall problem

We need to dump more variables on top of
"fs/buffer.c: add debug print for __getblk_gfp() stall problem".

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 580fda0..a50acac 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1066,9 +1066,14 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 #ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
 		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
 			continue;
-		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
+		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx "
+		       "bdev_super_blocksize=%lu size=%u "
+		       "bdev_super_blocksize_bits=%u bdev_inode_blkbits=%u\n",
 		       current->comm, current->pid, current->getblk_executed,
-		       current->getblk_bh_count, current->getblk_bh_state);
+		       current->getblk_bh_count, current->getblk_bh_state,
+		       bdev->bd_super->s_blocksize, size,
+		       bdev->bd_super->s_blocksize_bits,
+		       bdev->bd_inode->i_blkbits);
 		current->getblk_executed = 0;
 		current->getblk_bh_count = 0;
 		current->getblk_bh_state = 0;
Jan Kara Jan. 2, 2019, 5:26 p.m. UTC | #10
On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> On 2019/01/02 23:40, Jan Kara wrote:
> > I had a look into this and the only good explanation for this I have is
> > that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > If that would happen, we'd get exactly the behavior syzkaller observes
> > because grow_buffers() would populate different page than
> > __find_get_block() then looks up.
> > 
> > However I don't see how that's possible since the filesystem has the block
> > device open exclusively and blkdev_bszset() makes sure we also have
> > exclusive access to the block device before changing the block device size.
> > So changing block device block size after filesystem gets access to the
> > device should be impossible. 
> > 
> > Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > whether my theory is right or not. Thanks!
> > 
> 
> OK. Andrew, will you add (or fold into) this change?
> 
> From e6f334380ad2c87457bfc2a4058316c47f75824a Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Date: Thu, 3 Jan 2019 01:03:35 +0900
> Subject: [PATCH] fs/buffer.c: dump more info for __getblk_gfp() stall problem
> 
> We need to dump more variables on top of
> "fs/buffer.c: add debug print for __getblk_gfp() stall problem".
> 
> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
> Cc: Jan Kara <jack@suse.cz>
> ---
>  fs/buffer.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 580fda0..a50acac 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1066,9 +1066,14 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
>  #ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
>  		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
>  			continue;
> -		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
> +		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx "
> +		       "bdev_super_blocksize=%lu size=%u "
> +		       "bdev_super_blocksize_bits=%u bdev_inode_blkbits=%u\n",
>  		       current->comm, current->pid, current->getblk_executed,
> -		       current->getblk_bh_count, current->getblk_bh_state);
> +		       current->getblk_bh_count, current->getblk_bh_state,
> +		       bdev->bd_super->s_blocksize, size,
> +		       bdev->bd_super->s_blocksize_bits,
> +		       bdev->bd_inode->i_blkbits);

Well, bd_super may be NULL if there's no filesystem mounted so it would be
safer to check for this rather than blindly dereferencing it... Otherwise
the change looks good to me.

								Honza
Tetsuo Handa Jan. 3, 2019, 12:46 a.m. UTC | #11
On 2019/01/03 2:26, Jan Kara wrote:
> On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
>> On 2019/01/02 23:40, Jan Kara wrote:
>>> I had a look into this and the only good explanation for this I have is
>>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
>>> If that would happen, we'd get exactly the behavior syzkaller observes
>>> because grow_buffers() would populate different page than
>>> __find_get_block() then looks up.
>>>
>>> However I don't see how that's possible since the filesystem has the block
>>> device open exclusively and blkdev_bszset() makes sure we also have
>>> exclusive access to the block device before changing the block device size.
>>> So changing block device block size after filesystem gets access to the
>>> device should be impossible. 
>>>
>>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
>>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
>>> whether my theory is right or not. Thanks!
>>>
>>
>> OK. Andrew, will you add (or fold into) this change?
>>
>> From e6f334380ad2c87457bfc2a4058316c47f75824a Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Thu, 3 Jan 2019 01:03:35 +0900
>> Subject: [PATCH] fs/buffer.c: dump more info for __getblk_gfp() stall problem
>>
>> We need to dump more variables on top of
>> "fs/buffer.c: add debug print for __getblk_gfp() stall problem".
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Cc: Jan Kara <jack@suse.cz>
>> ---
>>  fs/buffer.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 580fda0..a50acac 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -1066,9 +1066,14 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
>>  #ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
>>  		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
>>  			continue;
>> -		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
>> +		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx "
>> +		       "bdev_super_blocksize=%lu size=%u "
>> +		       "bdev_super_blocksize_bits=%u bdev_inode_blkbits=%u\n",
>>  		       current->comm, current->pid, current->getblk_executed,
>> -		       current->getblk_bh_count, current->getblk_bh_state);
>> +		       current->getblk_bh_count, current->getblk_bh_state,
>> +		       bdev->bd_super->s_blocksize, size,
>> +		       bdev->bd_super->s_blocksize_bits,
>> +		       bdev->bd_inode->i_blkbits);
> 
> Well, bd_super may be NULL if there's no filesystem mounted so it would be
> safer to check for this rather than blindly dereferencing it... Otherwise
> the change looks good to me.

I see. Let's be cautious here.

From 317a0d0002b3d2cadae606055ad50f2926ca62d2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Date: Thu, 3 Jan 2019 09:42:02 +0900
Subject: [PATCH v2] fs/buffer.c: dump more info for __getblk_gfp() stall problem

We need to dump more variables on top of
"fs/buffer.c: add debug print for __getblk_gfp() stall problem".

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Jan Kara <jack@suse.cz>
---
 fs/buffer.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 580fda0..784de3d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1066,9 +1066,15 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 #ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
 		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
 			continue;
-		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
+		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx bdev_super_blocksize=%ld size=%u bdev_super_blocksize_bits=%d bdev_inode_blkbits=%d\n",
 		       current->comm, current->pid, current->getblk_executed,
-		       current->getblk_bh_count, current->getblk_bh_state);
+		       current->getblk_bh_count, current->getblk_bh_state,
+		       IS_ERR_OR_NULL(bdev->bd_super) ? -1L :
+		       bdev->bd_super->s_blocksize, size,
+		       IS_ERR_OR_NULL(bdev->bd_super) ? -1 :
+		       bdev->bd_super->s_blocksize_bits,
+		       IS_ERR_OR_NULL(bdev->bd_inode) ? -1 :
+		       bdev->bd_inode->i_blkbits);
 		current->getblk_executed = 0;
 		current->getblk_bh_count = 0;
 		current->getblk_bh_state = 0;
Tetsuo Handa Jan. 8, 2019, 10:04 a.m. UTC | #12
On 2019/01/03 2:26, Jan Kara wrote:
> On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
>> On 2019/01/02 23:40, Jan Kara wrote:
>>> I had a look into this and the only good explanation for this I have is
>>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
>>> If that would happen, we'd get exactly the behavior syzkaller observes
>>> because grow_buffers() would populate different page than
>>> __find_get_block() then looks up.
>>>
>>> However I don't see how that's possible since the filesystem has the block
>>> device open exclusively and blkdev_bszset() makes sure we also have
>>> exclusive access to the block device before changing the block device size.
>>> So changing block device block size after filesystem gets access to the
>>> device should be impossible. 
>>>
>>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
>>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
>>> whether my theory is right or not. Thanks!
>>>

Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.

https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000

[  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
(...snipped...)
[  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12

https://syzkaller.appspot.com/text?tag=CrashLog&x=143383d7400000

[ 1355.681513][ T6893] syz-executor0(6893): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
[ 1358.274585][T15649] kworker/u4:17(15649): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
(...snipped...)
[ 1455.341572][ T6893] syz-executor0(6893): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
[ 1455.541457][T15649] kworker/u4:17(15649): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12

>>
>> OK. Andrew, will you add (or fold into) this change?
>>
>> From e6f334380ad2c87457bfc2a4058316c47f75824a Mon Sep 17 00:00:00 2001
>> From: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Date: Thu, 3 Jan 2019 01:03:35 +0900
>> Subject: [PATCH] fs/buffer.c: dump more info for __getblk_gfp() stall problem
>>
>> We need to dump more variables on top of
>> "fs/buffer.c: add debug print for __getblk_gfp() stall problem".
>>
>> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
>> Cc: Jan Kara <jack@suse.cz>
>> ---
>>  fs/buffer.c | 9 +++++++--
>>  1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 580fda0..a50acac 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -1066,9 +1066,14 @@ static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
>>  #ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
>>  		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
>>  			continue;
>> -		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
>> +		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx "
>> +		       "bdev_super_blocksize=%lu size=%u "
>> +		       "bdev_super_blocksize_bits=%u bdev_inode_blkbits=%u\n",
>>  		       current->comm, current->pid, current->getblk_executed,
>> -		       current->getblk_bh_count, current->getblk_bh_state);
>> +		       current->getblk_bh_count, current->getblk_bh_state,
>> +		       bdev->bd_super->s_blocksize, size,
>> +		       bdev->bd_super->s_blocksize_bits,
>> +		       bdev->bd_inode->i_blkbits);
> 
> Well, bd_super may be NULL if there's no filesystem mounted so it would be
> safer to check for this rather than blindly dereferencing it... Otherwise
> the change looks good to me.
> 
> 								Honza
>
Jan Kara Jan. 8, 2019, 11:24 a.m. UTC | #13
On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> On 2019/01/03 2:26, Jan Kara wrote:
> > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> >> On 2019/01/02 23:40, Jan Kara wrote:
> >>> I had a look into this and the only good explanation for this I have is
> >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> >>> If that would happen, we'd get exactly the behavior syzkaller observes
> >>> because grow_buffers() would populate different page than
> >>> __find_get_block() then looks up.
> >>>
> >>> However I don't see how that's possible since the filesystem has the block
> >>> device open exclusively and blkdev_bszset() makes sure we also have
> >>> exclusive access to the block device before changing the block device size.
> >>> So changing block device block size after filesystem gets access to the
> >>> device should be impossible. 
> >>>
> >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> >>> whether my theory is right or not. Thanks!
> >>>
> 
> Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> 
> https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> 
> [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> (...snipped...)
> [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12

Right, so indeed the block size in the superblock and in the block device
gets out of sync which explains why we endlessly loop in the buffer cache
code. The superblock uses blocksize of 512 while the block device thinks
the set block size is 4096.

And after staring into the code for some time, I finally have a trivial
reproducer:

truncate -s 1G /tmp/image
losetup /dev/loop0 /tmp/image
mkfs.ext4 -b 1024 /dev/loop0
mount -t ext4 /dev/loop0 /mnt
losetup -c /dev/loop0
l /mnt
<hangs>

And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
device block size to 4096 by calling bd_set_size(). I have to think how to
best fix this...

Thanks for your help with debugging this!

								Honza
Dmitry Vyukov Jan. 8, 2019, 11:49 a.m. UTC | #14
On Tue, Jan 8, 2019 at 12:24 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> > On 2019/01/03 2:26, Jan Kara wrote:
> > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> > >> On 2019/01/02 23:40, Jan Kara wrote:
> > >>> I had a look into this and the only good explanation for this I have is
> > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > >>> If that would happen, we'd get exactly the behavior syzkaller observes
> > >>> because grow_buffers() would populate different page than
> > >>> __find_get_block() then looks up.
> > >>>
> > >>> However I don't see how that's possible since the filesystem has the block
> > >>> device open exclusively and blkdev_bszset() makes sure we also have
> > >>> exclusive access to the block device before changing the block device size.
> > >>> So changing block device block size after filesystem gets access to the
> > >>> device should be impossible.
> > >>>
> > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > >>> whether my theory is right or not. Thanks!
> > >>>
> >
> > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> >
> > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> >
> > [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > (...snipped...)
> > [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
>
> Right, so indeed the block size in the superblock and in the block device
> gets out of sync which explains why we endlessly loop in the buffer cache
> code. The superblock uses blocksize of 512 while the block device thinks
> the set block size is 4096.
>
> And after staring into the code for some time, I finally have a trivial
> reproducer:
>
> truncate -s 1G /tmp/image
> losetup /dev/loop0 /tmp/image
> mkfs.ext4 -b 1024 /dev/loop0
> mount -t ext4 /dev/loop0 /mnt
> losetup -c /dev/loop0
> l /mnt
> <hangs>
>
> And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
> device block size to 4096 by calling bd_set_size(). I have to think how to
> best fix this...
>
> Thanks for your help with debugging this!

Wow! I am very excited.
We have 587 open "task hung" reports, I suspect this explains lots of them.
What would be some pattern that we can use to best-effort distinguish
most manifestations? Skimming through few reports I see "inode_lock",
"get_super", "blkdev_put" as common indicators. Anything else?
Jan Kara Jan. 9, 2019, 1:30 p.m. UTC | #15
On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote:
> On Tue, Jan 8, 2019 at 12:24 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> > > On 2019/01/03 2:26, Jan Kara wrote:
> > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> > > >> On 2019/01/02 23:40, Jan Kara wrote:
> > > >>> I had a look into this and the only good explanation for this I have is
> > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > > >>> If that would happen, we'd get exactly the behavior syzkaller observes
> > > >>> because grow_buffers() would populate different page than
> > > >>> __find_get_block() then looks up.
> > > >>>
> > > >>> However I don't see how that's possible since the filesystem has the block
> > > >>> device open exclusively and blkdev_bszset() makes sure we also have
> > > >>> exclusive access to the block device before changing the block device size.
> > > >>> So changing block device block size after filesystem gets access to the
> > > >>> device should be impossible.
> > > >>>
> > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > > >>> whether my theory is right or not. Thanks!
> > > >>>
> > >
> > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> > >
> > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> > >
> > > [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > (...snipped...)
> > > [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> >
> > Right, so indeed the block size in the superblock and in the block device
> > gets out of sync which explains why we endlessly loop in the buffer cache
> > code. The superblock uses blocksize of 512 while the block device thinks
> > the set block size is 4096.
> >
> > And after staring into the code for some time, I finally have a trivial
> > reproducer:
> >
> > truncate -s 1G /tmp/image
> > losetup /dev/loop0 /tmp/image
> > mkfs.ext4 -b 1024 /dev/loop0
> > mount -t ext4 /dev/loop0 /mnt
> > losetup -c /dev/loop0
> > l /mnt
> > <hangs>
> >
> > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
> > device block size to 4096 by calling bd_set_size(). I have to think how to
> > best fix this...
> >
> > Thanks for your help with debugging this!
> 
> Wow! I am very excited.
> We have 587 open "task hung" reports, I suspect this explains lots of them.
> What would be some pattern that we can use to best-effort distinguish
> most manifestations? Skimming through few reports I see "inode_lock",
> "get_super", "blkdev_put" as common indicators. Anything else?

Well, there will be always looping task with __getblk_gfp() on its stack
(which should be visible in the stacktrace generated by the stall
detector). Then there can be lots of other processes getting blocked due to
locks and other resources held by this task...

								Honza
Dmitry Vyukov Jan. 14, 2019, 3:11 p.m. UTC | #16
On Wed, Jan 9, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote:
> > On Tue, Jan 8, 2019 at 12:24 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> > > > On 2019/01/03 2:26, Jan Kara wrote:
> > > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> > > > >> On 2019/01/02 23:40, Jan Kara wrote:
> > > > >>> I had a look into this and the only good explanation for this I have is
> > > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > > > >>> If that would happen, we'd get exactly the behavior syzkaller observes
> > > > >>> because grow_buffers() would populate different page than
> > > > >>> __find_get_block() then looks up.
> > > > >>>
> > > > >>> However I don't see how that's possible since the filesystem has the block
> > > > >>> device open exclusively and blkdev_bszset() makes sure we also have
> > > > >>> exclusive access to the block device before changing the block device size.
> > > > >>> So changing block device block size after filesystem gets access to the
> > > > >>> device should be impossible.
> > > > >>>
> > > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > > > >>> whether my theory is right or not. Thanks!
> > > > >>>
> > > >
> > > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> > > >
> > > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> > > >
> > > > [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > > (...snipped...)
> > > > [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > >
> > > Right, so indeed the block size in the superblock and in the block device
> > > gets out of sync which explains why we endlessly loop in the buffer cache
> > > code. The superblock uses blocksize of 512 while the block device thinks
> > > the set block size is 4096.
> > >
> > > And after staring into the code for some time, I finally have a trivial
> > > reproducer:
> > >
> > > truncate -s 1G /tmp/image
> > > losetup /dev/loop0 /tmp/image
> > > mkfs.ext4 -b 1024 /dev/loop0
> > > mount -t ext4 /dev/loop0 /mnt
> > > losetup -c /dev/loop0
> > > l /mnt
> > > <hangs>
> > >
> > > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
> > > device block size to 4096 by calling bd_set_size(). I have to think how to
> > > best fix this...
> > >
> > > Thanks for your help with debugging this!
> >
> > Wow! I am very excited.
> > We have 587 open "task hung" reports, I suspect this explains lots of them.
> > What would be some pattern that we can use to best-effort distinguish
> > most manifestations? Skimming through few reports I see "inode_lock",
> > "get_super", "blkdev_put" as common indicators. Anything else?
>
> Well, there will be always looping task with __getblk_gfp() on its stack
> (which should be visible in the stacktrace generated by the stall
> detector). Then there can be lots of other processes getting blocked due to
> locks and other resources held by this task...


Once we have a fix, I plan to do a sweep over existing open "task
hung" reports and dup lots of them onto this one. Probably preferring
to over-sweep rather then to under-sweep because there are too many of
them and lots does not seem to be actionable otherwise.
Tetsuo, do you have comments before I start?
Dmitry Vyukov Jan. 14, 2019, 3:13 p.m. UTC | #17
On Mon, Jan 14, 2019 at 4:11 PM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Wed, Jan 9, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote:
> > > On Tue, Jan 8, 2019 at 12:24 PM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> > > > > On 2019/01/03 2:26, Jan Kara wrote:
> > > > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> > > > > >> On 2019/01/02 23:40, Jan Kara wrote:
> > > > > >>> I had a look into this and the only good explanation for this I have is
> > > > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > > > > >>> If that would happen, we'd get exactly the behavior syzkaller observes
> > > > > >>> because grow_buffers() would populate different page than
> > > > > >>> __find_get_block() then looks up.
> > > > > >>>
> > > > > >>> However I don't see how that's possible since the filesystem has the block
> > > > > >>> device open exclusively and blkdev_bszset() makes sure we also have
> > > > > >>> exclusive access to the block device before changing the block device size.
> > > > > >>> So changing block device block size after filesystem gets access to the
> > > > > >>> device should be impossible.
> > > > > >>>
> > > > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > > > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > > > > >>> whether my theory is right or not. Thanks!
> > > > > >>>
> > > > >
> > > > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> > > > >
> > > > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> > > > >
> > > > > [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > > > (...snipped...)
> > > > > [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > >
> > > > Right, so indeed the block size in the superblock and in the block device
> > > > gets out of sync which explains why we endlessly loop in the buffer cache
> > > > code. The superblock uses blocksize of 512 while the block device thinks
> > > > the set block size is 4096.
> > > >
> > > > And after staring into the code for some time, I finally have a trivial
> > > > reproducer:
> > > >
> > > > truncate -s 1G /tmp/image
> > > > losetup /dev/loop0 /tmp/image
> > > > mkfs.ext4 -b 1024 /dev/loop0
> > > > mount -t ext4 /dev/loop0 /mnt
> > > > losetup -c /dev/loop0
> > > > l /mnt
> > > > <hangs>
> > > >
> > > > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
> > > > device block size to 4096 by calling bd_set_size(). I have to think how to
> > > > best fix this...
> > > >
> > > > Thanks for your help with debugging this!
> > >
> > > Wow! I am very excited.
> > > We have 587 open "task hung" reports, I suspect this explains lots of them.
> > > What would be some pattern that we can use to best-effort distinguish
> > > most manifestations? Skimming through few reports I see "inode_lock",
> > > "get_super", "blkdev_put" as common indicators. Anything else?
> >
> > Well, there will be always looping task with __getblk_gfp() on its stack
> > (which should be visible in the stacktrace generated by the stall
> > detector). Then there can be lots of other processes getting blocked due to
> > locks and other resources held by this task...
>
>
> Once we have a fix, I plan to do a sweep over existing open "task
> hung" reports and dup lots of them onto this one. Probably preferring
> to over-sweep rather then to under-sweep because there are too many of
> them and lots does not seem to be actionable otherwise.
> Tetsuo, do you have comments before I start?

Also, is it possible to add some kind of WARNING for this condition?
Taking into account how much effort it too to debug, looks like a
useful check. Or did I ask this already...
Jan Kara Jan. 15, 2019, 9:29 a.m. UTC | #18
On Mon 14-01-19 16:13:08, Dmitry Vyukov wrote:
> On Mon, Jan 14, 2019 at 4:11 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > On Wed, Jan 9, 2019 at 2:30 PM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Tue 08-01-19 12:49:08, Dmitry Vyukov wrote:
> > > > On Tue, Jan 8, 2019 at 12:24 PM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > On Tue 08-01-19 19:04:06, Tetsuo Handa wrote:
> > > > > > On 2019/01/03 2:26, Jan Kara wrote:
> > > > > > > On Thu 03-01-19 01:07:25, Tetsuo Handa wrote:
> > > > > > >> On 2019/01/02 23:40, Jan Kara wrote:
> > > > > > >>> I had a look into this and the only good explanation for this I have is
> > > > > > >>> that sb->s_blocksize is different from (1 << sb->s_bdev->bd_inode->i_blkbits).
> > > > > > >>> If that would happen, we'd get exactly the behavior syzkaller observes
> > > > > > >>> because grow_buffers() would populate different page than
> > > > > > >>> __find_get_block() then looks up.
> > > > > > >>>
> > > > > > >>> However I don't see how that's possible since the filesystem has the block
> > > > > > >>> device open exclusively and blkdev_bszset() makes sure we also have
> > > > > > >>> exclusive access to the block device before changing the block device size.
> > > > > > >>> So changing block device block size after filesystem gets access to the
> > > > > > >>> device should be impossible.
> > > > > > >>>
> > > > > > >>> Anyway, could you perhaps add to your debug patch a dump of 'size' passed
> > > > > > >>> to __getblk_slow() and bdev->bd_inode->i_blkbits? That should tell us
> > > > > > >>> whether my theory is right or not. Thanks!
> > > > > > >>>
> > > > > >
> > > > > > Got two reports. 'size' is 512 while bdev->bd_inode->i_blkbits is 12.
> > > > > >
> > > > > > https://syzkaller.appspot.com/text?tag=CrashLog&x=1237c3ab400000
> > > > > >
> > > > > > [  385.723941][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > > > > (...snipped...)
> > > > > > [  568.159544][  T439] kworker/u4:3(439): getblk(): executed=9 bh_count=0 bh_state=0 bdev_super_blocksize=512 size=512 bdev_super_blocksize_bits=9 bdev_inode_blkbits=12
> > > > >
> > > > > Right, so indeed the block size in the superblock and in the block device
> > > > > gets out of sync which explains why we endlessly loop in the buffer cache
> > > > > code. The superblock uses blocksize of 512 while the block device thinks
> > > > > the set block size is 4096.
> > > > >
> > > > > And after staring into the code for some time, I finally have a trivial
> > > > > reproducer:
> > > > >
> > > > > truncate -s 1G /tmp/image
> > > > > losetup /dev/loop0 /tmp/image
> > > > > mkfs.ext4 -b 1024 /dev/loop0
> > > > > mount -t ext4 /dev/loop0 /mnt
> > > > > losetup -c /dev/loop0
> > > > > l /mnt
> > > > > <hangs>
> > > > >
> > > > > And the problem is that LOOP_SET_CAPACITY ioctl ends up reseting block
> > > > > device block size to 4096 by calling bd_set_size(). I have to think how to
> > > > > best fix this...
> > > > >
> > > > > Thanks for your help with debugging this!
> > > >
> > > > Wow! I am very excited.
> > > > We have 587 open "task hung" reports, I suspect this explains lots of them.
> > > > What would be some pattern that we can use to best-effort distinguish
> > > > most manifestations? Skimming through few reports I see "inode_lock",
> > > > "get_super", "blkdev_put" as common indicators. Anything else?
> > >
> > > Well, there will be always looping task with __getblk_gfp() on its stack
> > > (which should be visible in the stacktrace generated by the stall
> > > detector). Then there can be lots of other processes getting blocked due to
> > > locks and other resources held by this task...
> >
> >
> > Once we have a fix, I plan to do a sweep over existing open "task

I have submitted the fix yesterday to linux-block ([PATCH 0/2] blkdev: Fix
livelock when loop device updates capacity).

> > hung" reports and dup lots of them onto this one. Probably preferring
> > to over-sweep rather then to under-sweep because there are too many of
> > them and lots does not seem to be actionable otherwise.
> > Tetsuo, do you have comments before I start?
> 
> Also, is it possible to add some kind of WARNING for this condition?
> Taking into account how much effort it too to debug, looks like a
> useful check. Or did I ask this already...

There are two things we could do:

1) Warn if we loop in __getblk_slow() more than couple of times (looping
once is normal, looping twice can happen easily due to races).

2) Warn & bail if block size passed to __getblk_slow() does not match the
block device block size.

I'll write the patch.

								Honza
diff mbox

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index be31e28..ebf78ab 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -955,10 +955,20 @@  static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 			end_block = init_page_buffers(page, bdev,
 						(sector_t)index << sizebits,
 						size);
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+			current->getblk_executed |= 0x01;
+#endif
 			goto done;
 		}
-		if (!try_to_free_buffers(page))
+		if (!try_to_free_buffers(page)) {
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+			current->getblk_executed |= 0x02;
+#endif
 			goto failed;
+		}
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+		current->getblk_executed |= 0x04;
+#endif
 	}
 
 	/*
@@ -978,6 +988,9 @@  static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 	spin_unlock(&inode->i_mapping->private_lock);
 done:
 	ret = (block < end_block) ? 1 : -ENXIO;
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+	current->getblk_executed |= 0x08;
+#endif
 failed:
 	unlock_page(page);
 	put_page(page);
@@ -1033,6 +1046,12 @@  static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 		return NULL;
 	}
 
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+	current->getblk_stamp = jiffies;
+	current->getblk_executed = 0;
+	current->getblk_bh_count = 0;
+	current->getblk_bh_state = 0;
+#endif
 	for (;;) {
 		struct buffer_head *bh;
 		int ret;
@@ -1044,6 +1063,18 @@  static sector_t blkdev_max_block(struct block_device *bdev, unsigned int size)
 		ret = grow_buffers(bdev, block, size, gfp);
 		if (ret < 0)
 			return NULL;
+
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+		if (!time_after(jiffies, current->getblk_stamp + 3 * HZ))
+			continue;
+		printk(KERN_ERR "%s(%u): getblk(): executed=%x bh_count=%d bh_state=%lx\n",
+		       current->comm, current->pid, current->getblk_executed,
+		       current->getblk_bh_count, current->getblk_bh_state);
+		current->getblk_executed = 0;
+		current->getblk_bh_count = 0;
+		current->getblk_bh_state = 0;
+		current->getblk_stamp = jiffies;
+#endif
 	}
 }
 
@@ -3216,6 +3247,11 @@  int sync_dirty_buffer(struct buffer_head *bh)
  */
 static inline int buffer_busy(struct buffer_head *bh)
 {
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+	current->getblk_executed |= 0x80;
+	current->getblk_bh_count = atomic_read(&bh->b_count);
+	current->getblk_bh_state = bh->b_state;
+#endif
 	return atomic_read(&bh->b_count) |
 		(bh->b_state & ((1 << BH_Dirty) | (1 << BH_Lock)));
 }
@@ -3254,11 +3290,18 @@  int try_to_free_buffers(struct page *page)
 	int ret = 0;
 
 	BUG_ON(!PageLocked(page));
-	if (PageWriteback(page))
+	if (PageWriteback(page)) {
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+		current->getblk_executed |= 0x10;
+#endif
 		return 0;
+	}
 
 	if (mapping == NULL) {		/* can this still happen? */
 		ret = drop_buffers(page, &buffers_to_free);
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+		current->getblk_executed |= 0x20;
+#endif
 		goto out;
 	}
 
@@ -3282,6 +3325,9 @@  int try_to_free_buffers(struct page *page)
 	if (ret)
 		cancel_dirty_page(page);
 	spin_unlock(&mapping->private_lock);
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+	current->getblk_executed |= 0x40;
+#endif
 out:
 	if (buffers_to_free) {
 		struct buffer_head *bh = buffers_to_free;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c4e0aba..95b143e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1207,6 +1207,13 @@  struct task_struct {
 	unsigned long			prev_lowest_stack;
 #endif
 
+#ifdef CONFIG_DEBUG_AID_FOR_SYZBOT
+	unsigned long			getblk_stamp;
+	unsigned int			getblk_executed;
+	unsigned int			getblk_bh_count;
+	unsigned long			getblk_bh_state;
+#endif
+
 	/*
 	 * New fields for task_struct should be added above here, so that
 	 * they are included in the randomized portion of task_struct.
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index c731ff9..0747ce7 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2058,3 +2058,9 @@  config IO_STRICT_DEVMEM
 	  if the driver using a given range cannot be disabled.
 
 	  If in doubt, say Y.
+
+config DEBUG_AID_FOR_SYZBOT
+       bool "Additional debug code for syzbot"
+       default n
+       help
+         This option is intended for testing by syzbot.