Message ID | 03b2f64b3acc918b67c2fef6d4bfce70ab12ce3b.1632765815.git.josef@toxicpanda.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Miscellaneous error handling patches | expand |
On Mon, Sep 27, 2021 at 7:07 PM Josef Bacik <josef@toxicpanda.com> wrote: > > Error injection stressing uncovered a busy loop in our data reclaim > loop. There are two cases here, one where we loop creating block groups > until space_info->full is set, or in the main loop we will skip erroring > out any tickets if space_info->full == 0. Unfortunately if we aborted > the transaction then we will never allocate chunks or reclaim any space > and thus never get ->full, and you'll see stack traces like this > > watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139] > CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G W 5.13.0-rc1+ #328 > Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 > Workqueue: events_unbound btrfs_async_reclaim_data_space > RIP: 0010:btrfs_join_transaction+0x12/0x20 > RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246 > RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000 > RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000 > RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0 > R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff > R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008 > FS: 0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0 > Call Trace: > flush_space+0x4a8/0x660 > btrfs_async_reclaim_data_space+0x55/0x130 > process_one_work+0x1e9/0x380 > worker_thread+0x53/0x3e0 > ? process_one_work+0x380/0x380 > kthread+0x118/0x140 > ? __kthread_bind_mask+0x60/0x60 > ret_from_fork+0x1f/0x30 > > Fix this by checking to see if we have a btrfs fs error in either of the > reclaim loops, and if so fail the tickets and bail. In addition to > this, fix maybe_fail_all_tickets() to not try to grant tickets if we've > aborted, simply fail everything. > > Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Looks good, thanks. > --- > fs/btrfs/space-info.c | 27 +++++++++++++++++++++++---- > 1 file changed, 23 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c > index aa5be0b24987..63423f9729c5 100644 > --- a/fs/btrfs/space-info.c > +++ b/fs/btrfs/space-info.c > @@ -885,6 +885,7 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, > { > struct reserve_ticket *ticket; > u64 tickets_id = space_info->tickets_id; > + const bool aborted = btrfs_has_fs_error(fs_info); > > trace_btrfs_fail_all_tickets(fs_info, space_info); > > @@ -898,16 +899,19 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, > ticket = list_first_entry(&space_info->tickets, > struct reserve_ticket, list); > > - if (ticket->steal && > + if (!aborted && ticket->steal && > steal_from_global_rsv(fs_info, space_info, ticket)) > return true; > > - if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) > + if (!aborted && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) > btrfs_info(fs_info, "failing ticket with %llu bytes", > ticket->bytes); > > remove_ticket(space_info, ticket); > - ticket->error = -ENOSPC; > + if (aborted) > + ticket->error = -EIO; > + else > + ticket->error = -ENOSPC; > wake_up(&ticket->wait); > > /* > @@ -916,7 +920,8 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, > * here to see if we can make progress with the next ticket in > * the list. > */ > - btrfs_try_granting_tickets(fs_info, space_info); > + if (!aborted) > + btrfs_try_granting_tickets(fs_info, space_info); > } > return (tickets_id != space_info->tickets_id); > } > @@ -1172,6 +1177,10 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) > spin_unlock(&space_info->lock); > return; > } > + > + /* Something happened, fail everything and bail. */ > + if (btrfs_has_fs_error(fs_info)) > + goto aborted_fs; > last_tickets_id = space_info->tickets_id; > spin_unlock(&space_info->lock); > } > @@ -1202,9 +1211,19 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) > } else { > flush_state = 0; > } > + > + /* Something happened, fail everything and bail. */ > + if (btrfs_has_fs_error(fs_info)) > + goto aborted_fs; > + > } > spin_unlock(&space_info->lock); > } > + return; > +aborted_fs: > + maybe_fail_all_tickets(fs_info, space_info); > + space_info->flush = 0; > + spin_unlock(&space_info->lock); > } > > void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info) > -- > 2.26.3 >
Hi Josef, I love your patch! Yet something to improve: [auto build test ERROR on v5.15-rc3] [cannot apply to kdave/for-next next-20210922] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Josef-Bacik/Miscellaneous-error-handling-patches/20210929-185151 base: 5816b3e6577eaa676ceb00a848f0fd65fe2adc29 config: nios2-allyesconfig (attached as .config) compiler: nios2-linux-gcc (GCC) 11.2.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # https://github.com/0day-ci/linux/commit/4d626ae95cb373b954751bcdadacf6b0f92f3a6c git remote add linux-review https://github.com/0day-ci/linux git fetch --no-tags linux-review Josef-Bacik/Miscellaneous-error-handling-patches/20210929-185151 git checkout 4d626ae95cb373b954751bcdadacf6b0f92f3a6c # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-11.2.0 make.cross ARCH=nios2 If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot <lkp@intel.com> All errors (new ones prefixed by >>): fs/btrfs/space-info.c: In function 'maybe_fail_all_tickets': >> fs/btrfs/space-info.c:888:30: error: implicit declaration of function 'btrfs_has_fs_error'; did you mean 'btrfs_handle_fs_error'? [-Werror=implicit-function-declaration] 888 | const bool aborted = btrfs_has_fs_error(fs_info); | ^~~~~~~~~~~~~~~~~~ | btrfs_handle_fs_error cc1: all warnings being treated as errors vim +888 fs/btrfs/space-info.c 867 868 /* 869 * maybe_fail_all_tickets - we've exhausted our flushing, start failing tickets 870 * @fs_info - fs_info for this fs 871 * @space_info - the space info we were flushing 872 * 873 * We call this when we've exhausted our flushing ability and haven't made 874 * progress in satisfying tickets. The reservation code handles tickets in 875 * order, so if there is a large ticket first and then smaller ones we could 876 * very well satisfy the smaller tickets. This will attempt to wake up any 877 * tickets in the list to catch this case. 878 * 879 * This function returns true if it was able to make progress by clearing out 880 * other tickets, or if it stumbles across a ticket that was smaller than the 881 * first ticket. 882 */ 883 static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, 884 struct btrfs_space_info *space_info) 885 { 886 struct reserve_ticket *ticket; 887 u64 tickets_id = space_info->tickets_id; > 888 const bool aborted = btrfs_has_fs_error(fs_info); 889 890 trace_btrfs_fail_all_tickets(fs_info, space_info); 891 892 if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) { 893 btrfs_info(fs_info, "cannot satisfy tickets, dumping space info"); 894 __btrfs_dump_space_info(fs_info, space_info); 895 } 896 897 while (!list_empty(&space_info->tickets) && 898 tickets_id == space_info->tickets_id) { 899 ticket = list_first_entry(&space_info->tickets, 900 struct reserve_ticket, list); 901 902 if (!aborted && ticket->steal && 903 steal_from_global_rsv(fs_info, space_info, ticket)) 904 return true; 905 906 if (!aborted && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) 907 btrfs_info(fs_info, "failing ticket with %llu bytes", 908 ticket->bytes); 909 910 remove_ticket(space_info, ticket); 911 if (aborted) 912 ticket->error = -EIO; 913 else 914 ticket->error = -ENOSPC; 915 wake_up(&ticket->wait); 916 917 /* 918 * We're just throwing tickets away, so more flushing may not 919 * trip over btrfs_try_granting_tickets, so we need to call it 920 * here to see if we can make progress with the next ticket in 921 * the list. 922 */ 923 if (!aborted) 924 btrfs_try_granting_tickets(fs_info, space_info); 925 } 926 return (tickets_id != space_info->tickets_id); 927 } 928 --- 0-DAY CI Kernel Test Service, Intel Corporation https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org
diff --git a/fs/btrfs/space-info.c b/fs/btrfs/space-info.c index aa5be0b24987..63423f9729c5 100644 --- a/fs/btrfs/space-info.c +++ b/fs/btrfs/space-info.c @@ -885,6 +885,7 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, { struct reserve_ticket *ticket; u64 tickets_id = space_info->tickets_id; + const bool aborted = btrfs_has_fs_error(fs_info); trace_btrfs_fail_all_tickets(fs_info, space_info); @@ -898,16 +899,19 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, ticket = list_first_entry(&space_info->tickets, struct reserve_ticket, list); - if (ticket->steal && + if (!aborted && ticket->steal && steal_from_global_rsv(fs_info, space_info, ticket)) return true; - if (btrfs_test_opt(fs_info, ENOSPC_DEBUG)) + if (!aborted && btrfs_test_opt(fs_info, ENOSPC_DEBUG)) btrfs_info(fs_info, "failing ticket with %llu bytes", ticket->bytes); remove_ticket(space_info, ticket); - ticket->error = -ENOSPC; + if (aborted) + ticket->error = -EIO; + else + ticket->error = -ENOSPC; wake_up(&ticket->wait); /* @@ -916,7 +920,8 @@ static bool maybe_fail_all_tickets(struct btrfs_fs_info *fs_info, * here to see if we can make progress with the next ticket in * the list. */ - btrfs_try_granting_tickets(fs_info, space_info); + if (!aborted) + btrfs_try_granting_tickets(fs_info, space_info); } return (tickets_id != space_info->tickets_id); } @@ -1172,6 +1177,10 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) spin_unlock(&space_info->lock); return; } + + /* Something happened, fail everything and bail. */ + if (btrfs_has_fs_error(fs_info)) + goto aborted_fs; last_tickets_id = space_info->tickets_id; spin_unlock(&space_info->lock); } @@ -1202,9 +1211,19 @@ static void btrfs_async_reclaim_data_space(struct work_struct *work) } else { flush_state = 0; } + + /* Something happened, fail everything and bail. */ + if (btrfs_has_fs_error(fs_info)) + goto aborted_fs; + } spin_unlock(&space_info->lock); } + return; +aborted_fs: + maybe_fail_all_tickets(fs_info, space_info); + space_info->flush = 0; + spin_unlock(&space_info->lock); } void btrfs_init_async_reclaim_work(struct btrfs_fs_info *fs_info)
Error injection stressing uncovered a busy loop in our data reclaim loop. There are two cases here, one where we loop creating block groups until space_info->full is set, or in the main loop we will skip erroring out any tickets if space_info->full == 0. Unfortunately if we aborted the transaction then we will never allocate chunks or reclaim any space and thus never get ->full, and you'll see stack traces like this watchdog: BUG: soft lockup - CPU#0 stuck for 26s! [kworker/u4:4:139] CPU: 0 PID: 139 Comm: kworker/u4:4 Tainted: G W 5.13.0-rc1+ #328 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 Workqueue: events_unbound btrfs_async_reclaim_data_space RIP: 0010:btrfs_join_transaction+0x12/0x20 RSP: 0018:ffffb2b780b77de0 EFLAGS: 00000246 RAX: ffffb2b781863d58 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000801 RSI: ffff987952b57400 RDI: ffff987940aa3000 RBP: ffff987954d55000 R08: 0000000000000001 R09: ffff98795539e8f0 R10: 000000000000000f R11: 000000000000000f R12: ffffffffffffffff R13: ffff987952b574c8 R14: ffff987952b57400 R15: 0000000000000008 FS: 0000000000000000(0000) GS:ffff9879bbc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f0703da4000 CR3: 0000000113398004 CR4: 0000000000370ef0 Call Trace: flush_space+0x4a8/0x660 btrfs_async_reclaim_data_space+0x55/0x130 process_one_work+0x1e9/0x380 worker_thread+0x53/0x3e0 ? process_one_work+0x380/0x380 kthread+0x118/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x1f/0x30 Fix this by checking to see if we have a btrfs fs error in either of the reclaim loops, and if so fail the tickets and bail. In addition to this, fix maybe_fail_all_tickets() to not try to grant tickets if we've aborted, simply fail everything. Signed-off-by: Josef Bacik <josef@toxicpanda.com> --- fs/btrfs/space-info.c | 27 +++++++++++++++++++++++---- 1 file changed, 23 insertions(+), 4 deletions(-)