Message ID | alpine.LRH.2.02.1908080540240.15519@file01.intranet.prod.int.rdu2.redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | direct-io: use GFP_NOIO to avoid deadlock | expand |
On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote: > A deadlock with this stacktrace was observed. > > The obvious problem here is that in the call chain > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a > block device driver. But that's not the problem. The problem is the loop driver calls into the filesystem without calling memalloc_noio_save() / memalloc_noio_restore(). There are dozens of places in XFS which use GFP_KERNEL allocations and all can trigger this same problem if called from the loop driver. > #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b > #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3 > #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3 > #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs] > #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994 > #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs] > #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop] > #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop] > #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c > #23 [ffff88272f5afec0] kthread at ffffffff810a8428 > #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242
Hi Mikulas, This seemed not issue on mainline, mutex in dm_bufio_shrink_count() had been removed from mainline. Thanks, Junxiao. On 8/8/19 2:50 AM, Mikulas Patocka wrote: > A deadlock with this stacktrace was observed. > > The obvious problem here is that in the call chain > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a > block device driver. > > This patch changes the direct-io code to use GFP_NOIO. > > PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" > #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 > #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 > #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec > #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 > #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f > #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 > #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 > #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] > #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] > #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] > #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce > #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 > #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f > #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 > #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 > > PID: 14127 TASK: ffff881455749c00 CPU: 11 COMMAND: "loop1" > #0 [ffff88272f5af228] __schedule at ffffffff8173f405 > #1 [ffff88272f5af280] schedule at ffffffff8173fa27 > #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e > #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5 > #4 [ffff88272f5af330] mutex_lock at ffffffff81742133 > #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio] > #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd > #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778 > #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34 > #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8 > #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3 > #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71 > #12 [ffff88272f5af760] new_slab at ffffffff811f4523 > #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5 > #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b > #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3 > #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3 > #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs] > #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994 > #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs] > #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop] > #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop] > #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c > #23 [ffff88272f5afec0] kthread at ffffffff810a8428 > #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242 > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > Cc: stable@vger.kernel.org > > --- > fs/direct-io.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > Index: linux-2.6/fs/direct-io.c > =================================================================== > --- linux-2.6.orig/fs/direct-io.c 2019-08-02 08:51:56.000000000 +0200 > +++ linux-2.6/fs/direct-io.c 2019-08-08 11:22:18.000000000 +0200 > @@ -436,7 +436,7 @@ dio_bio_alloc(struct dio *dio, struct di > * bio_alloc() is guaranteed to return a bio when allowed to sleep and > * we request a valid number of vectors. > */ > - bio = bio_alloc(GFP_KERNEL, nr_vecs); > + bio = bio_alloc(GFP_NOIO, nr_vecs); > > bio_set_dev(bio, bdev); > bio->bi_iter.bi_sector = first_sector; > @@ -1197,7 +1197,7 @@ do_blockdev_direct_IO(struct kiocb *iocb > if (iov_iter_rw(iter) == READ && !count) > return 0; > > - dio = kmem_cache_alloc(dio_cache, GFP_KERNEL); > + dio = kmem_cache_alloc(dio_cache, GFP_NOIO); > retval = -ENOMEM; > if (!dio) > goto out;
On Thu, 8 Aug 2019, Matthew Wilcox wrote: > On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote: > > A deadlock with this stacktrace was observed. > > > > The obvious problem here is that in the call chain > > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc > > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a > > block device driver. > > But that's not the problem. The problem is the loop driver calls into the > filesystem without calling memalloc_noio_save() / memalloc_noio_restore(). > There are dozens of places in XFS which use GFP_KERNEL allocations and > all can trigger this same problem if called from the loop driver. OK. I'll send a new patch that sets PF_MEMALLOC_NOIO in the loop driver. Mikulas
On Thu, Aug 08, 2019 at 05:50:10AM -0400, Mikulas Patocka wrote: > A deadlock with this stacktrace was observed. > > The obvious problem here is that in the call chain > xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc > we do a GFP_KERNEL allocation while we are in a filesystem driver and in a > block device driver. > > This patch changes the direct-io code to use GFP_NOIO. > > PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" > #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 > #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 > #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec > #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 > #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f > #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 > #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 > #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] > #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] > #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] > #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce > #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 > #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f > #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 > #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 > > PID: 14127 TASK: ffff881455749c00 CPU: 11 COMMAND: "loop1" > #0 [ffff88272f5af228] __schedule at ffffffff8173f405 > #1 [ffff88272f5af280] schedule at ffffffff8173fa27 > #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e > #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5 > #4 [ffff88272f5af330] mutex_lock at ffffffff81742133 > #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio] > #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd > #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778 > #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34 > #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8 > #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3 > #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71 > #12 [ffff88272f5af760] new_slab at ffffffff811f4523 > #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5 > #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b > #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3 > #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3 > #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs] > #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994 Um, what kernel is this? XFS stopped using __blockdev_direct_IO some time around 4.8 or 4.9, IIRC. Perhaps it would be best to reproduce problems on a TOT kernel first? And, FWIW, there's an argument to be made here that the underlying bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO completions while holding a mutex that other IO-level reclaim contexts require to make progress. Cheers, Dave.
On Fri, 9 Aug 2019, Dave Chinner wrote: > And, FWIW, there's an argument to be made here that the underlying > bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO > completions while holding a mutex that other IO-level reclaim > contexts require to make progress. > > Cheers, > > Dave. The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because: * it uses trylock to acquire the mutex * it doesn't attempt to free buffers that are dirty or under I/O (see __try_evict_buffer) I think that this logic is sufficient to prevent the deadlock. Mikulas
On Fri, Aug 09, 2019 at 07:30:00AM -0400, Mikulas Patocka wrote: > > > On Fri, 9 Aug 2019, Dave Chinner wrote: > > > And, FWIW, there's an argument to be made here that the underlying > > bug is dm_bufio_shrink_scan() blocking kswapd by waiting on IO > > completions while holding a mutex that other IO-level reclaim > > contexts require to make progress. > > > > Cheers, > > > > Dave. > > The IO-level reclaim contexts should use GFP_NOIO. If the dm-bufio > shrinker is called with GFP_NOIO, it cannot be blocked by kswapd, because: No, you misunderstand. I'm talking about blocking kswapd being wrong. i.e. Blocking kswapd in shrinkers causes problems because th ememory reclaim code does not expect kswapd to be arbitrarily delayed by waiting on IO. We've had this problem with the XFS inode cache shrinker for years, and there are many reports of extremely long reclaim latencies for both direct and kswapd reclaim that result from kswapd not making progress while waiting in shrinkers for IO to complete. The work I'm currently doing to fix this XFS problem can be found here: https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/ i.e. the point I'm making is that waiting for IO in kswapd reclaim context is considered harmful - kswapd context shrinker reclaim should be as non-blocking as possible, and any back-off to wait for IO to complete should be done by the high level reclaim core once it's completed an entire reclaim scan cycle of everything.... What follows from that, and is pertinent for in this situation, is that if you don't block kswapd, then other reclaim contexts are not going to get stuck waiting for it regardless of the reclaim context they use. Cheers, Dave.
On Sat, 10 Aug 2019, Dave Chinner wrote: > No, you misunderstand. I'm talking about blocking kswapd being > wrong. i.e. Blocking kswapd in shrinkers causes problems > because th ememory reclaim code does not expect kswapd to be > arbitrarily delayed by waiting on IO. We've had this problem with > the XFS inode cache shrinker for years, and there are many reports > of extremely long reclaim latencies for both direct and kswapd > reclaim that result from kswapd not making progress while waiting > in shrinkers for IO to complete. > > The work I'm currently doing to fix this XFS problem can be found > here: > > https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/ > > > i.e. the point I'm making is that waiting for IO in kswapd reclaim > context is considered harmful - kswapd context shrinker reclaim > should be as non-blocking as possible, and any back-off to wait for > IO to complete should be done by the high level reclaim core once > it's completed an entire reclaim scan cycle of everything.... > > What follows from that, and is pertinent for in this situation, is > that if you don't block kswapd, then other reclaim contexts are not > going to get stuck waiting for it regardless of the reclaim context > they use. > > Cheers, > > Dave. So, what do you think the dm-bufio shrinker should do? Currently it tries to free buffers on the clean list and if there are not enough buffers on the clean list, it goes into the dirty list - it writes the buffers back and then frees them. What should it do? Should it just start writeback of the dirty list without waiting for it? What should it do if all the buffers are under writeback? Mikulas
On Tue, Aug 13, 2019 at 12:35:49PM -0400, Mikulas Patocka wrote: > > > On Sat, 10 Aug 2019, Dave Chinner wrote: > > > No, you misunderstand. I'm talking about blocking kswapd being > > wrong. i.e. Blocking kswapd in shrinkers causes problems > > because th ememory reclaim code does not expect kswapd to be > > arbitrarily delayed by waiting on IO. We've had this problem with > > the XFS inode cache shrinker for years, and there are many reports > > of extremely long reclaim latencies for both direct and kswapd > > reclaim that result from kswapd not making progress while waiting > > in shrinkers for IO to complete. > > > > The work I'm currently doing to fix this XFS problem can be found > > here: > > > > https://lore.kernel.org/linux-fsdevel/20190801021752.4986-1-david@fromorbit.com/ > > > > > > i.e. the point I'm making is that waiting for IO in kswapd reclaim > > context is considered harmful - kswapd context shrinker reclaim > > should be as non-blocking as possible, and any back-off to wait for > > IO to complete should be done by the high level reclaim core once > > it's completed an entire reclaim scan cycle of everything.... > > > > What follows from that, and is pertinent for in this situation, is > > that if you don't block kswapd, then other reclaim contexts are not > > going to get stuck waiting for it regardless of the reclaim context > > they use. > > > > Cheers, > > > > Dave. > > So, what do you think the dm-bufio shrinker should do? I'm not familiar with the constraints the code operates under, so I can't guarantee that I have an answer for you... :/ > Currently it tries to free buffers on the clean list and if there are not > enough buffers on the clean list, it goes into the dirty list - it writes > the buffers back and then frees them. > > What should it do? Should it just start writeback of the dirty list > without waiting for it? What should it do if all the buffers are under > writeback? For kswapd, it should do what it can without blocking. e.g. kicking an async writer thread rather than submitting the IO itself. That's what I changes XFS to do. And if you look at the patchset in the above link, it also introduced a mechanism for shrinkers to communicate back to the high level reclaim code that kswapd needs to back off (reclaim_state->need_backoff). With these mechanism, the shrinker can start IO without blocking kswapd on congested request queues and tell memory reclaim to wait before calling this shrinker again. This allows kswapd to aggregate all the waits that shrinkers and page reclaim require to all progress to be made into a single backoff event. That means kswapd does other scanning work while background writeback goes on, and once everythign is scanned it does a single wait for everything that needs time to make progress... I think that should also work for the dm-bufio shrinker, and the the direct reclaim backoff parts of the patchset should work for non-blocking direct reclaim scanning as well, like it now does for XFS. Cheers, Dave.
Index: linux-2.6/fs/direct-io.c =================================================================== --- linux-2.6.orig/fs/direct-io.c 2019-08-02 08:51:56.000000000 +0200 +++ linux-2.6/fs/direct-io.c 2019-08-08 11:22:18.000000000 +0200 @@ -436,7 +436,7 @@ dio_bio_alloc(struct dio *dio, struct di * bio_alloc() is guaranteed to return a bio when allowed to sleep and * we request a valid number of vectors. */ - bio = bio_alloc(GFP_KERNEL, nr_vecs); + bio = bio_alloc(GFP_NOIO, nr_vecs); bio_set_dev(bio, bdev); bio->bi_iter.bi_sector = first_sector; @@ -1197,7 +1197,7 @@ do_blockdev_direct_IO(struct kiocb *iocb if (iov_iter_rw(iter) == READ && !count) return 0; - dio = kmem_cache_alloc(dio_cache, GFP_KERNEL); + dio = kmem_cache_alloc(dio_cache, GFP_NOIO); retval = -ENOMEM; if (!dio) goto out;
A deadlock with this stacktrace was observed. The obvious problem here is that in the call chain xfs_vm_direct_IO->__blockdev_direct_IO->do_blockdev_direct_IO->kmem_cache_alloc we do a GFP_KERNEL allocation while we are in a filesystem driver and in a block device driver. This patch changes the direct-io code to use GFP_NOIO. PID: 474 TASK: ffff8813e11f4600 CPU: 10 COMMAND: "kswapd0" #0 [ffff8813dedfb938] __schedule at ffffffff8173f405 #1 [ffff8813dedfb990] schedule at ffffffff8173fa27 #2 [ffff8813dedfb9b0] schedule_timeout at ffffffff81742fec #3 [ffff8813dedfba60] io_schedule_timeout at ffffffff8173f186 #4 [ffff8813dedfbaa0] bit_wait_io at ffffffff8174034f #5 [ffff8813dedfbac0] __wait_on_bit at ffffffff8173fec8 #6 [ffff8813dedfbb10] out_of_line_wait_on_bit at ffffffff8173ff81 #7 [ffff8813dedfbb90] __make_buffer_clean at ffffffffa038736f [dm_bufio] #8 [ffff8813dedfbbb0] __try_evict_buffer at ffffffffa0387bb8 [dm_bufio] #9 [ffff8813dedfbbd0] dm_bufio_shrink_scan at ffffffffa0387cc3 [dm_bufio] #10 [ffff8813dedfbc40] shrink_slab at ffffffff811a87ce #11 [ffff8813dedfbd30] shrink_zone at ffffffff811ad778 #12 [ffff8813dedfbdc0] kswapd at ffffffff811ae92f #13 [ffff8813dedfbec0] kthread at ffffffff810a8428 #14 [ffff8813dedfbf50] ret_from_fork at ffffffff81745242 PID: 14127 TASK: ffff881455749c00 CPU: 11 COMMAND: "loop1" #0 [ffff88272f5af228] __schedule at ffffffff8173f405 #1 [ffff88272f5af280] schedule at ffffffff8173fa27 #2 [ffff88272f5af2a0] schedule_preempt_disabled at ffffffff8173fd5e #3 [ffff88272f5af2b0] __mutex_lock_slowpath at ffffffff81741fb5 #4 [ffff88272f5af330] mutex_lock at ffffffff81742133 #5 [ffff88272f5af350] dm_bufio_shrink_count at ffffffffa03865f9 [dm_bufio] #6 [ffff88272f5af380] shrink_slab at ffffffff811a86bd #7 [ffff88272f5af470] shrink_zone at ffffffff811ad778 #8 [ffff88272f5af500] do_try_to_free_pages at ffffffff811adb34 #9 [ffff88272f5af590] try_to_free_pages at ffffffff811adef8 #10 [ffff88272f5af610] __alloc_pages_nodemask at ffffffff811a09c3 #11 [ffff88272f5af710] alloc_pages_current at ffffffff811e8b71 #12 [ffff88272f5af760] new_slab at ffffffff811f4523 #13 [ffff88272f5af7b0] __slab_alloc at ffffffff8173a1b5 #14 [ffff88272f5af880] kmem_cache_alloc at ffffffff811f484b #15 [ffff88272f5af8d0] do_blockdev_direct_IO at ffffffff812535b3 #16 [ffff88272f5afb00] __blockdev_direct_IO at ffffffff81255dc3 #17 [ffff88272f5afb30] xfs_vm_direct_IO at ffffffffa01fe3fc [xfs] #18 [ffff88272f5afb90] generic_file_read_iter at ffffffff81198994 #19 [ffff88272f5afc50] __dta_xfs_file_read_iter_2398 at ffffffffa020c970 [xfs] #20 [ffff88272f5afcc0] lo_rw_aio at ffffffffa0377042 [loop] #21 [ffff88272f5afd70] loop_queue_work at ffffffffa0377c3b [loop] #22 [ffff88272f5afe60] kthread_worker_fn at ffffffff810a8a0c #23 [ffff88272f5afec0] kthread at ffffffff810a8428 #24 [ffff88272f5aff50] ret_from_fork at ffffffff81745242 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Cc: stable@vger.kernel.org --- fs/direct-io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)