diff mbox series

[v2] NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt

Message ID 20240131161006.1475094-1-dwysocha@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v2] NFS: Fix nfs_netfs_issue_read() xarray locking for writeback interrupt | expand

Commit Message

David Wysochanski Jan. 31, 2024, 4:10 p.m. UTC
The loop inside nfs_netfs_issue_read() currently does not disable
interrupts while iterating through pages in the xarray to submit
for NFS read.  This is not safe though since after taking xa_lock,
another page in the mapping could be processed for writeback inside
an interrupt, and deadlock can occur.  The fix is simple and clean
if we use xa_for_each_range(), which handles the iteration with RCU
while reducing code complexity.

The problem is easily reproduced with the following test:
 mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs
 dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1
 echo 3 > /proc/sys/vm/drop_caches
 dd if=/mnt/nfs/file1.bin of=/dev/null
 umount /mnt/nfs

On the console with a lockdep-enabled kernel a message similar to
the following will be seen:

 ================================
 WARNING: inconsistent lock state
 6.7.0-lockdbg+ #10 Not tainted
 --------------------------------
 inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
 test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes:
 ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at:
nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
 {IN-SOFTIRQ-W} state was registered at:
   lock_acquire+0x144/0x380
   _raw_spin_lock_irqsave+0x4e/0xa0
   __folio_end_writeback+0x17e/0x5c0
   folio_end_writeback+0x93/0x1b0
   iomap_finish_ioend+0xeb/0x6a0
   blk_update_request+0x204/0x7f0
   blk_mq_end_request+0x30/0x1c0
   blk_complete_reqs+0x7e/0xa0
   __do_softirq+0x113/0x544
   __irq_exit_rcu+0xfe/0x120
   irq_exit_rcu+0xe/0x20
   sysvec_call_function_single+0x6f/0x90
   asm_sysvec_call_function_single+0x1a/0x20
   pv_native_safe_halt+0xf/0x20
   default_idle+0x9/0x20
   default_idle_call+0x67/0xa0
   do_idle+0x2b5/0x300
   cpu_startup_entry+0x34/0x40
   start_secondary+0x19d/0x1c0
   secondary_startup_64_no_verify+0x18f/0x19b
 irq event stamp: 176891
 hardirqs last  enabled at (176891): [<ffffffffa67a0be4>]
_raw_spin_unlock_irqrestore+0x44/0x60
 hardirqs last disabled at (176890): [<ffffffffa67a0899>]
_raw_spin_lock_irqsave+0x79/0xa0
 softirqs last  enabled at (176646): [<ffffffffa515d91e>]
__irq_exit_rcu+0xfe/0x120
 softirqs last disabled at (176633): [<ffffffffa515d91e>]
__irq_exit_rcu+0xfe/0x120

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&xa->xa_lock#4);
   <Interrupt>
     lock(&xa->xa_lock#4);

  *** DEADLOCK ***

 2 locks held by test5/1708:
  #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at:
      nfs_start_io_read+0x28/0x90 [nfs]
  #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at:
      page_cache_ra_unbounded+0xa4/0x280

 stack backtrace:
 CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
04/01/2014
 Call Trace:
  dump_stack_lvl+0x5b/0x90
  mark_lock+0xb3f/0xd20
  __lock_acquire+0x77b/0x3360
  _raw_spin_lock+0x34/0x80
  nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
  netfs_begin_read+0x77f/0x980 [netfs]
  nfs_netfs_readahead+0x45/0x60 [nfs]
  nfs_readahead+0x323/0x5a0 [nfs]
  read_pages+0xf3/0x5c0
  page_cache_ra_unbounded+0x1c8/0x280
  filemap_get_pages+0x38c/0xae0
  filemap_read+0x206/0x5e0
  nfs_file_read+0xb7/0x140 [nfs]
  vfs_read+0x2a9/0x460
  ksys_read+0xb7/0x140

Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when
fscache is enabled")
Suggested-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
---
 fs/nfs/fscache.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

Comments

Jeff Layton Jan. 31, 2024, 4:38 p.m. UTC | #1
On Wed, 2024-01-31 at 11:10 -0500, Dave Wysochanski wrote:
> The loop inside nfs_netfs_issue_read() currently does not disable
> interrupts while iterating through pages in the xarray to submit
> for NFS read.  This is not safe though since after taking xa_lock,
> another page in the mapping could be processed for writeback inside
> an interrupt, and deadlock can occur.  The fix is simple and clean
> if we use xa_for_each_range(), which handles the iteration with RCU
> while reducing code complexity.
> 
> The problem is easily reproduced with the following test:
>  mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs
>  dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1
>  echo 3 > /proc/sys/vm/drop_caches
>  dd if=/mnt/nfs/file1.bin of=/dev/null
>  umount /mnt/nfs
> 
> On the console with a lockdep-enabled kernel a message similar to
> the following will be seen:
> 
>  ================================
>  WARNING: inconsistent lock state
>  6.7.0-lockdbg+ #10 Not tainted
>  --------------------------------
>  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>  test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at:
> nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
>  {IN-SOFTIRQ-W} state was registered at:
>    lock_acquire+0x144/0x380
>    _raw_spin_lock_irqsave+0x4e/0xa0
>    __folio_end_writeback+0x17e/0x5c0
>    folio_end_writeback+0x93/0x1b0
>    iomap_finish_ioend+0xeb/0x6a0
>    blk_update_request+0x204/0x7f0
>    blk_mq_end_request+0x30/0x1c0
>    blk_complete_reqs+0x7e/0xa0
>    __do_softirq+0x113/0x544
>    __irq_exit_rcu+0xfe/0x120
>    irq_exit_rcu+0xe/0x20
>    sysvec_call_function_single+0x6f/0x90
>    asm_sysvec_call_function_single+0x1a/0x20
>    pv_native_safe_halt+0xf/0x20
>    default_idle+0x9/0x20
>    default_idle_call+0x67/0xa0
>    do_idle+0x2b5/0x300
>    cpu_startup_entry+0x34/0x40
>    start_secondary+0x19d/0x1c0
>    secondary_startup_64_no_verify+0x18f/0x19b
>  irq event stamp: 176891
>  hardirqs last  enabled at (176891): [<ffffffffa67a0be4>]
> _raw_spin_unlock_irqrestore+0x44/0x60
>  hardirqs last disabled at (176890): [<ffffffffa67a0899>]
> _raw_spin_lock_irqsave+0x79/0xa0
>  softirqs last  enabled at (176646): [<ffffffffa515d91e>]
> __irq_exit_rcu+0xfe/0x120
>  softirqs last disabled at (176633): [<ffffffffa515d91e>]
> __irq_exit_rcu+0xfe/0x120
> 
>  other info that might help us debug this:
>   Possible unsafe locking scenario:
> 
>         CPU0
>         ----
>    lock(&xa->xa_lock#4);
>    <Interrupt>
>      lock(&xa->xa_lock#4);
> 
>   *** DEADLOCK ***
> 
>  2 locks held by test5/1708:
>   #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at:
>       nfs_start_io_read+0x28/0x90 [nfs]
>   #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at:
>       page_cache_ra_unbounded+0xa4/0x280
> 
>  stack backtrace:
>  CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
> 04/01/2014
>  Call Trace:
>   dump_stack_lvl+0x5b/0x90
>   mark_lock+0xb3f/0xd20
>   __lock_acquire+0x77b/0x3360
>   _raw_spin_lock+0x34/0x80
>   nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
>   netfs_begin_read+0x77f/0x980 [netfs]
>   nfs_netfs_readahead+0x45/0x60 [nfs]
>   nfs_readahead+0x323/0x5a0 [nfs]
>   read_pages+0xf3/0x5c0
>   page_cache_ra_unbounded+0x1c8/0x280
>   filemap_get_pages+0x38c/0xae0
>   filemap_read+0x206/0x5e0
>   nfs_file_read+0xb7/0x140 [nfs]
>   vfs_read+0x2a9/0x460
>   ksys_read+0xb7/0x140
> 
> Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when
> fscache is enabled")
> Suggested-by: Jeff Layton <jlayton@redhat.com>
> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>
> ---
>  fs/nfs/fscache.c | 9 ++-------
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
> index b05717fe0d4e..60a3c28784e0 100644
> --- a/fs/nfs/fscache.c
> +++ b/fs/nfs/fscache.c
> @@ -307,11 +307,11 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
>  	struct inode *inode = sreq->rreq->inode;
>  	struct nfs_open_context *ctx = sreq->rreq->netfs_priv;
>  	struct page *page;
> +	unsigned long idx;
>  	int err;
>  	pgoff_t start = (sreq->start + sreq->transferred) >> PAGE_SHIFT;
>  	pgoff_t last = ((sreq->start + sreq->len -
>  			 sreq->transferred - 1) >> PAGE_SHIFT);
> -	XA_STATE(xas, &sreq->rreq->mapping->i_pages, start);
>  
>  	nfs_pageio_init_read(&pgio, inode, false,
>  			     &nfs_async_read_completion_ops);
> @@ -322,19 +322,14 @@ static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
>  
>  	pgio.pg_netfs = netfs; /* used in completion */
>  
> -	xas_lock(&xas);
> -	xas_for_each(&xas, page, last) {
> +	xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) {
>  		/* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs  */
> -		xas_pause(&xas);
> -		xas_unlock(&xas);
>  		err = nfs_read_add_folio(&pgio, ctx, page_folio(page));
>  		if (err < 0) {
>  			netfs->error = err;
>  			goto out;
>  		}
> -		xas_lock(&xas);
>  	}
> -	xas_unlock(&xas);
>  out:
>  	nfs_pageio_complete_read(&pgio);
>  	nfs_netfs_put(netfs);

Reviewed-by: Jeff Layton <jlayton@kernel.org>
David Howells Jan. 31, 2024, 7:51 p.m. UTC | #2
Dave Wysochanski <dwysocha@redhat.com> wrote:

> The loop inside nfs_netfs_issue_read() currently does not disable
> interrupts while iterating through pages in the xarray to submit
> for NFS read.  This is not safe though since after taking xa_lock,
> another page in the mapping could be processed for writeback inside
> an interrupt, and deadlock can occur.  The fix is simple and clean
> if we use xa_for_each_range(), which handles the iteration with RCU
> while reducing code complexity.
> 
> The problem is easily reproduced with the following test:
>  mount -o vers=3,fsc 127.0.0.1:/export /mnt/nfs
>  dd if=/dev/zero of=/mnt/nfs/file1.bin bs=4096 count=1
>  echo 3 > /proc/sys/vm/drop_caches
>  dd if=/mnt/nfs/file1.bin of=/dev/null
>  umount /mnt/nfs
> 
> On the console with a lockdep-enabled kernel a message similar to
> the following will be seen:
> 
>  ================================
>  WARNING: inconsistent lock state
>  6.7.0-lockdbg+ #10 Not tainted
>  --------------------------------
>  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
>  test5/1708 [HC0[0]:SC0[0]:HE1:SE1] takes:
>  ffff888127baa598 (&xa->xa_lock#4){+.?.}-{3:3}, at:
> nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
>  {IN-SOFTIRQ-W} state was registered at:
>    lock_acquire+0x144/0x380
>    _raw_spin_lock_irqsave+0x4e/0xa0
>    __folio_end_writeback+0x17e/0x5c0
>    folio_end_writeback+0x93/0x1b0
>    iomap_finish_ioend+0xeb/0x6a0
>    blk_update_request+0x204/0x7f0
>    blk_mq_end_request+0x30/0x1c0
>    blk_complete_reqs+0x7e/0xa0
>    __do_softirq+0x113/0x544
>    __irq_exit_rcu+0xfe/0x120
>    irq_exit_rcu+0xe/0x20
>    sysvec_call_function_single+0x6f/0x90
>    asm_sysvec_call_function_single+0x1a/0x20
>    pv_native_safe_halt+0xf/0x20
>    default_idle+0x9/0x20
>    default_idle_call+0x67/0xa0
>    do_idle+0x2b5/0x300
>    cpu_startup_entry+0x34/0x40
>    start_secondary+0x19d/0x1c0
>    secondary_startup_64_no_verify+0x18f/0x19b
>  irq event stamp: 176891
>  hardirqs last  enabled at (176891): [<ffffffffa67a0be4>]
> _raw_spin_unlock_irqrestore+0x44/0x60
>  hardirqs last disabled at (176890): [<ffffffffa67a0899>]
> _raw_spin_lock_irqsave+0x79/0xa0
>  softirqs last  enabled at (176646): [<ffffffffa515d91e>]
> __irq_exit_rcu+0xfe/0x120
>  softirqs last disabled at (176633): [<ffffffffa515d91e>]
> __irq_exit_rcu+0xfe/0x120
> 
>  other info that might help us debug this:
>   Possible unsafe locking scenario:
> 
>         CPU0
>         ----
>    lock(&xa->xa_lock#4);
>    <Interrupt>
>      lock(&xa->xa_lock#4);
> 
>   *** DEADLOCK ***
> 
>  2 locks held by test5/1708:
>   #0: ffff888127baa498 (&sb->s_type->i_mutex_key#22){++++}-{4:4}, at:
>       nfs_start_io_read+0x28/0x90 [nfs]
>   #1: ffff888127baa650 (mapping.invalidate_lock#3){.+.+}-{4:4}, at:
>       page_cache_ra_unbounded+0xa4/0x280
> 
>  stack backtrace:
>  CPU: 6 PID: 1708 Comm: test5 Kdump: loaded Not tainted 6.7.0-lockdbg+
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39
> 04/01/2014
>  Call Trace:
>   dump_stack_lvl+0x5b/0x90
>   mark_lock+0xb3f/0xd20
>   __lock_acquire+0x77b/0x3360
>   _raw_spin_lock+0x34/0x80
>   nfs_netfs_issue_read+0x1b2/0x4b0 [nfs]
>   netfs_begin_read+0x77f/0x980 [netfs]
>   nfs_netfs_readahead+0x45/0x60 [nfs]
>   nfs_readahead+0x323/0x5a0 [nfs]
>   read_pages+0xf3/0x5c0
>   page_cache_ra_unbounded+0x1c8/0x280
>   filemap_get_pages+0x38c/0xae0
>   filemap_read+0x206/0x5e0
>   nfs_file_read+0xb7/0x140 [nfs]
>   vfs_read+0x2a9/0x460
>   ksys_read+0xb7/0x140
> 
> Fixes: 000dbe0bec05 ("NFS: Convert buffered read paths to use netfs when
> fscache is enabled")
> Suggested-by: Jeff Layton <jlayton@redhat.com>
> Signed-off-by: Dave Wysochanski <dwysocha@redhat.com>

Reviewed-by: David Howells <dhowells@redhat.com>
diff mbox series

Patch

diff --git a/fs/nfs/fscache.c b/fs/nfs/fscache.c
index b05717fe0d4e..60a3c28784e0 100644
--- a/fs/nfs/fscache.c
+++ b/fs/nfs/fscache.c
@@ -307,11 +307,11 @@  static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
 	struct inode *inode = sreq->rreq->inode;
 	struct nfs_open_context *ctx = sreq->rreq->netfs_priv;
 	struct page *page;
+	unsigned long idx;
 	int err;
 	pgoff_t start = (sreq->start + sreq->transferred) >> PAGE_SHIFT;
 	pgoff_t last = ((sreq->start + sreq->len -
 			 sreq->transferred - 1) >> PAGE_SHIFT);
-	XA_STATE(xas, &sreq->rreq->mapping->i_pages, start);
 
 	nfs_pageio_init_read(&pgio, inode, false,
 			     &nfs_async_read_completion_ops);
@@ -322,19 +322,14 @@  static void nfs_netfs_issue_read(struct netfs_io_subrequest *sreq)
 
 	pgio.pg_netfs = netfs; /* used in completion */
 
-	xas_lock(&xas);
-	xas_for_each(&xas, page, last) {
+	xa_for_each_range(&sreq->rreq->mapping->i_pages, idx, page, start, last) {
 		/* nfs_read_add_folio() may schedule() due to pNFS layout and other RPCs  */
-		xas_pause(&xas);
-		xas_unlock(&xas);
 		err = nfs_read_add_folio(&pgio, ctx, page_folio(page));
 		if (err < 0) {
 			netfs->error = err;
 			goto out;
 		}
-		xas_lock(&xas);
 	}
-	xas_unlock(&xas);
 out:
 	nfs_pageio_complete_read(&pgio);
 	nfs_netfs_put(netfs);