[1/1] NFS: Fix potential oops in nfs_inode_remove_request()

Message ID	20230725150807.8770-2-smayhew@redhat.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-nfs-owner@vger.kernel.org> From: Scott Mayhew <smayhew@redhat.com> To: trond.myklebust@hammerspace.com, anna@kernel.org Cc: linux-nfs@vger.kernel.org Subject: [PATCH 1/1] NFS: Fix potential oops in nfs_inode_remove_request() Date: Tue, 25 Jul 2023 11:08:07 -0400 Message-ID: <20230725150807.8770-2-smayhew@redhat.com> In-Reply-To: <20230725150807.8770-1-smayhew@redhat.com> References: <20230725150807.8770-1-smayhew@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Fix potential oops in nfs_inode_remove_request() \| expand [0/1] Fix potential oops in nfs_inode_remove_request() [1/1] NFS: Fix potential oops in nfs_inode_remove_request()

Scott Mayhew July 25, 2023, 3:08 p.m. UTC

Once a folio's private data has been cleared, it's possible for another
process to clear the folio->mapping (e.g. via invalidate_complete_folio2
or evict_mapping_folio), so it wouldn't be safe to call
nfs_page_to_inode() after that.

Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
Signed-off-by: Scott Mayhew <smayhew@redhat.com>
---
 fs/nfs/write.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Trond Myklebust July 25, 2023, 3:14 p.m. UTC | #1

On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> Once a folio's private data has been cleared, it's possible for
> another
> process to clear the folio->mapping (e.g. via
> invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
> 
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> ---
>  fs/nfs/write.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index f4cca8f00c0c..489c3f9dae23 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page
> *req)
>   */
>  static void nfs_inode_remove_request(struct nfs_page *req)
>  {
> +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> +
>         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
>                 struct folio *folio = nfs_page_to_folio(req-
> >wb_head);
>                 struct address_space *mapping =
> folio_file_mapping(folio);
> @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> nfs_page *req)
>  
>         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
>                 nfs_release_request(req);
> -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> >nrequests);
> +               atomic_long_dec(&nfsi->nrequests);

Why not just invert the order of the atomic_long_dec() and the
nfs_release_request()? That way you are also ensuring that the inode is
still pinned in memory by the open context.
>         }
>  }
>

Scott Mayhew July 25, 2023, 4:24 p.m. UTC | #2

On Tue, 25 Jul 2023, Trond Myklebust wrote:

> On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > Once a folio's private data has been cleared, it's possible for
> > another
> > process to clear the folio->mapping (e.g. via
> > invalidate_complete_folio2
> > or evict_mapping_folio), so it wouldn't be safe to call
> > nfs_page_to_inode() after that.
> > 
> > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > ---
> >  fs/nfs/write.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > index f4cca8f00c0c..489c3f9dae23 100644
> > --- a/fs/nfs/write.c
> > +++ b/fs/nfs/write.c
> > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page
> > *req)
> >   */
> >  static void nfs_inode_remove_request(struct nfs_page *req)
> >  {
> > +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > +
> >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> >                 struct folio *folio = nfs_page_to_folio(req-
> > >wb_head);
> >                 struct address_space *mapping =
> > folio_file_mapping(folio);
> > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > nfs_page *req)
> >  
> >         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> >                 nfs_release_request(req);
> > -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > >nrequests);
> > +               atomic_long_dec(&nfsi->nrequests);
> 
> Why not just invert the order of the atomic_long_dec() and the
> nfs_release_request()? That way you are also ensuring that the inode is
> still pinned in memory by the open context.

I'm not following.  How does inverting the order prevent the
folio->mapping from getting clobbered?

-Scott
> >         }
> >  }
> >  
> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
>

Trond Myklebust July 25, 2023, 5:41 p.m. UTC | #3

On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> On Tue, 25 Jul 2023, Trond Myklebust wrote:
> 
> > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > Once a folio's private data has been cleared, it's possible for
> > > another
> > > process to clear the folio->mapping (e.g. via
> > > invalidate_complete_folio2
> > > or evict_mapping_folio), so it wouldn't be safe to call
> > > nfs_page_to_inode() after that.
> > > 
> > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > folios")
> > > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > > ---
> > >  fs/nfs/write.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > index f4cca8f00c0c..489c3f9dae23 100644
> > > --- a/fs/nfs/write.c
> > > +++ b/fs/nfs/write.c
> > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > nfs_page
> > > *req)
> > >   */
> > >  static void nfs_inode_remove_request(struct nfs_page *req)
> > >  {
> > > +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > +
> > >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > >                 struct folio *folio = nfs_page_to_folio(req-
> > > > wb_head);
> > >                 struct address_space *mapping =
> > > folio_file_mapping(folio);
> > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > nfs_page *req)
> > >  
> > >         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > >                 nfs_release_request(req);
> > > -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > nrequests);
> > > +               atomic_long_dec(&nfsi->nrequests);
> > 
> > Why not just invert the order of the atomic_long_dec() and the
> > nfs_release_request()? That way you are also ensuring that the
> > inode is
> > still pinned in memory by the open context.
> 
> I'm not following.  How does inverting the order prevent the
> folio->mapping from getting clobbered?
> 

The open/lock context is refcounted by the nfs_page until the latter is
released. That's why the inode is guaranteed to remain around at least
until the  call to nfs_release_request().

Jeff Layton July 26, 2023, 12:40 p.m. UTC | #4

On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote:
> On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> > On Tue, 25 Jul 2023, Trond Myklebust wrote:
> > 
> > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > > Once a folio's private data has been cleared, it's possible for
> > > > another
> > > > process to clear the folio->mapping (e.g. via
> > > > invalidate_complete_folio2
> > > > or evict_mapping_folio), so it wouldn't be safe to call
> > > > nfs_page_to_inode() after that.
> > > > 
> > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > > folios")
> > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > > > ---
> > > >  fs/nfs/write.c | 4 +++-
> > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > index f4cca8f00c0c..489c3f9dae23 100644
> > > > --- a/fs/nfs/write.c
> > > > +++ b/fs/nfs/write.c
> > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > > nfs_page
> > > > *req)
> > > >   */
> > > >  static void nfs_inode_remove_request(struct nfs_page *req)
> > > >  {
> > > > +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > > +
> > > >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > > >                 struct folio *folio = nfs_page_to_folio(req-
> > > > > wb_head);
> > > >                 struct address_space *mapping =
> > > > folio_file_mapping(folio);
> > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > > nfs_page *req)
> > > >  
> > > >         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > > >                 nfs_release_request(req);
> > > > -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > > nrequests);
> > > > +               atomic_long_dec(&nfsi->nrequests);
> > > 
> > > Why not just invert the order of the atomic_long_dec() and the
> > > nfs_release_request()? That way you are also ensuring that the
> > > inode is
> > > still pinned in memory by the open context.
> > 
> > I'm not following.  How does inverting the order prevent the
> > folio->mapping from getting clobbered?
> > 
> 
> The open/lock context is refcounted by the nfs_page until the latter is
> released. That's why the inode is guaranteed to remain around at least
> until the  call to nfs_release_request().
> 

The problem is not that the inode is going away, but rather that we
can't guarantee that the page is still part of the mapping at this
point, and so we can't safely dereference page->mapping there. I do see
that nfs_release_request releases a reference to the page, but I don't
think that's sufficient to ensure that it remains part of the mapping.

AFAICT, once we clear page->private, the page is subject to be removed
from the mapping. So, I *think* it's safe to just move the call to
nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.

Scott Mayhew July 26, 2023, 1:43 p.m. UTC | #5

On Wed, 26 Jul 2023, Jeff Layton wrote:

> On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote:
> > On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote:
> > > On Tue, 25 Jul 2023, Trond Myklebust wrote:
> > > 
> > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> > > > > Once a folio's private data has been cleared, it's possible for
> > > > > another
> > > > > process to clear the folio->mapping (e.g. via
> > > > > invalidate_complete_folio2
> > > > > or evict_mapping_folio), so it wouldn't be safe to call
> > > > > nfs_page_to_inode() after that.
> > > > > 
> > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use
> > > > > folios")
> > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> > > > > ---
> > > > >  fs/nfs/write.c | 4 +++-
> > > > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> > > > > index f4cca8f00c0c..489c3f9dae23 100644
> > > > > --- a/fs/nfs/write.c
> > > > > +++ b/fs/nfs/write.c
> > > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct
> > > > > nfs_page
> > > > > *req)
> > > > >   */
> > > > >  static void nfs_inode_remove_request(struct nfs_page *req)
> > > > >  {
> > > > > +       struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> > > > > +
> > > > >         if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
> > > > >                 struct folio *folio = nfs_page_to_folio(req-
> > > > > > wb_head);
> > > > >                 struct address_space *mapping =
> > > > > folio_file_mapping(folio);
> > > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct
> > > > > nfs_page *req)
> > > > >  
> > > > >         if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
> > > > >                 nfs_release_request(req);
> > > > > -               atomic_long_dec(&NFS_I(nfs_page_to_inode(req))-
> > > > > > nrequests);
> > > > > +               atomic_long_dec(&nfsi->nrequests);
> > > > 
> > > > Why not just invert the order of the atomic_long_dec() and the
> > > > nfs_release_request()? That way you are also ensuring that the
> > > > inode is
> > > > still pinned in memory by the open context.
> > > 
> > > I'm not following.  How does inverting the order prevent the
> > > folio->mapping from getting clobbered?
> > > 
> > 
> > The open/lock context is refcounted by the nfs_page until the latter is
> > released. That's why the inode is guaranteed to remain around at least
> > until the  call to nfs_release_request().
> > 
> 
> The problem is not that the inode is going away, but rather that we
> can't guarantee that the page is still part of the mapping at this
> point, and so we can't safely dereference page->mapping there. I do see
> that nfs_release_request releases a reference to the page, but I don't
> think that's sufficient to ensure that it remains part of the mapping.
> 
> AFAICT, once we clear page->private, the page is subject to be removed
> from the mapping. So, I *think* it's safe to just move the call to
> nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.

Yeah, the inode hasn't gone away.  I can pick the nfs_commit_data
address off the stack in nfs_commit_release_pages:

crash> nfs_commit_data.inode c0000006774cae00
  inode = 0xc00000006c1b05f8,
 
The nfs_inode is still allocated:

crash> kmem 0xc00000006c1b05f8
CACHE             OBJSIZE  ALLOCATED     TOTAL  SLABS  SSIZE  NAME
c000000030332600     1088        128      5959    101    64k  nfs_inode_cache
  SLAB              MEMORY            NODE  TOTAL  ALLOCATED  FREE
  c00c0000001b06c0  c00000006c1b0000     0     59          1    58
  FREE / [ALLOCATED]
  [c00000006c1b0448]

      PAGE        PHYSICAL      MAPPING       INDEX CNT FLAGS
c00c0000001b06c0  6c1b0000 c000000030332600 c00000006c1b4480  1 23ffff800000200 slab

The vfs_inode:

crash> px &((struct nfs_inode *)0xc00000006c1b0448)->vfs_inode
$7 = (struct inode *) 0xc00000006c1b05f8

Matches the inodes open by both nfs_flock programs from the test:

crash> foreach nfs_flock files
PID: 4006780  TASK: c00000009d436600  CPU: 43   COMMAND: "nfs_flock"
ROOT: /    CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 c000000196e9a000 c000000004090840 c00000000ae13bf0 CHR  /dev/null
  1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG  /opt/ltp/output/nfslock01.sh_20230610112802
  2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG  /opt/ltp/output/nfslock01.sh_20230610112802
  3 c000000196e97700 c000000419ccb040 c00000006c1b05f8 REG  /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
                                      ^^^^^^^^^^^^^^^^

PID: 4006781  TASK: c00000009d42d500  CPU: 42   COMMAND: "nfs_flock"
ROOT: /    CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 c0000000f0812200 c000000004090840 c00000000ae13bf0 CHR  /dev/null
  1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG  /opt/ltp/output/nfslock01.sh_20230610112802
  2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG  /opt/ltp/output/nfslock01.sh_20230610112802
  3 c0000000f0813c00 c000000419ccb040 c00000006c1b05f8 REG  /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata
                                      ^^^^^^^^^^^^^^^^

The file->f_mapping for both struct files matches the inode->i_data:

crash> file.f_mapping c000000196e97700
  f_mapping = 0xc00000006c1b0770,
crash> file.f_mapping c0000000f0813c00
  f_mapping = 0xc00000006c1b0770,
crash> px &((struct inode *)0xc00000006c1b05f8)->i_data
$8 = (struct address_space *) 0xc00000006c1b0770

and if I look at one of those nfs_flock tasks, the folio
passed in to nfs_read_folio has the same mapping:

crash> bt 4006781
PID: 4006781  TASK: c00000009d42d500  CPU: 42   COMMAND: "nfs_flock"
 #0 [c000000177053710] __schedule at c000000000f61d9c
 #1 [c0000001770537d0] schedule at c000000000f621f4
 #2 [c000000177053840] io_schedule at c000000000f62354
 #3 [c000000177053870] folio_wait_bit_common at c00000000042dc60
 #4 [c000000177053970] nfs_read_folio at c0080000050108a8 [nfs]
 #5 [c000000177053a60] nfs_write_begin at c008000004fff06c [nfs]
 #6 [c000000177053b10] generic_perform_write at c00000000042b044
 #7 [c000000177053bc0] nfs_file_write at c008000004ffda08 [nfs]
 #8 [c000000177053c60] new_sync_write at c00000000057fdd8
 #9 [c000000177053d10] vfs_write at c000000000582fd4
#10 [c000000177053d60] ksys_write at c0000000005833a4
#11 [c000000177053db0] system_call_exception at c00000000002f434
#12 [c000000177053e10] system_call_vectored_common at c00000000000bfe8

crash> folio.mapping c00c000000564400
      mapping = 0xc00000006c1b0770,

It's just that if we go back to the nfs_page being released by our panic
task, the folio->mapping has been cleared, so we panic when we try to go
folio->mapping->host.

crash> nfs_page c00000016fb2a600
struct nfs_page {
  wb_list = {
    next = 0xc00000016fb2a600,
    prev = 0xc00000016fb2a600
  },
  {
    wb_page = 0xc00c000001d49580,
    wb_folio = 0xc00c000001d49580
  },
  wb_lock_context = 0xc00000010518b2c0,
  wb_index = 0x1,
  wb_offset = 0x6940,
  wb_pgbase = 0x6940,
  wb_bytes = 0x40,
  wb_kref = {
    refcount = {
      refs = {
        counter = 0x1
      }
    }
  },
  wb_flags = 0x5,
  wb_verf = {
    data = "\214\205_d\214\210W\036"
  },
  wb_this_page = 0xc00000016fb2a600,
  wb_head = 0xc00000016fb2a600,
  wb_nio = 0x0
}
crash> folio.mapping 0xc00c000001d49580
      mapping = 0x0,

-Scott

> -- 
> Jeff Layton <jlayton@kernel.org>
>

Benjamin Coddington Oct. 2, 2023, 8:16 p.m. UTC | #6

On 25 Jul 2023, at 11:08, Scott Mayhew wrote:

> Once a folio's private data has been cleared, it's possible for another
> process to clear the folio->mapping (e.g. via invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
>
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <smayhew@redhat.com>

Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Tested-by: Benjamin Coddington <bcodding@redhat.com>

Ben

Jeff Layton Oct. 2, 2023, 8:29 p.m. UTC | #7

On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote:
> Once a folio's private data has been cleared, it's possible for another
> process to clear the folio->mapping (e.g. via invalidate_complete_folio2
> or evict_mapping_folio), so it wouldn't be safe to call
> nfs_page_to_inode() after that.
> 
> Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios")
> Signed-off-by: Scott Mayhew <smayhew@redhat.com>
> ---
>  fs/nfs/write.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index f4cca8f00c0c..489c3f9dae23 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page *req)
>   */
>  static void nfs_inode_remove_request(struct nfs_page *req)
>  {
> +	struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req));
> +
>  	if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) {
>  		struct folio *folio = nfs_page_to_folio(req->wb_head);
>  		struct address_space *mapping = folio_file_mapping(folio);
> @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct nfs_page *req)
>  
>  	if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) {
>  		nfs_release_request(req);
> -		atomic_long_dec(&NFS_I(nfs_page_to_inode(req))->nrequests);
> +		atomic_long_dec(&nfsi->nrequests);
>  	}
>  }
>  

Reviewed-by: Jeff Layton <jlayton@kernel.org>

[1/1] NFS: Fix potential oops in nfs_inode_remove_request()

Commit Message

Comments

Patch