Message ID | 20230725150807.8770-2-smayhew@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Fix potential oops in nfs_inode_remove_request() | expand |
On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > Once a folio's private data has been cleared, it's possible for > another > process to clear the folio->mapping (e.g. via > invalidate_complete_folio2 > or evict_mapping_folio), so it wouldn't be safe to call > nfs_page_to_inode() after that. > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > --- > fs/nfs/write.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index f4cca8f00c0c..489c3f9dae23 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page > *req) > */ > static void nfs_inode_remove_request(struct nfs_page *req) > { > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > + > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > struct folio *folio = nfs_page_to_folio(req- > >wb_head); > struct address_space *mapping = > folio_file_mapping(folio); > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct > nfs_page *req) > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > nfs_release_request(req); > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))- > >nrequests); > + atomic_long_dec(&nfsi->nrequests); Why not just invert the order of the atomic_long_dec() and the nfs_release_request()? That way you are also ensuring that the inode is still pinned in memory by the open context. > } > } >
On Tue, 25 Jul 2023, Trond Myklebust wrote: > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > > Once a folio's private data has been cleared, it's possible for > > another > > process to clear the folio->mapping (e.g. via > > invalidate_complete_folio2 > > or evict_mapping_folio), so it wouldn't be safe to call > > nfs_page_to_inode() after that. > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > --- > > fs/nfs/write.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > index f4cca8f00c0c..489c3f9dae23 100644 > > --- a/fs/nfs/write.c > > +++ b/fs/nfs/write.c > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page > > *req) > > */ > > static void nfs_inode_remove_request(struct nfs_page *req) > > { > > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > > + > > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > > struct folio *folio = nfs_page_to_folio(req- > > >wb_head); > > struct address_space *mapping = > > folio_file_mapping(folio); > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct > > nfs_page *req) > > > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > > nfs_release_request(req); > > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))- > > >nrequests); > > + atomic_long_dec(&nfsi->nrequests); > > Why not just invert the order of the atomic_long_dec() and the > nfs_release_request()? That way you are also ensuring that the inode is > still pinned in memory by the open context. I'm not following. How does inverting the order prevent the folio->mapping from getting clobbered? -Scott > > } > > } > > > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@hammerspace.com > >
On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote: > On Tue, 25 Jul 2023, Trond Myklebust wrote: > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > > > Once a folio's private data has been cleared, it's possible for > > > another > > > process to clear the folio->mapping (e.g. via > > > invalidate_complete_folio2 > > > or evict_mapping_folio), so it wouldn't be safe to call > > > nfs_page_to_inode() after that. > > > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use > > > folios") > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > > --- > > > fs/nfs/write.c | 4 +++- > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > index f4cca8f00c0c..489c3f9dae23 100644 > > > --- a/fs/nfs/write.c > > > +++ b/fs/nfs/write.c > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct > > > nfs_page > > > *req) > > > */ > > > static void nfs_inode_remove_request(struct nfs_page *req) > > > { > > > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > > > + > > > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > > > struct folio *folio = nfs_page_to_folio(req- > > > > wb_head); > > > struct address_space *mapping = > > > folio_file_mapping(folio); > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct > > > nfs_page *req) > > > > > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > > > nfs_release_request(req); > > > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))- > > > > nrequests); > > > + atomic_long_dec(&nfsi->nrequests); > > > > Why not just invert the order of the atomic_long_dec() and the > > nfs_release_request()? That way you are also ensuring that the > > inode is > > still pinned in memory by the open context. > > I'm not following. How does inverting the order prevent the > folio->mapping from getting clobbered? > The open/lock context is refcounted by the nfs_page until the latter is released. That's why the inode is guaranteed to remain around at least until the call to nfs_release_request().
On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote: > On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote: > > On Tue, 25 Jul 2023, Trond Myklebust wrote: > > > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > > > > Once a folio's private data has been cleared, it's possible for > > > > another > > > > process to clear the folio->mapping (e.g. via > > > > invalidate_complete_folio2 > > > > or evict_mapping_folio), so it wouldn't be safe to call > > > > nfs_page_to_inode() after that. > > > > > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use > > > > folios") > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > > > --- > > > > fs/nfs/write.c | 4 +++- > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > index f4cca8f00c0c..489c3f9dae23 100644 > > > > --- a/fs/nfs/write.c > > > > +++ b/fs/nfs/write.c > > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct > > > > nfs_page > > > > *req) > > > > */ > > > > static void nfs_inode_remove_request(struct nfs_page *req) > > > > { > > > > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > > > > + > > > > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > > > > struct folio *folio = nfs_page_to_folio(req- > > > > > wb_head); > > > > struct address_space *mapping = > > > > folio_file_mapping(folio); > > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct > > > > nfs_page *req) > > > > > > > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > > > > nfs_release_request(req); > > > > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))- > > > > > nrequests); > > > > + atomic_long_dec(&nfsi->nrequests); > > > > > > Why not just invert the order of the atomic_long_dec() and the > > > nfs_release_request()? That way you are also ensuring that the > > > inode is > > > still pinned in memory by the open context. > > > > I'm not following. How does inverting the order prevent the > > folio->mapping from getting clobbered? > > > > The open/lock context is refcounted by the nfs_page until the latter is > released. That's why the inode is guaranteed to remain around at least > until the call to nfs_release_request(). > The problem is not that the inode is going away, but rather that we can't guarantee that the page is still part of the mapping at this point, and so we can't safely dereference page->mapping there. I do see that nfs_release_request releases a reference to the page, but I don't think that's sufficient to ensure that it remains part of the mapping. AFAICT, once we clear page->private, the page is subject to be removed from the mapping. So, I *think* it's safe to just move the call to nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit.
On Wed, 26 Jul 2023, Jeff Layton wrote: > On Tue, 2023-07-25 at 17:41 +0000, Trond Myklebust wrote: > > On Tue, 2023-07-25 at 12:24 -0400, Scott Mayhew wrote: > > > On Tue, 25 Jul 2023, Trond Myklebust wrote: > > > > > > > On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > > > > > Once a folio's private data has been cleared, it's possible for > > > > > another > > > > > process to clear the folio->mapping (e.g. via > > > > > invalidate_complete_folio2 > > > > > or evict_mapping_folio), so it wouldn't be safe to call > > > > > nfs_page_to_inode() after that. > > > > > > > > > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use > > > > > folios") > > > > > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > > > > > --- > > > > > fs/nfs/write.c | 4 +++- > > > > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > > > > > index f4cca8f00c0c..489c3f9dae23 100644 > > > > > --- a/fs/nfs/write.c > > > > > +++ b/fs/nfs/write.c > > > > > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct > > > > > nfs_page > > > > > *req) > > > > > */ > > > > > static void nfs_inode_remove_request(struct nfs_page *req) > > > > > { > > > > > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > > > > > + > > > > > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > > > > > struct folio *folio = nfs_page_to_folio(req- > > > > > > wb_head); > > > > > struct address_space *mapping = > > > > > folio_file_mapping(folio); > > > > > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct > > > > > nfs_page *req) > > > > > > > > > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > > > > > nfs_release_request(req); > > > > > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))- > > > > > > nrequests); > > > > > + atomic_long_dec(&nfsi->nrequests); > > > > > > > > Why not just invert the order of the atomic_long_dec() and the > > > > nfs_release_request()? That way you are also ensuring that the > > > > inode is > > > > still pinned in memory by the open context. > > > > > > I'm not following. How does inverting the order prevent the > > > folio->mapping from getting clobbered? > > > > > > > The open/lock context is refcounted by the nfs_page until the latter is > > released. That's why the inode is guaranteed to remain around at least > > until the call to nfs_release_request(). > > > > The problem is not that the inode is going away, but rather that we > can't guarantee that the page is still part of the mapping at this > point, and so we can't safely dereference page->mapping there. I do see > that nfs_release_request releases a reference to the page, but I don't > think that's sufficient to ensure that it remains part of the mapping. > > AFAICT, once we clear page->private, the page is subject to be removed > from the mapping. So, I *think* it's safe to just move the call to > nfs_page_to_inode prior to the call to nfs_page_group_sync_on_bit. Yeah, the inode hasn't gone away. I can pick the nfs_commit_data address off the stack in nfs_commit_release_pages: crash> nfs_commit_data.inode c0000006774cae00 inode = 0xc00000006c1b05f8, The nfs_inode is still allocated: crash> kmem 0xc00000006c1b05f8 CACHE OBJSIZE ALLOCATED TOTAL SLABS SSIZE NAME c000000030332600 1088 128 5959 101 64k nfs_inode_cache SLAB MEMORY NODE TOTAL ALLOCATED FREE c00c0000001b06c0 c00000006c1b0000 0 59 1 58 FREE / [ALLOCATED] [c00000006c1b0448] PAGE PHYSICAL MAPPING INDEX CNT FLAGS c00c0000001b06c0 6c1b0000 c000000030332600 c00000006c1b4480 1 23ffff800000200 slab The vfs_inode: crash> px &((struct nfs_inode *)0xc00000006c1b0448)->vfs_inode $7 = (struct inode *) 0xc00000006c1b05f8 Matches the inodes open by both nfs_flock programs from the test: crash> foreach nfs_flock files PID: 4006780 TASK: c00000009d436600 CPU: 43 COMMAND: "nfs_flock" ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/ FD FILE DENTRY INODE TYPE PATH 0 c000000196e9a000 c000000004090840 c00000000ae13bf0 CHR /dev/null 1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802 2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802 3 c000000196e97700 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata ^^^^^^^^^^^^^^^^ PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock" ROOT: / CWD: /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/ FD FILE DENTRY INODE TYPE PATH 0 c0000000f0812200 c000000004090840 c00000000ae13bf0 CHR /dev/null 1 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802 2 c0000000bfd1ff00 c0000000963e0e40 c00000006c573900 REG /opt/ltp/output/nfslock01.sh_20230610112802 3 c0000000f0813c00 c000000419ccb040 c00000006c1b05f8 REG /tmp/ltp-aFr4AJt3R9/LTP_nfslock01.9hyHNgoKqq/3/0/flock_idata ^^^^^^^^^^^^^^^^ The file->f_mapping for both struct files matches the inode->i_data: crash> file.f_mapping c000000196e97700 f_mapping = 0xc00000006c1b0770, crash> file.f_mapping c0000000f0813c00 f_mapping = 0xc00000006c1b0770, crash> px &((struct inode *)0xc00000006c1b05f8)->i_data $8 = (struct address_space *) 0xc00000006c1b0770 and if I look at one of those nfs_flock tasks, the folio passed in to nfs_read_folio has the same mapping: crash> bt 4006781 PID: 4006781 TASK: c00000009d42d500 CPU: 42 COMMAND: "nfs_flock" #0 [c000000177053710] __schedule at c000000000f61d9c #1 [c0000001770537d0] schedule at c000000000f621f4 #2 [c000000177053840] io_schedule at c000000000f62354 #3 [c000000177053870] folio_wait_bit_common at c00000000042dc60 #4 [c000000177053970] nfs_read_folio at c0080000050108a8 [nfs] #5 [c000000177053a60] nfs_write_begin at c008000004fff06c [nfs] #6 [c000000177053b10] generic_perform_write at c00000000042b044 #7 [c000000177053bc0] nfs_file_write at c008000004ffda08 [nfs] #8 [c000000177053c60] new_sync_write at c00000000057fdd8 #9 [c000000177053d10] vfs_write at c000000000582fd4 #10 [c000000177053d60] ksys_write at c0000000005833a4 #11 [c000000177053db0] system_call_exception at c00000000002f434 #12 [c000000177053e10] system_call_vectored_common at c00000000000bfe8 crash> folio.mapping c00c000000564400 mapping = 0xc00000006c1b0770, It's just that if we go back to the nfs_page being released by our panic task, the folio->mapping has been cleared, so we panic when we try to go folio->mapping->host. crash> nfs_page c00000016fb2a600 struct nfs_page { wb_list = { next = 0xc00000016fb2a600, prev = 0xc00000016fb2a600 }, { wb_page = 0xc00c000001d49580, wb_folio = 0xc00c000001d49580 }, wb_lock_context = 0xc00000010518b2c0, wb_index = 0x1, wb_offset = 0x6940, wb_pgbase = 0x6940, wb_bytes = 0x40, wb_kref = { refcount = { refs = { counter = 0x1 } } }, wb_flags = 0x5, wb_verf = { data = "\214\205_d\214\210W\036" }, wb_this_page = 0xc00000016fb2a600, wb_head = 0xc00000016fb2a600, wb_nio = 0x0 } crash> folio.mapping 0xc00c000001d49580 mapping = 0x0, -Scott > -- > Jeff Layton <jlayton@kernel.org> >
On 25 Jul 2023, at 11:08, Scott Mayhew wrote: > Once a folio's private data has been cleared, it's possible for another > process to clear the folio->mapping (e.g. via invalidate_complete_folio2 > or evict_mapping_folio), so it wouldn't be safe to call > nfs_page_to_inode() after that. > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") > Signed-off-by: Scott Mayhew <smayhew@redhat.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Tested-by: Benjamin Coddington <bcodding@redhat.com> Ben
On Tue, 2023-07-25 at 11:08 -0400, Scott Mayhew wrote: > Once a folio's private data has been cleared, it's possible for another > process to clear the folio->mapping (e.g. via invalidate_complete_folio2 > or evict_mapping_folio), so it wouldn't be safe to call > nfs_page_to_inode() after that. > > Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") > Signed-off-by: Scott Mayhew <smayhew@redhat.com> > --- > fs/nfs/write.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/fs/nfs/write.c b/fs/nfs/write.c > index f4cca8f00c0c..489c3f9dae23 100644 > --- a/fs/nfs/write.c > +++ b/fs/nfs/write.c > @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page *req) > */ > static void nfs_inode_remove_request(struct nfs_page *req) > { > + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); > + > if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { > struct folio *folio = nfs_page_to_folio(req->wb_head); > struct address_space *mapping = folio_file_mapping(folio); > @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct nfs_page *req) > > if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { > nfs_release_request(req); > - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))->nrequests); > + atomic_long_dec(&nfsi->nrequests); > } > } > Reviewed-by: Jeff Layton <jlayton@kernel.org>
diff --git a/fs/nfs/write.c b/fs/nfs/write.c index f4cca8f00c0c..489c3f9dae23 100644 --- a/fs/nfs/write.c +++ b/fs/nfs/write.c @@ -785,6 +785,8 @@ static void nfs_inode_add_request(struct nfs_page *req) */ static void nfs_inode_remove_request(struct nfs_page *req) { + struct nfs_inode *nfsi = NFS_I(nfs_page_to_inode(req)); + if (nfs_page_group_sync_on_bit(req, PG_REMOVE)) { struct folio *folio = nfs_page_to_folio(req->wb_head); struct address_space *mapping = folio_file_mapping(folio); @@ -800,7 +802,7 @@ static void nfs_inode_remove_request(struct nfs_page *req) if (test_and_clear_bit(PG_INODE_REF, &req->wb_flags)) { nfs_release_request(req); - atomic_long_dec(&NFS_I(nfs_page_to_inode(req))->nrequests); + atomic_long_dec(&nfsi->nrequests); } }
Once a folio's private data has been cleared, it's possible for another process to clear the folio->mapping (e.g. via invalidate_complete_folio2 or evict_mapping_folio), so it wouldn't be safe to call nfs_page_to_inode() after that. Fixes: 0c493b5cf16e ("NFS: Convert buffered writes to use folios") Signed-off-by: Scott Mayhew <smayhew@redhat.com> --- fs/nfs/write.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-)