[2/4] ceph: fix buffer free while holding i_ceph_lock in __ceph_setxattr()
diff mbox series

Message ID 20190719143222.16058-3-lhenriques@suse.com
State New
Headers show
Series
  • Sleeping functions in invalid context bug fixes
Related show

Commit Message

Luis Henriques July 19, 2019, 2:32 p.m. UTC
Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
i_xattrs.prealloc_blob buffer while holding the i_ceph_lock.  This can be
fixed by postponing the call until later, when the lock is released.

The following backtrace was triggered by fstests generic/117.

  BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
  in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
  3 locks held by fsstress/650:
   #0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
   #1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
   #2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
  CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
  Call Trace:
   dump_stack+0x67/0x90
   ___might_sleep.cold+0x9f/0xb1
   vfree+0x4b/0x60
   ceph_buffer_release+0x1b/0x60
   __ceph_setxattr+0x2b4/0x810
   __vfs_setxattr+0x66/0x80
   __vfs_setxattr_noperm+0x59/0xf0
   vfs_setxattr+0x81/0xa0
   setxattr+0x115/0x230
   ? filename_lookup+0xc9/0x140
   ? rcu_read_lock_sched_held+0x74/0x80
   ? rcu_sync_lockdep_assert+0x2e/0x60
   ? __sb_start_write+0x142/0x1a0
   ? mnt_want_write+0x20/0x50
   path_setxattr+0xba/0xd0
   __x64_sys_lsetxattr+0x24/0x30
   do_syscall_64+0x50/0x1c0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  RIP: 0033:0x7ff23514359a

Signed-off-by: Luis Henriques <lhenriques@suse.com>
---
 fs/ceph/xattr.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

Comments

Jeff Layton July 19, 2019, 11:07 p.m. UTC | #1
On Fri, 2019-07-19 at 15:32 +0100, Luis Henriques wrote:
> Calling ceph_buffer_put() in __ceph_setxattr() may end up freeing the
> i_xattrs.prealloc_blob buffer while holding the i_ceph_lock.  This can be
> fixed by postponing the call until later, when the lock is released.
> 
> The following backtrace was triggered by fstests generic/117.
> 
>   BUG: sleeping function called from invalid context at mm/vmalloc.c:2283
>   in_atomic(): 1, irqs_disabled(): 0, pid: 650, name: fsstress
>   3 locks held by fsstress/650:
>    #0: 00000000870a0fe8 (sb_writers#8){.+.+}, at: mnt_want_write+0x20/0x50
>    #1: 00000000ba0c4c74 (&type->i_mutex_dir_key#6){++++}, at: vfs_setxattr+0x55/0xa0
>    #2: 000000008dfbb3f2 (&(&ci->i_ceph_lock)->rlock){+.+.}, at: __ceph_setxattr+0x297/0x810
>   CPU: 1 PID: 650 Comm: fsstress Not tainted 5.2.0+ #437
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
>   Call Trace:
>    dump_stack+0x67/0x90
>    ___might_sleep.cold+0x9f/0xb1
>    vfree+0x4b/0x60
>    ceph_buffer_release+0x1b/0x60
>    __ceph_setxattr+0x2b4/0x810
>    __vfs_setxattr+0x66/0x80
>    __vfs_setxattr_noperm+0x59/0xf0
>    vfs_setxattr+0x81/0xa0
>    setxattr+0x115/0x230
>    ? filename_lookup+0xc9/0x140
>    ? rcu_read_lock_sched_held+0x74/0x80
>    ? rcu_sync_lockdep_assert+0x2e/0x60
>    ? __sb_start_write+0x142/0x1a0
>    ? mnt_want_write+0x20/0x50
>    path_setxattr+0xba/0xd0
>    __x64_sys_lsetxattr+0x24/0x30
>    do_syscall_64+0x50/0x1c0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>   RIP: 0033:0x7ff23514359a
> 
> Signed-off-by: Luis Henriques <lhenriques@suse.com>
> ---
>  fs/ceph/xattr.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
> index 37b458a9af3a..c083557b3657 100644
> --- a/fs/ceph/xattr.c
> +++ b/fs/ceph/xattr.c
> @@ -1036,6 +1036,7 @@ int __ceph_setxattr(struct inode *inode, const char *name,
>  	struct ceph_inode_info *ci = ceph_inode(inode);
>  	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
>  	struct ceph_cap_flush *prealloc_cf = NULL;
> +	struct ceph_buffer *old_blob = NULL;
>  	int issued;
>  	int err;
>  	int dirty = 0;
> @@ -1109,13 +1110,15 @@ int __ceph_setxattr(struct inode *inode, const char *name,
>  		struct ceph_buffer *blob;
>  
>  		spin_unlock(&ci->i_ceph_lock);
> -		dout(" preaallocating new blob size=%d\n", required_blob_size);
> +		ceph_buffer_put(old_blob); /* Shouldn't be required */
> +		dout(" pre-allocating new blob size=%d\n", required_blob_size);
>  		blob = ceph_buffer_new(required_blob_size, GFP_NOFS);
>  		if (!blob)
>  			goto do_sync_unlocked;
>  		spin_lock(&ci->i_ceph_lock);
> +		/* prealloc_blob can't be released while holding i_ceph_lock */
>  		if (ci->i_xattrs.prealloc_blob)
> -			ceph_buffer_put(ci->i_xattrs.prealloc_blob);
> +			old_blob = ci->i_xattrs.prealloc_blob;
>  		ci->i_xattrs.prealloc_blob = blob;
>  		goto retry;
>  	}
> @@ -1131,6 +1134,7 @@ int __ceph_setxattr(struct inode *inode, const char *name,
>  	}
>  
>  	spin_unlock(&ci->i_ceph_lock);
> +	ceph_buffer_put(old_blob);
>  	if (lock_snap_rwsem)
>  		up_read(&mdsc->snap_rwsem);
>  	if (dirty)

(cc'ing Al)

Al pointed out on IRC that vfree should be callable under spinlock. It
only sleeps if !in_interrupt(), and I think that should return true if
we're holding a spinlock.

I'll plan to try replicating this soon.
Al Viro July 19, 2019, 11:23 p.m. UTC | #2
On Fri, Jul 19, 2019 at 07:07:49PM -0400, Jeff Layton wrote:

> Al pointed out on IRC that vfree should be callable under spinlock.

Al had been near-terminally low on caffeine at the time, posted
a retraction a few minutes later and went to grab some coffee...

> It
> only sleeps if !in_interrupt(), and I think that should return true if
> we're holding a spinlock.

It can be used from RCU callbacks and all such; it *can't* be used from
under spinlock - on non-preempt builds there's no way to recognize that.
Al Viro July 19, 2019, 11:30 p.m. UTC | #3
On Sat, Jul 20, 2019 at 12:23:08AM +0100, Al Viro wrote:
> On Fri, Jul 19, 2019 at 07:07:49PM -0400, Jeff Layton wrote:
> 
> > Al pointed out on IRC that vfree should be callable under spinlock.
> 
> Al had been near-terminally low on caffeine at the time, posted
> a retraction a few minutes later and went to grab some coffee...
> 
> > It
> > only sleeps if !in_interrupt(), and I think that should return true if
> > we're holding a spinlock.
> 
> It can be used from RCU callbacks and all such; it *can't* be used from
> under spinlock - on non-preempt builds there's no way to recognize that.

	Re original patch: looks like the sane way to handle that.
Alternatively, we could add kvfree_atomic() for use in such situations,
but I rather doubt that it's a good idea - not unless you need to free
something under a spinlock held over a large area, which is generally
a bad idea to start with...

	Note that vfree_atomic() has only one caller in the entire tree,
BTW.
Jeff Layton July 20, 2019, 12:35 a.m. UTC | #4
On Sat, 2019-07-20 at 00:30 +0100, Al Viro wrote:
> On Sat, Jul 20, 2019 at 12:23:08AM +0100, Al Viro wrote:
> > On Fri, Jul 19, 2019 at 07:07:49PM -0400, Jeff Layton wrote:
> > 
> > > Al pointed out on IRC that vfree should be callable under spinlock.
> > 
> > Al had been near-terminally low on caffeine at the time, posted
> > a retraction a few minutes later and went to grab some coffee...
> > 
> > > It
> > > only sleeps if !in_interrupt(), and I think that should return true if
> > > we're holding a spinlock.
> > 
> > It can be used from RCU callbacks and all such; it *can't* be used from
> > under spinlock - on non-preempt builds there's no way to recognize that.
> 
> 	Re original patch: looks like the sane way to handle that.
> Alternatively, we could add kvfree_atomic() for use in such situations,
> but I rather doubt that it's a good idea - not unless you need to free
> something under a spinlock held over a large area, which is generally
> a bad idea to start with...
> 
> 	Note that vfree_atomic() has only one caller in the entire tree,
> BTW.

In that case, I wonder if we ought to add this to the top of kvfree():

	might_sleep_if(!in_interrupt());

Might there be other places that are calling it under spinlock that are
almost always going down the kfree() path?

Patch
diff mbox series

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 37b458a9af3a..c083557b3657 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1036,6 +1036,7 @@  int __ceph_setxattr(struct inode *inode, const char *name,
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_mds_client *mdsc = ceph_sb_to_client(inode->i_sb)->mdsc;
 	struct ceph_cap_flush *prealloc_cf = NULL;
+	struct ceph_buffer *old_blob = NULL;
 	int issued;
 	int err;
 	int dirty = 0;
@@ -1109,13 +1110,15 @@  int __ceph_setxattr(struct inode *inode, const char *name,
 		struct ceph_buffer *blob;
 
 		spin_unlock(&ci->i_ceph_lock);
-		dout(" preaallocating new blob size=%d\n", required_blob_size);
+		ceph_buffer_put(old_blob); /* Shouldn't be required */
+		dout(" pre-allocating new blob size=%d\n", required_blob_size);
 		blob = ceph_buffer_new(required_blob_size, GFP_NOFS);
 		if (!blob)
 			goto do_sync_unlocked;
 		spin_lock(&ci->i_ceph_lock);
+		/* prealloc_blob can't be released while holding i_ceph_lock */
 		if (ci->i_xattrs.prealloc_blob)
-			ceph_buffer_put(ci->i_xattrs.prealloc_blob);
+			old_blob = ci->i_xattrs.prealloc_blob;
 		ci->i_xattrs.prealloc_blob = blob;
 		goto retry;
 	}
@@ -1131,6 +1134,7 @@  int __ceph_setxattr(struct inode *inode, const char *name,
 	}
 
 	spin_unlock(&ci->i_ceph_lock);
+	ceph_buffer_put(old_blob);
 	if (lock_snap_rwsem)
 		up_read(&mdsc->snap_rwsem);
 	if (dirty)