diff mbox series

libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc()

Message ID 20190910151748.914-1-idryomov@gmail.com (mailing list archive)
State New, archived
Headers show
Series libceph: avoid a __vmalloc() deadlock in ceph_kvmalloc() | expand

Commit Message

Ilya Dryomov Sept. 10, 2019, 3:17 p.m. UTC
The vmalloc allocator doesn't fully respect the specified gfp mask:
while the actual pages are allocated as requested, the page table pages
are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
result in a deadlock.

There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
it's just something that seemed sensible at the time (ceph_kvmalloc()
predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
long term fragmentation, it first tries to kmalloc non-disruptively.

Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
passed GFP_KERNEL to enable the fallback.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
---
 net/ceph/ceph_common.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

Comments

Christoph Hellwig Sept. 11, 2019, 6:31 a.m. UTC | #1
On Tue, Sep 10, 2019 at 05:17:48PM +0200, Ilya Dryomov wrote:
> The vmalloc allocator doesn't fully respect the specified gfp mask:
> while the actual pages are allocated as requested, the page table pages
> are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
> with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
> result in a deadlock.
> 
> There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
> it's just something that seemed sensible at the time (ceph_kvmalloc()
> predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
> long term fragmentation, it first tries to kmalloc non-disruptively.
> 
> Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
> the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
> passed GFP_KERNEL to enable the fallback.

If you can please just stop using GFP_NOFS altogether and set
PF_MEMALLOC_* for the actual contexts.
Ilya Dryomov Sept. 11, 2019, 1:33 p.m. UTC | #2
On Wed, Sep 11, 2019 at 8:32 AM Christoph Hellwig <hch@infradead.org> wrote:
>
> On Tue, Sep 10, 2019 at 05:17:48PM +0200, Ilya Dryomov wrote:
> > The vmalloc allocator doesn't fully respect the specified gfp mask:
> > while the actual pages are allocated as requested, the page table pages
> > are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
> > with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
> > result in a deadlock.
> >
> > There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
> > it's just something that seemed sensible at the time (ceph_kvmalloc()
> > predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
> > long term fragmentation, it first tries to kmalloc non-disruptively.
> >
> > Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
> > the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
> > passed GFP_KERNEL to enable the fallback.
>
> If you can please just stop using GFP_NOFS altogether and set
> PF_MEMALLOC_* for the actual contexts.

Hi Christoph,

ceph_kvmalloc() is indirectly called from dozens of places, everywhere
a new RPC message is allocated.  Some of them are used for client setup
and don't need a scope (GFP_KERNEL is fine), but the vast majority do.
I don't think wrapping each call is practical.

As for getting rid of GFP_NOFS and GFP_NOIO entirely (i.e. dropping the
gfp mask from all libceph APIs and using scopes instead), it's something
that I have had in the back of my head for a while now because we cheat
in a few places and hard-code GFP_NOIO as the lowest common denominator
instead of properly propagating the gfp mask.  It's more of a project
though, and won't be backportable.

Thanks,

                Ilya
Jeff Layton Sept. 11, 2019, 2:45 p.m. UTC | #3
On Tue, 2019-09-10 at 17:17 +0200, Ilya Dryomov wrote:
> The vmalloc allocator doesn't fully respect the specified gfp mask:
> while the actual pages are allocated as requested, the page table pages
> are always allocated with GFP_KERNEL.  ceph_kvmalloc() may be called
> with GFP_NOFS and GFP_NOIO (for ceph and rbd respectively), so this may
> result in a deadlock.
> 
> There is no real reason for the current PAGE_ALLOC_COSTLY_ORDER logic,
> it's just something that seemed sensible at the time (ceph_kvmalloc()
> predates kvmalloc()).  kvmalloc() is smarter: in an attempt to reduce
> long term fragmentation, it first tries to kmalloc non-disruptively.
> 
> Switch to kvmalloc() and set the respective PF_MEMALLOC_* flag using
> the scope API to avoid the deadlock.  Note that kvmalloc() needs to be
> passed GFP_KERNEL to enable the fallback.
> 
> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
> ---
>  net/ceph/ceph_common.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
> 
> diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
> index c41789154cdb..970e74b46213 100644
> --- a/net/ceph/ceph_common.c
> +++ b/net/ceph/ceph_common.c
> @@ -13,6 +13,7 @@
>  #include <linux/nsproxy.h>
>  #include <linux/fs_parser.h>
>  #include <linux/sched.h>
> +#include <linux/sched/mm.h>
>  #include <linux/seq_file.h>
>  #include <linux/slab.h>
>  #include <linux/statfs.h>
> @@ -185,18 +186,34 @@ int ceph_compare_options(struct ceph_options *new_opt,
>  }
>  EXPORT_SYMBOL(ceph_compare_options);
>  
> +/*
> + * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
> + * compatible with (a superset of) GFP_KERNEL.  This is because while the
> + * actual pages are allocated with the specified flags, the page table pages
> + * are always allocated with GFP_KERNEL.  map_vm_area() doesn't even take
> + * flags because GFP_KERNEL is hard-coded in {p4d,pud,pmd,pte}_alloc().
> + *
> + * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
> + */
>  void *ceph_kvmalloc(size_t size, gfp_t flags)
>  {
> -	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
> -		void *ptr = kmalloc(size, flags | __GFP_NOWARN);
> -		if (ptr)
> -			return ptr;
> +	void *p;
> +
> +	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
> +		p = kvmalloc(size, flags);
> +	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
> +		unsigned int nofs_flag = memalloc_nofs_save();
> +		p = kvmalloc(size, GFP_KERNEL);
> +		memalloc_nofs_restore(nofs_flag);
> +	} else {
> +		unsigned int noio_flag = memalloc_noio_save();
> +		p = kvmalloc(size, GFP_KERNEL);
> +		memalloc_noio_restore(noio_flag);
>  	}
>  
> -	return __vmalloc(size, flags, PAGE_KERNEL);
> +	return p;
>  }
>  
> -
>  static int parse_fsid(const char *str, struct ceph_fsid *fsid)
>  {
>  	int i = 0;

Reviewed-by: Jeff Layton <jlayton@kernel.org>
diff mbox series

Patch

diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index c41789154cdb..970e74b46213 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -13,6 +13,7 @@ 
 #include <linux/nsproxy.h>
 #include <linux/fs_parser.h>
 #include <linux/sched.h>
+#include <linux/sched/mm.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 #include <linux/statfs.h>
@@ -185,18 +186,34 @@  int ceph_compare_options(struct ceph_options *new_opt,
 }
 EXPORT_SYMBOL(ceph_compare_options);
 
+/*
+ * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
+ * compatible with (a superset of) GFP_KERNEL.  This is because while the
+ * actual pages are allocated with the specified flags, the page table pages
+ * are always allocated with GFP_KERNEL.  map_vm_area() doesn't even take
+ * flags because GFP_KERNEL is hard-coded in {p4d,pud,pmd,pte}_alloc().
+ *
+ * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
+ */
 void *ceph_kvmalloc(size_t size, gfp_t flags)
 {
-	if (size <= (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER)) {
-		void *ptr = kmalloc(size, flags | __GFP_NOWARN);
-		if (ptr)
-			return ptr;
+	void *p;
+
+	if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
+		p = kvmalloc(size, flags);
+	} else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
+		unsigned int nofs_flag = memalloc_nofs_save();
+		p = kvmalloc(size, GFP_KERNEL);
+		memalloc_nofs_restore(nofs_flag);
+	} else {
+		unsigned int noio_flag = memalloc_noio_save();
+		p = kvmalloc(size, GFP_KERNEL);
+		memalloc_noio_restore(noio_flag);
 	}
 
-	return __vmalloc(size, flags, PAGE_KERNEL);
+	return p;
 }
 
-
 static int parse_fsid(const char *str, struct ceph_fsid *fsid)
 {
 	int i = 0;