__vmalloc() vs. GFP_NOIO/GFP_NOFS
diff mbox

Message ID 20160103071246.GK9938@ZenIV.linux.org.uk
State New
Headers show

Commit Message

Al Viro Jan. 3, 2016, 7:12 a.m. UTC
While trying to write documentation on allocator choice, I've run
into something odd:
        /*
         * __vmalloc() will allocate data pages and auxillary structures (e.g.
         * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
         * here. Hence we need to tell memory reclaim that we are in such a
         * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
         * the filesystem here and potentially deadlocking.
         */
in XFS kmem_zalloc_large().  The comment is correct - __vmalloc() (actually,
map_vm_area() called from __vmalloc_area_node()) ignores gfp_flags; prior
to that point it does take care to pass __GFP_IO/__GFP_FS to page allocator,
but once the data pages are allocated and we get around to inserting them
into page tables those are ignored.

Allocation page tables doesn't have gfp argument at all.  Trying to propagate
it down there could be done, but it's not attractive.

Another approach is memalloc_noio_save(), actually used by XFS and some other
__vmalloc() callers that might be getting GFP_NOIO or GFP_NOFS.  That
works, but not all such callers are using that mechanism.  For example,
drbd bm_realloc_pages() has GFP_NOIO __vmalloc() with no memalloc_noio_...
in sight.  Either that GFP_NOIO is not needed there (quite possible) or
there's a deadlock in that code.  The same goes for ipoib.c ipoib_cm_tx_init();
again, either that GFP_NOIO is not needed, or it can deadlock.

Those, AFAICS, are such callers with GFP_NOIO; however, there's a shitload
of GFP_NOFS ones.  XFS uses memalloc_noio_save(), but a _lot_ of other
callers do not.  For example, all call chains leading to ceph_kvmalloc()
pass GFP_NOFS and none of them is under memalloc_noio_save().  The same
goes for GFS2 __vmalloc() callers, etc.  Again, quite a few of those probably
do not need GFP_NOFS at all, but those that do would appear to have
hard-to-trigger deadlocks.

Why do we do that in callers, though?  I.e. why not do something like this:

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro Jan. 3, 2016, 4:56 p.m. UTC | #1
On Sun, Jan 03, 2016 at 07:12:47AM +0000, Al Viro wrote:

> Allocation page tables doesn't have gfp argument at all.  Trying to propagate
> it down there could be done, but it's not attractive.

While we are at it, is there ever a reason to _not_ pass __GFP_HIGHMEM in
__vmalloc() flags?  After all, we explicitly put the pages we'd allocated
into the page table at vmalloc range we'd grabbed and these are the
addresses visible to caller.  Is there any point in having another alias
for those pages?

vmalloc() itself passes __GFP_HIGHMEM and so does a lot of __vmalloc()
callers; in fact, most of those that do not look like a result of
"we want vmalloc(), but we want to avoid it going into fs code and possibly
deadlocking us; vmalloc() has no gfp_t argument, so let's use __vmalloc()
and give it GFP_NOFS".  

Another very weird thing is the use of GFP_ATOMIC by alloc_large_system_hash();
if we want _that_ honoured, we'd probably have to pass gfp_t to alloc_one_pmd()
and friends, but I'm not sure what exactly is that caller requesting.
Confused...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Jan. 3, 2016, 8:12 p.m. UTC | #2
On Sun, Jan 03, 2016 at 07:12:47AM +0000, Al Viro wrote:
> 	While trying to write documentation on allocator choice, I've run
> into something odd:
>         /*
>          * __vmalloc() will allocate data pages and auxillary structures (e.g.
>          * pagetables) with GFP_KERNEL, yet we may be under GFP_NOFS context
>          * here. Hence we need to tell memory reclaim that we are in such a
>          * context via PF_MEMALLOC_NOIO to prevent memory reclaim re-entering
>          * the filesystem here and potentially deadlocking.
>          */
> in XFS kmem_zalloc_large().  The comment is correct - __vmalloc() (actually,
> map_vm_area() called from __vmalloc_area_node()) ignores gfp_flags; prior
> to that point it does take care to pass __GFP_IO/__GFP_FS to page allocator,
> but once the data pages are allocated and we get around to inserting them
> into page tables those are ignored.
> 
> Allocation page tables doesn't have gfp argument at all.  Trying to propagate
> it down there could be done, but it's not attractive.

Patches were written to do this years ago:

https://lkml.org/lkml/2012/4/23/77

But, well, using vmalloc is "lame"(*) and so it never got fixed. I
did have a rant about the "nobody should use vmalloc" answer to any
problem reported with vmalloc at the time:

https://lkml.org/lkml/2012/6/13/628

Nothing has really changed, except that we ended up with a
per-task flag hack similar to what was suggested here:

https://lkml.org/lkml/2012/4/25/475

> Another approach is memalloc_noio_save(), actually used by XFS and some other
> __vmalloc() callers that might be getting GFP_NOIO or GFP_NOFS.  That
> works, but not all such callers are using that mechanism.  For example,
> drbd bm_realloc_pages() has GFP_NOIO __vmalloc() with no memalloc_noio_...
> in sight.  Either that GFP_NOIO is not needed there (quite possible) or
> there's a deadlock in that code.  The same goes for ipoib.c ipoib_cm_tx_init();
> again, either that GFP_NOIO is not needed, or it can deadlock.
> 
> Those, AFAICS, are such callers with GFP_NOIO; however, there's a shitload
> of GFP_NOFS ones.  XFS uses memalloc_noio_save(), but a _lot_ of other
> callers do not.  For example, all call chains leading to ceph_kvmalloc()
> pass GFP_NOFS and none of them is under memalloc_noio_save().  The same
> goes for GFS2 __vmalloc() callers, etc.  Again, quite a few of those probably
> do not need GFP_NOFS at all, but those that do would appear to have
> hard-to-trigger deadlocks.

Yup, this has been addressed piecemeal in subsystems that can
reproduce vmalloc deadlocks, or at least have produced lockdep
warnings about it because most developers don't realise that vmalloc
is not fs/io context safe.

> Why do we do that in callers, though? 

I think it's because nobody could get a change for vmalloc actually
accepted (see "lame" comments above) and so per-callsite flag hacks
are the path of least resistance.

> I.e. why not do something like this:
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 8e3c9c5..412c5d6 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -1622,6 +1622,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
>  			cond_resched();
>  	}
>  
> +	if (unlikely(!(gfp_mask & __GFP_IO))) {
> +		unsigned flags = memalloc_noio_save();
> +		if (map_vm_area(area, prot, pages)) {
> +			memalloc_noio_restore(flags);
> +			goto fail;
> +		}
> +		memalloc_noio_restore(flags);
> +		return area->addr;
> +	}
> +
>  	if (map_vm_area(area, prot, pages))
>  		goto fail;
>  	return area->addr;

That'd be a nice start, though it doesn't address callers of
vm_map_ram() which also has hard-coded GFP_KERNEL allocation masks
for various allocations. It probably also should have the comment
from the XFS code added to it as well.

Cheers,

Dave.
Al Viro Jan. 3, 2016, 8:35 p.m. UTC | #3
On Mon, Jan 04, 2016 at 07:12:33AM +1100, Dave Chinner wrote:

> That'd be a nice start, though it doesn't address callers of
> vm_map_ram() which also has hard-coded GFP_KERNEL allocation masks
> for various allocations.

... all 3 of them, that is - XFS, android/ion/ion_heap.c and
v4l2-core.  5 call sites total.  Adding a gfp_t argument to those
shouldn't be an issue...

BTW, far scarier one is not GFP_NOFS or GFP_IO - there's a weird
caller passing GFP_ATOMIC to __vmalloc(), for no reason I can guess.

_That_ really couldn't be handled without passing gfp_t to page allocation
primitives, but I very much doubt that it's needed there at all; it's in
alloc_large_system_hash() and I really cannot imagine a situation when
it would be used in e.g. a nonblocking context.

Folks, what is that one for?
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tetsuo Handa Jan. 4, 2016, 1:40 p.m. UTC | #4
On 2016/01/03 16:12, Al Viro wrote:
> Those, AFAICS, are such callers with GFP_NOIO; however, there's a shitload
> of GFP_NOFS ones.  XFS uses memalloc_noio_save(), but a _lot_ of other
> callers do not.  For example, all call chains leading to ceph_kvmalloc()
> pass GFP_NOFS and none of them is under memalloc_noio_save().  The same
> goes for GFS2 __vmalloc() callers, etc.  Again, quite a few of those probably
> do not need GFP_NOFS at all, but those that do would appear to have
> hard-to-trigger deadlocks.
> 
> Why do we do that in callers, though?  I.e. why not do something like this:

This problem is not specific to vmalloc(). It is difficult for
non-fs developers to determine whether they need to use GFP_NOFS than
GFP_KERNEL in their code. Can't we annotate GFP_NOFS/GFP_NOIO sections like
http://marc.info/?l=linux-mm&m=142797559822655 ?

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Hocko Jan. 5, 2016, 3:35 p.m. UTC | #5
On Sun 03-01-16 20:35:14, Al Viro wrote:
[...]
> BTW, far scarier one is not GFP_NOFS or GFP_IO - there's a weird
> caller passing GFP_ATOMIC to __vmalloc(), for no reason I can guess.
> 
> _That_ really couldn't be handled without passing gfp_t to page allocation
> primitives, but I very much doubt that it's needed there at all; it's in
> alloc_large_system_hash() and I really cannot imagine a situation when
> it would be used in e.g. a nonblocking context.

Yeah, this is an __init context. The original commit which has added it
doesn't explain GFP_ATOMIC at all. It just converted alloc_bootmem to
__vmalloc resp. __get_free_pages based on the size. So we can only guess
it wanted to (ab)use memory reserves.

Patch
diff mbox

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 8e3c9c5..412c5d6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -1622,6 +1622,16 @@  static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
 			cond_resched();
 	}
 
+	if (unlikely(!(gfp_mask & __GFP_IO))) {
+		unsigned flags = memalloc_noio_save();
+		if (map_vm_area(area, prot, pages)) {
+			memalloc_noio_restore(flags);
+			goto fail;
+		}
+		memalloc_noio_restore(flags);
+		return area->addr;
+	}
+
 	if (map_vm_area(area, prot, pages))
 		goto fail;
 	return area->addr;