Message ID | 20211122153233.9924-3-mhocko@kernel.org (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | extend vmalloc support for constrained allocations | expand |
On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Dave Chinner has mentioned that some of the xfs code would benefit from > kvmalloc support for __GFP_NOFAIL because they have allocations that > cannot fail and they do not fit into a single page. > > The large part of the vmalloc implementation already complies with the > given gfp flags so there is no work for those to be done. The area > and page table allocations are an exception to that. Implement a retry > loop for those. > > Add a short sleep before retrying. 1 jiffy is a completely random > timeout. Ideally the retry would wait for an explicit event - e.g. > a change to the vmalloc space change if the failure was caused by > the space fragmentation or depletion. But there are multiple different > reasons to retry and this could become much more complex. Keep the retry > simple for now and just sleep to prevent from hogging CPUs. > > Signed-off-by: Michal Hocko <mhocko@suse.com> > --- > mm/vmalloc.c | 22 +++++++++++++++++----- > 1 file changed, 17 insertions(+), 5 deletions(-) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index 17ca7001de1f..b6aed4f94a85 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > * more permissive. > */ > if (!order) { > + gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL; > + > while (nr_allocated < nr_pages) { > unsigned int nr, nr_pages_request; > > @@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > * but mempolcy want to alloc memory by interleaving. > */ > if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) > - nr = alloc_pages_bulk_array_mempolicy(gfp, > + nr = alloc_pages_bulk_array_mempolicy(bulk_gfp, > nr_pages_request, > pages + nr_allocated); > > else > - nr = alloc_pages_bulk_array_node(gfp, nid, > + nr = alloc_pages_bulk_array_node(bulk_gfp, nid, > nr_pages_request, > pages + nr_allocated); > > @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > { > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > const gfp_t orig_gfp_mask = gfp_mask; > + bool nofail = gfp_mask & __GFP_NOFAIL; > unsigned long addr = (unsigned long)area->addr; > unsigned long size = get_vm_area_size(area); > unsigned long array_size; > @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > flags = memalloc_noio_save(); > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > + do { > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > page_shift); > + if (nofail && (ret < 0)) > + schedule_timeout_uninterruptible(1); > + } while (nofail && (ret < 0)); > > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) > memalloc_nofs_restore(flags); > @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > VM_UNINITIALIZED | vm_flags, start, end, node, > gfp_mask, caller); > if (!area) { > + bool nofail = gfp_mask & __GFP_NOFAIL; > warn_alloc(gfp_mask, NULL, > - "vmalloc error: size %lu, vm_struct allocation failed", > - real_size); > + "vmalloc error: size %lu, vm_struct allocation failed%s", > + real_size, (nofail) ? ". Retrying." : ""); > + if (nofail) { > + schedule_timeout_uninterruptible(1); > + goto again; > + } > goto fail; > } > > -- > 2.30.2 > I have raised two concerns in our previous discussion about this change, well that is sad... -- Vlad Rezki
On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote: [...] > I have raised two concerns in our previous discussion about this change, > well that is sad... I definitely didn't mean to ignore any of the feedback. IIRC we were in a disagreement in the failure mode for retry loop - i.e. free all the allocated pages in case page table pages cannot be allocated. I still maintain my position and until there is a wider consensus on that I will keep my current implementation. The other concern was about failures warning but that shouldn't be a problem when basing on the current Linus tree. Am I missing something?
On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > cannot fail and they do not fit into a single page. Perhaps we should tell xfs "no, do it internally". Because this is a rather nasty-looking thing - do we want to encourage other callsites to start using it? > > The large part of the vmalloc implementation already complies with the > > given gfp flags so there is no work for those to be done. The area > > and page table allocations are an exception to that. Implement a retry > > loop for those. > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > timeout. Ideally the retry would wait for an explicit event - e.g. > > a change to the vmalloc space change if the failure was caused by > > the space fragmentation or depletion. But there are multiple different > > reasons to retry and this could become much more complex. Keep the retry > > simple for now and just sleep to prevent from hogging CPUs. > > Yes, the horse has already bolted. But we didn't want that horse anyway ;) I added GFP_NOFAIL back in the mesozoic era because quite a lot of sites were doing open-coded try-forever loops. I thought "hey, they shouldn't be doing that in the first place, but let's at least centralize the concept to reduce code size, code duplication and so it's something we can now grep for". But longer term, all GFP_NOFAIL sites should be reworked to no longer need to do the retry-forever thing. In retrospect, this bright idea of mine seems to have added license for more sites to use retry-forever. Sigh. > > + if (nofail) { > > + schedule_timeout_uninterruptible(1); > > + goto again; > > + } The idea behind congestion_wait() is to prevent us from having to hard-wire delays like this. congestion_wait(1) would sleep for up to one millisecond, but will return earlier if reclaim events happened which make it likely that the caller can now proceed with the allocation event, successfully. However it turns out that congestion_wait() was quietly broken at the block level some time ago. We could perhaps resurrect the concept at another level - say by releasing congestion_wait() callers if an amount of memory newly becomes allocatable. This obviously asks for inclusion of zone/node/etc info from the congestion_wait() caller. But that's just an optimization - if the newly-available memory isn't useful to the congestion_wait() caller, they just fail the allocation attempts and wait again. > well that is sad... > I have raised two concerns in our previous discussion about this change, Can you please reiterate those concerns here?
On Wed, 24 Nov 2021, Andrew Morton wrote: > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > sites were doing open-coded try-forever loops. I thought "hey, they > shouldn't be doing that in the first place, but let's at least > centralize the concept to reduce code size, code duplication and so > it's something we can now grep for". But longer term, all GFP_NOFAIL > sites should be reworked to no longer need to do the retry-forever > thing. In retrospect, this bright idea of mine seems to have added > license for more sites to use retry-forever. Sigh. One of the costs of not having GFP_NOFAIL (or similar) is lots of untested failure-path code. When does an allocation that is allowed to retry and reclaim ever fail anyway? I think the answer is "only when it has been killed by the oom killer". That of course cannot happen to kernel threads, so maybe kernel threads should never need GFP_NOFAIL?? I'm not sure the above is 100%, but I do think that is the sort of semantic that we want. We want to know what kmalloc failure *means*. We also need well defined and documented strategies to handle it. mempools are one such strategy, but not always suitable. preallocating can also be useful but can be clumsy to implement. Maybe we should support a process preallocating a bunch of pages which can only be used by the process - and are auto-freed when the process returns to user-space. That might allow the "error paths" to be simple and early, and subsequent allocations that were GFP_USEPREALLOC would be safe. i.e. we need a plan for how to rework all those no-fail call-sites. NeilBrown
On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@suse.de> wrote: > On Wed, 24 Nov 2021, Andrew Morton wrote: > > > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > > sites were doing open-coded try-forever loops. I thought "hey, they > > shouldn't be doing that in the first place, but let's at least > > centralize the concept to reduce code size, code duplication and so > > it's something we can now grep for". But longer term, all GFP_NOFAIL > > sites should be reworked to no longer need to do the retry-forever > > thing. In retrospect, this bright idea of mine seems to have added > > license for more sites to use retry-forever. Sigh. > > One of the costs of not having GFP_NOFAIL (or similar) is lots of > untested failure-path code. Well that's bad of the relevant developers and testers! It isn't that hard to fake up allocation failures. Either with the formal fault injection framework or with ad-hackery. > When does an allocation that is allowed to retry and reclaim ever fail > anyway? I think the answer is "only when it has been killed by the oom > killer". That of course cannot happen to kernel threads, so maybe > kernel threads should never need GFP_NOFAIL?? > I'm not sure the above is 100%, but I do think that is the sort of > semantic that we want. We want to know what kmalloc failure *means*. > We also need well defined and documented strategies to handle it. > mempools are one such strategy, but not always suitable. Well, mempools aren't "handling" it. They're just another layer to make memory allocation attempts appear to be magical. The preferred answer is "just handle the damned error and return ENOMEM". Obviously this gets very painful at times (arguably because of high-level design shortcomings). The old radix_tree_preload approach was bulletproof, but was quite a lot of fuss. > preallocating can also be useful but can be clumsy to implement. Maybe > we should support a process preallocating a bunch of pages which can > only be used by the process - and are auto-freed when the process > returns to user-space. That might allow the "error paths" to be simple > and early, and subsequent allocations that were GFP_USEPREALLOC would be > safe. Yes, I think something like that would be quite feasible. Need to prevent interrupt code from stealing the task's page store. It can be a drag calculating (by hand) what the max possible amount of allocation will be and one can end up allocating and then not using a lot of memory. I forget why radix_tree_preload used a cpu-local store rather than a per-task one. Plus "what order pages would you like" and "on which node" and "in which zone", etc...
On Wed, 24 Nov 2021, Andrew Morton wrote: > On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <neilb@suse.de> wrote: > > > On Wed, 24 Nov 2021, Andrew Morton wrote: > > > > > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > > > sites were doing open-coded try-forever loops. I thought "hey, they > > > shouldn't be doing that in the first place, but let's at least > > > centralize the concept to reduce code size, code duplication and so > > > it's something we can now grep for". But longer term, all GFP_NOFAIL > > > sites should be reworked to no longer need to do the retry-forever > > > thing. In retrospect, this bright idea of mine seems to have added > > > license for more sites to use retry-forever. Sigh. > > > > One of the costs of not having GFP_NOFAIL (or similar) is lots of > > untested failure-path code. > > Well that's bad of the relevant developers and testers! It isn't that > hard to fake up allocation failures. Either with the formal fault > injection framework or with ad-hackery. If we have CONFIG_RANDOM_ALLOC_PAGE_FAILURE then I might agree. lockdep is AWESOME. It makes testing for deadlock problems *so* easy. That is the level of ease-of-use we need if you want people to handle page-alloc failures reliably. > > > When does an allocation that is allowed to retry and reclaim ever fail > > anyway? I think the answer is "only when it has been killed by the oom > > killer". That of course cannot happen to kernel threads, so maybe > > kernel threads should never need GFP_NOFAIL?? > > > I'm not sure the above is 100%, but I do think that is the sort of > > semantic that we want. We want to know what kmalloc failure *means*. > > We also need well defined and documented strategies to handle it. > > mempools are one such strategy, but not always suitable. > > Well, mempools aren't "handling" it. They're just another layer to > make memory allocation attempts appear to be magical. The preferred > answer is "just handle the damned error and return ENOMEM". No. mempools are EXACTLY handling it - in a specific context. When writing out dirty pages so as to free up memory, you often need to allocate memory. And you may need a sequence of allocations, typically at different levels in the stack. Global water-marks cannot help reliably as it might all be used up with top-level requests, and the lower levels can starve indefinitely. mempools ensure that when memory is freed, it is returned to the same level it was allocated for. That ensures you can get at least one request at a time all the way out to storage. If swap-out just returned ENOMEM, you'd really be in trouble. > > Obviously this gets very painful at times (arguably because of > high-level design shortcomings). The old radix_tree_preload approach > was bulletproof, but was quite a lot of fuss. It would get particularly painful if some system call started returned -ENOMEM, which had never returned that before. I note that ext4 uses __GFP_NOFAIL when handling truncate. I don't think user-space would be happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a recent commit which adds it focuses more on wanting to avoid the need for fsck. > > > preallocating can also be useful but can be clumsy to implement. Maybe > > we should support a process preallocating a bunch of pages which can > > only be used by the process - and are auto-freed when the process > > returns to user-space. That might allow the "error paths" to be simple > > and early, and subsequent allocations that were GFP_USEPREALLOC would be > > safe. > > Yes, I think something like that would be quite feasible. Need to > prevent interrupt code from stealing the task's page store. > > It can be a drag calculating (by hand) what the max possible amount of > allocation will be and one can end up allocating and then not using a > lot of memory. CONFIG_DEBUG_PREALLOC could force every GFP_USE_PREALLOC request to use a different page, and warn if there weren't enough. Not perfect, but useful. Concerning the "lot of memory" having prealloc as an easy option means people can use it until it becomes too expensive, then find a better solution. There will likely always be a better solution, but it might not often be worth the effort. > > I forget why radix_tree_preload used a cpu-local store rather than a > per-task one. > > Plus "what order pages would you like" and "on which node" and "in > which zone", etc... "what order" - only order-0 I hope. I'd hazard a guess that 90% of current NOFAIL allocations only need one page (providing slub is used - slab seems to insist on high-order pages sometimes). "which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page() will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is probably a poor choice. "which zone" - NORMAL. I cannot find any NOFAIL allocations that want DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't *requre* highmem. Of course, before designing this interface too precisely we should check if anyone can use it. From a quick through the some of the 100-ish users of __GFP_NOFAIL I'd guess that mempools would help - the preallocation should happen at init-time, not request-time. Maybe if we made mempools even more light weight .... though that risks allocating a lot of memory that will never get used. This brings me back to the idea that alloc_page(wait and reclaim allowed) should only fail on OOM_KILL. That way kernel threads are safe, and user-threads are free to return ENOMEM knowing it won't get to user-space. If user-thread code really needs NOFAIL, it punts to a workqueue and waits - aborting the wait if it is killed, while the work item still runs eventually. NeilBrown
On Tue 23-11-21 17:02:38, Andrew Morton wrote: > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > cannot fail and they do not fit into a single page. > > Perhaps we should tell xfs "no, do it internally". Because this is a > rather nasty-looking thing - do we want to encourage other callsites to > start using it? This is what xfs is likely going to do if we do not provide the functionality. I just do not see why that would be a better outcome though. My longterm experience tells me that whenever we ignore requirements by other subsystems then those requirements materialize in some form in the end. In many cases done either suboptimaly or outright wrong. This might be not the case for xfs as the quality of implementation is high there but this is not the case in general. Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for any other stupid reason then what? Is that something we should worry about? Retrying within the allocator doesn't make the things worse. In fact it is just easier to find such abusers by grep which would be more elaborate with custom retry loops. [...] > > > + if (nofail) { > > > + schedule_timeout_uninterruptible(1); > > > + goto again; > > > + } > > The idea behind congestion_wait() is to prevent us from having to > hard-wire delays like this. congestion_wait(1) would sleep for up to > one millisecond, but will return earlier if reclaim events happened > which make it likely that the caller can now proceed with the > allocation event, successfully. > > However it turns out that congestion_wait() was quietly broken at the > block level some time ago. We could perhaps resurrect the concept at > another level - say by releasing congestion_wait() callers if an amount > of memory newly becomes allocatable. This obviously asks for inclusion > of zone/node/etc info from the congestion_wait() caller. But that's > just an optimization - if the newly-available memory isn't useful to > the congestion_wait() caller, they just fail the allocation attempts > and wait again. vmalloc has two potential failure modes. Depleted memory and vmalloc space. So there are two different events to wait for. I do agree that schedule_timeout_uninterruptible is both ugly and very simple but do we really need a much more sophisticated solution at this stage?
On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote: > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > cannot fail and they do not fit into a single page. > > Perhaps we should tell xfs "no, do it internally". Because this is a > rather nasty-looking thing - do we want to encourage other callsites to > start using it? > > > > The large part of the vmalloc implementation already complies with the > > > given gfp flags so there is no work for those to be done. The area > > > and page table allocations are an exception to that. Implement a retry > > > loop for those. > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > a change to the vmalloc space change if the failure was caused by > > > the space fragmentation or depletion. But there are multiple different > > > reasons to retry and this could become much more complex. Keep the retry > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > Yes, the horse has already bolted. But we didn't want that horse anyway ;) > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > sites were doing open-coded try-forever loops. I thought "hey, they > shouldn't be doing that in the first place, but let's at least > centralize the concept to reduce code size, code duplication and so > it's something we can now grep for". But longer term, all GFP_NOFAIL > sites should be reworked to no longer need to do the retry-forever > thing. In retrospect, this bright idea of mine seems to have added > license for more sites to use retry-forever. Sigh. > > > > + if (nofail) { > > > + schedule_timeout_uninterruptible(1); > > > + goto again; > > > + } > > The idea behind congestion_wait() is to prevent us from having to > hard-wire delays like this. congestion_wait(1) would sleep for up to > one millisecond, but will return earlier if reclaim events happened > which make it likely that the caller can now proceed with the > allocation event, successfully. > > However it turns out that congestion_wait() was quietly broken at the > block level some time ago. We could perhaps resurrect the concept at > another level - say by releasing congestion_wait() callers if an amount > of memory newly becomes allocatable. This obviously asks for inclusion > of zone/node/etc info from the congestion_wait() caller. But that's > just an optimization - if the newly-available memory isn't useful to > the congestion_wait() caller, they just fail the allocation attempts > and wait again. > > > well that is sad... > > I have raised two concerns in our previous discussion about this change, > > Can you please reiterate those concerns here? > 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of duplication and spreading the logic across several places. This is about simplification. 2. Second one is about to do an unwinding and release everything what we have just accumulated in terms of memory consumption. The failure might occur, if so a condition we are in is a low memory one or high memory pressure. In this case, since we are about to sleep some milliseconds in order to repeat later, IMHO it makes sense to release memory: - to prevent killing apps or possible OOM; - we can end up looping quite a lot of time or even forever if users do nasty things with vmalloc API and __GFP_NOFAIL flag. -- Vlad Rezki
On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote: > On Tue 23-11-21 17:02:38, Andrew Morton wrote: > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > cannot fail and they do not fit into a single page. > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > rather nasty-looking thing - do we want to encourage other callsites to > > start using it? > > This is what xfs is likely going to do if we do not provide the > functionality. I just do not see why that would be a better outcome > though. My longterm experience tells me that whenever we ignore > requirements by other subsystems then those requirements materialize in > some form in the end. In many cases done either suboptimaly or outright > wrong. This might be not the case for xfs as the quality of > implementation is high there but this is not the case in general. > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for > any other stupid reason then what? Is that something we should worry > about? Retrying within the allocator doesn't make the things worse. In > fact it is just easier to find such abusers by grep which would be more > elaborate with custom retry loops. > > [...] > > > > + if (nofail) { > > > > + schedule_timeout_uninterruptible(1); > > > > + goto again; > > > > + } > > > > The idea behind congestion_wait() is to prevent us from having to > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > one millisecond, but will return earlier if reclaim events happened > > which make it likely that the caller can now proceed with the > > allocation event, successfully. > > > > However it turns out that congestion_wait() was quietly broken at the > > block level some time ago. We could perhaps resurrect the concept at > > another level - say by releasing congestion_wait() callers if an amount > > of memory newly becomes allocatable. This obviously asks for inclusion > > of zone/node/etc info from the congestion_wait() caller. But that's > > just an optimization - if the newly-available memory isn't useful to > > the congestion_wait() caller, they just fail the allocation attempts > > and wait again. > > vmalloc has two potential failure modes. Depleted memory and vmalloc > space. So there are two different events to wait for. I do agree that > schedule_timeout_uninterruptible is both ugly and very simple but do we > really need a much more sophisticated solution at this stage? > I would say there is at least one more. It is about when users set their own range(start:end) where to allocate. In that scenario we might never return to a user, because there might not be any free vmap space on specified range. To address this, we can allow __GFP_NOFAIL only for entire vmalloc address space, i.e. within VMALLOC_START:VMALLOC_END. By doing so we will guarantee that we will not run out of vmap space, at least for 64 bit systems, for smaller 32 bit ones we can not guarantee it but it is populated back when the "lazily free logic" is kicked. -- Vlad Rezki
On Tue, Nov 23, 2021 at 09:09:08PM +0100, Michal Hocko wrote: > On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote: > [...] > > I have raised two concerns in our previous discussion about this change, > > well that is sad... > > I definitely didn't mean to ignore any of the feedback. IIRC we were in > a disagreement in the failure mode for retry loop - i.e. free all the > allocated pages in case page table pages cannot be allocated. I still > maintain my position and until there is a wider consensus on that I will > keep my current implementation. The other concern was about failures > warning but that shouldn't be a problem when basing on the current Linus > tree. Am I missing something? > Sorry i was answering vice-versa from latest toward first messages :) -- Vlad Rezki
On Wed, Nov 24, 2021 at 04:23:31PM +1100, NeilBrown wrote: > > It would get particularly painful if some system call started returned > -ENOMEM, which had never returned that before. I note that ext4 uses > __GFP_NOFAIL when handling truncate. I don't think user-space would be > happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a > recent commit which adds it focuses more on wanting to avoid the need > for fsck. If the inode is in use (via an open file descriptor) when it is unlocked, we can't actually do the truncate until the inode is evicted, and at that point, there is no user space to return to. For that reason, the evict_inode() method is not *allowed* to fail. So this is why we need to use GFP_NOFAIL or an open-coded retry loop. The alternative would be to mark the file system corrupt, and then either remount the file system, panic the system and reboot, or leave the file system corrupted ("don't worry, be happy"). I considered GFP_NOFAIL to be the lesser of the evils. :-) If the VFS allowed evict_inode() to fail, all it could do is to put the inode back on the list of inodes to be later evicted --- which is to say, we would have to add a lot of complexity to effectively add a gigantic retry loop. Granted, we wouldn't need to be holding any locks in between retries, so perhaps it'a better than adding a retry loop deep in the guts of the ext4 truncate codepath. But then we would need to worry about userspace getting ENOMEM for system calls which historically, users have traditionally never failing. I suppose we could also solve this problem by adding retry logic in the top-level VFS truncate codepath, so instead of returning ENOMEM, we just retry the truncate(2) system call and hope that we have enough memory to succeed this time. After all, can the userspace do if truncate() fails with ENOMEM? It can fail the userspace program, which in the case of a long-running daemon such as mysqld, is basically the userspace equivalent of "panic and reboot", or it can retry truncate(2) syste call at the userspace level. Are we detecting a pattern here? There will always be cases where the choice is "panic" or "retry". - Ted
On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote: > On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote: > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > cannot fail and they do not fit into a single page. > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > rather nasty-looking thing - do we want to encourage other callsites to > > start using it? > > > > > > The large part of the vmalloc implementation already complies with the > > > > given gfp flags so there is no work for those to be done. The area > > > > and page table allocations are an exception to that. Implement a retry > > > > loop for those. > > > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > > a change to the vmalloc space change if the failure was caused by > > > > the space fragmentation or depletion. But there are multiple different > > > > reasons to retry and this could become much more complex. Keep the retry > > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > > > Yes, the horse has already bolted. But we didn't want that horse anyway ;) > > > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > > sites were doing open-coded try-forever loops. I thought "hey, they > > shouldn't be doing that in the first place, but let's at least > > centralize the concept to reduce code size, code duplication and so > > it's something we can now grep for". But longer term, all GFP_NOFAIL > > sites should be reworked to no longer need to do the retry-forever > > thing. In retrospect, this bright idea of mine seems to have added > > license for more sites to use retry-forever. Sigh. > > > > > > + if (nofail) { > > > > + schedule_timeout_uninterruptible(1); > > > > + goto again; > > > > + } > > > > The idea behind congestion_wait() is to prevent us from having to > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > one millisecond, but will return earlier if reclaim events happened > > which make it likely that the caller can now proceed with the > > allocation event, successfully. > > > > However it turns out that congestion_wait() was quietly broken at the > > block level some time ago. We could perhaps resurrect the concept at > > another level - say by releasing congestion_wait() callers if an amount > > of memory newly becomes allocatable. This obviously asks for inclusion > > of zone/node/etc info from the congestion_wait() caller. But that's > > just an optimization - if the newly-available memory isn't useful to > > the congestion_wait() caller, they just fail the allocation attempts > > and wait again. > > > > > well that is sad... > > > I have raised two concerns in our previous discussion about this change, > > > > Can you please reiterate those concerns here? > > > 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of > duplication and spreading the logic across several places. This is about > simplification. I am all for simplifications. But the presented simplification lead to 2) and ... > 2. Second one is about to do an unwinding and release everything what we > have just accumulated in terms of memory consumption. The failure might > occur, if so a condition we are in is a low memory one or high memory > pressure. In this case, since we are about to sleep some milliseconds > in order to repeat later, IMHO it makes sense to release memory: > > - to prevent killing apps or possible OOM; > - we can end up looping quite a lot of time or even forever if users do > nasty things with vmalloc API and __GFP_NOFAIL flag. ... this is where we disagree and I have tried to explain why. The primary memory to allocate are pages to back the vmalloc area. Failing to allocate few page tables - which btw. do not fail as they are order-0 - and result into the whole and much more expensive work to allocate the former is really wasteful. You've had a concern about OOM killer invocation while retrying the page table allocation but you should realize that page table allocations might already invoke OOM killer so that is absolutely nothing new.
On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote: > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote: > > On Tue 23-11-21 17:02:38, Andrew Morton wrote: > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > > cannot fail and they do not fit into a single page. > > > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > > rather nasty-looking thing - do we want to encourage other callsites to > > > start using it? > > > > This is what xfs is likely going to do if we do not provide the > > functionality. I just do not see why that would be a better outcome > > though. My longterm experience tells me that whenever we ignore > > requirements by other subsystems then those requirements materialize in > > some form in the end. In many cases done either suboptimaly or outright > > wrong. This might be not the case for xfs as the quality of > > implementation is high there but this is not the case in general. > > > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for > > any other stupid reason then what? Is that something we should worry > > about? Retrying within the allocator doesn't make the things worse. In > > fact it is just easier to find such abusers by grep which would be more > > elaborate with custom retry loops. > > > > [...] > > > > > + if (nofail) { > > > > > + schedule_timeout_uninterruptible(1); > > > > > + goto again; > > > > > + } > > > > > > The idea behind congestion_wait() is to prevent us from having to > > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > > one millisecond, but will return earlier if reclaim events happened > > > which make it likely that the caller can now proceed with the > > > allocation event, successfully. > > > > > > However it turns out that congestion_wait() was quietly broken at the > > > block level some time ago. We could perhaps resurrect the concept at > > > another level - say by releasing congestion_wait() callers if an amount > > > of memory newly becomes allocatable. This obviously asks for inclusion > > > of zone/node/etc info from the congestion_wait() caller. But that's > > > just an optimization - if the newly-available memory isn't useful to > > > the congestion_wait() caller, they just fail the allocation attempts > > > and wait again. > > > > vmalloc has two potential failure modes. Depleted memory and vmalloc > > space. So there are two different events to wait for. I do agree that > > schedule_timeout_uninterruptible is both ugly and very simple but do we > > really need a much more sophisticated solution at this stage? > > > I would say there is at least one more. It is about when users set their > own range(start:end) where to allocate. In that scenario we might never > return to a user, because there might not be any free vmap space on > specified range. > > To address this, we can allow __GFP_NOFAIL only for entire vmalloc > address space, i.e. within VMALLOC_START:VMALLOC_END. How should we do that?
On Thu, Nov 25, 2021 at 9:46 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote: > > On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote: > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > > cannot fail and they do not fit into a single page. > > > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > > rather nasty-looking thing - do we want to encourage other callsites to > > > start using it? > > > > > > > > The large part of the vmalloc implementation already complies with the > > > > > given gfp flags so there is no work for those to be done. The area > > > > > and page table allocations are an exception to that. Implement a retry > > > > > loop for those. > > > > > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > > > a change to the vmalloc space change if the failure was caused by > > > > > the space fragmentation or depletion. But there are multiple different > > > > > reasons to retry and this could become much more complex. Keep the retry > > > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > > > > > > Yes, the horse has already bolted. But we didn't want that horse anyway ;) > > > > > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of > > > sites were doing open-coded try-forever loops. I thought "hey, they > > > shouldn't be doing that in the first place, but let's at least > > > centralize the concept to reduce code size, code duplication and so > > > it's something we can now grep for". But longer term, all GFP_NOFAIL > > > sites should be reworked to no longer need to do the retry-forever > > > thing. In retrospect, this bright idea of mine seems to have added > > > license for more sites to use retry-forever. Sigh. > > > > > > > > + if (nofail) { > > > > > + schedule_timeout_uninterruptible(1); > > > > > + goto again; > > > > > + } > > > > > > The idea behind congestion_wait() is to prevent us from having to > > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > > one millisecond, but will return earlier if reclaim events happened > > > which make it likely that the caller can now proceed with the > > > allocation event, successfully. > > > > > > However it turns out that congestion_wait() was quietly broken at the > > > block level some time ago. We could perhaps resurrect the concept at > > > another level - say by releasing congestion_wait() callers if an amount > > > of memory newly becomes allocatable. This obviously asks for inclusion > > > of zone/node/etc info from the congestion_wait() caller. But that's > > > just an optimization - if the newly-available memory isn't useful to > > > the congestion_wait() caller, they just fail the allocation attempts > > > and wait again. > > > > > > > well that is sad... > > > > I have raised two concerns in our previous discussion about this change, > > > > > > Can you please reiterate those concerns here? > > > > > 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of > > duplication and spreading the logic across several places. This is about > > simplification. > > I am all for simplifications. But the presented simplification lead to 2) and ... > > > 2. Second one is about to do an unwinding and release everything what we > > have just accumulated in terms of memory consumption. The failure might > > occur, if so a condition we are in is a low memory one or high memory > > pressure. In this case, since we are about to sleep some milliseconds > > in order to repeat later, IMHO it makes sense to release memory: > > > > - to prevent killing apps or possible OOM; > > - we can end up looping quite a lot of time or even forever if users do > > nasty things with vmalloc API and __GFP_NOFAIL flag. > > ... this is where we disagree and I have tried to explain why. The primary > memory to allocate are pages to back the vmalloc area. Failing to > allocate few page tables - which btw. do not fail as they are order-0 - > and result into the whole and much more expensive work to allocate the > former is really wasteful. You've had a concern about OOM killer > invocation while retrying the page table allocation but you should > realize that page table allocations might already invoke OOM killer so that > is absolutely nothing new. > We are in a slow path and this is a corner case, it means we will timeout for many milliseconds, for example for CONFIG_HZ_100 it is 10 milliseconds. I would agree with you if it was requesting some memory and repeating in a tight loop because of any time constraint and workloads sensitive to latency. Is it sensitive to any workload? If so, we definitely can not go with any delay there. As for OOM, right you are. But it also can be that we are the source who triggers it, directly or indirectly. Unwinding and cleaning is a maximum what we actually can do here staying fair to OOM. Therefore i root for simplification and OOM related concerns :) But maybe there will be other opinions. Thanks! -- Uladzislau Rezki
On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote: > > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote: > > > On Tue 23-11-21 17:02:38, Andrew Morton wrote: > > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > > > cannot fail and they do not fit into a single page. > > > > > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > > > rather nasty-looking thing - do we want to encourage other callsites to > > > > start using it? > > > > > > This is what xfs is likely going to do if we do not provide the > > > functionality. I just do not see why that would be a better outcome > > > though. My longterm experience tells me that whenever we ignore > > > requirements by other subsystems then those requirements materialize in > > > some form in the end. In many cases done either suboptimaly or outright > > > wrong. This might be not the case for xfs as the quality of > > > implementation is high there but this is not the case in general. > > > > > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for > > > any other stupid reason then what? Is that something we should worry > > > about? Retrying within the allocator doesn't make the things worse. In > > > fact it is just easier to find such abusers by grep which would be more > > > elaborate with custom retry loops. > > > > > > [...] > > > > > > + if (nofail) { > > > > > > + schedule_timeout_uninterruptible(1); > > > > > > + goto again; > > > > > > + } > > > > > > > > The idea behind congestion_wait() is to prevent us from having to > > > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > > > one millisecond, but will return earlier if reclaim events happened > > > > which make it likely that the caller can now proceed with the > > > > allocation event, successfully. > > > > > > > > However it turns out that congestion_wait() was quietly broken at the > > > > block level some time ago. We could perhaps resurrect the concept at > > > > another level - say by releasing congestion_wait() callers if an amount > > > > of memory newly becomes allocatable. This obviously asks for inclusion > > > > of zone/node/etc info from the congestion_wait() caller. But that's > > > > just an optimization - if the newly-available memory isn't useful to > > > > the congestion_wait() caller, they just fail the allocation attempts > > > > and wait again. > > > > > > vmalloc has two potential failure modes. Depleted memory and vmalloc > > > space. So there are two different events to wait for. I do agree that > > > schedule_timeout_uninterruptible is both ugly and very simple but do we > > > really need a much more sophisticated solution at this stage? > > > > > I would say there is at least one more. It is about when users set their > > own range(start:end) where to allocate. In that scenario we might never > > return to a user, because there might not be any free vmap space on > > specified range. > > > > To address this, we can allow __GFP_NOFAIL only for entire vmalloc > > address space, i.e. within VMALLOC_START:VMALLOC_END. > > How should we do that? > <snip> diff --git a/mm/vmalloc.c b/mm/vmalloc.c index d2a00ad4e1dd..664935bee2a2 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, return NULL; } + if (gfp_mask & __GFP_NOFAIL) { + if (start != VMALLOC_START || end != VMALLOC_END) { + gfp_mask &= ~__GFP_NOFAIL; + WARN_ONCE(1, "__GFP_NOFAIL is allowed only for entire vmalloc space."); + } + } + if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) { unsigned long size_per_node; <snip> Or just allow __GFP_NOFAIL flag usage only for a high level API, it is __vmalloc() one where gfp can be passed. Because it uses whole vmalloc address space, thus we do not need to check the range and other parameters like align, etc. This variant is preferable. But the problem is that there are internal functions which are publicly available for kernel users like __vmalloc_node_range(). In that case we can add a big comment like: __GFP_NOFAIL flag can be used __only__ with high level API, i.e. __vmalloc() one. Any thoughts?
On Thu 25-11-21 19:40:56, Uladzislau Rezki wrote: > On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <mhocko@suse.com> wrote: > > > > On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote: > > > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote: > > > > On Tue 23-11-21 17:02:38, Andrew Morton wrote: > > > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <urezki@gmail.com> wrote: > > > > > > > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote: > > > > > > > From: Michal Hocko <mhocko@suse.com> > > > > > > > > > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > > > > > cannot fail and they do not fit into a single page. > > > > > > > > > > Perhaps we should tell xfs "no, do it internally". Because this is a > > > > > rather nasty-looking thing - do we want to encourage other callsites to > > > > > start using it? > > > > > > > > This is what xfs is likely going to do if we do not provide the > > > > functionality. I just do not see why that would be a better outcome > > > > though. My longterm experience tells me that whenever we ignore > > > > requirements by other subsystems then those requirements materialize in > > > > some form in the end. In many cases done either suboptimaly or outright > > > > wrong. This might be not the case for xfs as the quality of > > > > implementation is high there but this is not the case in general. > > > > > > > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for > > > > any other stupid reason then what? Is that something we should worry > > > > about? Retrying within the allocator doesn't make the things worse. In > > > > fact it is just easier to find such abusers by grep which would be more > > > > elaborate with custom retry loops. > > > > > > > > [...] > > > > > > > + if (nofail) { > > > > > > > + schedule_timeout_uninterruptible(1); > > > > > > > + goto again; > > > > > > > + } > > > > > > > > > > The idea behind congestion_wait() is to prevent us from having to > > > > > hard-wire delays like this. congestion_wait(1) would sleep for up to > > > > > one millisecond, but will return earlier if reclaim events happened > > > > > which make it likely that the caller can now proceed with the > > > > > allocation event, successfully. > > > > > > > > > > However it turns out that congestion_wait() was quietly broken at the > > > > > block level some time ago. We could perhaps resurrect the concept at > > > > > another level - say by releasing congestion_wait() callers if an amount > > > > > of memory newly becomes allocatable. This obviously asks for inclusion > > > > > of zone/node/etc info from the congestion_wait() caller. But that's > > > > > just an optimization - if the newly-available memory isn't useful to > > > > > the congestion_wait() caller, they just fail the allocation attempts > > > > > and wait again. > > > > > > > > vmalloc has two potential failure modes. Depleted memory and vmalloc > > > > space. So there are two different events to wait for. I do agree that > > > > schedule_timeout_uninterruptible is both ugly and very simple but do we > > > > really need a much more sophisticated solution at this stage? > > > > > > > I would say there is at least one more. It is about when users set their > > > own range(start:end) where to allocate. In that scenario we might never > > > return to a user, because there might not be any free vmap space on > > > specified range. > > > > > > To address this, we can allow __GFP_NOFAIL only for entire vmalloc > > > address space, i.e. within VMALLOC_START:VMALLOC_END. > > > > How should we do that? > > > <snip> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index d2a00ad4e1dd..664935bee2a2 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size, > unsigned long align, > return NULL; > } > > + if (gfp_mask & __GFP_NOFAIL) { > + if (start != VMALLOC_START || end != VMALLOC_END) { > + gfp_mask &= ~__GFP_NOFAIL; > + WARN_ONCE(1, "__GFP_NOFAIL is allowed only for > entire vmalloc space."); > + } > + } So the called function effectivelly ignores the flag which could lead to an actual failure and that is something the caller has told us not to do. I do not consider such an API great, to say the least. > + > if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) { > unsigned long size_per_node; > <snip> > > Or just allow __GFP_NOFAIL flag usage only for a high level API, it is > __vmalloc() one where > gfp can be passed. Because it uses whole vmalloc address space, thus > we do not need to > check the range and other parameters like align, etc. This variant is > preferable. > > But the problem is that there are internal functions which are > publicly available for kernel users like > __vmalloc_node_range(). In that case we can add a big comment like: > __GFP_NOFAIL flag can be > used __only__ with high level API, i.e. __vmalloc() one. > > Any thoughts? I dunno. I find it rather ugly. We can surely document some APIs that they shouldn't be used with __GFP_NOFAIL because they could result in an endless loop but I find it rather subtle to change the contract under the caller's feet and cause other problems. I am rather curious about other opinions but at this moment this is trying to handle a non existing problem IMHO. vmalloc and for that matter other allocators are not trying to be defensive in API because we assume in-kernel users to be good citizens.
On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote: [...] > Therefore i root for simplification and OOM related concerns :) But > maybe there will be other opinions. I have to say that I disagree with your view. I am not sure we have other precedence where an allocator would throw away the primary allocation just because a metadata allocation failure. In any case, if there is a wider consensus that we really want to do that I can rework. I would prefer to keep the patch as is though.
On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote: > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote: > [...] > > Therefore i root for simplification and OOM related concerns :) But > > maybe there will be other opinions. > > I have to say that I disagree with your view. I am not sure we have > other precedence where an allocator would throw away the primary > allocation just because a metadata allocation failure. > Well, i tried to do some code review and raised some concerns and proposals. If you do not agree, then of course i will not be arguing here. -- Vlad Rezki
On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote: > On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote: > > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote: > > [...] > > > Therefore i root for simplification and OOM related concerns :) But > > > maybe there will be other opinions. > > > > I have to say that I disagree with your view. I am not sure we have > > other precedence where an allocator would throw away the primary > > allocation just because a metadata allocation failure. > > > Well, i tried to do some code review and raised some concerns and > proposals. I do appreciate your review! No question about that. I was just surprised by your reaction that your review feedback had been ignored because I do not think this is the case. We were in a disagreement and that is just fine. It is quite normal to disagree. The question is whether that disagreement is fundamental and poses a roadblock for merging. I definitely do not want and mean to push anything by force. My previous understanding was that your concerns are mostly about aesthetics rather than blocking further progress.
> On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote: > > On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote: > > > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote: > > > [...] > > > > Therefore i root for simplification and OOM related concerns :) But > > > > maybe there will be other opinions. > > > > > > I have to say that I disagree with your view. I am not sure we have > > > other precedence where an allocator would throw away the primary > > > allocation just because a metadata allocation failure. > > > > > Well, i tried to do some code review and raised some concerns and > > proposals. > > I do appreciate your review! No question about that. > > I was just surprised by your reaction that your review feedback had been > ignored because I do not think this is the case. > > We were in a disagreement and that is just fine. It is quite normal to > disagree. The question is whether that disagreement is fundamental and > poses a roadblock for merging. I definitely do not want and mean to push > anything by force. My previous understanding was that your concerns are > mostly about aesthetics rather than blocking further progress. > It is not up to me to block you from further progress and of course it was not my intention. You asked for a review i did it, that is it :) -- Vlad Rezki
On Mon 22-11-21 16:32:31, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Dave Chinner has mentioned that some of the xfs code would benefit from > kvmalloc support for __GFP_NOFAIL because they have allocations that > cannot fail and they do not fit into a single page. > > The large part of the vmalloc implementation already complies with the > given gfp flags so there is no work for those to be done. The area > and page table allocations are an exception to that. Implement a retry > loop for those. > > Add a short sleep before retrying. 1 jiffy is a completely random > timeout. Ideally the retry would wait for an explicit event - e.g. > a change to the vmalloc space change if the failure was caused by > the space fragmentation or depletion. But there are multiple different > reasons to retry and this could become much more complex. Keep the retry > simple for now and just sleep to prevent from hogging CPUs. > > Signed-off-by: Michal Hocko <mhocko@suse.com> Are there still any concerns around this patch or the approach in general?
On 11/24/21 06:23, NeilBrown wrote: >> >> I forget why radix_tree_preload used a cpu-local store rather than a >> per-task one. >> >> Plus "what order pages would you like" and "on which node" and "in >> which zone", etc... > > "what order" - only order-0 I hope. I'd hazard a guess that 90% of > current NOFAIL allocations only need one page (providing slub is used - > slab seems to insist on high-order pages sometimes). Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback to smallest order that's enough (thus 0 unless the objects are larger than a page). > "which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page() > will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is > probably a poor choice. > "which zone" - NORMAL. I cannot find any NOFAIL allocations that want > DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't > *requre* highmem. > > Of course, before designing this interface too precisely we should check > if anyone can use it. From a quick through the some of the 100-ish > users of __GFP_NOFAIL I'd guess that mempools would help - the > preallocation should happen at init-time, not request-time. Maybe if we > made mempools even more light weight .... though that risks allocating a > lot of memory that will never get used. > > This brings me back to the idea that > alloc_page(wait and reclaim allowed) > should only fail on OOM_KILL. That way kernel threads are safe, and > user-threads are free to return ENOMEM knowing it won't get to Hm I thought that's already pretty much the case of the "too small to fail" of today. IIRC there's exactly that gotcha that OOM KILL can result in such allocation failure. But I believe that approach is rather fragile. If you encounter such an allocation not checking the resulting page != NULL, you can only guess which one is true: - the author simply forgot to check at all - the author relied on "too small to fail" without realizing the gotcha - at the time of writing the code was verified that it can be only run in kernel thread context, not user and - it is still true - it stopped being true at some later point - might be hard to even decide which is the case IIRC at some point we tried to abolish the "too small to fail" rule because of this, but Linus denied that. But the opposite - make it hard guarantee in all cases - also didn't happen, so... > user-space. If user-thread code really needs NOFAIL, it punts to a > workqueue and waits - aborting the wait if it is killed, while the work > item still runs eventually. > > NeilBrown >
On Fri 26-11-21 15:50:15, Vlastimil Babka wrote: > On 11/24/21 06:23, NeilBrown wrote: > >> > >> I forget why radix_tree_preload used a cpu-local store rather than a > >> per-task one. > >> > >> Plus "what order pages would you like" and "on which node" and "in > >> which zone", etc... > > > > "what order" - only order-0 I hope. I'd hazard a guess that 90% of > > current NOFAIL allocations only need one page (providing slub is used - > > slab seems to insist on high-order pages sometimes). > > Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback > to smallest order that's enough (thus 0 unless the objects are larger than a > page). > > > "which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page() > > will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is > > probably a poor choice. > > "which zone" - NORMAL. I cannot find any NOFAIL allocations that want > > DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't > > *requre* highmem. > > > > Of course, before designing this interface too precisely we should check > > if anyone can use it. From a quick through the some of the 100-ish > > users of __GFP_NOFAIL I'd guess that mempools would help - the > > preallocation should happen at init-time, not request-time. Maybe if we > > made mempools even more light weight .... though that risks allocating a > > lot of memory that will never get used. > > > > This brings me back to the idea that > > alloc_page(wait and reclaim allowed) > > should only fail on OOM_KILL. That way kernel threads are safe, and > > user-threads are free to return ENOMEM knowing it won't get to > > Hm I thought that's already pretty much the case of the "too small to fail" > of today. IIRC there's exactly that gotcha that OOM KILL can result in such > allocation failure. But I believe that approach is rather fragile. If you > encounter such an allocation not checking the resulting page != NULL, you > can only guess which one is true: > > - the author simply forgot to check at all > - the author relied on "too small to fail" without realizing the gotcha > - at the time of writing the code was verified that it can be only run in > kernel thread context, not user and > - it is still true > - it stopped being true at some later point > - might be hard to even decide which is the case > > IIRC at some point we tried to abolish the "too small to fail" rule because > of this, but Linus denied that. But the opposite - make it hard guarantee in > all cases - also didn't happen, so... Yeah. IMHO we should treat each missing check for allocation failure (except for GFP_NOFAIL) as a bug regardless the practical implementation that say that small allocations do not fail. Because they can fail and we should never subscribe to official support implicit non-fail semantic.
On 11/22/21 16:32, Michal Hocko wrote: > From: Michal Hocko <mhocko@suse.com> > > Dave Chinner has mentioned that some of the xfs code would benefit from > kvmalloc support for __GFP_NOFAIL because they have allocations that > cannot fail and they do not fit into a single page. > > The large part of the vmalloc implementation already complies with the > given gfp flags so there is no work for those to be done. The area > and page table allocations are an exception to that. Implement a retry > loop for those. > > Add a short sleep before retrying. 1 jiffy is a completely random > timeout. Ideally the retry would wait for an explicit event - e.g. > a change to the vmalloc space change if the failure was caused by > the space fragmentation or depletion. But there are multiple different > reasons to retry and this could become much more complex. Keep the retry > simple for now and just sleep to prevent from hogging CPUs. > > Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> > @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > { > const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; > const gfp_t orig_gfp_mask = gfp_mask; > + bool nofail = gfp_mask & __GFP_NOFAIL; > unsigned long addr = (unsigned long)area->addr; > unsigned long size = get_vm_area_size(area); > unsigned long array_size; > @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) > flags = memalloc_noio_save(); > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages, > + do { > + ret = vmap_pages_range(addr, addr + size, prot, area->pages, > page_shift); > + if (nofail && (ret < 0)) > + schedule_timeout_uninterruptible(1); > + } while (nofail && (ret < 0)); Kind of ugly to have the same condition twice here, but no hard feelings. > > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) > memalloc_nofs_restore(flags); > @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, > VM_UNINITIALIZED | vm_flags, start, end, node, > gfp_mask, caller); > if (!area) { > + bool nofail = gfp_mask & __GFP_NOFAIL; > warn_alloc(gfp_mask, NULL, > - "vmalloc error: size %lu, vm_struct allocation failed", > - real_size); > + "vmalloc error: size %lu, vm_struct allocation failed%s", > + real_size, (nofail) ? ". Retrying." : ""); > + if (nofail) { > + schedule_timeout_uninterruptible(1); > + goto again; > + } > goto fail; > } > >
On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <mhocko@suse.com> wrote: > On Mon 22-11-21 16:32:31, Michal Hocko wrote: > > From: Michal Hocko <mhocko@suse.com> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > cannot fail and they do not fit into a single page. > > > > The large part of the vmalloc implementation already complies with the > > given gfp flags so there is no work for those to be done. The area > > and page table allocations are an exception to that. Implement a retry > > loop for those. > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > timeout. Ideally the retry would wait for an explicit event - e.g. > > a change to the vmalloc space change if the failure was caused by > > the space fragmentation or depletion. But there are multiple different > > reasons to retry and this could become much more complex. Keep the retry > > simple for now and just sleep to prevent from hogging CPUs. > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > Are there still any concerns around this patch or the approach in > general? Well. Like GFP_NOFAIL, every use is a sin. But I don't think I've ever seen a real-world report of anyone hitting GFP_NOFAIL's theoretical issues. I assume there will be a v3?
On Sat 27-11-21 16:00:43, Andrew Morton wrote: > On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <mhocko@suse.com> wrote: > > > On Mon 22-11-21 16:32:31, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@suse.com> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from > > > kvmalloc support for __GFP_NOFAIL because they have allocations that > > > cannot fail and they do not fit into a single page. > > > > > > The large part of the vmalloc implementation already complies with the > > > given gfp flags so there is no work for those to be done. The area > > > and page table allocations are an exception to that. Implement a retry > > > loop for those. > > > > > > Add a short sleep before retrying. 1 jiffy is a completely random > > > timeout. Ideally the retry would wait for an explicit event - e.g. > > > a change to the vmalloc space change if the failure was caused by > > > the space fragmentation or depletion. But there are multiple different > > > reasons to retry and this could become much more complex. Keep the retry > > > simple for now and just sleep to prevent from hogging CPUs. > > > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > > > Are there still any concerns around this patch or the approach in > > general? > > Well. Like GFP_NOFAIL, every use is a sin. But I don't think I've > ever seen a real-world report of anyone hitting GFP_NOFAIL's > theoretical issues. I am not sure what you mean here. If you are missing real GFP_NOFAIL use cases then some have been mentioned in the discussion > I assume there will be a v3? Currently I do not have any follow up changes on top of neither of the patch except for acks and review tags. I can repost with those if you prefer.
diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 17ca7001de1f..b6aed4f94a85 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * more permissive. */ if (!order) { + gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL; + while (nr_allocated < nr_pages) { unsigned int nr, nr_pages_request; @@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid, * but mempolcy want to alloc memory by interleaving. */ if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE) - nr = alloc_pages_bulk_array_mempolicy(gfp, + nr = alloc_pages_bulk_array_mempolicy(bulk_gfp, nr_pages_request, pages + nr_allocated); else - nr = alloc_pages_bulk_array_node(gfp, nid, + nr = alloc_pages_bulk_array_node(bulk_gfp, nid, nr_pages_request, pages + nr_allocated); @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, { const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO; const gfp_t orig_gfp_mask = gfp_mask; + bool nofail = gfp_mask & __GFP_NOFAIL; unsigned long addr = (unsigned long)area->addr; unsigned long size = get_vm_area_size(area); unsigned long array_size; @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0) flags = memalloc_noio_save(); - ret = vmap_pages_range(addr, addr + size, prot, area->pages, + do { + ret = vmap_pages_range(addr, addr + size, prot, area->pages, page_shift); + if (nofail && (ret < 0)) + schedule_timeout_uninterruptible(1); + } while (nofail && (ret < 0)); if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO) memalloc_nofs_restore(flags); @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align, VM_UNINITIALIZED | vm_flags, start, end, node, gfp_mask, caller); if (!area) { + bool nofail = gfp_mask & __GFP_NOFAIL; warn_alloc(gfp_mask, NULL, - "vmalloc error: size %lu, vm_struct allocation failed", - real_size); + "vmalloc error: size %lu, vm_struct allocation failed%s", + real_size, (nofail) ? ". Retrying." : ""); + if (nofail) { + schedule_timeout_uninterruptible(1); + goto again; + } goto fail; }