Message ID | 20191212190427.ouyohviijf5inhur@linux-p48b (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v2] mm/hugetlb: defer free_huge_page() to a workqueue | expand |
On 12/12/19 11:04 AM, Davidlohr Bueso wrote: > There have been deadlock reports[1, 2] where put_page is called > from softirq context and this causes trouble with the hugetlb_lock, > as well as potentially the subpool lock. > > For such an unlikely scenario, lets not add irq dancing overhead > to the lock+unlock operations, which could incur in expensive > instruction dependencies, particularly when considering hard-irq > safety. For example PUSHF+POPF on x86. > > Instead, just use a workqueue and do the free_huge_page() in regular > task context. > > [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ > [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ > > Reported-by: Waiman Long <longman@redhat.com> > Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de> Thank you Davidlohr. The patch does seem fairly simple and straight forward. I need to brush up on my workqueue knowledge to provide a full review. Longman, Do you have a test to reproduce the issue? If so, can you try running with this patch.
On Thu, 12 Dec 2019, Mike Kravetz wrote: >The patch does seem fairly simple and straight forward. I need to brush up >on my workqueue knowledge to provide a full review. The flush_work() call is actually bogus as we obviously cannot block in softirq...
On 12/12/19 2:22 PM, Mike Kravetz wrote: > On 12/12/19 11:04 AM, Davidlohr Bueso wrote: >> There have been deadlock reports[1, 2] where put_page is called >> from softirq context and this causes trouble with the hugetlb_lock, >> as well as potentially the subpool lock. >> >> For such an unlikely scenario, lets not add irq dancing overhead >> to the lock+unlock operations, which could incur in expensive >> instruction dependencies, particularly when considering hard-irq >> safety. For example PUSHF+POPF on x86. >> >> Instead, just use a workqueue and do the free_huge_page() in regular >> task context. >> >> [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ >> [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ >> >> Reported-by: Waiman Long <longman@redhat.com> >> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> > Thank you Davidlohr. > > The patch does seem fairly simple and straight forward. I need to brush up > on my workqueue knowledge to provide a full review. > > Longman, > Do you have a test to reproduce the issue? If so, can you try running with > this patch. Yes, I do have a test that can reproduce the issue. I will run it with the patch and report the status tomorrow. -Longman
On 12/12/19 2:04 PM, Davidlohr Bueso wrote: > There have been deadlock reports[1, 2] where put_page is called > from softirq context and this causes trouble with the hugetlb_lock, > as well as potentially the subpool lock. > > For such an unlikely scenario, lets not add irq dancing overhead > to the lock+unlock operations, which could incur in expensive > instruction dependencies, particularly when considering hard-irq > safety. For example PUSHF+POPF on x86. > > Instead, just use a workqueue and do the free_huge_page() in regular > task context. > > [1] > https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ > [2] > https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ > > Reported-by: Waiman Long <longman@redhat.com> > Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > Signed-off-by: Davidlohr Bueso <dbueso@suse.de> > --- > > - Changes from v1: Only use wq when in_interrupt(), otherwise business > as usual. Also include the proper header file. > > - While I have not reproduced this issue, the v1 using wq passes all > hugetlb > related tests in ltp. > > mm/hugetlb.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- > 1 file changed, 44 insertions(+), 1 deletion(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index ac65bb5e38ac..f28cf601938d 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -27,6 +27,7 @@ > #include <linux/swapops.h> > #include <linux/jhash.h> > #include <linux/numa.h> > +#include <linux/workqueue.h> > > #include <asm/page.h> > #include <asm/pgtable.h> > @@ -1136,7 +1137,13 @@ static inline void > ClearPageHugeTemporary(struct page *page) > page[2].mapping = NULL; > } > > -void free_huge_page(struct page *page) > +static struct workqueue_struct *hugetlb_free_page_wq; > +struct hugetlb_free_page_work { > + struct page *page; > + struct work_struct work; > +}; > + > +static void __free_huge_page(struct page *page) > { > /* > * Can't pass hstate in here because it is called from the > @@ -1199,6 +1206,36 @@ void free_huge_page(struct page *page) > spin_unlock(&hugetlb_lock); > } > > +static void free_huge_page_workfn(struct work_struct *work) > +{ > + struct page *page; > + > + page = container_of(work, struct hugetlb_free_page_work, > work)->page; > + __free_huge_page(page); > +} > + > +void free_huge_page(struct page *page) > +{ > + if (unlikely(in_interrupt())) { in_interrupt() also include context where softIRQ is disabled. So maybe !in_task() is a better fit here. > + /* > + * While uncommon, free_huge_page() can be at least > + * called from softirq context, defer freeing such > + * that the hugetlb_lock and spool->lock need not have > + * to deal with irq dances just for this. > + */ > + struct hugetlb_free_page_work work; > + > + work.page = page; > + INIT_WORK_ONSTACK(&work.work, free_huge_page_workfn); > + queue_work(hugetlb_free_page_wq, &work.work); > + > + /* wait until the huge page freeing is done */ > + flush_work(&work.work); > + destroy_work_on_stack(&work.work); The problem I see is that you don't want to wait too long while in the hardirq context. However, the latency for the work to finish is indeterminate. Cheers, Longman
On 12/12/19 3:52 PM, Waiman Long wrote: > On 12/12/19 2:22 PM, Mike Kravetz wrote: >> On 12/12/19 11:04 AM, Davidlohr Bueso wrote: >>> There have been deadlock reports[1, 2] where put_page is called >>> from softirq context and this causes trouble with the hugetlb_lock, >>> as well as potentially the subpool lock. >>> >>> For such an unlikely scenario, lets not add irq dancing overhead >>> to the lock+unlock operations, which could incur in expensive >>> instruction dependencies, particularly when considering hard-irq >>> safety. For example PUSHF+POPF on x86. >>> >>> Instead, just use a workqueue and do the free_huge_page() in regular >>> task context. >>> >>> [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ >>> [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ >>> >>> Reported-by: Waiman Long <longman@redhat.com> >>> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> >> Thank you Davidlohr. >> >> The patch does seem fairly simple and straight forward. I need to brush up >> on my workqueue knowledge to provide a full review. >> >> Longman, >> Do you have a test to reproduce the issue? If so, can you try running with >> this patch. > Yes, I do have a test that can reproduce the issue. I will run it with > the patch and report the status tomorrow. I don't think Davidlohr's patch is ready for prime time yet. So I will wait until a better version is available. Cheers, Longman
On Thu 12-12-19 15:52:20, Waiman Long wrote: > On 12/12/19 2:22 PM, Mike Kravetz wrote: > > On 12/12/19 11:04 AM, Davidlohr Bueso wrote: > >> There have been deadlock reports[1, 2] where put_page is called > >> from softirq context and this causes trouble with the hugetlb_lock, > >> as well as potentially the subpool lock. > >> > >> For such an unlikely scenario, lets not add irq dancing overhead > >> to the lock+unlock operations, which could incur in expensive > >> instruction dependencies, particularly when considering hard-irq > >> safety. For example PUSHF+POPF on x86. > >> > >> Instead, just use a workqueue and do the free_huge_page() in regular > >> task context. > >> > >> [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ > >> [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ > >> > >> Reported-by: Waiman Long <longman@redhat.com> > >> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > >> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> > > Thank you Davidlohr. > > > > The patch does seem fairly simple and straight forward. I need to brush up > > on my workqueue knowledge to provide a full review. > > > > Longman, > > Do you have a test to reproduce the issue? If so, can you try running with > > this patch. > > Yes, I do have a test that can reproduce the issue. I will run it with > the patch and report the status tomorrow. Can you extract guts of the testcase and integrate them into hugetlb test suite?
On Thu 12-12-19 11:04:27, Davidlohr Bueso wrote: > There have been deadlock reports[1, 2] where put_page is called > from softirq context and this causes trouble with the hugetlb_lock, > as well as potentially the subpool lock. > > For such an unlikely scenario, lets not add irq dancing overhead > to the lock+unlock operations, which could incur in expensive > instruction dependencies, particularly when considering hard-irq > safety. For example PUSHF+POPF on x86. > > Instead, just use a workqueue and do the free_huge_page() in regular > task context. I am afraid that work_struct is too large to be stuffed into the struct page array (because of the lockdep part). I think that it would be just safer to make hugetlb_lock irq safe. Are there any other locks that would require the same?
On 12/16/19 8:26 AM, Michal Hocko wrote: > On Thu 12-12-19 15:52:20, Waiman Long wrote: >> On 12/12/19 2:22 PM, Mike Kravetz wrote: >>> On 12/12/19 11:04 AM, Davidlohr Bueso wrote: >>>> There have been deadlock reports[1, 2] where put_page is called >>>> from softirq context and this causes trouble with the hugetlb_lock, >>>> as well as potentially the subpool lock. >>>> >>>> For such an unlikely scenario, lets not add irq dancing overhead >>>> to the lock+unlock operations, which could incur in expensive >>>> instruction dependencies, particularly when considering hard-irq >>>> safety. For example PUSHF+POPF on x86. >>>> >>>> Instead, just use a workqueue and do the free_huge_page() in regular >>>> task context. >>>> >>>> [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ >>>> [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ >>>> >>>> Reported-by: Waiman Long <longman@redhat.com> >>>> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> >>> Thank you Davidlohr. >>> >>> The patch does seem fairly simple and straight forward. I need to brush up >>> on my workqueue knowledge to provide a full review. >>> >>> Longman, >>> Do you have a test to reproduce the issue? If so, can you try running with >>> this patch. >> Yes, I do have a test that can reproduce the issue. I will run it with >> the patch and report the status tomorrow. > Can you extract guts of the testcase and integrate them into hugetlb > test suite? The test case that I used is the Red Hat internal "Fork vs. fast GUP race test" written by Jarod Wilson. I would have to ask him if he is OK with that. Cheers, Longman
On 12/16/19 8:37 AM, Michal Hocko wrote: > On Thu 12-12-19 11:04:27, Davidlohr Bueso wrote: >> There have been deadlock reports[1, 2] where put_page is called >> from softirq context and this causes trouble with the hugetlb_lock, >> as well as potentially the subpool lock. >> >> For such an unlikely scenario, lets not add irq dancing overhead >> to the lock+unlock operations, which could incur in expensive >> instruction dependencies, particularly when considering hard-irq >> safety. For example PUSHF+POPF on x86. >> >> Instead, just use a workqueue and do the free_huge_page() in regular >> task context. > I am afraid that work_struct is too large to be stuffed into the struct > page array (because of the lockdep part). > > I think that it would be just safer to make hugetlb_lock irq safe. Are > there any other locks that would require the same? Currently, free_huge_page() can be called from the softIRQ context. The hugetlb_lock will be acquired during that call. The subpool lock may conditionally be acquired as well. I am still torn between converting both locks to be irq-safe or deferring the freeing to a workqueue. Cheers, Longman
On Mon, 16 Dec 2019, Michal Hocko wrote: >I am afraid that work_struct is too large to be stuffed into the struct >page array (because of the lockdep part). Yeah, this needs to be done without touching struct page. Which is why I had done the stack allocated way in this patch, but we cannot wait for it to complete in irq, so that's out the window. Andi had suggested percpu allocated work items, but having played with the idea over the weekend, I don't see how we can prevent another page being freed on the same cpu before previous work on the same cpu is complete (cpu0 wants to free pageA, schedules the work, in the mean time cpu0 wants to free pageB and workerfn for pageA still hasn't been called). >I think that it would be just safer to make hugetlb_lock irq safe. Are >there any other locks that would require the same? It would be simpler. Any performance issues that arise would probably be only seen in microbenchmarks, assuming we want to have full irq safety. If we don't need to worry about hardirq, then even better. The subpool lock would also need to be irq safe. Thanks, Davidlohr
On Mon, Dec 16, 2019 at 08:17:48AM -0800, Davidlohr Bueso wrote: > On Mon, 16 Dec 2019, Michal Hocko wrote: > > I am afraid that work_struct is too large to be stuffed into the struct > > page array (because of the lockdep part). > > Yeah, this needs to be done without touching struct page. > > Which is why I had done the stack allocated way in this patch, but we > cannot wait for it to complete in irq, so that's out the window. Andi > had suggested percpu allocated work items, but having played with the > idea over the weekend, I don't see how we can prevent another page being > freed on the same cpu before previous work on the same cpu is complete > (cpu0 wants to free pageA, schedules the work, in the mean time cpu0 > wants to free pageB and workerfn for pageA still hasn't been called). Why is it that we can call functions after-an-RCU-period-has-elapsed time, at returning-to-userspace time and after-exiting-hardirq-handler time easily, but the mechanism for calling a function after-we've-finished-handling-softirqs is so bloody hard to use? That's surely the major problem we need to fix.
On 12/16/19 10:38 AM, Waiman Long wrote: > On 12/16/19 8:26 AM, Michal Hocko wrote: >> On Thu 12-12-19 15:52:20, Waiman Long wrote: >>> On 12/12/19 2:22 PM, Mike Kravetz wrote: >>>> On 12/12/19 11:04 AM, Davidlohr Bueso wrote: >>>>> There have been deadlock reports[1, 2] where put_page is called >>>>> from softirq context and this causes trouble with the hugetlb_lock, >>>>> as well as potentially the subpool lock. >>>>> >>>>> For such an unlikely scenario, lets not add irq dancing overhead >>>>> to the lock+unlock operations, which could incur in expensive >>>>> instruction dependencies, particularly when considering hard-irq >>>>> safety. For example PUSHF+POPF on x86. >>>>> >>>>> Instead, just use a workqueue and do the free_huge_page() in regular >>>>> task context. >>>>> >>>>> [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ >>>>> [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ >>>>> >>>>> Reported-by: Waiman Long <longman@redhat.com> >>>>> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> >>>>> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> >>>> Thank you Davidlohr. >>>> >>>> The patch does seem fairly simple and straight forward. I need to brush up >>>> on my workqueue knowledge to provide a full review. >>>> >>>> Longman, >>>> Do you have a test to reproduce the issue? If so, can you try running with >>>> this patch. >>> Yes, I do have a test that can reproduce the issue. I will run it with >>> the patch and report the status tomorrow. >> Can you extract guts of the testcase and integrate them into hugetlb >> test suite? BTW, what hugetlb test suite are you talking about? Cheers, Longman
On 12/16/19 8:17 AM, Davidlohr Bueso wrote: > On Mon, 16 Dec 2019, Michal Hocko wrote: >> I am afraid that work_struct is too large to be stuffed into the struct >> page array (because of the lockdep part). > > Yeah, this needs to be done without touching struct page. > > Which is why I had done the stack allocated way in this patch, but we > cannot wait for it to complete in irq, so that's out the window. Andi > had suggested percpu allocated work items, but having played with the > idea over the weekend, I don't see how we can prevent another page being > freed on the same cpu before previous work on the same cpu is complete > (cpu0 wants to free pageA, schedules the work, in the mean time cpu0 > wants to free pageB and workerfn for pageA still hasn't been called). > >> I think that it would be just safer to make hugetlb_lock irq safe. Are >> there any other locks that would require the same? > > It would be simpler. Any performance issues that arise would probably > be only seen in microbenchmarks, assuming we want to have full irq safety. > If we don't need to worry about hardirq, then even better. > > The subpool lock would also need to be irq safe. I do think we need to worry about hardirq. There are no restruictions that put_page can not be called from hardirq context. I am concerned about the latency of making hugetlb_lock (and potentially subpool lock) hardirq safe. When these locks were introduced (before my time) the concept of making them irq safe was not considered. Recently, I learned that the hugetlb_lock is held for a linear scan of ALL hugetlb pages during a cgroup reparentling operation. That is just too long. If there is no viable work queue solution, then I think we would like to restructure the hugetlb locking before a change to just make hugetlb_lock irq safe. The idea would be to split the scope of what is done under hugetlb_lock. Most of it would never be executed in irq context. Then have a small/limited set of functionality that really needs to be irq safe protected by an irq safe lock.
On 12/16/19 2:08 PM, Mike Kravetz wrote: > On 12/16/19 8:17 AM, Davidlohr Bueso wrote: >> On Mon, 16 Dec 2019, Michal Hocko wrote: >>> I am afraid that work_struct is too large to be stuffed into the struct >>> page array (because of the lockdep part). >> Yeah, this needs to be done without touching struct page. >> >> Which is why I had done the stack allocated way in this patch, but we >> cannot wait for it to complete in irq, so that's out the window. Andi >> had suggested percpu allocated work items, but having played with the >> idea over the weekend, I don't see how we can prevent another page being >> freed on the same cpu before previous work on the same cpu is complete >> (cpu0 wants to free pageA, schedules the work, in the mean time cpu0 >> wants to free pageB and workerfn for pageA still hasn't been called). >> >>> I think that it would be just safer to make hugetlb_lock irq safe. Are >>> there any other locks that would require the same? >> It would be simpler. Any performance issues that arise would probably >> be only seen in microbenchmarks, assuming we want to have full irq safety. >> If we don't need to worry about hardirq, then even better. >> >> The subpool lock would also need to be irq safe. > I do think we need to worry about hardirq. There are no restruictions that > put_page can not be called from hardirq context. > > I am concerned about the latency of making hugetlb_lock (and potentially > subpool lock) hardirq safe. When these locks were introduced (before my > time) the concept of making them irq safe was not considered. Recently, > I learned that the hugetlb_lock is held for a linear scan of ALL hugetlb > pages during a cgroup reparentling operation. That is just too long. > > If there is no viable work queue solution, then I think we would like to > restructure the hugetlb locking before a change to just make hugetlb_lock > irq safe. The idea would be to split the scope of what is done under > hugetlb_lock. Most of it would never be executed in irq context. Then > have a small/limited set of functionality that really needs to be irq > safe protected by an irq safe lock. > Please take a look at my recently posted patch to see if that is an acceptable workqueue based solution. Thanks, Longman
On Mon 16-12-19 13:44:53, Waiman Long wrote: > On 12/16/19 10:38 AM, Waiman Long wrote: [...] > >> Can you extract guts of the testcase and integrate them into hugetlb > >> test suite? > > BTW, what hugetlb test suite are you talking about? I was using tests from libhugetlbfs package in the past. There are few tests in LTP project but the libhugetlbfs coverage used to cover the largest part of the functionality. Is there any newer home for the package than [1], Mike? Btw. would it mak sense to migrate those tests to a more common place, LTP or kernel selftests? [1] https://github.com/libhugetlbfs/libhugetlbfs/tree/master/tests
Cc: Eric On 12/17/19 1:00 AM, Michal Hocko wrote: > On Mon 16-12-19 13:44:53, Waiman Long wrote: >> On 12/16/19 10:38 AM, Waiman Long wrote: > [...] >>>> Can you extract guts of the testcase and integrate them into hugetlb >>>> test suite? >> >> BTW, what hugetlb test suite are you talking about? > > I was using tests from libhugetlbfs package in the past. There are few > tests in LTP project but the libhugetlbfs coverage used to cover the > largest part of the functionality. > > Is there any newer home for the package than [1], Mike? Btw. would it > mak sense to migrate those tests to a more common place, LTP or kernel > selftests? That is the latest home/release for libhugetlbfs. The libhugetlbfs test suite is somewhat strange in that I suspect it started as testing for libhugetlbfs itself. When it was written, the thought may have been that people would use libhugetlfs as the primary interface to hugetlb pages. That is not the case today. Over time, hugetlbfs tests not associated with libhugetlbfs were added. If we want to migrate libhugetlbfs tests, then I think we would only want to migrate the non-libhugetlbfs test cases. Although, the libhugetlbfs specific tests are useful as they 'could' point out regressions. Added Eric (libhugetlbfs maintainer) on Cc for another opinion. > > [1] https://github.com/libhugetlbfs/libhugetlbfs/tree/master/tests >
On Tue 17-12-19 10:05:06, Mike Kravetz wrote: > Cc: Eric > > On 12/17/19 1:00 AM, Michal Hocko wrote: > > On Mon 16-12-19 13:44:53, Waiman Long wrote: > >> On 12/16/19 10:38 AM, Waiman Long wrote: > > [...] > >>>> Can you extract guts of the testcase and integrate them into hugetlb > >>>> test suite? > >> > >> BTW, what hugetlb test suite are you talking about? > > > > I was using tests from libhugetlbfs package in the past. There are few > > tests in LTP project but the libhugetlbfs coverage used to cover the > > largest part of the functionality. > > > > Is there any newer home for the package than [1], Mike? Btw. would it > > mak sense to migrate those tests to a more common place, LTP or kernel > > selftests? > > That is the latest home/release for libhugetlbfs. > > The libhugetlbfs test suite is somewhat strange in that I suspect it started > as testing for libhugetlbfs itself. When it was written, the thought may have > been that people would use libhugetlfs as the primary interface to hugetlb > pages. That is not the case today. Over time, hugetlbfs tests not associated > with libhugetlbfs were added. > > If we want to migrate libhugetlbfs tests, then I think we would only want to > migrate the non-libhugetlbfs test cases. Although, the libhugetlbfs specific > tests are useful as they 'could' point out regressions. Yeah, I can second that. I remember using the suite and it pointed to real issues when I was touching the area in the past. So if we can get as many tests to be independent on the library and integrate it to some existing testing project - be it kernel selftest or LTP - then it would be really great and I assume the testing coverage of the hugetlb functionality would increase dramatically.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ac65bb5e38ac..f28cf601938d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -27,6 +27,7 @@ #include <linux/swapops.h> #include <linux/jhash.h> #include <linux/numa.h> +#include <linux/workqueue.h> #include <asm/page.h> #include <asm/pgtable.h> @@ -1136,7 +1137,13 @@ static inline void ClearPageHugeTemporary(struct page *page) page[2].mapping = NULL; } -void free_huge_page(struct page *page) +static struct workqueue_struct *hugetlb_free_page_wq; +struct hugetlb_free_page_work { + struct page *page; + struct work_struct work; +}; + +static void __free_huge_page(struct page *page) { /* * Can't pass hstate in here because it is called from the @@ -1199,6 +1206,36 @@ void free_huge_page(struct page *page) spin_unlock(&hugetlb_lock); } +static void free_huge_page_workfn(struct work_struct *work) +{ + struct page *page; + + page = container_of(work, struct hugetlb_free_page_work, work)->page; + __free_huge_page(page); +} + +void free_huge_page(struct page *page) +{ + if (unlikely(in_interrupt())) { + /* + * While uncommon, free_huge_page() can be at least + * called from softirq context, defer freeing such + * that the hugetlb_lock and spool->lock need not have + * to deal with irq dances just for this. + */ + struct hugetlb_free_page_work work; + + work.page = page; + INIT_WORK_ONSTACK(&work.work, free_huge_page_workfn); + queue_work(hugetlb_free_page_wq, &work.work); + + /* wait until the huge page freeing is done */ + flush_work(&work.work); + destroy_work_on_stack(&work.work); + } else + __free_huge_page(page); +} + static void prep_new_huge_page(struct hstate *h, struct page *page, int nid) { INIT_LIST_HEAD(&page->lru); @@ -2816,6 +2853,12 @@ static int __init hugetlb_init(void) for (i = 0; i < num_fault_mutexes; i++) mutex_init(&hugetlb_fault_mutex_table[i]); + + hugetlb_free_page_wq = alloc_workqueue("hugetlb_free_page_wq", + WQ_MEM_RECLAIM, 0); + if (!hugetlb_free_page_wq) + return -ENOMEM; + return 0; } subsys_initcall(hugetlb_init);
There have been deadlock reports[1, 2] where put_page is called from softirq context and this causes trouble with the hugetlb_lock, as well as potentially the subpool lock. For such an unlikely scenario, lets not add irq dancing overhead to the lock+unlock operations, which could incur in expensive instruction dependencies, particularly when considering hard-irq safety. For example PUSHF+POPF on x86. Instead, just use a workqueue and do the free_huge_page() in regular task context. [1] https://lore.kernel.org/lkml/20191211194615.18502-1-longman@redhat.com/ [2] https://lore.kernel.org/lkml/20180905112341.21355-1-aneesh.kumar@linux.ibm.com/ Reported-by: Waiman Long <longman@redhat.com> Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: Davidlohr Bueso <dbueso@suse.de> --- - Changes from v1: Only use wq when in_interrupt(), otherwise business as usual. Also include the proper header file. - While I have not reproduced this issue, the v1 using wq passes all hugetlb related tests in ltp. mm/hugetlb.c | 45 ++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) -- 2.16.4