diff mbox series

[v2] mm/hugetlb: Defer freeing of huge pages if in non-task context

Message ID 20191217012508.31495-1-longman@redhat.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm/hugetlb: Defer freeing of huge pages if in non-task context | expand

Commit Message

Waiman Long Dec. 17, 2019, 1:25 a.m. UTC
The following lockdep splat was observed when a certain hugetlbfs test
was run:

[  612.388273] ================================
[  612.411273] WARNING: inconsistent lock state
[  612.432273] 4.18.0-159.el8.x86_64+debug #1 Tainted: G        W --------- -  -
[  612.469273] --------------------------------
[  612.489273] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  612.517273] swapper/30/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  612.541273] ffffffff9acdc038 (hugetlb_lock){+.?.}, at: free_huge_page+0x36f/0xaa0
[  612.576273] {SOFTIRQ-ON-W} state was registered at:
[  612.598273]   lock_acquire+0x14f/0x3b0
[  612.616273]   _raw_spin_lock+0x30/0x70
[  612.634273]   __nr_hugepages_store_common+0x11b/0xb30
[  612.657273]   hugetlb_sysctl_handler_common+0x209/0x2d0
[  612.681273]   proc_sys_call_handler+0x37f/0x450
[  612.703273]   vfs_write+0x157/0x460
[  612.719273]   ksys_write+0xb8/0x170
[  612.736273]   do_syscall_64+0xa5/0x4d0
[  612.753273]   entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[  612.777273] irq event stamp: 691296
[  612.794273] hardirqs last  enabled at (691296): [<ffffffff99bb034b>] _raw_spin_unlock_irqrestore+0x4b/0x60
[  612.839273] hardirqs last disabled at (691295): [<ffffffff99bb0ad2>] _raw_spin_lock_irqsave+0x22/0x81
[  612.882273] softirqs last  enabled at (691284): [<ffffffff97ff0c63>] irq_enter+0xc3/0xe0
[  612.922273] softirqs last disabled at (691285): [<ffffffff97ff0ebe>] irq_exit+0x23e/0x2b0
[  612.962273]
[  612.962273] other info that might help us debug this:
[  612.993273]  Possible unsafe locking scenario:
[  612.993273]
[  613.020273]        CPU0
[  613.031273]        ----
[  613.042273]   lock(hugetlb_lock);
[  613.057273]   <Interrupt>
[  613.069273]     lock(hugetlb_lock);
[  613.085273]
[  613.085273]  *** DEADLOCK ***
      :
[  613.245273] Call Trace:
[  613.256273]  <IRQ>
[  613.265273]  dump_stack+0x9a/0xf0
[  613.281273]  mark_lock+0xd0c/0x12f0
[  613.297273]  ? print_shortest_lock_dependencies+0x80/0x80
[  613.322273]  ? sched_clock_cpu+0x18/0x1e0
[  613.341273]  __lock_acquire+0x146b/0x48c0
[  613.360273]  ? trace_hardirqs_on+0x10/0x10
[  613.379273]  ? trace_hardirqs_on_caller+0x27b/0x580
[  613.401273]  lock_acquire+0x14f/0x3b0
[  613.419273]  ? free_huge_page+0x36f/0xaa0
[  613.440273]  _raw_spin_lock+0x30/0x70
[  613.458273]  ? free_huge_page+0x36f/0xaa0
[  613.477273]  free_huge_page+0x36f/0xaa0
[  613.495273]  bio_check_pages_dirty+0x2fc/0x5c0
[  613.516273]  clone_endio+0x17f/0x670 [dm_mod]
[  613.536273]  ? disable_discard+0x90/0x90 [dm_mod]
[  613.558273]  ? bio_endio+0x4ba/0x930
[  613.575273]  ? blk_account_io_completion+0x400/0x530
[  613.598273]  blk_update_request+0x276/0xe50
[  613.617273]  scsi_end_request+0x7b/0x6a0
[  613.636273]  ? lock_downgrade+0x6f0/0x6f0
[  613.654273]  scsi_io_completion+0x1c6/0x1570
[  613.674273]  ? sd_completed_bytes+0x3a0/0x3a0 [sd_mod]
[  613.698273]  ? scsi_mq_requeue_cmd+0xc0/0xc0
[  613.718273]  blk_done_softirq+0x22e/0x350
[  613.737273]  ? blk_softirq_cpu_dead+0x230/0x230
[  613.758273]  __do_softirq+0x23d/0xad8
[  613.776273]  irq_exit+0x23e/0x2b0
[  613.792273]  do_IRQ+0x11a/0x200
[  613.806273]  common_interrupt+0xf/0xf
[  613.823273]  </IRQ>

Both the hugetbl_lock and the subpool lock can be acquired in
free_huge_page(). One way to solve the problem is to make both locks
irq-safe. Another alternative is to defer the freeing to a workqueue job.

This patch implements the deferred freeing by adding a
free_hpage_workfn() work function to do the actual freeing. The
free_huge_page() call in a non-task context saves the page to be freed
in the hpage_freelist linked list in a lockless manner.

The generic workqueue is used to process the work, but a dedicated
workqueue can be used instead if it is desirable to have the huge page
freed ASAP.

 [v2: Add more comment & remove unneeded racing check]

Reported-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 mm/hugetlb.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

Comments

Michal Hocko Dec. 17, 2019, 9:31 a.m. UTC | #1
On Mon 16-12-19 20:25:08, Waiman Long wrote:
[...]
> Both the hugetbl_lock and the subpool lock can be acquired in
> free_huge_page(). One way to solve the problem is to make both locks
> irq-safe.

Please document why we do not take this, quite natural path and instead
we have to come up with an elaborate way instead. I believe the primary
motivation is that some operations under those locks are quite
expensive. Please add that to the changelog and ideally to the code as
well. We probably want to fix those anyway and then this would be a
temporary workaround.

> Another alternative is to defer the freeing to a workqueue job.
> 
> This patch implements the deferred freeing by adding a
> free_hpage_workfn() work function to do the actual freeing. The
> free_huge_page() call in a non-task context saves the page to be freed
> in the hpage_freelist linked list in a lockless manner.

Do we need to over complicate this (presumably) rare event by a lockless
algorithm? Why cannot we use a dedicated spin lock for for the linked
list manipulation? This should be really a trivial code without an
additional burden of all the lockless subtleties.

> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);

Why do we need the debugging message here?
Kirill Tkhai Dec. 17, 2019, 10:50 a.m. UTC | #2
On 17.12.2019 12:31, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> 
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.
> 
>> Another alternative is to defer the freeing to a workqueue job.
>>
>> This patch implements the deferred freeing by adding a
>> free_hpage_workfn() work function to do the actual freeing. The
>> free_huge_page() call in a non-task context saves the page to be freed
>> in the hpage_freelist linked list in a lockless manner.
> 
> Do we need to over complicate this (presumably) rare event by a lockless
> algorithm? Why cannot we use a dedicated spin lock for for the linked
> list manipulation? This should be really a trivial code without an
> additional burden of all the lockless subtleties.

Why not llist_add()/llist_del_all() ?
Waiman Long Dec. 17, 2019, 2 p.m. UTC | #3
On 12/17/19 5:50 AM, Kirill Tkhai wrote:
> On 17.12.2019 12:31, Michal Hocko wrote:
>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>> [...]
>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>> free_huge_page(). One way to solve the problem is to make both locks
>>> irq-safe.
>> Please document why we do not take this, quite natural path and instead
>> we have to come up with an elaborate way instead. I believe the primary
>> motivation is that some operations under those locks are quite
>> expensive. Please add that to the changelog and ideally to the code as
>> well. We probably want to fix those anyway and then this would be a
>> temporary workaround.
>>
>>> Another alternative is to defer the freeing to a workqueue job.
>>>
>>> This patch implements the deferred freeing by adding a
>>> free_hpage_workfn() work function to do the actual freeing. The
>>> free_huge_page() call in a non-task context saves the page to be freed
>>> in the hpage_freelist linked list in a lockless manner.
>> Do we need to over complicate this (presumably) rare event by a lockless
>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>> list manipulation? This should be really a trivial code without an
>> additional burden of all the lockless subtleties.
> Why not llist_add()/llist_del_all() ?
>
The llist_add() and llist_del_all() are just simple helpers. Because
this lockless case involve synchronization of two variables, the llist
helpers do not directly apply here. So the rests cannot be used. It will
look awkward it is partially converted to use the helpers. If we convert
to use a lock as suggested by Michal, using the helpers will be an
overkill as xchg() will not be needed.

Cheers,
Longman
Waiman Long Dec. 17, 2019, 2:06 p.m. UTC | #4
On 12/17/19 4:31 AM, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.
>
Fair enough, I will include data from Mike about some use cases where
hugetlb_lock will be held for a long time.


>> Another alternative is to defer the freeing to a workqueue job.
>>
>> This patch implements the deferred freeing by adding a
>> free_hpage_workfn() work function to do the actual freeing. The
>> free_huge_page() call in a non-task context saves the page to be freed
>> in the hpage_freelist linked list in a lockless manner.
> Do we need to over complicate this (presumably) rare event by a lockless
> algorithm? Why cannot we use a dedicated spin lock for for the linked
> list manipulation? This should be really a trivial code without an
> additional burden of all the lockless subtleties.

Right, I can use an irq-safe raw spinlock instead. I am fine doing that.


>
>> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> Why do we need the debugging message here?

It is there just to verify that the workfn is properly activated and
frees the huge page. This message won't be printed by default. I can
remove it if you guys don't really want a debug statement here.

Cheers,
Longman
Kirill Tkhai Dec. 17, 2019, 2:13 p.m. UTC | #5
On 17.12.2019 17:00, Waiman Long wrote:
> On 12/17/19 5:50 AM, Kirill Tkhai wrote:
>> On 17.12.2019 12:31, Michal Hocko wrote:
>>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>>> [...]
>>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>>> free_huge_page(). One way to solve the problem is to make both locks
>>>> irq-safe.
>>> Please document why we do not take this, quite natural path and instead
>>> we have to come up with an elaborate way instead. I believe the primary
>>> motivation is that some operations under those locks are quite
>>> expensive. Please add that to the changelog and ideally to the code as
>>> well. We probably want to fix those anyway and then this would be a
>>> temporary workaround.
>>>
>>>> Another alternative is to defer the freeing to a workqueue job.
>>>>
>>>> This patch implements the deferred freeing by adding a
>>>> free_hpage_workfn() work function to do the actual freeing. The
>>>> free_huge_page() call in a non-task context saves the page to be freed
>>>> in the hpage_freelist linked list in a lockless manner.
>>> Do we need to over complicate this (presumably) rare event by a lockless
>>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>>> list manipulation? This should be really a trivial code without an
>>> additional burden of all the lockless subtleties.
>> Why not llist_add()/llist_del_all() ?
>>
> The llist_add() and llist_del_all() are just simple helpers. Because
> this lockless case involve synchronization of two variables, the llist
> helpers do not directly apply here. So the rests cannot be used. It will
> look awkward it is partially converted to use the helpers. If we convert
> to use a lock as suggested by Michal, using the helpers will be an
> overkill as xchg() will not be needed.

I don't understand you. What are two variables?

Why can't you simply do the below?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..e8ec753f3d92 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *page)
 	page[2].mapping = NULL;
 }
 
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,35 @@ void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
+static struct llist_head hpage_freelist = LLIST_HEAD_INIT;
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct llist_node *node;
+	struct page *page;
+
+	node = llist_del_all(&hpage_freelist);
+
+	while (node) {
+		page = container_of(node, struct page, mapping);
+		node = node->next;
+		__free_huge_page(page);
+	}
+}
+
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+void free_huge_page(struct page *page)
+{
+	if (!in_task()) {
+		if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist))
+			schedule_work(&free_hpage_work);
+		return;
+	}
+
+	__free_huge_page(page);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);
Waiman Long Dec. 17, 2019, 2:27 p.m. UTC | #6
On 12/17/19 9:13 AM, Kirill Tkhai wrote:
> On 17.12.2019 17:00, Waiman Long wrote:
>> On 12/17/19 5:50 AM, Kirill Tkhai wrote:
>>> On 17.12.2019 12:31, Michal Hocko wrote:
>>>> On Mon 16-12-19 20:25:08, Waiman Long wrote:
>>>> [...]
>>>>> Both the hugetbl_lock and the subpool lock can be acquired in
>>>>> free_huge_page(). One way to solve the problem is to make both locks
>>>>> irq-safe.
>>>> Please document why we do not take this, quite natural path and instead
>>>> we have to come up with an elaborate way instead. I believe the primary
>>>> motivation is that some operations under those locks are quite
>>>> expensive. Please add that to the changelog and ideally to the code as
>>>> well. We probably want to fix those anyway and then this would be a
>>>> temporary workaround.
>>>>
>>>>> Another alternative is to defer the freeing to a workqueue job.
>>>>>
>>>>> This patch implements the deferred freeing by adding a
>>>>> free_hpage_workfn() work function to do the actual freeing. The
>>>>> free_huge_page() call in a non-task context saves the page to be freed
>>>>> in the hpage_freelist linked list in a lockless manner.
>>>> Do we need to over complicate this (presumably) rare event by a lockless
>>>> algorithm? Why cannot we use a dedicated spin lock for for the linked
>>>> list manipulation? This should be really a trivial code without an
>>>> additional burden of all the lockless subtleties.
>>> Why not llist_add()/llist_del_all() ?
>>>
>> The llist_add() and llist_del_all() are just simple helpers. Because
>> this lockless case involve synchronization of two variables, the llist
>> helpers do not directly apply here. So the rests cannot be used. It will
>> look awkward it is partially converted to use the helpers. If we convert
>> to use a lock as suggested by Michal, using the helpers will be an
>> overkill as xchg() will not be needed.
> I don't understand you. What are two variables?
>
> Why can't you simply do the below?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ac65bb5e38ac..e8ec753f3d92 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1136,7 +1136,7 @@ static inline void ClearPageHugeTemporary(struct page *page)
>  	page[2].mapping = NULL;
>  }
>  
> -void free_huge_page(struct page *page)
> +static void __free_huge_page(struct page *page)
>  {
>  	/*
>  	 * Can't pass hstate in here because it is called from the
> @@ -1199,6 +1199,35 @@ void free_huge_page(struct page *page)
>  	spin_unlock(&hugetlb_lock);
>  }
>  
> +static struct llist_head hpage_freelist = LLIST_HEAD_INIT;
> +
> +static void free_hpage_workfn(struct work_struct *work)
> +{
> +	struct llist_node *node;
> +	struct page *page;
> +
> +	node = llist_del_all(&hpage_freelist);
> +
> +	while (node) {
> +		page = container_of(node, struct page, mapping);
> +		node = node->next;
> +		__free_huge_page(page);
> +	}
> +}
> +
> +static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
> +
> +void free_huge_page(struct page *page)
> +{
> +	if (!in_task()) {
> +		if (llist_add((struct llist_node *)&page->mapping, &hpage_freelist))
> +			schedule_work(&free_hpage_work);
> +		return;
> +	}
> +
> +	__free_huge_page(page);
> +}
> +
>  static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
>  {
>  	INIT_LIST_HEAD(&page->lru);
>
You are right. That should work. I was not aware of the llist before so
I haven't fully grasped its capability. Thanks for the suggestion.

Cheers,
Longman
Michal Hocko Dec. 17, 2019, 2:59 p.m. UTC | #7
On Tue 17-12-19 09:06:31, Waiman Long wrote:
> On 12/17/19 4:31 AM, Michal Hocko wrote:
> > On Mon 16-12-19 20:25:08, Waiman Long wrote:
[...]
> >> +	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
> > Why do we need the debugging message here?
> 
> It is there just to verify that the workfn is properly activated and
> frees the huge page. This message won't be printed by default. I can
> remove it if you guys don't really want a debug statement here.

Yes, drop it please. We are not adding debugging messages unless they
are really actionable. If this is a sign of a bug then put a WARN_ONCE or
somethin like that. But with a simple code like this it doesn't really
seem to be suitable IMHO.

Thanks!
Mike Kravetz Dec. 17, 2019, 6:33 p.m. UTC | #8
On 12/17/19 1:31 AM, Michal Hocko wrote:
> On Mon 16-12-19 20:25:08, Waiman Long wrote:
> [...]
>> Both the hugetbl_lock and the subpool lock can be acquired in
>> free_huge_page(). One way to solve the problem is to make both locks
>> irq-safe.
> 
> Please document why we do not take this, quite natural path and instead
> we have to come up with an elaborate way instead. I believe the primary
> motivation is that some operations under those locks are quite
> expensive. Please add that to the changelog and ideally to the code as
> well. We probably want to fix those anyway and then this would be a
> temporary workaround.

We may have talked in the past about how hugetlbfs locking is not ideal.
However, there are many things in hugetlbfs that are not ideal. :(  The
thought was to avoid making changes unless they showed up as real problems.
Looks like the locking is now a real issue.

I'll start work on restructuring at least this part of the locking.  But,
let's move forward with the deferred freeing approach until that is ready.
diff mbox series

Patch

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ac65bb5e38ac..1744481bb9a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1136,7 +1136,7 @@  static inline void ClearPageHugeTemporary(struct page *page)
 	page[2].mapping = NULL;
 }
 
-void free_huge_page(struct page *page)
+static void __free_huge_page(struct page *page)
 {
 	/*
 	 * Can't pass hstate in here because it is called from the
@@ -1199,6 +1199,74 @@  void free_huge_page(struct page *page)
 	spin_unlock(&hugetlb_lock);
 }
 
+/*
+ * As free_huge_page() can be called from a non-task context, we have
+ * to defer the actual freeing in a workqueue to prevent potential
+ * hugetlb_lock deadlock.
+ *
+ * free_hpage_workfn() locklessly retrieves the linked list of pages to
+ * be freed and frees them one-by-one. As the page->mapping pointer is
+ * going to be cleared in __free_huge_page() anyway, it is reused as the
+ * next pointer of a singly linked list of huge pages to be freed.
+ */
+#define NEXT_PENDING	((struct page *)-1)
+static struct page *hpage_freelist;
+
+static void free_hpage_workfn(struct work_struct *work)
+{
+	struct page *curr, *next;
+	int cnt = 0;
+
+	do {
+		curr = xchg(&hpage_freelist, NULL);
+		if (!curr)
+			break;
+
+		while (curr) {
+			next = (struct page *)READ_ONCE(curr->mapping);
+			if (next == NEXT_PENDING) {
+				cpu_relax();
+				continue;
+			}
+			__free_huge_page(curr);
+			curr = next;
+			cnt++;
+		}
+	} while (!READ_ONCE(hpage_freelist));
+
+	if (!cnt)
+		return;
+	pr_debug("HugeTLB: free_hpage_workfn() frees %d huge page(s)\n", cnt);
+}
+static DECLARE_WORK(free_hpage_work, free_hpage_workfn);
+
+void free_huge_page(struct page *page)
+{
+	/*
+	 * Defer freeing if in non-task context to avoid hugetlb_lock deadlock.
+	 */
+	if (!in_task()) {
+		struct page *next;
+
+		/*
+		 * Since we cannot atomically update both page->mapping
+		 * and hpage_freelist without lock, a handshake using an
+		 * intermediate NEXT_PENDING value in mapping is used to
+		 * make sure that the reader will see both values correctly.
+		 * The initial write of NEXT_PENDING is before the page
+		 * is visible to a concurrent reader, so WRITE_ONCE()
+		 * isn't needed.
+		 */
+		page->mapping = (struct address_space *)NEXT_PENDING;
+		next = xchg(&hpage_freelist, page);
+		WRITE_ONCE(page->mapping, (struct address_space *)next);
+		schedule_work(&free_hpage_work);
+		return;
+	}
+
+	__free_huge_page(page);
+}
+
 static void prep_new_huge_page(struct hstate *h, struct page *page, int nid)
 {
 	INIT_LIST_HEAD(&page->lru);