diff mbox series

[v2,1/1] mm/vmalloc: Move draining areas out of caller context

Message ID 20220125163912.2809-1-urezki@gmail.com (mailing list archive)
State New
Headers show
Series [v2,1/1] mm/vmalloc: Move draining areas out of caller context | expand

Commit Message

Uladzislau Rezki Jan. 25, 2022, 4:39 p.m. UTC
A caller initiates the drain procces from its context once the
drain threshold is reached or passed. There are at least two
drawbacks of doing so:

a) a caller can be a high-prio or RT task. In that case it can
   stuck in doing the actual drain of all lazily freed areas.
   This is not optimal because such tasks usually are latency
   sensitive where the control should be returned back as soon
   as possible in order to drive such workloads in time. See
   96e2db456135 ("mm/vmalloc: rework the drain logic")

b) It is not safe to call vfree() during holding a spinlock due
   to the vmap_purge_lock mutex. The was a report about this from
   Zeal Robot <zealci@zte.com.cn> here:
   https://lore.kernel.org/all/20211222081026.484058-1-chi.minghao@zte.com.cn

Moving the drain to the separate work context addresses those
issues.

v1->v2:
   - Added prefix "_work" to the drain worker function.

Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
---
 mm/vmalloc.c | 35 ++++++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 13 deletions(-)

Comments

Matthew Wilcox Jan. 25, 2022, 4:50 p.m. UTC | #1
On Tue, Jan 25, 2022 at 05:39:12PM +0100, Uladzislau Rezki (Sony) wrote:
> @@ -1768,7 +1776,8 @@ static void free_vmap_area_noflush(struct vmap_area *va)
>  
>  	/* After this point, we may free va at any time */
>  	if (unlikely(nr_lazy > lazy_max_pages()))
> -		try_purge_vmap_area_lazy();
> +		if (!atomic_xchg(&drain_vmap_work_in_progress, 1))
> +			schedule_work(&drain_vmap_work);
>  }

Is it necessary to have drain_vmap_work_in_progress?  The documentation
says:

 * This puts a job in the kernel-global workqueue if it was not already
 * queued and leaves it in the same position on the kernel-global
 * workqueue otherwise.

and the implementation seems to use test_and_set_bit() to ensure this
is true.
Uladzislau Rezki Jan. 25, 2022, 5:12 p.m. UTC | #2
On Tue, Jan 25, 2022 at 04:50:14PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 05:39:12PM +0100, Uladzislau Rezki (Sony) wrote:
> > @@ -1768,7 +1776,8 @@ static void free_vmap_area_noflush(struct vmap_area *va)
> >  
> >  	/* After this point, we may free va at any time */
> >  	if (unlikely(nr_lazy > lazy_max_pages()))
> > -		try_purge_vmap_area_lazy();
> > +		if (!atomic_xchg(&drain_vmap_work_in_progress, 1))
> > +			schedule_work(&drain_vmap_work);
> >  }
> 
> Is it necessary to have drain_vmap_work_in_progress?  The documentation
> says:
> 
>  * This puts a job in the kernel-global workqueue if it was not already
>  * queued and leaves it in the same position on the kernel-global
>  * workqueue otherwise.
> 
> and the implementation seems to use test_and_set_bit() to ensure this
> is true.
>
It checks pending state, if the work is in run-queue you can place it
one more time. The motivation of having it is to prevent the drain work
of being placed several times at once what i see on my stress testing.

CPU_1: invokes vfree() -> queues the drain work -> TASK_RUNNING
CPU_2: invokes vfree() -> queues the drain work one more time since it was not pending
...

Instead of drain_vmap_work_in_progress hack we can make use of work_busy()
helper. The main concern with that was the comment around that function:

/**
 * work_busy - test whether a work is currently pending or running
 * @work: the work to be tested
 *
 * Test whether @work is currently pending or running.  There is no
 * synchronization around this function and the test result is
 * unreliable and only useful as advisory hints or for debugging.
 *
 * Return:
 * OR'd bitmask of WORK_BUSY_* bits.
 */

i am not sure how reliable this is.

Thoughts?

--
Vlad Rezki
Matthew Wilcox Jan. 25, 2022, 6:46 p.m. UTC | #3
On Tue, Jan 25, 2022 at 06:12:48PM +0100, Uladzislau Rezki wrote:
> On Tue, Jan 25, 2022 at 04:50:14PM +0000, Matthew Wilcox wrote:
> > On Tue, Jan 25, 2022 at 05:39:12PM +0100, Uladzislau Rezki (Sony) wrote:
> > > @@ -1768,7 +1776,8 @@ static void free_vmap_area_noflush(struct vmap_area *va)
> > >  
> > >  	/* After this point, we may free va at any time */
> > >  	if (unlikely(nr_lazy > lazy_max_pages()))
> > > -		try_purge_vmap_area_lazy();
> > > +		if (!atomic_xchg(&drain_vmap_work_in_progress, 1))
> > > +			schedule_work(&drain_vmap_work);
> > >  }
> > 
> > Is it necessary to have drain_vmap_work_in_progress?  The documentation
> > says:
> > 
> >  * This puts a job in the kernel-global workqueue if it was not already
> >  * queued and leaves it in the same position on the kernel-global
> >  * workqueue otherwise.
> > 
> > and the implementation seems to use test_and_set_bit() to ensure this
> > is true.
> >
> It checks pending state, if the work is in run-queue you can place it
> one more time. The motivation of having it is to prevent the drain work
> of being placed several times at once what i see on my stress testing.
> 
> CPU_1: invokes vfree() -> queues the drain work -> TASK_RUNNING
> CPU_2: invokes vfree() -> queues the drain work one more time since it was not pending

But why not unconditionally call schedule_work() here?
Uladzislau Rezki Jan. 25, 2022, 7:17 p.m. UTC | #4
On Tue, Jan 25, 2022 at 06:46:35PM +0000, Matthew Wilcox wrote:
> On Tue, Jan 25, 2022 at 06:12:48PM +0100, Uladzislau Rezki wrote:
> > On Tue, Jan 25, 2022 at 04:50:14PM +0000, Matthew Wilcox wrote:
> > > On Tue, Jan 25, 2022 at 05:39:12PM +0100, Uladzislau Rezki (Sony) wrote:
> > > > @@ -1768,7 +1776,8 @@ static void free_vmap_area_noflush(struct vmap_area *va)
> > > >  
> > > >  	/* After this point, we may free va at any time */
> > > >  	if (unlikely(nr_lazy > lazy_max_pages()))
> > > > -		try_purge_vmap_area_lazy();
> > > > +		if (!atomic_xchg(&drain_vmap_work_in_progress, 1))
> > > > +			schedule_work(&drain_vmap_work);
> > > >  }
> > > 
> > > Is it necessary to have drain_vmap_work_in_progress?  The documentation
> > > says:
> > > 
> > >  * This puts a job in the kernel-global workqueue if it was not already
> > >  * queued and leaves it in the same position on the kernel-global
> > >  * workqueue otherwise.
> > > 
> > > and the implementation seems to use test_and_set_bit() to ensure this
> > > is true.
> > >
> > It checks pending state, if the work is in run-queue you can place it
> > one more time. The motivation of having it is to prevent the drain work
> > of being placed several times at once what i see on my stress testing.
> > 
> > CPU_1: invokes vfree() -> queues the drain work -> TASK_RUNNING
> > CPU_2: invokes vfree() -> queues the drain work one more time since it was not pending
> 
> But why not unconditionally call schedule_work() here?
>
We can :) The question is do we agree that extra queuing will be kind of
spurious? Because the CPU_1 will complete all cleanups once it is physically
on CPU and others workers just bail out.

We can disregard those spurious wake-ups for sure. If someone complains about
it in the future we can think later then.

Re-spin and do it unconditionally? I do not have a strong opinion about it.

--
Vlad Rezki
diff mbox series

Patch

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index bdc7222f87d4..e5285c9d2e2a 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -793,6 +793,9 @@  RB_DECLARE_CALLBACKS_MAX(static, free_vmap_area_rb_augment_cb,
 static void purge_vmap_area_lazy(void);
 static BLOCKING_NOTIFIER_HEAD(vmap_notify_list);
 static unsigned long lazy_max_pages(void);
+static void drain_vmap_area_work(struct work_struct *work);
+static DECLARE_WORK(drain_vmap_work, drain_vmap_area_work);
+static atomic_t drain_vmap_work_in_progress;
 
 static atomic_long_t nr_vmalloc_pages;
 
@@ -1719,18 +1722,6 @@  static bool __purge_vmap_area_lazy(unsigned long start, unsigned long end)
 	return true;
 }
 
-/*
- * Kick off a purge of the outstanding lazy areas. Don't bother if somebody
- * is already purging.
- */
-static void try_purge_vmap_area_lazy(void)
-{
-	if (mutex_trylock(&vmap_purge_lock)) {
-		__purge_vmap_area_lazy(ULONG_MAX, 0);
-		mutex_unlock(&vmap_purge_lock);
-	}
-}
-
 /*
  * Kick off a purge of the outstanding lazy areas.
  */
@@ -1742,6 +1733,23 @@  static void purge_vmap_area_lazy(void)
 	mutex_unlock(&vmap_purge_lock);
 }
 
+static void drain_vmap_area_work(struct work_struct *work)
+{
+	unsigned long nr_lazy;
+
+	do {
+		mutex_lock(&vmap_purge_lock);
+		__purge_vmap_area_lazy(ULONG_MAX, 0);
+		mutex_unlock(&vmap_purge_lock);
+
+		/* Recheck if further work is required. */
+		nr_lazy = atomic_long_read(&vmap_lazy_nr);
+	} while (nr_lazy > lazy_max_pages());
+
+	/* We are done at this point. */
+	atomic_set(&drain_vmap_work_in_progress, 0);
+}
+
 /*
  * Free a vmap area, caller ensuring that the area has been unmapped
  * and flush_cache_vunmap had been called for the correct range
@@ -1768,7 +1776,8 @@  static void free_vmap_area_noflush(struct vmap_area *va)
 
 	/* After this point, we may free va at any time */
 	if (unlikely(nr_lazy > lazy_max_pages()))
-		try_purge_vmap_area_lazy();
+		if (!atomic_xchg(&drain_vmap_work_in_progress, 1))
+			schedule_work(&drain_vmap_work);
 }
 
 /*