diff mbox series

[v6,17/18] mm/memory-failure: increase queued recovery work's priority

Message ID 20180921221705.6478-18-james.morse@arm.com (mailing list archive)
State Changes Requested, archived
Headers show
Series APEI in_nmi() rework | expand

Commit Message

James Morse Sept. 21, 2018, 10:17 p.m. UTC
arm64 can take an NMI-like error notification when user-space steps in
some corrupt memory. APEI's GHES code will call memory_failure_queue()
to schedule the recovery work. We then return to user-space, possibly
taking the fault again.

Currently the arch code unconditionally signals user-space from this
path, so we don't get stuck in this loop, but the affected process
never benefits from memory_failure()s recovery work. To fix this we
need to know the recovery work will run before we get back to user-space.

Increase the priority of the recovery work by scheduling it on the
system_highpri_wq, then try to bump the current task off this CPU
so that the recovery work starts immediately.

Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: James Morse <james.morse@arm.com>
Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
Tested-by: gengdongjiu <gengdongjiu@huawei.com>
CC: Xie XiuQi <xiexiuqi@huawei.com>
CC: gengdongjiu <gengdongjiu@huawei.com>
---
 mm/memory-failure.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Comments

Borislav Petkov Oct. 15, 2018, 4:49 p.m. UTC | #1
+ Peter.

On Fri, Sep 21, 2018 at 11:17:04PM +0100, James Morse wrote:
> arm64 can take an NMI-like error notification when user-space steps in
> some corrupt memory. APEI's GHES code will call memory_failure_queue()
> to schedule the recovery work. We then return to user-space, possibly
> taking the fault again.
> 
> Currently the arch code unconditionally signals user-space from this
> path, so we don't get stuck in this loop, but the affected process
> never benefits from memory_failure()s recovery work. To fix this we
> need to know the recovery work will run before we get back to user-space.
> 
> Increase the priority of the recovery work by scheduling it on the
> system_highpri_wq, then try to bump the current task off this CPU
> so that the recovery work starts immediately.
> 
> Reported-by: Xie XiuQi <xiexiuqi@huawei.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> Reviewed-by: Punit Agrawal <punit.agrawal@arm.com>
> Tested-by: Tyler Baicar <tbaicar@codeaurora.org>
> Tested-by: gengdongjiu <gengdongjiu@huawei.com>
> CC: Xie XiuQi <xiexiuqi@huawei.com>
> CC: gengdongjiu <gengdongjiu@huawei.com>
> ---
>  mm/memory-failure.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 0cd3de3550f0..4e7b115cea5a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -56,6 +56,7 @@
>  #include <linux/memory_hotplug.h>
>  #include <linux/mm_inline.h>
>  #include <linux/memremap.h>
> +#include <linux/preempt.h>
>  #include <linux/kfifo.h>
>  #include <linux/ratelimit.h>
>  #include <linux/page-isolation.h>
> @@ -1454,6 +1455,7 @@ static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
>   */
>  void memory_failure_queue(unsigned long pfn, int flags)
>  {
> +	int cpu = smp_processor_id();
>  	struct memory_failure_cpu *mf_cpu;
>  	unsigned long proc_flags;
>  	struct memory_failure_entry entry = {
> @@ -1463,11 +1465,14 @@ void memory_failure_queue(unsigned long pfn, int flags)
>  
>  	mf_cpu = &get_cpu_var(memory_failure_cpu);
>  	spin_lock_irqsave(&mf_cpu->lock, proc_flags);
> -	if (kfifo_put(&mf_cpu->fifo, entry))
> -		schedule_work_on(smp_processor_id(), &mf_cpu->work);
> -	else
> +	if (kfifo_put(&mf_cpu->fifo, entry)) {
> +		queue_work_on(cpu, system_highpri_wq, &mf_cpu->work);
> +		set_tsk_need_resched(current);
> +		preempt_set_need_resched();

What guarantees the workqueue would run before the process? I see this:

``WQ_HIGHPRI``
  Work items of a highpri wq are queued to the highpri
  worker-pool of the target cpu.  Highpri worker-pools are
  served by worker threads with elevated nice level.

but is that enough?
Peter Zijlstra Oct. 16, 2018, 7:43 a.m. UTC | #2
On Mon, Oct 15, 2018 at 06:49:13PM +0200, Borislav Petkov wrote:
> On Fri, Sep 21, 2018 at 11:17:04PM +0100, James Morse wrote:
> > @@ -1463,11 +1465,14 @@ void memory_failure_queue(unsigned long pfn, int flags)
> >  
> >  	mf_cpu = &get_cpu_var(memory_failure_cpu);
> >  	spin_lock_irqsave(&mf_cpu->lock, proc_flags);
> > -	if (kfifo_put(&mf_cpu->fifo, entry))
> > -		schedule_work_on(smp_processor_id(), &mf_cpu->work);
> > -	else
> > +	if (kfifo_put(&mf_cpu->fifo, entry)) {
> > +		queue_work_on(cpu, system_highpri_wq, &mf_cpu->work);
> > +		set_tsk_need_resched(current);
> > +		preempt_set_need_resched();
> 
> What guarantees the workqueue would run before the process? I see this:
> 
> ``WQ_HIGHPRI``
>   Work items of a highpri wq are queued to the highpri
>   worker-pool of the target cpu.  Highpri worker-pools are
>   served by worker threads with elevated nice level.
> 
> but is that enough?

Nope. Nice just makes it more likely, but no guarantees what so ever.

If you want to absolutely run something before we return to userspace,
would not task_work() be what we're looking for?
diff mbox series

Patch

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0cd3de3550f0..4e7b115cea5a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -56,6 +56,7 @@ 
 #include <linux/memory_hotplug.h>
 #include <linux/mm_inline.h>
 #include <linux/memremap.h>
+#include <linux/preempt.h>
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
@@ -1454,6 +1455,7 @@  static DEFINE_PER_CPU(struct memory_failure_cpu, memory_failure_cpu);
  */
 void memory_failure_queue(unsigned long pfn, int flags)
 {
+	int cpu = smp_processor_id();
 	struct memory_failure_cpu *mf_cpu;
 	unsigned long proc_flags;
 	struct memory_failure_entry entry = {
@@ -1463,11 +1465,14 @@  void memory_failure_queue(unsigned long pfn, int flags)
 
 	mf_cpu = &get_cpu_var(memory_failure_cpu);
 	spin_lock_irqsave(&mf_cpu->lock, proc_flags);
-	if (kfifo_put(&mf_cpu->fifo, entry))
-		schedule_work_on(smp_processor_id(), &mf_cpu->work);
-	else
+	if (kfifo_put(&mf_cpu->fifo, entry)) {
+		queue_work_on(cpu, system_highpri_wq, &mf_cpu->work);
+		set_tsk_need_resched(current);
+		preempt_set_need_resched();
+	} else {
 		pr_err("Memory failure: buffer overflow when queuing memory failure at %#lx\n",
 		       pfn);
+	}
 	spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
 	put_cpu_var(memory_failure_cpu);
 }