diff mbox

[1/3] perf/core: Make sure the ring-buffer is mapped in all page-tables

Message ID 1532103744-31902-2-git-send-email-joro@8bytes.org (mailing list archive)
State New, archived
Headers show

Commit Message

Joerg Roedel July 20, 2018, 4:22 p.m. UTC
From: Joerg Roedel <jroedel@suse.de>

The ring-buffer is accessed in the NMI handler, so we better
avoid faulting on it. Sync the vmalloc range with all
page-tables in system to make sure everyone has it mapped.

This fixes a WARN_ON_ONCE() that can be triggered with PTI
enabled on x86-32:

	WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230

This triggers because with PTI enabled on an PAE kernel the
PMDs are no longer shared between the page-tables, so the
vmalloc changes do not propagate automatically.

Signed-off-by: Joerg Roedel <jroedel@suse.de>
---
 kernel/events/ring_buffer.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Andy Lutomirski July 20, 2018, 5:06 p.m. UTC | #1
> On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>
> From: Joerg Roedel <jroedel@suse.de>
>
> The ring-buffer is accessed in the NMI handler, so we better
> avoid faulting on it. Sync the vmalloc range with all
> page-tables in system to make sure everyone has it mapped.
>
> This fixes a WARN_ON_ONCE() that can be triggered with PTI
> enabled on x86-32:
>
>    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>
> This triggers because with PTI enabled on an PAE kernel the
> PMDs are no longer shared between the page-tables, so the
> vmalloc changes do not propagate automatically.

It seems like it would be much more robust to fix the vmalloc_fault()
code instead.
Joerg Roedel July 20, 2018, 5:48 p.m. UTC | #2
On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
> 
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

The question is whether the NMI path is nesting-safe, then we can remove
the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
nesting-safe on x86-32 because of the way the stack-switch happens
there. If its also nesting-safe on x86-64 the warning there can be
removed.

Or did you think of something else to fix there?


Thanks,

	Joerg
Thomas Gleixner July 20, 2018, 7:27 p.m. UTC | #3
On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >
> > From: Joerg Roedel <jroedel@suse.de>
> >
> > The ring-buffer is accessed in the NMI handler, so we better
> > avoid faulting on it. Sync the vmalloc range with all
> > page-tables in system to make sure everyone has it mapped.
> >
> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> > enabled on x86-32:
> >
> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >
> > This triggers because with PTI enabled on an PAE kernel the
> > PMDs are no longer shared between the page-tables, so the
> > vmalloc changes do not propagate automatically.
> 
> It seems like it would be much more robust to fix the vmalloc_fault()
> code instead.

Right, but now the obvious fix for the issue at hand is this. We surely
should revisit this.

Thanks,

	tglx
Andy Lutomirski July 20, 2018, 7:32 p.m. UTC | #4
On Fri, Jul 20, 2018 at 10:48 AM, Joerg Roedel <joro@8bytes.org> wrote:
> On Fri, Jul 20, 2018 at 10:06:54AM -0700, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >
>> > From: Joerg Roedel <jroedel@suse.de>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> The question is whether the NMI path is nesting-safe, then we can remove
> the WARN_ON_ONCE(in_nmi()) in the vmalloc_fault path. It should be
> nesting-safe on x86-32 because of the way the stack-switch happens
> there. If its also nesting-safe on x86-64 the warning there can be
> removed.
>
> Or did you think of something else to fix there?

I'm just reading your changelog, and you said the PMDs are no longer
shared between the page tables.  So this presumably means that
vmalloc_fault() no longer actually works correctly on PTI systems.  I
didn't read the code to figure out *why* it doesn't work, but throwing
random vmalloc_sync_all() calls around is wrong.

Or maybe the bug really just is the warning.  The warning can probably go.

>
>
> Thanks,
>
>         Joerg
>
Andy Lutomirski July 20, 2018, 7:33 p.m. UTC | #5
On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Fri, 20 Jul 2018, Andy Lutomirski wrote:
>> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
>> >
>> > From: Joerg Roedel <jroedel@suse.de>
>> >
>> > The ring-buffer is accessed in the NMI handler, so we better
>> > avoid faulting on it. Sync the vmalloc range with all
>> > page-tables in system to make sure everyone has it mapped.
>> >
>> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
>> > enabled on x86-32:
>> >
>> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
>> >
>> > This triggers because with PTI enabled on an PAE kernel the
>> > PMDs are no longer shared between the page-tables, so the
>> > vmalloc changes do not propagate automatically.
>>
>> It seems like it would be much more robust to fix the vmalloc_fault()
>> code instead.
>
> Right, but now the obvious fix for the issue at hand is this. We surely
> should revisit this.

If you commit this under this reasoning, then please at least make it say:

/* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
is a workaround. */

Let's not have code in the kernel that pretends to make sense but is
actually voodoo magic that works around bugs elsewhere.  It's no fun
to maintain down the road.
Thomas Gleixner July 20, 2018, 7:43 p.m. UTC | #6
On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> On Fri, Jul 20, 2018 at 12:27 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Fri, 20 Jul 2018, Andy Lutomirski wrote:
> >> > On Jul 20, 2018, at 6:22 AM, Joerg Roedel <joro@8bytes.org> wrote:
> >> >
> >> > From: Joerg Roedel <jroedel@suse.de>
> >> >
> >> > The ring-buffer is accessed in the NMI handler, so we better
> >> > avoid faulting on it. Sync the vmalloc range with all
> >> > page-tables in system to make sure everyone has it mapped.
> >> >
> >> > This fixes a WARN_ON_ONCE() that can be triggered with PTI
> >> > enabled on x86-32:
> >> >
> >> >    WARNING: CPU: 4 PID: 0 at arch/x86/mm/fault.c:320 vmalloc_fault+0x220/0x230
> >> >
> >> > This triggers because with PTI enabled on an PAE kernel the
> >> > PMDs are no longer shared between the page-tables, so the
> >> > vmalloc changes do not propagate automatically.
> >>
> >> It seems like it would be much more robust to fix the vmalloc_fault()
> >> code instead.
> >
> > Right, but now the obvious fix for the issue at hand is this. We surely
> > should revisit this.
> 
> If you commit this under this reasoning, then please at least make it say:
> 
> /* XXX: The vmalloc_fault() code is buggy on PTI+PAE systems, and this
> is a workaround. */
> 
> Let's not have code in the kernel that pretends to make sense but is
> actually voodoo magic that works around bugs elsewhere.  It's no fun
> to maintain down the road.

Fair enough. Lemme amend it. Joerg is looking into it, but I surely want to
get that stuff some exposure in next ASAP.

Thanks,

	tglx
Joerg Roedel July 20, 2018, 9:37 p.m. UTC | #7
On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
> I'm just reading your changelog, and you said the PMDs are no longer
> shared between the page tables.  So this presumably means that
> vmalloc_fault() no longer actually works correctly on PTI systems.  I
> didn't read the code to figure out *why* it doesn't work, but throwing
> random vmalloc_sync_all() calls around is wrong.

Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
swapper_pg_dir to process page-tables when the relevant parts of the
kernel page-table are not shared, no?

That is also the reason we don't see this on 64 bit, because there these
parts *are* shared.

So with that reasoning vmalloc_fault() works as designed, except that
a warning is issued when it's happens in the NMI path. That warning comes
from

	ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI

which went into 2.6.37 and was added because the NMI handler were not
nesting-safe back then. Reason probably was that the handler on 64 bit
has to use an IST stack and a nested NMI would overwrite the stack of
the upper handler.  We don't have this problem on 32 bit as a nested NMI
will not do another stack-switch there.

I am not sure about 64 bit, but there is a lot of assembly magic to make
NMIs nesting-safe, so I guess the problem should be gone there too.


Regards,

	Joerg
Andy Lutomirski July 20, 2018, 10:20 p.m. UTC | #8
> On Jul 20, 2018, at 11:37 AM, Joerg Roedel <jroedel@suse.de> wrote:
> 
>> On Fri, Jul 20, 2018 at 12:32:10PM -0700, Andy Lutomirski wrote:
>> I'm just reading your changelog, and you said the PMDs are no longer
>> shared between the page tables.  So this presumably means that
>> vmalloc_fault() no longer actually works correctly on PTI systems.  I
>> didn't read the code to figure out *why* it doesn't work, but throwing
>> random vmalloc_sync_all() calls around is wrong.
> 
> Hmm, so the whole point of vmalloc_fault() fault is to sync changes from
> swapper_pg_dir to process page-tables when the relevant parts of the
> kernel page-table are not shared, no?
> 
> That is also the reason we don't see this on 64 bit, because there these
> parts *are* shared.
> 
> So with that reasoning vmalloc_fault() works as designed, except that
> a warning is issued when it's happens in the NMI path. That warning comes
> from
> 
>    ebc8827f75954 x86: Barf when vmalloc and kmemcheck faults happen in NMI
> 
> which went into 2.6.37 and was added because the NMI handler were not
> nesting-safe back then. Reason probably was that the handler on 64 bit
> has to use an IST stack and a nested NMI would overwrite the stack of
> the upper handler.  We don't have this problem on 32 bit as a nested NMI
> will not do another stack-switch there.
> 

Thanks for digging!  The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out.  But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.

I would remove the warning, re-test, and revert the other patch.

The one case we can’t handle in vmalloc_fault() is a fault on a stack access. I don’t expect this to be a problem for PTI. It was a problem for CONFIG_VMAP_STACK, though.

> I am not sure about 64 bit, but there is a lot of assembly magic to make
> NMIs nesting-safe, so I guess the problem should be gone there too.
> 
> 
> Regards,
> 
>    Joerg
Linus Torvalds July 21, 2018, 9:06 p.m. UTC | #9
On Fri, Jul 20, 2018 at 3:20 PM Andy Lutomirski <luto@amacapital.net> wrote:
> Thanks for digging!  The problem was presumably that vmalloc_fault() will IRET and re-enable NMIs on the way out.
>  But we’ve supported page faults on user memory in NMI handlers on 32-bit and 64-bit for quite a while, and it’s fine now.
>
> I would remove the warning, re-test, and revert the other patch.

Agreed. I don't think we have any issues with page faults during NMI
any more.  Afaik the kprobe people depend on it.

That said, 64-bit mode has that scary PV-op case
(arch_flush_lazy_mmu_mode). Being PV mode, I can't find it in myself
to worry about it, I'm assuming it's ok.

                Linus
diff mbox

Patch

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 5d3cf40..7b0e9aa 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -814,6 +814,9 @@  static void rb_free_work(struct work_struct *work)
 
 	vfree(base);
 	kfree(rb);
+
+	/* Make sure buffer is unmapped in all page-tables */
+	vmalloc_sync_all();
 }
 
 void rb_free(struct ring_buffer *rb)
@@ -840,6 +843,13 @@  struct ring_buffer *rb_alloc(int nr_pages, long watermark, int cpu, int flags)
 	if (!all_buf)
 		goto fail_all_buf;
 
+	/*
+	 * The buffer is accessed in NMI handlers, make sure it is
+	 * mapped in all page-tables in the system so that we don't
+	 * fault on the range in an NMI handler.
+	 */
+	vmalloc_sync_all();
+
 	rb->user_page = all_buf;
 	rb->data_pages[0] = all_buf + PAGE_SIZE;
 	if (nr_pages) {