diff mbox series

[v2] x86/mm/pti: in pti_clone_pgtable(), increase addr properly

Message ID 20190820202314.1083149-1-songliubraving@fb.com (mailing list archive)
State New, archived
Headers show
Series [v2] x86/mm/pti: in pti_clone_pgtable(), increase addr properly | expand

Commit Message

Song Liu Aug. 20, 2019, 8:23 p.m. UTC
Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
This behavior changes after the 32-bit support:  pti_clone_pgtable()
increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
addr may not be PUD_SIZE/PMD_SIZE aligned.

Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
in these two cases.

The following explains how we debugged this issue:

We use huge page for hot text and thus reduces iTLB misses. As we
benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
iTLB misses.

To figure out the issue, I use a debug patch that dumps page table for
a pid. The following are information from the workload pid.

For the 4.16 based kernel:

host-4.16 # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
0xffffffff81a00000-0xffffffff81c00000           2M     ro         PSE         x  pmd

For the 5.2 based kernel before this patch:

host-5.2-before # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd

The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
irq entry table; while 4.16 kernel doesn't have it.

For the 5.2 based kernel after this patch:

host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd

So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
in 4.16 kernel. This further reduces iTLB miss rate

Cc: stable@vger.kernel.org # v4.19+
Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
Reviewed-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/mm/pti.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Song Liu Aug. 21, 2019, 9:41 a.m. UTC | #1
> On Aug 20, 2019, at 1:23 PM, Song Liu <songliubraving@fb.com> wrote:
> 
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support:  pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
> 
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.

After poking around more, I found the following doesn't really make 
sense. 

Sorry for the noise. 
Song


<nonsense> 

> 
> The following explains how we debugged this issue:
> 
> We use huge page for hot text and thus reduces iTLB misses. As we
> benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
> iTLB misses.
> 
> To figure out the issue, I use a debug patch that dumps page table for
> a pid. The following are information from the workload pid.
> 
> For the 4.16 based kernel:
> 
> host-4.16 # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 0xffffffff81a00000-0xffffffff81c00000           2M     ro         PSE         x  pmd
> 
> For the 5.2 based kernel before this patch:
> 
> host-5.2-before # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 
> The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
> irq entry table; while 4.16 kernel doesn't have it.
> 
> For the 5.2 based kernel after this patch:
> 
> host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> 
> So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
> in 4.16 kernel. This further reduces iTLB miss rate

</nonsense>

> 
> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Reviewed-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
> arch/x86/mm/pti.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..1337494e22ef 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
> 
> 		pud = pud_offset(p4d, addr);
> 		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			addr = round_up(addr + 1, PUD_SIZE);
> 			continue;
> 		}
> 
> 		pmd = pmd_offset(pud, addr);
> 		if (pmd_none(*pmd)) {
> -			addr += PMD_SIZE;
> +			addr = round_up(addr + 1, PMD_SIZE);
> 			continue;
> 		}
> 
> -- 
> 2.17.1
>
Peter Zijlstra Aug. 21, 2019, 10:10 a.m. UTC | #2
On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> This behavior changes after the 32-bit support:  pti_clone_pgtable()
> increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
> addr may not be PUD_SIZE/PMD_SIZE aligned.
> 
> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> in these two cases.

So the patch is fine, ACK on that, but that still leaves us with the
puzzle of why this didn't explode mightily and the story needs a little
more work.

> The following explains how we debugged this issue:
> 
> We use huge page for hot text and thus reduces iTLB misses. As we
> benchmark 5.2 based kernel (vs. 4.16 based), we found ~2.5x more
> iTLB misses.
> 
> To figure out the issue, I use a debug patch that dumps page table for
> a pid. The following are information from the workload pid.
> 
> For the 4.16 based kernel:
> 
> host-4.16 # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 0xffffffff81a00000-0xffffffff81c00000           2M     ro         PSE         x  pmd
> 
> For the 5.2 based kernel before this patch:
> 
> host-5.2-before # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 
> The 8MB text in pmd is from user space. 4.16 kernel has 1 pmd for the
> irq entry table; while 4.16 kernel doesn't have it.
> 
> For the 5.2 based kernel after this patch:
> 
> host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> 0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> 
> So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
> in 4.16 kernel.

This basically gives rise to more questions than it provides answers.
You seem to have 'forgotten' to provide the equivalent mappings on the
two older kernels. The fact that they're not PMD is evident, but it
would be very good to know what is mapped, and what -- if anything --
lives in the holes we've (accidentally) created.

Can you please provide more complete mappings? Basically provide the
whole cpu_entry_area mapping.

> This further reduces iTLB miss rate

What you're saying is that by using PMDs, we reduce 4K iTLB usage. But
we increase 2M iTLB usage, but for your workload this works out
favourably (a quick look at the PMU event tables for SKL didn't show me
separate 4K/2M iTLB counters :/).

> Cc: stable@vger.kernel.org # v4.19+
> Fixes: 16a3fe634f6a ("x86/mm/pti: Clone kernel-image on PTE level for 32 bit")
> Reviewed-by: Rik van Riel <riel@surriel.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> Cc: Joerg Roedel <jroedel@suse.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/x86/mm/pti.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
> index b196524759ec..1337494e22ef 100644
> --- a/arch/x86/mm/pti.c
> +++ b/arch/x86/mm/pti.c
> @@ -330,13 +330,13 @@ pti_clone_pgtable(unsigned long start, unsigned long end,
>  
>  		pud = pud_offset(p4d, addr);
>  		if (pud_none(*pud)) {
> -			addr += PUD_SIZE;
> +			addr = round_up(addr + 1, PUD_SIZE);
>  			continue;
>  		}
>  
>  		pmd = pmd_offset(pud, addr);
>  		if (pmd_none(*pmd)) {
> -			addr += PMD_SIZE;
> +			addr = round_up(addr + 1, PMD_SIZE);
>  			continue;
>  		}
>  
> -- 
> 2.17.1
>
Thomas Gleixner Aug. 21, 2019, 10:17 a.m. UTC | #3
On Wed, 21 Aug 2019, Song Liu wrote:
> > On Aug 20, 2019, at 1:23 PM, Song Liu <songliubraving@fb.com> wrote:
> > 
> > Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> > This behavior changes after the 32-bit support:  pti_clone_pgtable()
> > increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
> > PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
> > addr may not be PUD_SIZE/PMD_SIZE aligned.
> > 
> > Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> > in these two cases.
> 
> After poking around more, I found the following doesn't really make 
> sense. 

I'm glad you figured that out yourself. Was about to write up something to
that effect.

Still interesting questions remain:

  1) How did you end up feeding an unaligned address into that which points
     to a 0 PUD?

  2) Is this related to Facebook specific changes and unlikely to affect any
     regular kernel? I can't come up with a way to trigger that in mainline

  3) As this is a user page table and the missing mapping is related to
     mappings required by PTI, how is the machine going in/out of user
     space in the first place? Or did I just trip over what you called
     nonsense?

Thanks,

	tglx
Peter Zijlstra Aug. 21, 2019, 10:30 a.m. UTC | #4
On Wed, Aug 21, 2019 at 12:10:08PM +0200, Peter Zijlstra wrote:
> On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:

> > host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
> > 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
> > 0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
> > 
> > So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
> > in 4.16 kernel.
> 
> This basically gives rise to more questions than it provides answers.
> You seem to have 'forgotten' to provide the equivalent mappings on the
> two older kernels. The fact that they're not PMD is evident, but it
> would be very good to know what is mapped, and what -- if anything --
> lives in the holes we've (accidentally) created.
> 
> Can you please provide more complete mappings? Basically provide the
> whole cpu_entry_area mapping.

I tried on my local machine and:

  cat /debug/page_tables/kernel | awk '/^---/ { p=0 } /CPU entry/ { p=1 } { if (p) print $0 }' > ~/cea-{before,after}.txt

resulted in _identical_ files ?!?!

Can you share your before and after dumps?
Thomas Gleixner Aug. 24, 2019, 12:59 a.m. UTC | #5
On Wed, 21 Aug 2019, Thomas Gleixner wrote:
> On Wed, 21 Aug 2019, Song Liu wrote:
> > > On Aug 20, 2019, at 1:23 PM, Song Liu <songliubraving@fb.com> wrote:
> > > 
> > > Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
> > > This behavior changes after the 32-bit support:  pti_clone_pgtable()
> > > increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
> > > PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
> > > addr may not be PUD_SIZE/PMD_SIZE aligned.
> > > 
> > > Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
> > > in these two cases.
> > 
> > After poking around more, I found the following doesn't really make 
> > sense. 
> 
> I'm glad you figured that out yourself. Was about to write up something to
> that effect.
> 
> Still interesting questions remain:
> 
>   1) How did you end up feeding an unaligned address into that which points
>      to a 0 PUD?
> 
>   2) Is this related to Facebook specific changes and unlikely to affect any
>      regular kernel? I can't come up with a way to trigger that in mainline
> 
>   3) As this is a user page table and the missing mapping is related to
>      mappings required by PTI, how is the machine going in/out of user
>      space in the first place? Or did I just trip over what you called
>      nonsense?

And just because this ended in silence I looked at it myself after Peter
told me that this was on a kernel with PTI disabled. Aside of that my built
in distrust for debug war stories combined with fairy tale changelogs
triggered my curiousity anyway.

So that cannot be a kernel with PTI disabled compile time because in that
case the functions are not available, unless it's FB hackery which I do not
care about.

So the only way this can be reached is when PTI is configured but disabled
at boot time via pti=off or nopti.

For some silly reason and that goes back to before the 32bit support and
Joern did not notice either when he introduced pti_finalize() this results
in the following functions being called unconditionallY:

     pti_clone_entry_text()
     pti_clone_kernel_text()

pti_clone_kernel_text() was called unconditionally before the 32bit support
already and the only reason why it did not have any effect in that
situation is that it invokes pti_kernel_image_global_ok() and that returns
false when PTI is disabled on the kernel command line. Oh well. It
obviously never checked whether X86_FEATURE_PTI was disabled or enabled in
the first place.

Now 32bit moved that around into pti_finalize() and added the call to
pti_clone_entry_text() which just runs unconditionally.

Now there is still the interesting question why this matters. The to be
cloned page table entries are mapped and the start address even if
unaligned never points to something unmapped. The unmapped case only covers
holes and holes are obviously aligned at the upper levels even if the
address of the hole is unaligned.

So colour me still confused what's wrong there but the proper fix is the
trivial:

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -666,6 +666,8 @@ void __init pti_init(void)
  */
 void pti_finalize(void)
 {
+	if (!boot_cpu_has(X86_FEATURE_PTI))
+		return;
 	/*
 	 * We need to clone everything (again) that maps parts of the
 	 * kernel image.

Hmm?

I'm going to look whether that makes any difference in the page tables
tomorrow with brain awake, but I wanted to share this before the .us
vanishes into the weekend :)

Thanks,

	tglx
Song Liu Aug. 24, 2019, 2:10 a.m. UTC | #6
> On Aug 23, 2019, at 5:59 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Wed, 21 Aug 2019, Thomas Gleixner wrote:
>> On Wed, 21 Aug 2019, Song Liu wrote:
>>>> On Aug 20, 2019, at 1:23 PM, Song Liu <songliubraving@fb.com> wrote:
>>>> 
>>>> Before 32-bit support, pti_clone_pmds() always adds PMD_SIZE to addr.
>>>> This behavior changes after the 32-bit support:  pti_clone_pgtable()
>>>> increases addr by PUD_SIZE for pud_none(*pud) case, and increases addr by
>>>> PMD_SIZE for pmd_none(*pmd) case. However, this is not accurate because
>>>> addr may not be PUD_SIZE/PMD_SIZE aligned.
>>>> 
>>>> Fix this issue by properly rounding up addr to next PUD_SIZE/PMD_SIZE
>>>> in these two cases.
>>> 
>>> After poking around more, I found the following doesn't really make 
>>> sense. 
>> 
>> I'm glad you figured that out yourself. Was about to write up something to
>> that effect.
>> 
>> Still interesting questions remain:
>> 
>>  1) How did you end up feeding an unaligned address into that which points
>>     to a 0 PUD?
>> 
>>  2) Is this related to Facebook specific changes and unlikely to affect any
>>     regular kernel? I can't come up with a way to trigger that in mainline
>> 
>>  3) As this is a user page table and the missing mapping is related to
>>     mappings required by PTI, how is the machine going in/out of user
>>     space in the first place? Or did I just trip over what you called
>>     nonsense?
> 
> And just because this ended in silence I looked at it myself after Peter
> told me that this was on a kernel with PTI disabled. Aside of that my built
> in distrust for debug war stories combined with fairy tale changelogs
> triggered my curiousity anyway.

I am really sorry that I was silent. Somehow I didn't see this in my inbox
(or it didn't show up until just now?). 

For this patch, I really messed up this with something else. The issue we
are seeing is that kprobe on CONFIG_KPROBES_ON_FTRACE splits PMD located 
at 0xffffffff81a00000. I sent another patch last night, but that might not
be the right fix either. 

I haven't started testing our PTI enabled kernel, so I am not sure whether
there is really an issue with the PTI code. 

Thanks,
Song
Song Liu Aug. 24, 2019, 2:12 a.m. UTC | #7
> On Aug 21, 2019, at 3:30 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> On Wed, Aug 21, 2019 at 12:10:08PM +0200, Peter Zijlstra wrote:
>> On Tue, Aug 20, 2019 at 01:23:14PM -0700, Song Liu wrote:
> 
>>> host-5.2-after # grep "x  pmd" /sys/kernel/debug/page_tables/dump_pid
>>> 0x0000000000600000-0x0000000000e00000           8M USR ro         PSE         x  pmd
>>> 0xffffffff81000000-0xffffffff81e00000          14M     ro         PSE     GLB x  pmd
>>> 
>>> So after this patch, the 5.2 based kernel has 7 PMDs instead of 1 PMD
>>> in 4.16 kernel.
>> 
>> This basically gives rise to more questions than it provides answers.
>> You seem to have 'forgotten' to provide the equivalent mappings on the
>> two older kernels. The fact that they're not PMD is evident, but it
>> would be very good to know what is mapped, and what -- if anything --
>> lives in the holes we've (accidentally) created.
>> 
>> Can you please provide more complete mappings? Basically provide the
>> whole cpu_entry_area mapping.
> 
> I tried on my local machine and:
> 
>  cat /debug/page_tables/kernel | awk '/^---/ { p=0 } /CPU entry/ { p=1 } { if (p) print $0 }' > ~/cea-{before,after}.txt
> 
> resulted in _identical_ files ?!?!
> 
> Can you share your before and after dumps?

I was really dumb on this. The actual issue this that kprobe on 
CONFIG_KPROBES_ON_FTRACE splits kernel text PMDs (0xffffffff81000000-). 

I will dig more into this. 

Sorry for being silent, somehow I didn't see this email until just now.

Song
diff mbox series

Patch

diff --git a/arch/x86/mm/pti.c b/arch/x86/mm/pti.c
index b196524759ec..1337494e22ef 100644
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -330,13 +330,13 @@  pti_clone_pgtable(unsigned long start, unsigned long end,
 
 		pud = pud_offset(p4d, addr);
 		if (pud_none(*pud)) {
-			addr += PUD_SIZE;
+			addr = round_up(addr + 1, PUD_SIZE);
 			continue;
 		}
 
 		pmd = pmd_offset(pud, addr);
 		if (pmd_none(*pmd)) {
-			addr += PMD_SIZE;
+			addr = round_up(addr + 1, PMD_SIZE);
 			continue;
 		}