diff mbox

x86/mm/vmfault: Make vmalloc_fault() handle large pages

Message ID 1454976038-22486-1-git-send-email-toshi.kani@hpe.com (mailing list archive)
State Accepted
Commit f4eafd8bcd52
Headers show

Commit Message

Kani, Toshi Feb. 9, 2016, midnight UTC
Since 4.1, ioremap() supports large page (pud/pmd) mappings in
x86_64 and PAE.  vmalloc_fault() however assumes that the vmalloc
range is limited to pte mappings.

pgd_ctor() sets the kernel's pgd entries to user's during fork(),
which makes user processes share the same page tables for the
kernel ranges.  When a call to ioremap() is made at run-time that
leads to allocate a new 2nd level table (pud in 64-bit and pmd in
PAE), user process needs to re-sync with the updated kernel pgd
entry with vmalloc_fault().

Following changes are made to vmalloc_fault().

64-bit:
- No change for the sync operation as set_pgd() takes care of
  huge pages as well.
- Add pud_huge() and pmd_huge() to the validation code to
  handle huge pages.
- Change pud_page_vaddr() to pud_pfn() since an ioremap range
  is not directly mapped (although the if-statement still works
  with a bogus addr).
- Change pmd_page() to pmd_pfn() since an ioremap range is not
  backed by struct page table (although the if-statement still
  works with a bogus addr).

PAE:
- No change for the sync operation since the index3 pgd entry
  covers the entire vmalloc range, which is always valid.
  (A separate change will be needed if this assumption gets
  changed regardless of the page size.)
- Add pmd_huge() to the validation code to handle huge pages.
  This is only for completeness since vmalloc_fault() won't
  happen for ioremap'd ranges as its pgd entry is always valid.
  (I was unable to test this part of the changes as a result.)

Reported-by: Henning Schild <henning.schild@siemens.com>
Signed-off-by: Toshi Kani <toshi.kani@hpe.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
---
When this patch is accepted, please copy to stable up to 4.1.
---
 arch/x86/mm/fault.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Comments

Ingo Molnar Feb. 9, 2016, 9:10 a.m. UTC | #1
* Toshi Kani <toshi.kani@hpe.com> wrote:

> Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64 and PAE.  
> vmalloc_fault() however assumes that the vmalloc range is limited to pte 
> mappings.
> 
> pgd_ctor() sets the kernel's pgd entries to user's during fork(), which makes 
> user processes share the same page tables for the kernel ranges.  When a call to 
> ioremap() is made at run-time that leads to allocate a new 2nd level table (pud 
> in 64-bit and pmd in PAE), user process needs to re-sync with the updated kernel 
> pgd entry with vmalloc_fault().
> 
> Following changes are made to vmalloc_fault().

So what were the effects of this shortcoming? Were large page ioremap()s unusable? 
Was this harmless because no driver used this facility?

If so then the changelog needs to spell this out clearly ...

Thanks,

	Ingo
Henning Schild Feb. 9, 2016, 9:53 a.m. UTC | #2
On Tue, 9 Feb 2016 10:10:03 +0100
Ingo Molnar <mingo@kernel.org> wrote:

> * Toshi Kani <toshi.kani@hpe.com> wrote:
> 
> > Since 4.1, ioremap() supports large page (pud/pmd) mappings in
> > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
> > range is limited to pte mappings.
> > 
> > pgd_ctor() sets the kernel's pgd entries to user's during fork(),
> > which makes user processes share the same page tables for the
> > kernel ranges.  When a call to ioremap() is made at run-time that
> > leads to allocate a new 2nd level table (pud in 64-bit and pmd in
> > PAE), user process needs to re-sync with the updated kernel pgd
> > entry with vmalloc_fault().
> > 
> > Following changes are made to vmalloc_fault().  
> 
> So what were the effects of this shortcoming? Were large page
> ioremap()s unusable? Was this harmless because no driver used this
> facility?

Drivers do use huge ioremap()s. Now if a pre-existing mm is used to
access the device memory a #PF and the call to vmalloc_fault would
eventually make the kernel treat device memory as if it was a
pagetable.
The results are illegal reads/writes on iomem and dereferencing iomem
content like it was a pointer to a lower level pagetable.
- #PF if you are lucky
- funny modification of arbitrary memory possible
- can be abused with uio or regular userland ?? 

Henning
  
> If so then the changelog needs to spell this out clearly ...



> Thanks,
> 
> 	Ingo
Ingo Molnar Feb. 9, 2016, 10:22 a.m. UTC | #3
* Henning Schild <henning.schild@siemens.com> wrote:

> On Tue, 9 Feb 2016 10:10:03 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > 
> > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in
> > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
> > > range is limited to pte mappings.
> > > 
> > > pgd_ctor() sets the kernel's pgd entries to user's during fork(),
> > > which makes user processes share the same page tables for the
> > > kernel ranges.  When a call to ioremap() is made at run-time that
> > > leads to allocate a new 2nd level table (pud in 64-bit and pmd in
> > > PAE), user process needs to re-sync with the updated kernel pgd
> > > entry with vmalloc_fault().
> > > 
> > > Following changes are made to vmalloc_fault().  
> > 
> > So what were the effects of this shortcoming? Were large page
> > ioremap()s unusable? Was this harmless because no driver used this
> > facility?
> 
> Drivers do use huge ioremap()s. Now if a pre-existing mm is used to
> access the device memory a #PF and the call to vmalloc_fault would
> eventually make the kernel treat device memory as if it was a
> pagetable.
> The results are illegal reads/writes on iomem and dereferencing iomem
> content like it was a pointer to a lower level pagetable.
> - #PF if you are lucky
> - funny modification of arbitrary memory possible
> - can be abused with uio or regular userland ?? 

Ok, so this is a serious live bug exposed to drivers, that also requires a
Cc: stable tag.

All of this should have been in the changelog!

Thanks,

	Ingo
Henning Schild Feb. 9, 2016, 12:26 p.m. UTC | #4
On Tue, 9 Feb 2016 11:22:35 +0100
Ingo Molnar <mingo@kernel.org> wrote:

> * Henning Schild <henning.schild@siemens.com> wrote:
> 
> > On Tue, 9 Feb 2016 10:10:03 +0100
> > Ingo Molnar <mingo@kernel.org> wrote:
> >   
> > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > >   
> > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in
> > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
> > > > range is limited to pte mappings.
> > > > 
> > > > pgd_ctor() sets the kernel's pgd entries to user's during
> > > > fork(), which makes user processes share the same page tables
> > > > for the kernel ranges.  When a call to ioremap() is made at
> > > > run-time that leads to allocate a new 2nd level table (pud in
> > > > 64-bit and pmd in PAE), user process needs to re-sync with the
> > > > updated kernel pgd entry with vmalloc_fault().
> > > > 
> > > > Following changes are made to vmalloc_fault().    
> > > 
> > > So what were the effects of this shortcoming? Were large page
> > > ioremap()s unusable? Was this harmless because no driver used this
> > > facility?  
> > 
> > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to
> > access the device memory a #PF and the call to vmalloc_fault would
> > eventually make the kernel treat device memory as if it was a
> > pagetable.
> > The results are illegal reads/writes on iomem and dereferencing
> > iomem content like it was a pointer to a lower level pagetable.
> > - #PF if you are lucky

> > - funny modification of arbitrary memory possible
> > - can be abused with uio or regular userland ??   

Looking over the code again i am not sure the last two are even
possible, it is just the pointer deref that can cause a #PF.
If the pointer turns out to "work" the code will just read and
eventually BUG().

> Ok, so this is a serious live bug exposed to drivers, that also
> requires a Cc: stable tag.
>
> All of this should have been in the changelog!
> 
> Thanks,
> 
> 	Ingo
Kani, Toshi Feb. 9, 2016, 4:03 p.m. UTC | #5
On Tue, 2016-02-09 at 10:10 +0100, Ingo Molnar wrote:
> * Toshi Kani <toshi.kani@hpe.com> wrote:
> 
> > Since 4.1, ioremap() supports large page (pud/pmd) mappings in x86_64
> > and PAE.  vmalloc_fault() however assumes that the vmalloc range is
> > limited to pte mappings.
> > 
> > pgd_ctor() sets the kernel's pgd entries to user's during fork(), which
> > makes user processes share the same page tables for the kernel
> > ranges.  When a call to ioremap() is made at run-time that leads to
> > allocate a new 2nd level table (pud in 64-bit and pmd in PAE), user
> > process needs to re-sync with the updated kernel pgd entry with
> > vmalloc_fault().
> > 
> > Following changes are made to vmalloc_fault().
> 
> So what were the effects of this shortcoming? Were large page ioremap()s
> unusable? Was this harmless because no driver used this facility?
>
> If so then the changelog needs to spell this out clearly ...

Large page support of ioremap() has been used for persistent memory
mappings for a while.

In order to hit this problem, i.e. causing a vmalloc fault, a large mount
of ioremap allocations at run-time is required.  The following example
repeats allocation of 16GB range.

# cat /proc/vmallocinfo | grep memremap
0xffffc90040000000-0xffffc90440001000 17179873280 memremap+0xb4/0x110
phys=480000000 ioremap
0xffffc90480000000-0xffffc90880001000 17179873280 memremap+0xb4/0x110
phys=480000000 ioremap
0xffffc908c0000000-0xffffc90cc0001000 17179873280 memremap+0xb4/0x110
phys=c80000000 ioremap
0xffffc90d00000000-0xffffc91100001000 17179873280 memremap+0xb4/0x110
phys=c80000000 ioremap
0xffffc91140000000-0xffffc91540001000 17179873280 memremap+0xb4/0x110 
phys=480000000 ioremap
  :
0xffffc97300000000-0xffffc97700001000 17179873280 memremap+0xb4/0x110
phys=c80000000 ioremap
0xffffc97740000000-0xffffc97b40001000 17179873280 memremap+0xb4/0x110
phys=480000000 ioremap
0xffffc97b80000000-0xffffc97f80001000 17179873280 memremap+0xb4/0x110
phys=c80000000 ioremap
0xffffc97fc0000000-0xffffc983c0001000 17179873280 memremap+0xb4/0x110
phys=480000000 ioremap

The last ioremap call above crossed a 512GB boundary (0x8000000000), which
allocated a new pud table and updated the kernel pgd entry to point it.
 Because user process's page table does not have this pgd entry update, a
read/write syscall request to the range will hit a vmalloc fault.  Since
vmalloc_fault() does not handle a large page properly, this causes an Oops
as follows.

 BUG: unable to handle kernel paging request at ffff880840000ff8
 IP: [<ffffffff810664ae>] vmalloc_fault+0x1be/0x300
 PGD c7f03a067 PUD 0 
 Oops: 0000 [#1] SM
   :
 Call Trace:
 [<ffffffff81067335>] __do_page_fault+0x285/0x3e0
 [<ffffffff810674bf>] do_page_fault+0x2f/0x80
 [<ffffffff810d6d85>] ? put_prev_entity+0x35/0x7a0
 [<ffffffff817a6888>] page_fault+0x28/0x30
 [<ffffffff813bb976>] ? memcpy_erms+0x6/0x10
 [<ffffffff817a0845>] ? schedule+0x35/0x80
 [<ffffffffa006350a>] ? pmem_rw_bytes+0x6a/0x190 [nd_pmem]
 [<ffffffff817a3713>] ? schedule_timeout+0x183/0x240
 [<ffffffffa028d2b3>] btt_log_read+0x63/0x140 [nd_btt]
   :
 [<ffffffff811201d0>] ? __symbol_put+0x60/0x60
 [<ffffffff8122dc60>] ? kernel_read+0x50/0x80
 [<ffffffff81124489>] SyS_finit_module+0xb9/0xf0
 [<ffffffff817a4632>] entry_SYSCALL_64_fastpath+0x1a/0xa4

Note that this issue is limited to 64-bit.  32-bit only uses index 3 of the
pgd entry to cover the entire vmalloc range, which is always valid.

I will add this information to the change log.

Thanks,
-Toshi
Kani, Toshi Feb. 9, 2016, 4:08 p.m. UTC | #6
On Tue, 2016-02-09 at 13:26 +0100, Henning Schild wrote:
> On Tue, 9 Feb 2016 11:22:35 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > * Henning Schild <henning.schild@siemens.com> wrote:
> > 
> > > On Tue, 9 Feb 2016 10:10:03 +0100
> > > Ingo Molnar <mingo@kernel.org> wrote:
> > >   
> > > > * Toshi Kani <toshi.kani@hpe.com> wrote:
> > > >   
> > > > > Since 4.1, ioremap() supports large page (pud/pmd) mappings in
> > > > > x86_64 and PAE. vmalloc_fault() however assumes that the vmalloc
> > > > > range is limited to pte mappings.
> > > > > 
> > > > > pgd_ctor() sets the kernel's pgd entries to user's during
> > > > > fork(), which makes user processes share the same page tables
> > > > > for the kernel ranges.  When a call to ioremap() is made at
> > > > > run-time that leads to allocate a new 2nd level table (pud in
> > > > > 64-bit and pmd in PAE), user process needs to re-sync with the
> > > > > updated kernel pgd entry with vmalloc_fault().
> > > > > 
> > > > > Following changes are made to vmalloc_fault().    
> > > > 
> > > > So what were the effects of this shortcoming? Were large page
> > > > ioremap()s unusable? Was this harmless because no driver used this
> > > > facility?  
> > > 
> > > Drivers do use huge ioremap()s. Now if a pre-existing mm is used to
> > > access the device memory a #PF and the call to vmalloc_fault would
> > > eventually make the kernel treat device memory as if it was a
> > > pagetable.
> > > The results are illegal reads/writes on iomem and dereferencing
> > > iomem content like it was a pointer to a lower level pagetable.
> > > - #PF if you are lucky

#PF -> vmalloc_fault -> oops

> > > - funny modification of arbitrary memory possible
> > > - can be abused with uio or regular userland ??   
> 
> Looking over the code again i am not sure the last two are even
> possible, it is just the pointer deref that can cause a #PF.
> If the pointer turns out to "work" the code will just read and
> eventually BUG().

The last two case are not possible.

> > Ok, so this is a serious live bug exposed to drivers, that also
> > requires a Cc: stable tag.

Yes, the fix should go to stable as well.

Thanks,
-Toshi
diff mbox

Patch

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index eef44d9..e830c71 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -287,6 +287,9 @@  static noinline int vmalloc_fault(unsigned long address)
 	if (!pmd_k)
 		return -1;
 
+	if (pmd_huge(*pmd_k))
+		return 0;
+
 	pte_k = pte_offset_kernel(pmd_k, address);
 	if (!pte_present(*pte_k))
 		return -1;
@@ -360,8 +363,6 @@  void vmalloc_sync_all(void)
  * 64-bit:
  *
  *   Handle a fault on the vmalloc area
- *
- * This assumes no large pages in there.
  */
 static noinline int vmalloc_fault(unsigned long address)
 {
@@ -403,17 +404,23 @@  static noinline int vmalloc_fault(unsigned long address)
 	if (pud_none(*pud_ref))
 		return -1;
 
-	if (pud_none(*pud) || pud_page_vaddr(*pud) != pud_page_vaddr(*pud_ref))
+	if (pud_none(*pud) || pud_pfn(*pud) != pud_pfn(*pud_ref))
 		BUG();
 
+	if (pud_huge(*pud))
+		return 0;
+
 	pmd = pmd_offset(pud, address);
 	pmd_ref = pmd_offset(pud_ref, address);
 	if (pmd_none(*pmd_ref))
 		return -1;
 
-	if (pmd_none(*pmd) || pmd_page(*pmd) != pmd_page(*pmd_ref))
+	if (pmd_none(*pmd) || pmd_pfn(*pmd) != pmd_pfn(*pmd_ref))
 		BUG();
 
+	if (pmd_huge(*pmd))
+		return 0;
+
 	pte_ref = pte_offset_kernel(pmd_ref, address);
 	if (!pte_present(*pte_ref))
 		return -1;