diff mbox

memremap: Fix NULL pointer BUG in get_zone_device_page()

Message ID CAPcyv4g1ULSDXzp5ufpXzp5JB3z1TN_za=_AqUC8E6f2FNsycw@mail.gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Dan Williams Aug. 24, 2016, 3:58 a.m. UTC
On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@intel.com> wrote:
> On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>>> wrote:
>>> > On Tue, 2016-08-23 at 15:32 -0700, Dan Williams wrote:
>>> >> On Tue, Aug 23, 2016 at 11:43 AM, Toshi Kani <toshi.kani@hpe.com>
>>> >> wrote:
>>> >  :
>>> >> I'm not sure about this fix.  The point of honoring
>>> >> vmem_altmap_offset() is because a portion of the resource that is
>>> >> passed to devm_memremap_pages() also contains the metadata info
>>> block
>>> >> for the device.  The offset says "use everything past this point for
>>> >> pages".  This may work for avoiding a crash, but it may corrupt info
>>> >> block metadata in the process.  Can you provide more information
>>> >> about the failing scenario to be sure that we are not triggering a
>>> >> fault on an address that is not meant to have a page mapping?  I.e.
>>> >> what is the host physical address of the page that caused this fault,
>>> >> and is it valid?
>>> >
>>> > The fault address in question was the 2nd page of an NVDIMM range.  I
>>> > assumed this fault address was valid and needed to be handled.  Here is
>>> > some info about the base and patched cases.  Let me know if you need
>>> > more info.
>>> >
>>> > Base
>>> > ====
>>> >
>>> > The following NVDIMM range was set to /dev/dax.
>>>
>>> With ndctl create-namespace or manually via sysfs?  Specifically I'm
>>> looking for what the 'align' attribute was set to when the
>>> configuration was established.  Can you provide a dump of the sysfs
>>> attributes for the /dev/dax parent device?
>>
>> I used the ndctl command below.
>> ndctl create-namespace -f -e namespace0.0 -m dax
>>
>> Here is additional info from my note for the base case.
>>
>> p {struct dev_pagemap} 0xffff88046d0453f0
>> $3 = {
>>   altmap = 0xffff88046d045410,
>>   res = 0xffff88046d0453a8,
>>   ref = 0xffff88046d0452f0,
>>   dev = 0xffff880464790410
>> }
>>
>> crash> p {struct vmem_altmap} 0xffff88046d045410
>> $6 = {
>>   base_pfn = 0x480000,
>>   reserve = 0x2,        // PHYS_PFN(SZ_8K)
>>   free = 0x101fe,
>>   align = 0x1fe,
>>   alloc = 0x10000
>> }
>
> Ah, so, on second look the 0x490200000 data offset looks correct.  The
> total size of the address range is 16GB which equates to 256MB needed
> for struct page, plus 2MB more to re-align the data on the next 2MB
> boundary.
>
> The question now is why is the guest faulting on an access to an
> address less than 0x490200000?

Does the attached patch fix this for you?

Comments

Kani, Toshi Aug. 24, 2016, 4:28 a.m. UTC | #1
> On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@intel.com>
> wrote:
> > On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
> wrote:
> >>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu
> <toshi.kani@hpe.com>
> >>> wrote:
 :
> >>
> >> crash> p {struct vmem_altmap} 0xffff88046d045410
> >> $6 = {
> >>   base_pfn = 0x480000,
> >>   reserve = 0x2,        // PHYS_PFN(SZ_8K)
> >>   free = 0x101fe,
> >>   align = 0x1fe,
> >>   alloc = 0x10000
> >> }
> >
> > Ah, so, on second look the 0x490200000 data offset looks correct.  The
> > total size of the address range is 16GB which equates to 256MB needed
> > for struct page, plus 2MB more to re-align the data on the next 2MB
> > boundary.
> >
> > The question now is why is the guest faulting on an access to an
> > address less than 0x490200000?
> 
> Does the attached patch fix this for you?

Yeah, that makes sense.  I will test it tomorrow.

BTW, why does devm_memremap_pages() put a whole range to pgmap_radix
as device memory, but only initialize page->pgmap for its data range?  Is there
particular reason for this inconsistency?

Thanks,
-Toshi
Dan Williams Aug. 24, 2016, 4:48 a.m. UTC | #2
On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
>> On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@intel.com>
>> wrote:
>> > On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@hpe.com>
>> wrote:
>> >>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu
>> <toshi.kani@hpe.com>
>> >>> wrote:
>  :
>> >>
>> >> crash> p {struct vmem_altmap} 0xffff88046d045410
>> >> $6 = {
>> >>   base_pfn = 0x480000,
>> >>   reserve = 0x2,        // PHYS_PFN(SZ_8K)
>> >>   free = 0x101fe,
>> >>   align = 0x1fe,
>> >>   alloc = 0x10000
>> >> }
>> >
>> > Ah, so, on second look the 0x490200000 data offset looks correct.  The
>> > total size of the address range is 16GB which equates to 256MB needed
>> > for struct page, plus 2MB more to re-align the data on the next 2MB
>> > boundary.
>> >
>> > The question now is why is the guest faulting on an access to an
>> > address less than 0x490200000?
>>
>> Does the attached patch fix this for you?
>
> Yeah, that makes sense.  I will test it tomorrow.
>
> BTW, why does devm_memremap_pages() put a whole range to pgmap_radix
> as device memory, but only initialize page->pgmap for its data range?  Is there
> particular reason for this inconsistency?
>

The radix tree is indexed by section number, but we don't always
initialize a full section.  The cases when we don't use a full section
is when it overlaps device metadata, or if a platform multiplexes the
device memory range with another resource within the same section.
Kani, Toshi Aug. 24, 2016, 2:40 p.m. UTC | #3
On Tue, 2016-08-23 at 21:48 -0700, Dan Williams wrote:
> On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com

> > 

> > BTW, why does devm_memremap_pages() put a whole range to

> > pgmap_radix as device memory, but only initialize page->pgmap for

> > its data range?  Is there particular reason for this inconsistency?

>

> The radix tree is indexed by section number, but we don't always

> initialize a full section.  The cases when we don't use a full

> section is when it overlaps device metadata, or if a platform

> multiplexes the device memory range with another resource within the

> same section.


I see, but I still feel odd about making get_dev_pagemap() to work for
metadata, but get_page() -> get_zone_device_page() to crash like this.
 follow_devmap_pmd() assumes get_page() to work when get_dev_pagemap()
returns a valid pgmap...

Thanks,
-Toshi
Dan Williams Aug. 24, 2016, 2:55 p.m. UTC | #4
[ adding Konstantin ]

On Wed, Aug 24, 2016 at 7:40 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote:
> On Tue, 2016-08-23 at 21:48 -0700, Dan Williams wrote:
>> On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com
>> >
>> > BTW, why does devm_memremap_pages() put a whole range to
>> > pgmap_radix as device memory, but only initialize page->pgmap for
>> > its data range?  Is there particular reason for this inconsistency?
>>
>> The radix tree is indexed by section number, but we don't always
>> initialize a full section.  The cases when we don't use a full
>> section is when it overlaps device metadata, or if a platform
>> multiplexes the device memory range with another resource within the
>> same section.
>
> I see, but I still feel odd about making get_dev_pagemap() to work for
> metadata, but get_page() -> get_zone_device_page() to crash like this.
>  follow_devmap_pmd() assumes get_page() to work when get_dev_pagemap()
> returns a valid pgmap...
>

We could leave the unmapped portions of a section out of the radix,
but I'm also worried about keeping the get_dev_pagemap() lookup cheap.
I saw that Konstantin has some proposed changes to the radix api to
make it easier to fill a range [1].  I might switch to
radix_tree_fill_range() when it becomes available.

[1]: https://github.com/koct9i/linux/commits/radix-tree
diff mbox

Patch

From 506cdd2c4ec0f9283ac4eb92259f2e8a71c5b637 Mon Sep 17 00:00:00 2001
From: Dan Williams <dan.j.williams@intel.com>
Date: Tue, 23 Aug 2016 19:59:31 -0700
Subject: [PATCH] dax: fix device-dax region base

The data offset for a dax region needs to account for an altmap
reservation in the resource range.  Otherwise, device-dax is allowing
mappings directly into the memmap or device info-block area, with crash
signatures like the following:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
 IP: [<ffffffff811ac851>] get_zone_device_page+0x11/0x30
 Call Trace:
   follow_devmap_pmd+0x298/0x2c0
   follow_page_mask+0x275/0x530
   __get_user_pages+0xe3/0x750
   __gfn_to_pfn_memslot+0x1b2/0x450 [kvm]
   ? hrtimer_try_to_cancel+0x2c/0x120
   ? kvm_read_l1_tsc+0x55/0x60 [kvm]
   try_async_pf+0x66/0x230 [kvm]
   ? kvm_host_page_size+0x90/0xa0 [kvm]
   tdp_page_fault+0x130/0x280 [kvm]
   kvm_mmu_page_fault+0x5f/0xf0 [kvm]
   handle_ept_violation+0x94/0x180 [kvm_intel]
   vmx_handle_exit+0x1d3/0x1440 [kvm_intel]
   ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
   ? vmx_vcpu_run+0x2d1/0x490 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm]
   ? wake_up_q+0x44/0x80
   kvm_vcpu_ioctl+0x33c/0x620 [kvm]
   ? __vfs_write+0x37/0x160
   do_vfs_ioctl+0xa2/0x5d0
   SyS_ioctl+0x79/0x90
   entry_SYSCALL_64_fastpath+0x1a/0xa4

Cc: <stable@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Fixes: ab68f2622136 ("/dev/dax, pmem: direct access to persistent memory")
Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com>
Reported-by: Toshi Kani <toshi.kani@hpe.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
 drivers/dax/pmem.c       | 2 ++
 include/linux/memremap.h | 1 +
 kernel/memremap.c        | 9 +++++++++
 3 files changed, 12 insertions(+)

diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index dfb168568af1..0603637df162 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -116,6 +116,8 @@  static int dax_pmem_probe(struct device *dev)
 	if (rc)
 		return rc;
 
+	res.start += vmem_altmap_resource_offset(altmap);
+
 	nd_region = to_nd_region(dev->parent);
 	dax_region = alloc_dax_region(dev, nd_region->id, &res,
 			le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP);
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 93416196ba64..7265b83cdbea 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -24,6 +24,7 @@  struct vmem_altmap {
 };
 
 unsigned long vmem_altmap_offset(struct vmem_altmap *altmap);
+resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap);
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns);
 
 #ifdef CONFIG_ZONE_DEVICE
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 251d16b4cb41..941e897c52a8 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -384,6 +384,15 @@  unsigned long vmem_altmap_offset(struct vmem_altmap *altmap)
 	return altmap->reserve + altmap->free;
 }
 
+resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap)
+{
+	/* number of bytes from base where data starts after the memmap */
+	if (!altmap)
+		return 0;
+	return vmem_altmap_offset(altmap) * PAGE_SIZE;
+}
+EXPORT_SYMBOL_GPL(vmem_altmap_resource_offset);
+
 void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns)
 {
 	altmap->alloc -= nr_pfns;
-- 
2.5.5