Message ID | CAPcyv4g1ULSDXzp5ufpXzp5JB3z1TN_za=_AqUC8E6f2FNsycw@mail.gmail.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
> On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@intel.com> > wrote: > > On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@hpe.com> > wrote: > >>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu > <toshi.kani@hpe.com> > >>> wrote: : > >> > >> crash> p {struct vmem_altmap} 0xffff88046d045410 > >> $6 = { > >> base_pfn = 0x480000, > >> reserve = 0x2, // PHYS_PFN(SZ_8K) > >> free = 0x101fe, > >> align = 0x1fe, > >> alloc = 0x10000 > >> } > > > > Ah, so, on second look the 0x490200000 data offset looks correct. The > > total size of the address range is 16GB which equates to 256MB needed > > for struct page, plus 2MB more to re-align the data on the next 2MB > > boundary. > > > > The question now is why is the guest faulting on an access to an > > address less than 0x490200000? > > Does the attached patch fix this for you? Yeah, that makes sense. I will test it tomorrow. BTW, why does devm_memremap_pages() put a whole range to pgmap_radix as device memory, but only initialize page->pgmap for its data range? Is there particular reason for this inconsistency? Thanks, -Toshi
On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: >> On Tue, Aug 23, 2016 at 7:53 PM, Dan Williams <dan.j.williams@intel.com> >> wrote: >> > On Tue, Aug 23, 2016 at 6:29 PM, Kani, Toshimitsu <toshi.kani@hpe.com> >> wrote: >> >>> On Tue, Aug 23, 2016 at 4:47 PM, Kani, Toshimitsu >> <toshi.kani@hpe.com> >> >>> wrote: > : >> >> >> >> crash> p {struct vmem_altmap} 0xffff88046d045410 >> >> $6 = { >> >> base_pfn = 0x480000, >> >> reserve = 0x2, // PHYS_PFN(SZ_8K) >> >> free = 0x101fe, >> >> align = 0x1fe, >> >> alloc = 0x10000 >> >> } >> > >> > Ah, so, on second look the 0x490200000 data offset looks correct. The >> > total size of the address range is 16GB which equates to 256MB needed >> > for struct page, plus 2MB more to re-align the data on the next 2MB >> > boundary. >> > >> > The question now is why is the guest faulting on an access to an >> > address less than 0x490200000? >> >> Does the attached patch fix this for you? > > Yeah, that makes sense. I will test it tomorrow. > > BTW, why does devm_memremap_pages() put a whole range to pgmap_radix > as device memory, but only initialize page->pgmap for its data range? Is there > particular reason for this inconsistency? > The radix tree is indexed by section number, but we don't always initialize a full section. The cases when we don't use a full section is when it overlaps device metadata, or if a platform multiplexes the device memory range with another resource within the same section.
On Tue, 2016-08-23 at 21:48 -0700, Dan Williams wrote: > On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com > > > > BTW, why does devm_memremap_pages() put a whole range to > > pgmap_radix as device memory, but only initialize page->pgmap for > > its data range? Is there particular reason for this inconsistency? > > The radix tree is indexed by section number, but we don't always > initialize a full section. The cases when we don't use a full > section is when it overlaps device metadata, or if a platform > multiplexes the device memory range with another resource within the > same section. I see, but I still feel odd about making get_dev_pagemap() to work for metadata, but get_page() -> get_zone_device_page() to crash like this. follow_devmap_pmd() assumes get_page() to work when get_dev_pagemap() returns a valid pgmap... Thanks, -Toshi
[ adding Konstantin ] On Wed, Aug 24, 2016 at 7:40 AM, Kani, Toshimitsu <toshi.kani@hpe.com> wrote: > On Tue, 2016-08-23 at 21:48 -0700, Dan Williams wrote: >> On Tue, Aug 23, 2016 at 9:28 PM, Kani, Toshimitsu <toshi.kani@hpe.com >> > >> > BTW, why does devm_memremap_pages() put a whole range to >> > pgmap_radix as device memory, but only initialize page->pgmap for >> > its data range? Is there particular reason for this inconsistency? >> >> The radix tree is indexed by section number, but we don't always >> initialize a full section. The cases when we don't use a full >> section is when it overlaps device metadata, or if a platform >> multiplexes the device memory range with another resource within the >> same section. > > I see, but I still feel odd about making get_dev_pagemap() to work for > metadata, but get_page() -> get_zone_device_page() to crash like this. > follow_devmap_pmd() assumes get_page() to work when get_dev_pagemap() > returns a valid pgmap... > We could leave the unmapped portions of a section out of the radix, but I'm also worried about keeping the get_dev_pagemap() lookup cheap. I saw that Konstantin has some proposed changes to the radix api to make it easier to fill a range [1]. I might switch to radix_tree_fill_range() when it becomes available. [1]: https://github.com/koct9i/linux/commits/radix-tree
From 506cdd2c4ec0f9283ac4eb92259f2e8a71c5b637 Mon Sep 17 00:00:00 2001 From: Dan Williams <dan.j.williams@intel.com> Date: Tue, 23 Aug 2016 19:59:31 -0700 Subject: [PATCH] dax: fix device-dax region base The data offset for a dax region needs to account for an altmap reservation in the resource range. Otherwise, device-dax is allowing mappings directly into the memmap or device info-block area, with crash signatures like the following: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 IP: [<ffffffff811ac851>] get_zone_device_page+0x11/0x30 Call Trace: follow_devmap_pmd+0x298/0x2c0 follow_page_mask+0x275/0x530 __get_user_pages+0xe3/0x750 __gfn_to_pfn_memslot+0x1b2/0x450 [kvm] ? hrtimer_try_to_cancel+0x2c/0x120 ? kvm_read_l1_tsc+0x55/0x60 [kvm] try_async_pf+0x66/0x230 [kvm] ? kvm_host_page_size+0x90/0xa0 [kvm] tdp_page_fault+0x130/0x280 [kvm] kvm_mmu_page_fault+0x5f/0xf0 [kvm] handle_ept_violation+0x94/0x180 [kvm_intel] vmx_handle_exit+0x1d3/0x1440 [kvm_intel] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel] ? vmx_vcpu_run+0x2d1/0x490 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x81d/0x16a0 [kvm] ? wake_up_q+0x44/0x80 kvm_vcpu_ioctl+0x33c/0x620 [kvm] ? __vfs_write+0x37/0x160 do_vfs_ioctl+0xa2/0x5d0 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1a/0xa4 Cc: <stable@vger.kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Fixes: ab68f2622136 ("/dev/dax, pmem: direct access to persistent memory") Reported-by: Abhilash Kumar Mulumudi <m.abhilash-kumar@hpe.com> Reported-by: Toshi Kani <toshi.kani@hpe.com> Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- drivers/dax/pmem.c | 2 ++ include/linux/memremap.h | 1 + kernel/memremap.c | 9 +++++++++ 3 files changed, 12 insertions(+) diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c index dfb168568af1..0603637df162 100644 --- a/drivers/dax/pmem.c +++ b/drivers/dax/pmem.c @@ -116,6 +116,8 @@ static int dax_pmem_probe(struct device *dev) if (rc) return rc; + res.start += vmem_altmap_resource_offset(altmap); + nd_region = to_nd_region(dev->parent); dax_region = alloc_dax_region(dev, nd_region->id, &res, le32_to_cpu(pfn_sb->align), addr, PFN_DEV|PFN_MAP); diff --git a/include/linux/memremap.h b/include/linux/memremap.h index 93416196ba64..7265b83cdbea 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -24,6 +24,7 @@ struct vmem_altmap { }; unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); +resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap); void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); #ifdef CONFIG_ZONE_DEVICE diff --git a/kernel/memremap.c b/kernel/memremap.c index 251d16b4cb41..941e897c52a8 100644 --- a/kernel/memremap.c +++ b/kernel/memremap.c @@ -384,6 +384,15 @@ unsigned long vmem_altmap_offset(struct vmem_altmap *altmap) return altmap->reserve + altmap->free; } +resource_size_t vmem_altmap_resource_offset(struct vmem_altmap *altmap) +{ + /* number of bytes from base where data starts after the memmap */ + if (!altmap) + return 0; + return vmem_altmap_offset(altmap) * PAGE_SIZE; +} +EXPORT_SYMBOL_GPL(vmem_altmap_resource_offset); + void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns) { altmap->alloc -= nr_pfns; -- 2.5.5