Message ID | 20191009091205.11753-1-david@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v1] mm: Fix access of uninitialized memmaps in fs/proc/page.c | expand |
On Wed 09-10-19 11:12:04, David Hildenbrand wrote: > There are various places where we access uninitialized memmaps, namely: > - /proc/kpagecount > - /proc/kpageflags > - /proc/kpagecgroup > - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c > > We have initialized memmaps either when the section is online or when > the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain > garbage and in the worst case trigger kernel BUGs, especially with > CONFIG_PAGE_POISONING. > > For example, not onlining a DIMM during boot and calling /proc/kpagecount > with CONFIG_PAGE_POISONING: > :/# cat /proc/kpagecount > tmp.test > [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe > [ 95.601238] #PF: supervisor read access in kernel mode > [ 95.601675] #PF: error_code(0x0000) - not-present page > [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 > [ 95.602596] Oops: 0000 [#1] SMP NOPTI > [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 > [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 > [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 > [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 > [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 > [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 > [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 > [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 > [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 > [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 > [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 > [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 > [ 95.611686] Call Trace: > [ 95.611906] proc_reg_read+0x3c/0x60 > [ 95.612228] vfs_read+0xc5/0x180 > [ 95.612505] ksys_read+0x68/0xe0 > [ 95.612785] do_syscall_64+0x5c/0xa0 > [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Note that there are still two possible races as far as I can see: > - pfn_to_online_page() succeeding but the memory getting offlined and > removed. get_online_mems() could help once we run into this. > - pfn_zone_device() succeeding but the memmap not being fully > initialized yet. As the memmap is initialized outside of the memory > hoptlug lock, get_online_mems() can't help. > > Let's keep the existing interfaces working with ZONE_DEVICE memory. We > can later come back and fix these rare races and eventually speed-up the > ZONE_DEVICE detection. This patch now also makes sure we don't dump data > about memory blocks that are already offline again. As I've already expressed already I am not really happy to have explicit checks for zone device pages in pfn walkers. This is simply too fragile. pfn_to_online_page makes sense because offline pages are not really in a defined state. This would be worth a patch of its own. I remember there was a discussion about the uninitialized zone device memmaps. It would be really good to summarize this discussion in the changelog and conclude why the explicit check is really good and what were other alternatives considered. Thanks! > Reported-by: Qian Cai <cai@lca.pw> > Cc: Alexey Dobriyan <adobriyan@gmail.com> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Konstantin Khlebnikov <koct9i@gmail.com> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> > Cc: Anthony Yznaga <anthony.yznaga@oracle.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Logan Gunthorpe <logang@deltatee.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> > Cc: linux-fsdevel@vger.kernel.org > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > fs/proc/page.c | 12 ++++++------ > include/linux/memremap.h | 11 +++++++++-- > mm/memory-failure.c | 22 ++++++++++++++++------ > mm/memremap.c | 19 +++++++++++-------- > 4 files changed, 42 insertions(+), 22 deletions(-) > > diff --git a/fs/proc/page.c b/fs/proc/page.c > index decd3fe39674..76502af461e2 100644 > --- a/fs/proc/page.c > +++ b/fs/proc/page.c > @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page) > if (!page) > return BIT_ULL(KPF_NOPAGE); > > - if (pfn_zone_device_reserved(page_to_pfn(page))) > - return BIT_ULL(KPF_RESERVED); > - > k = page->flags; > u = 0; > > @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index c676e33205d3..c076bb163c2f 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) > } > > #ifdef CONFIG_ZONE_DEVICE > -bool pfn_zone_device_reserved(unsigned long pfn); > +bool pfn_zone_device(unsigned long pfn); > +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap); > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > void memunmap_pages(struct dev_pagemap *pgmap); > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); > void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); > #else > -static inline bool pfn_zone_device_reserved(unsigned long pfn) > +static inline bool pfn_zone_device(unsigned long pfn) > +{ > + return false; > +} > + > +static inline bool __pfn_zone_device(unsigned long pfn, > + struct dev_pagemap *pgmap) > { > return false; > } > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 7ef849da8278..2b4cc6b67720 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > loff_t start; > dax_entry_t cookie; > > + /* memmaps of driver reserved memory is not initialized */ > + if (!__pfn_zone_device(pfn, pgmap)) { > + pr_err("Memory failure: %#lx: driver reserved memory\n", > + pfn); > + rc = -ENXIO; > + goto out; > + } > + > /* > * Prevent the inode from being freed while we are interrogating > * the address_space, typically this would be handled by > @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags) > if (!sysctl_memory_failure_recovery) > panic("Memory failure on page %lx", pfn); > > - if (!pfn_valid(pfn)) { > + p = pfn_to_online_page(pfn); > + if (!p) { > + if (pfn_valid(pfn)) { > + pgmap = get_dev_pagemap(pfn, NULL); > + if (pgmap) > + return memory_failure_dev_pagemap(pfn, flags, > + pgmap); > + } > pr_err("Memory failure: %#lx: memory outside kernel control\n", > pfn); > return -ENXIO; > } > > - pgmap = get_dev_pagemap(pfn, NULL); > - if (pgmap) > - return memory_failure_dev_pagemap(pfn, flags, pgmap); > - > - p = pfn_to_page(pfn); > if (PageHuge(p)) > return memory_failure_hugetlb(pfn, flags); > if (TestSetPageHWPoison(p)) { > diff --git a/mm/memremap.c b/mm/memremap.c > index 7fed8bd32a18..9f3bb223aec7 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn) > return pfn + 1; > } > > +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap) > +{ > + return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap); > +} > + > /* > - * This returns true if the page is reserved by ZONE_DEVICE driver. > + * Returns true if the page was initialized to the ZONE_DEVICE (especially, > + * is not reserved for driver usage). > */ > -bool pfn_zone_device_reserved(unsigned long pfn) > +bool pfn_zone_device(unsigned long pfn) > { > struct dev_pagemap *pgmap; > - struct vmem_altmap *altmap; > - bool ret = false; > + bool ret; > > pgmap = get_dev_pagemap(pfn, NULL); > if (!pgmap) > - return ret; > - altmap = pgmap_altmap(pgmap); > - if (altmap && pfn < (altmap->base_pfn + altmap->reserve)) > - ret = true; > + return false; > + ret = __pfn_zone_device(pfn, pgmap); > put_dev_pagemap(pgmap); > > return ret; > -- > 2.21.0
Hi David, On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: > There are various places where we access uninitialized memmaps, namely: > - /proc/kpagecount > - /proc/kpageflags > - /proc/kpagecgroup > - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c Ah right, memory_failure is another victim of this bug. > > We have initialized memmaps either when the section is online or when > the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain > garbage and in the worst case trigger kernel BUGs, especially with > CONFIG_PAGE_POISONING. > > For example, not onlining a DIMM during boot and calling /proc/kpagecount > with CONFIG_PAGE_POISONING: > :/# cat /proc/kpagecount > tmp.test > [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe > [ 95.601238] #PF: supervisor read access in kernel mode > [ 95.601675] #PF: error_code(0x0000) - not-present page > [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 > [ 95.602596] Oops: 0000 [#1] SMP NOPTI > [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 > [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 > [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 > [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 > [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 > [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 > [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 > [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 > [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 > [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 > [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 > [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 > [ 95.611686] Call Trace: > [ 95.611906] proc_reg_read+0x3c/0x60 > [ 95.612228] vfs_read+0xc5/0x180 > [ 95.612505] ksys_read+0x68/0xe0 > [ 95.612785] do_syscall_64+0x5c/0xa0 > [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > Note that there are still two possible races as far as I can see: > - pfn_to_online_page() succeeding but the memory getting offlined and > removed. get_online_mems() could help once we run into this. > - pfn_zone_device() succeeding but the memmap not being fully > initialized yet. As the memmap is initialized outside of the memory > hoptlug lock, get_online_mems() can't help. > > Let's keep the existing interfaces working with ZONE_DEVICE memory. We > can later come back and fix these rare races and eventually speed-up the > ZONE_DEVICE detection. Actually, Toshiki is writing code to refactor and optimize the pfn walking part, where we find the pfn ranges covered by zone devices by running over xarray pgmap_array and use the range info to reduce pointer dereferences to speed up pfn walk. I hope he will share it soon. Thanks, Naoya Horiguchi > This patch now also makes sure we don't dump data > about memory blocks that are already offline again. > > Reported-by: Qian Cai <cai@lca.pw> > Cc: Alexey Dobriyan <adobriyan@gmail.com> > Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Stephen Rothwell <sfr@canb.auug.org.au> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Konstantin Khlebnikov <koct9i@gmail.com> > Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> > Cc: Anthony Yznaga <anthony.yznaga@oracle.com> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Logan Gunthorpe <logang@deltatee.com> > Cc: Ira Weiny <ira.weiny@intel.com> > Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> > Cc: linux-fsdevel@vger.kernel.org > Signed-off-by: David Hildenbrand <david@redhat.com> > --- > fs/proc/page.c | 12 ++++++------ > include/linux/memremap.h | 11 +++++++++-- > mm/memory-failure.c | 22 ++++++++++++++++------ > mm/memremap.c | 19 +++++++++++-------- > 4 files changed, 42 insertions(+), 22 deletions(-) > > diff --git a/fs/proc/page.c b/fs/proc/page.c > index decd3fe39674..76502af461e2 100644 > --- a/fs/proc/page.c > +++ b/fs/proc/page.c > @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page) > if (!page) > return BIT_ULL(KPF_NOPAGE); > > - if (pfn_zone_device_reserved(page_to_pfn(page))) > - return BIT_ULL(KPF_RESERVED); > - > k = page->flags; > u = 0; > > @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf, > return -EINVAL; > > while (count > 0) { > - if (pfn_valid(pfn)) > + if (pfn_valid(pfn) && > + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) > ppage = pfn_to_page(pfn); > else > ppage = NULL; > diff --git a/include/linux/memremap.h b/include/linux/memremap.h > index c676e33205d3..c076bb163c2f 100644 > --- a/include/linux/memremap.h > +++ b/include/linux/memremap.h > @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) > } > > #ifdef CONFIG_ZONE_DEVICE > -bool pfn_zone_device_reserved(unsigned long pfn); > +bool pfn_zone_device(unsigned long pfn); > +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap); > void *memremap_pages(struct dev_pagemap *pgmap, int nid); > void memunmap_pages(struct dev_pagemap *pgmap); > void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); > @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, > unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); > void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); > #else > -static inline bool pfn_zone_device_reserved(unsigned long pfn) > +static inline bool pfn_zone_device(unsigned long pfn) > +{ > + return false; > +} > + > +static inline bool __pfn_zone_device(unsigned long pfn, > + struct dev_pagemap *pgmap) > { > return false; > } > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 7ef849da8278..2b4cc6b67720 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, > loff_t start; > dax_entry_t cookie; > > + /* memmaps of driver reserved memory is not initialized */ > + if (!__pfn_zone_device(pfn, pgmap)) { > + pr_err("Memory failure: %#lx: driver reserved memory\n", > + pfn); > + rc = -ENXIO; > + goto out; > + } > + > /* > * Prevent the inode from being freed while we are interrogating > * the address_space, typically this would be handled by > @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags) > if (!sysctl_memory_failure_recovery) > panic("Memory failure on page %lx", pfn); > > - if (!pfn_valid(pfn)) { > + p = pfn_to_online_page(pfn); > + if (!p) { > + if (pfn_valid(pfn)) { > + pgmap = get_dev_pagemap(pfn, NULL); > + if (pgmap) > + return memory_failure_dev_pagemap(pfn, flags, > + pgmap); > + } > pr_err("Memory failure: %#lx: memory outside kernel control\n", > pfn); > return -ENXIO; > } > > - pgmap = get_dev_pagemap(pfn, NULL); > - if (pgmap) > - return memory_failure_dev_pagemap(pfn, flags, pgmap); > - > - p = pfn_to_page(pfn); > if (PageHuge(p)) > return memory_failure_hugetlb(pfn, flags); > if (TestSetPageHWPoison(p)) { > diff --git a/mm/memremap.c b/mm/memremap.c > index 7fed8bd32a18..9f3bb223aec7 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn) > return pfn + 1; > } > > +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap) > +{ > + return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap); > +} > + > /* > - * This returns true if the page is reserved by ZONE_DEVICE driver. > + * Returns true if the page was initialized to the ZONE_DEVICE (especially, > + * is not reserved for driver usage). > */ > -bool pfn_zone_device_reserved(unsigned long pfn) > +bool pfn_zone_device(unsigned long pfn) > { > struct dev_pagemap *pgmap; > - struct vmem_altmap *altmap; > - bool ret = false; > + bool ret; > > pgmap = get_dev_pagemap(pfn, NULL); > if (!pgmap) > - return ret; > - altmap = pgmap_altmap(pgmap); > - if (altmap && pfn < (altmap->base_pfn + altmap->reserve)) > - ret = true; > + return false; > + ret = __pfn_zone_device(pfn, pgmap); > put_dev_pagemap(pgmap); > > return ret; > -- > 2.21.0 > >
On 09.10.19 11:37, Michal Hocko wrote: > On Wed 09-10-19 11:12:04, David Hildenbrand wrote: >> There are various places where we access uninitialized memmaps, namely: >> - /proc/kpagecount >> - /proc/kpageflags >> - /proc/kpagecgroup >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c >> >> We have initialized memmaps either when the section is online or when >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain >> garbage and in the worst case trigger kernel BUGs, especially with >> CONFIG_PAGE_POISONING. >> >> For example, not onlining a DIMM during boot and calling /proc/kpagecount >> with CONFIG_PAGE_POISONING: >> :/# cat /proc/kpagecount > tmp.test >> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe >> [ 95.601238] #PF: supervisor read access in kernel mode >> [ 95.601675] #PF: error_code(0x0000) - not-present page >> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 >> [ 95.602596] Oops: 0000 [#1] SMP NOPTI >> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 >> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 >> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 >> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 >> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 >> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 >> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 >> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 >> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 >> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 >> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 >> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 >> [ 95.611686] Call Trace: >> [ 95.611906] proc_reg_read+0x3c/0x60 >> [ 95.612228] vfs_read+0xc5/0x180 >> [ 95.612505] ksys_read+0x68/0xe0 >> [ 95.612785] do_syscall_64+0x5c/0xa0 >> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> Note that there are still two possible races as far as I can see: >> - pfn_to_online_page() succeeding but the memory getting offlined and >> removed. get_online_mems() could help once we run into this. >> - pfn_zone_device() succeeding but the memmap not being fully >> initialized yet. As the memmap is initialized outside of the memory >> hoptlug lock, get_online_mems() can't help. >> >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We >> can later come back and fix these rare races and eventually speed-up the >> ZONE_DEVICE detection. This patch now also makes sure we don't dump data >> about memory blocks that are already offline again. > > As I've already expressed already I am not really happy to have explicit > checks for zone device pages in pfn walkers. This is simply too fragile. > > pfn_to_online_page makes sense because offline pages are not really in a > defined state. This would be worth a patch of its own. I remember there The issue is, once I check for pfn_to_online_page(), these functions can't handle ZONE_DEVICE at all anymore. Especially in regards to memory_failure() I don't think this is acceptable. So while I (personally) only care about adding pfn_to_online_page() checks, the in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me to introduce a temporary check for initialized memmaps. > was a discussion about the uninitialized zone device memmaps. It would > be really good to summarize this discussion in the changelog and > conclude why the explicit check is really good and what were other > alternatives considered. Yeah, I also expressed my feelings and the issues to be solved by ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the discussion stalled and nobody really proposed a solution or followed up. I want to have this fixed now, as it is much easier to trigger with "[PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory" for !ZONE_DEVICE. As we already do have a similar check (pfn_zone_device_reserved()) that I merely convert for now, I will summarize and make it clearer that this is something that should be reworked by ZONE_DEVICE people rather sooner than later. Thanks! > > Thanks! > >> Reported-by: Qian Cai <cai@lca.pw> >> Cc: Alexey Dobriyan <adobriyan@gmail.com> >> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> >> Cc: Andrew Morton <akpm@linux-foundation.org> >> Cc: Stephen Rothwell <sfr@canb.auug.org.au> >> Cc: Michal Hocko <mhocko@kernel.org> >> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> >> Cc: David Hildenbrand <david@redhat.com> >> Cc: Konstantin Khlebnikov <koct9i@gmail.com> >> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> >> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> >> Cc: Jason Gunthorpe <jgg@ziepe.ca> >> Cc: Dan Williams <dan.j.williams@intel.com> >> Cc: Logan Gunthorpe <logang@deltatee.com> >> Cc: Ira Weiny <ira.weiny@intel.com> >> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> >> Cc: linux-fsdevel@vger.kernel.org >> Signed-off-by: David Hildenbrand <david@redhat.com> >> --- >> fs/proc/page.c | 12 ++++++------ >> include/linux/memremap.h | 11 +++++++++-- >> mm/memory-failure.c | 22 ++++++++++++++++------ >> mm/memremap.c | 19 +++++++++++-------- >> 4 files changed, 42 insertions(+), 22 deletions(-) >> >> diff --git a/fs/proc/page.c b/fs/proc/page.c >> index decd3fe39674..76502af461e2 100644 >> --- a/fs/proc/page.c >> +++ b/fs/proc/page.c >> @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, >> return -EINVAL; >> >> while (count > 0) { >> - if (pfn_valid(pfn)) >> + if (pfn_valid(pfn) && >> + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) >> ppage = pfn_to_page(pfn); >> else >> ppage = NULL; >> @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page) >> if (!page) >> return BIT_ULL(KPF_NOPAGE); >> >> - if (pfn_zone_device_reserved(page_to_pfn(page))) >> - return BIT_ULL(KPF_RESERVED); >> - >> k = page->flags; >> u = 0; >> >> @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf, >> return -EINVAL; >> >> while (count > 0) { >> - if (pfn_valid(pfn)) >> + if (pfn_valid(pfn) && >> + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) >> ppage = pfn_to_page(pfn); >> else >> ppage = NULL; >> @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf, >> return -EINVAL; >> >> while (count > 0) { >> - if (pfn_valid(pfn)) >> + if (pfn_valid(pfn) && >> + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) >> ppage = pfn_to_page(pfn); >> else >> ppage = NULL; >> diff --git a/include/linux/memremap.h b/include/linux/memremap.h >> index c676e33205d3..c076bb163c2f 100644 >> --- a/include/linux/memremap.h >> +++ b/include/linux/memremap.h >> @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) >> } >> >> #ifdef CONFIG_ZONE_DEVICE >> -bool pfn_zone_device_reserved(unsigned long pfn); >> +bool pfn_zone_device(unsigned long pfn); >> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap); >> void *memremap_pages(struct dev_pagemap *pgmap, int nid); >> void memunmap_pages(struct dev_pagemap *pgmap); >> void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); >> @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, >> unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); >> void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); >> #else >> -static inline bool pfn_zone_device_reserved(unsigned long pfn) >> +static inline bool pfn_zone_device(unsigned long pfn) >> +{ >> + return false; >> +} >> + >> +static inline bool __pfn_zone_device(unsigned long pfn, >> + struct dev_pagemap *pgmap) >> { >> return false; >> } >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index 7ef849da8278..2b4cc6b67720 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, >> loff_t start; >> dax_entry_t cookie; >> >> + /* memmaps of driver reserved memory is not initialized */ >> + if (!__pfn_zone_device(pfn, pgmap)) { >> + pr_err("Memory failure: %#lx: driver reserved memory\n", >> + pfn); >> + rc = -ENXIO; >> + goto out; >> + } >> + >> /* >> * Prevent the inode from being freed while we are interrogating >> * the address_space, typically this would be handled by >> @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags) >> if (!sysctl_memory_failure_recovery) >> panic("Memory failure on page %lx", pfn); >> >> - if (!pfn_valid(pfn)) { >> + p = pfn_to_online_page(pfn); >> + if (!p) { >> + if (pfn_valid(pfn)) { >> + pgmap = get_dev_pagemap(pfn, NULL); >> + if (pgmap) >> + return memory_failure_dev_pagemap(pfn, flags, >> + pgmap); >> + } >> pr_err("Memory failure: %#lx: memory outside kernel control\n", >> pfn); >> return -ENXIO; >> } >> >> - pgmap = get_dev_pagemap(pfn, NULL); >> - if (pgmap) >> - return memory_failure_dev_pagemap(pfn, flags, pgmap); >> - >> - p = pfn_to_page(pfn); >> if (PageHuge(p)) >> return memory_failure_hugetlb(pfn, flags); >> if (TestSetPageHWPoison(p)) { >> diff --git a/mm/memremap.c b/mm/memremap.c >> index 7fed8bd32a18..9f3bb223aec7 100644 >> --- a/mm/memremap.c >> +++ b/mm/memremap.c >> @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn) >> return pfn + 1; >> } >> >> +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap) >> +{ >> + return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap); >> +} >> + >> /* >> - * This returns true if the page is reserved by ZONE_DEVICE driver. >> + * Returns true if the page was initialized to the ZONE_DEVICE (especially, >> + * is not reserved for driver usage). >> */ >> -bool pfn_zone_device_reserved(unsigned long pfn) >> +bool pfn_zone_device(unsigned long pfn) >> { >> struct dev_pagemap *pgmap; >> - struct vmem_altmap *altmap; >> - bool ret = false; >> + bool ret; >> >> pgmap = get_dev_pagemap(pfn, NULL); >> if (!pgmap) >> - return ret; >> - altmap = pgmap_altmap(pgmap); >> - if (altmap && pfn < (altmap->base_pfn + altmap->reserve)) >> - ret = true; >> + return false; >> + ret = __pfn_zone_device(pfn, pgmap); >> put_dev_pagemap(pgmap); >> >> return ret; >> -- >> 2.21.0 >
On Wed 09-10-19 12:19:59, David Hildenbrand wrote: [...] > > pfn_to_online_page makes sense because offline pages are not really in a > > defined state. This would be worth a patch of its own. I remember there > > The issue is, once I check for pfn_to_online_page(), these functions > can't handle ZONE_DEVICE at all anymore. Especially in regards to > memory_failure() I don't think this is acceptable. Could you be more specific please? I am not sure I am following. > So while I > (personally) only care about adding pfn_to_online_page() checks, the > in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me > to introduce a temporary check for initialized memmaps. > > > was a discussion about the uninitialized zone device memmaps. It would > > be really good to summarize this discussion in the changelog and > > conclude why the explicit check is really good and what were other > > alternatives considered. > > Yeah, I also expressed my feelings and the issues to be solved by > ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the > discussion stalled and nobody really proposed a solution or followed up. I will try to get back to that discussion but is there any technical reason that prevents any conclusion or it is just stuck on a lack of time of the participants?
On 09.10.19 13:24, Michal Hocko wrote: > On Wed 09-10-19 12:19:59, David Hildenbrand wrote: > [...] >>> pfn_to_online_page makes sense because offline pages are not really in a >>> defined state. This would be worth a patch of its own. I remember there >> >> The issue is, once I check for pfn_to_online_page(), these functions >> can't handle ZONE_DEVICE at all anymore. Especially in regards to >> memory_failure() I don't think this is acceptable. > > Could you be more specific please? I am not sure I am following. I wasn't quite clear, let me try to be more precise: if (pfn_to_online_page(pfn)) { /* memmap initialized */ } else if (pfn_valid(pfn)) { /* ??? * a) offline memory. memmap garbage. * b) memremap memory: memmap initialized to ZONE_DEVICE. * c) memremap memory: reserved for driver. memmap garbage. * (d) memremap memory: memmap currently initializing - garbage) */ } To distinguish between a) and b/c), we can currently only use get_dev_pagemap(pfn, NULL). To distinguish between b) and c), we can currently only use pfn_zone_device_reserved(). That implies, that - right now - if we want to fix what is described in the patch without introducing more users of get_dev_pagemap(pfn, NULL), we will not be able to support ZONE_DEVICE in: - /proc/kpagecount - /proc/kpageflags - /proc/kpagecgroup if (pfn_to_online_page(pfn)) { /* memmap initialized */ } else { /* skip */ } Now, memory_failure() already contains a get_dev_pagemap(pfn, NULL) check and adding pfn_to_online_page(pfn) would also work. I would be fine with this, but it means that - for now - the three /proc/ files won't be able to deal with ZONE_DEVICE memory. > >> So while I >> (personally) only care about adding pfn_to_online_page() checks, the >> in-this-sense-fragile-subsection ZONE_DEVICE implementation requires me >> to introduce a temporary check for initialized memmaps. >> >>> was a discussion about the uninitialized zone device memmaps. It would >>> be really good to summarize this discussion in the changelog and >>> conclude why the explicit check is really good and what were other >>> alternatives considered. >> >> Yeah, I also expressed my feelings and the issues to be solved by >> ZONE_DEVICE people in https://lkml.org/lkml/2019/9/6/114. However, the >> discussion stalled and nobody really proposed a solution or followed up. > > I will try to get back to that discussion but is there any technical > reason that prevents any conclusion or it is just stuck on a lack of > time of the participants? I think it was both. People not responding to questions and not having decided on a solution.
On Wed 09-10-19 14:58:24, David Hildenbrand wrote: [...] > I would be fine with this, but it means that - for now - the three > /proc/ files won't be able to deal with ZONE_DEVICE memory. Thanks for the clarification. Is this an actual problem though? Do we have any consumers of the functionality?
On 09.10.19 15:22, Michal Hocko wrote: > On Wed 09-10-19 14:58:24, David Hildenbrand wrote: > [...] >> I would be fine with this, but it means that - for now - the three >> /proc/ files won't be able to deal with ZONE_DEVICE memory. > > Thanks for the clarification. Is this an actual problem though? Do we > have any consumers of the functionality? > I don't know, that's why I was being careful in not changing its behavior.
On Wed 09-10-19 15:24:05, David Hildenbrand wrote: > On 09.10.19 15:22, Michal Hocko wrote: > > On Wed 09-10-19 14:58:24, David Hildenbrand wrote: > > [...] > >> I would be fine with this, but it means that - for now - the three > >> /proc/ files won't be able to deal with ZONE_DEVICE memory. > > > > Thanks for the clarification. Is this an actual problem though? Do we > > have any consumers of the functionality? > > > > I don't know, that's why I was being careful in not changing its behavior. Can we simply go with pfn_to_online_page. I would be quite surprised if anybody was really examining zone device memory ranges as they are static IIUC. If there is some usecase I am pretty sure we will learn that and can address it accordingly.
On 09.10.19 15:29, Michal Hocko wrote: > On Wed 09-10-19 15:24:05, David Hildenbrand wrote: >> On 09.10.19 15:22, Michal Hocko wrote: >>> On Wed 09-10-19 14:58:24, David Hildenbrand wrote: >>> [...] >>>> I would be fine with this, but it means that - for now - the three >>>> /proc/ files won't be able to deal with ZONE_DEVICE memory. >>> >>> Thanks for the clarification. Is this an actual problem though? Do we >>> have any consumers of the functionality? >>> >> >> I don't know, that's why I was being careful in not changing its behavior. > > Can we simply go with pfn_to_online_page. I would be quite surprised if > anybody was really examining zone device memory ranges as they are > static IIUC. If there is some usecase I am pretty sure we will learn > that and can address it accordingly. > I consider it mostly a debug interface either way. Will rework, test and resend. Thanks!
On 09.10.19 11:57, Naoya Horiguchi wrote: > Hi David, > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: >> There are various places where we access uninitialized memmaps, namely: >> - /proc/kpagecount >> - /proc/kpageflags >> - /proc/kpagecgroup >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c > > Ah right, memory_failure is another victim of this bug. > >> >> We have initialized memmaps either when the section is online or when >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain >> garbage and in the worst case trigger kernel BUGs, especially with >> CONFIG_PAGE_POISONING. >> >> For example, not onlining a DIMM during boot and calling /proc/kpagecount >> with CONFIG_PAGE_POISONING: >> :/# cat /proc/kpagecount > tmp.test >> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe >> [ 95.601238] #PF: supervisor read access in kernel mode >> [ 95.601675] #PF: error_code(0x0000) - not-present page >> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 >> [ 95.602596] Oops: 0000 [#1] SMP NOPTI >> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 >> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 >> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 >> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 >> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 >> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 >> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 >> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 >> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 >> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 >> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 >> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 >> [ 95.611686] Call Trace: >> [ 95.611906] proc_reg_read+0x3c/0x60 >> [ 95.612228] vfs_read+0xc5/0x180 >> [ 95.612505] ksys_read+0x68/0xe0 >> [ 95.612785] do_syscall_64+0x5c/0xa0 >> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe >> >> Note that there are still two possible races as far as I can see: >> - pfn_to_online_page() succeeding but the memory getting offlined and >> removed. get_online_mems() could help once we run into this. >> - pfn_zone_device() succeeding but the memmap not being fully >> initialized yet. As the memmap is initialized outside of the memory >> hoptlug lock, get_online_mems() can't help. >> >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We >> can later come back and fix these rare races and eventually speed-up the >> ZONE_DEVICE detection. > > Actually, Toshiki is writing code to refactor and optimize the pfn walking > part, where we find the pfn ranges covered by zone devices by running over > xarray pgmap_array and use the range info to reduce pointer dereferences > to speed up pfn walk. I hope he will share it soon. AFAIKT, Michal is not a friend of special-casing PFN walkers in that way. We should have a mechanism to detect if a memmap was initialized without having to go via pgmap, special-casing. See my other mail where I draft one basic approach.
On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: > On 09.10.19 11:57, Naoya Horiguchi wrote: > > Hi David, > > > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: > >> There are various places where we access uninitialized memmaps, namely: > >> - /proc/kpagecount > >> - /proc/kpageflags > >> - /proc/kpagecgroup > >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c > > > > Ah right, memory_failure is another victim of this bug. > > > >> > >> We have initialized memmaps either when the section is online or when > >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain > >> garbage and in the worst case trigger kernel BUGs, especially with > >> CONFIG_PAGE_POISONING. > >> > >> For example, not onlining a DIMM during boot and calling /proc/kpagecount > >> with CONFIG_PAGE_POISONING: > >> :/# cat /proc/kpagecount > tmp.test > >> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe > >> [ 95.601238] #PF: supervisor read access in kernel mode > >> [ 95.601675] #PF: error_code(0x0000) - not-present page > >> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 > >> [ 95.602596] Oops: 0000 [#1] SMP NOPTI > >> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 > >> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 > >> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 > >> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 > >> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 > >> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 > >> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 > >> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 > >> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 > >> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 > >> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 > >> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > >> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 > >> [ 95.611686] Call Trace: > >> [ 95.611906] proc_reg_read+0x3c/0x60 > >> [ 95.612228] vfs_read+0xc5/0x180 > >> [ 95.612505] ksys_read+0x68/0xe0 > >> [ 95.612785] do_syscall_64+0x5c/0xa0 > >> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe > >> > >> Note that there are still two possible races as far as I can see: > >> - pfn_to_online_page() succeeding but the memory getting offlined and > >> removed. get_online_mems() could help once we run into this. > >> - pfn_zone_device() succeeding but the memmap not being fully > >> initialized yet. As the memmap is initialized outside of the memory > >> hoptlug lock, get_online_mems() can't help. > >> > >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We > >> can later come back and fix these rare races and eventually speed-up the > >> ZONE_DEVICE detection. > > > > Actually, Toshiki is writing code to refactor and optimize the pfn walking > > part, where we find the pfn ranges covered by zone devices by running over > > xarray pgmap_array and use the range info to reduce pointer dereferences > > to speed up pfn walk. I hope he will share it soon. > > AFAIKT, Michal is not a friend of special-casing PFN walkers in that > way. We should have a mechanism to detect if a memmap was initialized > without having to go via pgmap, special-casing. See my other mail where > I draft one basic approach. OK, so considering your v2 approach, we could have another pfn_to_page() variant like pfn_to_zone_device_page(), where we check that a given pfn belongs to the memory section backed by zone memory, then another check if the pfn has initialized memmap or not, and return NULL if memmap not initialied. We'll try this approach then, but if you find problems/concerns, please let me know. Thanks, Naoya Horiguchi
On Fri, Oct 11, 2019 at 12:11:25AM +0000, Horiguchi Naoya(堀口 直也) wrote: > On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: > > On 09.10.19 11:57, Naoya Horiguchi wrote: > > > Hi David, > > > > > > On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: > > >> There are various places where we access uninitialized memmaps, namely: > > >> - /proc/kpagecount > > >> - /proc/kpageflags > > >> - /proc/kpagecgroup > > >> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c > > > > > > Ah right, memory_failure is another victim of this bug. > > > > > >> > > >> We have initialized memmaps either when the section is online or when > > >> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain > > >> garbage and in the worst case trigger kernel BUGs, especially with > > >> CONFIG_PAGE_POISONING. > > >> > > >> For example, not onlining a DIMM during boot and calling /proc/kpagecount > > >> with CONFIG_PAGE_POISONING: > > >> :/# cat /proc/kpagecount > tmp.test > > >> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe > > >> [ 95.601238] #PF: supervisor read access in kernel mode > > >> [ 95.601675] #PF: error_code(0x0000) - not-present page > > >> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 > > >> [ 95.602596] Oops: 0000 [#1] SMP NOPTI > > >> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 > > >> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 > > >> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 > > >> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 > > >> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 > > >> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 > > >> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 > > >> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 > > >> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 > > >> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 > > >> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 > > >> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > >> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 > > >> [ 95.611686] Call Trace: > > >> [ 95.611906] proc_reg_read+0x3c/0x60 > > >> [ 95.612228] vfs_read+0xc5/0x180 > > >> [ 95.612505] ksys_read+0x68/0xe0 > > >> [ 95.612785] do_syscall_64+0x5c/0xa0 > > >> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe > > >> > > >> Note that there are still two possible races as far as I can see: > > >> - pfn_to_online_page() succeeding but the memory getting offlined and > > >> removed. get_online_mems() could help once we run into this. > > >> - pfn_zone_device() succeeding but the memmap not being fully > > >> initialized yet. As the memmap is initialized outside of the memory > > >> hoptlug lock, get_online_mems() can't help. > > >> > > >> Let's keep the existing interfaces working with ZONE_DEVICE memory. We > > >> can later come back and fix these rare races and eventually speed-up the > > >> ZONE_DEVICE detection. > > > > > > Actually, Toshiki is writing code to refactor and optimize the pfn walking > > > part, where we find the pfn ranges covered by zone devices by running over > > > xarray pgmap_array and use the range info to reduce pointer dereferences > > > to speed up pfn walk. I hope he will share it soon. > > > > AFAIKT, Michal is not a friend of special-casing PFN walkers in that > > way. We should have a mechanism to detect if a memmap was initialized > > without having to go via pgmap, special-casing. See my other mail where > > I draft one basic approach. > > OK, so considering your v2 approach, we could have another pfn_to_page() > variant like pfn_to_zone_device_page(), where we check that a given pfn > belongs to the memory section backed by zone memory, then another check if > the pfn has initialized memmap or not, and return NULL if memmap not > initialied. We'll try this approach then, but if you find problems/concerns, > please let me know. Sorry, you already mentioned detail here, https://lore.kernel.org/lkml/c6198acd-8ff7-c40c-cb4e-f0f12f841b38@redhat.com/
On 11.10.19 02:50, Naoya Horiguchi wrote: > On Fri, Oct 11, 2019 at 12:11:25AM +0000, Horiguchi Naoya(堀口 直也) wrote: >> On Thu, Oct 10, 2019 at 09:30:01AM +0200, David Hildenbrand wrote: >>> On 09.10.19 11:57, Naoya Horiguchi wrote: >>>> Hi David, >>>> >>>> On Wed, Oct 09, 2019 at 11:12:04AM +0200, David Hildenbrand wrote: >>>>> There are various places where we access uninitialized memmaps, namely: >>>>> - /proc/kpagecount >>>>> - /proc/kpageflags >>>>> - /proc/kpagecgroup >>>>> - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c >>>> >>>> Ah right, memory_failure is another victim of this bug. >>>> >>>>> >>>>> We have initialized memmaps either when the section is online or when >>>>> the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain >>>>> garbage and in the worst case trigger kernel BUGs, especially with >>>>> CONFIG_PAGE_POISONING. >>>>> >>>>> For example, not onlining a DIMM during boot and calling /proc/kpagecount >>>>> with CONFIG_PAGE_POISONING: >>>>> :/# cat /proc/kpagecount > tmp.test >>>>> [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe >>>>> [ 95.601238] #PF: supervisor read access in kernel mode >>>>> [ 95.601675] #PF: error_code(0x0000) - not-present page >>>>> [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 >>>>> [ 95.602596] Oops: 0000 [#1] SMP NOPTI >>>>> [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 >>>>> [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 >>>>> [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 >>>>> [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 >>>>> [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 >>>>> [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 >>>>> [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 >>>>> [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 >>>>> [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 >>>>> [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 >>>>> [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 >>>>> [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>>> [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 >>>>> [ 95.611686] Call Trace: >>>>> [ 95.611906] proc_reg_read+0x3c/0x60 >>>>> [ 95.612228] vfs_read+0xc5/0x180 >>>>> [ 95.612505] ksys_read+0x68/0xe0 >>>>> [ 95.612785] do_syscall_64+0x5c/0xa0 >>>>> [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe >>>>> >>>>> Note that there are still two possible races as far as I can see: >>>>> - pfn_to_online_page() succeeding but the memory getting offlined and >>>>> removed. get_online_mems() could help once we run into this. >>>>> - pfn_zone_device() succeeding but the memmap not being fully >>>>> initialized yet. As the memmap is initialized outside of the memory >>>>> hoptlug lock, get_online_mems() can't help. >>>>> >>>>> Let's keep the existing interfaces working with ZONE_DEVICE memory. We >>>>> can later come back and fix these rare races and eventually speed-up the >>>>> ZONE_DEVICE detection. >>>> >>>> Actually, Toshiki is writing code to refactor and optimize the pfn walking >>>> part, where we find the pfn ranges covered by zone devices by running over >>>> xarray pgmap_array and use the range info to reduce pointer dereferences >>>> to speed up pfn walk. I hope he will share it soon. >>> >>> AFAIKT, Michal is not a friend of special-casing PFN walkers in that >>> way. We should have a mechanism to detect if a memmap was initialized >>> without having to go via pgmap, special-casing. See my other mail where >>> I draft one basic approach. >> >> OK, so considering your v2 approach, we could have another pfn_to_page() >> variant like pfn_to_zone_device_page(), where we check that a given pfn >> belongs to the memory section backed by zone memory, then another check if >> the pfn has initialized memmap or not, and return NULL if memmap not >> initialied. We'll try this approach then, but if you find problems/concerns, >> please let me know. > > Sorry, you already mentioned detail here, > https://lore.kernel.org/lkml/c6198acd-8ff7-c40c-cb4e-f0f12f841b38@redhat.com/ > I'm planning on sending a proper writeup of the overall approach and pitfalls maybe next week. Feel free to ping me in case I forget.
diff --git a/fs/proc/page.c b/fs/proc/page.c index decd3fe39674..76502af461e2 100644 --- a/fs/proc/page.c +++ b/fs/proc/page.c @@ -42,7 +42,8 @@ static ssize_t kpagecount_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) + if (pfn_valid(pfn) && + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) ppage = pfn_to_page(pfn); else ppage = NULL; @@ -97,9 +98,6 @@ u64 stable_page_flags(struct page *page) if (!page) return BIT_ULL(KPF_NOPAGE); - if (pfn_zone_device_reserved(page_to_pfn(page))) - return BIT_ULL(KPF_RESERVED); - k = page->flags; u = 0; @@ -218,7 +216,8 @@ static ssize_t kpageflags_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) + if (pfn_valid(pfn) && + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) ppage = pfn_to_page(pfn); else ppage = NULL; @@ -263,7 +262,8 @@ static ssize_t kpagecgroup_read(struct file *file, char __user *buf, return -EINVAL; while (count > 0) { - if (pfn_valid(pfn)) + if (pfn_valid(pfn) && + (pfn_to_online_page(pfn) || pfn_zone_device(pfn))) ppage = pfn_to_page(pfn); else ppage = NULL; diff --git a/include/linux/memremap.h b/include/linux/memremap.h index c676e33205d3..c076bb163c2f 100644 --- a/include/linux/memremap.h +++ b/include/linux/memremap.h @@ -123,7 +123,8 @@ static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap) } #ifdef CONFIG_ZONE_DEVICE -bool pfn_zone_device_reserved(unsigned long pfn); +bool pfn_zone_device(unsigned long pfn); +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap); void *memremap_pages(struct dev_pagemap *pgmap, int nid); void memunmap_pages(struct dev_pagemap *pgmap); void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap); @@ -134,7 +135,13 @@ struct dev_pagemap *get_dev_pagemap(unsigned long pfn, unsigned long vmem_altmap_offset(struct vmem_altmap *altmap); void vmem_altmap_free(struct vmem_altmap *altmap, unsigned long nr_pfns); #else -static inline bool pfn_zone_device_reserved(unsigned long pfn) +static inline bool pfn_zone_device(unsigned long pfn) +{ + return false; +} + +static inline bool __pfn_zone_device(unsigned long pfn, + struct dev_pagemap *pgmap) { return false; } diff --git a/mm/memory-failure.c b/mm/memory-failure.c index 7ef849da8278..2b4cc6b67720 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1161,6 +1161,14 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags, loff_t start; dax_entry_t cookie; + /* memmaps of driver reserved memory is not initialized */ + if (!__pfn_zone_device(pfn, pgmap)) { + pr_err("Memory failure: %#lx: driver reserved memory\n", + pfn); + rc = -ENXIO; + goto out; + } + /* * Prevent the inode from being freed while we are interrogating * the address_space, typically this would be handled by @@ -1253,17 +1261,19 @@ int memory_failure(unsigned long pfn, int flags) if (!sysctl_memory_failure_recovery) panic("Memory failure on page %lx", pfn); - if (!pfn_valid(pfn)) { + p = pfn_to_online_page(pfn); + if (!p) { + if (pfn_valid(pfn)) { + pgmap = get_dev_pagemap(pfn, NULL); + if (pgmap) + return memory_failure_dev_pagemap(pfn, flags, + pgmap); + } pr_err("Memory failure: %#lx: memory outside kernel control\n", pfn); return -ENXIO; } - pgmap = get_dev_pagemap(pfn, NULL); - if (pgmap) - return memory_failure_dev_pagemap(pfn, flags, pgmap); - - p = pfn_to_page(pfn); if (PageHuge(p)) return memory_failure_hugetlb(pfn, flags); if (TestSetPageHWPoison(p)) { diff --git a/mm/memremap.c b/mm/memremap.c index 7fed8bd32a18..9f3bb223aec7 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -73,21 +73,24 @@ static unsigned long pfn_next(unsigned long pfn) return pfn + 1; } +bool __pfn_zone_device(unsigned long pfn, struct dev_pagemap *pgmap) +{ + return pfn >= pfn_first(pgmap) && pfn <= pfn_end(pgmap); +} + /* - * This returns true if the page is reserved by ZONE_DEVICE driver. + * Returns true if the page was initialized to the ZONE_DEVICE (especially, + * is not reserved for driver usage). */ -bool pfn_zone_device_reserved(unsigned long pfn) +bool pfn_zone_device(unsigned long pfn) { struct dev_pagemap *pgmap; - struct vmem_altmap *altmap; - bool ret = false; + bool ret; pgmap = get_dev_pagemap(pfn, NULL); if (!pgmap) - return ret; - altmap = pgmap_altmap(pgmap); - if (altmap && pfn < (altmap->base_pfn + altmap->reserve)) - ret = true; + return false; + ret = __pfn_zone_device(pfn, pgmap); put_dev_pagemap(pgmap); return ret;
There are various places where we access uninitialized memmaps, namely: - /proc/kpagecount - /proc/kpageflags - /proc/kpagecgroup - memory_failure() - which reuses stable_page_flags() from fs/proc/page.c We have initialized memmaps either when the section is online or when the page was initialized to the ZONE_DEVICE. Uninitialized memmaps contain garbage and in the worst case trigger kernel BUGs, especially with CONFIG_PAGE_POISONING. For example, not onlining a DIMM during boot and calling /proc/kpagecount with CONFIG_PAGE_POISONING: :/# cat /proc/kpagecount > tmp.test [ 95.600592] BUG: unable to handle page fault for address: fffffffffffffffe [ 95.601238] #PF: supervisor read access in kernel mode [ 95.601675] #PF: error_code(0x0000) - not-present page [ 95.602116] PGD 114616067 P4D 114616067 PUD 114618067 PMD 0 [ 95.602596] Oops: 0000 [#1] SMP NOPTI [ 95.602920] CPU: 0 PID: 469 Comm: cat Not tainted 5.4.0-rc1-next-20191004+ #11 [ 95.603547] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.4 [ 95.604521] RIP: 0010:kpagecount_read+0xce/0x1e0 [ 95.604917] Code: e8 09 83 e0 3f 48 0f a3 02 73 2d 4c 89 e7 48 c1 e7 06 48 03 3d ab 51 01 01 74 1d 48 8b 57 08 480 [ 95.606450] RSP: 0018:ffffa14e409b7e78 EFLAGS: 00010202 [ 95.606904] RAX: fffffffffffffffe RBX: 0000000000020000 RCX: 0000000000000000 [ 95.607519] RDX: 0000000000000001 RSI: 00007f76b5595000 RDI: fffff35645000000 [ 95.608128] RBP: 00007f76b5595000 R08: 0000000000000001 R09: 0000000000000000 [ 95.608731] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000140000 [ 95.609327] R13: 0000000000020000 R14: 00007f76b5595000 R15: ffffa14e409b7f08 [ 95.609924] FS: 00007f76b577d580(0000) GS:ffff8f41bd400000(0000) knlGS:0000000000000000 [ 95.610599] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 95.611083] CR2: fffffffffffffffe CR3: 0000000078960000 CR4: 00000000000006f0 [ 95.611686] Call Trace: [ 95.611906] proc_reg_read+0x3c/0x60 [ 95.612228] vfs_read+0xc5/0x180 [ 95.612505] ksys_read+0x68/0xe0 [ 95.612785] do_syscall_64+0x5c/0xa0 [ 95.613092] entry_SYSCALL_64_after_hwframe+0x49/0xbe Note that there are still two possible races as far as I can see: - pfn_to_online_page() succeeding but the memory getting offlined and removed. get_online_mems() could help once we run into this. - pfn_zone_device() succeeding but the memmap not being fully initialized yet. As the memmap is initialized outside of the memory hoptlug lock, get_online_mems() can't help. Let's keep the existing interfaces working with ZONE_DEVICE memory. We can later come back and fix these rare races and eventually speed-up the ZONE_DEVICE detection. This patch now also makes sure we don't dump data about memory blocks that are already offline again. Reported-by: Qian Cai <cai@lca.pw> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Toshiki Fukasawa <t-fukasawa@vx.jp.nec.com> Cc: David Hildenbrand <david@redhat.com> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: David Hildenbrand <david@redhat.com> --- fs/proc/page.c | 12 ++++++------ include/linux/memremap.h | 11 +++++++++-- mm/memory-failure.c | 22 ++++++++++++++++------ mm/memremap.c | 19 +++++++++++-------- 4 files changed, 42 insertions(+), 22 deletions(-)