KASAN vs ZONE_DEVICE (was: Re: [PATCH v2 2/7] dax: change bdev_dax_supported()...)
diff mbox

Message ID 71c55f5c-deed-b307-9022-8a41dd898822@virtuozzo.com
State New, archived
Headers show

Commit Message

Andrey Ryabinin June 5, 2018, 2:01 p.m. UTC
On 06/05/2018 07:22 AM, Dan Williams wrote:
> On Mon, Jun 4, 2018 at 8:32 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>> [ adding KASAN devs...]
>>
>> On Mon, Jun 4, 2018 at 4:40 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> On Sun, Jun 3, 2018 at 6:48 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Sun, Jun 3, 2018 at 5:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Mon, Jun 04, 2018 at 08:20:38AM +1000, Dave Chinner wrote:
>>>>>> On Thu, May 31, 2018 at 09:02:52PM -0700, Dan Williams wrote:
>>>>>>> On Thu, May 31, 2018 at 7:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>> On Thu, May 31, 2018 at 06:57:33PM -0700, Dan Williams wrote:
>>>>>>>>>> FWIW, XFS+DAX used to just work on this setup (I hadn't even
>>>>>>>>>> installed ndctl until this morning!) but after changing the kernel
>>>>>>>>>> it no longer works. That would make it a regression, yes?
>>>>>>
>>>>>> [....]
>>>>>>
>>>>>>>>> I suspect your kernel does not have CONFIG_ZONE_DEVICE enabled which
>>>>>>>>> has the following dependencies:
>>>>>>>>>
>>>>>>>>>         depends on MEMORY_HOTPLUG
>>>>>>>>>         depends on MEMORY_HOTREMOVE
>>>>>>>>>         depends on SPARSEMEM_VMEMMAP
>>>>>>>>
>>>>>>>> Filesystem DAX now has a dependency on memory hotplug?
>>>>>>
>>>>>> [....]
>>>>>>
>>>>>>>> OK, works now I've found the magic config incantantions to turn
>>>>>>>> everything I now need on.
>>>>>>
>>>>>> By enabling these options, my test VM now has a ~30s pause in the
>>>>>> boot very soon after the nvdimm subsystem is initialised.
>>>>>>
>>>>>> [    1.523718] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>>>>>> [    1.550353] 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
>>>>>> [    1.552175] Non-volatile memory driver v1.3
>>>>>> [    2.332045] tsc: Refined TSC clocksource calibration: 2199.909 MHz
>>>>>> [    2.333280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb5dcd4620, max_idle_ns: 440795264143 ns
>>>>>> [   37.217453] brd: module loaded
>>>>>> [   37.225423] loop: module loaded
>>>>>> [   37.228441] virtio_blk virtio2: [vda] 10485760 512-byte logical blocks (5.37 GB/5.00 GiB)
>>>>>> [   37.245418] virtio_blk virtio3: [vdb] 146800640 512-byte logical blocks (75.2 GB/70.0 GiB)
>>>>>> [   37.255794] virtio_blk virtio4: [vdc] 1073741824000 512-byte logical blocks (550 TB/500 TiB)
>>>>>> [   37.265403] nd_pmem namespace1.0: unable to guarantee persistence of writes
>>>>>> [   37.265618] nd_pmem namespace0.0: unable to guarantee persistence of writes
>>>>>>
>>>>>> The system does not appear to be consuming CPU, but it is blocking
>>>>>> NMIs so I can't get a CPU trace. For a VM that I rely on booting in
>>>>>> a few seconds because I reboot it tens of times a day, this is a
>>>>>> problem....
>>>>>
>>>>> And when I turn on KASAN, the kernel fails to boot to a login prompt
>>>>> because:
>>>>
>>>> What's your qemu and kernel command line? I'll take look at this first
>>>> thing tomorrow.
>>>
>>> I was able to reproduce this crash by just turning on KASAN...
>>> investigating. It would still help to have your config for our own
>>> regression testing purposes it makes sense for us to prioritize
>>> "Dave's test config", similar to the priority of not breaking Linus'
>>> laptop.
>>
>> I believe this is a bug in KASAN, or a bug in devm_memremap_pages(),
>> depends on your point of view. At the very least it is a mismatch of
>> assumptions. KASAN learns of hot added memory via the memory hotplug
>> notifier. However, the devm_memremap_pages() implementation is
>> intentionally limited to the "first half" of the memory hotplug
>> procedure. I.e. it does just enough to setup the linear map for
>> pfn_to_page() and initialize the "struct page" memmap, but then stops
>> short of onlining the pages. This is why we are getting a NULL ptr
>> deref and not a KASAN report, because KASAN has no shadow area setup
>> for the linearly mapped pmem range.
>>
>> In terms of solving it we could refactor kasan_mem_notifier() so that
>> devm_memremap_pages() can call it outside of the notifier... I'll give
>> this a shot.
> 
> Well, the attached patch got me slightly further, but only slightly...
> 
> [   14.998394] BUG: KASAN: unknown-crash in pmem_do_bvec+0x19e/0x790 [nd_pmem]
> [   15.000006] Read of size 4096 at addr ffff880200000000 by task
> systemd-udevd/915
> [   15.001991]
> [   15.002590] CPU: 15 PID: 915 Comm: systemd-udevd Tainted: G
>   OE     4.17.0-rc5+ #1
> 982
> [   15.004783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS rel-1.11.1-0-g0551a
> 4be2c-prebuilt.qemu-project.org 04/01/2014
> [   15.007652] Call Trace:
> [   15.008339]  dump_stack+0x9a/0xeb
> [   15.009344]  print_address_description+0x73/0x280
> [   15.010524]  kasan_report+0x258/0x380
> [   15.011528]  ? pmem_do_bvec+0x19e/0x790 [nd_pmem]
> [   15.012747]  memcpy+0x1f/0x50
> [   15.013659]  pmem_do_bvec+0x19e/0x790 [nd_pmem]
> 
> ...I've exhausted my limited kasan internals knowledge, any ideas what
> it's missing?
> 

Initialization is missing. kasan_mem_notifier() doesn't initialize shadow because
it expects kasan_free_pages()/kasan_alloc_pages() will do that when page allocated/freed.

So adding memset(shadow_start, 0, shadow_size); will make this work.
But we shouldn't use kasan_mem_notifier here, as that would mean wasting a lot of memory only
to store zeroes.

A better solution would be mapping kasan_zero_page in shadow.
The draft patch bellow demonstrates the idea (build tested only).


---
 include/linux/kasan.h | 14 ++++++++++++++
 kernel/memremap.c     | 10 ++++++++++
 mm/kasan/kasan_init.c | 46 ++++++++++++++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 10 deletions(-)

Comments

Dan Williams June 5, 2018, 7:10 p.m. UTC | #1
On Tue, Jun 5, 2018 at 7:01 AM, Andrey Ryabinin <aryabinin@virtuozzo.com> wrote:
>
>
> On 06/05/2018 07:22 AM, Dan Williams wrote:
>> On Mon, Jun 4, 2018 at 8:32 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>> [ adding KASAN devs...]
>>>
>>> On Mon, Jun 4, 2018 at 4:40 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>> On Sun, Jun 3, 2018 at 6:48 PM, Dan Williams <dan.j.williams@intel.com> wrote:
>>>>> On Sun, Jun 3, 2018 at 5:25 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>> On Mon, Jun 04, 2018 at 08:20:38AM +1000, Dave Chinner wrote:
>>>>>>> On Thu, May 31, 2018 at 09:02:52PM -0700, Dan Williams wrote:
>>>>>>>> On Thu, May 31, 2018 at 7:24 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>>>> On Thu, May 31, 2018 at 06:57:33PM -0700, Dan Williams wrote:
>>>>>>>>>>> FWIW, XFS+DAX used to just work on this setup (I hadn't even
>>>>>>>>>>> installed ndctl until this morning!) but after changing the kernel
>>>>>>>>>>> it no longer works. That would make it a regression, yes?
>>>>>>>
>>>>>>> [....]
>>>>>>>
>>>>>>>>>> I suspect your kernel does not have CONFIG_ZONE_DEVICE enabled which
>>>>>>>>>> has the following dependencies:
>>>>>>>>>>
>>>>>>>>>>         depends on MEMORY_HOTPLUG
>>>>>>>>>>         depends on MEMORY_HOTREMOVE
>>>>>>>>>>         depends on SPARSEMEM_VMEMMAP
>>>>>>>>>
>>>>>>>>> Filesystem DAX now has a dependency on memory hotplug?
>>>>>>>
>>>>>>> [....]
>>>>>>>
>>>>>>>>> OK, works now I've found the magic config incantantions to turn
>>>>>>>>> everything I now need on.
>>>>>>>
>>>>>>> By enabling these options, my test VM now has a ~30s pause in the
>>>>>>> boot very soon after the nvdimm subsystem is initialised.
>>>>>>>
>>>>>>> [    1.523718] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
>>>>>>> [    1.550353] 00:05: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
>>>>>>> [    1.552175] Non-volatile memory driver v1.3
>>>>>>> [    2.332045] tsc: Refined TSC clocksource calibration: 2199.909 MHz
>>>>>>> [    2.333280] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1fb5dcd4620, max_idle_ns: 440795264143 ns
>>>>>>> [   37.217453] brd: module loaded
>>>>>>> [   37.225423] loop: module loaded
>>>>>>> [   37.228441] virtio_blk virtio2: [vda] 10485760 512-byte logical blocks (5.37 GB/5.00 GiB)
>>>>>>> [   37.245418] virtio_blk virtio3: [vdb] 146800640 512-byte logical blocks (75.2 GB/70.0 GiB)
>>>>>>> [   37.255794] virtio_blk virtio4: [vdc] 1073741824000 512-byte logical blocks (550 TB/500 TiB)
>>>>>>> [   37.265403] nd_pmem namespace1.0: unable to guarantee persistence of writes
>>>>>>> [   37.265618] nd_pmem namespace0.0: unable to guarantee persistence of writes
>>>>>>>
>>>>>>> The system does not appear to be consuming CPU, but it is blocking
>>>>>>> NMIs so I can't get a CPU trace. For a VM that I rely on booting in
>>>>>>> a few seconds because I reboot it tens of times a day, this is a
>>>>>>> problem....
>>>>>>
>>>>>> And when I turn on KASAN, the kernel fails to boot to a login prompt
>>>>>> because:
>>>>>
>>>>> What's your qemu and kernel command line? I'll take look at this first
>>>>> thing tomorrow.
>>>>
>>>> I was able to reproduce this crash by just turning on KASAN...
>>>> investigating. It would still help to have your config for our own
>>>> regression testing purposes it makes sense for us to prioritize
>>>> "Dave's test config", similar to the priority of not breaking Linus'
>>>> laptop.
>>>
>>> I believe this is a bug in KASAN, or a bug in devm_memremap_pages(),
>>> depends on your point of view. At the very least it is a mismatch of
>>> assumptions. KASAN learns of hot added memory via the memory hotplug
>>> notifier. However, the devm_memremap_pages() implementation is
>>> intentionally limited to the "first half" of the memory hotplug
>>> procedure. I.e. it does just enough to setup the linear map for
>>> pfn_to_page() and initialize the "struct page" memmap, but then stops
>>> short of onlining the pages. This is why we are getting a NULL ptr
>>> deref and not a KASAN report, because KASAN has no shadow area setup
>>> for the linearly mapped pmem range.
>>>
>>> In terms of solving it we could refactor kasan_mem_notifier() so that
>>> devm_memremap_pages() can call it outside of the notifier... I'll give
>>> this a shot.
>>
>> Well, the attached patch got me slightly further, but only slightly...
>>
>> [   14.998394] BUG: KASAN: unknown-crash in pmem_do_bvec+0x19e/0x790 [nd_pmem]
>> [   15.000006] Read of size 4096 at addr ffff880200000000 by task
>> systemd-udevd/915
>> [   15.001991]
>> [   15.002590] CPU: 15 PID: 915 Comm: systemd-udevd Tainted: G
>>   OE     4.17.0-rc5+ #1
>> 982
>> [   15.004783] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS rel-1.11.1-0-g0551a
>> 4be2c-prebuilt.qemu-project.org 04/01/2014
>> [   15.007652] Call Trace:
>> [   15.008339]  dump_stack+0x9a/0xeb
>> [   15.009344]  print_address_description+0x73/0x280
>> [   15.010524]  kasan_report+0x258/0x380
>> [   15.011528]  ? pmem_do_bvec+0x19e/0x790 [nd_pmem]
>> [   15.012747]  memcpy+0x1f/0x50
>> [   15.013659]  pmem_do_bvec+0x19e/0x790 [nd_pmem]
>>
>> ...I've exhausted my limited kasan internals knowledge, any ideas what
>> it's missing?
>>
>
> Initialization is missing. kasan_mem_notifier() doesn't initialize shadow because
> it expects kasan_free_pages()/kasan_alloc_pages() will do that when page allocated/freed.
>
> So adding memset(shadow_start, 0, shadow_size); will make this work.
> But we shouldn't use kasan_mem_notifier here, as that would mean wasting a lot of memory only
> to store zeroes.
>
> A better solution would be mapping kasan_zero_page in shadow.
> The draft patch bellow demonstrates the idea (build tested only).
>
>
> ---
>  include/linux/kasan.h | 14 ++++++++++++++
>  kernel/memremap.c     | 10 ++++++++++
>  mm/kasan/kasan_init.c | 46 ++++++++++++++++++++++++++++++++++++----------
>  3 files changed, 60 insertions(+), 10 deletions(-)


Thank you! This RFC patch works for me. For now we don't necessarily
need kasan_remove_zero_shadow(), but in the future we might
dynamically switch the same physical address from being mapped by
devm_memremap_page() and traditional memory hotplug.

Patch
diff mbox

diff --git a/include/linux/kasan.h b/include/linux/kasan.h
index de784fd11d12..b5f5d2d9e46f 100644
--- a/include/linux/kasan.h
+++ b/include/linux/kasan.h
@@ -71,6 +71,10 @@  struct kasan_cache {
 int kasan_module_alloc(void *addr, size_t size);
 void kasan_free_shadow(const struct vm_struct *vm);
 
+int kasan_add_zero_shadow(unsigned long start, unsigned long size);
+void kasan_remove_zero_shadow(unsigned long start, unsigned long size);
+
+
 size_t ksize(const void *);
 static inline void kasan_unpoison_slab(const void *ptr) { ksize(ptr); }
 size_t kasan_metadata_size(struct kmem_cache *cache);
@@ -124,6 +128,16 @@  static inline bool kasan_slab_free(struct kmem_cache *s, void *object,
 static inline int kasan_module_alloc(void *addr, size_t size) { return 0; }
 static inline void kasan_free_shadow(const struct vm_struct *vm) {}
 
+static inline int kasan_add_zero_shadow(unsigned long start, unsigned long size)
+{
+	return 0;
+}
+static inline int kasan_remove_zero_shadow(unsigned long start,
+					unsigned long size)
+{
+	return 0;
+}
+
 static inline void kasan_unpoison_slab(const void *ptr) { }
 static inline size_t kasan_metadata_size(struct kmem_cache *cache) { return 0; }
 
diff --git a/kernel/memremap.c b/kernel/memremap.c
index 895e6b76b25e..1524dda52667 100644
--- a/kernel/memremap.c
+++ b/kernel/memremap.c
@@ -15,6 +15,7 @@ 
 #include <linux/types.h>
 #include <linux/pfn_t.h>
 #include <linux/io.h>
+#include <linux/kasan.h>
 #include <linux/mm.h>
 #include <linux/memory_hotplug.h>
 #include <linux/swap.h>
@@ -309,6 +310,7 @@  static void devm_memremap_pages_release(void *data)
 	mem_hotplug_begin();
 	arch_remove_memory(align_start, align_size, pgmap->altmap_valid ?
 			&pgmap->altmap : NULL);
+	kasan_remove_zero_shadow((unsigned long)__va(align_start), align_size);
 	mem_hotplug_done();
 
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
@@ -395,6 +397,12 @@  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 		goto err_pfn_remap;
 
 	mem_hotplug_begin();
+	error = kasan_add_zero_shadow((unsigned long)__va(align_start), align_size);
+	if (error) {
+		mem_hotplug_done();
+		goto err_kasan;
+	}
+
 	error = arch_add_memory(nid, align_start, align_size, altmap, false);
 	if (!error)
 		move_pfn_range_to_zone(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
@@ -423,6 +431,8 @@  void *devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 	return __va(res->start);
 
  err_add_memory:
+	kasan_remove_zero_shadow((unsigned long)__va(align_start), align_size);
+ err_kasan:
 	untrack_pfn(NULL, PHYS_PFN(align_start), align_size);
  err_pfn_remap:
  err_radix:
diff --git a/mm/kasan/kasan_init.c b/mm/kasan/kasan_init.c
index f436246ccc79..160d35d28e62 100644
--- a/mm/kasan/kasan_init.c
+++ b/mm/kasan/kasan_init.c
@@ -21,6 +21,8 @@ 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
 
+#include "kasan.h"
+
 /*
  * This page serves two purposes:
  *   - It used as early shadow memory. The entire shadow region populated
@@ -41,13 +43,16 @@  pmd_t kasan_zero_pmd[PTRS_PER_PMD] __page_aligned_bss;
 #endif
 pte_t kasan_zero_pte[PTRS_PER_PTE] __page_aligned_bss;
 
-static __init void *early_alloc(size_t size, int node)
+static void *kasan_alloc(size_t size, int node)
 {
+	if (slab_is_available())
+		return (void *)get_zeroed_page(GFP_KERNEL | __GFP_NOFAIL);
+
 	return memblock_virt_alloc_try_nid(size, size, __pa(MAX_DMA_ADDRESS),
 					BOOTMEM_ALLOC_ACCESSIBLE, node);
 }
 
-static void __init zero_pte_populate(pmd_t *pmd, unsigned long addr,
+static void __ref zero_pte_populate(pmd_t *pmd, unsigned long addr,
 				unsigned long end)
 {
 	pte_t *pte = pte_offset_kernel(pmd, addr);
@@ -63,7 +68,7 @@  static void __init zero_pte_populate(pmd_t *pmd, unsigned long addr,
 	}
 }
 
-static void __init zero_pmd_populate(pud_t *pud, unsigned long addr,
+static void __ref zero_pmd_populate(pud_t *pud, unsigned long addr,
 				unsigned long end)
 {
 	pmd_t *pmd = pmd_offset(pud, addr);
@@ -79,13 +84,13 @@  static void __init zero_pmd_populate(pud_t *pud, unsigned long addr,
 
 		if (pmd_none(*pmd)) {
 			pmd_populate_kernel(&init_mm, pmd,
-					early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+					kasan_alloc(PAGE_SIZE, NUMA_NO_NODE));
 		}
 		zero_pte_populate(pmd, addr, next);
 	} while (pmd++, addr = next, addr != end);
 }
 
-static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr,
+static void __ref zero_pud_populate(p4d_t *p4d, unsigned long addr,
 				unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, addr);
@@ -104,13 +109,13 @@  static void __init zero_pud_populate(p4d_t *p4d, unsigned long addr,
 
 		if (pud_none(*pud)) {
 			pud_populate(&init_mm, pud,
-				early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+				kasan_alloc(PAGE_SIZE, NUMA_NO_NODE));
 		}
 		zero_pmd_populate(pud, addr, next);
 	} while (pud++, addr = next, addr != end);
 }
 
-static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
+static void __ref zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 				unsigned long end)
 {
 	p4d_t *p4d = p4d_offset(pgd, addr);
@@ -133,7 +138,7 @@  static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
 
 		if (p4d_none(*p4d)) {
 			p4d_populate(&init_mm, p4d,
-				early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+				kasan_alloc(PAGE_SIZE, NUMA_NO_NODE));
 		}
 		zero_pud_populate(p4d, addr, next);
 	} while (p4d++, addr = next, addr != end);
@@ -145,7 +150,7 @@  static void __init zero_p4d_populate(pgd_t *pgd, unsigned long addr,
  * @shadow_start - start of the memory range to populate
  * @shadow_end   - end of the memory range to populate
  */
-void __init kasan_populate_zero_shadow(const void *shadow_start,
+void __ref kasan_populate_zero_shadow(const void *shadow_start,
 				const void *shadow_end)
 {
 	unsigned long addr = (unsigned long)shadow_start;
@@ -192,8 +197,29 @@  void __init kasan_populate_zero_shadow(const void *shadow_start,
 
 		if (pgd_none(*pgd)) {
 			pgd_populate(&init_mm, pgd,
-				early_alloc(PAGE_SIZE, NUMA_NO_NODE));
+				kasan_alloc(PAGE_SIZE, NUMA_NO_NODE));
 		}
 		zero_p4d_populate(pgd, addr, next);
 	} while (pgd++, addr = next, addr != end);
 }
+
+int kasan_add_zero_shadow(unsigned long start, unsigned long size)
+{
+	unsigned long shadow_start, shadow_end;
+
+	shadow_start = (unsigned long)kasan_mem_to_shadow((void *)start);
+	shadow_end = shadow_start + (size >> KASAN_SHADOW_SCALE_SHIFT);
+
+	if (WARN_ON(start % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE)) ||
+	    WARN_ON(size % (KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE)))
+		return -EINVAL;
+
+	kasan_populate_zero_shadow((void *)shadow_start,
+				(void *)shadow_end);
+	return 0;
+}
+
+void kasan_remove_zero_shadow(unsigned long start, unsigned long size)
+{
+	/* TODO */
+}