Message ID | 20200902114222.181353-1-aneesh.kumar@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | mm/debug_vm_pgtable fixes | expand |
On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: > This patch series includes fixes for debug_vm_pgtable test code so that > they follow page table updates rules correctly. The first two patches introduce > changes w.r.t ppc64. The patches are included in this series for completeness. We can > merge them via ppc64 tree if required. > > Hugetlb test is disabled on ppc64 because that needs larger change to satisfy > page table update rules. > > These tests are broken w.r.t page table update rules and results in kernel > crash as below. > > [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! > cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] > pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 > lr: c0000000005eeeec: pte_update+0x11c/0x190 > sp: c000000c6d1e7950 > msr: 8000000002029033 > current = 0xc000000c6d172c80 > paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 > pid = 1, comm = swapper/0 > kernel BUG at arch/powerpc/mm/pgtable.c:304! > [link register ] c0000000005eeeec pte_update+0x11c/0x190 > [c000000c6d1e7950] 0000000000000001 (unreliable) > [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 > [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 > [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 > [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 > [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 > [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > With DEBUG_VM disabled > > [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 > [ 20.530183] Faulting instruction address: 0xc0000000000df330 > cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] > pc: c0000000000df330: memset+0x68/0x104 > lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > sp: c000000c6d19f990 > msr: 8000000002009033 > dar: 0 > current = 0xc000000c6d177480 > paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 > pid = 1, comm = swapper/0 > [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) > [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 > [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 > [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 > [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 > [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > Changes from v3: > * Address review feedback > * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. This version - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well + linux-riscv <linux-riscv@lists.infradead.org> + linux-snps-arc@lists.infradead.org <linux-snps-arc@lists.infradead.org> + linux-s390@vger.kernel.org + Gerald Schaefer <gerald.schaefer@de.ibm.com> + Vineet Gupta <vgupta@synopsys.com> There is still an open git bisect issue on arm64 platform which ideally should be fixed. - Anshuman
On Fri, 4 Sep 2020 12:18:05 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: > > This patch series includes fixes for debug_vm_pgtable test code so that > > they follow page table updates rules correctly. The first two patches introduce > > changes w.r.t ppc64. The patches are included in this series for completeness. We can > > merge them via ppc64 tree if required. > > > > Hugetlb test is disabled on ppc64 because that needs larger change to satisfy > > page table update rules. > > > > These tests are broken w.r.t page table update rules and results in kernel > > crash as below. > > > > [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! > > cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] > > pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 > > lr: c0000000005eeeec: pte_update+0x11c/0x190 > > sp: c000000c6d1e7950 > > msr: 8000000002029033 > > current = 0xc000000c6d172c80 > > paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 > > pid = 1, comm = swapper/0 > > kernel BUG at arch/powerpc/mm/pgtable.c:304! > > [link register ] c0000000005eeeec pte_update+0x11c/0x190 > > [c000000c6d1e7950] 0000000000000001 (unreliable) > > [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 > > [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 > > [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 > > [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 > > [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 > > [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > With DEBUG_VM disabled > > > > [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 > > [ 20.530183] Faulting instruction address: 0xc0000000000df330 > > cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] > > pc: c0000000000df330: memset+0x68/0x104 > > lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > sp: c000000c6d19f990 > > msr: 8000000002009033 > > dar: 0 > > current = 0xc000000c6d177480 > > paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 > > pid = 1, comm = swapper/0 > > [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) > > [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 > > [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 > > [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 > > [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 > > [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > Changes from v3: > > * Address review feedback > > * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. > > This version > > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) > - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed > - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well When I quickly tested v3, it worked fine, but now it turned out to only work fine "sometimes", both v3 and v4. I need to look into it further, but so far it seems related to the hugetlb_advanced_tests(). I guess there was already some discussion on this test, but we did not receive all of the thread(s). Please always add at least linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> for further discussions. That being said, sorry for duplications, this might already have been discussed. Preliminary analysis showed that it only seems to go wrong for certain random vaddr values. I cannot make any sense of that yet, but what seems strange to me is that the hugetlb_advanced_tests() take a (real) pte_t pointer as input, and also use that for all kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.). Although all the hugetlb code in the kernel is (mis)using pte_t pointers instead of the correct pmd/pud_t pointers like THP, that is just for historic reasons. The pointers will actually never point to a real pte_t (i.e. page table entry), but of course to a pmd or pud entry, depending on hugepage size. What is passed in as ptep to hugetlb_advanced_tests() seems to be the result from the previous ptep = pte_alloc_map(mm, pmdp, vaddr), so I would expect that it points to a real page table entry. Need to investigate further, but IIUC, using such a pointer for adding large pte entries (i.e. pmd/pud entries) at least feels very wrong to me, and I assume it is related to the issues we see on s390. We actually see different issues, e.g. once a panic directly in hugetlb_advanced_tests() -> huge_ptep_get_and_clear(), but also indirect symptoms after debug_vm_pgtable() completes, like this: [ 10.533901] BUG task_struct (Not tainted): Padding overwritten. 0x0000000019f798c7-0x0000000019f798c7 @offset=30087 Last but not least, what I said about the pte vs. pmd/pud of course also should apply to the hugetlb_basic_tests(), although they are not directly using a pte_t pointer, and especially also not writing to it. Still, the pte_aligned pfn parameter is not guaranteed to also be pmd/pud_aligned, which doesn't feel right. So, for now, until this is sorted out, I guess we also need to exclude s390 at least from the hugetlb_advanced_tests(). The hugetlb_basic_tests() seem to work fine so far (probably by chance :-))
On Fri, 4 Sep 2020 17:26:47 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > On Fri, 4 Sep 2020 12:18:05 +0530 > Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > > > > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: > > > This patch series includes fixes for debug_vm_pgtable test code so that > > > they follow page table updates rules correctly. The first two patches introduce > > > changes w.r.t ppc64. The patches are included in this series for completeness. We can > > > merge them via ppc64 tree if required. > > > > > > Hugetlb test is disabled on ppc64 because that needs larger change to satisfy > > > page table update rules. > > > > > > These tests are broken w.r.t page table update rules and results in kernel > > > crash as below. > > > > > > [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! > > > cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] > > > pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 > > > lr: c0000000005eeeec: pte_update+0x11c/0x190 > > > sp: c000000c6d1e7950 > > > msr: 8000000002029033 > > > current = 0xc000000c6d172c80 > > > paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 > > > pid = 1, comm = swapper/0 > > > kernel BUG at arch/powerpc/mm/pgtable.c:304! > > > [link register ] c0000000005eeeec pte_update+0x11c/0x190 > > > [c000000c6d1e7950] 0000000000000001 (unreliable) > > > [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 > > > [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 > > > [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 > > > [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > > [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 > > > [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 > > > [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > > > With DEBUG_VM disabled > > > > > > [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 > > > [ 20.530183] Faulting instruction address: 0xc0000000000df330 > > > cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] > > > pc: c0000000000df330: memset+0x68/0x104 > > > lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > > sp: c000000c6d19f990 > > > msr: 8000000002009033 > > > dar: 0 > > > current = 0xc000000c6d177480 > > > paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 > > > pid = 1, comm = swapper/0 > > > [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > > [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) > > > [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 > > > [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 > > > [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > > [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 > > > [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 > > > [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > > > Changes from v3: > > > * Address review feedback > > > * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. > > > > This version > > > > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) > > - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed > > - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well > > When I quickly tested v3, it worked fine, but now it turned out to > only work fine "sometimes", both v3 and v4. I need to look into it > further, but so far it seems related to the hugetlb_advanced_tests(). > > I guess there was already some discussion on this test, but we did > not receive all of the thread(s). Please always add at least > linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> > for further discussions. BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com. The old gerald.schaefer@de.ibm.com seems to work (again), but is not very reliable. BTW2, a quick test with this change (so far) made the issues on s390 go away: @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) spin_unlock(ptl); #ifndef CONFIG_PPC_BOOK3S_64 - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); #endif spin_lock(&mm->page_table_lock); That would more match the "pte_t pointer" usage for hugetlb code, i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, but I think the root cause is the pte_t pointer. Not entirely sure though if that would really be the correct fix. I somehow lost whatever little track I had about what these tests really want to check, and if that would still be valid with that change.
On Fri, 4 Sep 2020 18:01:15 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > On Fri, 4 Sep 2020 17:26:47 +0200 > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > > On Fri, 4 Sep 2020 12:18:05 +0530 > > Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > > > > > > > > On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: > > > > This patch series includes fixes for debug_vm_pgtable test code so that > > > > they follow page table updates rules correctly. The first two patches introduce > > > > changes w.r.t ppc64. The patches are included in this series for completeness. We can > > > > merge them via ppc64 tree if required. > > > > > > > > Hugetlb test is disabled on ppc64 because that needs larger change to satisfy > > > > page table update rules. > > > > > > > > These tests are broken w.r.t page table update rules and results in kernel > > > > crash as below. > > > > > > > > [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! > > > > cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] > > > > pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 > > > > lr: c0000000005eeeec: pte_update+0x11c/0x190 > > > > sp: c000000c6d1e7950 > > > > msr: 8000000002029033 > > > > current = 0xc000000c6d172c80 > > > > paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 > > > > pid = 1, comm = swapper/0 > > > > kernel BUG at arch/powerpc/mm/pgtable.c:304! > > > > [link register ] c0000000005eeeec pte_update+0x11c/0x190 > > > > [c000000c6d1e7950] 0000000000000001 (unreliable) > > > > [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 > > > > [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 > > > > [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 > > > > [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > > > [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 > > > > [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 > > > > [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > > > > > With DEBUG_VM disabled > > > > > > > > [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 > > > > [ 20.530183] Faulting instruction address: 0xc0000000000df330 > > > > cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] > > > > pc: c0000000000df330: memset+0x68/0x104 > > > > lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > > > sp: c000000c6d19f990 > > > > msr: 8000000002009033 > > > > dar: 0 > > > > current = 0xc000000c6d177480 > > > > paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 > > > > pid = 1, comm = swapper/0 > > > > [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > > > > [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) > > > > [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 > > > > [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 > > > > [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > > > > [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 > > > > [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 > > > > [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > > > > > > > > Changes from v3: > > > > * Address review feedback > > > > * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. > > > > > > This version > > > > > > - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) > > > - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed > > > - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well > > > > When I quickly tested v3, it worked fine, but now it turned out to > > only work fine "sometimes", both v3 and v4. I need to look into it > > further, but so far it seems related to the hugetlb_advanced_tests(). > > > > I guess there was already some discussion on this test, but we did > > not receive all of the thread(s). Please always add at least > > linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> > > for further discussions. > > BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com. > The old gerald.schaefer@de.ibm.com seems to work (again), but is not > very reliable. > > BTW2, a quick test with this change (so far) made the issues on s390 > go away: > > @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) > spin_unlock(ptl); > > #ifndef CONFIG_PPC_BOOK3S_64 > - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); > #endif > > spin_lock(&mm->page_table_lock); > > That would more match the "pte_t pointer" usage for hugetlb code, > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, > but I think the root cause is the pte_t pointer. > > Not entirely sure though if that would really be the correct fix. > I somehow lost whatever little track I had about what these tests > really want to check, and if that would still be valid with that > change. Another potential issue, apparently not for s390, but maybe for others, is that the vaddr passed to hugetlb_advanced_tests() is also not pmd/pud size aligned, like you did in pmd/pud_advanced_tests(). I guess for the hugetlb_advanced_tests() you need to choose if you want to test pmd or pud hugepages, and accordingly prepare the *ptep, pfn and vaddr input. If you only check for CONFIG_HUGETLB_PAGE, then probably only pmd hugepages would be safe, there might be architectures only supporting one hugepage size. So, for s390, at least the ptep input value is a problem. Still need to better understand how it goes wrong, but it seems to be fixed when using proper pmdp, and also works with pudp. For others, especially the apparent issues on ppc64, the other non-hugepage aligned input pfn and vaddr might also be an issue, e.g. power at least seems to use the vaddr in its set_huge_pte_at() implementation for some pmd_off(mm, addr) calculation. Again, sorry if this was already discussed, I missed most of it and honestly didn't properly look at the scarce mails that we did receive...
On Fri, 4 Sep 2020 18:01:15 +0200 Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: [...] > > BTW2, a quick test with this change (so far) made the issues on s390 > go away: > > @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) > spin_unlock(ptl); > > #ifndef CONFIG_PPC_BOOK3S_64 > - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); > #endif > > spin_lock(&mm->page_table_lock); > > That would more match the "pte_t pointer" usage for hugetlb code, > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, > but I think the root cause is the pte_t pointer. > > Not entirely sure though if that would really be the correct fix. > I somehow lost whatever little track I had about what these tests > really want to check, and if that would still be valid with that > change. Uh oh, wasn't aware that this (or some predecessor) already went upstream, and broke our debug kernel today. I found out now what goes (horribly) wrong on s390, see below for more details. In short, using hugetlb primitives with ptep pointers that do _not_ point to a pmd or pud entry will not work on s390. It also seems to make no sense to verify / test such a thing in general, as it would also be a severe bug if any kernel code would do that. After all, with hugepages, there are no pte tables, only pmd etc. tables. My change above would fix the issue for s390, but I can still not completely judge if that would not break other things for your tests. In general, for normal kernel code, much of what you do would be very broken, but I guess your tests are doing such "special" things because they can. E.g. because they operate on some "sandbox" mm and page tables, and you also do not need properly populated page tables for some exit / free cleanup, you just throw them away explicitly with pXd_free at the end. So it might just be "the right thing" to pass a casted pmd pointer to hugetlb_advanced_tests(), to simulate and test (proper) usage of the hugetlb primitives. I also see no other way to make this work for s390, than using a proper pmd/pud pointer. If not possible, please add us to the #ifndef. So, for all those interested, here is what goes wrong on s390. huge_ptep_get_and_clear() uses the "idte" instruction for the clearing (and TLB invalidation) part. That instruction expects a "region or segment table" origin, which is a pmd/pud/p4d/pgd, but not a pte table. Even worse, when we calculate the table origin from the given ptep (which *should* not point to a pte), due to different table sizes for pte / pXd tables, we end up at some place before the given pte table. The "idte" instruction also gets the virtual address, and does corresponding index addition to the given table origin. Depending on the pmd_index we now end up either within the pte table again, in which case we see a panic because idte complains about seeing a pte value. If we are unlucky, then we end up outside the pte table, and depending on the content of that memory location, idte might succeed, effectively corrupting that memory. That explains why we only see the panic sometimes, depending on random vaddr, other symptoms other times, and probably completely silent memory corruption for the rest...
Gerald Schaefer <gerald.schaefer@linux.ibm.com> writes: > On Fri, 4 Sep 2020 18:01:15 +0200 > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > [...] >> >> BTW2, a quick test with this change (so far) made the issues on s390 >> go away: >> >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) >> spin_unlock(ptl); >> >> #ifndef CONFIG_PPC_BOOK3S_64 >> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); >> + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); >> #endif >> >> spin_lock(&mm->page_table_lock); >> >> That would more match the "pte_t pointer" usage for hugetlb code, >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, >> but I think the root cause is the pte_t pointer. >> >> Not entirely sure though if that would really be the correct fix. >> I somehow lost whatever little track I had about what these tests >> really want to check, and if that would still be valid with that >> change. > > Uh oh, wasn't aware that this (or some predecessor) already went > upstream, and broke our debug kernel today. Not sure i followed the above. Are you finding that s390 kernel crash after this patch series or the original patchset? As noted in my patch the hugetlb test is broken and we should fix that. A quick fix is to comment out that test for s390 too as i have done for PPC64. -aneesh
On 09/04/2020 08:56 PM, Gerald Schaefer wrote: > On Fri, 4 Sep 2020 12:18:05 +0530 > Anshuman Khandual <anshuman.khandual@arm.com> wrote: > >> >> >> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: >>> This patch series includes fixes for debug_vm_pgtable test code so that >>> they follow page table updates rules correctly. The first two patches introduce >>> changes w.r.t ppc64. The patches are included in this series for completeness. We can >>> merge them via ppc64 tree if required. >>> >>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy >>> page table update rules. >>> >>> These tests are broken w.r.t page table update rules and results in kernel >>> crash as below. >>> >>> [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! >>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] >>> pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 >>> lr: c0000000005eeeec: pte_update+0x11c/0x190 >>> sp: c000000c6d1e7950 >>> msr: 8000000002029033 >>> current = 0xc000000c6d172c80 >>> paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 >>> pid = 1, comm = swapper/0 >>> kernel BUG at arch/powerpc/mm/pgtable.c:304! >>> [link register ] c0000000005eeeec pte_update+0x11c/0x190 >>> [c000000c6d1e7950] 0000000000000001 (unreliable) >>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 >>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 >>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 >>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 >>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 >>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>> >>> With DEBUG_VM disabled >>> >>> [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 >>> [ 20.530183] Faulting instruction address: 0xc0000000000df330 >>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] >>> pc: c0000000000df330: memset+0x68/0x104 >>> lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>> sp: c000000c6d19f990 >>> msr: 8000000002009033 >>> dar: 0 >>> current = 0xc000000c6d177480 >>> paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 >>> pid = 1, comm = swapper/0 >>> [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) >>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 >>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 >>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 >>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 >>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>> >>> Changes from v3: >>> * Address review feedback >>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. >> >> This version >> >> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) >> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed >> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well > > When I quickly tested v3, it worked fine, but now it turned out to > only work fine "sometimes", both v3 and v4. I need to look into it > further, but so far it seems related to the hugetlb_advanced_tests(). > > I guess there was already some discussion on this test, but we did > not receive all of the thread(s). Please always add at least > linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> > for further discussions. IIRC, the V3 series previously had all these addresses copied properly but this version once again missed copying all required addresses. > > That being said, sorry for duplications, this might already have been > discussed. Preliminary analysis showed that it only seems to go wrong > for certain random vaddr values. I cannot make any sense of that yet, > but what seems strange to me is that the hugetlb_advanced_tests() > take a (real) pte_t pointer as input, and also use that for all > kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.). > > Although all the hugetlb code in the kernel is (mis)using pte_t > pointers instead of the correct pmd/pud_t pointers like THP, that > is just for historic reasons. The pointers will actually never point > to a real pte_t (i.e. page table entry), but of course to a pmd > or pud entry, depending on hugepage size. HugeTLB logically operates on a PTE entry irrespective of it's real page table level position. Nonetheless, IIUC, vaddr here should have been aligned to real page table level in which the entry is being mapped currently. > > What is passed in as ptep to hugetlb_advanced_tests() seems to be > the result from the previous ptep = pte_alloc_map(mm, pmdp, vaddr), > so I would expect that it points to a real page table entry. Need > to investigate further, but IIUC, using such a pointer for adding > large pte entries (i.e. pmd/pud entries) at least feels very wrong > to me, and I assume it is related to the issues we see on s390. Will look into this further. > > We actually see different issues, e.g. once a panic directly in > hugetlb_advanced_tests() -> huge_ptep_get_and_clear(), but also > indirect symptoms after debug_vm_pgtable() completes, like this: > > [ 10.533901] BUG task_struct (Not tainted): Padding overwritten. 0x0000000019f798c7-0x0000000019f798c7 @offset=30087 > > Last but not least, what I said about the pte vs. pmd/pud of > course also should apply to the hugetlb_basic_tests(), although > they are not directly using a pte_t pointer, and especially > also not writing to it. Still, the pte_aligned pfn parameter > is not guaranteed to also be pmd/pud_aligned, which doesn't > feel right. hugetlb_basic_tests() does not directly operate on real page table entries. But I do see the point wrt using pmd_aligned pfn instead. I will look into this in detail and send out something after this series settles down. > > So, for now, until this is sorted out, I guess we also need > to exclude s390 at least from the hugetlb_advanced_tests(). > The hugetlb_basic_tests() seem to work fine so far (probably > by chance :-))
On 09/04/2020 09:31 PM, Gerald Schaefer wrote: > On Fri, 4 Sep 2020 17:26:47 +0200 > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > >> On Fri, 4 Sep 2020 12:18:05 +0530 >> Anshuman Khandual <anshuman.khandual@arm.com> wrote: >> >>> >>> >>> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: >>>> This patch series includes fixes for debug_vm_pgtable test code so that >>>> they follow page table updates rules correctly. The first two patches introduce >>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can >>>> merge them via ppc64 tree if required. >>>> >>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy >>>> page table update rules. >>>> >>>> These tests are broken w.r.t page table update rules and results in kernel >>>> crash as below. >>>> >>>> [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! >>>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] >>>> pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 >>>> lr: c0000000005eeeec: pte_update+0x11c/0x190 >>>> sp: c000000c6d1e7950 >>>> msr: 8000000002029033 >>>> current = 0xc000000c6d172c80 >>>> paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 >>>> pid = 1, comm = swapper/0 >>>> kernel BUG at arch/powerpc/mm/pgtable.c:304! >>>> [link register ] c0000000005eeeec pte_update+0x11c/0x190 >>>> [c000000c6d1e7950] 0000000000000001 (unreliable) >>>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 >>>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 >>>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 >>>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 >>>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 >>>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>>> >>>> With DEBUG_VM disabled >>>> >>>> [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 >>>> [ 20.530183] Faulting instruction address: 0xc0000000000df330 >>>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] >>>> pc: c0000000000df330: memset+0x68/0x104 >>>> lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>>> sp: c000000c6d19f990 >>>> msr: 8000000002009033 >>>> dar: 0 >>>> current = 0xc000000c6d177480 >>>> paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 >>>> pid = 1, comm = swapper/0 >>>> [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) >>>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 >>>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 >>>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 >>>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 >>>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>>> >>>> Changes from v3: >>>> * Address review feedback >>>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. >>> >>> This version >>> >>> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) >>> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed >>> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well >> >> When I quickly tested v3, it worked fine, but now it turned out to >> only work fine "sometimes", both v3 and v4. I need to look into it >> further, but so far it seems related to the hugetlb_advanced_tests(). >> >> I guess there was already some discussion on this test, but we did >> not receive all of the thread(s). Please always add at least >> linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> >> for further discussions. > > BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com. > The old gerald.schaefer@de.ibm.com seems to work (again), but is not > very reliable. Sure, noted. > > BTW2, a quick test with this change (so far) made the issues on s390 > go away: > > @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) > spin_unlock(ptl); > > #ifndef CONFIG_PPC_BOOK3S_64 > - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); > #endif > > spin_lock(&mm->page_table_lock); > > That would more match the "pte_t pointer" usage for hugetlb code, > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, > but I think the root cause is the pte_t pointer. Ideally, the pte_t pointer used here should be from huge_pte_alloc() not from pte_alloc_map_lock() as the case currently. > > Not entirely sure though if that would really be the correct fix. > I somehow lost whatever little track I had about what these tests > really want to check, and if that would still be valid with that > change. >
On 09/04/2020 11:23 PM, Gerald Schaefer wrote: > On Fri, 4 Sep 2020 18:01:15 +0200 > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > >> On Fri, 4 Sep 2020 17:26:47 +0200 >> Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: >> >>> On Fri, 4 Sep 2020 12:18:05 +0530 >>> Anshuman Khandual <anshuman.khandual@arm.com> wrote: >>> >>>> >>>> >>>> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: >>>>> This patch series includes fixes for debug_vm_pgtable test code so that >>>>> they follow page table updates rules correctly. The first two patches introduce >>>>> changes w.r.t ppc64. The patches are included in this series for completeness. We can >>>>> merge them via ppc64 tree if required. >>>>> >>>>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy >>>>> page table update rules. >>>>> >>>>> These tests are broken w.r.t page table update rules and results in kernel >>>>> crash as below. >>>>> >>>>> [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! >>>>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] >>>>> pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 >>>>> lr: c0000000005eeeec: pte_update+0x11c/0x190 >>>>> sp: c000000c6d1e7950 >>>>> msr: 8000000002029033 >>>>> current = 0xc000000c6d172c80 >>>>> paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 >>>>> pid = 1, comm = swapper/0 >>>>> kernel BUG at arch/powerpc/mm/pgtable.c:304! >>>>> [link register ] c0000000005eeeec pte_update+0x11c/0x190 >>>>> [c000000c6d1e7950] 0000000000000001 (unreliable) >>>>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 >>>>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 >>>>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 >>>>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>>>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 >>>>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 >>>>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>>>> >>>>> With DEBUG_VM disabled >>>>> >>>>> [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 >>>>> [ 20.530183] Faulting instruction address: 0xc0000000000df330 >>>>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] >>>>> pc: c0000000000df330: memset+0x68/0x104 >>>>> lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>>>> sp: c000000c6d19f990 >>>>> msr: 8000000002009033 >>>>> dar: 0 >>>>> current = 0xc000000c6d177480 >>>>> paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 >>>>> pid = 1, comm = swapper/0 >>>>> [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 >>>>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) >>>>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 >>>>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 >>>>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 >>>>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 >>>>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 >>>>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c >>>>> >>>>> Changes from v3: >>>>> * Address review feedback >>>>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. >>>> >>>> This version >>>> >>>> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) >>>> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed >>>> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well >>> >>> When I quickly tested v3, it worked fine, but now it turned out to >>> only work fine "sometimes", both v3 and v4. I need to look into it >>> further, but so far it seems related to the hugetlb_advanced_tests(). >>> >>> I guess there was already some discussion on this test, but we did >>> not receive all of the thread(s). Please always add at least >>> linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> >>> for further discussions. >> >> BTW, with myself I mean the new address gerald.schaefer@linux.ibm.com. >> The old gerald.schaefer@de.ibm.com seems to work (again), but is not >> very reliable. >> >> BTW2, a quick test with this change (so far) made the issues on s390 >> go away: >> >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) >> spin_unlock(ptl); >> >> #ifndef CONFIG_PPC_BOOK3S_64 >> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); >> + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); >> #endif >> >> spin_lock(&mm->page_table_lock); >> >> That would more match the "pte_t pointer" usage for hugetlb code, >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, >> but I think the root cause is the pte_t pointer. >> >> Not entirely sure though if that would really be the correct fix. >> I somehow lost whatever little track I had about what these tests >> really want to check, and if that would still be valid with that >> change. > > Another potential issue, apparently not for s390, but maybe for > others, is that the vaddr passed to hugetlb_advanced_tests() is > also not pmd/pud size aligned, like you did in pmd/pud_advanced_tests(). > > I guess for the hugetlb_advanced_tests() you need to choose if > you want to test pmd or pud hugepages, and accordingly prepare > the *ptep, pfn and vaddr input. If you only check for CONFIG_HUGETLB_PAGE, > then probably only pmd hugepages would be safe, there might be > architectures only supporting one hugepage size. I guess preparing for PMD based HugeTLB tests should be sufficient for now, which can be improved later on to cover other levels. > > So, for s390, at least the ptep input value is a problem. Still > need to better understand how it goes wrong, but it seems to be > fixed when using proper pmdp, and also works with pudp. > > For others, especially the apparent issues on ppc64, the other > non-hugepage aligned input pfn and vaddr might also be an issue, > e.g. power at least seems to use the vaddr in its set_huge_pte_at() > implementation for some pmd_off(mm, addr) calculation. > > Again, sorry if this was already discussed, I missed most of it > and honestly didn't properly look at the scarce mails that we did > receive... Sure, will consider these points and try improve tests afterwards.
On Wed, 9 Sep 2020 13:45:48 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote: [...] > > > > That would more match the "pte_t pointer" usage for hugetlb code, > > i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, > > but I think the root cause is the pte_t pointer. > > Ideally, the pte_t pointer used here should be from huge_pte_alloc() > not from pte_alloc_map_lock() as the case currently. Ah, good point. I assumed that this would also always return casted pmd etc. pointers, and never pte pointers. Unfortunately, that doesn't seem to be true for all architectures, e.g. ia64, parisc, (some) powerpc, where they really do a pte_alloc_map() for some reason. I guess that means you cannot simply cast the pmd pointer, as suggested, although I really do not understand how any architecture can work with real ptes for hugepages. But that's fair, s390 also does some things that nobody would expect or understand for other architectures... So, for using huge_pte_alloc() you'd also need some size, maybe iterating over hstates with for_each_hstate() could be an option, if they are already initialized at that point. Then you have the size(s) with huge_page_size(hstate) and can actually call the hugetlb tests for all supported sizes, and with proper pointer from huge_pte_alloc().
On Wed, 09 Sep 2020 11:38:39 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > Gerald Schaefer <gerald.schaefer@linux.ibm.com> writes: > > > On Fri, 4 Sep 2020 18:01:15 +0200 > > Gerald Schaefer <gerald.schaefer@linux.ibm.com> wrote: > > > > [...] > >> > >> BTW2, a quick test with this change (so far) made the issues on s390 > >> go away: > >> > >> @@ -1069,7 +1074,7 @@ static int __init debug_vm_pgtable(void) > >> spin_unlock(ptl); > >> > >> #ifndef CONFIG_PPC_BOOK3S_64 > >> - hugetlb_advanced_tests(mm, vma, ptep, pte_aligned, vaddr, prot); > >> + hugetlb_advanced_tests(mm, vma, (pte_t *) pmdp, pmd_aligned, vaddr, prot); > >> #endif > >> > >> spin_lock(&mm->page_table_lock); > >> > >> That would more match the "pte_t pointer" usage for hugetlb code, > >> i.e. just cast a pmd_t pointer to it. Also changed to pmd_aligned, > >> but I think the root cause is the pte_t pointer. > >> > >> Not entirely sure though if that would really be the correct fix. > >> I somehow lost whatever little track I had about what these tests > >> really want to check, and if that would still be valid with that > >> change. > > > > Uh oh, wasn't aware that this (or some predecessor) already went > > upstream, and broke our debug kernel today. > > Not sure i followed the above. Are you finding that s390 kernel crash > after this patch series or the original patchset? As noted in my patch > the hugetlb test is broken and we should fix that. A quick fix is to > comment out that test for s390 too as i have done for PPC64. We see it with both, it basically is broken since there is a hugetlb test using real pte pointers. It doesn't always show, depending on random vaddr, so it slipped through earlier testing. I guess we also would have had one or the other chance to notice that earlier, through better review, or better reading of previous mails. I must admit that I neglected this a bit.
On Wed, 9 Sep 2020 13:38:25 +0530 Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > On 09/04/2020 08:56 PM, Gerald Schaefer wrote: > > On Fri, 4 Sep 2020 12:18:05 +0530 > > Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > >> > >> > >> On 09/02/2020 05:12 PM, Aneesh Kumar K.V wrote: > >>> This patch series includes fixes for debug_vm_pgtable test code so that > >>> they follow page table updates rules correctly. The first two patches introduce > >>> changes w.r.t ppc64. The patches are included in this series for completeness. We can > >>> merge them via ppc64 tree if required. > >>> > >>> Hugetlb test is disabled on ppc64 because that needs larger change to satisfy > >>> page table update rules. > >>> > >>> These tests are broken w.r.t page table update rules and results in kernel > >>> crash as below. > >>> > >>> [ 21.083519] kernel BUG at arch/powerpc/mm/pgtable.c:304! > >>> cpu 0x0: Vector: 700 (Program Check) at [c000000c6d1e76c0] > >>> pc: c00000000009a5ec: assert_pte_locked+0x14c/0x380 > >>> lr: c0000000005eeeec: pte_update+0x11c/0x190 > >>> sp: c000000c6d1e7950 > >>> msr: 8000000002029033 > >>> current = 0xc000000c6d172c80 > >>> paca = 0xc000000003ba0000 irqmask: 0x03 irq_happened: 0x01 > >>> pid = 1, comm = swapper/0 > >>> kernel BUG at arch/powerpc/mm/pgtable.c:304! > >>> [link register ] c0000000005eeeec pte_update+0x11c/0x190 > >>> [c000000c6d1e7950] 0000000000000001 (unreliable) > >>> [c000000c6d1e79b0] c0000000005eee14 pte_update+0x44/0x190 > >>> [c000000c6d1e7a10] c000000001a2ca9c pte_advanced_tests+0x160/0x3d8 > >>> [c000000c6d1e7ab0] c000000001a2d4fc debug_vm_pgtable+0x7e8/0x1338 > >>> [c000000c6d1e7ba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > >>> [c000000c6d1e7c80] c0000000019e4fac kernel_init_freeable+0x4dc/0x5a4 > >>> [c000000c6d1e7db0] c000000000012474 kernel_init+0x24/0x160 > >>> [c000000c6d1e7e20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > >>> > >>> With DEBUG_VM disabled > >>> > >>> [ 20.530152] BUG: Kernel NULL pointer dereference on read at 0x00000000 > >>> [ 20.530183] Faulting instruction address: 0xc0000000000df330 > >>> cpu 0x33: Vector: 380 (Data SLB Access) at [c000000c6d19f700] > >>> pc: c0000000000df330: memset+0x68/0x104 > >>> lr: c00000000009f6d8: hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > >>> sp: c000000c6d19f990 > >>> msr: 8000000002009033 > >>> dar: 0 > >>> current = 0xc000000c6d177480 > >>> paca = 0xc00000001ec4f400 irqmask: 0x03 irq_happened: 0x01 > >>> pid = 1, comm = swapper/0 > >>> [link register ] c00000000009f6d8 hash__pmdp_huge_get_and_clear+0xe8/0x1b0 > >>> [c000000c6d19f990] c00000000009f748 hash__pmdp_huge_get_and_clear+0x158/0x1b0 (unreliable) > >>> [c000000c6d19fa10] c0000000019ebf30 pmd_advanced_tests+0x1f0/0x378 > >>> [c000000c6d19fab0] c0000000019ed088 debug_vm_pgtable+0x79c/0x1244 > >>> [c000000c6d19fba0] c0000000000116ec do_one_initcall+0xac/0x5f0 > >>> [c000000c6d19fc80] c0000000019a4fac kernel_init_freeable+0x4dc/0x5a4 > >>> [c000000c6d19fdb0] c000000000012474 kernel_init+0x24/0x160 > >>> [c000000c6d19fe20] c00000000000cbd0 ret_from_kernel_thread+0x5c/0x6c > >>> > >>> Changes from v3: > >>> * Address review feedback > >>> * Move page table depost and withdraw patch after adding pmdlock to avoid bisect failure. > >> > >> This version > >> > >> - Builds on x86, arm64, s390, arc, powerpc and riscv (defconfig with DEBUG_VM_PGTABLE) > >> - Runs on arm64 and x86 without any regression, atleast nothing that I have noticed > >> - Will be great if this could get tested on s390, arc, riscv, ppc32 platforms as well > > > > When I quickly tested v3, it worked fine, but now it turned out to > > only work fine "sometimes", both v3 and v4. I need to look into it > > further, but so far it seems related to the hugetlb_advanced_tests(). > > > > I guess there was already some discussion on this test, but we did > > not receive all of the thread(s). Please always add at least > > linux-s390@vger.kernel.org and maybe myself and Vasily Gorbik <gor@linux.ibm.com> > > for further discussions. > > IIRC, the V3 series previously had all these addresses copied properly > but this version once again missed copying all required addresses. I also had issues with the de.ibm.com address, which might also have made some mails disappear, and others might simply have been overlooked be me. Don't bother, my bad. > > > > > That being said, sorry for duplications, this might already have been > > discussed. Preliminary analysis showed that it only seems to go wrong > > for certain random vaddr values. I cannot make any sense of that yet, > > but what seems strange to me is that the hugetlb_advanced_tests() > > take a (real) pte_t pointer as input, and also use that for all > > kinds of operations (set_huge_pte_at, huge_ptep_get_and_clear, etc.). > > > > Although all the hugetlb code in the kernel is (mis)using pte_t > > pointers instead of the correct pmd/pud_t pointers like THP, that > > is just for historic reasons. The pointers will actually never point > > to a real pte_t (i.e. page table entry), but of course to a pmd > > or pud entry, depending on hugepage size. > > HugeTLB logically operates on a PTE entry irrespective of it's real > page table level position. Nonetheless, IIUC, vaddr here should have > been aligned to real page table level in which the entry is being > mapped currently. That goes back to the time where only x86 had hugepages, and they have the same layout for pte/pmd/etc entries, so it simply didn't matter that the code (mis)used pte pointers / entries. But even for x86, the hugetlb pte pointers would never have pointed to real ptes, but pmds instead. That's why I call it misuse. s390 is very sensitive to page table level, and we can also determine the level from the entry value, which is used for some primitives. Others have implicit assumptions and calculations, which go wrong if a wrong level is passed in, like in this case for huge_ptep_get_and_clear(). Simply aligning vaddr / pfn will not be enough to fix this for s390, it has to be a pmd/pud pointer. Or, as you already mentioned, the result of huge_pte_alloc(). Furthermore, the pmd and pte layout are different, so we simply cannot use any pte_xxx primitives for hugepages. That was the reason for introducing huge_ptep_get(), which will do an implicit conversion from the real pmd/pud entry to a "fake" pte entry, which can then be used with such pte_xxx primitives. Before writing it back in set_huge_pte_at() we then do the reverse conversion to a proper pmd/pud again.
On Wed, 2 Sep 2020 17:12:09 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > This patch series includes fixes for debug_vm_pgtable test code so that > they follow page table updates rules correctly. The first two patches introduce > changes w.r.t ppc64. The patches are included in this series for completeness. We can > merge them via ppc64 tree if required. Do you think this series is ready to be merged? Possibly-unresolved issues which I have recorded are Against mm-debug_vm_pgtable-locks-move-non-page-table-modifying-test-together.patch: https://lkml.kernel.org/r/56830efb-887e-0000-a46e-ae015e5854cd@arm.com https://lkml.kernel.org/r/20200910075752.GC26874@shao2-debian Against mm-debug_vm_pgtable-avoid-none-pte-in-pte_clear_test.patch: https://lkml.kernel.org/r/87zh5wx51b.fsf@linux.ibm.com https://lkml.kernel.org/r/37a9facc-ca36-290f-3748-16c4a7a778fa@arm.com https://lkml.kernel.org/r/20201011200258.GA91021@roeck-us.net
On 10/14/20 2:28 AM, Andrew Morton wrote: > On Wed, 2 Sep 2020 17:12:09 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > >> This patch series includes fixes for debug_vm_pgtable test code so that >> they follow page table updates rules correctly. The first two patches introduce >> changes w.r.t ppc64. The patches are included in this series for completeness. We can >> merge them via ppc64 tree if required. > > Do you think this series is ready to be merged? Hopefully, except for the Riscv crash. > > Possibly-unresolved issues which I have recorded are > > Against > mm-debug_vm_pgtable-locks-move-non-page-table-modifying-test-together.patch: > > https://lkml.kernel.org/r/56830efb-887e-0000-a46e-ae015e5854cd@arm.com I guess the full series do boot fine on arm. > https://lkml.kernel.org/r/20200910075752.GC26874@shao2-debian This should be fixed by https://ozlabs.org/~akpm/mmots/broken-out/mm-debug_vm_pgtable-avoid-doing-memory-allocation-with-pgtable_t-mapped.patch > > Against mm-debug_vm_pgtable-avoid-none-pte-in-pte_clear_test.patch: > > https://lkml.kernel.org/r/87zh5wx51b.fsf@linux.ibm.com yes this one we should get fixed. I was hoping someone familiar with Riscv pte updates rules would pitch in. IIUC we need to update RANDON_ORVALUE similar to how we updated it for s390 and ppc64. Alternatively we can do this modified mm/debug_vm_pgtable.c @@ -548,7 +548,7 @@ static void __init pte_clear_tests(struct mm_struct *mm, pte_t *ptep, pte_t pte = pfn_pte(pfn, prot); pr_debug("Validating PTE clear\n"); - pte = __pte(pte_val(pte) | RANDOM_ORVALUE); +// pte = __pte(pte_val(pte) | RANDOM_ORVALUE); set_pte_at(mm, vaddr, ptep, pte); barrier(); pte_clear(mm, vaddr, ptep); till we get that feedback from RiscV team? > https://lkml.kernel.org/r/37a9facc-ca36-290f-3748-16c4a7a778fa@arm.com same as the above. > https://lkml.kernel.org/r/20201011200258.GA91021@roeck-us.net > same as the above. -aneesh
On Wed, 14 Oct 2020 08:45:16 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > > Against mm-debug_vm_pgtable-avoid-none-pte-in-pte_clear_test.patch: > > > > https://lkml.kernel.org/r/87zh5wx51b.fsf@linux.ibm.com > > > yes this one we should get fixed. I was hoping someone familiar with > Riscv pte updates rules would pitch in. IIUC we need to update > RANDON_ORVALUE similar to how we updated it for s390 and ppc64. > > > Alternatively we can do this > > modified mm/debug_vm_pgtable.c > @@ -548,7 +548,7 @@ static void __init pte_clear_tests(struct mm_struct > *mm, pte_t *ptep, > pte_t pte = pfn_pte(pfn, prot); > > pr_debug("Validating PTE clear\n"); > - pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > +// pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > set_pte_at(mm, vaddr, ptep, pte); > barrier(); > pte_clear(mm, vaddr, ptep); > > till we get that feedback from RiscV team? Would it be better to do #ifdef CONFIG_S390 pte = __pte(pte_val(pte) | RANDOM_ORVALUE); #endif ?
On 10/15/2020 02:06 AM, Andrew Morton wrote: > On Wed, 14 Oct 2020 08:45:16 +0530 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote: > >>> Against mm-debug_vm_pgtable-avoid-none-pte-in-pte_clear_test.patch: >>> >>> https://lkml.kernel.org/r/87zh5wx51b.fsf@linux.ibm.com >> >> >> yes this one we should get fixed. I was hoping someone familiar with >> Riscv pte updates rules would pitch in. IIUC we need to update >> RANDON_ORVALUE similar to how we updated it for s390 and ppc64. >> >> >> Alternatively we can do this >> >> modified mm/debug_vm_pgtable.c >> @@ -548,7 +548,7 @@ static void __init pte_clear_tests(struct mm_struct >> *mm, pte_t *ptep, >> pte_t pte = pfn_pte(pfn, prot); >> >> pr_debug("Validating PTE clear\n"); >> - pte = __pte(pte_val(pte) | RANDOM_ORVALUE); >> +// pte = __pte(pte_val(pte) | RANDOM_ORVALUE); >> set_pte_at(mm, vaddr, ptep, pte); >> barrier(); >> pte_clear(mm, vaddr, ptep); >> >> till we get that feedback from RiscV team? > > Would it be better to do > > #ifdef CONFIG_S390 > pte = __pte(pte_val(pte) | RANDOM_ORVALUE); > #endif I would suggest just dropping RANDOM_ORVALUE from pte_clear_tests() possibly with a comment saying it needs to be fixed on RISCV before being added back later. pte = __pte(pte_val(pte)); OR Disable RANDOM_ORVALUE only for RISCV. #ifndef CONFIG_RISCV pte = __pte(pte_val(pte) | RANDOM_ORVALUE); #endif