diff mbox series

mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa

Message ID 20200216191800.22423-1-aquini@redhat.com (mailing list archive)
State New, archived
Headers show
Series mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa | expand

Commit Message

Rafael Aquini Feb. 16, 2020, 7:18 p.m. UTC
From: Mel Gorman <mgorman@techsingularity.net>
  A user reported a bug against a distribution kernel while running
  a proprietary workload described as "memory intensive that is not
  swapping" that is expected to apply to mainline kernels. The workload
  is read/write/modifying ranges of memory and checking the contents. They
  reported that within a few hours that a bad PMD would be reported followed
  by a memory corruption where expected data was all zeros.  A partial report
  of the bad PMD looked like

  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
  [ 5195.341184] ------------[ cut here ]------------
  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
  ....
  [ 5195.410033] Call Trace:
  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10

  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
  was a false detection. The bug does not trigger if automatic NUMA balancing
  or transparent huge pages is disabled.

  The bug is due a race in change_pmd_range between a pmd_trans_huge and
  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
  check, a parallel protection update under lock can have cleared the PMD
  and filled it with a prot_numa entry between the transhuge check and the
  pmd_none_or_clear_bad check.

  While this could be fixed with heavy locking, it's only necessary to
  make a copy of the PMD on the stack during change_pmd_range and avoid
  races. A new helper is created for this as the check if quite subtle and the
  existing similar helpful is not suitable. This passed 154 hours of testing
  (usually triggers between 20 minutes and 24 hours) without detecting bad
  PMDs or corruption. A basic test of an autonuma-intensive workload showed
  no significant change in behaviour.

Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
the race window aforementioned is still open, and we have reports of Linpack test reporting bad
residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.

 host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
 ....
 host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
 host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096

The issue is observed on a v4.18-based distribution kernel, but the race window is
expected to be applicable to mainline kernels, as well.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Cc: stable@vger.kernel.org
Signed-off-by: Rafael Aquini <aquini@redhat.com>
---
 mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

Comments

Mel Gorman Feb. 16, 2020, 11:32 p.m. UTC | #1
On Sun, Feb 16, 2020 at 02:18:00PM -0500, Rafael Aquini wrote:
> From: Mel Gorman <mgorman@techsingularity.net>
>   A user reported a bug against a distribution kernel while running
>   a proprietary workload described as "memory intensive that is not
>   swapping" that is expected to apply to mainline kernels. The workload
>   is read/write/modifying ranges of memory and checking the contents. They
>   reported that within a few hours that a bad PMD would be reported followed
>   by a memory corruption where expected data was all zeros.  A partial report
>   of the bad PMD looked like
> 
>   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
>   [ 5195.341184] ------------[ cut here ]------------
>   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
>   ....
>   [ 5195.410033] Call Trace:
>   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
>   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
>   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
>   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
>   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
>   Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
>   was a false detection. The bug does not trigger if automatic NUMA balancing
>   or transparent huge pages is disabled.
> 
>   The bug is due a race in change_pmd_range between a pmd_trans_huge and
>   pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
>   check, a parallel protection update under lock can have cleared the PMD
>   and filled it with a prot_numa entry between the transhuge check and the
>   pmd_none_or_clear_bad check.
> 
>   While this could be fixed with heavy locking, it's only necessary to
>   make a copy of the PMD on the stack during change_pmd_range and avoid
>   races. A new helper is created for this as the check if quite subtle and the
>   existing similar helpful is not suitable. This passed 154 hours of testing
>   (usually triggers between 20 minutes and 24 hours) without detecting bad
>   PMDs or corruption. A basic test of an autonuma-intensive workload showed
>   no significant change in behaviour.
> 
> Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> 
>  host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
>  ....
>  host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
>  host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> 
> The issue is observed on a v4.18-based distribution kernel, but the race window is
> expected to be applicable to mainline kernels, as well.
> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org
> Signed-off-by: Rafael Aquini <aquini@redhat.com>

It's curious that it took so long for this to be caught again.
Unfortunately I cannot find exactly what it's racing against but maybe
it's not worth chasing down and the patch is simply the safer option :(
Qian Cai March 7, 2020, 2:40 a.m. UTC | #2
> On Feb 16, 2020, at 2:18 PM, Rafael Aquini <aquini@redhat.com> wrote:
> 
> From: Mel Gorman <mgorman@techsingularity.net>
>  A user reported a bug against a distribution kernel while running
>  a proprietary workload described as "memory intensive that is not
>  swapping" that is expected to apply to mainline kernels. The workload
>  is read/write/modifying ranges of memory and checking the contents. They
>  reported that within a few hours that a bad PMD would be reported followed
>  by a memory corruption where expected data was all zeros.  A partial report
>  of the bad PMD looked like
> 
>  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
>  [ 5195.341184] ------------[ cut here ]------------
>  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
>  ....
>  [ 5195.410033] Call Trace:
>  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
>  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
>  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
>  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
>  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
>  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
>  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> 
>  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
>  was a false detection. The bug does not trigger if automatic NUMA balancing
>  or transparent huge pages is disabled.
> 
>  The bug is due a race in change_pmd_range between a pmd_trans_huge and
>  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
>  check, a parallel protection update under lock can have cleared the PMD
>  and filled it with a prot_numa entry between the transhuge check and the
>  pmd_none_or_clear_bad check.
> 
>  While this could be fixed with heavy locking, it's only necessary to
>  make a copy of the PMD on the stack during change_pmd_range and avoid
>  races. A new helper is created for this as the check if quite subtle and the
>  existing similar helpful is not suitable. This passed 154 hours of testing
>  (usually triggers between 20 minutes and 24 hours) without detecting bad
>  PMDs or corruption. A basic test of an autonuma-intensive workload showed
>  no significant change in behaviour.
> 
> Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> 
> host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
> ....
> host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
> host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> 
> The issue is observed on a v4.18-based distribution kernel, but the race window is
> expected to be applicable to mainline kernels, as well.

After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
with swapping on linux-next 20200306,

The trace is a bit off for some reasons, but it is basically,

kernel BUG at include/linux/page-flags.h:317!

end_swap_bio_write()
   SetPageError(page)
       VM_BUG_ON_PAGE(1 && PageCompound(page)) 

The page was allocated in,

do_huge_pmd_anonymous_page()

[ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
[ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
[ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
[ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
[ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
[ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
[ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
[ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
[ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
[ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
[ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
[ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
[ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
[ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
[ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
[ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
[ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
[ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
[ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
[ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
[ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
[ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
[ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
[ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
[ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
[ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
[ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
[ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
[ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
[ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
[ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
[ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
[ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
[ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
[ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
[ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
[ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
[ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
[ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
[ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
[ 9692.845443][   C25] page_owner tracks the page as allocated 
[ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
[ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
[ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
[ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
[ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
[ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
[ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
[ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
[ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
[ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
[ 9692.845786][  T213] page_owner tracks the page as allocated 
[ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
[ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
[ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
[ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
[ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
[ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
[ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
[ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
[ 9692.847250][   C25]  bio_endio+0x297/0x560 
[ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
[ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
[ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
[ 9692.847276][   C25]  blk_update_request+0x201/0x920 
[ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
[ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
[ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
[ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
[ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
[ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
[ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
[ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
[ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
[ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
[ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
[ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
[ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
[ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
[ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
[ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
[ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
[ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
[ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
[ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
[ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
[ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
[ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
[ 9692.849884][  T213]  libphy 
[ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
[ 9692.849952][  T213]  firmware_class 
[ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
[ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
[ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
[ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
[ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
[ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
[ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
[ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
[ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
[ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
[ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  

> 
> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> Cc: stable@vger.kernel.org
> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> ---
> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> 1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 7a8e84f86831..9ea8cc0ab2fd 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> 	return pages;
> }
> 
> +/*
> + * Used when setting automatic NUMA hinting protection where it is
> + * critical that a numa hinting PMD is not confused with a bad PMD.
> + */
> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +{
> +	pmd_t pmdval = pmd_read_atomic(pmd);
> +
> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	barrier();
> +#endif
> +
> +	if (pmd_none(pmdval))
> +		return 1;
> +	if (pmd_trans_huge(pmdval))
> +		return 0;
> +	if (unlikely(pmd_bad(pmdval))) {
> +		pmd_clear_bad(pmd);
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> 		pud_t *pud, unsigned long addr, unsigned long end,
> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> 		unsigned long this_pages;
> 
> 		next = pmd_addr_end(addr, end);
> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> -				&& pmd_none_or_clear_bad(pmd))
> +
> +		/*
> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> +		 * held for read. It's possible a parallel update to occur
> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> +		 * check leading to a false positive and clearing.
> +		 * Hence, it's ecessary to atomically read the PMD value
> +		 * for all the checks.
> +		 */
> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> 			goto next;
> 
> 		/* invoke the mmu notifier if the pmd is populated */
> -- 
> 2.24.1
> 
>
Rafael Aquini March 7, 2020, 3:05 a.m. UTC | #3
On Fri, Mar 06, 2020 at 09:40:10PM -0500, Qian Cai wrote:
> 
> 
> > On Feb 16, 2020, at 2:18 PM, Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > From: Mel Gorman <mgorman@techsingularity.net>
> >  A user reported a bug against a distribution kernel while running
> >  a proprietary workload described as "memory intensive that is not
> >  swapping" that is expected to apply to mainline kernels. The workload
> >  is read/write/modifying ranges of memory and checking the contents. They
> >  reported that within a few hours that a bad PMD would be reported followed
> >  by a memory corruption where expected data was all zeros.  A partial report
> >  of the bad PMD looked like
> > 
> >  [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
> >  [ 5195.341184] ------------[ cut here ]------------
> >  [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
> >  ....
> >  [ 5195.410033] Call Trace:
> >  [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
> >  [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
> >  [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
> >  [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
> >  [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
> >  [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
> >  [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
> > 
> >  Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
> >  was a false detection. The bug does not trigger if automatic NUMA balancing
> >  or transparent huge pages is disabled.
> > 
> >  The bug is due a race in change_pmd_range between a pmd_trans_huge and
> >  pmd_nond_or_clear_bad check without any locks held. During the pmd_trans_huge
> >  check, a parallel protection update under lock can have cleared the PMD
> >  and filled it with a prot_numa entry between the transhuge check and the
> >  pmd_none_or_clear_bad check.
> > 
> >  While this could be fixed with heavy locking, it's only necessary to
> >  make a copy of the PMD on the stack during change_pmd_range and avoid
> >  races. A new helper is created for this as the check if quite subtle and the
> >  existing similar helpful is not suitable. This passed 154 hours of testing
> >  (usually triggers between 20 minutes and 24 hours) without detecting bad
> >  PMDs or corruption. A basic test of an autonuma-intensive workload showed
> >  no significant change in behaviour.
> > 
> > Although Mel withdrew the patch on the face of LKML comment https://lkml.org/lkml/2017/4/10/922
> > the race window aforementioned is still open, and we have reports of Linpack test reporting bad
> > residuals after the bad PMD warning is observed. In addition to that, bad rss-counter and
> > non-zero pgtables assertions are triggered on mm teardown for the task hitting the bad PMD.
> > 
> > host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
> > ....
> > host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
> > host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
> > 
> > The issue is observed on a v4.18-based distribution kernel, but the race window is
> > expected to be applicable to mainline kernels, as well.
> 
> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> with swapping on linux-next 20200306,
>

Would you mind sharing your .config ? Also it would be nice to know your
system and workload details so I can try to reproduce.

Cheers,
-- Rafael

> The trace is a bit off for some reasons, but it is basically,
> 
> kernel BUG at include/linux/page-flags.h:317!
> 
> end_swap_bio_write()
>    SetPageError(page)
>        VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> 
> The page was allocated in,
> 
> do_huge_pmd_anonymous_page()
> 
> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> [ 9692.849884][  T213]  libphy 
> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> [ 9692.849952][  T213]  firmware_class 
> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> 
> > 
> > Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Rafael Aquini <aquini@redhat.com>
> > ---
> > mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> > 1 file changed, 36 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 7a8e84f86831..9ea8cc0ab2fd 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > 	return pages;
> > }
> > 
> > +/*
> > + * Used when setting automatic NUMA hinting protection where it is
> > + * critical that a numa hinting PMD is not confused with a bad PMD.
> > + */
> > +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> > +{
> > +	pmd_t pmdval = pmd_read_atomic(pmd);
> > +
> > +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +	barrier();
> > +#endif
> > +
> > +	if (pmd_none(pmdval))
> > +		return 1;
> > +	if (pmd_trans_huge(pmdval))
> > +		return 0;
> > +	if (unlikely(pmd_bad(pmdval))) {
> > +		pmd_clear_bad(pmd);
> > +		return 1;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > 		pud_t *pud, unsigned long addr, unsigned long end,
> > 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> > @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > 		unsigned long this_pages;
> > 
> > 		next = pmd_addr_end(addr, end);
> > -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> > -				&& pmd_none_or_clear_bad(pmd))
> > +
> > +		/*
> > +		 * Automatic NUMA balancing walks the tables with mmap_sem
> > +		 * held for read. It's possible a parallel update to occur
> > +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> > +		 * check leading to a false positive and clearing.
> > +		 * Hence, it's ecessary to atomically read the PMD value
> > +		 * for all the checks.
> > +		 */
> > +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> > +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> > 			goto next;
> > 
> > 		/* invoke the mmu notifier if the pmd is populated */
> > -- 
> > 2.24.1
> > 
> > 
>
Qian Cai March 8, 2020, 3:20 a.m. UTC | #4
Not sure if it is really worth hurrying this into -rc5 now given...

> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
>> 
>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
>> with swapping on linux-next 20200306,
>> 
> 
> Would you mind sharing your .config ? Also it would be nice to know your
> system and workload details so I can try to reproduce.

LTP oom02 in a tight loop,

# i=0; while :; echo $((i++)); ./oom02; sleep 10; done

HPE ProLiant DL385 Gen10
AMD EPYC 7601 32-Core Processor
65536 MB memory, 400 GB disk space

Processors	128
Cores	64
Sockets	2

linux-next 20200306

> 
> Cheers,
> -- Rafael
> 
>> The trace is a bit off for some reasons, but it is basically,
>> 
>> kernel BUG at include/linux/page-flags.h:317!
>> 
>> end_swap_bio_write()
>>   SetPageError(page)
>>       VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> 
>> The page was allocated in,
>> 
>> do_huge_pmd_anonymous_page()
>> 
>> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
>> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
>> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
>> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
>> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
>> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
>> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
>> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
>> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
>> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
>> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
>> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
>> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
>> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
>> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
>> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
>> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
>> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
>> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
>> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
>> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
>> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
>> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
>> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
>> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
>> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
>> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
>> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
>> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
>> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
>> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
>> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
>> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
>> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
>> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
>> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
>> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
>> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
>> [ 9692.845443][   C25] page_owner tracks the page as allocated 
>> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
>> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
>> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
>> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
>> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
>> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
>> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
>> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
>> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
>> [ 9692.845786][  T213] page_owner tracks the page as allocated 
>> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
>> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
>> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
>> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
>> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
>> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
>> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
>> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
>> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
>> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
>> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
>> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
>> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
>> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
>> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
>> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
>> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
>> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
>> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
>> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
>> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
>> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
>> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
>> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
>> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
>> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
>> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
>> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
>> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
>> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
>> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
>> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
>> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
>> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
>> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
>> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
>> [ 9692.849884][  T213]  libphy 
>> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
>> [ 9692.849952][  T213]  firmware_class 
>> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
>> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
>> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
>> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
>> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
>> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
>> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
>> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
>> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
>> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
>> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
>> 
>>> 
>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
>>> ---
>>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
>>> 1 file changed, 36 insertions(+), 2 deletions(-)
>>> 
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 7a8e84f86831..9ea8cc0ab2fd 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>>> 	return pages;
>>> }
>>> 
>>> +/*
>>> + * Used when setting automatic NUMA hinting protection where it is
>>> + * critical that a numa hinting PMD is not confused with a bad PMD.
>>> + */
>>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
>>> +{
>>> +	pmd_t pmdval = pmd_read_atomic(pmd);
>>> +
>>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	barrier();
>>> +#endif
>>> +
>>> +	if (pmd_none(pmdval))
>>> +		return 1;
>>> +	if (pmd_trans_huge(pmdval))
>>> +		return 0;
>>> +	if (unlikely(pmd_bad(pmdval))) {
>>> +		pmd_clear_bad(pmd);
>>> +		return 1;
>>> +	}
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>> 		pud_t *pud, unsigned long addr, unsigned long end,
>>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
>>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>>> 		unsigned long this_pages;
>>> 
>>> 		next = pmd_addr_end(addr, end);
>>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
>>> -				&& pmd_none_or_clear_bad(pmd))
>>> +
>>> +		/*
>>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
>>> +		 * held for read. It's possible a parallel update to occur
>>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
>>> +		 * check leading to a false positive and clearing.
>>> +		 * Hence, it's ecessary to atomically read the PMD value
>>> +		 * for all the checks.
>>> +		 */
>>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
>>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
>>> 			goto next;
>>> 
>>> 		/* invoke the mmu notifier if the pmd is populated */
>>> -- 
>>> 2.24.1
>>> 
>>> 
>> 
>
Rafael Aquini March 8, 2020, 11:14 p.m. UTC | #5
On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
> Not sure if it is really worth hurrying this into -rc5 now given...
> 
> > On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
> >> 
> >> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> >> with swapping on linux-next 20200306,
> >> 
> > 
> > Would you mind sharing your .config ? Also it would be nice to know your
> > system and workload details so I can try to reproduce.
> 
> LTP oom02 in a tight loop,
> 
> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
> 
> HPE ProLiant DL385 Gen10
> AMD EPYC 7601 32-Core Processor
> 65536 MB memory, 400 GB disk space
> 
> Processors	128
> Cores	64
> Sockets	2
> 
> linux-next 20200306
>

Thx! what about your .config?

 

Cheers,
-- Rafael

> >> The trace is a bit off for some reasons, but it is basically,
> >> 
> >> kernel BUG at include/linux/page-flags.h:317!
> >> 
> >> end_swap_bio_write()
> >>   SetPageError(page)
> >>       VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> 
> >> The page was allocated in,
> >> 
> >> do_huge_pmd_anonymous_page()
> >> 
> >> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> >> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> >> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> >> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> >> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> >> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> >> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> >> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> >> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> >> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> >> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> >> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> >> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> >> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> >> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> >> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> >> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> >> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> >> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> >> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> >> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> >> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> >> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> >> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> >> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> >> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> >> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> >> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> >> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> >> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> >> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> >> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> >> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> >> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> >> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> >> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> >> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> >> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> >> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> >> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> >> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> >> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> >> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> >> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> >> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> >> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> >> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> >> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> >> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> >> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> >> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> >> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> >> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> >> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> >> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> >> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> >> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> >> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> >> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> >> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> >> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> >> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> >> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> >> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> >> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> >> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> >> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> >> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> >> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> >> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> >> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> >> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> >> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> >> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> >> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> >> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> >> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> >> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> >> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> >> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> >> [ 9692.849884][  T213]  libphy 
> >> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> >> [ 9692.849952][  T213]  firmware_class 
> >> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> >> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> >> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> >> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> >> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> >> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> >> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> >> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> >> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> >> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> >> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> >> 
> >>> 
> >>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> >>> ---
> >>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> >>> 1 file changed, 36 insertions(+), 2 deletions(-)
> >>> 
> >>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>> index 7a8e84f86831..9ea8cc0ab2fd 100644
> >>> --- a/mm/mprotect.c
> >>> +++ b/mm/mprotect.c
> >>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>> 	return pages;
> >>> }
> >>> 
> >>> +/*
> >>> + * Used when setting automatic NUMA hinting protection where it is
> >>> + * critical that a numa hinting PMD is not confused with a bad PMD.
> >>> + */
> >>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> >>> +{
> >>> +	pmd_t pmdval = pmd_read_atomic(pmd);
> >>> +
> >>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> >>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>> +	barrier();
> >>> +#endif
> >>> +
> >>> +	if (pmd_none(pmdval))
> >>> +		return 1;
> >>> +	if (pmd_trans_huge(pmdval))
> >>> +		return 0;
> >>> +	if (unlikely(pmd_bad(pmdval))) {
> >>> +		pmd_clear_bad(pmd);
> >>> +		return 1;
> >>> +	}
> >>> +
> >>> +	return 0;
> >>> +}
> >>> +
> >>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>> 		pud_t *pud, unsigned long addr, unsigned long end,
> >>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> >>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>> 		unsigned long this_pages;
> >>> 
> >>> 		next = pmd_addr_end(addr, end);
> >>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> >>> -				&& pmd_none_or_clear_bad(pmd))
> >>> +
> >>> +		/*
> >>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> >>> +		 * held for read. It's possible a parallel update to occur
> >>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> >>> +		 * check leading to a false positive and clearing.
> >>> +		 * Hence, it's ecessary to atomically read the PMD value
> >>> +		 * for all the checks.
> >>> +		 */
> >>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> >>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> >>> 			goto next;
> >>> 
> >>> 		/* invoke the mmu notifier if the pmd is populated */
> >>> -- 
> >>> 2.24.1
> >>> 
> >>> 
> >> 
> > 
>
Qian Cai March 9, 2020, 3:27 a.m. UTC | #6
> On Mar 8, 2020, at 7:14 PM, Rafael Aquini <aquini@redhat.com> wrote:
> 
> On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
>> Not sure if it is really worth hurrying this into -rc5 now given...
>> 
>>> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
>>>> 
>>>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
>>>> with swapping on linux-next 20200306,
>>>> 
>>> 
>>> Would you mind sharing your .config ? Also it would be nice to know your
>>> system and workload details so I can try to reproduce.
>> 
>> LTP oom02 in a tight loop,
>> 
>> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
>> 
>> HPE ProLiant DL385 Gen10
>> AMD EPYC 7601 32-Core Processor
>> 65536 MB memory, 400 GB disk space
>> 
>> Processors	128
>> Cores	64
>> Sockets	2
>> 
>> linux-next 20200306
>> 
> 
> Thx! what about your .config?

https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config

My theory is this race,

From 73d4e8dad6cb248970bade5049cc072cd9090d0f Mon Sep 17 00:00:00 2001
From: Qian Cai <cai@lca.pw>
Date: Sun, 8 Mar 2020 17:37:51 -0400
Subject: [PATCH] mm/mprotect: fix a race in change_pmd_range()

The recent commit increased the race window between,

CPU0:
task_numa_work()
change_prot_numa()
change_protection()
change_protection_range()
change_pmd_range()
pmd_none_or_clear_bad_unless_trans_huge(pmd) <-- PMD is NULL.
		goto next;
				CPU1:
				do_huge_pmd_anonymous_page()
				__do_huge_pmd_anonymous_page()
				PMD is transhuge.

where it will not calling __split_huge_pmd() properly and trigger a
crash later on,

kernel BUG at include/linux/page-flags.h:317!

end_swap_bio_write()
  SetPageError(page)
      VM_BUG_ON_PAGE(1 && PageCompound(page))

Fix it by using a similar check in pmd_trans_unstable().

Fixes: 8b272b3cbbb5 ("mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa")
Signed-off-by: Qian Cai <cai@lca.pw>
---
 mm/mprotect.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 311c0dadf71c..00dd2c66a434 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,7 +165,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
  * Used when setting automatic NUMA hinting protection where it is
  * critical that a numa hinting PMD is not confused with a bad PMD.
  */
-static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+static inline int pmd_clear_bad_unless_none_or_trans_huge(pmd_t *pmd)
 {
 	pmd_t pmdval = pmd_read_atomic(pmd);
 
@@ -174,9 +174,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
 	barrier();
 #endif
 
-	if (pmd_none(pmdval))
-		return 1;
-	if (pmd_trans_huge(pmdval))
+	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))
 		return 0;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
@@ -213,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		 * for all the checks.
 		 */
 		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
-		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
+		     pmd_clear_bad_unless_none_or_trans_huge(pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */
Rafael Aquini March 9, 2020, 3:05 p.m. UTC | #7
On Sun, Mar 08, 2020 at 11:27:22PM -0400, Qian Cai wrote:
> 
> 
> > On Mar 8, 2020, at 7:14 PM, Rafael Aquini <aquini@redhat.com> wrote:
> > 
> > On Sat, Mar 07, 2020 at 10:20:20PM -0500, Qian Cai wrote:
> >> Not sure if it is really worth hurrying this into -rc5 now given...
> >> 
> >>> On Mar 6, 2020, at 10:05 PM, Rafael Aquini <aquini@redhat.com> wrote:
> >>>> 
> >>>> After reverted this patch, it is no longer possible to reproduce a bug under memory pressure
> >>>> with swapping on linux-next 20200306,
> >>>> 
> >>> 
> >>> Would you mind sharing your .config ? Also it would be nice to know your
> >>> system and workload details so I can try to reproduce.
> >> 
> >> LTP oom02 in a tight loop,
> >> 
> >> # i=0; while :; echo $((i++)); ./oom02; sleep 10; done
> >> 
> >> HPE ProLiant DL385 Gen10
> >> AMD EPYC 7601 32-Core Processor
> >> 65536 MB memory, 400 GB disk space
> >> 
> >> Processors	128
> >> Cores	64
> >> Sockets	2
> >> 
> >> linux-next 20200306
> >> 
> > 
> > Thx! what about your .config?
> 
> https://raw.githubusercontent.com/cailca/linux-mm/master/x86.config
> 
> My theory is this race,
> 
> From 73d4e8dad6cb248970bade5049cc072cd9090d0f Mon Sep 17 00:00:00 2001
> From: Qian Cai <cai@lca.pw>
> Date: Sun, 8 Mar 2020 17:37:51 -0400
> Subject: [PATCH] mm/mprotect: fix a race in change_pmd_range()
> 
> The recent commit increased the race window between,
> 
> CPU0:
> task_numa_work()
> change_prot_numa()
> change_protection()
> change_protection_range()
> change_pmd_range()
> pmd_none_or_clear_bad_unless_trans_huge(pmd) <-- PMD is NULL.
> 		goto next;
> 				CPU1:
> 				do_huge_pmd_anonymous_page()
> 				__do_huge_pmd_anonymous_page()
> 				PMD is transhuge.
> 
> where it will not calling __split_huge_pmd() properly and trigger a
> crash later on,
> 
> kernel BUG at include/linux/page-flags.h:317!
> 
> end_swap_bio_write()
>   SetPageError(page)
>       VM_BUG_ON_PAGE(1 && PageCompound(page))
> 
> Fix it by using a similar check in pmd_trans_unstable().
> 
> Fixes: 8b272b3cbbb5 ("mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa")
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>  mm/mprotect.c | 8 +++-----
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 311c0dadf71c..00dd2c66a434 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -165,7 +165,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
>   * Used when setting automatic NUMA hinting protection where it is
>   * critical that a numa hinting PMD is not confused with a bad PMD.
>   */
> -static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> +static inline int pmd_clear_bad_unless_none_or_trans_huge(pmd_t *pmd)
>  {
>  	pmd_t pmdval = pmd_read_atomic(pmd);
>  
> @@ -174,9 +174,7 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
>  	barrier();
>  #endif
>  
> -	if (pmd_none(pmdval))
> -		return 1;
> -	if (pmd_trans_huge(pmdval))
> +	if (pmd_none(pmdval) || pmd_trans_huge(pmdval))


I'm still waiting on a similar system to become available, so I can
work on your reproducer case, as well as to dig and wrap my head around it.

I still don't think that skipping the pmd_none() in the change-protection walk 
should cause a big fuss like you observed here (yet, it seems it does), and 
the fact that we need that race window to take the __split_huge_pmd() suggests,
at least to me, that we might be missing this proper split somewhere else.

>  	if (unlikely(pmd_bad(pmdval))) {
>  		pmd_clear_bad(pmd);
> @@ -213,7 +211,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>  		 * for all the checks.
>  		 */
>  		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> -		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> +		     pmd_clear_bad_unless_none_or_trans_huge(pmd))
>  			goto next;
>  
>  		/* invoke the mmu notifier if the pmd is populated */
> -- 
> 2.21.0 (Apple Git-122.2)
> 
> 
> > 
> > 
> > 
> > Cheers,
> > -- Rafael
> > 
> >>>> The trace is a bit off for some reasons, but it is basically,
> >>>> 
> >>>> kernel BUG at include/linux/page-flags.h:317!
> >>>> 
> >>>> end_swap_bio_write()
> >>>>  SetPageError(page)
> >>>>      VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> 
> >>>> The page was allocated in,
> >>>> 
> >>>> do_huge_pmd_anonymous_page()
> >>>> 
> >>>> [ 9598.551238][  T794] oom_reaper: reaped process 49120 (oom02), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB 
> >>>> [ 9690.896638][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc0000 flags=0x0010] 
> >>>> [ 9690.910129][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1000 flags=0x0010] 
> >>>> [ 9690.923933][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1900 flags=0x0010] 
> >>>> [ 9690.937669][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc1d00 flags=0x0010] 
> >>>> [ 9690.951468][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2000 flags=0x0010] 
> >>>> [ 9690.964803][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2400 flags=0x0010] 
> >>>> [ 9690.978256][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2800 flags=0x0010] 
> >>>> [ 9690.991423][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc2c00 flags=0x0010] 
> >>>> [ 9691.004294][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3000 flags=0x0010] 
> >>>> [ 9691.017228][  T814] smartpqi 0000:23:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0027 address=0xffffffffffcc3400 flags=0x0010] 
> >>>> [ 9691.029804][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3800 flags=0x0010] 
> >>>> [ 9691.041623][  T814] AMD-Vi: Event logged [IO_PAGE_FAULT device=23:00.0 domain=0x0027 address=0xffffffffffcc3c00 flags=0x0010] 
> >>>> [ 9692.767064][    C8] smartpqi 0000:23:00.0: controller is offline: status code 0x14803 
> >>>> [ 9692.775562][    C8] smartpqi 0000:23:00.0: controller offline 
> >>>> [ 9692.841721][   C36] sd 0:1:0:0: [sda] tag#693 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.841970][   C87] blk_update_request: I/O error, dev sda, sector 59046400 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842718][   C31] blk_update_request: I/O error, dev sda, sector 36509696 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842748][   C18] sd 0:1:0:0: [sda] tag#616 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842757][   C28] sd 0:1:0:0: [sda] tag#841 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842766][  C109] sd 0:1:0:0: [sda] tag#393 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842811][  C109] sd 0:1:0:0: [sda] tag#393 CDB: opcode=0x2a 2a 00 01 60 c8 00 00 02 00 00 
> >>>> [ 9692.842840][   C13] sd 0:1:0:0: [sda] tag#324 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842924][   C18] sd 0:1:0:0: [sda] tag#616 CDB: opcode=0x2a 2a 00 01 c7 8c 00 00 02 00 00 
> >>>> [ 9692.842932][  C109] blk_update_request: I/O error, dev sda, sector 23119872 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.842938][   C28] sd 0:1:0:0: [sda] tag#841 CDB: opcode=0x2a 2a 00 03 2f cc 00 00 02 00 00 
> >>>> [ 9692.842984][  T213] sd 0:1:0:0: [sda] tag#47 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.842995][   C28] blk_update_request: I/O error, dev sda, sector 53464064 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843000][   C81] sd 0:1:0:0: [sda] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843031][   C18] blk_update_request: I/O error, dev sda, sector 29854720 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843056][   C13] sd 0:1:0:0: [sda] tag#324 CDB: opcode=0x2a 2a 00 03 17 66 00 00 02 00 00 
> >>>> [ 9692.843101][   C81] sd 0:1:0:0: [sda] tag#10 CDB: opcode=0x2a 2a 00 00 52 a4 00 00 02 00 00 
> >>>> [ 9692.843129][   C13] blk_update_request: I/O error, dev sda, sector 51865088 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843192][   C81] blk_update_request: I/O error, dev sda, sector 5415936 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843338][  T213] sd 0:1:0:0: [sda] tag#47 CDB: opcode=0x2a 2a 00 02 be 5a 00 00 02 00 00 
> >>>> [ 9692.843358][   C13] sd 0:1:0:0: [sda] tag#325 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843480][  T213] blk_update_request: I/O error, dev sda, sector 46029312 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843485][   C13] sd 0:1:0:0: [sda] tag#325 CDB: opcode=0x2a 2a 00 03 17 68 00 00 02 00 00 
> >>>> [ 9692.843510][   C13] blk_update_request: I/O error, dev sda, sector 51865600 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843685][  C103] blk_update_request: I/O error, dev sda, sector 34941952 op 0x1:(WRITE) flags 0x8004000 phys_seg 4 prio class 0 
> >>>> [ 9692.843725][   C96] sd 0:1:0:0: [sda] tag#877 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843744][   C92] Write-error on swap-device (254:1:39256632) 
> >>>> [ 9692.843791][  C118] sd 0:1:0:0: [sda] tag#230 UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00 cmd_age=1s 
> >>>> [ 9692.843866][   C64] Write-error on swap-device (254:1:63030016) 
> >>>> [ 9692.843891][   C96] sd 0:1:0:0: [sda] tag#877 CDB: opcode=0x2a 2a 00 03 1e 48 00 00 02 00 00 
> >>>> [ 9692.844013][  C118] sd 0:1:0:0: [sda] tag#230 CDB: opcode=0x2a 2a 00 03 c6 26 00 00 02 00 00 
> >>>> [ 9692.844615][   C92] Write-error on swap-device (254:1:39256640) 
> >>>> [ 9692.844643][   C64] Write-error on swap-device (254:1:63030024) 
> >>>> [ 9692.845109][   C61] Write-error on swap-device (254:1:17105384) 
> >>>> [ 9692.845192][   C25] page:fffff3b2ec3a8000 refcount:512 mapcount:0 mapping:000000009eb0338c index:0x7f6e58200 head:fffff3b2ec3a8000 order:9 compound_mapcount:0 compound_pincount:0 
> >>>> [ 9692.845226][   C25] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >>>> [ 9692.845254][   C25] raw: 045fffe0000d8454 fffff3b2f9d48008 fffff3b2f3270008 ffff97c3090f8209 
> >>>> [ 9692.845283][   C25] raw: 00000007f6e58200 000000000053c800 00000200ffffffff ffff97c8d71e9000 
> >>>> [ 9692.845293][  C122] Write-error on swap-device (254:1:39080184) 
> >>>> [ 9692.845316][  T213] page:fffff3b2f77d8000 refcount:512 mapcount:0 mapping:00000000a4517f4e index:0x7f69d2200 head:fffff3b2f77d8000 order:9 compound_mapcount:0 compound_pincount:0 
> >>>> [ 9692.845330][   C25] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> [ 9692.845427][  T213] anon flags: 0x45fffe0000d8454(uptodate|lru|workingset|owner_priv_1|writeback|head|reclaim|swapbacked) 
> >>>> [ 9692.845433][   C25] page->mem_cgroup:ffff97c8d71e9000 
> >>>> [ 9692.845443][   C25] page_owner tracks the page as allocated 
> >>>> [ 9692.845520][  T213] raw: 045fffe0000d8454 fffff3b2f5ec2f08 fffff3b2f6b88008 ffff97c715cc0609 
> >>>> [ 9692.845535][   C25] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >>>> [ 9692.845606][   C25]  prep_new_page+0x1cd/0x1f0 
> >>>> [ 9692.845628][   C25]  get_page_from_freelist+0x18ac/0x24d0 
> >>>> [ 9692.845639][  T213] raw: 00000007f69d2200 00000000000bf400 00000200ffffffff ffff97c8d71e9000 
> >>>> [ 9692.845654][   C25]  __alloc_pages_nodemask+0x1b1/0x450 
> >>>> [ 9692.845733][  T213] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page)) 
> >>>> [ 9692.845757][   C25]  alloc_pages_vma+0x8a/0x2c0 
> >>>> [ 9692.845762][  T213] page->mem_cgroup:ffff97c8d71e9000 
> >>>> [ 9692.845779][   C25]  do_huge_pmd_anonymous_page+0x1d4/0xc30 
> >>>> [ 9692.845786][  T213] page_owner tracks the page as allocated 
> >>>> [ 9692.845807][   C25]  __handle_mm_fault+0xd27/0xd50 
> >>>> [ 9692.845819][  T213] page last allocated via order 9, migratetype Movable, gfp_mask 0x1c20ca(GFP_TRANSHUGE_LIGHT) 
> >>>> [ 9692.845833][   C25]  handle_mm_fault+0xfc/0x2f0 
> >>>> [ 9692.845856][  T213]  prep_new_page+0x1cd/0x1f0 
> >>>> [ 9692.845876][   C25]  do_page_fault+0x263/0x6f9 
> >>>> [ 9692.845892][  T213]  get_page_from_free7219][   C25]  clone_endio+0xe4/0x2c0 [dm_mod] 
> >>>> [ 9692.847226][  T213] kernel BUG at include/linux/page-flags.h:317! 
> >>>> [ 9692.847234][   C64] Write-error on swap-device (254:1:63030064) 
> >>>> [ 9692.847250][   C25]  bio_endio+0x297/0x560 
> >>>> [ 9692.847255][   C61] Write-error on swap-device (254:1:17105424) 
> >>>> [ 9692.847263][   C25]  ? bio_advance+0x92/0x190 
> >>>> [ 9692.847267][  C122] Write-error on swap-device (254:1:39080224) 
> >>>> [ 9692.847276][   C25]  blk_update_request+0x201/0x920 
> >>>> [ 9692.847287][   C25]  ? ___ratelimit+0x138/0x1e0 
> >>>> [ 9692.847312][   C25]  scsi_end_request+0x6b/0x4b0 
> >>>> [ 9692.847320][   C25]  ? ___ratelimit+0x3e/0x1e0 
> >>>> [ 9692.847329][   C25]  scsi_io_completion+0x509/0x7e0 
> >>>> [ 9692.847340][   C25]  scsi_finish_command+0x1ed/0x2a0 
> >>>> [ 9692.847350][   C25]  scsi_softirq_done+0x1c9/0x1d0 
> >>>> [ 9692.847354][   C64] Write-error on swap-device (254:1:63030072) 
> >>>> [ 9692.847363][   C25]  ? blk_mq_check_inflight+0xa0/0xa0 
> >>>> [ 9692.847383][   C61] Write-error on swap-device (254:1:17105432) 
> >>>> [ 9692.847397][  C122] Write-erro692.848242][   C25] RSP: 0018:ffffb02006b68bb0 EFLAGS: 00010086 
> >>>> [ 9692.848333][  T213] Workqueue: events pqi_ctrl_offline_worker [smartpqi] 
> >>>> [ 9692.848346][   C61] Write-error on swap-device (254:1:17105528) 
> >>>> [ 9692.848351][  C122] Write-error on swap-device (254:1:39080328) 
> >>>> [ 9692.848428][   C25] RAX: 0000000000000000 RBX: ffff97c76d8d4b80 RCX: ffffffff9d51e8f0 
> >>>> [ 9692.848433][   C61] Write-error on swap-device (254:1:17105536) 
> >>>> [ 9692.848437][  C122] Write-error on swap-device (254:1:39080336) 
> >>>> [ 9692.848445][  T213] RIP: 0010:end_swap_bio_write+0x90/0x280 
> >>>> [ 9692.848450][   C25] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff97c7e539f00a 
> >>>> [ 9692.848461][  T213] Code: 80 7b 1a 00 0f 84 58 01 00 00 4c 89 e6 bf 10 00 00 00 e8 33 f8 ff ff 84 c0 74 1d 48 c7 c6 18 7e 3b 9a 4c 89 e7 e8 20 78 fa ff <0f> 0b 48 c7 c7 b0 0b 6e 9a e8 04 ee 2b 00 4d 8d 74 24 08 ba 08 00 
> >>>> [ 9692.848466][   C25] RBP: ffffb02006b68bf0 R08: 0000000000000000 R09: 000097c7e539f00a 
> >>>> [ 9692.848471][  T213] RSP: 0018:ffffb02006dd3908 EFLAGS: 00010286 
> >>>> [ 9692.848485][   C25] R10: 0000b02006b689a0 R11: 000097c7e5ax_pmem_core efivars acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi scsi_transport_sas tg3 mlx5_core 
> >>>> [ 9692.849818][  C122] Write-error on swap-device (254:1:39080440) 
> >>>> [ 9692.849884][  T213]  libphy 
> >>>> [ 9692.849889][  C122] Write-error on swap-device (254:1:39080448) 
> >>>> [ 9692.849952][  T213]  firmware_class 
> >>>> [ 9692.849959][  C122] Write-error on swap-device (254:1:39080456) 
> >>>> [ 9692.849963][  T213]  dm_mirror dm_region_hash dm_log dm_mod 
> >>>> [ 9692.849990][  C122] Write-error on swap-device (254:1:39080464) 
> >>>> [ 9692.850069][  C122] Write-error on swap-device (254:1:39080472) 
> >>>> [ 9692.850107][  C122] Write-error on swap-device (254:1:39080480) 
> >>>> [ 9692.850156][  C122] Write-error on swap-device (254:1:39080488) 
> >>>> [ 9692.850220][  C122] Write-error on swap-device (254:1:39080496) 
> >>>> [ 9692.850264][  C122] Write-error on swap-device (254:1:39080504) 
> >>>> [ 9692.850323][  C122] Write-error on swap-device (254:1:39080512) 
> >>>> [ 9692.850361][  C122] Write-error on swap-device (254:1:39080520) 
> >>>> [ 9692.850411][  C122] Write-error on swap-device (254:1:39080528)  
> >>>> 
> >>>>> 
> >>>>> Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
> >>>>> Cc: stable@vger.kernel.org
> >>>>> Signed-off-by: Rafael Aquini <aquini@redhat.com>
> >>>>> ---
> >>>>> mm/mprotect.c | 38 ++++++++++++++++++++++++++++++++++++--
> >>>>> 1 file changed, 36 insertions(+), 2 deletions(-)
> >>>>> 
> >>>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >>>>> index 7a8e84f86831..9ea8cc0ab2fd 100644
> >>>>> --- a/mm/mprotect.c
> >>>>> +++ b/mm/mprotect.c
> >>>>> @@ -161,6 +161,31 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> >>>>> 	return pages;
> >>>>> }
> >>>>> 
> >>>>> +/*
> >>>>> + * Used when setting automatic NUMA hinting protection where it is
> >>>>> + * critical that a numa hinting PMD is not confused with a bad PMD.
> >>>>> + */
> >>>>> +static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
> >>>>> +{
> >>>>> +	pmd_t pmdval = pmd_read_atomic(pmd);
> >>>>> +
> >>>>> +	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
> >>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>>>> +	barrier();
> >>>>> +#endif
> >>>>> +
> >>>>> +	if (pmd_none(pmdval))
> >>>>> +		return 1;
> >>>>> +	if (pmd_trans_huge(pmdval))
> >>>>> +		return 0;
> >>>>> +	if (unlikely(pmd_bad(pmdval))) {
> >>>>> +		pmd_clear_bad(pmd);
> >>>>> +		return 1;
> >>>>> +	}
> >>>>> +
> >>>>> +	return 0;
> >>>>> +}
> >>>>> +
> >>>>> static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>>>> 		pud_t *pud, unsigned long addr, unsigned long end,
> >>>>> 		pgprot_t newprot, int dirty_accountable, int prot_numa)
> >>>>> @@ -178,8 +203,17 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >>>>> 		unsigned long this_pages;
> >>>>> 
> >>>>> 		next = pmd_addr_end(addr, end);
> >>>>> -		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
> >>>>> -				&& pmd_none_or_clear_bad(pmd))
> >>>>> +
> >>>>> +		/*
> >>>>> +		 * Automatic NUMA balancing walks the tables with mmap_sem
> >>>>> +		 * held for read. It's possible a parallel update to occur
> >>>>> +		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
> >>>>> +		 * check leading to a false positive and clearing.
> >>>>> +		 * Hence, it's ecessary to atomically read the PMD value
> >>>>> +		 * for all the checks.
> >>>>> +		 */
> >>>>> +		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
> >>>>> +		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
> >>>>> 			goto next;
> >>>>> 
> >>>>> 		/* invoke the mmu notifier if the pmd is populated */
> >>>>> -- 
> >>>>> 2.24.1
>
Qian Cai March 11, 2020, 12:04 a.m. UTC | #8
> On Mar 9, 2020, at 11:05 AM, Rafael Aquini <aquini@redhat.com> wrote:
> I'm still waiting on a similar system to become available, so I can
> work on your reproducer case, as well as to dig and wrap my head around it.
> 
> I still don't think that skipping the pmd_none() in the change-protection walk 
> should cause a big fuss like you observed here (yet, it seems it does), and 
> the fact that we need that race window to take the __split_huge_pmd() suggests,
> at least to me, that we might be missing this proper split somewhere else.

I have sent out another patch which should be more correct,

https://lore.kernel.org/lkml/20200310235846.1319-1-cai@lca.pw/
diff mbox series

Patch

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 7a8e84f86831..9ea8cc0ab2fd 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -161,6 +161,31 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 	return pages;
 }
 
+/*
+ * Used when setting automatic NUMA hinting protection where it is
+ * critical that a numa hinting PMD is not confused with a bad PMD.
+ */
+static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
+{
+	pmd_t pmdval = pmd_read_atomic(pmd);
+
+	/* See pmd_none_or_trans_huge_or_clear_bad for info on barrier */
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	barrier();
+#endif
+
+	if (pmd_none(pmdval))
+		return 1;
+	if (pmd_trans_huge(pmdval))
+		return 0;
+	if (unlikely(pmd_bad(pmdval))) {
+		pmd_clear_bad(pmd);
+		return 1;
+	}
+
+	return 0;
+}
+
 static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		pud_t *pud, unsigned long addr, unsigned long end,
 		pgprot_t newprot, int dirty_accountable, int prot_numa)
@@ -178,8 +203,17 @@  static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
 		unsigned long this_pages;
 
 		next = pmd_addr_end(addr, end);
-		if (!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd) && !pmd_devmap(*pmd)
-				&& pmd_none_or_clear_bad(pmd))
+
+		/*
+		 * Automatic NUMA balancing walks the tables with mmap_sem
+		 * held for read. It's possible a parallel update to occur
+		 * between pmd_trans_huge() and a pmd_none_or_clear_bad()
+		 * check leading to a false positive and clearing.
+		 * Hence, it's ecessary to atomically read the PMD value
+		 * for all the checks.
+		 */
+		if (!is_swap_pmd(*pmd) && !pmd_devmap(*pmd) &&
+		     pmd_none_or_clear_bad_unless_trans_huge(pmd))
 			goto next;
 
 		/* invoke the mmu notifier if the pmd is populated */