diff mbox series

[1/2] mm: hwpoison: don't drop slab caches for offlining non-LRU page

Message ID 20210816180909.3603-1-shy828301@gmail.com (mailing list archive)
State New
Headers show
Series [1/2] mm: hwpoison: don't drop slab caches for offlining non-LRU page | expand

Commit Message

Yang Shi Aug. 16, 2021, 6:09 p.m. UTC
In the current implementation of soft offline, if non-LRU page is met,
all the slab caches will be dropped to free the page then offline.  But
if the page is not slab page all the effort is wasted in vain.  Even
though it is a slab page, it is not guaranteed the page could be freed
at all.

However the side effect and cost is quite high.  It does not only drop
the slab caches, but also may drop a significant amount of page caches
which are associated with inode caches.  It could make the most
workingset gone in order to just offline a page.  And the offline is not
guaranteed to succeed at all, actually I really doubt the success rate
for real life workload.

Furthermore the worse consequence is the system may be locked up and
unusable since the page cache release may incur huge amount of works
queued for memcg release.

Actually we ran into such unpleasant case in our production environment.
Firstly, the workqueue of memory_failure_work_func is locked up as
below:

BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
  pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
    in-flight: 409271:memory_failure_work_func
    pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
workqueue mm_percpu_wq: flags=0x8
  pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
    pending: vmstat_update
workqueue cgroup_destroy: flags=0x0
  pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
    pending: css_release_work_fn

There were over 12K css_release_work_fn queued, and this caused a few
lockups due to the contention of worker pool lock with IRQ disabled, for
example:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
Workqueue: events memory_failure_work_func
RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
Call Trace:
 __queue_work+0xd6/0x3c0
 queue_work_on+0x1c/0x30
 uncharge_batch+0x10e/0x110
 mem_cgroup_uncharge_list+0x6d/0x80
 release_pages+0x37f/0x3f0
 __pagevec_release+0x1c/0x50
 __invalidate_mapping_pages+0x348/0x380
 ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
 inode_lru_isolate+0x10a/0x160
 ? iput+0x1d0/0x1d0
 __list_lru_walk_one+0x7b/0x170
 ? iput+0x1d0/0x1d0
 list_lru_walk_one+0x4a/0x60
 prune_icache_sb+0x37/0x50
 super_cache_scan+0x123/0x1a0
 do_shrink_slab+0x10c/0x2c0
 shrink_slab+0x1f1/0x290
 drop_slab_node+0x4d/0x70
 soft_offline_page+0x1ac/0x5b0
 ? dev_mce_log+0xee/0x110
 ? notifier_call_chain+0x39/0x90
 memory_failure_work_func+0x6a/0x90
 process_one_work+0x19e/0x340
 ? process_one_work+0x340/0x340
 worker_thread+0x30/0x360
 ? process_one_work+0x340/0x340
 kthread+0x116/0x130

The lockup made the machine is quite unusable.  And it also made the
most workingset gone, the reclaimabled slab caches were reduced from 12G
to 300MB, the page caches were decreased from 17G to 4G.

But the most disappointing thing is all the effort doesn't make the page
offline, it just returns:

soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

It seems the aggressive behavior for non-LRU page didn't pay back, so it
doesn't make too much sense to keep it considering the terrible side
effect.

Reported-by: David Mackey <tdmackey@twitter.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/mm.h   |  2 +-
 mm/hwpoison-inject.c |  2 +-
 mm/memory-failure.c  | 16 +++++-----------
 3 files changed, 7 insertions(+), 13 deletions(-)

Comments

David Hildenbrand Aug. 16, 2021, 7:02 p.m. UTC | #1
On 16.08.21 20:09, Yang Shi wrote:
> In the current implementation of soft offline, if non-LRU page is met,
> all the slab caches will be dropped to free the page then offline.  But
> if the page is not slab page all the effort is wasted in vain.  Even
> though it is a slab page, it is not guaranteed the page could be freed
> at all.
> 
> However the side effect and cost is quite high.  It does not only drop
> the slab caches, but also may drop a significant amount of page caches
> which are associated with inode caches.  It could make the most
> workingset gone in order to just offline a page.  And the offline is not
> guaranteed to succeed at all, actually I really doubt the success rate
> for real life workload.
> 
> Furthermore the worse consequence is the system may be locked up and
> unusable since the page cache release may incur huge amount of works
> queued for memcg release.
> 
> Actually we ran into such unpleasant case in our production environment.
> Firstly, the workqueue of memory_failure_work_func is locked up as
> below:
> 
> BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
>      in-flight: 409271:memory_failure_work_func
>      pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
> workqueue mm_percpu_wq: flags=0x8
>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>      pending: vmstat_update
> workqueue cgroup_destroy: flags=0x0
>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
>      pending: css_release_work_fn
> 
> There were over 12K css_release_work_fn queued, and this caused a few
> lockups due to the contention of worker pool lock with IRQ disabled, for
> example:
> 
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
> Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
> Workqueue: events memory_failure_work_func
> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> Call Trace:
>   __queue_work+0xd6/0x3c0
>   queue_work_on+0x1c/0x30
>   uncharge_batch+0x10e/0x110
>   mem_cgroup_uncharge_list+0x6d/0x80
>   release_pages+0x37f/0x3f0
>   __pagevec_release+0x1c/0x50
>   __invalidate_mapping_pages+0x348/0x380
>   ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
>   inode_lru_isolate+0x10a/0x160
>   ? iput+0x1d0/0x1d0
>   __list_lru_walk_one+0x7b/0x170
>   ? iput+0x1d0/0x1d0
>   list_lru_walk_one+0x4a/0x60
>   prune_icache_sb+0x37/0x50
>   super_cache_scan+0x123/0x1a0
>   do_shrink_slab+0x10c/0x2c0
>   shrink_slab+0x1f1/0x290
>   drop_slab_node+0x4d/0x70
>   soft_offline_page+0x1ac/0x5b0
>   ? dev_mce_log+0xee/0x110
>   ? notifier_call_chain+0x39/0x90
>   memory_failure_work_func+0x6a/0x90
>   process_one_work+0x19e/0x340
>   ? process_one_work+0x340/0x340
>   worker_thread+0x30/0x360
>   ? process_one_work+0x340/0x340
>   kthread+0x116/0x130

Just curious, who actually ends up calling soft_offline_page() ? I 
cannot really make sense of this, looking at upstream Linux.

I can spot

a) drivers/base/memory.c: /sys/devices/system/memory/soft_offline_page 
seems to be a testing interface

b) MADV_SOFT_OFFLINE seems to be a testing interface as well

c) arch/parisc/kernel/pdt.c doesn't apply to your case I guess?

I'm just wondering who ends up calling soft_offline_page() in a 
production environment and via which call path. I'm most probably 
missing something.
David Hildenbrand Aug. 16, 2021, 7:04 p.m. UTC | #2
On 16.08.21 21:02, David Hildenbrand wrote:
> On 16.08.21 20:09, Yang Shi wrote:
>> In the current implementation of soft offline, if non-LRU page is met,
>> all the slab caches will be dropped to free the page then offline.  But
>> if the page is not slab page all the effort is wasted in vain.  Even
>> though it is a slab page, it is not guaranteed the page could be freed
>> at all.
>>
>> However the side effect and cost is quite high.  It does not only drop
>> the slab caches, but also may drop a significant amount of page caches
>> which are associated with inode caches.  It could make the most
>> workingset gone in order to just offline a page.  And the offline is not
>> guaranteed to succeed at all, actually I really doubt the success rate
>> for real life workload.
>>
>> Furthermore the worse consequence is the system may be locked up and
>> unusable since the page cache release may incur huge amount of works
>> queued for memcg release.
>>
>> Actually we ran into such unpleasant case in our production environment.
>> Firstly, the workqueue of memory_failure_work_func is locked up as
>> below:
>>
>> BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
>> Showing busy workqueues and worker pools:
>> workqueue events: flags=0x0
>>     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
>>       in-flight: 409271:memory_failure_work_func
>>       pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
>> workqueue mm_percpu_wq: flags=0x8
>>     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>       pending: vmstat_update
>> workqueue cgroup_destroy: flags=0x0
>>     pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
>>       pending: css_release_work_fn
>>
>> There were over 12K css_release_work_fn queued, and this caused a few
>> lockups due to the contention of worker pool lock with IRQ disabled, for
>> example:
>>
>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
>> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
>> Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
>> Workqueue: events memory_failure_work_func
>> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
>> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
>> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
>> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
>> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
>> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
>> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
>> FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
>> Call Trace:
>>    __queue_work+0xd6/0x3c0
>>    queue_work_on+0x1c/0x30
>>    uncharge_batch+0x10e/0x110
>>    mem_cgroup_uncharge_list+0x6d/0x80
>>    release_pages+0x37f/0x3f0
>>    __pagevec_release+0x1c/0x50
>>    __invalidate_mapping_pages+0x348/0x380
>>    ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
>>    inode_lru_isolate+0x10a/0x160
>>    ? iput+0x1d0/0x1d0
>>    __list_lru_walk_one+0x7b/0x170
>>    ? iput+0x1d0/0x1d0
>>    list_lru_walk_one+0x4a/0x60
>>    prune_icache_sb+0x37/0x50
>>    super_cache_scan+0x123/0x1a0
>>    do_shrink_slab+0x10c/0x2c0
>>    shrink_slab+0x1f1/0x290
>>    drop_slab_node+0x4d/0x70
>>    soft_offline_page+0x1ac/0x5b0
>>    ? dev_mce_log+0xee/0x110
>>    ? notifier_call_chain+0x39/0x90
>>    memory_failure_work_func+0x6a/0x90
>>    process_one_work+0x19e/0x340
>>    ? process_one_work+0x340/0x340
>>    worker_thread+0x30/0x360
>>    ? process_one_work+0x340/0x340
>>    kthread+0x116/0x130
> 
> Just curious, who actually ends up calling soft_offline_page() ? I
> cannot really make sense of this, looking at upstream Linux.
> 
> I can spot
> 
> a) drivers/base/memory.c: /sys/devices/system/memory/soft_offline_page
> seems to be a testing interface
> 
> b) MADV_SOFT_OFFLINE seems to be a testing interface as well
> 
> c) arch/parisc/kernel/pdt.c doesn't apply to your case I guess?
> 
> I'm just wondering who ends up calling soft_offline_page() in a
> production environment and via which call path. I'm most probably
> missing something.
> 

... and I missed memory_failure_work_func() with MF_SOFT_OFFLINE :)

Ignore my question :)
David Hildenbrand Aug. 16, 2021, 7:15 p.m. UTC | #3
On 16.08.21 20:09, Yang Shi wrote:
> In the current implementation of soft offline, if non-LRU page is met,
> all the slab caches will be dropped to free the page then offline.  But
> if the page is not slab page all the effort is wasted in vain.  Even
> though it is a slab page, it is not guaranteed the page could be freed
> at all.

... but there is a chance it could be and the current behavior is 
actually helpful in some setups.

[...]

> The lockup made the machine is quite unusable.  And it also made the
> most workingset gone, the reclaimabled slab caches were reduced from 12G
> to 300MB, the page caches were decreased from 17G to 4G.
> 
> But the most disappointing thing is all the effort doesn't make the page
> offline, it just returns:
> 
> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> 

In your example, yes. I had a look at the introducing commit: 
facb6011f399 ("HWPOISON: Add soft page offline support")

"
     When the page is not free or LRU we try to free pages
     from slab and other caches. The slab freeing is currently
     quite dumb and does not try to focus on the specific slab
     cache which might own the page. This could be potentially
     improved later.
"

I wonder, if instead of removing it altogether, we could actually 
improve it as envisioned.

To be precise, for alloc_contig_range() it would also make sense to be 
able to shrink only in a specific physical memory range; this here seems 
to be a similar thing. (actually, alloc_contig_range(), actual memory 
offlining and hw poisoning/soft-offlining have a lot in common)

Unfortunately, the last time I took a brief look at teaching shrinkers 
to be range-aware, it turned out to be a lot of work ... so maybe this 
is really a long term goal to be mitigated in the meantime by disabling 
it, if it turns out to be more of a problem than actually help.
Yang Shi Aug. 16, 2021, 7:37 p.m. UTC | #4
On Mon, Aug 16, 2021 at 12:15 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 16.08.21 20:09, Yang Shi wrote:
> > In the current implementation of soft offline, if non-LRU page is met,
> > all the slab caches will be dropped to free the page then offline.  But
> > if the page is not slab page all the effort is wasted in vain.  Even
> > though it is a slab page, it is not guaranteed the page could be freed
> > at all.
>
> ... but there is a chance it could be and the current behavior is
> actually helpful in some setups.

I don't disagree it is kind of helpful for some cases, but the
question is how likely it is helpful and if the cost is worth it or
not. For non-slab page (of course, non-lru too), dropping slab doesn't
make any sense. Even though it is slab page, it must be a reclaimable
slab. Even though it is a reclaimable slab, dropping slab can't
guarantee all objects on the same page are dropped.

IMHO the likelihood is not worth the cost and side effect, for example
the unsuable system.

>
> [...]
>
> > The lockup made the machine is quite unusable.  And it also made the
> > most workingset gone, the reclaimabled slab caches were reduced from 12G
> > to 300MB, the page caches were decreased from 17G to 4G.
> >
> > But the most disappointing thing is all the effort doesn't make the page
> > offline, it just returns:
> >
> > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> >
>
> In your example, yes. I had a look at the introducing commit:
> facb6011f399 ("HWPOISON: Add soft page offline support")
>
> "
>      When the page is not free or LRU we try to free pages
>      from slab and other caches. The slab freeing is currently
>      quite dumb and does not try to focus on the specific slab
>      cache which might own the page. This could be potentially
>      improved later.
> "
>
> I wonder, if instead of removing it altogether, we could actually
> improve it as envisioned.
>
> To be precise, for alloc_contig_range() it would also make sense to be
> able to shrink only in a specific physical memory range; this here seems
> to be a similar thing. (actually, alloc_contig_range(), actual memory
> offlining and hw poisoning/soft-offlining have a lot in common)
>
> Unfortunately, the last time I took a brief look at teaching shrinkers
> to be range-aware, it turned out to be a lot of work ... so maybe this
> is really a long term goal to be mitigated in the meantime by disabling
> it, if it turns out to be more of a problem than actually help.

Do you mean physical page range? Yes, it would need a lot of work.
TBH, I don't think it is quite feasible for the time being.

The problem is slabs for shrinker are managed by objects rather than
pages. For example, dentry and inode objects (the most consumed
reclaimable slabs) are linked to lru, and shrinkers traverse the lru
to shrink the objects. The objects in a certain range can not be
guaranteed in the same range of physical pages.

>
> --
> Thanks,
>
> David / dhildenb
>
Matthew Wilcox Aug. 16, 2021, 7:37 p.m. UTC | #5
On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> But the most disappointing thing is all the effort doesn't make the page
> offline, it just returns:
> 
> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()

It's a shame it doesn't call dump_page().  There might be more
interesting information somewhere in struct page that would help us
figure out what kind of page it was in your environment.  For example,
it might be a page table page or a page allocated for vmalloc(), and
in both those cases, there are things we might be able to do (we'd
certainly be able to figure out that it isn't worth shrinking slab!)
David Hildenbrand Aug. 16, 2021, 7:40 p.m. UTC | #6
On 16.08.21 21:37, Yang Shi wrote:
> On Mon, Aug 16, 2021 at 12:15 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 16.08.21 20:09, Yang Shi wrote:
>>> In the current implementation of soft offline, if non-LRU page is met,
>>> all the slab caches will be dropped to free the page then offline.  But
>>> if the page is not slab page all the effort is wasted in vain.  Even
>>> though it is a slab page, it is not guaranteed the page could be freed
>>> at all.
>>
>> ... but there is a chance it could be and the current behavior is
>> actually helpful in some setups.
> 
> I don't disagree it is kind of helpful for some cases, but the
> question is how likely it is helpful and if the cost is worth it or
> not. For non-slab page (of course, non-lru too), dropping slab doesn't
> make any sense. Even though it is slab page, it must be a reclaimable
> slab. Even though it is a reclaimable slab, dropping slab can't
> guarantee all objects on the same page are dropped.
> 
> IMHO the likelihood is not worth the cost and side effect, for example
> the unsuable system.
> 
>>
>> [...]
>>
>>> The lockup made the machine is quite unusable.  And it also made the
>>> most workingset gone, the reclaimabled slab caches were reduced from 12G
>>> to 300MB, the page caches were decreased from 17G to 4G.
>>>
>>> But the most disappointing thing is all the effort doesn't make the page
>>> offline, it just returns:
>>>
>>> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
>>>
>>
>> In your example, yes. I had a look at the introducing commit:
>> facb6011f399 ("HWPOISON: Add soft page offline support")
>>
>> "
>>       When the page is not free or LRU we try to free pages
>>       from slab and other caches. The slab freeing is currently
>>       quite dumb and does not try to focus on the specific slab
>>       cache which might own the page. This could be potentially
>>       improved later.
>> "
>>
>> I wonder, if instead of removing it altogether, we could actually
>> improve it as envisioned.
>>
>> To be precise, for alloc_contig_range() it would also make sense to be
>> able to shrink only in a specific physical memory range; this here seems
>> to be a similar thing. (actually, alloc_contig_range(), actual memory
>> offlining and hw poisoning/soft-offlining have a lot in common)
>>
>> Unfortunately, the last time I took a brief look at teaching shrinkers
>> to be range-aware, it turned out to be a lot of work ... so maybe this
>> is really a long term goal to be mitigated in the meantime by disabling
>> it, if it turns out to be more of a problem than actually help.
> 
> Do you mean physical page range? Yes, it would need a lot of work.
> TBH, I don't think it is quite feasible for the time being.
> 
> The problem is slabs for shrinker are managed by objects rather than
> pages. For example, dentry and inode objects (the most consumed
> reclaimable slabs) are linked to lru, and shrinkers traverse the lru
> to shrink the objects. The objects in a certain range can not be
> guaranteed in the same range of physical pages.

Right, essentially you would have to look at each individual object and 
test if it falls into the physical range of interest. Not that it can't 
be done I guess, but it screams to be a lot of work.
Yang Shi Aug. 16, 2021, 8:24 p.m. UTC | #7
On Mon, Aug 16, 2021 at 12:38 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> > But the most disappointing thing is all the effort doesn't make the page
> > offline, it just returns:
> >
> > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
>
> It's a shame it doesn't call dump_page().  There might be more
> interesting information somewhere in struct page that would help us
> figure out what kind of page it was in your environment.  For example,
> it might be a page table page or a page allocated for vmalloc(), and
> in both those cases, there are things we might be able to do (we'd
> certainly be able to figure out that it isn't worth shrinking slab!)

Yes,  dump_page() could provide more information to us. I could add a
new patch or just update this patch to call dump_page() if offline is
failed if the hwpoison maintainer agrees to this as well.
HORIGUCHI NAOYA(堀口 直也) Aug. 18, 2021, 5:02 a.m. UTC | #8
On Mon, Aug 16, 2021 at 01:24:25PM -0700, Yang Shi wrote:
> On Mon, Aug 16, 2021 at 12:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> > > But the most disappointing thing is all the effort doesn't make the page
> > > offline, it just returns:
> > >
> > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> >
> > It's a shame it doesn't call dump_page().  There might be more
> > interesting information somewhere in struct page that would help us
> > figure out what kind of page it was in your environment.  For example,
> > it might be a page table page or a page allocated for vmalloc(), and
> > in both those cases, there are things we might be able to do (we'd
> > certainly be able to figure out that it isn't worth shrinking slab!)
> 
> Yes,  dump_page() could provide more information to us. I could add a
> new patch or just update this patch to call dump_page() if offline is
> failed if the hwpoison maintainer agrees to this as well.

I agree with showing more information in failure case. Thanks for the input.

- Naoya Horiguchi
Naoya Horiguchi Aug. 18, 2021, 6:30 a.m. UTC | #9
On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> In the current implementation of soft offline, if non-LRU page is met,
> all the slab caches will be dropped to free the page then offline.  But
> if the page is not slab page all the effort is wasted in vain.  Even
> though it is a slab page, it is not guaranteed the page could be freed
> at all.
> 
> However the side effect and cost is quite high.  It does not only drop
> the slab caches, but also may drop a significant amount of page caches
> which are associated with inode caches.  It could make the most
> workingset gone in order to just offline a page.  And the offline is not
> guaranteed to succeed at all, actually I really doubt the success rate
> for real life workload.
> 
> Furthermore the worse consequence is the system may be locked up and
> unusable since the page cache release may incur huge amount of works
> queued for memcg release.
> 
> Actually we ran into such unpleasant case in our production environment.
> Firstly, the workqueue of memory_failure_work_func is locked up as
> below:
> 
> BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
>   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
>     in-flight: 409271:memory_failure_work_func
>     pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
> workqueue mm_percpu_wq: flags=0x8
>   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>     pending: vmstat_update
> workqueue cgroup_destroy: flags=0x0
>   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
>     pending: css_release_work_fn
> 
> There were over 12K css_release_work_fn queued, and this caused a few
> lockups due to the contention of worker pool lock with IRQ disabled, for
> example:
> 
> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
> Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
> Workqueue: events memory_failure_work_func
> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> Call Trace:
>  __queue_work+0xd6/0x3c0
>  queue_work_on+0x1c/0x30
>  uncharge_batch+0x10e/0x110
>  mem_cgroup_uncharge_list+0x6d/0x80
>  release_pages+0x37f/0x3f0
>  __pagevec_release+0x1c/0x50
>  __invalidate_mapping_pages+0x348/0x380
>  ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
>  inode_lru_isolate+0x10a/0x160
>  ? iput+0x1d0/0x1d0
>  __list_lru_walk_one+0x7b/0x170
>  ? iput+0x1d0/0x1d0
>  list_lru_walk_one+0x4a/0x60
>  prune_icache_sb+0x37/0x50
>  super_cache_scan+0x123/0x1a0
>  do_shrink_slab+0x10c/0x2c0
>  shrink_slab+0x1f1/0x290
>  drop_slab_node+0x4d/0x70
>  soft_offline_page+0x1ac/0x5b0
>  ? dev_mce_log+0xee/0x110
>  ? notifier_call_chain+0x39/0x90
>  memory_failure_work_func+0x6a/0x90
>  process_one_work+0x19e/0x340
>  ? process_one_work+0x340/0x340
>  worker_thread+0x30/0x360
>  ? process_one_work+0x340/0x340
>  kthread+0x116/0x130
> 
> The lockup made the machine is quite unusable.  And it also made the
> most workingset gone, the reclaimabled slab caches were reduced from 12G
> to 300MB, the page caches were decreased from 17G to 4G.
> 
> But the most disappointing thing is all the effort doesn't make the page
> offline, it just returns:
> 
> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> 
> It seems the aggressive behavior for non-LRU page didn't pay back, so it
> doesn't make too much sense to keep it considering the terrible side
> effect.
> 
> Reported-by: David Mackey <tdmackey@twitter.com>
> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Thank you. I agree with the idea of dropping drop_slab_node() in shake_page(),
hoping that range-based slab shrinker will be implemented in the future.

This patch conflicts with the patch
https://lore.kernel.org/linux-mm/20210817053703.2267588-1-naoya.horiguchi@linux.dev/T/#u
which adds another shake_page(), so could you add the following hunk in your patch?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 64f8ac969544..7dd2ca665866 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1198,7 +1198,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 			 * page, retry.
 			 */
 			if (pass++ < 3) {
-				shake_page(p, 1);
+				shake_page(p);
 				goto try_again;
 			}
 			goto out;


Thanks,
Naoya Horiguchi
David Hildenbrand Aug. 18, 2021, 7:24 a.m. UTC | #10
On 18.08.21 08:30, Naoya Horiguchi wrote:
> On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
>> In the current implementation of soft offline, if non-LRU page is met,
>> all the slab caches will be dropped to free the page then offline.  But
>> if the page is not slab page all the effort is wasted in vain.  Even
>> though it is a slab page, it is not guaranteed the page could be freed
>> at all.
>>
>> However the side effect and cost is quite high.  It does not only drop
>> the slab caches, but also may drop a significant amount of page caches
>> which are associated with inode caches.  It could make the most
>> workingset gone in order to just offline a page.  And the offline is not
>> guaranteed to succeed at all, actually I really doubt the success rate
>> for real life workload.
>>
>> Furthermore the worse consequence is the system may be locked up and
>> unusable since the page cache release may incur huge amount of works
>> queued for memcg release.
>>
>> Actually we ran into such unpleasant case in our production environment.
>> Firstly, the workqueue of memory_failure_work_func is locked up as
>> below:
>>
>> BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
>> Showing busy workqueues and worker pools:
>> workqueue events: flags=0x0
>>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
>>      in-flight: 409271:memory_failure_work_func
>>      pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
>> workqueue mm_percpu_wq: flags=0x8
>>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
>>      pending: vmstat_update
>> workqueue cgroup_destroy: flags=0x0
>>    pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
>>      pending: css_release_work_fn
>>
>> There were over 12K css_release_work_fn queued, and this caused a few
>> lockups due to the contention of worker pool lock with IRQ disabled, for
>> example:
>>
>> NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
>> Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
>> CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
>> Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
>> Workqueue: events memory_failure_work_func
>> RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
>> Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
>> RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
>> RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
>> RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
>> RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
>> R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
>> R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
>> FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
>> Call Trace:
>>   __queue_work+0xd6/0x3c0
>>   queue_work_on+0x1c/0x30
>>   uncharge_batch+0x10e/0x110
>>   mem_cgroup_uncharge_list+0x6d/0x80
>>   release_pages+0x37f/0x3f0
>>   __pagevec_release+0x1c/0x50
>>   __invalidate_mapping_pages+0x348/0x380
>>   ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
>>   inode_lru_isolate+0x10a/0x160
>>   ? iput+0x1d0/0x1d0
>>   __list_lru_walk_one+0x7b/0x170
>>   ? iput+0x1d0/0x1d0
>>   list_lru_walk_one+0x4a/0x60
>>   prune_icache_sb+0x37/0x50
>>   super_cache_scan+0x123/0x1a0
>>   do_shrink_slab+0x10c/0x2c0
>>   shrink_slab+0x1f1/0x290
>>   drop_slab_node+0x4d/0x70
>>   soft_offline_page+0x1ac/0x5b0
>>   ? dev_mce_log+0xee/0x110
>>   ? notifier_call_chain+0x39/0x90
>>   memory_failure_work_func+0x6a/0x90
>>   process_one_work+0x19e/0x340
>>   ? process_one_work+0x340/0x340
>>   worker_thread+0x30/0x360
>>   ? process_one_work+0x340/0x340
>>   kthread+0x116/0x130
>>
>> The lockup made the machine is quite unusable.  And it also made the
>> most workingset gone, the reclaimabled slab caches were reduced from 12G
>> to 300MB, the page caches were decreased from 17G to 4G.
>>
>> But the most disappointing thing is all the effort doesn't make the page
>> offline, it just returns:
>>
>> soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
>>
>> It seems the aggressive behavior for non-LRU page didn't pay back, so it
>> doesn't make too much sense to keep it considering the terrible side
>> effect.
>>
>> Reported-by: David Mackey <tdmackey@twitter.com>
>> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Signed-off-by: Yang Shi <shy828301@gmail.com>
> 
> Thank you. I agree with the idea of dropping drop_slab_node() in shake_page(),
> hoping that range-based slab shrinker will be implemented in the future.
> 
> This patch conflicts with the patch
> https://lore.kernel.org/linux-mm/20210817053703.2267588-1-naoya.horiguchi@linux.dev/T/#u
> which adds another shake_page(), so could you add the following hunk in your patch?
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 64f8ac969544..7dd2ca665866 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1198,7 +1198,7 @@ static int get_any_page(struct page *p, unsigned long flags)
>   			 * page, retry.
>   			 */
>   			if (pass++ < 3) {
> -				shake_page(p, 1);
> +				shake_page(p);
>   				goto try_again;
>   			}
>   			goto out;
> 
> 
> Thanks,
> Naoya Horiguchi
> 

Might we want to add a TODO in the code? We have a similar one in 
mm/page_isolation.c:set_migratetype_isolate() and it's certainly a 
reminder that something of value is missing.
HORIGUCHI NAOYA(堀口 直也) Aug. 18, 2021, 7:53 a.m. UTC | #11
On Wed, Aug 18, 2021 at 09:24:01AM +0200, David Hildenbrand wrote:
...
> 
> Might we want to add a TODO in the code? We have a similar one in
> mm/page_isolation.c:set_migratetype_isolate() and it's certainly a reminder
> that something of value is missing.

Yes, that will be helpful.  The below's what's in my mind, but if someone
has better idea, that's fine.

@@ -296,11 +296,9 @@ void shake_page(struct page *p, int access)
 	}
 	
 	/*
-	 * Only call shrink_node_slabs here (which would also shrink
-	 * other caches) if access is not potentially fatal.
+	 * TODO: Could shrink slab caches here if a lightweight range-based
+	 * shrinker will be available.
 	 */
-	if (access)
-		drop_slab_node(page_to_nid(p));
 }
 EXPORT_SYMBOL_GPL(shake_page);


- Naoya Horiguchi
David Hildenbrand Aug. 18, 2021, 7:55 a.m. UTC | #12
On 18.08.21 09:53, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Aug 18, 2021 at 09:24:01AM +0200, David Hildenbrand wrote:
> ...
>>
>> Might we want to add a TODO in the code? We have a similar one in
>> mm/page_isolation.c:set_migratetype_isolate() and it's certainly a reminder
>> that something of value is missing.
> 
> Yes, that will be helpful.  The below's what's in my mind, but if someone
> has better idea, that's fine.
> 
> @@ -296,11 +296,9 @@ void shake_page(struct page *p, int access)
>   	}
>   	
>   	/*
> -	 * Only call shrink_node_slabs here (which would also shrink
> -	 * other caches) if access is not potentially fatal.
> +	 * TODO: Could shrink slab caches here if a lightweight range-based
> +	 * shrinker will be available.
>   	 */
> -	if (access)
> -		drop_slab_node(page_to_nid(p));
>   }
>   EXPORT_SYMBOL_GPL(shake_page);

Just what I had in mind, thanks!
Yang Shi Aug. 18, 2021, 5:02 p.m. UTC | #13
On Tue, Aug 17, 2021 at 11:30 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> > In the current implementation of soft offline, if non-LRU page is met,
> > all the slab caches will be dropped to free the page then offline.  But
> > if the page is not slab page all the effort is wasted in vain.  Even
> > though it is a slab page, it is not guaranteed the page could be freed
> > at all.
> >
> > However the side effect and cost is quite high.  It does not only drop
> > the slab caches, but also may drop a significant amount of page caches
> > which are associated with inode caches.  It could make the most
> > workingset gone in order to just offline a page.  And the offline is not
> > guaranteed to succeed at all, actually I really doubt the success rate
> > for real life workload.
> >
> > Furthermore the worse consequence is the system may be locked up and
> > unusable since the page cache release may incur huge amount of works
> > queued for memcg release.
> >
> > Actually we ran into such unpleasant case in our production environment.
> > Firstly, the workqueue of memory_failure_work_func is locked up as
> > below:
> >
> > BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
> > Showing busy workqueues and worker pools:
> > workqueue events: flags=0x0
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
> >     in-flight: 409271:memory_failure_work_func
> >     pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
> > workqueue mm_percpu_wq: flags=0x8
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> >     pending: vmstat_update
> > workqueue cgroup_destroy: flags=0x0
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
> >     pending: css_release_work_fn
> >
> > There were over 12K css_release_work_fn queued, and this caused a few
> > lockups due to the contention of worker pool lock with IRQ disabled, for
> > example:
> >
> > NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> > Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> > CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
> > Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
> > Workqueue: events memory_failure_work_func
> > RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> > Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> > RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> > RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> > RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> > R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> > R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> > FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> > Call Trace:
> >  __queue_work+0xd6/0x3c0
> >  queue_work_on+0x1c/0x30
> >  uncharge_batch+0x10e/0x110
> >  mem_cgroup_uncharge_list+0x6d/0x80
> >  release_pages+0x37f/0x3f0
> >  __pagevec_release+0x1c/0x50
> >  __invalidate_mapping_pages+0x348/0x380
> >  ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
> >  inode_lru_isolate+0x10a/0x160
> >  ? iput+0x1d0/0x1d0
> >  __list_lru_walk_one+0x7b/0x170
> >  ? iput+0x1d0/0x1d0
> >  list_lru_walk_one+0x4a/0x60
> >  prune_icache_sb+0x37/0x50
> >  super_cache_scan+0x123/0x1a0
> >  do_shrink_slab+0x10c/0x2c0
> >  shrink_slab+0x1f1/0x290
> >  drop_slab_node+0x4d/0x70
> >  soft_offline_page+0x1ac/0x5b0
> >  ? dev_mce_log+0xee/0x110
> >  ? notifier_call_chain+0x39/0x90
> >  memory_failure_work_func+0x6a/0x90
> >  process_one_work+0x19e/0x340
> >  ? process_one_work+0x340/0x340
> >  worker_thread+0x30/0x360
> >  ? process_one_work+0x340/0x340
> >  kthread+0x116/0x130
> >
> > The lockup made the machine is quite unusable.  And it also made the
> > most workingset gone, the reclaimabled slab caches were reduced from 12G
> > to 300MB, the page caches were decreased from 17G to 4G.
> >
> > But the most disappointing thing is all the effort doesn't make the page
> > offline, it just returns:
> >
> > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> >
> > It seems the aggressive behavior for non-LRU page didn't pay back, so it
> > doesn't make too much sense to keep it considering the terrible side
> > effect.
> >
> > Reported-by: David Mackey <tdmackey@twitter.com>
> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Thank you. I agree with the idea of dropping drop_slab_node() in shake_page(),
> hoping that range-based slab shrinker will be implemented in the future.

Thank you.

>
> This patch conflicts with the patch
> https://lore.kernel.org/linux-mm/20210817053703.2267588-1-naoya.horiguchi@linux.dev/T/#u
> which adds another shake_page(), so could you add the following hunk in your patch?

Sure, I will rebase the patches on top of it.

>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 64f8ac969544..7dd2ca665866 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1198,7 +1198,7 @@ static int get_any_page(struct page *p, unsigned long flags)
>                          * page, retry.
>                          */
>                         if (pass++ < 3) {
> -                               shake_page(p, 1);
> +                               shake_page(p);
>                                 goto try_again;
>                         }
>                         goto out;
>
>
> Thanks,
> Naoya Horiguchi
Yang Shi Aug. 18, 2021, 5:04 p.m. UTC | #14
On Wed, Aug 18, 2021 at 12:55 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 18.08.21 09:53, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Aug 18, 2021 at 09:24:01AM +0200, David Hildenbrand wrote:
> > ...
> >>
> >> Might we want to add a TODO in the code? We have a similar one in
> >> mm/page_isolation.c:set_migratetype_isolate() and it's certainly a reminder
> >> that something of value is missing.
> >
> > Yes, that will be helpful.  The below's what's in my mind, but if someone
> > has better idea, that's fine.
> >
> > @@ -296,11 +296,9 @@ void shake_page(struct page *p, int access)
> >       }
> >
> >       /*
> > -      * Only call shrink_node_slabs here (which would also shrink
> > -      * other caches) if access is not potentially fatal.
> > +      * TODO: Could shrink slab caches here if a lightweight range-based
> > +      * shrinker will be available.
> >        */
> > -     if (access)
> > -             drop_slab_node(page_to_nid(p));
> >   }
> >   EXPORT_SYMBOL_GPL(shake_page);
>
> Just what I had in mind, thanks!

Fine to me, will add this in v2.

>
>
> --
> Thanks,
>
> David / dhildenb
>
>
Yang Shi Aug. 18, 2021, 5:45 p.m. UTC | #15
On Tue, Aug 17, 2021 at 10:02 PM HORIGUCHI NAOYA(堀口 直也)
<naoya.horiguchi@nec.com> wrote:
>
> On Mon, Aug 16, 2021 at 01:24:25PM -0700, Yang Shi wrote:
> > On Mon, Aug 16, 2021 at 12:38 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> > > > But the most disappointing thing is all the effort doesn't make the page
> > > > offline, it just returns:
> > > >
> > > > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> > >
> > > It's a shame it doesn't call dump_page().  There might be more
> > > interesting information somewhere in struct page that would help us
> > > figure out what kind of page it was in your environment.  For example,
> > > it might be a page table page or a page allocated for vmalloc(), and
> > > in both those cases, there are things we might be able to do (we'd
> > > certainly be able to figure out that it isn't worth shrinking slab!)
> >
> > Yes,  dump_page() could provide more information to us. I could add a
> > new patch or just update this patch to call dump_page() if offline is
> > failed if the hwpoison maintainer agrees to this as well.
>
> I agree with showing more information in failure case. Thanks for the input.

By reading the code, it seems get_any_page() is called to shake the
page for both soft offline and memory_failure(), so it seems like a
good place to call dump_page() if -EIO is going to be returned, which
hwpoison can't handle the page, otherwise we may need call dump_page()
in a couple of different places.

Although dump_page() will be called with pcp disabled and holding
memory hotplug lock if it is called by get_any_page(), but I'm
supposed it should be not a big deal.

>
> - Naoya Horiguchi
Yang Shi Aug. 18, 2021, 6:01 p.m. UTC | #16
On Tue, Aug 17, 2021 at 11:30 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Mon, Aug 16, 2021 at 11:09:08AM -0700, Yang Shi wrote:
> > In the current implementation of soft offline, if non-LRU page is met,
> > all the slab caches will be dropped to free the page then offline.  But
> > if the page is not slab page all the effort is wasted in vain.  Even
> > though it is a slab page, it is not guaranteed the page could be freed
> > at all.
> >
> > However the side effect and cost is quite high.  It does not only drop
> > the slab caches, but also may drop a significant amount of page caches
> > which are associated with inode caches.  It could make the most
> > workingset gone in order to just offline a page.  And the offline is not
> > guaranteed to succeed at all, actually I really doubt the success rate
> > for real life workload.
> >
> > Furthermore the worse consequence is the system may be locked up and
> > unusable since the page cache release may incur huge amount of works
> > queued for memcg release.
> >
> > Actually we ran into such unpleasant case in our production environment.
> > Firstly, the workqueue of memory_failure_work_func is locked up as
> > below:
> >
> > BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 53s!
> > Showing busy workqueues and worker pools:
> > workqueue events: flags=0x0
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=14/256 refcnt=15
> >     in-flight: 409271:memory_failure_work_func
> >     pending: kfree_rcu_work, kfree_rcu_monitor, kfree_rcu_work, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, rht_deferred_worker, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, kfree_rcu_work, drain_local_stock, kfree_rcu_work
> > workqueue mm_percpu_wq: flags=0x8
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
> >     pending: vmstat_update
> > workqueue cgroup_destroy: flags=0x0
> >   pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/1 refcnt=12072
> >     pending: css_release_work_fn
> >
> > There were over 12K css_release_work_fn queued, and this caused a few
> > lockups due to the contention of worker pool lock with IRQ disabled, for
> > example:
> >
> > NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
> > Modules linked in: amd64_edac_mod edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xt_DSCP iptable_mangle kvm_amd bpfilter vfat fat acpi_ipmi i2c_piix4 usb_storage ipmi_si k10temp i2c_core ipmi_devintf ipmi_msghandler acpi_cpufreq sch_fq_codel xfs libcrc32c crc32c_intel mlx5_core mlxfw nvme xhci_pci ptp nvme_core pps_core xhci_hcd
> > CPU: 1 PID: 205500 Comm: kworker/1:0 Tainted: G             L    5.10.32-t1.el7.twitter.x86_64 #1
> > Hardware name: TYAN F5AMT /z        /S8026GM2NRE-CGN, BIOS V8.030 03/30/2021
> > Workqueue: events memory_failure_work_func
> > RIP: 0010:queued_spin_lock_slowpath+0x41/0x1a0
> > Code: 41 f0 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 1b 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 c3 f6 c4 01 75 04 c6 47
> > RSP: 0018:ffff9b2ac278f900 EFLAGS: 00000002
> > RAX: 0000000000480101 RBX: ffff8ce98ce71800 RCX: 0000000000000084
> > RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8ce98ce6a140
> > RBP: 00000000000284c8 R08: ffffd7248dcb6808 R09: 0000000000000000
> > R10: 0000000000000003 R11: ffff9b2ac278f9b0 R12: 0000000000000001
> > R13: ffff8cb44dab9c00 R14: ffffffffbd1ce6a0 R15: ffff8cacaa37f068
> > FS:  0000000000000000(0000) GS:ffff8ce98ce40000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00007fcf6e8cb000 CR3: 0000000a0c60a000 CR4: 0000000000350ee0
> > Call Trace:
> >  __queue_work+0xd6/0x3c0
> >  queue_work_on+0x1c/0x30
> >  uncharge_batch+0x10e/0x110
> >  mem_cgroup_uncharge_list+0x6d/0x80
> >  release_pages+0x37f/0x3f0
> >  __pagevec_release+0x1c/0x50
> >  __invalidate_mapping_pages+0x348/0x380
> >  ? xfs_alloc_buftarg+0xa4/0x120 [xfs]
> >  inode_lru_isolate+0x10a/0x160
> >  ? iput+0x1d0/0x1d0
> >  __list_lru_walk_one+0x7b/0x170
> >  ? iput+0x1d0/0x1d0
> >  list_lru_walk_one+0x4a/0x60
> >  prune_icache_sb+0x37/0x50
> >  super_cache_scan+0x123/0x1a0
> >  do_shrink_slab+0x10c/0x2c0
> >  shrink_slab+0x1f1/0x290
> >  drop_slab_node+0x4d/0x70
> >  soft_offline_page+0x1ac/0x5b0
> >  ? dev_mce_log+0xee/0x110
> >  ? notifier_call_chain+0x39/0x90
> >  memory_failure_work_func+0x6a/0x90
> >  process_one_work+0x19e/0x340
> >  ? process_one_work+0x340/0x340
> >  worker_thread+0x30/0x360
> >  ? process_one_work+0x340/0x340
> >  kthread+0x116/0x130
> >
> > The lockup made the machine is quite unusable.  And it also made the
> > most workingset gone, the reclaimabled slab caches were reduced from 12G
> > to 300MB, the page caches were decreased from 17G to 4G.
> >
> > But the most disappointing thing is all the effort doesn't make the page
> > offline, it just returns:
> >
> > soft_offline: 0x1469f2: unknown non LRU page type 5ffff0000000000 ()
> >
> > It seems the aggressive behavior for non-LRU page didn't pay back, so it
> > doesn't make too much sense to keep it considering the terrible side
> > effect.
> >
> > Reported-by: David Mackey <tdmackey@twitter.com>
> > Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: Oscar Salvador <osalvador@suse.de>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
>
> Thank you. I agree with the idea of dropping drop_slab_node() in shake_page(),
> hoping that range-based slab shrinker will be implemented in the future.
>
> This patch conflicts with the patch
> https://lore.kernel.org/linux-mm/20210817053703.2267588-1-naoya.horiguchi@linux.dev/T/#u
> which adds another shake_page(), so could you add the following hunk in your patch?
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 64f8ac969544..7dd2ca665866 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1198,7 +1198,7 @@ static int get_any_page(struct page *p, unsigned long flags)
>                          * page, retry.
>                          */
>                         if (pass++ < 3) {
> -                               shake_page(p, 1);
> +                               shake_page(p);
>                                 goto try_again;
>                         }
>                         goto out;

BTW, a question about the return value with the above patch, will
reply in that thread.

>
>
> Thanks,
> Naoya Horiguchi
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ce8fc0fd6d6e..4cd46e6ab821 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3215,7 +3215,7 @@  extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
-extern void shake_page(struct page *p, int access);
+extern void shake_page(struct page *p);
 extern atomic_long_t num_poisoned_pages __read_mostly;
 extern int soft_offline_page(unsigned long pfn, int flags);
 
diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 1ae1ebc2b9b1..aff4d27ec235 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -30,7 +30,7 @@  static int hwpoison_inject(void *data, u64 val)
 	if (!hwpoison_filter_enable)
 		goto inject;
 
-	shake_page(hpage, 0);
+	shake_page(hpage);
 	/*
 	 * This implies unable to support non-LRU pages.
 	 */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 03f83e7d075b..82986fca9961 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -282,9 +282,9 @@  static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 
 /*
  * Unknown page type encountered. Try to check whether it can turn PageLRU by
- * lru_add_drain_all, or a free page by reclaiming slabs when possible.
+ * lru_add_drain_all.
  */
-void shake_page(struct page *p, int access)
+void shake_page(struct page *p)
 {
 	if (PageHuge(p))
 		return;
@@ -295,12 +295,6 @@  void shake_page(struct page *p, int access)
 			return;
 	}
 
-	/*
-	 * Only call shrink_node_slabs here (which would also shrink
-	 * other caches) if access is not potentially fatal.
-	 */
-	if (access)
-		drop_slab_node(page_to_nid(p));
 }
 EXPORT_SYMBOL_GPL(shake_page);
 
@@ -1215,7 +1209,7 @@  static int get_any_page(struct page *p, unsigned long flags)
 		 */
 		if (pass++ < 3) {
 			put_page(p);
-			shake_page(p, 1);
+			shake_page(p);
 			count_increased = false;
 			goto try_again;
 		}
@@ -1363,7 +1357,7 @@  static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * shake_page() again to ensure that it's flushed.
 	 */
 	if (mlocked)
-		shake_page(hpage, 0);
+		shake_page(hpage);
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -1718,7 +1712,7 @@  int memory_failure(unsigned long pfn, int flags)
 	 * The check (unnecessarily) ignores LRU pages being isolated and
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
-	shake_page(p, 0);
+	shake_page(p);
 
 	lock_page(p);