Message ID | 20210924161253.D7673E31@davehans-spike.ostc.intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mm/migrate: 5.15 fixes for automatic demotion | expand |
On 24.09.21 18:12, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > == tl;dr == > > Automatic demotion opted for a simple, lazy approach to handling > hotplug events. This noticeably slows down memory hotplug[1]. > Optimize away updates to the demotion order when memory hotplug > events should have no effect. > > This has no effect on CPU hotplug. There is no known problem on > the CPU side and any work there will be in a separate series. > > == Background == > > Automatic demotion is a memory migration strategy to ensure that > new allocations have room in faster memory tiers on tiered memory > systems. The kernel maintains an array (node_demotion[]) to > drive these migrations. > > The node_demotion[] path is calculated by starting at nodes with > CPUs and then "walking" to nodes with memory. Only hotplug > events which online or offline a node with memory (N_ONLINE) or > CPUs (N_CPU) will actually affect the migration order. > > == Problem == > > However, the current code is lazy. It completely regenerates the > migration order on *any* CPU or memory hotplug event. The logic > was that these events are extremely rare and that the overhead > from indiscriminate order regeneration is minimal. > > Part of the update logic involves a synchronize_rcu(), which is a > pretty big hammer. Its overhead was large enough to be detected > by some 0day tests that watch memory hotplug performance[1]. > > == Solution == > > Add a new helper (node_demotion_topo_changed()) which can > differentiate between superfluous and impactful hotplug events. > Skip the expensive update operation for superfluous events. > > == Aside: Locking == > > It took me a few moments to declare the locking to be safe enough > for node_demotion_topo_changed() to work. It all hinges on the > memory hotplug lock: > > During memory hotplug events, 'mem_hotplug_lock' is held for > write. This ensures that two memory hotplug events can not be > called simultaneously. > > CPU hotplug has a similar lock (cpuhp_state_mutex) which also > provides mutual exclusion between CPU hotplug events. In > addition, the demotion code acquire and hold the mem_hotplug_lock > for read during its CPU hotplug handlers. This provides mutual > exclusion between the demotion memory hotplug callbacks and the > CPU hotplug callbacks. > > This effectively allows treating the migration target generation > code to act as if it is single-threaded. > > 1. https://lore.kernel.org/all/20210905135932.GE15026@xsang-OptiPlex-9020/ > > Fixes: 884a6e5d1f93 ("mm/migrate: update node demotion order on hotplug events") > Reported-by: kernel test robot <oliver.sang@intel.com> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: "Huang, Ying" <ying.huang@intel.com> > Cc: Michal Hocko <mhocko@suse.com> > Cc: Wei Xu <weixugc@google.com> > Cc: Oscar Salvador <osalvador@suse.de> > Cc: David Rientjes <rientjes@google.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: David Hildenbrand <david@redhat.com> > Cc: Greg Thelen <gthelen@google.com> > Cc: Yang Shi <yang.shi@linux.alibaba.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > --- > > b/mm/migrate.c | 12 +++++++++++- > 1 file changed, 11 insertions(+), 1 deletion(-) > > diff -puN mm/migrate.c~faster-node-order mm/migrate.c > --- a/mm/migrate.c~faster-node-order 2021-09-24 09:12:30.988377798 -0700 > +++ b/mm/migrate.c 2021-09-24 09:12:30.988377798 -0700 > @@ -3239,8 +3239,18 @@ static int migration_offline_cpu(unsigne > * set_migration_target_nodes(). > */ > static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, > - unsigned long action, void *arg) > + unsigned long action, void *_arg) > { > + struct memory_notify *arg = _arg; > + > + /* > + * Only update the node migration order when a node is > + * changing status, like online->offline. This avoids > + * the overhead of synchronize_rcu() in most cases. > + */ > + if (arg->status_change_nid < 0) > + return notifier_from_errno(0); > + > switch (action) { > case MEM_GOING_OFFLINE: > /* > _ > Reviewed-by: David Hildenbrand <david@redhat.com>
diff -puN mm/migrate.c~faster-node-order mm/migrate.c --- a/mm/migrate.c~faster-node-order 2021-09-24 09:12:30.988377798 -0700 +++ b/mm/migrate.c 2021-09-24 09:12:30.988377798 -0700 @@ -3239,8 +3239,18 @@ static int migration_offline_cpu(unsigne * set_migration_target_nodes(). */ static int __meminit migrate_on_reclaim_callback(struct notifier_block *self, - unsigned long action, void *arg) + unsigned long action, void *_arg) { + struct memory_notify *arg = _arg; + + /* + * Only update the node migration order when a node is + * changing status, like online->offline. This avoids + * the overhead of synchronize_rcu() in most cases. + */ + if (arg->status_change_nid < 0) + return notifier_from_errno(0); + switch (action) { case MEM_GOING_OFFLINE: /*