Message ID | 20190429105939.11962-1-jslaby@suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | memcg: make it work on sparse non-0-node systems | expand |
On Mon 29-04-19 12:59:39, Jiri Slaby wrote: [...] > static inline bool list_lru_memcg_aware(struct list_lru *lru) > { > - /* > - * This needs node 0 to be always present, even > - * in the systems supporting sparse numa ids. > - */ > - return !!lru->node[0].memcg_lrus; > + return !!lru->node[first_online_node].memcg_lrus; > } > > static inline struct list_lru_one * How come this doesn't blow up later - e.g. in memcg_destroy_list_lru path which does iterate over all existing nodes thus including the node 0.
On 29. 04. 19, 13:30, Michal Hocko wrote: > On Mon 29-04-19 12:59:39, Jiri Slaby wrote: > [...] >> static inline bool list_lru_memcg_aware(struct list_lru *lru) >> { >> - /* >> - * This needs node 0 to be always present, even >> - * in the systems supporting sparse numa ids. >> - */ >> - return !!lru->node[0].memcg_lrus; >> + return !!lru->node[first_online_node].memcg_lrus; >> } >> >> static inline struct list_lru_one * > > How come this doesn't blow up later - e.g. in memcg_destroy_list_lru > path which does iterate over all existing nodes thus including the > node 0. If the node is not disabled (i.e. is N_POSSIBLE), lru->node is allocated for that node too. It will also have memcg_lrus properly set. If it is disabled, it will never be iterated. Well, I could have used first_node. But I am not sure, if the first POSSIBLE node is also ONLINE during boot? thanks,
On 29. 04. 19, 13:55, Jiri Slaby wrote: > Well, I could have used first_node. But I am not sure, if the first > POSSIBLE node is also ONLINE during boot? Thinking about it, it does not matter, actually. Both first_node and first_online are allocated and set up, no matter which one is ONLINE node. So first_node should work as good as first_online_node. thanks,
On Mon 29-04-19 13:55:26, Jiri Slaby wrote: > On 29. 04. 19, 13:30, Michal Hocko wrote: > > On Mon 29-04-19 12:59:39, Jiri Slaby wrote: > > [...] > >> static inline bool list_lru_memcg_aware(struct list_lru *lru) > >> { > >> - /* > >> - * This needs node 0 to be always present, even > >> - * in the systems supporting sparse numa ids. > >> - */ > >> - return !!lru->node[0].memcg_lrus; > >> + return !!lru->node[first_online_node].memcg_lrus; > >> } > >> > >> static inline struct list_lru_one * > > > > How come this doesn't blow up later - e.g. in memcg_destroy_list_lru > > path which does iterate over all existing nodes thus including the > > node 0. > > If the node is not disabled (i.e. is N_POSSIBLE), lru->node is allocated > for that node too. It will also have memcg_lrus properly set. > > If it is disabled, it will never be iterated. > > Well, I could have used first_node. But I am not sure, if the first > POSSIBLE node is also ONLINE during boot? I dunno. I would have to think about this much more. The whole expectation that node 0 is always around is simply broken. But also list_lru_memcg_aware looks very suspicious. We should have a flag or something rather than what we have now. I am still not sure I have completely understood the problem though. I will try to get to this during the week but Vladimir should be much better fit to judge here.
Vladimir, as you are perhaps the one most familiar with the code, could you take a look on this? On 29. 04. 19, 12:59, Jiri Slaby wrote: > We have a single node system with node 0 disabled: > Scanning NUMA topology in Northbridge 24 > Number of physical nodes 2 > Skipping disabled node 0 > Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 > NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] > > This causes crashes in memcg when system boots: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > #PF error: [normal kernel read fault] > ... > RIP: 0010:list_lru_add+0x94/0x170 > ... > Call Trace: > d_lru_add+0x44/0x50 > dput.part.34+0xfc/0x110 > __fput+0x108/0x230 > task_work_run+0x9f/0xc0 > exit_to_usermode_loop+0xf5/0x100 > > It is reproducible as far as 4.12. I did not try older kernels. You have > to have a new enough systemd, e.g. 241 (the reason is unknown -- was not > investigated). Cannot be reproduced with systemd 234. > > The system crashes because the size of lru array is never updated in > memcg_update_all_list_lrus and the reads are past the zero-sized array, > causing dereferences of random memory. > > The root cause are list_lru_memcg_aware checks in the list_lru code. > The test in list_lru_memcg_aware is broken: it assumes node 0 is always > present, but it is not true on some systems as can be seen above. > > So fix this by checking the first online node instead of node 0. > > Signed-off-by: Jiri Slaby <jslaby@suse.cz> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: <cgroups@vger.kernel.org> > Cc: <linux-mm@kvack.org> > Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > --- > mm/list_lru.c | 6 +----- > 1 file changed, 1 insertion(+), 5 deletions(-) > > diff --git a/mm/list_lru.c b/mm/list_lru.c > index 0730bf8ff39f..7689910f1a91 100644 > --- a/mm/list_lru.c > +++ b/mm/list_lru.c > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > { > - /* > - * This needs node 0 to be always present, even > - * in the systems supporting sparse numa ids. > - */ > - return !!lru->node[0].memcg_lrus; > + return !!lru->node[first_online_node].memcg_lrus; > } > > static inline struct list_lru_one * >
On Mon, Apr 29, 2019 at 12:59:39PM +0200, Jiri Slaby wrote: > We have a single node system with node 0 disabled: > Scanning NUMA topology in Northbridge 24 > Number of physical nodes 2 > Skipping disabled node 0 > Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 > NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] > > This causes crashes in memcg when system boots: > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > #PF error: [normal kernel read fault] > ... > RIP: 0010:list_lru_add+0x94/0x170 > ... > Call Trace: > d_lru_add+0x44/0x50 > dput.part.34+0xfc/0x110 > __fput+0x108/0x230 > task_work_run+0x9f/0xc0 > exit_to_usermode_loop+0xf5/0x100 > > It is reproducible as far as 4.12. I did not try older kernels. You have > to have a new enough systemd, e.g. 241 (the reason is unknown -- was not > investigated). Cannot be reproduced with systemd 234. > > The system crashes because the size of lru array is never updated in > memcg_update_all_list_lrus and the reads are past the zero-sized array, > causing dereferences of random memory. > > The root cause are list_lru_memcg_aware checks in the list_lru code. > The test in list_lru_memcg_aware is broken: it assumes node 0 is always > present, but it is not true on some systems as can be seen above. > > So fix this by checking the first online node instead of node 0. > > Signed-off-by: Jiri Slaby <jslaby@suse.cz> > Cc: Johannes Weiner <hannes@cmpxchg.org> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > Cc: <cgroups@vger.kernel.org> > Cc: <linux-mm@kvack.org> > Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > --- > mm/list_lru.c | 6 +----- > 1 file changed, 1 insertion(+), 5 deletions(-) > > diff --git a/mm/list_lru.c b/mm/list_lru.c > index 0730bf8ff39f..7689910f1a91 100644 > --- a/mm/list_lru.c > +++ b/mm/list_lru.c > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > { > - /* > - * This needs node 0 to be always present, even > - * in the systems supporting sparse numa ids. > - */ > - return !!lru->node[0].memcg_lrus; > + return !!lru->node[first_online_node].memcg_lrus; > } > > static inline struct list_lru_one * Yep, I didn't expect node 0 could ever be unavailable, my bad. The patch looks fine to me: Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> However, I tend to agree with Michal that (ab)using node[0].memcg_lrus to check if a list_lru is memcg aware looks confusing. I guess we could simply add a bool flag to list_lru instead. Something like this, may be: diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index aa5efd9351eb..d5ceb2839a2d 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -54,6 +54,7 @@ struct list_lru { #ifdef CONFIG_MEMCG_KMEM struct list_head list; int shrinker_id; + bool memcg_aware; #endif }; diff --git a/mm/list_lru.c b/mm/list_lru.c index 0730bf8ff39f..8e605e40a4c6 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) static inline bool list_lru_memcg_aware(struct list_lru *lru) { - /* - * This needs node 0 to be always present, even - * in the systems supporting sparse numa ids. - */ - return !!lru->node[0].memcg_lrus; + return lru->memcg_aware; } static inline struct list_lru_one * @@ -451,6 +447,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) { int i; + lru->memcg_aware = memcg_aware; if (!memcg_aware) return 0;
On Thu, May 9, 2019 at 5:25 AM Vladimir Davydov <vdavydov.dev@gmail.com> wrote: > > On Mon, Apr 29, 2019 at 12:59:39PM +0200, Jiri Slaby wrote: > > We have a single node system with node 0 disabled: > > Scanning NUMA topology in Northbridge 24 > > Number of physical nodes 2 > > Skipping disabled node 0 > > Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 > > NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] > > > > This causes crashes in memcg when system boots: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > > #PF error: [normal kernel read fault] > > ... > > RIP: 0010:list_lru_add+0x94/0x170 > > ... > > Call Trace: > > d_lru_add+0x44/0x50 > > dput.part.34+0xfc/0x110 > > __fput+0x108/0x230 > > task_work_run+0x9f/0xc0 > > exit_to_usermode_loop+0xf5/0x100 > > > > It is reproducible as far as 4.12. I did not try older kernels. You have > > to have a new enough systemd, e.g. 241 (the reason is unknown -- was not > > investigated). Cannot be reproduced with systemd 234. > > > > The system crashes because the size of lru array is never updated in > > memcg_update_all_list_lrus and the reads are past the zero-sized array, > > causing dereferences of random memory. > > > > The root cause are list_lru_memcg_aware checks in the list_lru code. > > The test in list_lru_memcg_aware is broken: it assumes node 0 is always > > present, but it is not true on some systems as can be seen above. > > > > So fix this by checking the first online node instead of node 0. > > > > Signed-off-by: Jiri Slaby <jslaby@suse.cz> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > > Cc: <cgroups@vger.kernel.org> > > Cc: <linux-mm@kvack.org> > > Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > --- > > mm/list_lru.c | 6 +----- > > 1 file changed, 1 insertion(+), 5 deletions(-) > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c > > index 0730bf8ff39f..7689910f1a91 100644 > > --- a/mm/list_lru.c > > +++ b/mm/list_lru.c > > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > > { > > - /* > > - * This needs node 0 to be always present, even > > - * in the systems supporting sparse numa ids. > > - */ > > - return !!lru->node[0].memcg_lrus; > > + return !!lru->node[first_online_node].memcg_lrus; > > } > > > > static inline struct list_lru_one * > > Yep, I didn't expect node 0 could ever be unavailable, my bad. > The patch looks fine to me: > > Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> > > However, I tend to agree with Michal that (ab)using node[0].memcg_lrus > to check if a list_lru is memcg aware looks confusing. I guess we could > simply add a bool flag to list_lru instead. Something like this, may be: > I think the bool flag approach is much better. No assumption on the node initialization. If we go with bool approach then add Reviewed-by: Shakeel Butt <shakeelb@google.com> > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h > index aa5efd9351eb..d5ceb2839a2d 100644 > --- a/include/linux/list_lru.h > +++ b/include/linux/list_lru.h > @@ -54,6 +54,7 @@ struct list_lru { > #ifdef CONFIG_MEMCG_KMEM > struct list_head list; > int shrinker_id; > + bool memcg_aware; > #endif > }; > > diff --git a/mm/list_lru.c b/mm/list_lru.c > index 0730bf8ff39f..8e605e40a4c6 100644 > --- a/mm/list_lru.c > +++ b/mm/list_lru.c > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > { > - /* > - * This needs node 0 to be always present, even > - * in the systems supporting sparse numa ids. > - */ > - return !!lru->node[0].memcg_lrus; > + return lru->memcg_aware; > } > > static inline struct list_lru_one * > @@ -451,6 +447,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) > { > int i; > > + lru->memcg_aware = memcg_aware; > if (!memcg_aware) > return 0; >
On Thu 09-05-19 15:25:26, Vladimir Davydov wrote: > On Mon, Apr 29, 2019 at 12:59:39PM +0200, Jiri Slaby wrote: > > We have a single node system with node 0 disabled: > > Scanning NUMA topology in Northbridge 24 > > Number of physical nodes 2 > > Skipping disabled node 0 > > Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 > > NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] > > > > This causes crashes in memcg when system boots: > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 > > #PF error: [normal kernel read fault] > > ... > > RIP: 0010:list_lru_add+0x94/0x170 > > ... > > Call Trace: > > d_lru_add+0x44/0x50 > > dput.part.34+0xfc/0x110 > > __fput+0x108/0x230 > > task_work_run+0x9f/0xc0 > > exit_to_usermode_loop+0xf5/0x100 > > > > It is reproducible as far as 4.12. I did not try older kernels. You have > > to have a new enough systemd, e.g. 241 (the reason is unknown -- was not > > investigated). Cannot be reproduced with systemd 234. > > > > The system crashes because the size of lru array is never updated in > > memcg_update_all_list_lrus and the reads are past the zero-sized array, > > causing dereferences of random memory. > > > > The root cause are list_lru_memcg_aware checks in the list_lru code. > > The test in list_lru_memcg_aware is broken: it assumes node 0 is always > > present, but it is not true on some systems as can be seen above. > > > > So fix this by checking the first online node instead of node 0. > > > > Signed-off-by: Jiri Slaby <jslaby@suse.cz> > > Cc: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Michal Hocko <mhocko@kernel.org> > > Cc: Vladimir Davydov <vdavydov.dev@gmail.com> > > Cc: <cgroups@vger.kernel.org> > > Cc: <linux-mm@kvack.org> > > Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> > > --- > > mm/list_lru.c | 6 +----- > > 1 file changed, 1 insertion(+), 5 deletions(-) > > > > diff --git a/mm/list_lru.c b/mm/list_lru.c > > index 0730bf8ff39f..7689910f1a91 100644 > > --- a/mm/list_lru.c > > +++ b/mm/list_lru.c > > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > > { > > - /* > > - * This needs node 0 to be always present, even > > - * in the systems supporting sparse numa ids. > > - */ > > - return !!lru->node[0].memcg_lrus; > > + return !!lru->node[first_online_node].memcg_lrus; > > } > > > > static inline struct list_lru_one * > > Yep, I didn't expect node 0 could ever be unavailable, my bad. > The patch looks fine to me: > > Acked-by: Vladimir Davydov <vdavydov.dev@gmail.com> > > However, I tend to agree with Michal that (ab)using node[0].memcg_lrus > to check if a list_lru is memcg aware looks confusing. I guess we could > simply add a bool flag to list_lru instead. Something like this, may be: Yes, this makes much more sense to me! > > diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h > index aa5efd9351eb..d5ceb2839a2d 100644 > --- a/include/linux/list_lru.h > +++ b/include/linux/list_lru.h > @@ -54,6 +54,7 @@ struct list_lru { > #ifdef CONFIG_MEMCG_KMEM > struct list_head list; > int shrinker_id; > + bool memcg_aware; > #endif > }; > > diff --git a/mm/list_lru.c b/mm/list_lru.c > index 0730bf8ff39f..8e605e40a4c6 100644 > --- a/mm/list_lru.c > +++ b/mm/list_lru.c > @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) > > static inline bool list_lru_memcg_aware(struct list_lru *lru) > { > - /* > - * This needs node 0 to be always present, even > - * in the systems supporting sparse numa ids. > - */ > - return !!lru->node[0].memcg_lrus; > + return lru->memcg_aware; > } > > static inline struct list_lru_one * > @@ -451,6 +447,7 @@ static int memcg_init_list_lru(struct list_lru *lru, bool memcg_aware) > { > int i; > > + lru->memcg_aware = memcg_aware; > if (!memcg_aware) > return 0; >
On 16. 05. 19, 15:59, Michal Hocko wrote: >> However, I tend to agree with Michal that (ab)using node[0].memcg_lrus >> to check if a list_lru is memcg aware looks confusing. I guess we could >> simply add a bool flag to list_lru instead. Something like this, may be: > > Yes, this makes much more sense to me! I am not sure if I should send a patch with this solution or Vladimir will (given he is an author and has a diff already)? thanks,
On Fri, May 17, 2019 at 06:48:37AM +0200, Jiri Slaby wrote: > On 16. 05. 19, 15:59, Michal Hocko wrote: > >> However, I tend to agree with Michal that (ab)using node[0].memcg_lrus > >> to check if a list_lru is memcg aware looks confusing. I guess we could > >> simply add a bool flag to list_lru instead. Something like this, may be: > > > > Yes, this makes much more sense to me! > > I am not sure if I should send a patch with this solution or Vladimir > will (given he is an author and has a diff already)? I didn't even try to compile it, let alone test it. I'd appreciate if you could wrap it up and send it out using your authorship. Feel free to add my acked-by.
On 17. 05. 19, 10:00, Vladimir Davydov wrote: > On Fri, May 17, 2019 at 06:48:37AM +0200, Jiri Slaby wrote: >> On 16. 05. 19, 15:59, Michal Hocko wrote: >>>> However, I tend to agree with Michal that (ab)using node[0].memcg_lrus >>>> to check if a list_lru is memcg aware looks confusing. I guess we could >>>> simply add a bool flag to list_lru instead. Something like this, may be: >>> >>> Yes, this makes much more sense to me! >> >> I am not sure if I should send a patch with this solution or Vladimir >> will (given he is an author and has a diff already)? > > I didn't even try to compile it, let alone test it. I'd appreciate if > you could wrap it up and send it out using your authorship. Feel free > to add my acked-by. OK, NP. thanks,
diff --git a/mm/list_lru.c b/mm/list_lru.c index 0730bf8ff39f..7689910f1a91 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -37,11 +37,7 @@ static int lru_shrinker_id(struct list_lru *lru) static inline bool list_lru_memcg_aware(struct list_lru *lru) { - /* - * This needs node 0 to be always present, even - * in the systems supporting sparse numa ids. - */ - return !!lru->node[0].memcg_lrus; + return !!lru->node[first_online_node].memcg_lrus; } static inline struct list_lru_one *
We have a single node system with node 0 disabled: Scanning NUMA topology in Northbridge 24 Number of physical nodes 2 Skipping disabled node 0 Node 1 MemBase 0000000000000000 Limit 00000000fbff0000 NODE_DATA(1) allocated [mem 0xfbfda000-0xfbfeffff] This causes crashes in memcg when system boots: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008 #PF error: [normal kernel read fault] ... RIP: 0010:list_lru_add+0x94/0x170 ... Call Trace: d_lru_add+0x44/0x50 dput.part.34+0xfc/0x110 __fput+0x108/0x230 task_work_run+0x9f/0xc0 exit_to_usermode_loop+0xf5/0x100 It is reproducible as far as 4.12. I did not try older kernels. You have to have a new enough systemd, e.g. 241 (the reason is unknown -- was not investigated). Cannot be reproduced with systemd 234. The system crashes because the size of lru array is never updated in memcg_update_all_list_lrus and the reads are past the zero-sized array, causing dereferences of random memory. The root cause are list_lru_memcg_aware checks in the list_lru code. The test in list_lru_memcg_aware is broken: it assumes node 0 is always present, but it is not true on some systems as can be seen above. So fix this by checking the first online node instead of node 0. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: <cgroups@vger.kernel.org> Cc: <linux-mm@kvack.org> Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com> --- mm/list_lru.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-)