Message ID | 20230311051712.4095040-1-mcgrof@kernel.org (mailing list archive) |
---|---|
Headers | show |
Series | module: avoid userspace pressure on unwanted allocations | expand |
On 11.03.23 06:17, Luis Chamberlain wrote: > A long time ago we had some issues with userspace doing stupid stuff. > Well, it turns out even the kernel and do stupid stuff too, as we're > learning with the ACPI modules aliaes and that hammering tons of loads. > > So add a bit of code which gets us a bit more in the defensive about > these situations. > > To experiment, this also adds in-kernel alias support to see if this helps > with some larger systems. > > This is all based on some old code which tried to add defensive > mechanisms the last of which was here and I had dropped the ball: > > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/ > > I've only compile tested this for now. Will need to stress to test > with kmod tests 0008 and 0009 to see if there's any differences. > I'll have to re-test and re-gnuplot stuff there. But early feedback > is appreciated, hence the RFC. > > David Hildenbrand had reported a while ago issues with userspace > doing insane things with allocations bringing a system down to > its knees. This is part of the motivation for this series. I'll try to grab a system where I can reproduce the issue and give your patches a churn. Thanks!
On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote: > On 11.03.23 06:17, Luis Chamberlain wrote: > > A long time ago we had some issues with userspace doing stupid stuff. > > Well, it turns out even the kernel and do stupid stuff too, as we're > > learning with the ACPI modules aliaes and that hammering tons of loads. > > > > So add a bit of code which gets us a bit more in the defensive about > > these situations. > > > > To experiment, this also adds in-kernel alias support to see if this helps > > with some larger systems. > > > > This is all based on some old code which tried to add defensive > > mechanisms the last of which was here and I had dropped the ball: > > > > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/ > > > > I've only compile tested this for now. Will need to stress to test > > with kmod tests 0008 and 0009 to see if there's any differences. > > I'll have to re-test and re-gnuplot stuff there. But early feedback > > is appreciated, hence the RFC. > > > > David Hildenbrand had reported a while ago issues with userspace > > doing insane things with allocations bringing a system down to > > its knees. This is part of the motivation for this series. > > > I'll try to grab a system where I can reproduce the issue and give your > patches a churn. Great, then please wait for v2 RFC as the first patch was missing an obvious mutex grab / release, I already have some memory pressure data that shows gains. Hope to post soon. Luis
On 15.03.23 17:10, Luis Chamberlain wrote: > On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote: >> On 11.03.23 06:17, Luis Chamberlain wrote: >>> A long time ago we had some issues with userspace doing stupid stuff. >>> Well, it turns out even the kernel and do stupid stuff too, as we're >>> learning with the ACPI modules aliaes and that hammering tons of loads. >>> >>> So add a bit of code which gets us a bit more in the defensive about >>> these situations. >>> >>> To experiment, this also adds in-kernel alias support to see if this helps >>> with some larger systems. >>> >>> This is all based on some old code which tried to add defensive >>> mechanisms the last of which was here and I had dropped the ball: >>> >>> https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/ >>> >>> I've only compile tested this for now. Will need to stress to test >>> with kmod tests 0008 and 0009 to see if there's any differences. >>> I'll have to re-test and re-gnuplot stuff there. But early feedback >>> is appreciated, hence the RFC. >>> >>> David Hildenbrand had reported a while ago issues with userspace >>> doing insane things with allocations bringing a system down to >>> its knees. This is part of the motivation for this series. >> >> >> I'll try to grab a system where I can reproduce the issue and give your >> patches a churn. > > Great, then please wait for v2 RFC as the first patch was missing an > obvious mutex grab / release, I already have some memory pressure data > that shows gains. Hope to post soon. I expect to have a machine (with a crazy number of CPUs/devices) available in a couple of days (1-2), so no need to rush. The original machine I was able to reproduce with is blocked for a little bit longer; so I hope the alternative I looked up will similarly trigger the issue easily.
On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > On 15.03.23 17:10, Luis Chamberlain wrote: > > On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote: > > > On 11.03.23 06:17, Luis Chamberlain wrote: > > > > A long time ago we had some issues with userspace doing stupid stuff. > > > > Well, it turns out even the kernel and do stupid stuff too, as we're > > > > learning with the ACPI modules aliaes and that hammering tons of loads. > > > > > > > > So add a bit of code which gets us a bit more in the defensive about > > > > these situations. > > > > > > > > To experiment, this also adds in-kernel alias support to see if this helps > > > > with some larger systems. > > > > > > > > This is all based on some old code which tried to add defensive > > > > mechanisms the last of which was here and I had dropped the ball: > > > > > > > > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/ > > > > > > > > I've only compile tested this for now. Will need to stress to test > > > > with kmod tests 0008 and 0009 to see if there's any differences. > > > > I'll have to re-test and re-gnuplot stuff there. But early feedback > > > > is appreciated, hence the RFC. > > > > > > > > David Hildenbrand had reported a while ago issues with userspace > > > > doing insane things with allocations bringing a system down to > > > > its knees. This is part of the motivation for this series. > > > > > > > > > I'll try to grab a system where I can reproduce the issue and give your > > > patches a churn. > > > > Great, then please wait for v2 RFC as the first patch was missing an > > obvious mutex grab / release, I already have some memory pressure data > > that shows gains. Hope to post soon. > > I expect to have a machine (with a crazy number of CPUs/devices) available > in a couple of days (1-2), so no need to rush. > > The original machine I was able to reproduce with is blocked for a little > bit longer; so I hope the alternative I looked up will similarly trigger the > issue easily. OK give this a spin: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts I'm seeing about ~86 MiB saving on the upper bound on memory usage while hammering on kmod test 0008, and this is on a small system. Probably won't help *that* much but am curious... if it helps somewhat. Luis
On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > > I expect to have a machine (with a crazy number of CPUs/devices) available > > in a couple of days (1-2), so no need to rush. > > > > The original machine I was able to reproduce with is blocked for a little > > bit longer; so I hope the alternative I looked up will similarly trigger the > > issue easily. > > OK give this a spin: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts > > I'm seeing about ~86 MiB saving on the upper bound on memory usage > while hammering on kmod test 0008, and this is on a small system. > > Probably won't help *that* much but am curious... if it helps somewhat. How much cpu count BTW? Luis
On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > > > I expect to have a machine (with a crazy number of CPUs/devices) available > > > in a couple of days (1-2), so no need to rush. > > > > > > The original machine I was able to reproduce with is blocked for a little > > > bit longer; so I hope the alternative I looked up will similarly trigger the > > > issue easily. > > > > OK give this a spin: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts Today I am up to here: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts The last patch really would have no justification yet at all unless it does help your case. Luis
On 17.03.23 00:56, Luis Chamberlain wrote: > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>> I expect to have a machine (with a crazy number of CPUs/devices) available >>> in a couple of days (1-2), so no need to rush. >>> >>> The original machine I was able to reproduce with is blocked for a little >>> bit longer; so I hope the alternative I looked up will similarly trigger the >>> issue easily. >> >> OK give this a spin: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts >> >> I'm seeing about ~86 MiB saving on the upper bound on memory usage >> while hammering on kmod test 0008, and this is on a small system. >> >> Probably won't help *that* much but am curious... if it helps somewhat. > > How much cpu count BTW? 220 physical ones and 440 virtual ones.
On 18.03.23 01:11, Luis Chamberlain wrote: > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: >> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>>> I expect to have a machine (with a crazy number of CPUs/devices) available >>>> in a couple of days (1-2), so no need to rush. >>>> >>>> The original machine I was able to reproduce with is blocked for a little >>>> bit longer; so I hope the alternative I looked up will similarly trigger the >>>> issue easily. >>> >>> OK give this a spin: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts > > Today I am up to here: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts > > The last patch really would have no justification yet at all unless it > does help your case. Still waiting on the system (the replacement system I was able to grab broke ...). I'll let you know once I succeeded in reproducing + testing your fixes.
On 20.03.23 10:38, David Hildenbrand wrote: > On 18.03.23 01:11, Luis Chamberlain wrote: >> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: >>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>>>> I expect to have a machine (with a crazy number of CPUs/devices) available >>>>> in a couple of days (1-2), so no need to rush. >>>>> >>>>> The original machine I was able to reproduce with is blocked for a little >>>>> bit longer; so I hope the alternative I looked up will similarly trigger the >>>>> issue easily. >>>> >>>> OK give this a spin: >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts >> >> Today I am up to here: >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts >> >> The last patch really would have no justification yet at all unless it >> does help your case. > > Still waiting on the system (the replacement system I was able to grab > broke ...). > > I'll let you know once I succeeded in reproducing + testing your fixes. Okay, I have a system where I can reproduce. Should I give https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts from yesterday a churn?
On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: > On 20.03.23 10:38, David Hildenbrand wrote: > > On 18.03.23 01:11, Luis Chamberlain wrote: > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available > > > > > > in a couple of days (1-2), so no need to rush. > > > > > > > > > > > > The original machine I was able to reproduce with is blocked for a little > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the > > > > > > issue easily. > > > > > > > > > > OK give this a spin: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts > > > > > > Today I am up to here: > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts > > > > > > The last patch really would have no justification yet at all unless it > > > does help your case. > > > > Still waiting on the system (the replacement system I was able to grab > > broke ...). > > > > I'll let you know once I succeeded in reproducing + testing your fixes. > > Okay, I have a system where I can reproduce. > > Should I give > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts > > from yesterday a churn? Yes please give that a run. Please collect systemd-analyze given lack of any other tool to evaluate any deltas. Can't think of anything else to gather other than seeing if it booted. If that boots works then try removing the last patch "module: add a sanity check prior to allowing kernel module auto-loading" to see if that last patch helped or was just noise. As it stands I'm not convinced yet if it did help, if it *does* help we probably need to rethink some finit_module() allocations things. If you *still* can't boot the system, well, we should re-think finit_module() allocation path a bit more too. Luis
On 20.03.23 22:09, Luis Chamberlain wrote: > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: >> On 20.03.23 10:38, David Hildenbrand wrote: >>> On 18.03.23 01:11, Luis Chamberlain wrote: >>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: >>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available >>>>>>> in a couple of days (1-2), so no need to rush. >>>>>>> >>>>>>> The original machine I was able to reproduce with is blocked for a little >>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the >>>>>>> issue easily. >>>>>> >>>>>> OK give this a spin: >>>>>> >>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts >>>> >>>> Today I am up to here: >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts >>>> >>>> The last patch really would have no justification yet at all unless it >>>> does help your case. >>> >>> Still waiting on the system (the replacement system I was able to grab >>> broke ...). >>> >>> I'll let you know once I succeeded in reproducing + testing your fixes. >> >> Okay, I have a system where I can reproduce. >> >> Should I give >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts >> >> from yesterday a churn? > > Yes please give that a run. Reproduced with v6.3.0-rc1 (on 1st try) Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). > > Please collect systemd-analyze given lack of any other tool to evaluate > any deltas. Can't think of anything else to gather other than seeing if > it booted. Issue is that some services (kdump, tuned) seem to take sometimes ages on that system to start for some reason, and systemd-analyze refuses to do something reasonable while the system is still booting up. I'll see if I can come up with some data. > > If that boots works then try removing the last patch "module: add a > sanity check prior to allowing kernel module auto-loading" to see if > that last patch helped or was just noise. As it stands I'm not convinced > yet if it did help, if it *does* help we probably need to rethink some > finit_module() allocations things. Okay, will try without the last patch tomorrow.
On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: > On 20.03.23 22:09, Luis Chamberlain wrote: > > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: > > > On 20.03.23 10:38, David Hildenbrand wrote: > > > > On 18.03.23 01:11, Luis Chamberlain wrote: > > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: > > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: > > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available > > > > > > > > in a couple of days (1-2), so no need to rush. > > > > > > > > > > > > > > > > The original machine I was able to reproduce with is blocked for a little > > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the > > > > > > > > issue easily. > > > > > > > > > > > > > > OK give this a spin: > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts > > > > > > > > > > Today I am up to here: > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts > > > > > > > > > > The last patch really would have no justification yet at all unless it > > > > > does help your case. > > > > > > > > Still waiting on the system (the replacement system I was able to grab > > > > broke ...). > > > > > > > > I'll let you know once I succeeded in reproducing + testing your fixes. > > > > > > Okay, I have a system where I can reproduce. > > > > > > Should I give > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts > > > > > > from yesterday a churn? > > > > Yes please give that a run. > > Reproduced with v6.3.0-rc1 (on 1st try) By reproduced, you mean it fails to boot? > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). Oh wow, so to clarify, it boots OK? > > Please collect systemd-analyze given lack of any other tool to evaluate > > any deltas. Can't think of anything else to gather other than seeing if > > it booted. > > Issue is that some services (kdump, tuned) seem to take sometimes ages on > that system to start for some reason, How about disabling that? > and systemd-analyze refuses to do > something reasonable while the system is still booting up. I see. > I'll see if I can come up with some data. Thanks! > > If that boots works then try removing the last patch "module: add a > > sanity check prior to allowing kernel module auto-loading" to see if > > that last patch helped or was just noise. As it stands I'm not convinced > > yet if it did help, if it *does* help we probably need to rethink some > > finit_module() allocations things. > > Okay, will try without the last patch tomorrow. Thanks!! Luis
On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote: > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). > > Oh wow, so to clarify, it boots OK? > Now that we know that tree works, I'm curious also now if you can confirm just re-ordering the patches still works (it should) https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust And although it's *probably* just noise, but I'm very curious how much, if any difference there is if you just revert "module: use list_add_tail_rcu() when adding module". The data on that commit log is pretty small as I have a low end system, and I'm not yet done beating the hell out of a system with stress-ng, but getting some data froma pretty large system would be great. Specially if this series seems to prove fixing boot on them. Luis
On 20.03.23 22:23, Luis Chamberlain wrote: > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: >> On 20.03.23 22:09, Luis Chamberlain wrote: >>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: >>>> On 20.03.23 10:38, David Hildenbrand wrote: >>>>> On 18.03.23 01:11, Luis Chamberlain wrote: >>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: >>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available >>>>>>>>> in a couple of days (1-2), so no need to rush. >>>>>>>>> >>>>>>>>> The original machine I was able to reproduce with is blocked for a little >>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the >>>>>>>>> issue easily. >>>>>>>> >>>>>>>> OK give this a spin: >>>>>>>> >>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts >>>>>> >>>>>> Today I am up to here: >>>>>> >>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts >>>>>> >>>>>> The last patch really would have no justification yet at all unless it >>>>>> does help your case. >>>>> >>>>> Still waiting on the system (the replacement system I was able to grab >>>>> broke ...). >>>>> >>>>> I'll let you know once I succeeded in reproducing + testing your fixes. >>>> >>>> Okay, I have a system where I can reproduce. >>>> >>>> Should I give >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts >>>> >>>> from yesterday a churn? >>> >>> Yes please give that a run. >> >> Reproduced with v6.3.0-rc1 (on 1st try) > > By reproduced, you mean it fails to boot? It boots but we get vmap allocation warnings, because the ~440 CPUs manage to completely exhaust the module vmap area due to KASAN. > >> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). > > Oh wow, so to clarify, it boots OK? It boots and I don't get the vmap allocation warnings. > >>> Please collect systemd-analyze given lack of any other tool to evaluate >>> any deltas. Can't think of anything else to gather other than seeing if >>> it booted. >> >> Issue is that some services (kdump, tuned) seem to take sometimes ages on >> that system to start for some reason, > > How about disabling that? It seems to be random services. On my debug kernel with KASAN everything is just super slow. I'll try to measure on a !debug kernel.
On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote: > On 20.03.23 22:23, Luis Chamberlain wrote: > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: > > > On 20.03.23 22:09, Luis Chamberlain wrote: > > > > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: > > > > > On 20.03.23 10:38, David Hildenbrand wrote: > > > > > > On 18.03.23 01:11, Luis Chamberlain wrote: > > > > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: > > > > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: > > > > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: > > > > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available > > > > > > > > > > in a couple of days (1-2), so no need to rush. > > > > > > > > > > > > > > > > > > > > The original machine I was able to reproduce with is blocked for a little > > > > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the > > > > > > > > > > issue easily. > > > > > > > > > > > > > > > > > > OK give this a spin: > > > > > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts > > > > > > > > > > > > > > Today I am up to here: > > > > > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts > > > > > > > > > > > > > > The last patch really would have no justification yet at all unless it > > > > > > > does help your case. > > > > > > > > > > > > Still waiting on the system (the replacement system I was able to grab > > > > > > broke ...). > > > > > > > > > > > > I'll let you know once I succeeded in reproducing + testing your fixes. > > > > > > > > > > Okay, I have a system where I can reproduce. > > > > > > > > > > Should I give > > > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts > > > > > > > > > > from yesterday a churn? > > > > > > > > Yes please give that a run. > > > > > > Reproduced with v6.3.0-rc1 (on 1st try) > > > > By reproduced, you mean it fails to boot? > > It boots but we get vmap allocation warnings, because the ~440 CPUs manage > to completely exhaust the module vmap area due to KASAN. Thanks, can you post a trace? > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). > > > > Oh wow, so to clarify, it boots OK? > > It boots and I don't get the vmap allocation warnings. Wonderful! Now to zero if all commits are required, and measure small value-add for each of them. That is the hard part and thanks for any time you can help dedicate towards that. There's really only 3 functional changes we need to measure: In the 20230319-module-alloc-opts-adjust branch they are left towards the end, in history, the top being the most recent commit: 600f1f769d06 module: add a sanity check prior to allowing kernel module auto-loading 680f2c1fff2d module: use list_add_tail_rcu() when adding module 6023a6c7d98c module: avoid allocation if module is already present and ready My guess is the majority of the fix comes from the last commit, so commit 6023a6c7d98c ("module: avoid allocation if module is already present and ready"). The rcu one probably doesn't help much but I think it makes sense while we're at it. The last one is the one I'm less convinced makes sense. But *if* it does help, it means such finit_module() situations / stresses are much larger than we ever anticipated. > > > > Please collect systemd-analyze given lack of any other tool to evaluate > > > > any deltas. Can't think of anything else to gather other than seeing if > > > > it booted. > > > > > > Issue is that some services (kdump, tuned) seem to take sometimes ages on > > > that system to start for some reason, > > > > How about disabling that? > > It seems to be random services. On my debug kernel with KASAN everything is > just super slow. I'll try to measure on a !debug kernel. Got it thanks! Luis
On 21.03.23 17:52, Luis Chamberlain wrote: > On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote: >> On 20.03.23 22:23, Luis Chamberlain wrote: >>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: >>>> On 20.03.23 22:09, Luis Chamberlain wrote: >>>>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote: >>>>>> On 20.03.23 10:38, David Hildenbrand wrote: >>>>>>> On 18.03.23 01:11, Luis Chamberlain wrote: >>>>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote: >>>>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote: >>>>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote: >>>>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available >>>>>>>>>>> in a couple of days (1-2), so no need to rush. >>>>>>>>>>> >>>>>>>>>>> The original machine I was able to reproduce with is blocked for a little >>>>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the >>>>>>>>>>> issue easily. >>>>>>>>>> >>>>>>>>>> OK give this a spin: >>>>>>>>>> >>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts >>>>>>>> >>>>>>>> Today I am up to here: >>>>>>>> >>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts >>>>>>>> >>>>>>>> The last patch really would have no justification yet at all unless it >>>>>>>> does help your case. >>>>>>> >>>>>>> Still waiting on the system (the replacement system I was able to grab >>>>>>> broke ...). >>>>>>> >>>>>>> I'll let you know once I succeeded in reproducing + testing your fixes. >>>>>> >>>>>> Okay, I have a system where I can reproduce. >>>>>> >>>>>> Should I give >>>>>> >>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts >>>>>> >>>>>> from yesterday a churn? >>>>> >>>>> Yes please give that a run. >>>> >>>> Reproduced with v6.3.0-rc1 (on 1st try) >>> >>> By reproduced, you mean it fails to boot? >> >> It boots but we get vmap allocation warnings, because the ~440 CPUs manage >> to completely exhaust the module vmap area due to KASAN. > > Thanks, can you post a trace? See https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com where I also describe the issue in more detail. > >>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). >>> >>> Oh wow, so to clarify, it boots OK? >> >> It boots and I don't get the vmap allocation warnings. > > Wonderful! I'm still compiling/testing the debug kernels (20230319-module-alloc-opts-adjust + single revert of the rcu patch). I should have some systemd-analyze results for !debug kernels later.
On 20.03.23 22:27, Luis Chamberlain wrote: > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote: >> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: >>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). >> >> Oh wow, so to clarify, it boots OK? >> > > Now that we know that tree works, I'm curious also now if you can > confirm just re-ordering the patches still works (it should) > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust > So far no vmap errors booting the debug/kasan kernel (2 tries). > And although it's *probably* just noise, but I'm very curious how much, > if any difference there is if you just revert "module: use > list_add_tail_rcu() when adding module". Dito, no vmap errors booting the debug/kasan kernel (2 tries). > > The data on that commit log is pretty small as I have a low end system, > and I'm not yet done beating the hell out of a system with stress-ng, > but getting some data froma pretty large system would be great. > Specially if this series seems to prove fixing boot on them. 2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I had to delete the kdump initrd to make space for another kernel), but we can just mostly ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today. 1) v6.3-rc1: #1 Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s multi-user.target reached after 29.186s in userspace 47.178s kdump.service 14.898s tuned.service 11.394s chrony-wait.service 7.486s systemd-udev-settle.service 7.334s NetworkManager-wait-online.service 2.908s initrd-switch-root.service 2.451s smartd.service 2.316s dracut-initqueue.service 2.057s polkit.service 1.290s NetworkManager.service 1.046s cups.service ... #2 Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s multi-user.target reached after 29.392s in userspace 14.552s tuned.service 9.410s chrony-wait.service 8.126s systemd-udev-settle.service 7.502s NetworkManager-wait-online.service 2.871s initrd-switch-root.service 2.401s kdump.service 2.297s polkit.service 2.116s dracut-initqueue.service 2.027s smartd.service 1.262s NetworkManager.service 1.102s cups.service 1.011s sshd.service ... 2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu(): #1 Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s multi-user.target reached after 27.213s in userspace 47.232s kdump.service 14.235s tuned.service 8.220s chrony-wait.service 7.444s NetworkManager-wait-online.service 5.986s systemd-udev-settle.service 2.881s initrd-switch-root.service 2.236s smartd.service 1.899s dracut-initqueue.service 1.812s polkit.service 1.554s NetworkManager.service 1.140s ModemManager.service ... #2 Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms multi-user.target reached after 28.210s in userspace 15.435s tuned.service 11.365s chrony-wait.service 7.512s NetworkManager-wait-online.service 5.962s systemd-udev-settle.service 2.889s initrd-switch-root.service 2.846s smartd.service 2.819s kdump.service 2.228s polkit.service 1.872s dracut-initqueue.service 1.312s NetworkManager.service 1.152s ModemManager.service 1.011s sshd.service ... 3) 20230319-module-alloc-opts-adjust: #1 Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s multi-user.target reached after 28.205s in userspace 46.307s kdump.service 14.199s tuned.service 13.358s chrony-wait.service 7.468s NetworkManager-wait-online.service 6.329s systemd-udev-settle.service 2.910s initrd-switch-root.service 2.752s smartd.service 2.142s polkit.service 1.909s dracut-initqueue.service 1.210s NetworkManager.service 1.041s ModemManager.service 925ms sshd.service ... #2 Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s multi-user.target reached after 27.326s in userspace 45.626s kdump.service 14.707s tuned.service 13.246s chrony-wait.service 7.428s NetworkManager-wait-online.service 6.507s systemd-udev-settle.service 3.038s initrd-switch-root.service 3.001s smartd.service 2.057s polkit.service 1.746s dracut-initqueue.service 1.221s NetworkManager.service ... I think we primarily only care about systemd-udev-settle.service. That is fastest without the rcu patch (~6s), compared to with the rcu patch (~6.5s) and with stock (~7.5s -- 8s). Looks like dracut-initqueue also might be a bit faster with your changes, but maybe it's mostly noise (would have to do more runs). So maybe drop that rcu patch? But of course, there could be other scenarios where it's helpful ...
On 21.03.23 20:32, David Hildenbrand wrote: > On 20.03.23 22:27, Luis Chamberlain wrote: >> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote: >>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: >>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). >>> >>> Oh wow, so to clarify, it boots OK? >>> >> >> Now that we know that tree works, I'm curious also now if you can >> confirm just re-ordering the patches still works (it should) >> >> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust >> > > So far no vmap errors booting the debug/kasan kernel (2 tries). > >> And although it's *probably* just noise, but I'm very curious how much, >> if any difference there is if you just revert "module: use >> list_add_tail_rcu() when adding module". > > Dito, no vmap errors booting the debug/kasan kernel (2 tries). > >> >> The data on that commit log is pretty small as I have a low end system, >> and I'm not yet done beating the hell out of a system with stress-ng, >> but getting some data froma pretty large system would be great. >> Specially if this series seems to prove fixing boot on them. > > 2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I > had to delete the kdump initrd to make space for another kernel), but we can just mostly > ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today. > > > 1) v6.3-rc1: > > #1 > > Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s > multi-user.target reached after 29.186s in userspace > > 47.178s kdump.service > 14.898s tuned.service > 11.394s chrony-wait.service > 7.486s systemd-udev-settle.service > 7.334s NetworkManager-wait-online.service > 2.908s initrd-switch-root.service > 2.451s smartd.service > 2.316s dracut-initqueue.service > 2.057s polkit.service > 1.290s NetworkManager.service > 1.046s cups.service > ... > > #2 > > Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s > multi-user.target reached after 29.392s in userspace > > 14.552s tuned.service > 9.410s chrony-wait.service > 8.126s systemd-udev-settle.service > 7.502s NetworkManager-wait-online.service > 2.871s initrd-switch-root.service > 2.401s kdump.service > 2.297s polkit.service > 2.116s dracut-initqueue.service > 2.027s smartd.service > 1.262s NetworkManager.service > 1.102s cups.service > 1.011s sshd.service > ... > > > 2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu(): > > #1 > > Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s > multi-user.target reached after 27.213s in userspace > > 47.232s kdump.service > 14.235s tuned.service > 8.220s chrony-wait.service > 7.444s NetworkManager-wait-online.service > 5.986s systemd-udev-settle.service > 2.881s initrd-switch-root.service > 2.236s smartd.service > 1.899s dracut-initqueue.service > 1.812s polkit.service > 1.554s NetworkManager.service > 1.140s ModemManager.service > ... > > #2 > > Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms > multi-user.target reached after 28.210s in userspace > > 15.435s tuned.service > 11.365s chrony-wait.service > 7.512s NetworkManager-wait-online.service > 5.962s systemd-udev-settle.service > 2.889s initrd-switch-root.service > 2.846s smartd.service > 2.819s kdump.service > 2.228s polkit.service > 1.872s dracut-initqueue.service > 1.312s NetworkManager.service > 1.152s ModemManager.service > 1.011s sshd.service > ... > > 3) 20230319-module-alloc-opts-adjust: > > > #1 > > Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s > multi-user.target reached after 28.205s in userspace > > 46.307s kdump.service > 14.199s tuned.service > 13.358s chrony-wait.service > 7.468s NetworkManager-wait-online.service > 6.329s systemd-udev-settle.service > 2.910s initrd-switch-root.service > 2.752s smartd.service > 2.142s polkit.service > 1.909s dracut-initqueue.service > 1.210s NetworkManager.service > 1.041s ModemManager.service > 925ms sshd.service > ... > > #2 > > Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s > multi-user.target reached after 27.326s in userspace > > 45.626s kdump.service > 14.707s tuned.service > 13.246s chrony-wait.service > 7.428s NetworkManager-wait-online.service > 6.507s systemd-udev-settle.service > 3.038s initrd-switch-root.service > 3.001s smartd.service > 2.057s polkit.service > 1.746s dracut-initqueue.service > 1.221s NetworkManager.service > ... > > > I think we primarily only care about systemd-udev-settle.service. > > That is fastest without the rcu patch (~6s), compared to with the rcu > patch (~6.5s) and with stock (~7.5s -- 8s). > > Looks like dracut-initqueue also might be a bit faster with your changes, but > maybe it's mostly noise (would have to do more runs). > > So maybe drop that rcu patch? But of course, there could be other scenarios where it's > helpful ... Are there any other things you would like me to measure/test? I'll have to hand back that test machine soonish.
On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote: > On 21.03.23 20:32, David Hildenbrand wrote: > > On 20.03.23 22:27, Luis Chamberlain wrote: > > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote: > > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: > > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). > > > > > > > > Oh wow, so to clarify, it boots OK? > > > > > > > > > > Now that we know that tree works, I'm curious also now if you can > > > confirm just re-ordering the patches still works (it should) > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust > > > > > > > So far no vmap errors booting the debug/kasan kernel (2 tries). <-- snip --> > > I think we primarily only care about systemd-udev-settle.service. > > > > That is fastest without the rcu patch (~6s), compared to with the rcu > > patch (~6.5s) and with stock (~7.5s -- 8s). > > > > Looks like dracut-initqueue also might be a bit faster with your changes, but > > maybe it's mostly noise (would have to do more runs). > > > > So maybe drop that rcu patch? But of course, there could be other scenarios where it's > > helpful ... Yes I confirm the RCU patch does not help at all now also using stress-ng. > Are there any other things you would like me to measure/test? I'll have to > hand back that test machine soonish. Yes please test the below. Perhaps its not the final form we want, but it *does* fix OOM'ing when thrashing with stress-ng now with the module option and even with 100 threads brings down max memory consumption by 259 MiB. The reason is that we also vmalloc during each finit_read_file() for each module as well way before we even do layout_and_allocate(), and so obviously if we fix the module path but not this path this will eventually catch up with us as. I'm not at all happy with the current approach given ideally we'd bump the counter when the user is done with the file, but we don't yet have any tracking of that for users, they just vfree the memory itself. And so this is just trying to catch heavy immediate abuse on the caller to fend off abuse of vmalloc uses in a lightway manner. There's gotta be a better way to do this, but its just an idea I have so far. If we *want* to keep tabs until the user is done, we have to just modify most users of these APIs and intrudce our own free. I don't think we're in a rush to fix this so maybe that's the better approach. And so I've managed to reproduce the issues you found now with my new stress-ng module stressor as well. https://github.com/ColinIanKing/stress-ng.git Even though you have 400 CPUs with stress-ng we can likely reproduce it with (use a module not loaded on your system): ./stress-ng --module 100 --module-name xfs Without the patch below using 400 threads still OOMs easily due to the kread issue. Max threads allowed are 8192. From ec5099b15bb74f154319034547184e81e4ad8ba0 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain <mcgrof@kernel.org> Date: Fri, 24 Mar 2023 10:35:41 -0700 Subject: [PATCH] fs/kernel_read_file: add a clutch Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- fs/kernel_read_file.c | 52 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 50 insertions(+), 2 deletions(-) diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c index 5d826274570c..2d55abe73b21 100644 --- a/fs/kernel_read_file.c +++ b/fs/kernel_read_file.c @@ -1,10 +1,52 @@ // SPDX-License-Identifier: GPL-2.0-only + +#define pr_fmt(fmt) "kread: " fmt + #include <linux/fs.h> #include <linux/fs_struct.h> #include <linux/kernel_read_file.h> #include <linux/security.h> #include <linux/vmalloc.h> +/* + * This clutch ensures we only allow a certain number concurrent threads at a + * time allocating space concurrently and they must all finish within the + * timeout specified. Anything more we know we're thrashing. + */ +#define MAX_KREAD_CONCURRENT 20 +static atomic_t kread_concurrent_max = ATOMIC_INIT(MAX_KREAD_CONCURRENT); +static DECLARE_WAIT_QUEUE_HEAD(kread_wq); + +/* + * How many seconds to wait for *all* MAX_KREAD_CONCURRENT threads running + * at the same time without returning. + */ +#define MAX_KREAD_ALL_BUSY_TIMEOUT 1 + +static int kernel_read_check_concurrent(void) +{ + int ret; + + if (atomic_dec_if_positive(&kread_concurrent_max) < 0) { + pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...", + atomic_read(&kread_concurrent_max), + MAX_KREAD_CONCURRENT); + ret = wait_event_killable_timeout(kread_wq, + atomic_dec_if_positive(&kread_concurrent_max) >= 0, + MAX_KREAD_ALL_BUSY_TIMEOUT * HZ); + if (!ret) { + pr_warn_ratelimited("reading cannot be processed, kernel busy with %d threads reading files now for more than %d seconds", + MAX_KREAD_CONCURRENT, MAX_KREAD_ALL_BUSY_TIMEOUT); + return -ETIME; + } else if (ret == -ERESTARTSYS) { + pr_warn_ratelimited("sigkill sent for kernel read, giving up"); + return ret; + } + } + + return 0; +} + /** * kernel_read_file() - read file contents into a kernel buffer * @@ -68,10 +110,14 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, goto out; } + ret = kernel_read_check_concurrent(); + if (ret) + goto out; + whole_file = (offset == 0 && i_size <= buf_size); ret = security_kernel_read_file(file, id, whole_file); if (ret) - goto out; + goto out_allow_new_read; if (file_size) *file_size = i_size; @@ -117,7 +163,9 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf, *buf = NULL; } } - +out_allow_new_read: + atomic_inc(&kread_concurrent_max); + wake_up(&kread_wq); out: allow_write_access(file); return ret == 0 ? copied : ret;
On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > +/* > + * This clutch ensures we only allow a certain number concurrent threads at a kludge, not clutch. And it's much worse than a kludge. It's just wrong and disgusting. > + pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...", > + atomic_read(&kread_concurrent_max), > + MAX_KREAD_CONCURRENT); This is also wrong, since it's not kernel_read_file() that is the problem, but whatever broken caller. Yeah, yeah, in practice it's presumably always just finit_module() doing kernel_read_file_from_fd(), but it's still *completely* wrong to just say "function X is throttling" when "X" isn't the problem, and doesn't tell what the _real_ problem is. I really think this all needs some core fixing at the module layer, not these kinds of horrific hacks. Linus
On Fri, Mar 24, 2023 at 12:11:07PM -0700, Linus Torvalds wrote: > On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > +/* > > + * This clutch ensures we only allow a certain number concurrent threads at a > > kludge, not clutch. > > And it's much worse than a kludge. It's just wrong and disgusting. I wasn't happy with it either... > > + pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...", > > + atomic_read(&kread_concurrent_max), > > + MAX_KREAD_CONCURRENT); > > This is also wrong, since it's not kernel_read_file() that is the > problem, but whatever broken caller. > > Yeah, yeah, in practice it's presumably always just finit_module() > doing kernel_read_file_from_fd(), but it's still *completely* wrong to > just say "function X is throttling" when "X" isn't the problem, and > doesn't tell what the _real_ problem is. True. > I really think this all needs some core fixing at the module layer, > not these kinds of horrific hacks. On the modules side of things we can be super defensive on the second vmalloc allocation defensive [0] but other than this the initial kread also needs care too. To address the kread abuse within finit_module we could just move the kludge to the modules side of things until each free happens as in the below alternative. That just means any easy user interfacing call with kernel_read*() would likely have to be as careful. Untested below. [0] https://lkml.kernel.org/r/20230319214926.1794108-4-mcgrof@kernel.org Luis From 3c3f7e597ab35b4482ccb4064bb897eefa449071 Mon Sep 17 00:00:00 2001 From: Luis Chamberlain <mcgrof@kernel.org> Date: Fri, 24 Mar 2023 12:51:44 -0700 Subject: [PATCH] module: kludge Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> --- kernel/module/main.c | 45 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 44 insertions(+), 1 deletion(-) diff --git a/kernel/module/main.c b/kernel/module/main.c index 145e15f19576..a96de989532a 100644 --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -62,6 +62,16 @@ #define CREATE_TRACE_POINTS #include <trace/events/module.h> +#define MAX_INITMOD_CONCURRENT 50 +static atomic_t initmod_concurrent_max = ATOMIC_INIT(MAX_INITMOD_CONCURRENT); +static DECLARE_WAIT_QUEUE_HEAD(initmod_wq); + +/* + * How much time to wait for *all* MAX_INITMOD_CONCURRENT threads running + * at the same time without returning. + */ +#define MAX_INITMOD_ALL_BUSY_TIMEOUT 5 + /* * Mutex protects: * 1) List of modules (also safely readable with preempt_disable), @@ -3015,6 +3025,30 @@ SYSCALL_DEFINE3(init_module, void __user *, umod, return load_module(&info, uargs, 0); } +static int module_kread_concurrent(void) +{ + int err; + + if (atomic_dec_if_positive(&initmod_concurrent_max) < 0) { + pr_warn_ratelimited("finit_module: initkmod_concurrent_max (%u) close to 0 (max_loads: %u), throttling...", + atomic_read(&initmod_concurrent_max), + MAX_INITMOD_CONCURRENT); + err = wait_event_killable_timeout(initmod_wq, + atomic_dec_if_positive(&initmod_concurrent_max) >= 0, + MAX_INITMOD_ALL_BUSY_TIMEOUT * HZ); + if (!err) { + pr_warn_ratelimited("finit_module: loading module cannot be processed, kernel busy with %d threads loading modules now for more than %d seconds", + MAX_INITMOD_CONCURRENT, MAX_INITMOD_ALL_BUSY_TIMEOUT); + return -ETIME; + } else if (err == -ERESTARTSYS) { + pr_warn_ratelimited("finit_module: sigkill sent for load_module giving up"); + return err; + } + } + + return 0; +} + SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; @@ -3033,6 +3067,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) |MODULE_INIT_COMPRESSED_FILE)) return -EINVAL; + err = module_kread_concurrent(); + if (err) + return err; + len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, READING_MODULE); if (len < 0) @@ -3048,7 +3086,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) info.len = len; } - return load_module(&info, uargs, flags); + err = load_module(&info, uargs, flags); + + atomic_inc(&initmod_concurrent_max); + wake_up(&initmod_wq); + + return err; } /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > On the modules side of things we can be super defensive on the second > vmalloc allocation defensive [0] but other than this the initial kread > also needs care too. Please don't re-implement semaphores. They are a *classic* concurrency limiter. In fact, probably *THE* classic one. So just do something like this instead: --- a/kernel/module/main.c +++ b/kernel/module/main.c @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod, return load_module(&info, uargs, 0); } +#define CONCURRENCY_LIMITER(name, n) \ + struct semaphore name = __SEMAPHORE_INITIALIZER(name, n) + +static CONCURRENCY_LIMITER(module_loading_concurrency, 50); + SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) { struct load_info info = { }; @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags) |MODULE_INIT_COMPRESSED_FILE)) return -EINVAL; + err = down_killable(&module_loading_concurrency); + if (err) + return err; len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, READING_MODULE); + up(&module_loading_concurrency); if (len < 0) return len; NOTE! Entirely untested. Surprise surprise. I'm a tiny bit worried about deadlocks here, so somebody needs to make sure that the kernel_read_file_from_fd() case cannot possibly in turn cause a "request_module()" recursion. We most definitely have had module recursion before, but I *think* it's always just in the module init function (ie one module requests another when ithe mod->init() function is called). I think by the time we have opened the module file descriptors and are just reading the data, we should be ok and the above would never deadlock, but I want people to at least think about it. Of course, with that "max 50 concurrent loaders" limit, maybe it's never an issue anyway. Even if you get a recursion a few deep, at most you just wait for somebody else to get out of their loaders. Unless *everybody* ends up waiting on some progress. Linus
On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote: > On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > > > On the modules side of things we can be super defensive on the second > > vmalloc allocation defensive [0] but other than this the initial kread > > also needs care too. > > Please don't re-implement semaphores. They are a *classic* concurrency limiter. > > In fact, probably *THE* classic one. > > So just do something like this instead: > > --- a/kernel/module/main.c > +++ b/kernel/module/main.c > @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod, > return load_module(&info, uargs, 0); > } > > +#define CONCURRENCY_LIMITER(name, n) \ > + struct semaphore name = __SEMAPHORE_INITIALIZER(name, n) > + > +static CONCURRENCY_LIMITER(module_loading_concurrency, 50); > + > SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, > int, flags) > { > struct load_info info = { }; > @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const > char __user *, uargs, int, flags) > |MODULE_INIT_COMPRESSED_FILE)) > return -EINVAL; > > + err = down_killable(&module_loading_concurrency); > + if (err) > + return err; > len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL, > READING_MODULE); > + up(&module_loading_concurrency); > if (len < 0) > return len; > > NOTE! Entirely untested. Surprise surprise. I'll give it a good wack thanks. But it still begs the question if *other* vmalloc user-interfacing places need similar practices. It's not just system calls that use it willy nilly but anything that could in the end use it. Surely a lot of "issues" could only happen in an insane pathological use case, but it's a generic thing to keep in mind in the end. > I'm a tiny bit worried about deadlocks here, so somebody needs to make > sure that the kernel_read_file_from_fd() case cannot possibly in turn > cause a "request_module()" recursion. Automount on a path where the module lies in a path for a modue not loaded yet triggering a kernel module autoload is the only thing I can think of that could cause that, but that just calls userspace modprobe. And I think that'd be an insane setup. > We most definitely have had module recursion before, but I *think* > it's always just in the module init function (ie one module requests > another when ithe mod->init() function is called). Since you up() right after the kernel_read_file_from_fd() we would not be racing module inits with this. If however we place the up() after the load_module() then that does incur that recurssion possibilty. > I think by the time we have opened the module file descriptors and are > just reading the data, we should be ok and the above would never > deadlock, but I want people to at least think about it. > > Of course, with that "max 50 concurrent loaders" limit, maybe it's > never an issue anyway. Even if you get a recursion a few deep, at most > you just wait for somebody else to get out of their loaders. Unless > *everybody* ends up waiting on some progress. Yeah, these certainly are pathalogical corner cases. I'm more interested in solving doing something sane for 1000 CPUs or 2000 CPUs for now, even if the issue was the kernel (not userspace) to blame (and even if those use cases are being fixed now like the queued up linux-next ACPI CPU frequency modules per CPU). Luis
On Fri, Mar 24, 2023 at 02:14:11PM -0700, Luis Chamberlain wrote: > On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote: > > +static CONCURRENCY_LIMITER(module_loading_concurrency, 50); > I'll give it a good wack thanks. Using a value of 20 above matches what I to get up to 100 stress-ng module ops without any OOM (which seems to mimick a 400 CPU environment): echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs I'll go with that and evaluate max memory usage as well, but this is still inviting us to consider other users interfacing areas with an implicated vmalloc and the scale of this as we grow number of CPU counts. A magic values of 20 is completely empirical. Luis
On Fri, Mar 24, 2023 at 4:27 PM Luis Chamberlain <mcgrof@kernel.org> wrote: > > A magic values of 20 is completely empirical. I suspect a magic value is fine in this context. I do not believe that "hundreds of concurrent module loads" is a valid real-life load. It sounds more like a "systemd does horrible things at startup". 20 sounds plenty big enough to me. But I guess this is one of those "only real life will tell". Linus
On 24.03.23 18:54, Luis Chamberlain wrote: > On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote: >> On 21.03.23 20:32, David Hildenbrand wrote: >>> On 20.03.23 22:27, Luis Chamberlain wrote: >>>> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote: >>>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote: >>>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries). >>>>> >>>>> Oh wow, so to clarify, it boots OK? >>>>> >>>> >>>> Now that we know that tree works, I'm curious also now if you can >>>> confirm just re-ordering the patches still works (it should) >>>> >>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust >>>> >>> >>> So far no vmap errors booting the debug/kasan kernel (2 tries). > > <-- snip --> > >>> I think we primarily only care about systemd-udev-settle.service. >>> >>> That is fastest without the rcu patch (~6s), compared to with the rcu >>> patch (~6.5s) and with stock (~7.5s -- 8s). >>> >>> Looks like dracut-initqueue also might be a bit faster with your changes, but >>> maybe it's mostly noise (would have to do more runs). >>> >>> So maybe drop that rcu patch? But of course, there could be other scenarios where it's >>> helpful ... > > Yes I confirm the RCU patch does not help at all now also using > stress-ng. > >> Are there any other things you would like me to measure/test? I'll have to >> hand back that test machine soonish. > > Yes please test the below. Perhaps its not the final form we want, but > it *does* fix OOM'ing when thrashing with stress-ng now with the module > option and even with 100 threads brings down max memory consumption by > 259 MiB. The reason is that we also vmalloc during each finit_read_file() > for each module as well way before we even do layout_and_allocate(), and > so obviously if we fix the module path but not this path this will eventually > catch up with us as. I'm not at all happy with the current approach given > ideally we'd bump the counter when the user is done with the file, but we > don't yet have any tracking of that for users, they just vfree the memory > itself. And so this is just trying to catch heavy immediate abuse on the > caller to fend off abuse of vmalloc uses in a lightway manner. Understood. (I'm planning on review one I have time some spare cycles) > > There's gotta be a better way to do this, but its just an idea I have so far. > If we *want* to keep tabs until the user is done, we have to just modify > most users of these APIs and intrudce our own free. I don't think we're > in a rush to fix this so maybe that's the better approach. > > And so I've managed to reproduce the issues you found now with my new stress-ng > module stressor as well. Nice! > > https://github.com/ColinIanKing/stress-ng.git > > Even though you have 400 CPUs with stress-ng we can likely reproduce it > with (use a module not loaded on your system): > > ./stress-ng --module 100 --module-name xfs I'll give that a churn on that machine with the updated patch ... > > Without the patch below using 400 threads still OOMs easily due to the > kread issue. Max threads allowed are 8192. > ... do you have an updated patch/branch that includes the feedback from Linus so I can give it a churn tomorrow?
On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote: > ... do you have an updated patch/branch that includes the feedback from > Linus so I can give it a churn tomorrow? Yeah sure: https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts The commit log needs updateing to reflect the results I just collected: With the alloc patch ("module: avoid allocation if module is already present and ready") I see 145 MiB in memory difference in comparison to its last patch, "module: extract patient module check into helper". So I think that's a clear keeper and should help large CPU count boots. The patch "module: add concurrency limiter" which puts the concurency delimiter on the kread only saves about 2 MiB with 100 stress-ng ops, which seems to be what I needed to reproduce your 400 CPU count original issue. The program used to reproduce is stress-ng with the new module option: echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs To see how much max memory I use, I just use: free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log Run the test in another window, CTRL-C the test when above finishes after 40 seconds and then: sort -n -r foo.log | head -1 If you have xfs loaded already you probably wanna pick module just as big that you don't have loaded. You must have dependencies loaded already as it doesn't call modprobe, it just finit_module's the module. The last patch "modules/kmod: replace implementation with a sempahore" just takes Linus' simplification suggestion to replace kmod's solution. I tested it with: tools/testing/selftests/kmod/kmod.sh -t 0008 This stress tests the kmod autoloading. Using time, the timing is similar, perhaps mildly slower, but I like the simplification. I'm still not happy with the "module: add concurrency limiter", specially it doesn't seem to really buy us much, 2 MiB seems like within noise. But stress-ng clearly shows it is a possible source of issue as we grow the ops. So it may make sense to just use the idea to replace the delimiter for kmod for now, unless you see different results. The stress-ng module test can at least be used now to replicate insane setups at bootup. Let me know if you get other useful results. Luis
On 28.03.23 08:16, Luis Chamberlain wrote: > On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote: >> ... do you have an updated patch/branch that includes the feedback from >> Linus so I can give it a churn tomorrow? > > Yeah sure: > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts > I gave that one a go and get for system bootup: #1: 13.761s tuned.service 12.261s chrony-wait.service 7.386s NetworkManager-wait-online.service 5.227s systemd-udev-settle.service 2.893s initrd-switch-root.service 2.148s polkit.service 2.137s smartd.service 1.893s dracut-initqueue.service 1.290s NetworkManager.service 1.032s cups.service #2 13.881s tuned.service 9.255s chrony-wait.service 7.404s NetworkManager-wait-online.service 5.826s systemd-udev-settle.service 2.859s initrd-switch-root.service 2.847s smartd.service 2.172s polkit.service 1.884s dracut-initqueue.service 1.371s NetworkManager.service 1.119s ModemManager.service So we're a bit faster (0.2 -- 0.7s) than the original version without the rcu patch (~6s). > The commit log needs updateing to reflect the results I just collected: > > With the alloc patch ("module: avoid allocation if module is already > present and ready") I see 145 MiB in memory difference in comparison > to its last patch, "module: extract patient module check into helper". > So I think that's a clear keeper and should help large CPU count boots. > > The patch "module: add concurrency limiter" which puts the concurency > delimiter on the kread only saves about 2 MiB with 100 stress-ng ops, > which seems to be what I needed to reproduce your 400 CPU count original > issue. > > The program used to reproduce is stress-ng with the new module option: > > echo 0 > /proc/sys/vm/oom_dump_tasks > ./stress-ng --module 100 --module-name xfs Above command fills for me with nfs (but also ext4) the kernel log with: ... [ 883.036035] nfs: Unknown symbol xdr_reserve_space (err -2) [ 883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2) [ 883.048549] nfs: Unknown symbol put_rpccred (err -2) [ 883.054104] nfs: Unknown symbol __fscache_invalidate (err -2) [ 883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2) [ 883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2) [ 883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2) [ 883.081743] nfs: Unknown symbol nlmclnt_init (err -2) [ 883.087396] nfs: Unknown symbol nlmclnt_done (err -2) [ 883.093074] nfs: Unknown symbol nfs_debug (err -2) [ 883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2) [ 883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2) [ 883.163764] nfs: Unknown symbol rpc_put_task (err -2) [ 883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2) [ 883.176297] nfs: Unknown symbol rpc_proc_register (err -2) [ 883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2) [ 883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2) [ 883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2) ... I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent handling or just side-effect of the concurrent loading? > > To see how much max memory I use, I just use: > > free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log > > Run the test in another window, CTRL-C the test when above > finishes after 40 seconds and then: > > sort -n -r foo.log | head -1 [root@lenovo-sr950-01 fs]# sort -n -r foo.log | head -1 14254024 [root@lenovo-sr950-01 fs]# sort -n -r foo.log | tail -1 12862528 So 1391496 (KiB I assume, so 1.3 GiB !?) difference compared to before the test (I first start capturing and then run stress-ng). > > If you have xfs loaded already you probably wanna pick module just as big > that you don't have loaded. You must have dependencies loaded already as > it doesn't call modprobe, it just finit_module's the module. My setup already has xfs in use. nfs and ext4 are a bit smaller, but still big.
On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote: > > So we're a bit faster (0.2 -- 0.7s) than the original version without the > rcu patch (~6s). Groovy. > > The commit log needs updateing to reflect the results I just collected: > > > > With the alloc patch ("module: avoid allocation if module is already > > present and ready") I see 145 MiB in memory difference in comparison > > to its last patch, "module: extract patient module check into helper". > > So I think that's a clear keeper and should help large CPU count boots. > > > > The patch "module: add concurrency limiter" which puts the concurency > > delimiter on the kread only saves about 2 MiB with 100 stress-ng ops, > > which seems to be what I needed to reproduce your 400 CPU count original > > issue. > > > > The program used to reproduce is stress-ng with the new module option: > > > > echo 0 > /proc/sys/vm/oom_dump_tasks > > ./stress-ng --module 100 --module-name xfs > > Above command fills for me with nfs (but also ext4) the kernel log with: > > ... > [ 883.036035] nfs: Unknown symbol xdr_reserve_space (err -2) > [ 883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2) > [ 883.048549] nfs: Unknown symbol put_rpccred (err -2) > [ 883.054104] nfs: Unknown symbol __fscache_invalidate (err -2) > [ 883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2) > [ 883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2) > [ 883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2) > [ 883.081743] nfs: Unknown symbol nlmclnt_init (err -2) > [ 883.087396] nfs: Unknown symbol nlmclnt_done (err -2) > [ 883.093074] nfs: Unknown symbol nfs_debug (err -2) > [ 883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2) > [ 883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2) > [ 883.163764] nfs: Unknown symbol rpc_put_task (err -2) > [ 883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2) > [ 883.176297] nfs: Unknown symbol rpc_proc_register (err -2) > [ 883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2) > [ 883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2) > [ 883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2) > ... > > > I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent > handling or just side-effect of the concurrent loading? It is just because modprobe deals with module dependencies, stress-ng modprobe doesn't, it just calls finit_module() and expects dependencies to be loaded first. > > To see how much max memory I use, I just use: > > > > free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log > > > > Run the test in another window, CTRL-C the test when above > > finishes after 40 seconds and then: > > > > sort -n -r foo.log | head -1 > > [root@lenovo-sr950-01 fs]# sort -n -r foo.log | head -1 > 14254024 > [root@lenovo-sr950-01 fs]# sort -n -r foo.log | tail -1 > 12862528 > > So 1391496 (KiB I assume, so 1.3 GiB !?) That is sadly correct. > difference compared to before the > test (I first start capturing and then run stress-ng). That's a good chunk :-D On a recent thread I do a full analysis of average module sizes [0], you can use that to also get an average size estimate of your currently loaded modules. Do a bit of math and that could give you the average memory pressure on real vmap allocations. For a synthentic test as with stress-ng modules we'd be doing twice the memory for a success load, and only one time allocation due to kread for a failed allocation, since we always allocate memory even for bogus modules on the kread side. My xfs.ko is 42M on my test guest system, with 100 concurrent threads doing two allocations each that puts the expected vmap allocation at around: 42*2*100 8400 8400/1024 8.20312500000000000000 8.2 GiB So on the worst case that's what we'd see. Since I saw seeing the patches helped overall around 145 MiB in difference let's see what that means. Let me re-test: Idle memory: (note free -k uses kibibytes, KiB. root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r | head -1 416452 So while idle, the system is using up 416452 KiB, so 406 MiB. While running stress-ng with 100 module ops free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt echo 0 > /proc/sys/vm/oom_dump_tasks ./stress-ng --module 100 --module-name xfs root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1 4886492 4886492/1024 4771.96484375000000000000 4886492/1024/1024 4.66012191772460937500 So during the stress test we see memory highest usage was about 4771.96 MiB or ~4.66 GiB. What's the difference between idle and after the stress test: 4886492 - 416452 4470040 4470040/1024 4365.27343750000000000000 4470040/1024/1024 4.26296234130859375000 So about 4365.27 MiB or 4.26 GiB. Now, the upper limit for the test should have been a delta of 8.2 GiB and we get about 4.26, so it means we're rejecting more than half of the requests. Why half not just half? I only see 40 successful loads of XFS during a 40 second window: dmesg -c > /dev/null # run tests dmesg -c | grep XFS | wc -l At 100 ops all running finit_module for 40 seconds those successful loads only should have consumed about 40 * 2 * size of XFS (42 MiB): 3360 MiB or 3.28 GiB. But at any point in time only one module could be loaded at a time, so in the *worst* consumption point a XFS in this test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB. 4365.27 - 84 4281.27 So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the failed module loads due to the kread*() calls from finit_module(). And get this... *if* you use module compression that also uses vmap() *after* the kernel_read*() call which uses vmalloc(). *At least* in that case we immediately free the buffer for the compressed module, but *still* -- that is 3 possible vmap space allocations for every module! It'd be hard, but not impossible to collect stats for failed finit_modules(). *That* could be indicative of areas in the kernel we need to brush up on to stop doing stupid things, like we learned about for the CPU frequency scaling modules. As I wrote this paragraph I realized -- that this *is* what we really have wanted all along to help us debug these stupid things, so we can slowly learn where to point fingers at to help optimize things. Altough I recently did some tool scraping to collect stats and *wished* for this from userspace [0], it wasn't hard to just a debug option for this to help us debug these failures. So I added support for that and sent finally a new patch series. [0] https://lkml.kernel.org/r/CAB=NE6UTC4VkNM57GGJ3XkG_PWLkMfXv2e2=yQJhtM6Fc-uMsQ@mail.gmail.com > > If you have xfs loaded already you probably wanna pick module just as big > > that you don't have loaded. You must have dependencies loaded already as > > it doesn't call modprobe, it just finit_module's the module. > > My setup already has xfs in use. nfs and ext4 are a bit smaller, but still > big. Sure. Luis
On 29.03.23 07:31, Luis Chamberlain wrote: > On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote: >> >> So we're a bit faster (0.2 -- 0.7s) than the original version without the >> rcu patch (~6s). > > Groovy. Just to clarify, it was up to 0.7s faster (I phrased it in a confusing way). > >>> The commit log needs updateing to reflect the results I just collected: >>> >>> With the alloc patch ("module: avoid allocation if module is already >>> present and ready") I see 145 MiB in memory difference in comparison >>> to its last patch, "module: extract patient module check into helper". >>> So I think that's a clear keeper and should help large CPU count boots. >>> >>> The patch "module: add concurrency limiter" which puts the concurency >>> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops, >>> which seems to be what I needed to reproduce your 400 CPU count original >>> issue. >>> >>> The program used to reproduce is stress-ng with the new module option: >>> >>> echo 0 > /proc/sys/vm/oom_dump_tasks >>> ./stress-ng --module 100 --module-name xfs >> >> Above command fills for me with nfs (but also ext4) the kernel log with: >> >> ... >> [ 883.036035] nfs: Unknown symbol xdr_reserve_space (err -2) >> [ 883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2) >> [ 883.048549] nfs: Unknown symbol put_rpccred (err -2) >> [ 883.054104] nfs: Unknown symbol __fscache_invalidate (err -2) >> [ 883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2) >> [ 883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2) >> [ 883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2) >> [ 883.081743] nfs: Unknown symbol nlmclnt_init (err -2) >> [ 883.087396] nfs: Unknown symbol nlmclnt_done (err -2) >> [ 883.093074] nfs: Unknown symbol nfs_debug (err -2) >> [ 883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2) >> [ 883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2) >> [ 883.163764] nfs: Unknown symbol rpc_put_task (err -2) >> [ 883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2) >> [ 883.176297] nfs: Unknown symbol rpc_proc_register (err -2) >> [ 883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2) >> [ 883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2) >> [ 883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2) >> ... >> >> >> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent >> handling or just side-effect of the concurrent loading? > > It is just because modprobe deals with module dependencies, stress-ng > modprobe doesn't, it just calls finit_module() and expects dependencies > to be loaded first. Oh, right. That makes sense. > >>> To see how much max memory I use, I just use: >>> >>> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log >>> >>> Run the test in another window, CTRL-C the test when above >>> finishes after 40 seconds and then: >>> >>> sort -n -r foo.log | head -1 >> >> [root@lenovo-sr950-01 fs]# sort -n -r foo.log | head -1 >> 14254024 >> [root@lenovo-sr950-01 fs]# sort -n -r foo.log | tail -1 >> 12862528 >> >> So 1391496 (KiB I assume, so 1.3 GiB !?) > > That is sadly correct. > >> difference compared to before the >> test (I first start capturing and then run stress-ng). > > That's a good chunk :-D > > On a recent thread I do a full analysis of average module sizes [0], you > can use that to also get an average size estimate of your currently > loaded modules. Do a bit of math and that could give you the average > memory pressure on real vmap allocations. For a synthentic test as with > stress-ng modules we'd be doing twice the memory for a success load, and > only one time allocation due to kread for a failed allocation, since we > always allocate memory even for bogus modules on the kread side. > > My xfs.ko is 42M on my test guest system, with 100 concurrent threads > doing two allocations each that puts the expected vmap allocation at > around: > > 42*2*100 > 8400 > 8400/1024 > 8.20312500000000000000 > > 8.2 GiB > > So on the worst case that's what we'd see. Since I saw seeing the > patches helped overall around 145 MiB in difference let's see what > that means. Let me re-test: > > Idle memory: (note free -k uses kibibytes, KiB. > > root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r | head -1 > 416452 > > So while idle, the system is using up 416452 KiB, so 406 MiB. > > While running stress-ng with 100 module ops > > free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt > echo 0 > /proc/sys/vm/oom_dump_tasks > ./stress-ng --module 100 --module-name xfs > root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1 > 4886492 > > 4886492/1024 > 4771.96484375000000000000 > 4886492/1024/1024 > 4.66012191772460937500 > > So during the stress test we see memory highest usage was about > 4771.96 MiB or ~4.66 GiB. > > What's the difference between idle and after the stress test: > > 4886492 - 416452 > 4470040 > 4470040/1024 > 4365.27343750000000000000 > 4470040/1024/1024 > 4.26296234130859375000 > > So about 4365.27 MiB or 4.26 GiB. > > Now, the upper limit for the test should have been a delta of 8.2 GiB > and we get about 4.26, so it means we're rejecting more than half of > the requests. Why half not just half? I only see 40 successful loads > of XFS during a 40 second window: > > dmesg -c > /dev/null > # run tests > dmesg -c | grep XFS | wc -l > > At 100 ops all running finit_module for 40 seconds those successful > loads only should have consumed about 40 * 2 * size of XFS (42 MiB): > 3360 MiB or 3.28 GiB. But at any point in time only one module could > be loaded at a time, so in the *worst* consumption point a XFS in this > test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB. > > 4365.27 - 84 > 4281.27 > > So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the > failed module loads due to the kread*() calls from finit_module(). > > And get this... *if* you use module compression that also uses vmap() > *after* the kernel_read*() call which uses vmalloc(). *At least* in > that case we immediately free the buffer for the compressed module, > but *still* -- that is 3 possible vmap space allocations for every > module! :/ > > It'd be hard, but not impossible to collect stats for failed > finit_modules(). *That* could be indicative of areas in the kernel > we need to brush up on to stop doing stupid things, like we learned > about for the CPU frequency scaling modules. As I wrote this paragraph > I realized -- that this *is* what we really have wanted all along to > help us debug these stupid things, so we can slowly learn where to > point fingers at to help optimize things. Altough I recently did some > tool scraping to collect stats and *wished* for this from userspace [0], > it wasn't hard to just a debug option for this to help us debug these > failures. So I added support for that and sent finally a new patch > series. Makes sense to me. I didn't have time to look at the code yet (currently traveling), I hope that the kasan vmap issues are at least history. I'll do another test with debug kernels using your latest version on that machine before I have to hand it back. Thanks!