mbox series

[RFC,00/12] module: avoid userspace pressure on unwanted allocations

Message ID 20230311051712.4095040-1-mcgrof@kernel.org (mailing list archive)
Headers show
Series module: avoid userspace pressure on unwanted allocations | expand

Message

Luis Chamberlain March 11, 2023, 5:17 a.m. UTC
A long time ago we had some issues with userspace doing stupid stuff.
Well, it turns out even the kernel and do stupid stuff too, as we're
learning with the ACPI modules aliaes and that hammering tons of loads.

So add a bit of code which gets us a bit more in the defensive about
these situations.

To experiment, this also adds in-kernel alias support to see if this helps
with some larger systems.

This is all based on some old code which tried to add defensive
mechanisms the last of which was here and I had dropped the ball:

https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/

I've only compile tested this for now. Will need to stress to test
with kmod tests 0008 and 0009 to see if there's any differences.
I'll have to re-test and re-gnuplot stuff there. But early feedback
is appreciated, hence the RFC.

David Hildenbrand had reported a while ago issues with userspace
doing insane things with allocations bringing a system down to
its knees. This is part of the motivation for this series.

I repeat, I only have compiled tested this so far.

A few Suggested-by there linger since Linus had suggested a few of
these ideas a long time ago and we just never picked them up.

Luis Chamberlain (12):
  module: use goto errors on check_modinfo() and layout_and_allocate()
  module: move get_modinfo() helpers all above
  module: rename next_string() to module_next_tag_pair()
  module: add a for_each_modinfo_entry()
  module: add debugging alias parsing support
  module: move early sanity checks into a helper
  module: move check_modinfo() early to early_mod_check()
  module: move finished_loading()
  module: extract patient module check into helper
  module: avoid allocation if module is already present and ready
  module: use list_add_tail_rcu() when adding module
  module: use aliases to find module on find_module_all()

 include/linux/module.h   |   4 +
 kernel/module/Kconfig    |  19 +++
 kernel/module/Makefile   |   1 +
 kernel/module/aliases.c  | 109 +++++++++++++
 kernel/module/internal.h |  25 +++
 kernel/module/main.c     | 324 +++++++++++++++++++++++----------------
 6 files changed, 346 insertions(+), 136 deletions(-)
 create mode 100644 kernel/module/aliases.c

Comments

David Hildenbrand March 15, 2023, 12:24 p.m. UTC | #1
On 11.03.23 06:17, Luis Chamberlain wrote:
> A long time ago we had some issues with userspace doing stupid stuff.
> Well, it turns out even the kernel and do stupid stuff too, as we're
> learning with the ACPI modules aliaes and that hammering tons of loads.
> 
> So add a bit of code which gets us a bit more in the defensive about
> these situations.
> 
> To experiment, this also adds in-kernel alias support to see if this helps
> with some larger systems.
> 
> This is all based on some old code which tried to add defensive
> mechanisms the last of which was here and I had dropped the ball:
> 
> https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> 
> I've only compile tested this for now. Will need to stress to test
> with kmod tests 0008 and 0009 to see if there's any differences.
> I'll have to re-test and re-gnuplot stuff there. But early feedback
> is appreciated, hence the RFC.
> 
> David Hildenbrand had reported a while ago issues with userspace
> doing insane things with allocations bringing a system down to
> its knees. This is part of the motivation for this series.


I'll try to grab a system where I can reproduce the issue and give your 
patches a churn.

Thanks!
Luis Chamberlain March 15, 2023, 4:10 p.m. UTC | #2
On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
> On 11.03.23 06:17, Luis Chamberlain wrote:
> > A long time ago we had some issues with userspace doing stupid stuff.
> > Well, it turns out even the kernel and do stupid stuff too, as we're
> > learning with the ACPI modules aliaes and that hammering tons of loads.
> > 
> > So add a bit of code which gets us a bit more in the defensive about
> > these situations.
> > 
> > To experiment, this also adds in-kernel alias support to see if this helps
> > with some larger systems.
> > 
> > This is all based on some old code which tried to add defensive
> > mechanisms the last of which was here and I had dropped the ball:
> > 
> > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> > 
> > I've only compile tested this for now. Will need to stress to test
> > with kmod tests 0008 and 0009 to see if there's any differences.
> > I'll have to re-test and re-gnuplot stuff there. But early feedback
> > is appreciated, hence the RFC.
> > 
> > David Hildenbrand had reported a while ago issues with userspace
> > doing insane things with allocations bringing a system down to
> > its knees. This is part of the motivation for this series.
> 
> 
> I'll try to grab a system where I can reproduce the issue and give your
> patches a churn.

Great, then please wait for v2 RFC as the first patch was missing an
obvious mutex grab / release, I already have some memory pressure data
that shows gains. Hope to post soon.

  Luis
David Hildenbrand March 15, 2023, 4:41 p.m. UTC | #3
On 15.03.23 17:10, Luis Chamberlain wrote:
> On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
>> On 11.03.23 06:17, Luis Chamberlain wrote:
>>> A long time ago we had some issues with userspace doing stupid stuff.
>>> Well, it turns out even the kernel and do stupid stuff too, as we're
>>> learning with the ACPI modules aliaes and that hammering tons of loads.
>>>
>>> So add a bit of code which gets us a bit more in the defensive about
>>> these situations.
>>>
>>> To experiment, this also adds in-kernel alias support to see if this helps
>>> with some larger systems.
>>>
>>> This is all based on some old code which tried to add defensive
>>> mechanisms the last of which was here and I had dropped the ball:
>>>
>>> https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
>>>
>>> I've only compile tested this for now. Will need to stress to test
>>> with kmod tests 0008 and 0009 to see if there's any differences.
>>> I'll have to re-test and re-gnuplot stuff there. But early feedback
>>> is appreciated, hence the RFC.
>>>
>>> David Hildenbrand had reported a while ago issues with userspace
>>> doing insane things with allocations bringing a system down to
>>> its knees. This is part of the motivation for this series.
>>
>>
>> I'll try to grab a system where I can reproduce the issue and give your
>> patches a churn.
> 
> Great, then please wait for v2 RFC as the first patch was missing an
> obvious mutex grab / release, I already have some memory pressure data
> that shows gains. Hope to post soon.

I expect to have a machine (with a crazy number of CPUs/devices) 
available in a couple of days (1-2), so no need to rush.

The original machine I was able to reproduce with is blocked for a 
little bit longer; so I hope the alternative I looked up will similarly 
trigger the issue easily.
Luis Chamberlain March 16, 2023, 11:55 p.m. UTC | #4
On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> On 15.03.23 17:10, Luis Chamberlain wrote:
> > On Wed, Mar 15, 2023 at 01:24:41PM +0100, David Hildenbrand wrote:
> > > On 11.03.23 06:17, Luis Chamberlain wrote:
> > > > A long time ago we had some issues with userspace doing stupid stuff.
> > > > Well, it turns out even the kernel and do stupid stuff too, as we're
> > > > learning with the ACPI modules aliaes and that hammering tons of loads.
> > > > 
> > > > So add a bit of code which gets us a bit more in the defensive about
> > > > these situations.
> > > > 
> > > > To experiment, this also adds in-kernel alias support to see if this helps
> > > > with some larger systems.
> > > > 
> > > > This is all based on some old code which tried to add defensive
> > > > mechanisms the last of which was here and I had dropped the ball:
> > > > 
> > > > https://lore.kernel.org/all/20171208001540.23696-1-mcgrof@kernel.org/
> > > > 
> > > > I've only compile tested this for now. Will need to stress to test
> > > > with kmod tests 0008 and 0009 to see if there's any differences.
> > > > I'll have to re-test and re-gnuplot stuff there. But early feedback
> > > > is appreciated, hence the RFC.
> > > > 
> > > > David Hildenbrand had reported a while ago issues with userspace
> > > > doing insane things with allocations bringing a system down to
> > > > its knees. This is part of the motivation for this series.
> > > 
> > > 
> > > I'll try to grab a system where I can reproduce the issue and give your
> > > patches a churn.
> > 
> > Great, then please wait for v2 RFC as the first patch was missing an
> > obvious mutex grab / release, I already have some memory pressure data
> > that shows gains. Hope to post soon.
> 
> I expect to have a machine (with a crazy number of CPUs/devices) available
> in a couple of days (1-2), so no need to rush.
> 
> The original machine I was able to reproduce with is blocked for a little
> bit longer; so I hope the alternative I looked up will similarly trigger the
> issue easily.

OK give this a spin:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts

I'm seeing about ~86 MiB saving on the upper bound on memory usage
while hammering on kmod test 0008, and this is on a small system.

Probably won't help *that* much but am curious... if it helps somewhat.

  Luis
Luis Chamberlain March 16, 2023, 11:56 p.m. UTC | #5
On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > I expect to have a machine (with a crazy number of CPUs/devices) available
> > in a couple of days (1-2), so no need to rush.
> > 
> > The original machine I was able to reproduce with is blocked for a little
> > bit longer; so I hope the alternative I looked up will similarly trigger the
> > issue easily.
> 
> OK give this a spin:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> 
> I'm seeing about ~86 MiB saving on the upper bound on memory usage
> while hammering on kmod test 0008, and this is on a small system.
> 
> Probably won't help *that* much but am curious... if it helps somewhat.

How much cpu count BTW?

  Luis
Luis Chamberlain March 18, 2023, 12:11 a.m. UTC | #6
On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > in a couple of days (1-2), so no need to rush.
> > > 
> > > The original machine I was able to reproduce with is blocked for a little
> > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > issue easily.
> > 
> > OK give this a spin:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts

Today I am up to here:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts

The last patch really would have no justification yet at all unless it
does help your case.

  Luis
David Hildenbrand March 20, 2023, 9:37 a.m. UTC | #7
On 17.03.23 00:56, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>> in a couple of days (1-2), so no need to rush.
>>>
>>> The original machine I was able to reproduce with is blocked for a little
>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>> issue easily.
>>
>> OK give this a spin:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>
>> I'm seeing about ~86 MiB saving on the upper bound on memory usage
>> while hammering on kmod test 0008, and this is on a small system.
>>
>> Probably won't help *that* much but am curious... if it helps somewhat.
> 
> How much cpu count BTW?

220 physical ones and 440 virtual ones.
David Hildenbrand March 20, 2023, 9:38 a.m. UTC | #8
On 18.03.23 01:11, Luis Chamberlain wrote:
> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>> in a couple of days (1-2), so no need to rush.
>>>>
>>>> The original machine I was able to reproduce with is blocked for a little
>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>> issue easily.
>>>
>>> OK give this a spin:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> 
> Today I am up to here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> 
> The last patch really would have no justification yet at all unless it
> does help your case.

Still waiting on the system (the replacement system I was able to grab 
broke ...).

I'll let you know once I succeeded in reproducing + testing your fixes.
David Hildenbrand March 20, 2023, 7:40 p.m. UTC | #9
On 20.03.23 10:38, David Hildenbrand wrote:
> On 18.03.23 01:11, Luis Chamberlain wrote:
>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>> in a couple of days (1-2), so no need to rush.
>>>>>
>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>> issue easily.
>>>>
>>>> OK give this a spin:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>
>> Today I am up to here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>
>> The last patch really would have no justification yet at all unless it
>> does help your case.
> 
> Still waiting on the system (the replacement system I was able to grab
> broke ...).
> 
> I'll let you know once I succeeded in reproducing + testing your fixes.

Okay, I have a system where I can reproduce.

Should I give

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts

from yesterday a churn?
Luis Chamberlain March 20, 2023, 9:09 p.m. UTC | #10
On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> On 20.03.23 10:38, David Hildenbrand wrote:
> > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > 
> > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > issue easily.
> > > > > 
> > > > > OK give this a spin:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > 
> > > Today I am up to here:
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > 
> > > The last patch really would have no justification yet at all unless it
> > > does help your case.
> > 
> > Still waiting on the system (the replacement system I was able to grab
> > broke ...).
> > 
> > I'll let you know once I succeeded in reproducing + testing your fixes.
> 
> Okay, I have a system where I can reproduce.
> 
> Should I give
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> 
> from yesterday a churn?

Yes please give that a run.

Please collect systemd-analyze given lack of any other tool to evaluate
any deltas. Can't think of anything else to gather other than seeing if
it booted.

If that boots works then try removing the last patch "module: add a
sanity check prior to allowing kernel module auto-loading" to see if
that last patch helped or was just noise. As it stands I'm not convinced
yet if it did help, if it *does* help we probably need to rethink some
finit_module() allocations things.

If you *still* can't boot the system, well, we should re-think
finit_module() allocation path a bit more too.

  Luis
David Hildenbrand March 20, 2023, 9:15 p.m. UTC | #11
On 20.03.23 22:09, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>> On 20.03.23 10:38, David Hildenbrand wrote:
>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>
>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>> issue easily.
>>>>>>
>>>>>> OK give this a spin:
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>
>>>> Today I am up to here:
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>
>>>> The last patch really would have no justification yet at all unless it
>>>> does help your case.
>>>
>>> Still waiting on the system (the replacement system I was able to grab
>>> broke ...).
>>>
>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>
>> Okay, I have a system where I can reproduce.
>>
>> Should I give
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>
>> from yesterday a churn?
> 
> Yes please give that a run.

Reproduced with v6.3.0-rc1 (on 1st try)

Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).

> 
> Please collect systemd-analyze given lack of any other tool to evaluate
> any deltas. Can't think of anything else to gather other than seeing if
> it booted.

Issue is that some services (kdump, tuned) seem to take sometimes ages 
on that system to start for some reason, and systemd-analyze refuses to 
do something reasonable while the system is still booting up.

I'll see if I can come up with some data.

> 
> If that boots works then try removing the last patch "module: add a
> sanity check prior to allowing kernel module auto-loading" to see if
> that last patch helped or was just noise. As it stands I'm not convinced
> yet if it did help, if it *does* help we probably need to rethink some
> finit_module() allocations things.

Okay, will try without the last patch tomorrow.
Luis Chamberlain March 20, 2023, 9:23 p.m. UTC | #12
On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> On 20.03.23 22:09, Luis Chamberlain wrote:
> > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> > > On 20.03.23 10:38, David Hildenbrand wrote:
> > > > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > > > 
> > > > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > > > issue easily.
> > > > > > > 
> > > > > > > OK give this a spin:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > > > 
> > > > > Today I am up to here:
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > > > 
> > > > > The last patch really would have no justification yet at all unless it
> > > > > does help your case.
> > > > 
> > > > Still waiting on the system (the replacement system I was able to grab
> > > > broke ...).
> > > > 
> > > > I'll let you know once I succeeded in reproducing + testing your fixes.
> > > 
> > > Okay, I have a system where I can reproduce.
> > > 
> > > Should I give
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> > > 
> > > from yesterday a churn?
> > 
> > Yes please give that a run.
> 
> Reproduced with v6.3.0-rc1 (on 1st try)

By reproduced, you mean it fails to boot?

> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).

Oh wow, so to clarify, it boots OK?

> > Please collect systemd-analyze given lack of any other tool to evaluate
> > any deltas. Can't think of anything else to gather other than seeing if
> > it booted.
> 
> Issue is that some services (kdump, tuned) seem to take sometimes ages on
> that system to start for some reason, 

How about disabling that?

> and systemd-analyze refuses to do
> something reasonable while the system is still booting up.

I see.

> I'll see if I can come up with some data.

Thanks!

> > If that boots works then try removing the last patch "module: add a
> > sanity check prior to allowing kernel module auto-loading" to see if
> > that last patch helped or was just noise. As it stands I'm not convinced
> > yet if it did help, if it *does* help we probably need to rethink some
> > finit_module() allocations things.
> 
> Okay, will try without the last patch tomorrow.

Thanks!!

  Luis
Luis Chamberlain March 20, 2023, 9:27 p.m. UTC | #13
On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> 
> Oh wow, so to clarify, it boots OK?
> 

Now that we know that tree works, I'm curious also now if you can
confirm just re-ordering the patches still works (it should)

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust

And although it's *probably* just noise, but I'm very curious how much,
if any difference there is if you just revert "module: use
list_add_tail_rcu() when adding module".

The data on that commit log is pretty small as I have a low end system,
and I'm not yet done beating the hell out of a system with stress-ng,
but getting some data froma  pretty large system would be great.
Specially if this series seems to prove fixing boot on them.

  Luis
David Hildenbrand March 21, 2023, 3:11 p.m. UTC | #14
On 20.03.23 22:23, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>> On 20.03.23 22:09, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>>>> On 20.03.23 10:38, David Hildenbrand wrote:
>>>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>>>
>>>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>>>> issue easily.
>>>>>>>>
>>>>>>>> OK give this a spin:
>>>>>>>>
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>>>
>>>>>> Today I am up to here:
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>>>
>>>>>> The last patch really would have no justification yet at all unless it
>>>>>> does help your case.
>>>>>
>>>>> Still waiting on the system (the replacement system I was able to grab
>>>>> broke ...).
>>>>>
>>>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>>>
>>>> Okay, I have a system where I can reproduce.
>>>>
>>>> Should I give
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>>>
>>>> from yesterday a churn?
>>>
>>> Yes please give that a run.
>>
>> Reproduced with v6.3.0-rc1 (on 1st try)
> 
> By reproduced, you mean it fails to boot?

It boots but we get vmap allocation warnings, because the ~440 CPUs 
manage to completely exhaust the module vmap area due to KASAN.

> 
>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> 
> Oh wow, so to clarify, it boots OK?

It boots and I don't get the vmap allocation warnings.

> 
>>> Please collect systemd-analyze given lack of any other tool to evaluate
>>> any deltas. Can't think of anything else to gather other than seeing if
>>> it booted.
>>
>> Issue is that some services (kdump, tuned) seem to take sometimes ages on
>> that system to start for some reason,
> 
> How about disabling that?

It seems to be random services. On my debug kernel with KASAN everything 
is just super slow. I'll try to measure on a !debug kernel.
Luis Chamberlain March 21, 2023, 4:52 p.m. UTC | #15
On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote:
> On 20.03.23 22:23, Luis Chamberlain wrote:
> > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > On 20.03.23 22:09, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
> > > > > On 20.03.23 10:38, David Hildenbrand wrote:
> > > > > > On 18.03.23 01:11, Luis Chamberlain wrote:
> > > > > > > On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
> > > > > > > > On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
> > > > > > > > > On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
> > > > > > > > > > I expect to have a machine (with a crazy number of CPUs/devices) available
> > > > > > > > > > in a couple of days (1-2), so no need to rush.
> > > > > > > > > > 
> > > > > > > > > > The original machine I was able to reproduce with is blocked for a little
> > > > > > > > > > bit longer; so I hope the alternative I looked up will similarly trigger the
> > > > > > > > > > issue easily.
> > > > > > > > > 
> > > > > > > > > OK give this a spin:
> > > > > > > > > 
> > > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
> > > > > > > 
> > > > > > > Today I am up to here:
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
> > > > > > > 
> > > > > > > The last patch really would have no justification yet at all unless it
> > > > > > > does help your case.
> > > > > > 
> > > > > > Still waiting on the system (the replacement system I was able to grab
> > > > > > broke ...).
> > > > > > 
> > > > > > I'll let you know once I succeeded in reproducing + testing your fixes.
> > > > > 
> > > > > Okay, I have a system where I can reproduce.
> > > > > 
> > > > > Should I give
> > > > > 
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
> > > > > 
> > > > > from yesterday a churn?
> > > > 
> > > > Yes please give that a run.
> > > 
> > > Reproduced with v6.3.0-rc1 (on 1st try)
> > 
> > By reproduced, you mean it fails to boot?
> 
> It boots but we get vmap allocation warnings, because the ~440 CPUs manage
> to completely exhaust the module vmap area due to KASAN.

Thanks, can you post a trace?

> > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > 
> > Oh wow, so to clarify, it boots OK?
> 
> It boots and I don't get the vmap allocation warnings.

Wonderful!

Now to zero if all commits are required, and measure small value-add 
for each of them. That is the hard part and thanks for any time you
can help dedicate towards that.

There's really only 3 functional changes we need to measure:

In the 20230319-module-alloc-opts-adjust branch they are left towards
the end, in history, the top being the most recent commit:

600f1f769d06 module: add a sanity check prior to allowing kernel module auto-loading
680f2c1fff2d module: use list_add_tail_rcu() when adding module
6023a6c7d98c module: avoid allocation if module is already present and ready

My guess is the majority of the fix comes from the last commit, so
commit 6023a6c7d98c ("module: avoid allocation if module is already
present and ready").

The rcu one probably doesn't help much but I think it makes sense while
we're at it.

The last one is the one I'm less convinced makes sense. But *if* it does
help, it means such finit_module() situations / stresses are much larger
than we ever anticipated.

> > > > Please collect systemd-analyze given lack of any other tool to evaluate
> > > > any deltas. Can't think of anything else to gather other than seeing if
> > > > it booted.
> > > 
> > > Issue is that some services (kdump, tuned) seem to take sometimes ages on
> > > that system to start for some reason,
> > 
> > How about disabling that?
> 
> It seems to be random services. On my debug kernel with KASAN everything is
> just super slow. I'll try to measure on a !debug kernel.

Got it thanks!

  Luis
David Hildenbrand March 21, 2023, 5:01 p.m. UTC | #16
On 21.03.23 17:52, Luis Chamberlain wrote:
> On Tue, Mar 21, 2023 at 04:11:27PM +0100, David Hildenbrand wrote:
>> On 20.03.23 22:23, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>> On 20.03.23 22:09, Luis Chamberlain wrote:
>>>>> On Mon, Mar 20, 2023 at 08:40:07PM +0100, David Hildenbrand wrote:
>>>>>> On 20.03.23 10:38, David Hildenbrand wrote:
>>>>>>> On 18.03.23 01:11, Luis Chamberlain wrote:
>>>>>>>> On Thu, Mar 16, 2023 at 04:56:56PM -0700, Luis Chamberlain wrote:
>>>>>>>>> On Thu, Mar 16, 2023 at 04:55:31PM -0700, Luis Chamberlain wrote:
>>>>>>>>>> On Wed, Mar 15, 2023 at 05:41:53PM +0100, David Hildenbrand wrote:
>>>>>>>>>>> I expect to have a machine (with a crazy number of CPUs/devices) available
>>>>>>>>>>> in a couple of days (1-2), so no need to rush.
>>>>>>>>>>>
>>>>>>>>>>> The original machine I was able to reproduce with is blocked for a little
>>>>>>>>>>> bit longer; so I hope the alternative I looked up will similarly trigger the
>>>>>>>>>>> issue easily.
>>>>>>>>>>
>>>>>>>>>> OK give this a spin:
>>>>>>>>>>
>>>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230316-module-alloc-opts
>>>>>>>>
>>>>>>>> Today I am up to here:
>>>>>>>>
>>>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230317-module-alloc-opts
>>>>>>>>
>>>>>>>> The last patch really would have no justification yet at all unless it
>>>>>>>> does help your case.
>>>>>>>
>>>>>>> Still waiting on the system (the replacement system I was able to grab
>>>>>>> broke ...).
>>>>>>>
>>>>>>> I'll let you know once I succeeded in reproducing + testing your fixes.
>>>>>>
>>>>>> Okay, I have a system where I can reproduce.
>>>>>>
>>>>>> Should I give
>>>>>>
>>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts
>>>>>>
>>>>>> from yesterday a churn?
>>>>>
>>>>> Yes please give that a run.
>>>>
>>>> Reproduced with v6.3.0-rc1 (on 1st try)
>>>
>>> By reproduced, you mean it fails to boot?
>>
>> It boots but we get vmap allocation warnings, because the ~440 CPUs manage
>> to completely exhaust the module vmap area due to KASAN.
> 
> Thanks, can you post a trace?

See

https://lkml.kernel.org/r/20221013180518.217405-1-david@redhat.com

where I also describe the issue in more detail.

> 
>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>
>>> Oh wow, so to clarify, it boots OK?
>>
>> It boots and I don't get the vmap allocation warnings.
> 
> Wonderful!

I'm still compiling/testing the debug kernels 
(20230319-module-alloc-opts-adjust +  single revert of the rcu patch). I 
should have some systemd-analyze results for !debug kernels later.
David Hildenbrand March 21, 2023, 7:32 p.m. UTC | #17
On 20.03.23 22:27, Luis Chamberlain wrote:
> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>
>> Oh wow, so to clarify, it boots OK?
>>
> 
> Now that we know that tree works, I'm curious also now if you can
> confirm just re-ordering the patches still works (it should)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> 

So far no vmap errors booting the debug/kasan kernel (2 tries).

> And although it's *probably* just noise, but I'm very curious how much,
> if any difference there is if you just revert "module: use
> list_add_tail_rcu() when adding module".

Dito, no vmap errors booting the debug/kasan kernel (2 tries).

> 
> The data on that commit log is pretty small as I have a low end system,
> and I'm not yet done beating the hell out of a system with stress-ng,
> but getting some data froma  pretty large system would be great.
> Specially if this series seems to prove fixing boot on them.

2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I
had to delete the kdump initrd to make space for another kernel), but we can just mostly
ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today.


1) v6.3-rc1:

#1

Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s
multi-user.target reached after 29.186s in userspace

47.178s kdump.service
14.898s tuned.service
11.394s chrony-wait.service
  7.486s systemd-udev-settle.service
  7.334s NetworkManager-wait-online.service
  2.908s initrd-switch-root.service
  2.451s smartd.service
  2.316s dracut-initqueue.service
  2.057s polkit.service
  1.290s NetworkManager.service
  1.046s cups.service
  ...

#2

Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s
multi-user.target reached after 29.392s in userspace

14.552s tuned.service
  9.410s chrony-wait.service
  8.126s systemd-udev-settle.service
  7.502s NetworkManager-wait-online.service
  2.871s initrd-switch-root.service
  2.401s kdump.service
  2.297s polkit.service
  2.116s dracut-initqueue.service
  2.027s smartd.service
  1.262s NetworkManager.service
  1.102s cups.service
  1.011s sshd.service
  ...


2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu():

#1

Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s
multi-user.target reached after 27.213s in userspace

47.232s kdump.service
14.235s tuned.service
  8.220s chrony-wait.service
  7.444s NetworkManager-wait-online.service
  5.986s systemd-udev-settle.service
  2.881s initrd-switch-root.service
  2.236s smartd.service
  1.899s dracut-initqueue.service
  1.812s polkit.service
  1.554s NetworkManager.service
  1.140s ModemManager.service
  ...

#2

Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms
multi-user.target reached after 28.210s in userspace

15.435s tuned.service
11.365s chrony-wait.service
  7.512s NetworkManager-wait-online.service
  5.962s systemd-udev-settle.service
  2.889s initrd-switch-root.service
  2.846s smartd.service
  2.819s kdump.service
  2.228s polkit.service
  1.872s dracut-initqueue.service
  1.312s NetworkManager.service
  1.152s ModemManager.service
  1.011s sshd.service
  ...

3) 20230319-module-alloc-opts-adjust:


#1

Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s
multi-user.target reached after 28.205s in userspace

46.307s kdump.service
14.199s tuned.service
13.358s chrony-wait.service
  7.468s NetworkManager-wait-online.service
  6.329s systemd-udev-settle.service
  2.910s initrd-switch-root.service
  2.752s smartd.service
  2.142s polkit.service
  1.909s dracut-initqueue.service
  1.210s NetworkManager.service
  1.041s ModemManager.service
   925ms sshd.service
  ...

#2

Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s
multi-user.target reached after 27.326s in userspace

45.626s kdump.service
14.707s tuned.service
13.246s chrony-wait.service
  7.428s NetworkManager-wait-online.service
  6.507s systemd-udev-settle.service
  3.038s initrd-switch-root.service
  3.001s smartd.service
  2.057s polkit.service
  1.746s dracut-initqueue.service
  1.221s NetworkManager.service
  ...


I think we primarily only care about systemd-udev-settle.service.

That is fastest without the rcu patch (~6s), compared to with the rcu
patch (~6.5s) and with stock (~7.5s -- 8s).

Looks like dracut-initqueue also might be a bit faster with your changes, but
maybe it's mostly noise (would have to do more runs).

So maybe drop that rcu patch? But of course, there could be other scenarios where it's
helpful ...
David Hildenbrand March 24, 2023, 9:27 a.m. UTC | #18
On 21.03.23 20:32, David Hildenbrand wrote:
> On 20.03.23 22:27, Luis Chamberlain wrote:
>> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>
>>> Oh wow, so to clarify, it boots OK?
>>>
>>
>> Now that we know that tree works, I'm curious also now if you can
>> confirm just re-ordering the patches still works (it should)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
>>
> 
> So far no vmap errors booting the debug/kasan kernel (2 tries).
> 
>> And although it's *probably* just noise, but I'm very curious how much,
>> if any difference there is if you just revert "module: use
>> list_add_tail_rcu() when adding module".
> 
> Dito, no vmap errors booting the debug/kasan kernel (2 tries).
> 
>>
>> The data on that commit log is pretty small as I have a low end system,
>> and I'm not yet done beating the hell out of a system with stress-ng,
>> but getting some data froma  pretty large system would be great.
>> Specially if this series seems to prove fixing boot on them.
> 
> 2x booting RHEL9.1 on a !debug kernel. Something went wrong with kdump in 2 runs (think I
> had to delete the kdump initrd to make space for another kernel), but we can just mostly
> ignore that. I wanted to rerun with kdump disabled completely, but I'm out of time for today.
> 
> 
> 1) v6.3-rc1:
> 
> #1
> 
> Startup finished in 25.354s (kernel) + 7.662s (initrd) + 1min 8.805s (userspace) = 1min 41.822s
> multi-user.target reached after 29.186s in userspace
> 
> 47.178s kdump.service
> 14.898s tuned.service
> 11.394s chrony-wait.service
>    7.486s systemd-udev-settle.service
>    7.334s NetworkManager-wait-online.service
>    2.908s initrd-switch-root.service
>    2.451s smartd.service
>    2.316s dracut-initqueue.service
>    2.057s polkit.service
>    1.290s NetworkManager.service
>    1.046s cups.service
>    ...
> 
> #2
> 
> Startup finished in 25.375s (kernel) + 7.497s (initrd) + 29.428s (userspace) = 1min 2.301s
> multi-user.target reached after 29.392s in userspace
> 
> 14.552s tuned.service
>    9.410s chrony-wait.service
>    8.126s systemd-udev-settle.service
>    7.502s NetworkManager-wait-online.service
>    2.871s initrd-switch-root.service
>    2.401s kdump.service
>    2.297s polkit.service
>    2.116s dracut-initqueue.service
>    2.027s smartd.service
>    1.262s NetworkManager.service
>    1.102s cups.service
>    1.011s sshd.service
>    ...
> 
> 
> 2) 20230319-module-alloc-opts-adjust + revert of list_add_tail_rcu():
> 
> #1
> 
> Startup finished in 25.441s (kernel) + 7.285s (initrd) + 1min 7.644s (userspace) = 1min 40.371s
> multi-user.target reached after 27.213s in userspace
> 
> 47.232s kdump.service
> 14.235s tuned.service
>    8.220s chrony-wait.service
>    7.444s NetworkManager-wait-online.service
>    5.986s systemd-udev-settle.service
>    2.881s initrd-switch-root.service
>    2.236s smartd.service
>    1.899s dracut-initqueue.service
>    1.812s polkit.service
>    1.554s NetworkManager.service
>    1.140s ModemManager.service
>    ...
> 
> #2
> 
> Startup finished in 25.377s (kernel) + 7.271s (initrd) + 28.247s (userspace) = 1min 897ms
> multi-user.target reached after 28.210s in userspace
> 
> 15.435s tuned.service
> 11.365s chrony-wait.service
>    7.512s NetworkManager-wait-online.service
>    5.962s systemd-udev-settle.service
>    2.889s initrd-switch-root.service
>    2.846s smartd.service
>    2.819s kdump.service
>    2.228s polkit.service
>    1.872s dracut-initqueue.service
>    1.312s NetworkManager.service
>    1.152s ModemManager.service
>    1.011s sshd.service
>    ...
> 
> 3) 20230319-module-alloc-opts-adjust:
> 
> 
> #1
> 
> Startup finished in 25.320s (kernel) + 7.192s (initrd) + 1min 6.511s (userspace) = 1min 39.024s
> multi-user.target reached after 28.205s in userspace
> 
> 46.307s kdump.service
> 14.199s tuned.service
> 13.358s chrony-wait.service
>    7.468s NetworkManager-wait-online.service
>    6.329s systemd-udev-settle.service
>    2.910s initrd-switch-root.service
>    2.752s smartd.service
>    2.142s polkit.service
>    1.909s dracut-initqueue.service
>    1.210s NetworkManager.service
>    1.041s ModemManager.service
>     925ms sshd.service
>    ...
> 
> #2
> 
> Startup finished in 25.368s (kernel) + 7.303s (initrd) + 1min 6.897s (userspace) = 1min 39.569s
> multi-user.target reached after 27.326s in userspace
> 
> 45.626s kdump.service
> 14.707s tuned.service
> 13.246s chrony-wait.service
>    7.428s NetworkManager-wait-online.service
>    6.507s systemd-udev-settle.service
>    3.038s initrd-switch-root.service
>    3.001s smartd.service
>    2.057s polkit.service
>    1.746s dracut-initqueue.service
>    1.221s NetworkManager.service
>    ...
> 
> 
> I think we primarily only care about systemd-udev-settle.service.
> 
> That is fastest without the rcu patch (~6s), compared to with the rcu
> patch (~6.5s) and with stock (~7.5s -- 8s).
> 
> Looks like dracut-initqueue also might be a bit faster with your changes, but
> maybe it's mostly noise (would have to do more runs).
> 
> So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> helpful ...

Are there any other things you would like me to measure/test? I'll have 
to hand back that test machine soonish.
Luis Chamberlain March 24, 2023, 5:54 p.m. UTC | #19
On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
> On 21.03.23 20:32, David Hildenbrand wrote:
> > On 20.03.23 22:27, Luis Chamberlain wrote:
> > > On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
> > > > On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
> > > > > Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
> > > > 
> > > > Oh wow, so to clarify, it boots OK?
> > > > 
> > > 
> > > Now that we know that tree works, I'm curious also now if you can
> > > confirm just re-ordering the patches still works (it should)
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
> > > 
> > 
> > So far no vmap errors booting the debug/kasan kernel (2 tries).

<-- snip -->

> > I think we primarily only care about systemd-udev-settle.service.
> > 
> > That is fastest without the rcu patch (~6s), compared to with the rcu
> > patch (~6.5s) and with stock (~7.5s -- 8s).
> > 
> > Looks like dracut-initqueue also might be a bit faster with your changes, but
> > maybe it's mostly noise (would have to do more runs).
> > 
> > So maybe drop that rcu patch? But of course, there could be other scenarios where it's
> > helpful ...

Yes I confirm the RCU patch does not help at all now also using
stress-ng.

> Are there any other things you would like me to measure/test? I'll have to
> hand back that test machine soonish.

Yes please test the below. Perhaps its not the final form we want, but
it *does* fix OOM'ing when thrashing with stress-ng now with the module
option and even with 100 threads brings down max memory consumption by
259 MiB. The reason is that we also vmalloc during each finit_read_file()
for each module as well way before we even do layout_and_allocate(), and
so obviously if we fix the module path but not this path this will eventually
catch up with us as. I'm not at all happy with the current approach given
ideally we'd bump the counter when the user is done with the file, but we
don't yet have any tracking of that for users, they just vfree the memory
itself. And so this is just trying to catch heavy immediate abuse on the
caller to fend off abuse of vmalloc uses in a lightway manner.

There's gotta be a better way to do this, but its just an idea I have so far.
If we *want* to keep tabs until the user is done, we have to just modify
most users of these APIs and intrudce our own free. I don't think we're
in a rush to fix this so maybe that's the better approach.

And so I've managed to reproduce the issues you found now with my new stress-ng
module stressor as well.

https://github.com/ColinIanKing/stress-ng.git

Even though you have 400 CPUs with stress-ng we can likely reproduce it
with (use a module not loaded on your system):

./stress-ng --module 100 --module-name xfs

Without the patch below using 400 threads still OOMs easily due to the
kread issue. Max threads allowed are 8192.

From ec5099b15bb74f154319034547184e81e4ad8ba0 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Fri, 24 Mar 2023 10:35:41 -0700
Subject: [PATCH] fs/kernel_read_file: add a clutch

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 fs/kernel_read_file.c | 52 +++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 50 insertions(+), 2 deletions(-)

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 5d826274570c..2d55abe73b21 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -1,10 +1,52 @@
 // SPDX-License-Identifier: GPL-2.0-only
+
+#define pr_fmt(fmt) "kread: " fmt
+
 #include <linux/fs.h>
 #include <linux/fs_struct.h>
 #include <linux/kernel_read_file.h>
 #include <linux/security.h>
 #include <linux/vmalloc.h>
 
+/*
+ * This clutch ensures we only allow a certain number concurrent threads at a
+ * time allocating space concurrently and they must all finish within the
+ * timeout specified.  Anything more we know we're thrashing.
+ */
+#define MAX_KREAD_CONCURRENT 20
+static atomic_t kread_concurrent_max = ATOMIC_INIT(MAX_KREAD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(kread_wq);
+
+/*
+ * How many seconds to wait for *all*  MAX_KREAD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_KREAD_ALL_BUSY_TIMEOUT 1
+
+static int kernel_read_check_concurrent(void)
+{
+	int ret;
+
+	if (atomic_dec_if_positive(&kread_concurrent_max) < 0) {
+		pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+				    atomic_read(&kread_concurrent_max),
+				    MAX_KREAD_CONCURRENT);
+		ret = wait_event_killable_timeout(kread_wq,
+						  atomic_dec_if_positive(&kread_concurrent_max) >= 0,
+						  MAX_KREAD_ALL_BUSY_TIMEOUT * HZ);
+		if (!ret) {
+			pr_warn_ratelimited("reading cannot be processed, kernel busy with %d threads reading files now for more than %d seconds",
+					    MAX_KREAD_CONCURRENT, MAX_KREAD_ALL_BUSY_TIMEOUT);
+			return -ETIME;
+		} else if (ret == -ERESTARTSYS) {
+			pr_warn_ratelimited("sigkill sent for kernel read, giving up");
+			return ret;
+		}
+	}
+
+	return 0;
+}
+
 /**
  * kernel_read_file() - read file contents into a kernel buffer
  *
@@ -68,10 +110,14 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 		goto out;
 	}
 
+	ret = kernel_read_check_concurrent();
+	if (ret)
+		goto out;
+
 	whole_file = (offset == 0 && i_size <= buf_size);
 	ret = security_kernel_read_file(file, id, whole_file);
 	if (ret)
-		goto out;
+		goto out_allow_new_read;
 
 	if (file_size)
 		*file_size = i_size;
@@ -117,7 +163,9 @@ ssize_t kernel_read_file(struct file *file, loff_t offset, void **buf,
 			*buf = NULL;
 		}
 	}
-
+out_allow_new_read:
+	atomic_inc(&kread_concurrent_max);
+	wake_up(&kread_wq);
 out:
 	allow_write_access(file);
 	return ret == 0 ? copied : ret;
Linus Torvalds March 24, 2023, 7:11 p.m. UTC | #20
On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> +/*
> + * This clutch ensures we only allow a certain number concurrent threads at a

kludge, not clutch.

And it's much worse than a kludge. It's just wrong and disgusting.

> +               pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
> +                                   atomic_read(&kread_concurrent_max),
> +                                   MAX_KREAD_CONCURRENT);

This is also wrong, since it's not kernel_read_file() that is the
problem, but whatever broken caller.

Yeah, yeah, in practice it's presumably always just finit_module()
doing kernel_read_file_from_fd(), but it's still *completely* wrong to
just say "function X is throttling" when "X" isn't the problem, and
doesn't tell what the _real_ problem is.

I really think this all needs some core fixing at the module layer,
not these kinds of horrific hacks.

          Linus
Luis Chamberlain March 24, 2023, 7:59 p.m. UTC | #21
On Fri, Mar 24, 2023 at 12:11:07PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2023 at 10:54 AM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > +/*
> > + * This clutch ensures we only allow a certain number concurrent threads at a
> 
> kludge, not clutch.
> 
> And it's much worse than a kludge. It's just wrong and disgusting.

I wasn't happy with it either...

> > +               pr_warn_ratelimited("kread_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
> > +                                   atomic_read(&kread_concurrent_max),
> > +                                   MAX_KREAD_CONCURRENT);
> 
> This is also wrong, since it's not kernel_read_file() that is the
> problem, but whatever broken caller.
> 
> Yeah, yeah, in practice it's presumably always just finit_module()
> doing kernel_read_file_from_fd(), but it's still *completely* wrong to
> just say "function X is throttling" when "X" isn't the problem, and
> doesn't tell what the _real_ problem is.

True.

> I really think this all needs some core fixing at the module layer,
> not these kinds of horrific hacks.

On the modules side of things we can be super defensive on the second
vmalloc allocation defensive [0] but other than this the initial kread
also needs care too.

To address the kread abuse within finit_module we could just move the
kludge to the modules side of things until each free happens as in the
below alternative. That just means any easy user interfacing call with
kernel_read*() would likely have to be as careful. Untested below.

[0] https://lkml.kernel.org/r/20230319214926.1794108-4-mcgrof@kernel.org

  Luis

From 3c3f7e597ab35b4482ccb4064bb897eefa449071 Mon Sep 17 00:00:00 2001
From: Luis Chamberlain <mcgrof@kernel.org>
Date: Fri, 24 Mar 2023 12:51:44 -0700
Subject: [PATCH] module: kludge

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
---
 kernel/module/main.c | 45 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/kernel/module/main.c b/kernel/module/main.c
index 145e15f19576..a96de989532a 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -62,6 +62,16 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
 
+#define MAX_INITMOD_CONCURRENT 50
+static atomic_t initmod_concurrent_max = ATOMIC_INIT(MAX_INITMOD_CONCURRENT);
+static DECLARE_WAIT_QUEUE_HEAD(initmod_wq);
+
+/*
+ * How much time to wait for *all*  MAX_INITMOD_CONCURRENT threads running
+ * at the same time without returning.
+ */
+#define MAX_INITMOD_ALL_BUSY_TIMEOUT 5
+
 /*
  * Mutex protects:
  * 1) List of modules (also safely readable with preempt_disable),
@@ -3015,6 +3025,30 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
 	return load_module(&info, uargs, 0);
 }
 
+static int module_kread_concurrent(void)
+{
+	int err;
+
+	if (atomic_dec_if_positive(&initmod_concurrent_max) < 0) {
+		pr_warn_ratelimited("finit_module: initkmod_concurrent_max (%u) close to 0 (max_loads: %u), throttling...",
+				    atomic_read(&initmod_concurrent_max),
+				    MAX_INITMOD_CONCURRENT);
+		err = wait_event_killable_timeout(initmod_wq,
+						  atomic_dec_if_positive(&initmod_concurrent_max) >= 0,
+						  MAX_INITMOD_ALL_BUSY_TIMEOUT * HZ);
+		if (!err) {
+			pr_warn_ratelimited("finit_module: loading module cannot be processed, kernel busy with %d threads loading modules now for more than %d seconds",
+					    MAX_INITMOD_CONCURRENT, MAX_INITMOD_ALL_BUSY_TIMEOUT);
+			return -ETIME;
+		} else if (err == -ERESTARTSYS) {
+			pr_warn_ratelimited("finit_module: sigkill sent for load_module giving up");
+			return err;
+		}
+	}
+
+	return 0;
+}
+
 SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 {
 	struct load_info info = { };
@@ -3033,6 +3067,10 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		      |MODULE_INIT_COMPRESSED_FILE))
 		return -EINVAL;
 
+	err = module_kread_concurrent();
+	if (err)
+		return err;
+
 	len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
 				       READING_MODULE);
 	if (len < 0)
@@ -3048,7 +3086,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
 		info.len = len;
 	}
 
-	return load_module(&info, uargs, flags);
+	err = load_module(&info, uargs, flags);
+
+	atomic_inc(&initmod_concurrent_max);
+	wake_up(&initmod_wq);
+
+	return err;
 }
 
 /* Keep in sync with MODULE_FLAGS_BUF_SIZE !!! */
Linus Torvalds March 24, 2023, 8:28 p.m. UTC | #22
On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> On the modules side of things we can be super defensive on the second
> vmalloc allocation defensive [0] but other than this the initial kread
> also needs care too.

Please don't re-implement semaphores. They are a *classic* concurrency limiter.

In fact, probably *THE* classic one.

So just do something like this instead:

   --- a/kernel/module/main.c
   +++ b/kernel/module/main.c
   @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
        return load_module(&info, uargs, 0);
    }

   +#define CONCURRENCY_LIMITER(name, n) \
   +    struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
   +
   +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
   +
    SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs,
int, flags)
    {
        struct load_info info = { };
   @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const
char __user *, uargs, int, flags)
                      |MODULE_INIT_COMPRESSED_FILE))
                return -EINVAL;

   +    err = down_killable(&module_loading_concurrency);
   +    if (err)
   +            return err;
        len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
                                       READING_MODULE);
   +    up(&module_loading_concurrency);
        if (len < 0)
                return len;

NOTE! Entirely untested. Surprise surprise.

I'm a tiny bit worried about deadlocks here, so somebody needs to make
sure that the kernel_read_file_from_fd() case cannot possibly in turn
cause a "request_module()" recursion.

We most definitely have had module recursion before, but I *think*
it's always just in the module init function (ie one module requests
another when ithe mod->init() function is called).

I think by the time we have opened the module file descriptors and are
just reading the data, we should be ok and the above would never
deadlock, but I want people to at least think about it.

Of course, with that "max 50 concurrent loaders" limit, maybe it's
never an issue anyway. Even if you get a recursion a few deep, at most
you just wait for somebody else to get out of their loaders. Unless
*everybody* ends up waiting on some progress.

              Linus
Luis Chamberlain March 24, 2023, 9:14 p.m. UTC | #23
On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote:
> On Fri, Mar 24, 2023 at 1:00 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
> >
> > On the modules side of things we can be super defensive on the second
> > vmalloc allocation defensive [0] but other than this the initial kread
> > also needs care too.
> 
> Please don't re-implement semaphores. They are a *classic* concurrency limiter.
> 
> In fact, probably *THE* classic one.
> 
> So just do something like this instead:
> 
>    --- a/kernel/module/main.c
>    +++ b/kernel/module/main.c
>    @@ -2937,6 +2937,11 @@ SYSCALL_DEFINE3(init_module, void __user *, umod,
>         return load_module(&info, uargs, 0);
>     }
> 
>    +#define CONCURRENCY_LIMITER(name, n) \
>    +    struct semaphore name = __SEMAPHORE_INITIALIZER(name, n)
>    +
>    +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
>    +
>     SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs,
> int, flags)
>     {
>         struct load_info info = { };
>    @@ -2955,8 +2960,12 @@ SYSCALL_DEFINE3(finit_module, int, fd, const
> char __user *, uargs, int, flags)
>                       |MODULE_INIT_COMPRESSED_FILE))
>                 return -EINVAL;
> 
>    +    err = down_killable(&module_loading_concurrency);
>    +    if (err)
>    +            return err;
>         len = kernel_read_file_from_fd(fd, 0, &buf, INT_MAX, NULL,
>                                        READING_MODULE);
>    +    up(&module_loading_concurrency);
>         if (len < 0)
>                 return len;
> 
> NOTE! Entirely untested. Surprise surprise.

I'll give it a good wack thanks.

But it still begs the question if *other* vmalloc user-interfacing
places need similar practices. It's not just system calls that use it
willy nilly but anything that could in the end use it. Surely a lot of
"issues" could only happen in an insane pathological use case, but it's
a generic thing to keep in mind in the end.

> I'm a tiny bit worried about deadlocks here, so somebody needs to make
> sure that the kernel_read_file_from_fd() case cannot possibly in turn
> cause a "request_module()" recursion.

Automount on a path where the module lies in a path for a modue not
loaded yet triggering a kernel module autoload is the only thing
I can think of that could cause that, but that just calls userspace
modprobe. And I think that'd be an insane setup.

> We most definitely have had module recursion before, but I *think*
> it's always just in the module init function (ie one module requests
> another when ithe mod->init() function is called).

Since you up() right after the kernel_read_file_from_fd() we would not
be racing module inits with this. If however we place the up() after
the load_module() then that does incur that recurssion possibilty.

> I think by the time we have opened the module file descriptors and are
> just reading the data, we should be ok and the above would never
> deadlock, but I want people to at least think about it.
> 
> Of course, with that "max 50 concurrent loaders" limit, maybe it's
> never an issue anyway. Even if you get a recursion a few deep, at most
> you just wait for somebody else to get out of their loaders. Unless
> *everybody* ends up waiting on some progress.

Yeah, these certainly are pathalogical corner cases. I'm more interested
in solving doing something sane for 1000 CPUs or 2000 CPUs for now, even
if the issue was the kernel (not userspace) to blame (and even if those
use cases are being fixed now like the queued up linux-next ACPI
CPU frequency modules per CPU).

  Luis
Luis Chamberlain March 24, 2023, 11:27 p.m. UTC | #24
On Fri, Mar 24, 2023 at 02:14:11PM -0700, Luis Chamberlain wrote:
> On Fri, Mar 24, 2023 at 01:28:51PM -0700, Linus Torvalds wrote:
> >    +static CONCURRENCY_LIMITER(module_loading_concurrency, 50);
> I'll give it a good wack thanks.

Using a value of 20 above matches what I to get up to 100 stress-ng
module ops without any OOM (which seems to mimick a 400 CPU environment):

echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs

I'll go with that and evaluate max memory usage as well, but this is
still inviting us to consider other users interfacing areas with an
implicated vmalloc and the scale of this as we grow number of CPU
counts.

A magic values of 20 is completely empirical.

  Luis
Linus Torvalds March 24, 2023, 11:41 p.m. UTC | #25
On Fri, Mar 24, 2023 at 4:27 PM Luis Chamberlain <mcgrof@kernel.org> wrote:
>
> A magic values of 20 is completely empirical.

I suspect a magic value is fine in this context. I do not believe that
"hundreds of concurrent module loads" is a valid real-life load. It
sounds more like a "systemd does horrible things at startup".

20 sounds plenty big enough to me.

But I guess this is one of those "only real life will tell".

              Linus
David Hildenbrand March 28, 2023, 3:44 a.m. UTC | #26
On 24.03.23 18:54, Luis Chamberlain wrote:
> On Fri, Mar 24, 2023 at 10:27:14AM +0100, David Hildenbrand wrote:
>> On 21.03.23 20:32, David Hildenbrand wrote:
>>> On 20.03.23 22:27, Luis Chamberlain wrote:
>>>> On Mon, Mar 20, 2023 at 02:23:36PM -0700, Luis Chamberlain wrote:
>>>>> On Mon, Mar 20, 2023 at 10:15:23PM +0100, David Hildenbrand wrote:
>>>>>> Not able to reproduce with 20230319-module-alloc-opts so far (2 tries).
>>>>>
>>>>> Oh wow, so to clarify, it boots OK?
>>>>>
>>>>
>>>> Now that we know that tree works, I'm curious also now if you can
>>>> confirm just re-ordering the patches still works (it should)
>>>>
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230319-module-alloc-opts-adjust
>>>>
>>>
>>> So far no vmap errors booting the debug/kasan kernel (2 tries).
> 
> <-- snip -->
> 
>>> I think we primarily only care about systemd-udev-settle.service.
>>>
>>> That is fastest without the rcu patch (~6s), compared to with the rcu
>>> patch (~6.5s) and with stock (~7.5s -- 8s).
>>>
>>> Looks like dracut-initqueue also might be a bit faster with your changes, but
>>> maybe it's mostly noise (would have to do more runs).
>>>
>>> So maybe drop that rcu patch? But of course, there could be other scenarios where it's
>>> helpful ...
> 
> Yes I confirm the RCU patch does not help at all now also using
> stress-ng.
> 
>> Are there any other things you would like me to measure/test? I'll have to
>> hand back that test machine soonish.
> 
> Yes please test the below. Perhaps its not the final form we want, but
> it *does* fix OOM'ing when thrashing with stress-ng now with the module
> option and even with 100 threads brings down max memory consumption by
> 259 MiB. The reason is that we also vmalloc during each finit_read_file()
> for each module as well way before we even do layout_and_allocate(), and
> so obviously if we fix the module path but not this path this will eventually
> catch up with us as. I'm not at all happy with the current approach given
> ideally we'd bump the counter when the user is done with the file, but we
> don't yet have any tracking of that for users, they just vfree the memory
> itself. And so this is just trying to catch heavy immediate abuse on the
> caller to fend off abuse of vmalloc uses in a lightway manner.

Understood. (I'm planning on review one I have time some spare cycles)

> 
> There's gotta be a better way to do this, but its just an idea I have so far.
> If we *want* to keep tabs until the user is done, we have to just modify
> most users of these APIs and intrudce our own free. I don't think we're
> in a rush to fix this so maybe that's the better approach.
> 
> And so I've managed to reproduce the issues you found now with my new stress-ng
> module stressor as well.

Nice!

> 
> https://github.com/ColinIanKing/stress-ng.git
> 
> Even though you have 400 CPUs with stress-ng we can likely reproduce it
> with (use a module not loaded on your system):
> 
> ./stress-ng --module 100 --module-name xfs

I'll give that a churn on that machine with the updated patch ...

> 
> Without the patch below using 400 threads still OOMs easily due to the
> kread issue. Max threads allowed are 8192.
> 

... do you have an updated patch/branch that includes the feedback from 
Linus so I can give it a churn tomorrow?
Luis Chamberlain March 28, 2023, 6:16 a.m. UTC | #27
On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote:
> ... do you have an updated patch/branch that includes the feedback from
> Linus so I can give it a churn tomorrow?

Yeah sure:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts

The commit log needs updateing to reflect the results I just collected:

With the alloc patch ("module: avoid allocation if module is already
present and ready") I see 145 MiB in memory difference in comparison
to its last patch, "module: extract patient module check into helper".
So I think that's a clear keeper and should help large CPU count boots.

The patch "module: add concurrency limiter" which puts the concurency
delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
which seems to be what I needed to reproduce your 400 CPU count original
issue.

The program used to reproduce is stress-ng with the new module option:

echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs

To see how much max memory I use, I just use:

free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log

Run the test in another window, CTRL-C the test when above
finishes after 40 seconds and then:

sort -n -r foo.log  | head -1

If you have xfs loaded already you probably wanna pick module just as big
that you don't have loaded. You must have dependencies loaded already as
it doesn't call modprobe, it just finit_module's the module.

The last patch "modules/kmod: replace implementation with a sempahore"
just takes Linus' simplification suggestion to replace kmod's solution.
I tested it with:

tools/testing/selftests/kmod/kmod.sh -t 0008

This stress tests the kmod autoloading. Using time, the timing is
similar, perhaps mildly slower, but I like the simplification. I'm still
not happy with the "module: add concurrency limiter", specially it
doesn't seem to really buy us much, 2 MiB seems like within noise. 
But stress-ng clearly shows it is a possible source of issue as we
grow the ops. So it may make sense to just use the idea to replace the
delimiter for kmod for now, unless you see different results.

The stress-ng module test can at least be used now to replicate insane setups
at bootup.

Let me know if you get other useful results.

  Luis
David Hildenbrand March 28, 2023, 9:02 p.m. UTC | #28
On 28.03.23 08:16, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 05:44:40AM +0200, David Hildenbrand wrote:
>> ... do you have an updated patch/branch that includes the feedback from
>> Linus so I can give it a churn tomorrow?
> 
> Yeah sure:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/log/?h=20230327-module-alloc-opts
> 

I gave that one a go and get for system bootup:

#1:

13.761s tuned.service
12.261s chrony-wait.service
  7.386s NetworkManager-wait-online.service
  5.227s systemd-udev-settle.service
  2.893s initrd-switch-root.service
  2.148s polkit.service
  2.137s smartd.service
  1.893s dracut-initqueue.service
  1.290s NetworkManager.service
  1.032s cups.service


#2

13.881s tuned.service
  9.255s chrony-wait.service
  7.404s NetworkManager-wait-online.service
  5.826s systemd-udev-settle.service
  2.859s initrd-switch-root.service
  2.847s smartd.service
  2.172s polkit.service
  1.884s dracut-initqueue.service
  1.371s NetworkManager.service
  1.119s ModemManager.service


So we're a bit faster (0.2 -- 0.7s) than the original version without 
the rcu patch (~6s).


> The commit log needs updateing to reflect the results I just collected:
> 
> With the alloc patch ("module: avoid allocation if module is already
> present and ready") I see 145 MiB in memory difference in comparison
> to its last patch, "module: extract patient module check into helper".
> So I think that's a clear keeper and should help large CPU count boots.
> 
> The patch "module: add concurrency limiter" which puts the concurency
> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
> which seems to be what I needed to reproduce your 400 CPU count original
> issue.
> 
> The program used to reproduce is stress-ng with the new module option:
> 
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs

Above command fills for me with nfs (but also ext4) the kernel log with:

...
[  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
[  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
[  883.048549] nfs: Unknown symbol put_rpccred (err -2)
[  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
[  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
[  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
[  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
[  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
[  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
[  883.093074] nfs: Unknown symbol nfs_debug (err -2)
[  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
[  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
[  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
[  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
[  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
[  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
[  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
[  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
...


I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent 
handling or just side-effect of the concurrent loading?

> 
> To see how much max memory I use, I just use:
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
> 
> Run the test in another window, CTRL-C the test when above
> finishes after 40 seconds and then:
> 
> sort -n -r foo.log  | head -1

[root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
14254024
[root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
12862528

So 1391496 (KiB I assume, so 1.3 GiB !?) difference compared to before 
the test (I first start capturing and then run stress-ng).


> 
> If you have xfs loaded already you probably wanna pick module just as big
> that you don't have loaded. You must have dependencies loaded already as
> it doesn't call modprobe, it just finit_module's the module.


My setup already has xfs in use. nfs and ext4 are a bit smaller, but 
still big.
Luis Chamberlain March 29, 2023, 5:31 a.m. UTC | #29
On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
> 
> So we're a bit faster (0.2 -- 0.7s) than the original version without the
> rcu patch (~6s).

Groovy.

> > The commit log needs updateing to reflect the results I just collected:
> > 
> > With the alloc patch ("module: avoid allocation if module is already
> > present and ready") I see 145 MiB in memory difference in comparison
> > to its last patch, "module: extract patient module check into helper".
> > So I think that's a clear keeper and should help large CPU count boots.
> > 
> > The patch "module: add concurrency limiter" which puts the concurency
> > delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
> > which seems to be what I needed to reproduce your 400 CPU count original
> > issue.
> > 
> > The program used to reproduce is stress-ng with the new module option:
> > 
> > echo 0 > /proc/sys/vm/oom_dump_tasks
> > ./stress-ng --module 100 --module-name xfs
> 
> Above command fills for me with nfs (but also ext4) the kernel log with:
> 
> ...
> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
> ...
> 
> 
> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
> handling or just side-effect of the concurrent loading?

It is just because modprobe deals with module dependencies, stress-ng
modprobe doesn't, it just calls finit_module() and expects dependencies
to be loaded first.

> > To see how much max memory I use, I just use:
> > 
> > free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
> > 
> > Run the test in another window, CTRL-C the test when above
> > finishes after 40 seconds and then:
> > 
> > sort -n -r foo.log  | head -1
> 
> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
> 14254024
> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
> 12862528
> 
> So 1391496 (KiB I assume, so 1.3 GiB !?)

That is sadly correct.

> difference compared to before the
> test (I first start capturing and then run stress-ng).

That's a good chunk :-D

On a recent thread I do a full analysis of average module sizes [0], you
can use that to also get an average size estimate of your currently
loaded modules. Do a bit of math and that could give you the average
memory pressure on real vmap allocations. For a synthentic test as with
stress-ng modules we'd be doing twice the memory for a success load, and
only one time allocation due to kread for a failed allocation, since we
always allocate memory even for bogus modules on the kread side.

My xfs.ko is 42M on my test guest system, with 100 concurrent threads
doing two allocations each that puts the expected vmap allocation at
around:

42*2*100
8400
8400/1024
8.20312500000000000000

8.2 GiB

So on the worst case that's what we'd see. Since I saw seeing the
patches helped overall around 145 MiB in difference let's see what
that means. Let me re-test:

Idle memory: (note free -k uses kibibytes, KiB.

root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
416452

So while idle, the system is using up 416452 KiB, so 406 MiB.

While running stress-ng with 100 module ops

free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
echo 0 > /proc/sys/vm/oom_dump_tasks
./stress-ng --module 100 --module-name xfs
root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
4886492

4886492/1024
4771.96484375000000000000
4886492/1024/1024
4.66012191772460937500

So during the stress test we see memory highest usage was about
4771.96 MiB or ~4.66 GiB.

What's the difference between idle and after the stress test:

4886492 - 416452
4470040
4470040/1024
4365.27343750000000000000
4470040/1024/1024
4.26296234130859375000

So about 4365.27 MiB or 4.26 GiB.

Now, the upper limit for the test should have been a delta of 8.2 GiB
and we get about 4.26, so it means we're rejecting more than half of
the requests. Why half not just half? I only see 40 successful loads
of XFS during a 40 second window:

  dmesg -c > /dev/null
  # run tests
  dmesg -c | grep XFS | wc -l

At 100 ops all running finit_module for 40 seconds those successful
loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
3360 MiB or 3.28 GiB. But at any point in time only one module could
be loaded at a time, so in the *worst* consumption point a XFS in this
test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.

4365.27 - 84
4281.27

So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
failed module loads due to the kread*() calls from finit_module().

And get this... *if* you use module compression that also uses vmap()
*after* the kernel_read*() call which uses vmalloc(). *At least* in
that case we immediately free the buffer for the compressed module,
but *still* -- that is 3 possible vmap space allocations for every
module!

It'd be hard, but not impossible to collect stats for failed
finit_modules(). *That* could be indicative of areas in the kernel
we need to brush up on to stop doing stupid things, like we learned
about for the CPU frequency scaling modules. As I wrote this paragraph
I realized -- that this *is*  what we really have wanted all along to
help us debug these stupid things, so we can slowly learn where to
point fingers at to help optimize things. Altough I recently did some
tool scraping to collect stats and *wished* for this from userspace [0],
it wasn't hard to just a debug option for this to help us debug these
failures. So I added support for that and sent finally a new patch
series.

[0] https://lkml.kernel.org/r/CAB=NE6UTC4VkNM57GGJ3XkG_PWLkMfXv2e2=yQJhtM6Fc-uMsQ@mail.gmail.com

> > If you have xfs loaded already you probably wanna pick module just as big
> > that you don't have loaded. You must have dependencies loaded already as
> > it doesn't call modprobe, it just finit_module's the module.
> 
> My setup already has xfs in use. nfs and ext4 are a bit smaller, but still
> big.

Sure.

  Luis
David Hildenbrand March 30, 2023, 4:42 a.m. UTC | #30
On 29.03.23 07:31, Luis Chamberlain wrote:
> On Tue, Mar 28, 2023 at 11:02:49PM +0200, David Hildenbrand wrote:
>>
>> So we're a bit faster (0.2 -- 0.7s) than the original version without the
>> rcu patch (~6s).
> 
> Groovy.

Just to clarify, it was up to 0.7s faster (I phrased it in a confusing way).

> 
>>> The commit log needs updateing to reflect the results I just collected:
>>>
>>> With the alloc patch ("module: avoid allocation if module is already
>>> present and ready") I see 145 MiB in memory difference in comparison
>>> to its last patch, "module: extract patient module check into helper".
>>> So I think that's a clear keeper and should help large CPU count boots.
>>>
>>> The patch "module: add concurrency limiter" which puts the concurency
>>> delimiter on the kread only saves about 2 MiB with 100 stress-ng ops,
>>> which seems to be what I needed to reproduce your 400 CPU count original
>>> issue.
>>>
>>> The program used to reproduce is stress-ng with the new module option:
>>>
>>> echo 0 > /proc/sys/vm/oom_dump_tasks
>>> ./stress-ng --module 100 --module-name xfs
>>
>> Above command fills for me with nfs (but also ext4) the kernel log with:
>>
>> ...
>> [  883.036035] nfs: Unknown symbol xdr_reserve_space (err -2)
>> [  883.042221] nfs: Unknown symbol rpc_init_wait_queue (err -2)
>> [  883.048549] nfs: Unknown symbol put_rpccred (err -2)
>> [  883.054104] nfs: Unknown symbol __fscache_invalidate (err -2)
>> [  883.060540] nfs: Unknown symbol __fscache_use_cookie (err -2)
>> [  883.066969] nfs: Unknown symbol rpc_clnt_xprt_switch_has_addr (err -2)
>> [  883.074264] nfs: Unknown symbol __fscache_begin_write_operation (err -2)
>> [  883.081743] nfs: Unknown symbol nlmclnt_init (err -2)
>> [  883.087396] nfs: Unknown symbol nlmclnt_done (err -2)
>> [  883.093074] nfs: Unknown symbol nfs_debug (err -2)
>> [  883.098429] nfs: Unknown symbol rpc_wait_for_completion_task (err -2)
>> [  883.105640] nfs: Unknown symbol __fscache_acquire_cookie (err -2)
>> [  883.163764] nfs: Unknown symbol rpc_put_task (err -2)
>> [  883.169461] nfs: Unknown symbol __fscache_acquire_volume (err -2)
>> [  883.176297] nfs: Unknown symbol rpc_proc_register (err -2)
>> [  883.182430] nfs: Unknown symbol rpc_shutdown_client (err -2)
>> [  883.188765] nfs: Unknown symbol rpc_clnt_show_stats (err -2)
>> [  883.195097] nfs: Unknown symbol __fscache_begin_read_operation (err -2)
>> ...
>>
>>
>> I do *not* get these errors on manual morprobe/rmmod. BUG in concurrent
>> handling or just side-effect of the concurrent loading?
> 
> It is just because modprobe deals with module dependencies, stress-ng
> modprobe doesn't, it just calls finit_module() and expects dependencies
> to be loaded first.

Oh, right. That makes sense.

> 
>>> To see how much max memory I use, I just use:
>>>
>>> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > foo.log
>>>
>>> Run the test in another window, CTRL-C the test when above
>>> finishes after 40 seconds and then:
>>>
>>> sort -n -r foo.log  | head -1
>>
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | head -1
>> 14254024
>> [root@lenovo-sr950-01 fs]# sort -n -r foo.log  | tail -1
>> 12862528
>>
>> So 1391496 (KiB I assume, so 1.3 GiB !?)
> 
> That is sadly correct.
> 
>> difference compared to before the
>> test (I first start capturing and then run stress-ng).
> 
> That's a good chunk :-D
> 
> On a recent thread I do a full analysis of average module sizes [0], you
> can use that to also get an average size estimate of your currently
> loaded modules. Do a bit of math and that could give you the average
> memory pressure on real vmap allocations. For a synthentic test as with
> stress-ng modules we'd be doing twice the memory for a success load, and
> only one time allocation due to kread for a failed allocation, since we
> always allocate memory even for bogus modules on the kread side.
> 
> My xfs.ko is 42M on my test guest system, with 100 concurrent threads
> doing two allocations each that puts the expected vmap allocation at
> around:
> 
> 42*2*100
> 8400
> 8400/1024
> 8.20312500000000000000
> 
> 8.2 GiB
> 
> So on the worst case that's what we'd see. Since I saw seeing the
> patches helped overall around 145 MiB in difference let's see what
> that means. Let me re-test:
> 
> Idle memory: (note free -k uses kibibytes, KiB.
> 
> root@kmod ~ # free -k -s 1 -c 3 | grep Mem | awk '{print $3}' | sort -n -r  | head -1
> 416452
> 
> So while idle, the system is using up 416452 KiB, so 406 MiB.
> 
> While running stress-ng with 100 module ops
> 
> free -k -s 1 -c 40 | grep Mem | awk '{print $3}' > 2023-03-28-kmodsem-stress-ng-v1.txt
> echo 0 > /proc/sys/vm/oom_dump_tasks
> ./stress-ng --module 100 --module-name xfs
> root@kmod ~ # sort -n -r 2023-03-28-kmodsem-stress-ng-v1.txt | head -1
> 4886492
> 
> 4886492/1024
> 4771.96484375000000000000
> 4886492/1024/1024
> 4.66012191772460937500
> 
> So during the stress test we see memory highest usage was about
> 4771.96 MiB or ~4.66 GiB.
> 
> What's the difference between idle and after the stress test:
> 
> 4886492 - 416452
> 4470040
> 4470040/1024
> 4365.27343750000000000000
> 4470040/1024/1024
> 4.26296234130859375000
> 
> So about 4365.27 MiB or 4.26 GiB.
> 
> Now, the upper limit for the test should have been a delta of 8.2 GiB
> and we get about 4.26, so it means we're rejecting more than half of
> the requests. Why half not just half? I only see 40 successful loads
> of XFS during a 40 second window:
> 
>    dmesg -c > /dev/null
>    # run tests
>    dmesg -c | grep XFS | wc -l
> 
> At 100 ops all running finit_module for 40 seconds those successful
> loads only should have consumed about 40 * 2 * size of XFS (42 MiB):
> 3360 MiB or 3.28 GiB. But at any point in time only one module could
> be loaded at a time, so in the *worst* consumption point a XFS in this
> test should only be consuming 2*size_of_XFS(42 MiB) so 84 MiB.
> 
> 4365.27 - 84
> 4281.27
> 
> So about 4281.27 MiB (4.18 GiB) are consumed by the 98.07% of the
> failed module loads due to the kread*() calls from finit_module().
> 
> And get this... *if* you use module compression that also uses vmap()
> *after* the kernel_read*() call which uses vmalloc(). *At least* in
> that case we immediately free the buffer for the compressed module,
> but *still* -- that is 3 possible vmap space allocations for every
> module!

:/

> 
> It'd be hard, but not impossible to collect stats for failed
> finit_modules(). *That* could be indicative of areas in the kernel
> we need to brush up on to stop doing stupid things, like we learned
> about for the CPU frequency scaling modules. As I wrote this paragraph
> I realized -- that this *is*  what we really have wanted all along to
> help us debug these stupid things, so we can slowly learn where to
> point fingers at to help optimize things. Altough I recently did some
> tool scraping to collect stats and *wished* for this from userspace [0],
> it wasn't hard to just a debug option for this to help us debug these
> failures. So I added support for that and sent finally a new patch
> series.

Makes sense to me.


I didn't have time to look at the code yet (currently traveling), I hope 
that the kasan vmap issues are at least history.

I'll do another test with debug kernels using your latest version on 
that machine before I have to hand it back.

Thanks!