mbox series

[RFC,v3,0/3] Rlimit for module space

Message ID 20181019204723.3903-1-rick.p.edgecombe@intel.com (mailing list archive)
Headers show
Series Rlimit for module space | expand

Message

Rick Edgecombe Oct. 19, 2018, 8:47 p.m. UTC
If BPF JIT is on, there is no effective limit to prevent filling the entire
module space with JITed e/BPF filters. For classic BPF filters attached with
setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
number of insertions like there is for the bpf syscall.

This patch adds a per user rlimit for module space, as well as a system wide
limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
problem that in some cases a user can get access to 65536 UIDs, so the effective
limit cannot be set low enough to stop an attacker and be useful for the general
case. A discussed alternative solution was a system wide limit for BPF JIT
filters. This much more simply resolves the problem of exhaustion and
de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
exhausts the BPF JIT limit. In this case a per user limit is still needed. If
the subuid facility is disabled for normal users, this should still be ok
because the higher limit will not be able to be worked around that way.

The new BPF JIT limit can be set like this:
echo 5000000 > /proc/sys/net/core/bpf_jit_limit

So I *think* this patchset should resolve that issue except for the
configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
Better module space KASLR is another way to resolve the de-randomizing issue,
and so then you would just be left with the BPF DOS in that configuration.

Jann also pointed out how, with purposely fragmenting the module space, you
could make the effective module space blockage area much larger. This is also
somewhat un-resolved. The impact would depend on how big of a space you are
trying to allocate. The limit has been lowered on x86_64 so that at least
typical sized BPF filters cannot be blocked.

If anyone with more experience with subuid/user namespaces has any suggestions
I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
user could do this. I am going to keep working on this and see if I can find a
better solution.

Changes since v2:
 - System wide BPF JIT limit (discussion with Jann Horn)
 - Holding reference to user correctly (Jann)
 - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
 - Shrinking of default limits, to help prevent the limit being worked around
   with fragmentation (Jann)

Changes since v1:
 - Plug in for non-x86
 - Arch specific default values


Rick Edgecombe (3):
  modules: Create arch versions of module alloc/free
  modules: Create rlimit for module space
  bpf: Add system wide BPF JIT limit

 arch/arm/kernel/module.c                |   2 +-
 arch/arm64/kernel/module.c              |   2 +-
 arch/mips/kernel/module.c               |   2 +-
 arch/nds32/kernel/module.c              |   2 +-
 arch/nios2/kernel/module.c              |   4 +-
 arch/parisc/kernel/module.c             |   2 +-
 arch/s390/kernel/module.c               |   2 +-
 arch/sparc/kernel/module.c              |   2 +-
 arch/unicore32/kernel/module.c          |   2 +-
 arch/x86/include/asm/pgtable_32_types.h |   3 +
 arch/x86/include/asm/pgtable_64_types.h |   2 +
 arch/x86/kernel/module.c                |   2 +-
 fs/proc/base.c                          |   1 +
 include/asm-generic/resource.h          |   8 ++
 include/linux/bpf.h                     |   7 ++
 include/linux/filter.h                  |   1 +
 include/linux/sched/user.h              |   4 +
 include/uapi/asm-generic/resource.h     |   3 +-
 kernel/bpf/core.c                       |  22 +++-
 kernel/bpf/inode.c                      |  16 +++
 kernel/module.c                         | 152 +++++++++++++++++++++++-
 net/core/sysctl_net_core.c              |   7 ++
 22 files changed, 233 insertions(+), 15 deletions(-)

Comments

Ard Biesheuvel Oct. 20, 2018, 5:20 p.m. UTC | #1
Hi Rick,

On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com> wrote:
> If BPF JIT is on, there is no effective limit to prevent filling the entire
> module space with JITed e/BPF filters.

Why do BPF filters use the module space, and does this reason apply to
all architectures?

On arm64, we already support loading plain modules far away from the
core kernel (i.e. out of range for ordinary relative jump/calll
instructions), and so I'd expect BPF to be able to deal with this
already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
wouldn't be more appropriate.

So before refactoring the module alloc/free routines to accommodate
BPF, I'd like to take one step back and assess whether it wouldn't be
more appropriate to have a separate bpf_alloc/free API, which could be
totally separate from module alloc/free if the arch permits it.


> For classic BPF filters attached with
> setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> number of insertions like there is for the bpf syscall.
>
> This patch adds a per user rlimit for module space, as well as a system wide
> limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out the
> problem that in some cases a user can get access to 65536 UIDs, so the effective
> limit cannot be set low enough to stop an attacker and be useful for the general
> case. A discussed alternative solution was a system wide limit for BPF JIT
> filters. This much more simply resolves the problem of exhaustion and
> de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another user
> exhausts the BPF JIT limit. In this case a per user limit is still needed. If
> the subuid facility is disabled for normal users, this should still be ok
> because the higher limit will not be able to be worked around that way.
>
> The new BPF JIT limit can be set like this:
> echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>
> So I *think* this patchset should resolve that issue except for the
> configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal users.
> Better module space KASLR is another way to resolve the de-randomizing issue,
> and so then you would just be left with the BPF DOS in that configuration.
>
> Jann also pointed out how, with purposely fragmenting the module space, you
> could make the effective module space blockage area much larger. This is also
> somewhat un-resolved. The impact would depend on how big of a space you are
> trying to allocate. The limit has been lowered on x86_64 so that at least
> typical sized BPF filters cannot be blocked.
>
> If anyone with more experience with subuid/user namespaces has any suggestions
> I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-privileged
> user could do this. I am going to keep working on this and see if I can find a
> better solution.
>
> Changes since v2:
>  - System wide BPF JIT limit (discussion with Jann Horn)
>  - Holding reference to user correctly (Jann)
>  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>  - Shrinking of default limits, to help prevent the limit being worked around
>    with fragmentation (Jann)
>
> Changes since v1:
>  - Plug in for non-x86
>  - Arch specific default values
>
>
> Rick Edgecombe (3):
>   modules: Create arch versions of module alloc/free
>   modules: Create rlimit for module space
>   bpf: Add system wide BPF JIT limit
>
>  arch/arm/kernel/module.c                |   2 +-
>  arch/arm64/kernel/module.c              |   2 +-
>  arch/mips/kernel/module.c               |   2 +-
>  arch/nds32/kernel/module.c              |   2 +-
>  arch/nios2/kernel/module.c              |   4 +-
>  arch/parisc/kernel/module.c             |   2 +-
>  arch/s390/kernel/module.c               |   2 +-
>  arch/sparc/kernel/module.c              |   2 +-
>  arch/unicore32/kernel/module.c          |   2 +-
>  arch/x86/include/asm/pgtable_32_types.h |   3 +
>  arch/x86/include/asm/pgtable_64_types.h |   2 +
>  arch/x86/kernel/module.c                |   2 +-
>  fs/proc/base.c                          |   1 +
>  include/asm-generic/resource.h          |   8 ++
>  include/linux/bpf.h                     |   7 ++
>  include/linux/filter.h                  |   1 +
>  include/linux/sched/user.h              |   4 +
>  include/uapi/asm-generic/resource.h     |   3 +-
>  kernel/bpf/core.c                       |  22 +++-
>  kernel/bpf/inode.c                      |  16 +++
>  kernel/module.c                         | 152 +++++++++++++++++++++++-
>  net/core/sysctl_net_core.c              |   7 ++
>  22 files changed, 233 insertions(+), 15 deletions(-)
>
> --
> 2.17.1
>
Rick Edgecombe Oct. 22, 2018, 11:06 p.m. UTC | #2
On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
> Hi Rick,
> 
> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
> wrote:
> > If BPF JIT is on, there is no effective limit to prevent filling the entire
> > module space with JITed e/BPF filters.
> 
> Why do BPF filters use the module space, and does this reason apply to
> all architectures?
> 
> On arm64, we already support loading plain modules far away from the
> core kernel (i.e. out of range for ordinary relative jump/calll
> instructions), and so I'd expect BPF to be able to deal with this
> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
> wouldn't be more appropriate.
AFAIK, it's like you said about relative instruction limits, but also because
some predictors don't predict past a certain distance. So performance as well.
Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
to be 8 by my count, that have a dedicated module space for some reason.

> So before refactoring the module alloc/free routines to accommodate
> BPF, I'd like to take one step back and assess whether it wouldn't be
> more appropriate to have a separate bpf_alloc/free API, which could be
> totally separate from module alloc/free if the arch permits it.
> 
I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
will chime in. I only ran into this because I was trying to increase
randomization for the module space and wanted to find out how many allocations
needed to be supported.

I'd guess though, that BPF JIT is just assuming that there will be some arch
specific constraints about where text can be placed optimally and they would
already be taken into account in the module space allocator.

If there are no constraints for some arch, I'd wonder why not just update its
module_alloc to use the whole space available. What exactly are the constraints
for arm64 for normal modules?

It seems fine to me for architectures to have the option of solving this a
different way. If potentially the rlimit ends up being the best solution for
some architectures though, do you think this refactor (pretty close to just a
name change) is that intrusive?

I guess it could be just a BPF JIT per user limit and not touch module space,
but I thought the module space limit was a more general solution as that is the
actual limited resource.

> > For classic BPF filters attached with
> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
> > number of insertions like there is for the bpf syscall.
> > 
> > This patch adds a per user rlimit for module space, as well as a system wide
> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
> > the
> > problem that in some cases a user can get access to 65536 UIDs, so the
> > effective
> > limit cannot be set low enough to stop an attacker and be useful for the
> > general
> > case. A discussed alternative solution was a system wide limit for BPF JIT
> > filters. This much more simply resolves the problem of exhaustion and
> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
> > user
> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
> > If
> > the subuid facility is disabled for normal users, this should still be ok
> > because the higher limit will not be able to be worked aroThey might und
> > that way.
> > 
> > The new BPF JIT limit can be set like this:
> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
> > 
> > So I *think* this patchset should resolve that issue except for the
> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
> > users.
> > Better module space KASLR is another way to resolve the de-randomizing
> > issue,
> > and so then you would just be left with the BPF DOS in that configuration.
> > 
> > Jann also pointed out how, with purposely fragmenting the module space, you
> > could make the effective module space blockage area much larger. This is
> > also
> > somewhat un-resolved. The impact would depend on how big of a space you are
> > trying to allocate. The limit has been lowered on x86_64 so that at least
> > typical sized BPF filters cannot be blocked.
> > 
> > If anyone with more experience with subuid/user namespaces has any
> > suggestions
> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
> > privileged
> > user could do this. I am going to keep working on this and see if I can find
> > a
> > better solution.
> > 
> > Changes since v2:
> >  - System wide BPF JIT limit (discussion with Jann Horn)
> >  - Holding reference to user correctly (Jann)
> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
> >  - Shrinking of default limits, to help prevent the limit being worked
> > around
> >    with fragmentation (Jann)
> > 
> > Changes since v1:
> >  - Plug in for non-x86
> >  - Arch specific default values
> > 
> > 
> > Rick Edgecombe (3):
> >   modules: Create arch versions of module alloc/free
> >   modules: Create rlimit for module space
> >   bpf: Add system wide BPF JIT limit
> > 
> >  arch/arm/kernel/module.c                |   2 +-
> >  arch/arm64/kernel/module.c              |   2 +-
> >  arch/mips/kernel/module.c               |   2 +-
> >  arch/nds32/kernel/module.c              |   2 +-
> >  arch/nios2/kernel/module.c              |   4 +-
> >  arch/parisc/kernel/module.c             |   2 +-
> >  arch/s390/kernel/module.c               |   2 +-
> >  arch/sparc/kernel/module.c              |   2 +-
> >  arch/unicore32/kernel/module.c          |   2 +-
> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
> >  arch/x86/kernel/module.c                |   2 +-
> >  fs/proc/base.c                          |   1 +
> >  include/asm-generic/resource.h          |   8 ++
> >  include/linux/bpf.h                     |   7 ++
> >  include/linux/filter.h                  |   1 +
> >  include/linux/sched/user.h              |   4 +
> >  include/uapi/asm-generic/resource.h     |   3 +-
> >  kernel/bpf/core.c                       |  22 +++-
> >  kernel/bpf/inode.c                      |  16 +++
> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
> >  net/core/sysctl_net_core.c              |   7 ++
> >  22 files changed, 233 insertions(+), 15 deletions(-)
> > 
> > --
> > 2.17.1
> > 
branching predictors may not be able to predict past a certain distance. So
performance as well.
Ard Biesheuvel Oct. 23, 2018, 11:54 a.m. UTC | #3
On 22 October 2018 at 20:06, Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>> Hi Rick,
>>
>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>> wrote:
>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>> > module space with JITed e/BPF filters.
>>
>> Why do BPF filters use the module space, and does this reason apply to
>> all architectures?
>>
>> On arm64, we already support loading plain modules far away from the
>> core kernel (i.e. out of range for ordinary relative jump/calll
>> instructions), and so I'd expect BPF to be able to deal with this
>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>> wouldn't be more appropriate.
> AFAIK, it's like you said about relative instruction limits, but also because
> some predictors don't predict past a certain distance. So performance as well.
> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
> to be 8 by my count, that have a dedicated module space for some reason.
>
>> So before refactoring the module alloc/free routines to accommodate
>> BPF, I'd like to take one step back and assess whether it wouldn't be
>> more appropriate to have a separate bpf_alloc/free API, which could be
>> totally separate from module alloc/free if the arch permits it.
>>
> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
> will chime in. I only ran into this because I was trying to increase
> randomization for the module space and wanted to find out how many allocations
> needed to be supported.
>
> I'd guess though, that BPF JIT is just assuming that there will be some arch
> specific constraints about where text can be placed optimally and they would
> already be taken into account in the module space allocator.
>
> If there are no constraints for some arch, I'd wonder why not just update its
> module_alloc to use the whole space available. What exactly are the constraints
> for arm64 for normal modules?
>

Relative branches and the interactions with KAsan.

We just fixed something similar in the kprobes code: it was using
RWX-mapped module memory to store kprobed instructions, and we
replaced that with a simple vmalloc_exec() [and code to remap it
read-only], which was possible given that relative branches are always
emulated by arm64 kprobes.

So it depends on whether BPF code needs to be in relative branching
range from the calling code, and whether the BPF code itself is
emitted using relative branches into other parts of the code.

> It seems fine to me for architectures to have the option of solving this a
> different way. If potentially the rlimit ends up being the best solution for
> some architectures though, do you think this refactor (pretty close to just a
> name change) is that intrusive?
>
> I guess it could be just a BPF JIT per user limit and not touch module space,
> but I thought the module space limit was a more general solution as that is the
> actual limited resource.
>

I think it is wrong to conflate the two things. Limiting the number of
BPF allocations and the limiting number of module allocations are two
separate things, and the fact that BPF reuses module_alloc() out of
convenience does not mean a single rlimit for both is appropriate.


>> > For classic BPF filters attached with
>> > setsockopt SO_ATTACH_FILTER, there is no memlock rlimit check to limit the
>> > number of insertions like there is for the bpf syscall.
>> >
>> > This patch adds a per user rlimit for module space, as well as a system wide
>> > limit for BPF JIT. In a previously reviewed patchset, Jann Horn pointed out
>> > the
>> > problem that in some cases a user can get access to 65536 UIDs, so the
>> > effective
>> > limit cannot be set low enough to stop an attacker and be useful for the
>> > general
>> > case. A discussed alternative solution was a system wide limit for BPF JIT
>> > filters. This much more simply resolves the problem of exhaustion and
>> > de-randomizing in the case of non-CONFIG_BPF_JIT_ALWAYS_ON. If
>> > CONFIG_BPF_JIT_ALWAYS_ON is on however, BPF insertions will fail if another
>> > user
>> > exhausts the BPF JIT limit. In this case a per user limit is still needed.
>> > If
>> > the subuid facility is disabled for normal users, this should still be ok
>> > because the higher limit will not be able to be worked aroThey might und
>> > that way.
>> >
>> > The new BPF JIT limit can be set like this:
>> > echo 5000000 > /proc/sys/net/core/bpf_jit_limit
>> >
>> > So I *think* this patchset should resolve that issue except for the
>> > configuration of CONFIG_BPF_JIT_ALWAYS_ON and subuid allowed for normal
>> > users.
>> > Better module space KASLR is another way to resolve the de-randomizing
>> > issue,
>> > and so then you would just be left with the BPF DOS in that configuration.
>> >
>> > Jann also pointed out how, with purposely fragmenting the module space, you
>> > could make the effective module space blockage area much larger. This is
>> > also
>> > somewhat un-resolved. The impact would depend on how big of a space you are
>> > trying to allocate. The limit has been lowered on x86_64 so that at least
>> > typical sized BPF filters cannot be blocked.
>> >
>> > If anyone with more experience with subuid/user namespaces has any
>> > suggestions
>> > I'd be glad to hear. On an Ubuntu machine it didn't seem like a un-
>> > privileged
>> > user could do this. I am going to keep working on this and see if I can find
>> > a
>> > better solution.
>> >
>> > Changes since v2:
>> >  - System wide BPF JIT limit (discussion with Jann Horn)
>> >  - Holding reference to user correctly (Jann)
>> >  - Having arch versions of modulde_alloc (Dave Hansen, Jessica Yu)
>> >  - Shrinking of default limits, to help prevent the limit being worked
>> > around
>> >    with fragmentation (Jann)
>> >
>> > Changes since v1:
>> >  - Plug in for non-x86
>> >  - Arch specific default values
>> >
>> >
>> > Rick Edgecombe (3):
>> >   modules: Create arch versions of module alloc/free
>> >   modules: Create rlimit for module space
>> >   bpf: Add system wide BPF JIT limit
>> >
>> >  arch/arm/kernel/module.c                |   2 +-
>> >  arch/arm64/kernel/module.c              |   2 +-
>> >  arch/mips/kernel/module.c               |   2 +-
>> >  arch/nds32/kernel/module.c              |   2 +-
>> >  arch/nios2/kernel/module.c              |   4 +-
>> >  arch/parisc/kernel/module.c             |   2 +-
>> >  arch/s390/kernel/module.c               |   2 +-
>> >  arch/sparc/kernel/module.c              |   2 +-
>> >  arch/unicore32/kernel/module.c          |   2 +-
>> >  arch/x86/include/asm/pgtable_32_types.h |   3 +
>> >  arch/x86/include/asm/pgtable_64_types.h |   2 +
>> >  arch/x86/kernel/module.c                |   2 +-
>> >  fs/proc/base.c                          |   1 +
>> >  include/asm-generic/resource.h          |   8 ++
>> >  include/linux/bpf.h                     |   7 ++
>> >  include/linux/filter.h                  |   1 +
>> >  include/linux/sched/user.h              |   4 +
>> >  include/uapi/asm-generic/resource.h     |   3 +-
>> >  kernel/bpf/core.c                       |  22 +++-
>> >  kernel/bpf/inode.c                      |  16 +++
>> >  kernel/module.c                         | 152 +++++++++++++++++++++++-
>> >  net/core/sysctl_net_core.c              |   7 ++
>> >  22 files changed, 233 insertions(+), 15 deletions(-)
>> >
>> > --
>> > 2.17.1
>> >
> branching predictors may not be able to predict past a certain distance. So
> performance as well.
Jessica Yu Oct. 24, 2018, 3:07 p.m. UTC | #4
+++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>On 22 October 2018 at 20:06, Edgecombe, Rick P
><rick.p.edgecombe@intel.com> wrote:
>> On Sat, 2018-10-20 at 19:20 +0200, Ard Biesheuvel wrote:
>>> Hi Rick,
>>>
>>> On 19 October 2018 at 22:47, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>> wrote:
>>> > If BPF JIT is on, there is no effective limit to prevent filling the entire
>>> > module space with JITed e/BPF filters.
>>>
>>> Why do BPF filters use the module space, and does this reason apply to
>>> all architectures?
>>>
>>> On arm64, we already support loading plain modules far away from the
>>> core kernel (i.e. out of range for ordinary relative jump/calll
>>> instructions), and so I'd expect BPF to be able to deal with this
>>> already as well. So for arm64, I wonder why an ordinary vmalloc_exec()
>>> wouldn't be more appropriate.
>> AFAIK, it's like you said about relative instruction limits, but also because
>> some predictors don't predict past a certain distance. So performance as well.
>> Not sure the reasons for each arch, or if they all apply for BPF JIT. There seem
>> to be 8 by my count, that have a dedicated module space for some reason.
>>
>>> So before refactoring the module alloc/free routines to accommodate
>>> BPF, I'd like to take one step back and assess whether it wouldn't be
>>> more appropriate to have a separate bpf_alloc/free API, which could be
>>> totally separate from module alloc/free if the arch permits it.
>>>
>> I am not a BPF JIT expert unfortunately, hopefully someone more authoritative
>> will chime in. I only ran into this because I was trying to increase
>> randomization for the module space and wanted to find out how many allocations
>> needed to be supported.
>>
>> I'd guess though, that BPF JIT is just assuming that there will be some arch
>> specific constraints about where text can be placed optimally and they would
>> already be taken into account in the module space allocator.
>>
>> If there are no constraints for some arch, I'd wonder why not just update its
>> module_alloc to use the whole space available. What exactly are the constraints
>> for arm64 for normal modules?
>>
>
>Relative branches and the interactions with KAsan.
>
>We just fixed something similar in the kprobes code: it was using
>RWX-mapped module memory to store kprobed instructions, and we
>replaced that with a simple vmalloc_exec() [and code to remap it
>read-only], which was possible given that relative branches are always
>emulated by arm64 kprobes.
>
>So it depends on whether BPF code needs to be in relative branching
>range from the calling code, and whether the BPF code itself is
>emitted using relative branches into other parts of the code.
>
>> It seems fine to me for architectures to have the option of solving this a
>> different way. If potentially the rlimit ends up being the best solution for
>> some architectures though, do you think this refactor (pretty close to just a
>> name change) is that intrusive?
>>
>> I guess it could be just a BPF JIT per user limit and not touch module space,
>> but I thought the module space limit was a more general solution as that is the
>> actual limited resource.
>>
>
>I think it is wrong to conflate the two things. Limiting the number of
>BPF allocations and the limiting number of module allocations are two
>separate things, and the fact that BPF reuses module_alloc() out of
>convenience does not mean a single rlimit for both is appropriate.

Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
because it is an easy way to obtain executable kernel memory (and
depending on the needs of the architecture, being additionally
reachable via relative branches) during runtime. The side effect is
that all these users share the "module" memory space, even though this
memory region is not exclusively used by modules (well, personally I
think it technically should be, because seeing module_alloc() usage
outside of the module loader is kind of a misuse of the module API and
it's confusing for people who don't know the reason behind its usage
outside of the module loader).

Right now I'm not sure if it makes sense to impose a blanket limit on
all module_alloc() allocations when the real motivation behind the
rlimit is related to BPF, i.e., to stop unprivileged users from
hogging up all the vmalloc space for modules with JITed BPF filters.
So the rlimit has more to do with limiting the memory usage of BPF
filters than it has to do with modules themselves.

I think Ard's suggestion of having a separate bpf_alloc/free API makes
a lot of sense if we want to keep track of bpf-related allocations
(and then the rlimit would be enforced for those). Maybe part of the
module mapping space could be carved out for bpf filters (e.g. have
BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
continue sharing the region but explicitly define a separate bpf_alloc
API, depending on an architecture's needs. What do people think?

Thanks,

Jessica
Daniel Borkmann Oct. 24, 2018, 4:04 p.m. UTC | #5
[ +Alexei, netdev ]

On 10/24/2018 05:07 PM, Jessica Yu wrote:
> +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
>> On 22 October 2018 at 20:06, Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
[...]
>> I think it is wrong to conflate the two things. Limiting the number of
>> BPF allocations and the limiting number of module allocations are two
>> separate things, and the fact that BPF reuses module_alloc() out of
>> convenience does not mean a single rlimit for both is appropriate.
> 
> Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> because it is an easy way to obtain executable kernel memory (and
> depending on the needs of the architecture, being additionally
> reachable via relative branches) during runtime. The side effect is
> that all these users share the "module" memory space, even though this
> memory region is not exclusively used by modules (well, personally I
> think it technically should be, because seeing module_alloc() usage
> outside of the module loader is kind of a misuse of the module API and
> it's confusing for people who don't know the reason behind its usage
> outside of the module loader).
> 
> Right now I'm not sure if it makes sense to impose a blanket limit on
> all module_alloc() allocations when the real motivation behind the
> rlimit is related to BPF, i.e., to stop unprivileged users from
> hogging up all the vmalloc space for modules with JITed BPF filters.
> So the rlimit has more to do with limiting the memory usage of BPF
> filters than it has to do with modules themselves.
> 
> I think Ard's suggestion of having a separate bpf_alloc/free API makes
> a lot of sense if we want to keep track of bpf-related allocations
> (and then the rlimit would be enforced for those). Maybe part of the
> module mapping space could be carved out for bpf filters (e.g. have
> BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> continue sharing the region but explicitly define a separate bpf_alloc
> API, depending on an architecture's needs. What do people think?

Hmm, I think here are several issues mixed up at the same time which is just
very confusing, imho:

1) The fact that there are several non-module users of module_alloc()
as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
them are not being modules, they all need to alloc some piece of executable
memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
that is even /before/ eBPF existed. Having some different API perhaps for all
these users seems to make sense if the goal is not to interfere with modules
themselves. It might also help as a benefit to potentially increase that
memory pool if we're hitting limits at scale which would not be a concern
for normal kernel modules since there's usually just a very few of them
needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
running BPF-seccomp policies, networking BPF policies, etc which need to
scale w/ application or container deployment; so this is of much more
dynamic and unpredictable nature).

2) Then there is rlimit which is proposing to have a "fairer" share among
unprivileged users. I'm not fully sure yet whether rlimit is actually a
nice usable interface for all this. I'd agree that something is needed
on that regard, but I also tend to like Michal Hocko's cgroup proposal,
iow, to have such resource control as part of memory cgroup could be
something to consider _especially_ since it already _exists_ for vmalloc()
backed memory so this should be not much different than that. It sounds
like 2) comes on top of 1).

3) Last but not least, there's a short term fix which is needed independently
of 1) and 2) and should be done immediately which is to account for
unprivileged users and restrict them based on a global configurable
limit such that privileged use keeps functioning, and 2) could enforce
based on the global upper limit, for example. Pending fix is under
https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
Linus as soon as possible as short term fix. Then something like memcg
can be considered on top since it seems this makes most sense from a
usability point.

Thanks a lot,
Daniel
Rick Edgecombe Oct. 25, 2018, 1:01 a.m. UTC | #6
On Wed, 2018-10-24 at 18:04 +0200, Daniel Borkmann wrote:
> [ +Alexei, netdev ]
> 
> On 10/24/2018 05:07 PM, Jessica Yu wrote:
> > +++ Ard Biesheuvel [23/10/18 08:54 -0300]:
> > > On 22 October 2018 at 20:06, Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> 
> [...]
> > > I think it is wrong to conflate the two things. Limiting the number of
> > > BPF allocations and the limiting number of module allocations are two
> > > separate things, and the fact that BPF reuses module_alloc() out of
> > > convenience does not mean a single rlimit for both is appropriate.
> > 
> > Hm, I think Ard has a good point. AFAIK, and correct me if I'm wrong,
> > users of module_alloc() i.e. kprobes, ftrace, bpf, seem to use it
> > because it is an easy way to obtain executable kernel memory (and
> > depending on the needs of the architecture, being additionally
> > reachable via relative branches) during runtime. The side effect is
> > that all these users share the "module" memory space, even though this
> > memory region is not exclusively used by modules (well, personally I
> > think it technically should be, because seeing module_alloc() usage
> > outside of the module loader is kind of a misuse of the module API and
> > it's confusing for people who don't know the reason behind its usage
> > outside of the module loader).
> > 
> > Right now I'm not sure if it makes sense to impose a blanket limit on
> > all module_alloc() allocations when the real motivation behind the
> > rlimit is related to BPF, i.e., to stop unprivileged users from
> > hogging up all the vmalloc space for modules with JITed BPF filters.
> > So the rlimit has more to do with limiting the memory usage of BPF
> > filters than it has to do with modules themselves.
> > 
> > I think Ard's suggestion of having a separate bpf_alloc/free API makes
> > a lot of sense if we want to keep track of bpf-related allocations
> > (and then the rlimit would be enforced for those). Maybe part of the
> > module mapping space could be carved out for bpf filters (e.g. have
> > BPF_VADDR, BPF_VSIZE, etc like how we have it for modules), or
> > continue sharing the region but explicitly define a separate bpf_alloc
> > API, depending on an architecture's needs. What do people think?
> 
> Hmm, I think here are several issues mixed up at the same time which is just
> very confusing, imho:
> 
> 1) The fact that there are several non-module users of module_alloc()
> as Jessica notes such as kprobes, ftrace, bpf, for example. While all of
> them are not being modules, they all need to alloc some piece of executable
> memory. It's nothing new, this exists for 7 years now since 0a14842f5a3c
> ("net: filter: Just In Time compiler for x86-64") from BPF side; effectively
> that is even /before/ eBPF existed. Having some different API perhaps for all
> these users seems to make sense if the goal is not to interfere with modules
> themselves. It might also help as a benefit to potentially increase that
> memory pool if we're hitting limits at scale which would not be a concern
> for normal kernel modules since there's usually just a very few of them
> needed (unlike dynamically tracing various kernel parts 24/7 w/ or w/o BPF,
> running BPF-seccomp policies, networking BPF policies, etc which need to
> scale w/ application or container deployment; so this is of much more
> dynamic and unpredictable nature).
> 
> 2) Then there is rlimit which is proposing to have a "fairer" share among
> unprivileged users. I'm not fully sure yet whether rlimit is actually a
> nice usable interface for all this. I'd agree that something is needed
> on that regard, but I also tend to like Michal Hocko's cgroup proposal,
> iow, to have such resource control as part of memory cgroup could be
> something to consider _especially_ since it already _exists_ for vmalloc()
> backed memory so this should be not much different than that. It sounds
> like 2) comes on top of 1).
FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
already know, but it looks like the cgroups vmalloc charge is done in the main
page allocator and counts against the whole kernel memory limit. A user may want
to have a higher kernel limit than the module space size, so it seems it isn't
enough by itself and some new limit would need to be added.

As for what the limit should be, I wonder if some of the disagreement is just
from the name "module space".

There is a limited resource of physical memory, so we have limits for it. There
is a limited resource of CPU time, so we have limits for it. If there is a
limited resource for preferred address space for executable code, why not just
continue that trend? If other forms of unprivileged JIT come along, would it be
better to have N limits for each type? Request_module probably can't fill the
space, but technically there are already 2 unprivileged users. So IMHO, its a
more forward looking solution.

If there are some usage/architecture combos that don't need the preferred space
they can allocate in vmalloc and have it not count against the preferred space
limit but still count against the cgroups kernel memory limit.

Another benefit of centralizing the allocation of the "executable memory
preferred space" is KASLR. Right now that is only done in module_alloc and so
all users of it get randomized. If they all call vmalloc by themselves they will
just use the normal allocator.

Anyway, it seems like either type of limit (BPF JIT or all module space) will
solve the problem equally well today.

> 3) Last but not least, there's a short term fix which is needed independently
> of 1) and 2) and should be done immediately which is to account for
> unprivileged users and restrict them based on a global configurable
> limit such that privileged use keeps functioning, and 2) could enforce
> based on the global upper limit, for example. Pending fix is under
> https://patchwork.ozlabs.org/patch/987971/ which we intend to ship to
> Linus as soon as possible as short term fix. Then something like memcg
> can be considered on top since it seems this makes most sense from a
> usability point.
> 
> Thanks a lot,
> Daniel
Michal Hocko Oct. 25, 2018, 2:18 p.m. UTC | #7
On Thu 25-10-18 01:01:44, Edgecombe, Rick P wrote:
[...]
> FWIW, cgroups seems like a better solution than rlimit to me too. Maybe you all
> already know, but it looks like the cgroups vmalloc charge is done in the main
> page allocator and counts against the whole kernel memory limit.

I am not sure I understand what you are saying but let me clarify that
vmalloc memory is accounted against memory cgroup associated with the
user context calling vmalloc. All you need to do is to add __GFP_ACCOUNT
to the gfp mask. The only challenge here is the charged memory life
cycle. When does it get deallocated? In the worst case the memory is not
tight to any user context and as such it doesn't get deallocated by
killing all processes which could lead to memcg limit exhaustion.

> A user may want
> to have a higher kernel limit than the module space size, so it seems it isn't
> enough by itself and some new limit would need to be added.

If there is another restriction on this memory then you are right.