diff mbox series

[v14,2/7] mm: add VM_DROPPABLE for designating always lazily freeable mappings

Message ID 20230101162910.710293-3-Jason@zx2c4.com (mailing list archive)
State New
Headers show
Series None | expand

Commit Message

Jason A. Donenfeld Jan. 1, 2023, 4:29 p.m. UTC
The vDSO getrandom() implementation works with a buffer allocated with a
new system call that has certain requirements:

- It shouldn't be written to core dumps.
  * Easy: VM_DONTDUMP.
- It should be zeroed on fork.
  * Easy: VM_WIPEONFORK.

- It shouldn't be written to swap.
  * Uh-oh: mlock is rlimited.
  * Uh-oh: mlock isn't inherited by forks.

- It shouldn't reserve actual memory, but it also shouldn't crash when
  page faulting in memory if none is available
  * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
  * Uh-oh: VM_NORESERVE means segfaults.

It turns out that the vDSO getrandom() function has three really nice
characteristics that we can exploit to solve this problem:

1) Due to being wiped during fork(), the vDSO code is already robust to
   having the contents of the pages it reads zeroed out midway through
   the function's execution.

2) In the absolute worst case of whatever contingency we're coding for,
   we have the option to fallback to the getrandom() syscall, and
   everything is fine.

3) The buffers the function uses are only ever useful for a maximum of
   60 seconds -- a sort of cache, rather than a long term allocation.

These characteristics mean that we can introduce VM_DROPPABLE, which
has the following semantics:

a) It never is written out to swap.
b) Under memory pressure, mm can just drop the pages (so that they're
   zero when read back again).
c) If there's not enough memory to service a page fault, it's not fatal,
   and no signal is sent. Instead, writes are simply lost.
d) It is inherited by fork.
e) It doesn't count against the mlock budget, since nothing is locked.

This is fairly simple to implement, with the one snag that we have to
use 64-bit VM_* flags, but this shouldn't be a problem, since the only
consumers will probably be 64-bit anyway.

This way, allocations used by vDSO getrandom() can use:

    VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE

And there will be no problem with OOMing, crashing on overcommitment,
using memory when not in use, not wiping on fork(), coredumps, or
writing out to swap.

At the moment, rather than skipping writes on OOM, the fault handler
just returns to userspace, and the instruction is retried. This isn't
terrible, but it's not quite what is intended. The actual instruction
skipping has to be implemented arch-by-arch, but so does this whole
vDSO series, so that's fine. The following commit addresses it for x86.

Cc: linux-mm@kvack.org
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 fs/proc/task_mmu.c             | 3 +++
 include/linux/mm.h             | 8 ++++++++
 include/trace/events/mmflags.h | 7 +++++++
 mm/Kconfig                     | 3 +++
 mm/memory.c                    | 4 ++++
 mm/mempolicy.c                 | 3 +++
 mm/mprotect.c                  | 2 +-
 mm/rmap.c                      | 5 +++--
 8 files changed, 32 insertions(+), 3 deletions(-)

Comments

Ingo Molnar Jan. 3, 2023, 10:50 a.m. UTC | #1
* Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> The vDSO getrandom() implementation works with a buffer allocated with a
> new system call that has certain requirements:
> 
> - It shouldn't be written to core dumps.
>   * Easy: VM_DONTDUMP.
> - It should be zeroed on fork.
>   * Easy: VM_WIPEONFORK.
> 
> - It shouldn't be written to swap.
>   * Uh-oh: mlock is rlimited.
>   * Uh-oh: mlock isn't inherited by forks.
> 
> - It shouldn't reserve actual memory, but it also shouldn't crash when
>   page faulting in memory if none is available
>   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
>   * Uh-oh: VM_NORESERVE means segfaults.
> 
> It turns out that the vDSO getrandom() function has three really nice
> characteristics that we can exploit to solve this problem:
> 
> 1) Due to being wiped during fork(), the vDSO code is already robust to
>    having the contents of the pages it reads zeroed out midway through
>    the function's execution.
> 
> 2) In the absolute worst case of whatever contingency we're coding for,
>    we have the option to fallback to the getrandom() syscall, and
>    everything is fine.
> 
> 3) The buffers the function uses are only ever useful for a maximum of
>    60 seconds -- a sort of cache, rather than a long term allocation.
> 
> These characteristics mean that we can introduce VM_DROPPABLE, which
> has the following semantics:
> 
> a) It never is written out to swap.
> b) Under memory pressure, mm can just drop the pages (so that they're
>    zero when read back again).
> c) If there's not enough memory to service a page fault, it's not fatal,
>    and no signal is sent. Instead, writes are simply lost.
> d) It is inherited by fork.
> e) It doesn't count against the mlock budget, since nothing is locked.
> 
> This is fairly simple to implement, with the one snag that we have to
> use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> consumers will probably be 64-bit anyway.
> 
> This way, allocations used by vDSO getrandom() can use:
> 
>     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> 
> And there will be no problem with OOMing, crashing on overcommitment,
> using memory when not in use, not wiping on fork(), coredumps, or
> writing out to swap.
> 
> At the moment, rather than skipping writes on OOM, the fault handler
> just returns to userspace, and the instruction is retried. This isn't
> terrible, but it's not quite what is intended. The actual instruction
> skipping has to be implemented arch-by-arch, but so does this whole
> vDSO series, so that's fine. The following commit addresses it for x86.

Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per 
arch low level work and rarely tested functionality (seriously, whose 
desktop system touches swap these days?), just so we can add a few pages of 
per thread vDSO data of a quirky type that in 99.999% of cases won't ever 
be 'dropped' from under the functionality that is using it and will thus 
bitrot fast?

The maintainability baby is being thrown out with the bath water IMO ...

And we want to add more complexity to a facility people desperately want to 
trust *more*? [RNG]

What's wrong with making mlock() more usable? Or just saying that "yeah, 
the vDSO can allocate a few more permanent pages outside of existing 
rlimits & mlock budgets"?

The rest of the series looks fine to me, but this special one of a kind 
VM_DROPPABLE is just way over-engineered cornercase functionality that 
pushes us well past the maintenance_overhead<->complexity trade-off sweet spot ...

Thanks,

	Ingo
Jason A. Donenfeld Jan. 3, 2023, 3:01 p.m. UTC | #2
On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote:
> 
> * Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> 
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> > 
> > - It shouldn't be written to core dumps.
> >   * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> >   * Easy: VM_WIPEONFORK.
> > 
> > - It shouldn't be written to swap.
> >   * Uh-oh: mlock is rlimited.
> >   * Uh-oh: mlock isn't inherited by forks.
> > 
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> >   page faulting in memory if none is available
> >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> >   * Uh-oh: VM_NORESERVE means segfaults.
> > 
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> > 
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> >    having the contents of the pages it reads zeroed out midway through
> >    the function's execution.
> > 
> > 2) In the absolute worst case of whatever contingency we're coding for,
> >    we have the option to fallback to the getrandom() syscall, and
> >    everything is fine.
> > 
> > 3) The buffers the function uses are only ever useful for a maximum of
> >    60 seconds -- a sort of cache, rather than a long term allocation.
> > 
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:
> > 
> > a) It never is written out to swap.
> > b) Under memory pressure, mm can just drop the pages (so that they're
> >    zero when read back again).
> > c) If there's not enough memory to service a page fault, it's not fatal,
> >    and no signal is sent. Instead, writes are simply lost.
> > d) It is inherited by fork.
> > e) It doesn't count against the mlock budget, since nothing is locked.
> > 
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> > 
> > This way, allocations used by vDSO getrandom() can use:
> > 
> >     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> > 
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> > 
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.
> 
> Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per 
> arch low level work and rarely tested functionality (seriously, whose 
> desktop system touches swap these days?), just so we can add a few pages of 
> per thread vDSO data of a quirky type that in 99.999% of cases won't ever 
> be 'dropped' from under the functionality that is using it and will thus 
> bitrot fast?

It sounds like you've misunderstood the issue.

Firstly, the arch work is +19 lines (in the vdso branch of random.git).
That's very small and basic. Don't misrepresent it just to make a point.

Secondly, and more importantly, swapping this data is *not* permissible.
It doesn't matter if you think that certain systems don't use swap
anyway (you're wrong though regardless -- desktops will swap things
pretty regularly). It simply must *not* be swapped under any
circumstances. It's not that swapping is simply undesirable; it's
absolutely catastrophic. Do you understand that distinction?

If so, then if you don't like this solution, perhaps you could mention
something that accomplishes the goals in the commit message.

> The maintainability baby is being thrown out with the bath water IMO ...

Really? Are you sure this isn't just a matter of you not having written
it yourself or something? The commit is pretty short and basic. All the
actual internals for this kind of thing are already in present. The
commit just pipes it through.

> And we want to add more complexity to a facility people desperately want to 
> trust *more*? [RNG]

That seems like a ridiculous rhetorical leap.

> What's wrong with making mlock() more usable? Or just saying that "yeah, 
> the vDSO can allocate a few more permanent pages outside of existing 
> rlimits & mlock budgets"?

Did you actually read the commit message?

And now you're suggesting a rlimit backdoor? And adding more cases for
fork() to fail or block? That sounds terrible.

> The rest of the series looks fine to me, but this special one of a kind 
> VM_DROPPABLE is just way over-engineered cornercase functionality that 
> pushes us well past the maintenance_overhead<->complexity trade-off sweet spot ...

I think you should reevaluate that judgement. Consider the actual
problem. I think if you look at it carefully you'll see that this is a
pretty straight forward solution. It's deliberately not over-engineered.
It uses existing facilities and just does the plumbing in the right way
to make it work. It's really not that complicated.

The commit is +38, -4. The commit message has more lines than that.

Jason
Ingo Molnar Jan. 3, 2023, 6:15 p.m. UTC | #3
* Jason A. Donenfeld <Jason@zx2c4.com> wrote:

> On Tue, Jan 03, 2023 at 11:50:43AM +0100, Ingo Molnar wrote:
> > 
> > * Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> > 
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > > 
> > > - It shouldn't be written to core dumps.
> > >   * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > >   * Easy: VM_WIPEONFORK.
> > > 
> > > - It shouldn't be written to swap.
> > >   * Uh-oh: mlock is rlimited.
> > >   * Uh-oh: mlock isn't inherited by forks.
> > > 
> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > >   page faulting in memory if none is available
> > >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > >   * Uh-oh: VM_NORESERVE means segfaults.
> > > 
> > > It turns out that the vDSO getrandom() function has three really nice
> > > characteristics that we can exploit to solve this problem:
> > > 
> > > 1) Due to being wiped during fork(), the vDSO code is already robust to
> > >    having the contents of the pages it reads zeroed out midway through
> > >    the function's execution.
> > > 
> > > 2) In the absolute worst case of whatever contingency we're coding for,
> > >    we have the option to fallback to the getrandom() syscall, and
> > >    everything is fine.
> > > 
> > > 3) The buffers the function uses are only ever useful for a maximum of
> > >    60 seconds -- a sort of cache, rather than a long term allocation.
> > > 
> > > These characteristics mean that we can introduce VM_DROPPABLE, which
> > > has the following semantics:
> > > 
> > > a) It never is written out to swap.
> > > b) Under memory pressure, mm can just drop the pages (so that they're
> > >    zero when read back again).
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > >    and no signal is sent. Instead, writes are simply lost.
> > > d) It is inherited by fork.
> > > e) It doesn't count against the mlock budget, since nothing is locked.
> > > 
> > > This is fairly simple to implement, with the one snag that we have to
> > > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > > consumers will probably be 64-bit anyway.
> > > 
> > > This way, allocations used by vDSO getrandom() can use:
> > > 
> > >     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> > > 
> > > And there will be no problem with OOMing, crashing on overcommitment,
> > > using memory when not in use, not wiping on fork(), coredumps, or
> > > writing out to swap.
> > > 
> > > At the moment, rather than skipping writes on OOM, the fault handler
> > > just returns to userspace, and the instruction is retried. This isn't
> > > terrible, but it's not quite what is intended. The actual instruction
> > > skipping has to be implemented arch-by-arch, but so does this whole
> > > vDSO series, so that's fine. The following commit addresses it for x86.
> > 
> > Yeah, so VM_DROPPABLE adds a whole lot of complexity, corner cases, per 
> > arch low level work and rarely tested functionality (seriously, whose 
> > desktop system touches swap these days?), just so we can add a few pages of 
> > per thread vDSO data of a quirky type that in 99.999% of cases won't ever 
> > be 'dropped' from under the functionality that is using it and will thus 
> > bitrot fast?
> 
> It sounds like you've misunderstood the issue.
> 
> Firstly, the arch work is +19 lines (in the vdso branch of random.git).

For a single architecture: x86.

And it's only 19 lines because x86 already happens to have a bunch of 
complexity implemented, such as a safe instruction decoder that allows the 
skipping of an instruction - which relies on thousands of lines of 
complexity.

On an architecture where this isn't present, it would have to be 
implemented to support the instruction-skipping aspect of VM_DROPPABLE.

Even on x86, it's not common today for the software-decoder to be used in 
unprivileged code - primary use was debugging & instrumentation code. So 
your patches bring this piece of complexity to a much larger scope of 
untrusted user-space functionality.

> That's very small and basic. Don't misrepresent it just to make a point.

I'm not misrepresenting anything.

> Secondly, and more importantly, swapping this data is *not* permissible.

I did not suggest to swap it: my suggestion is to just pin these vDSO data 
pages. The per thread memory overhead is infinitesimal on the vast majority 
of the target systems, and the complexity trade-off you are proposing is 
poorly reasoned IMO.

Anyway:

  > Don't misrepresent it just to make a point.
  ...
  > That seems like a ridiculous rhetorical leap.
  ...
  > Did you actually read the commit message?

Frankly, I don't appreciate your condescending discussion style that 
borders on the toxic, and to save time I'm nacking this technical approach 
until both the patch-set and your reaction to constructive review feedback 
improves:

    NAcked-by: Ingo Molnar <mingo@kernel.org>

I think my core point that it would be much simpler to simply pin those 
pages and not introduce rarely-excercised 'discardable memory' semantics in 
Linux is a fair one - so it's straightforward to lift this NAK.

I'll re-evaluate the NACK on every new iteration of this patchset I see.

Thanks,

	Ingo
Andy Lutomirski Jan. 3, 2023, 6:36 p.m. UTC | #4
On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
>
>
> * Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> > The vDSO getrandom() implementation works with a buffer allocated with a
> > new system call that has certain requirements:
> >
> > - It shouldn't be written to core dumps.
> >   * Easy: VM_DONTDUMP.
> > - It should be zeroed on fork.
> >   * Easy: VM_WIPEONFORK.

I have a rather different suggestion: make a special mapping.  Jason,
you're trying to shoehorn all kinds of bizarre behavior into the core
mm, and none of that seems to me to belong to the core mm.  Instead,
have an actual special mapping with callbacks that does the right
thing.  No fancy VM flags.

Memory pressure: have it free and unmap it self.  Gets accessed again?
 ->fault can handle it.

Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
so that, if it gets evicted, it's not written out anywhere.  And when
it gets faulted back in it does the right thing -- see above.

Zero on fork?  I'm sure that's manageable with a special mapping.  If
not, you can add a new vm operation or similar to make it work.  (Kind
of like how we extended special mappings to get mremap right a couple
years go.)  But maybe you don't want to *zero* it on fork and you want
to do something more intelligent.  Fine -- you control ->fault!

> >
> > - It shouldn't be written to swap.
> >   * Uh-oh: mlock is rlimited.
> >   * Uh-oh: mlock isn't inherited by forks.

No mlock, no problems.

> >
> > - It shouldn't reserve actual memory, but it also shouldn't crash when
> >   page faulting in memory if none is available
> >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> >   * Uh-oh: VM_NORESERVE means segfaults.

->fault can do whatever you want.

And there is no shortage of user memory that *must* be made available
on fault in order to resume the faulting process.  ->fault can handle
this.

> >
> > It turns out that the vDSO getrandom() function has three really nice
> > characteristics that we can exploit to solve this problem:
> >
> > 1) Due to being wiped during fork(), the vDSO code is already robust to
> >    having the contents of the pages it reads zeroed out midway through
> >    the function's execution.
> >
> > 2) In the absolute worst case of whatever contingency we're coding for,
> >    we have the option to fallback to the getrandom() syscall, and
> >    everything is fine.
> >
> > 3) The buffers the function uses are only ever useful for a maximum of
> >    60 seconds -- a sort of cache, rather than a long term allocation.
> >
> > These characteristics mean that we can introduce VM_DROPPABLE, which
> > has the following semantics:

No need for another vm flag.

> >
> > a) It never is written out to swap.

No need to complicate the swap logic for this.

> > b) Under memory pressure, mm can just drop the pages (so that they're
> >    zero when read back again).

Or ->fault could even repopulate it without needing to ever read zeros.

> > c) If there's not enough memory to service a page fault, it's not fatal,
> >    and no signal is sent. Instead, writes are simply lost.

This just seems massively overcomplicated to me.  If there isn't
enough memory to fault in a page of code, we don't have some magic
instruction emulator in the kernel.  We either OOM or we wait for
memory to show up.

> > d) It is inherited by fork.

If you have a special mapping and you fork, it doesn't magically turn
into normal memory.

> > e) It doesn't count against the mlock budget, since nothing is locked.

Special mapping -> no mlock.

> >
> > This is fairly simple to implement, with the one snag that we have to
> > use 64-bit VM_* flags, but this shouldn't be a problem, since the only
> > consumers will probably be 64-bit anyway.
> >
> > This way, allocations used by vDSO getrandom() can use:
> >
> >     VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE
> >
> > And there will be no problem with OOMing, crashing on overcommitment,
> > using memory when not in use, not wiping on fork(), coredumps, or
> > writing out to swap.
> >
> > At the moment, rather than skipping writes on OOM, the fault handler
> > just returns to userspace, and the instruction is retried. This isn't
> > terrible, but it's not quite what is intended. The actual instruction
> > skipping has to be implemented arch-by-arch, but so does this whole
> > vDSO series, so that's fine. The following commit addresses it for x86.

I really dislike this.  I'm with Ingo.
Jason A. Donenfeld Jan. 3, 2023, 6:51 p.m. UTC | #5
On Tue, Jan 03, 2023 at 07:15:50PM +0100, Ingo Molnar wrote:
> Frankly, I don't appreciate your condescending discussion style that 
> borders on the toxic, and to save time I'm nacking this technical approach 
> until both the patch-set and your reaction to constructive review feedback 
> improves:
> 
>     NAcked-by: Ingo Molnar <mingo@kernel.org>

Your initial email to me did not strike me as constructive at all. All I
gotta say is that you really seem to live up to your reputation here...

But trying to steer things back to the technical realm:

> For a single architecture: x86.
> 
> And it's only 19 lines because x86 already happens to have a bunch of 
> complexity implemented, such as a safe instruction decoder that allows the 
> skipping of an instruction - which relies on thousands of lines of 
> complexity.
> 
> On an architecture where this isn't present, it would have to be 
> implemented to support the instruction-skipping aspect of VM_DROPPABLE.

My assumption is actually the opposite: that x86 (CISC) is basically the
most complex, and that the RISC architectures will all be a good deal
more trivial -- e.g. just adding 4 to IP on some. It looks like some
architectures also already have mature decoders where required.

> Even on x86, it's not common today for the software-decoder to be used in 
> unprivileged code - primary use was debugging & instrumentation code. So 
> your patches bring this piece of complexity to a much larger scope of 
> untrusted user-space functionality.

As far as I can tell, this decoder *is* used with userspace already.
It's used by SEV and by UMIP, in a totally userspace accessible way. Am
I misunderstanding its use there? It looks to me like that operates on
untrusted code.

*However* - if your big objection to this patch is that the instruction
skipping is problematic, we could actually punt that part. The result
will be that userspace just retries the memory write and the fault
happens again, and eventually it succeeds. From a perspective of
vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
immediately fallback to the syscall under memory pressure -- but you
could argue that nobody really cares about performance at that point
anyway, and so just retrying the fault until it succeeds is a less
complex behavior that would be just fine.

Let me know if you think that'd be an acceptable compromise, and I'll
roll it into v15. As a preview, it pretty much amounts to dropping 3/7
and editing the commit message in this 2/7 patch.

> I did not suggest to swap it: my suggestion is to just pin these vDSO data 
> pages. The per thread memory overhead is infinitesimal on the vast majority 
> of the target systems, and the complexity trade-off you are proposing is 
> poorly reasoned IMO.
> 
> I think my core point that it would be much simpler to simply pin those 
> pages and not introduce rarely-excercised 'discardable memory' semantics in 
> Linux is a fair one - so it's straightforward to lift this NAK.

Okay so this is where I think we're really not lined up and is a large
part of why I wondered whether you'd read the commit messages before
dismissing this. This VM_DROPPABLE mapping comes as a result of a
vgetrandom_alloc() syscall, which (g)libc makes at some point, and then
the result of that is passed to the vDSO getrandom() function. The
memory in vgetrandom_alloc() is then carved up, one per thread, with
(g)libc's various internal pthread creation/exit functions.

So that means this isn't a thing that's trivially limited to just one
per thread. Userspace can call vgetrandom_alloc() all it wants.

Thus, I'm having a hard time seeing how relaxing rlimits here as you
suggested doesn't amount to an rlimit backdoor. I'm also not seeing
other fancy options for "pinning pages" as you mentioned in this email.
Something about having the kernel allocate them on clone()? That seems
terrible and complex. And if you do want this all to go through mlock(),
somehow, there's still the fork() inheritabiity issue. (This was all
discussed on the thread a few versions ago that surfaced these issues,
by the way.)

So I'm not really seeing any good alternatives, no matter how hard I
squint at your suggestions. Maybe you can elaborate a bit?
Alternatively, perhaps the compromise I suggested above where we ditch
the instruction decoder stuff is actually fine with you?

> rarely-excercised 'discardable memory' semantics in 
> Linux is a fair one - so it's straightforward to lift this NAK.

I still don't think calling this "rarely-exercised" is true. Desktop
machines regularly OOM with lots of Chrome tabs, and this is
functionality that'll be in glibc, so it'll be exercised quite often.
Even on servers, many operators work with the philosophy that unused RAM
is wasted RAM, and so servers are run pretty close to capacity. Not to
mention Android, where lots of handsets have way too little RAM.
Swapping and memory pressure and so forth is very real. So claiming that
this is somehow obscure or rarely used or what have you isn't very
accurate. These are code paths that will certainly get exercised.

Jason
Jason A. Donenfeld Jan. 3, 2023, 7:05 p.m. UTC | #6
Hi Andy,

Thanks for your constructive suggestions.

On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > c) If there's not enough memory to service a page fault, it's not fatal,
> > >    and no signal is sent. Instead, writes are simply lost.
> 
> This just seems massively overcomplicated to me.  If there isn't
> enough memory to fault in a page of code, we don't have some magic
> instruction emulator in the kernel.  We either OOM or we wait for
> memory to show up.

Before addressing the other parts of your email, I thought I'd touch on
this. Quoting from the email I just wrote Ingo:

| *However* - if your big objection to this patch is that the instruction
| skipping is problematic, we could actually punt that part. The result
| will be that userspace just retries the memory write and the fault
| happens again, and eventually it succeeds. From a perspective of
| vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
| immediately fallback to the syscall under memory pressure -- but you
| could argue that nobody really cares about performance at that point
| anyway, and so just retrying the fault until it succeeds is a less
| complex behavior that would be just fine.
| 
| Let me know if you think that'd be an acceptable compromise, and I'll
| roll it into v15. As a preview, it pretty much amounts to dropping 3/7
| and editing the commit message in this 2/7 patch.

IOW, I think the main ideas of the patch work just fine without "point
c" with the instruction skipping. Instead, waiting/retrying could
potentially work. So, okay, it seems like the two of you both hate the
instruction decoder stuff, so I'll plan on working that part in, in one
way or another, for v15.

> On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
> > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > new system call that has certain requirements:
> > >
> > > - It shouldn't be written to core dumps.
> > >   * Easy: VM_DONTDUMP.
> > > - It should be zeroed on fork.
> > >   * Easy: VM_WIPEONFORK.
> 
> I have a rather different suggestion: make a special mapping.  Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm.  Instead,
> have an actual special mapping with callbacks that does the right
> thing.  No fancy VM flags.

Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
I had seen things that I thought looked in this direction with the shmem
API, but when I got into the details, it looked like this was meant for
something else and couldn't address most of what I wanted here.

If you say this is possible, I'll look again to see if I can figure it
out. Though, if you have some API name at the top of your head, you
might save me some code squinting time.

> Memory pressure: have it free and unmap it self.  Gets accessed again?
>  ->fault can handle it.

Right.

> Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
> so that, if it gets evicted, it's not written out anywhere.  And when
> it gets faulted back in it does the right thing -- see above.

Presumably mlock calls are redirected to some function pointer so I can
just return EINTR?

> Zero on fork?  I'm sure that's manageable with a special mapping.  If
> not, you can add a new vm operation or similar to make it work.  (Kind
> of like how we extended special mappings to get mremap right a couple
> years go.)  But maybe you don't want to *zero* it on fork and you want
> to do something more intelligent.  Fine -- you control ->fault!

Doing something more intelligent would be an interesting development, I
guess... But, before I think about that, all mapping have flags;
couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
API you have in mind not work that way? (Side note: I also want
VM_DONTDUMP to work.)

> > > - It shouldn't reserve actual memory, but it also shouldn't crash when
> > >   page faulting in memory if none is available
> > >   * Uh-oh: MAP_NORESERVE respects vm.overcommit_memory=2.
> > >   * Uh-oh: VM_NORESERVE means segfaults.
> 
> ->fault can do whatever you want.
> 
> And there is no shortage of user memory that *must* be made available
> on fault in order to resume the faulting process.  ->fault can handle
> this.

I'll look to see how other things handle this...

Anyway, thanks for the suggestion. That seems like a good future
direction for this.

Jason
Linus Torvalds Jan. 3, 2023, 7:19 p.m. UTC | #7
On Tue, Jan 3, 2023 at 10:36 AM Andy Lutomirski <luto@kernel.org> wrote:
>
> I have a rather different suggestion: make a special mapping.  Jason,
> you're trying to shoehorn all kinds of bizarre behavior into the core
> mm, and none of that seems to me to belong to the core mm.  Instead,
> have an actual special mapping with callbacks that does the right
> thing.  No fancy VM flags.

I don't disagree, but honestly, my deeper reaction is "this is not worth it".

I think the real issue here is that Jason has to prove that this is
such a big deal that the VM has to be modified *at*all* for this.

Honestly, "getrandom()" performance simply is not important enough to
design VM changes for.

Do some people care? Debatable. Jason cares, sure. But people have
gotten used to doing their own random number implementations if they
*really* care, and yes, they've gotten them wrong, or they've not
performed as well as they could, but on the whole this is still a
really tiny thing, and Jason is trying to micro-optimize something
that THE KERNEL SHOULD NOT CARE ABOUT.

This should all be in libc. Not in the kernel with special magic vdso
support and special buffer allocations. The kernel should give good
enough support that libc can do a good job, but the kernel should
simply *not* take the approach of "libc will get this wrong, so let's
just do all the work for it".

Let user space do their own percpu areas if they really care. And most
(as in 99.9%) of all user space won't care at all, and is perfectly
happy just doing a read() from /dev/urandom or using our existing
"getrandom()" without any super-clever vdso support with magic
temporary buffer handling.

Now, if the magic buffers were something cool that were a generic
concept that a lot of *other* cases would also kill for, that's one
thing. But this is such a small special case that absolutely *nobody*
has asked for, and that nothing else wants.

             Linus
Jason A. Donenfeld Jan. 3, 2023, 7:35 p.m. UTC | #8
Hi Linus,

On Tue, Jan 03, 2023 at 11:19:36AM -0800, Linus Torvalds wrote:
> performed as well as they could, but on the whole this is still a
> really tiny thing, and Jason is trying to micro-optimize something
> that THE KERNEL SHOULD NOT CARE ABOUT.

I don't think this is about micro-optimization. Rather, userspace RNGs
aren't really possible in a safe way at the moment. This patchset aims
to make that possible, by providing things that libc will use. The cover
letter of this series makes that case.

> This should all be in libc. Not in the kernel with special magic vdso
> support and special buffer allocations. The kernel should give good
> enough support that libc can do a good job, but the kernel should
> simply *not* take the approach of "libc will get this wrong, so let's
> just do all the work for it".

That's not what this patchset does. libc still needs to handle
per-thread semantics itself and slice up buffers and so forth. The vDSO
doesn't allocate any memory. I suspect this was Ingo's presumption too,
and you extrapolated from that. But that's not what's happening.

> Now, if the magic buffers were something cool that were a generic
> concept that a lot of *other* cases would also kill for, that's one

Actually, I was thinking VM_DROPPABLE might be a somewhat interesting
thing to introduce for database caches and so forth, where dropping
things under memory pressure is actually useful. Obviously that's the
result of a thought process involving a solution looking for a problem,
but I considered this a month or so ago when I first sent this in, and
decided that if I was to expose this via a MAP_* flag in mmap(), that
should come later, so I didn't here. Anyway, that is all to say it's not
like this is the only use for it. But either way, I don't actually have
my sights set on it as a general solution -- after all, I am not in the
process of authoring a database cache or something -- and if I can make
Andy's vm_ops suggestion work, that sounds perfectly fine to me.

> thing. But this is such a small special case that absolutely *nobody*
> has asked for, and that nothing else wants.

Okay so that's where I think you're really quite mistaken. If you recall
the original discussion on this, I was initially a bit hesitant to do it
and didn't really want to do it that much. And then I looked into it,
and talked to a bunch of library and program authors, and saw that
there's actually quite a bit of demand for this, and generally an
unhealthy ecosystem of bad solutions that have cropped up in lieu of a
good one.

(I talked about this a bit with tglx at Plumbers, and I had hoped to
discuss with you as well, but you weren't available.)

Jason
Linus Torvalds Jan. 3, 2023, 7:54 p.m. UTC | #9
On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> I don't think this is about micro-optimization. Rather, userspace RNGs
> aren't really possible in a safe way at the moment.

"Bah, humbug", to quote a modern-time philosopher.

It's humbug simply because it makes two assumptions that aren't even valid:

 (a) that you have to do it in user space in the first place

 (b) that you care about the particular semantics that you are looking for

The thing is, you can just do getrandom(). It's what people already
do. Problem solved.

The system call interface just isn't that expensive. Sure, the various
spectre mitigations have screwed us over in a big way on various
hardware, but even with that in mind system calls aren't some kind of
insurmountable hazard. There are absolutely tons of system calls that
are way more performance-critical than getrandom() has ever been.

So 99% of the time, the solution really is just "getrandom()",
generally with the usual buffering ("read more than you need, so that
you don't do it all the time").\

Which is not at all different from all the other system calls like
"read()" and friends.

And that's entirely ignoring the fact that something like "read()"
basically *has* to be a system call, because there are no alternatives
(ok, mmap, but that's actually much slower, unless you're in it for
the reuse-and/or-sparse-use situation).

With random numbers, you have tons of other alternatives, including
just hardware giving you the random numbers almost for free and you
just using your own rng in user space entirely.

And yes, some users might want to do things like periodic re-seeding,
but we actually do have fast ways to check for periodic events in user
space, ranging from entirely non-kernel things like rdtsc to actual
vdso's for gettimeofday().

So when you say that this isn't about micro-optimizations, I really
say "humbug". It's *clearly* about micro-optimization of an area that
very few people care about, since the alternative is just our existing
"getrandom()" that is not at all horribly slow.

Let me guess: the people you talked to who were excited about this are
mainly just library people?

And they are excited about this not because they actually need the
random numbers themselves, but because they just don't want to do the
work. They want to get nice benchmark numbers, and push the onus of
complexity onto somebody else?

Because the people who actually *use* the random numbers and are *so*
performance-critical about them already have their own per-thread
buffers or what-not, and need to have them anyway because they write
real applications that possibly work on other systems than Linux, but
that *definitely* work on enterprise systems that won't even have this
kind of new support for another several years even if we implemented
it today.

In fact, they probably are people inside of cloud vendors that are
still running 4.x kernels on a large portion of their machines.

          Linus
Jason A. Donenfeld Jan. 3, 2023, 8:03 p.m. UTC | #10
Hi Linus,

On Tue, Jan 03, 2023 at 11:54:35AM -0800, Linus Torvalds wrote:
> So 99% of the time, the solution really is just "getrandom()",
> generally with the usual buffering ("read more than you need, so that
> you don't do it all the time").\

That buffering cannot be done safely currently -- VM forks, reseeding
semantics, and so forth. Again, discussed in the cover letter of the
patch if you'd like to engage with those ideas.

> just using your own rng in user space entirely.

This is the thing that isn't quite safe.

> Let me guess: the people you talked to who were excited about this are
> mainly just library people?

No, actually. Mainly people deploying production network-facing things
that need a lot of randomness often. e.g. CBC nonces in TLS, or random
sequence numbers in UDP-based protocols.

> So when you say that this isn't about micro-optimizations, I really
> say "humbug". It's *clearly* about micro-optimization of an area that
> very few people care about, since the alternative is just our existing
> "getrandom()" that is not at all horribly slow.

The alternative is what people currently do, which is attempt to
implement a userspace RNG, which cannot be done safely. Again, -->
cover letter.

> Because the people who actually *use* the random numbers and are *so*
> performance-critical about them already have their own per-thread
> buffers or what-not

...which are not safe.

Anyway, if you're NACK'ing the whole getrandom() vDSO project, just
please outright say so, so I don't spend another 14 revisions in vain.

Jason
Linus Torvalds Jan. 3, 2023, 8:15 p.m. UTC | #11
On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> That buffering cannot be done safely currently

.. again, this is "your semantics" (the (b) in my humbug list), not
necessarily reality for anybody else.

I'm NAK'ing making invasive changes to the VM for something this
specialized. I really believe that the people who have this issue are
*so* few and far between that they can deal with the VM forking and
reseeding issues quite well on their own.

            Linus
Linus Torvalds Jan. 3, 2023, 8:25 p.m. UTC | #12
On Tue, Jan 3, 2023 at 12:15 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > That buffering cannot be done safely currently
>
> .. again, this is "your semantics" (the (b) in my humbug list), not
> necessarily reality for anybody else.

Just to make an example: fork() is already problematic for something
as fundamental as <stdio.h>.

That doesn't mean that we do a special fork-safe stdio.h
infrastructure in the kernel. It just means that people have to do
things like fflush(NULL) (or use variations of setbuf() and friends)
when they deal with fork() and stdio interactions.

The random number generator really isn't that different. Periodic
reseeding isn't something stdio has to deal with, but having a
timestamp in user space library and forcing a re-seed isn't unheard of
in other contexts.

You don't even need anything as fancy as an actual timer, because it
doesn't need *active* flushing, just a "oh, it's been too long since
we read the random data, let's do it again".

And yes, I bet it would be a good idea to have an actual library for
this that handles (and *documents*) these kinds of issues - exactly
like a <stdio.h>, just for randomness.

I just don't think it should be involved in the kernel - again exactly
like <stdio.h>.

            Linus
Jason A. Donenfeld Jan. 3, 2023, 8:44 p.m. UTC | #13
On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > That buffering cannot be done safely currently
> 
> .. again, this is "your semantics" (the (b) in my humbug list), not
> necessarily reality for anybody else.

Yea that's fair. Except, of course, I maintain that my semantics are
important ones. :)

> I'm NAK'ing making invasive changes to the VM for something this
> specialized. I really believe that the people who have this issue are
> *so* few and far between that they can deal with the VM forking and
> reseeding issues quite well on their own.

Okay, that's fine. I'll see if I can make this work without having to do
surgery on mm and introduce a new VM_* flag and such. Hopefully I'll
succeed there.

Jason
Andy Lutomirski Jan. 3, 2023, 8:52 p.m. UTC | #14
On Tue, Jan 3, 2023 at 11:06 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> Hi Andy,
>
> Thanks for your constructive suggestions.
>
> On Tue, Jan 03, 2023 at 10:36:01AM -0800, Andy Lutomirski wrote:
> > > > c) If there's not enough memory to service a page fault, it's not fatal,
> > > >    and no signal is sent. Instead, writes are simply lost.
> >
> > This just seems massively overcomplicated to me.  If there isn't
> > enough memory to fault in a page of code, we don't have some magic
> > instruction emulator in the kernel.  We either OOM or we wait for
> > memory to show up.
>
> Before addressing the other parts of your email, I thought I'd touch on
> this. Quoting from the email I just wrote Ingo:
>
> | *However* - if your big objection to this patch is that the instruction
> | skipping is problematic, we could actually punt that part. The result
> | will be that userspace just retries the memory write and the fault
> | happens again, and eventually it succeeds. From a perspective of
> | vgetrandom(), that's perhaps worse -- with this v14 patchset, it'll
> | immediately fallback to the syscall under memory pressure -- but you
> | could argue that nobody really cares about performance at that point
> | anyway, and so just retrying the fault until it succeeds is a less
> | complex behavior that would be just fine.
> |
> | Let me know if you think that'd be an acceptable compromise, and I'll
> | roll it into v15. As a preview, it pretty much amounts to dropping 3/7
> | and editing the commit message in this 2/7 patch.
>
> IOW, I think the main ideas of the patch work just fine without "point
> c" with the instruction skipping. Instead, waiting/retrying could
> potentially work. So, okay, it seems like the two of you both hate the
> instruction decoder stuff, so I'll plan on working that part in, in one
> way or another, for v15.
>
> > On Tue, Jan 3, 2023 at 2:50 AM Ingo Molnar <mingo@kernel.org> wrote:
> > > > The vDSO getrandom() implementation works with a buffer allocated with a
> > > > new system call that has certain requirements:
> > > >
> > > > - It shouldn't be written to core dumps.
> > > >   * Easy: VM_DONTDUMP.
> > > > - It should be zeroed on fork.
> > > >   * Easy: VM_WIPEONFORK.
> >
> > I have a rather different suggestion: make a special mapping.  Jason,
> > you're trying to shoehorn all kinds of bizarre behavior into the core
> > mm, and none of that seems to me to belong to the core mm.  Instead,
> > have an actual special mapping with callbacks that does the right
> > thing.  No fancy VM flags.
>
> Oooo! I like this. Avoiding adding VM_* flags would indeed be nice.
> I had seen things that I thought looked in this direction with the shmem
> API, but when I got into the details, it looked like this was meant for
> something else and couldn't address most of what I wanted here.
>
> If you say this is possible, I'll look again to see if I can figure it
> out. Though, if you have some API name at the top of your head, you
> might save me some code squinting time.

Look for _install_special_mapping().

--Andy

> > Want to mlock it?  No, don't do that -- that's absurd.  Just arrange
> > so that, if it gets evicted, it's not written out anywhere.  And when
> > it gets faulted back in it does the right thing -- see above.
>
> Presumably mlock calls are redirected to some function pointer so I can
> just return EINTR?

Or just don't worry about it.  If someone mlocks() it, that's their
problem.  The point is that no one needs to.

>
> > Zero on fork?  I'm sure that's manageable with a special mapping.  If
> > not, you can add a new vm operation or similar to make it work.  (Kind
> > of like how we extended special mappings to get mremap right a couple
> > years go.)  But maybe you don't want to *zero* it on fork and you want
> > to do something more intelligent.  Fine -- you control ->fault!
>
> Doing something more intelligent would be an interesting development, I
> guess... But, before I think about that, all mapping have flags;
> couldn't I *still* set VM_WIPEONFORK on the special mapping? Or does the
> API you have in mind not work that way? (Side note: I also want
> VM_DONTDUMP to work.)

You really want unmap (the pages, not the vma) on fork, not wipe on
fork.  It'll be VM_SHARED, and I'm not sure what VM_WIPEONFORK |
VM_SHARED does.
Yann Droneaud Jan. 5, 2023, 9:57 p.m. UTC | #15
Hi,

Le 03/01/2023 à 21:44, Jason A. Donenfeld a écrit :
> On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
>> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>> That buffering cannot be done safely currently
>> .. again, this is "your semantics" (the (b) in my humbug list), not
>> necessarily reality for anybody else.
> Yea that's fair. Except, of course, I maintain that my semantics are
> important ones. :)


I concur.

To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
And I want mlock() without paying the price.

Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.

Regards.
Jason A. Donenfeld Jan. 5, 2023, 10:57 p.m. UTC | #16
On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> Hi,
> 
> Le 03/01/2023 à 21:44, Jason A. Donenfeld a écrit :
> > On Tue, Jan 03, 2023 at 12:15:57PM -0800, Linus Torvalds wrote:
> >> On Tue, Jan 3, 2023 at 12:03 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >>> That buffering cannot be done safely currently
> >> .. again, this is "your semantics" (the (b) in my humbug list), not
> >> necessarily reality for anybody else.
> > Yea that's fair. Except, of course, I maintain that my semantics are
> > important ones. :)
> 
> 
> I concur.
> 
> To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> And I want mlock() without paying the price.
> 
> Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.

If you're actually serious about wanting a generic mechanism for
userspace, I think the moral of yesterday's poo-poo'ing all over this
cool new idea is that the Linux innercircle doesn't really care for
"security things" as a motivator and just takes the shortest and easiest
route toward swatting it away like a gadfly, assuming that the concerns
are unreal or niche or paranoid or whatever. This is obviously nothing
new - it's an old complaint beaten to death for years, with people who
are diehard it about eventually getting burnt out and leaving. So,
practically speaking, if you want this to exist, I think you have to
find some other cool use cases. Like, see if the database cache people
would actually love this. Or if it could be used as an opportunistic
renderer cache in Chrome that wouldn't result in OOMing with lots of
tabs. Or if shared process compiler servers could benefit from it.
"Droppable cache" is likely useful lots of places. So just find
SOMETHING that doesn't mean having to convince folks of a new security
model that justifies tickling mm/ innards in uncomfortable ways.

Jason
Linus Torvalds Jan. 6, 2023, 1:02 a.m. UTC | #17
On Thu, Jan 5, 2023 at 2:57 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> >
> > To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> > And I want mlock() without paying the price.
> >
> > Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> > The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.
>
> If you're actually serious about wanting a generic mechanism for
> userspace, I think the moral of yesterday's poo-poo'ing all over this
> cool new idea is that the Linux innercircle doesn't really care for
> "security things" as a motivator

No.

We don't take stupid statements as a motivator.

Stop with the histrionics and silly security theater BS.

There is *nop* security in "MADV_WIPEONFORK". You claiming that that
is "security" is just making you less believable and me ignoring your
arguments more.

It's a complete make-believe fairy tale.

Why would it be "security" to dump random state data? In most
situations it's a complete non-issue, and nobody cares.

And those situations that want to be extra careful, and are actually
generating keys, those situations can do all of this very carefully on
their own using existing machinery.

If you don't want a core-dump because you have sensitive information,
you do "ulimit -c 0". Or you use MADV_DONTDUMP that we've had forever.

And you don't want to have wipe-on-fork, because

 (a) if you want things to be wiped on fork, you just wipe it before
the fork (duh!)

 (b) but more likely, and more relevantly, you want to make *DAMN
SURE* you wiped things much earlier than that if you are really
security-conscious and just generated a secret key, because you don't
want to leak things accidentally other ways.

 (c) and you can use MADV_DONTFORK to not copy it at all, which again
we've had for a long time.

And if you don't want to have it written to swap, you're just making
sh*t up at that point.

First off, it's a purely theoretical thing in the first place. See (b)
above. Don't keep those random things around long enough (and
untouched enough) to hit the disk.

Secondly, anybody who can read swap space can already ptrace you and
read things much more easily that way.

Thirdly, you can just use mlock, and make sure you never have so much
super-sikret stuff pending for long times and in big buffers.

Fourth, if your keys are *that* sensitive, and *that* secret, just use
/dev/random or getrandom(), because you're not generating that kind of
volume of long-term keys, so the whole "I have a huge random buffer
that is super-secret" is a complete non-issue.

So stop making stupid arguments. The kernel is not supposed to
baby-sit programs that do things wrong on purpose, and that are
literally trying to do things wrong, and leaving secret stuff around
while they do a lot of other things.

You guys have literally MADE UP bad examples of so-called "security",
and then you use those as arguments for bad coding, and for
bad-mouthing kernel developers who just don't happen to believe in
that bad model.

None of what you ask for is for any kind of real security, it's all
just crazy "but I want to feel the warm and fuzzies and take shortcuts
elsewhere, and push my pain onto other people".

          Linus
Linus Torvalds Jan. 6, 2023, 2:08 a.m. UTC | #18
On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> None of what you ask for is for any kind of real security, it's all
> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> elsewhere, and push my pain onto other people".

Actually, let me maybe soften that a bit and say that it's
"convenience features". It might make some things more _convenient_ to
do, exactly because it might allow other parts to do short-cuts.

But because it's a convenience-feature, it had also better either be
(a) really easy and clear to do in the kernel and (b) have
sufficiently *wide* convenience so that it doesn't end up being one of
those "corner case things we have to maintain forever and nobody
uses".

And I think VM_DROPPABLE matches (a), and would be fine if it had some
other non-made-up use (although honestly, we should solve the 32-bit
problem first - ignoring it isn't fine for anything that is supposed
to be widely useful).

We *have* talked about features kind of like it before, for people
doing basically caches in user space that they can re-create on demand
and are ok with just going away under memory pressure.

But those people almost invariably want dropped pages to cause a
SIGSEGV or SIGBUS, not to come back as zeroes.

So you were insulting when you said kernel people don't care about
security issues.  And I'm just telling you that's not true, but it
*is* 100% true that kernel people are often really fed up with
security people who have their blinders on, focus on some small thing,
and think nothing else ever matters.

So yes, the way to get something like VM_DROPPABLE accepted is to
remove the blinders, and have it be something more widely useful, and
not be a "for made up bad code".

Side note: making the 32-bit issue go away is likely trivial. We can
make 'vm_flags' be 64-bit, and a patch for that has been floating
around for over a decade:

   https://lore.kernel.org/all/20110412151116.B50D.A69D9226@jp.fujitsu.com/

but there was enough push-back on that patch that I didn't want to
take it, and some of the arguments for it were not that convincing (at
the time).

But see commit ca16d140af91 ("mm: don't access vm_flags as 'int'"),
which happened as a result, and which I (obviously very naively)
believed would be a way to get the conversion to happen in a more
controlled manner. Sadly, it never actually took off, and we have very
few "vm_flags_t" users in the kernel, and a lot of "unsigned long
flags". We even started out with a "__nocast" annotation to try to
make sparse trigger on people who didn't use vm_flags_t properly. That
was removed due to it just never happening.

But converting things to vm_flags_t with a coccinelle script
(hand-wave: look for variables of of "unsigned long" that use the
VM_xyz constants), and then just making vm_flags_t be a "u64" instead
sounds like a way forward.

But again: this is all about new flags like VM_DROPPABLE not being
some corner-case that nobody is expected to use other than some
special code that is relegated to 64-bit only because it is *so*
special.

                 Linus
Jason A. Donenfeld Jan. 6, 2023, 2:14 a.m. UTC | #19
On Thu, Jan 05, 2023 at 05:02:05PM -0800, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 2:57 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > On Thu, Jan 05, 2023 at 10:57:48PM +0100, Yann Droneaud wrote:
> > >
> > > To hold secret material, we need MADV_WIPEONFORK | MADV_DONTDUMP and the side effect of mlock() (pages' content never written to swap), inherited across fork().
> > > And I want mlock() without paying the price.
> > >
> > > Jason's proposed semantics, which I call MADV_WIPEONSWAP, provide a mean to hold /unlimited/ amount secrets in userspace memory (not limited by RLIMIT_MEMLOCK).
> > > The only constraint for userspace is to handle the case pages are wiped, which is already the case of userspace arc4random()'s implementation.
> >
> > If you're actually serious about wanting a generic mechanism for
> > userspace, I think the moral of yesterday's poo-poo'ing all over this
> > cool new idea is that the Linux innercircle doesn't really care for
> > "security things" as a motivator
> 
> No.
> 
> We don't take stupid statements as a motivator.
> 
> Stop with the histrionics and silly security theater BS.
> 
> There is *nop* security in "MADV_WIPEONFORK". You claiming that that
> is "security" is just making you less believable and me ignoring your
> arguments more.
> 
> It's a complete make-believe fairy tale.
> 
> Why would it be "security" to dump random state data? In most
> situations it's a complete non-issue, and nobody cares.
> 
> And those situations that want to be extra careful, and are actually
> generating keys, those situations can do all of this very carefully on
> their own using existing machinery.
> 
> If you don't want a core-dump because you have sensitive information,
> you do "ulimit -c 0". Or you use MADV_DONTDUMP that we've had forever.
> 
> And you don't want to have wipe-on-fork, because
> 
>  (a) if you want things to be wiped on fork, you just wipe it before
> the fork (duh!)
> 
>  (b) but more likely, and more relevantly, you want to make *DAMN
> SURE* you wiped things much earlier than that if you are really
> security-conscious and just generated a secret key, because you don't
> want to leak things accidentally other ways.
> 
>  (c) and you can use MADV_DONTFORK to not copy it at all, which again
> we've had for a long time.

I never made any of these arguments. This just reads like angry-strawman
to me.

In fact there's actually an intelligent argument against using
VM_DONTDUMP, which is that a core dump will only ever contain state that
hasn't yet been used by anything, since it'll either be a random value
that hasn't yet been returned to the caller, or the overwritten/ratcheted
state that can't be wound backwards to recover the previous state. So it
doesn't really matter that much if it gets written into a core dump.

But in fact, I didn't make any argument one way or another for it.

> And if you don't want to have it written to swap, you're just making
> sh*t up at that point.

I'm really not making stuff up. I've reconstructed and recovered keys
and secrets from pilfered swap partitions before on more than one
occasion. The whole idea is to prevent a state that's going to be used
to generate secrets in the future from being written to a place that
can't be easily and deterministically wiped after use.

I get that you're pretty convinced this is all bologna, and that
"forward secrecy" is really just for security LARPers. But there are
plenty of people whose threat models disagree, including my own
experience reconstructing keys in swap from pulled drives.

> anybody who can read swap space can already ptrace you and
> read things much more easily that way.

As you probably inferred from the above, this isn't a ptrace question.
If you have code execution on the same machine before random bytes are
generated, all bets are off and none of this matters anyway. (And more
generally, if you have code exec prior to use, you can probably not even
care about cryptographic stuff, since in many cases you can just access
whatever sensitive data you were after in the first place.)

> Thirdly, you can just use mlock, and make sure you never have so much
> super-sikret stuff pending for long times and in big buffers.

Yea this is what I initially tried. Then people (Jann and Florian, IIRC)
pointed out all sorts of thorny issues with inheriting that on fork()
and segfaulting if non-reserved and such. So eventually I came up with
this monster as a way to avoid the pitfalls -- as just one particular
technical solution.

It turns out that twiddling with mm/ internals isn't your idea of a fun
time, so I'll keep futzing with things and see if I can come up with a
less intrusive technical solution. IOW, I'm not really wedded at all to
this VM_DROPPABLE thing, and if I find another way to do it, great.

> Fourth, if your keys are *that* sensitive, and *that* secret, just use
> /dev/random or getrandom(), because you're not generating that kind of
> volume of long-term keys, so the whole "I have a huge random buffer
> that is super-secret" is a complete non-issue.

I don't think that's really so accurate, actually. Firstly, this whole
swap discussion is about forward secrecy, so we're most likely talking
about ephemeral keys, not long-term keys. Secondly, many protocols that
have ephemeral keys generate them in every handshake, and that might
mean volume. Thirdly, I'm not sure I'd give greater importance to the
secrecy of long-term keys over ephemeral keys, as your point seems to
do; if you're leaking ephemeral keys for some reason, that's pretty bad.

> You guys have literally MADE UP bad examples of so-called "security",
> and then you use those as arguments for bad coding, and for
> bad-mouthing kernel developers who just don't happen to believe in
> that bad model.

I think you might have made up a bunch of arguments I never made. And if
that "you guys" includes me in it, I also object to being summarily
grouped in with any other folks here.

Most importantly, though, I'm not bad-mouthing anyone. Your whole email
strikes me as an indication that my characterization of the matter is a
pretty accurate one.

That characterization from my last email could be summarized as: you
think the model is bogus, that these sorts of security arguments aren't
typically compelling, and so if this VM_DROPPABLE business is to have
any future, it'll need to be motivated by something non-security related
(if there even is one; I made a few guesses as to other applications,
but they're at best guesses, as I obviously have no credibility or
experience in, e.g., claiming what a database program actually might
need). You even said so yourself a few days ago:

| Now, if the magic buffers were something cool that were a generic
| concept that a lot of *other* cases would also kill for, that's one
| thing.

So I would describe my perspective as basically a realistic and
practical one. I don't really mind that you're highly skeptical of
security things and security people, as you have been for as long as I
can remember. I'm not a diehard purist or something. And it's not
bad-mouthing for me to point out the obvious there. Rather, I'll keep
poking around and seeing if I can come up with better technical
solutions to what I think are worthy goals, and you and others will swat
away things that are distasteful (in this case, mm/ tickling), and
perhaps eventually something will emerge. I don't think there's really
anything wrong with that.

> None of what you ask for is for any kind of real security, it's all
> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> elsewhere, and push my pain onto other people".

No. As discussed, it's simply not toward a security model you care for.
But that's fine, you don't need to care about it; I will do that part.

It's also certainly not at all about wanting to feel warm and fuzzies or
take shortcuts or push pain elsewhere. If anything, this pushes the pain
onto me, since I'm the one implementing and maintaining this! I tried to
make mlock() work vanilla for as long as I could before I couldn't
figure out how to do it, groaned, and then tried to implement something
else. I am generally pretty reluctant to code any thing at all.

Jason
Jason A. Donenfeld Jan. 6, 2023, 2:42 a.m. UTC | #20
On Thu, Jan 05, 2023 at 06:08:28PM -0800, Linus Torvalds wrote:
> (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).

Yea, the 64-bit aspect of that patch is pretty gnarly. I would hope that
if converting the field to be 64-bit everywhere, the compiler would
still generate good code for most cases, when testing against
common-case constant bit masks that are half unset, but I didn't look
into it. Indeed if this is to become something real, that'd be a
worthwhile pre-requisite project.

> And I think VM_DROPPABLE matches (a), and would be fine if it had some
> other non-made-up use (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).
> 
> We *have* talked about features kind of like it before, for people
> doing basically caches in user space that they can re-create on demand
> and are ok with just going away under memory pressure.
> 
> But those people almost invariably want dropped pages to cause a
> SIGSEGV or SIGBUS, not to come back as zeroes.

Yea it seems intuitively like that would be a useful thing. Rather than
trying to heuristically monitor memory pressure from inside
uncoordinated userspace apps, just have the kernel nuke pages when it
makes sense to do.

The thing is -- and this is what I was trying to express to Yann --
while I have some theoretical notion of how it /might/ be useful, that's
mainly just a guess from me. But one could run around discussing the
idea with actual potential users of it, and see if anyone's eyes light
up. And then once, for example, the Chrome renderer team is clamoring
for it, then hey, there'd be a good use that wouldn't be bound up with
security models not everybody cares about.

> So you were insulting when you said kernel people don't care about
> security issues.  And I'm just telling you that's not true, but it

Maybe you read that without looking at the end of the sentence which
read: ", assuming that the concerns are unreal or niche or paranoid or
whatever." Because the email you sent in response pretty much described
the concerns as either unreal or niche. So I didn't mean to be
insulting. I actually think we agree with each other on the nature of
your position.

> *is* 100% true that kernel people are often really fed up with
> security people who have their blinders on, focus on some small thing,
> and think nothing else ever matters.
> 
> So yes, the way to get something like VM_DROPPABLE accepted is to
> remove the blinders, and have it be something more widely useful, and
> not be a "for made up bad code".

I get this part. Either the implementation is trivial/convenient/
unintrusive/unnoticeable, or it really needs to be motivated by
something you find compelling and preferably has wide application. No
objection from me there.

If you ignore the instruction decoder part, though -- which I mentioned
to Ingo could be nixed anyway -- my initial estimation of this patch
here was that it wasn't /that/ bad, being +32/-3, and I didn't buy the
arguments that it'd be rarely used code paths and that nobody OOMs and
nobody swaps, because it doesn't match my own experience of seeing
OOMing regularly and the fact that if this is in glibc, it'll wind up
being used. So the initial arguments against it were not compelling.

But then you and others elaborated that VM_* flags really ought to be
for general things, and that as a matter of taste and cleanliness,
sticking things into the *core* of mm/ is just really a sensitive thing
not to be done lightly and without a really compelling reason. And so I
also get that.

So, I'm playing around with other technical solutions to the mlock()
limitation thing that started all of this (amidst other things in my
end-of-the-year backlog), and I'll post results if I can get something
working.

Jason
Andy Lutomirski Jan. 6, 2023, 8:53 p.m. UTC | #21
On Thu, Jan 5, 2023, at 6:08 PM, Linus Torvalds wrote:
> On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> None of what you ask for is for any kind of real security, it's all
>> just crazy "but I want to feel the warm and fuzzies and take shortcuts
>> elsewhere, and push my pain onto other people".
>
> Actually, let me maybe soften that a bit and say that it's
> "convenience features". It might make some things more _convenient_ to
> do, exactly because it might allow other parts to do short-cuts.
>
> But because it's a convenience-feature, it had also better either be
> (a) really easy and clear to do in the kernel and (b) have
> sufficiently *wide* convenience so that it doesn't end up being one of
> those "corner case things we have to maintain forever and nobody
> uses".
>
> And I think VM_DROPPABLE matches (a), and would be fine if it had some
> other non-made-up use (although honestly, we should solve the 32-bit
> problem first - ignoring it isn't fine for anything that is supposed
> to be widely useful).
>
> We *have* talked about features kind of like it before, for people
> doing basically caches in user space that they can re-create on demand
> and are ok with just going away under memory pressure.
>
> But those people almost invariably want dropped pages to cause a
> SIGSEGV or SIGBUS, not to come back as zeroes.
>
> So you were insulting when you said kernel people don't care about
> security issues.  And I'm just telling you that's not true, but it
> *is* 100% true that kernel people are often really fed up with
> security people who have their blinders on, focus on some small thing,
> and think nothing else ever matters.
>
> So yes, the way to get something like VM_DROPPABLE accepted is to
> remove the blinders, and have it be something more widely useful, and
> not be a "for made up bad code".

I'm going to suggest a very very different approach: fix secret storage in memory for real.  That is, don't lock "super secret sensitive stuff" into memory, and don't wipe it either.  *Encrypt* it.

This boils down to implementing proper encrypted swap support, which is not conceptually that difficult.  The kernel already has identifiers (mapping, offset, etc) for every page in swap and already stores some metadata.  Using that as part of a cryptographic operation seems conceptually fairly straightforward.

And a nice implementation of this could be on by default, and the kernel could even tell userspace that it's on, and then userspace could just stop worrying about this particular issue.
Linus Torvalds Jan. 6, 2023, 9:10 p.m. UTC | #22
On Fri, Jan 6, 2023 at 12:54 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> I'm going to suggest a very very different approach: fix secret
> storage in memory for real. That is, don't lock "super secret
> sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.

I don't think you're wrong, but people will complain about key
management, and worry about that part instead.

Honestly, this is what SGX and CPU enclaves is _supposed_ to all do
for you, but then nobody uses it for various reasons.

               Linus
Jason A. Donenfeld Jan. 6, 2023, 9:36 p.m. UTC | #23
On Fri, Jan 06, 2023 at 12:53:41PM -0800, Andy Lutomirski wrote:
> 
> 
> On Thu, Jan 5, 2023, at 6:08 PM, Linus Torvalds wrote:
> > On Thu, Jan 5, 2023 at 5:02 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> None of what you ask for is for any kind of real security, it's all
> >> just crazy "but I want to feel the warm and fuzzies and take shortcuts
> >> elsewhere, and push my pain onto other people".
> >
> > Actually, let me maybe soften that a bit and say that it's
> > "convenience features". It might make some things more _convenient_ to
> > do, exactly because it might allow other parts to do short-cuts.
> >
> > But because it's a convenience-feature, it had also better either be
> > (a) really easy and clear to do in the kernel and (b) have
> > sufficiently *wide* convenience so that it doesn't end up being one of
> > those "corner case things we have to maintain forever and nobody
> > uses".
> >
> > And I think VM_DROPPABLE matches (a), and would be fine if it had some
> > other non-made-up use (although honestly, we should solve the 32-bit
> > problem first - ignoring it isn't fine for anything that is supposed
> > to be widely useful).
> >
> > We *have* talked about features kind of like it before, for people
> > doing basically caches in user space that they can re-create on demand
> > and are ok with just going away under memory pressure.
> >
> > But those people almost invariably want dropped pages to cause a
> > SIGSEGV or SIGBUS, not to come back as zeroes.
> >
> > So you were insulting when you said kernel people don't care about
> > security issues.  And I'm just telling you that's not true, but it
> > *is* 100% true that kernel people are often really fed up with
> > security people who have their blinders on, focus on some small thing,
> > and think nothing else ever matters.
> >
> > So yes, the way to get something like VM_DROPPABLE accepted is to
> > remove the blinders, and have it be something more widely useful, and
> > not be a "for made up bad code".
> 
> I'm going to suggest a very very different approach: fix secret storage in memory for real.  That is, don't lock "super secret sensitive stuff" into memory, and don't wipe it either.  *Encrypt* it.
> 
> This boils down to implementing proper encrypted swap support, which is not conceptually that difficult.  The kernel already has identifiers (mapping, offset, etc) for every page in swap and already stores some metadata.  Using that as part of a cryptographic operation seems conceptually fairly straightforward.

Not sure this solves the right problem, which is primarily related to
forward secrecy, which means the important property is timely secret
erasure. Writing things out to disk complicates that, and encrypted swap
means moving the problem into key lifetime, which sounds like a can of
worms to synchronize. So this doesn't sound so appealing to me as a
solution.

It does sound like a potentially nice thing for other uses, though.

Jason
Matthew Wilcox Jan. 6, 2023, 9:42 p.m. UTC | #24
On Thu, Jan 05, 2023 at 06:08:28PM -0800, Linus Torvalds wrote:
> Side note: making the 32-bit issue go away is likely trivial. We can
> make 'vm_flags' be 64-bit, and a patch for that has been floating
> around for over a decade:
> 
>    https://lore.kernel.org/all/20110412151116.B50D.A69D9226@jp.fujitsu.com/
> 
> but there was enough push-back on that patch that I didn't want to
> take it, and some of the arguments for it were not that convincing (at
> the time).
> 
> But see commit ca16d140af91 ("mm: don't access vm_flags as 'int'"),
> which happened as a result, and which I (obviously very naively)
> believed would be a way to get the conversion to happen in a more
> controlled manner. Sadly, it never actually took off, and we have very
> few "vm_flags_t" users in the kernel, and a lot of "unsigned long
> flags". We even started out with a "__nocast" annotation to try to
> make sparse trigger on people who didn't use vm_flags_t properly. That
> was removed due to it just never happening.
> 
> But converting things to vm_flags_t with a coccinelle script
> (hand-wave: look for variables of of "unsigned long" that use the
> VM_xyz constants), and then just making vm_flags_t be a "u64" instead
> sounds like a way forward.

I'd be more inclined to do:

typedef unsigned int vm_flags_t[2];

and deal with all the fallout.  That'll find all the problems (although
leave us vulnerable to people forgetting which half of the flags they
want to be looking at).

Hm.  We never actually converted vma->vm_flags to be vm_flags_t.  Only
vm_region, which aiui is only used on nommu.
Linus Torvalds Jan. 6, 2023, 10:06 p.m. UTC | #25
On Fri, Jan 6, 2023 at 1:42 PM Matthew Wilcox <willy@infradead.org> wrote:
>
>
> I'd be more inclined to do:
>
> typedef unsigned int vm_flags_t[2];

No, that's entirely invalid.

Never *ever* use arrays in C for type safety. Arrays are not type
safe. They can't be assigned sanely, and they silently become pointers
(which also aren't type-safe, since they end up converting silently to
'void *').

If you want to use the type system to enforce things, and you don't
want to rely on sparse, you absolutely have to use a struct (or union)
type.

So something like

   typedef struct { u64 val; } vm_flags_t;

would be an option.

              Linus
Florian Weimer Jan. 9, 2023, 10:34 a.m. UTC | #26
* Linus Torvalds:

> On Tue, Jan 3, 2023 at 11:35 AM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>>
>> I don't think this is about micro-optimization. Rather, userspace RNGs
>> aren't really possible in a safe way at the moment.
>
> "Bah, humbug", to quote a modern-time philosopher.
>
> It's humbug simply because it makes two assumptions that aren't even valid:
>
>  (a) that you have to do it in user space in the first place
>
>  (b) that you care about the particular semantics that you are looking for
>
> The thing is, you can just do getrandom(). It's what people already
> do. Problem solved.

We are currently doing this in glibc for our arc4random implementation,
after Jason opposed userspace buffering.  If chrony is recompiled
against the glibc version of arc4random (instead of its OpenBSD compat
version, which uses userspace buffering), the result is a 25% drop in
NTP packet response rate:

| The new arc4random using getrandom() seems to have a significant
| impact on performance of chronyd operating as an NTP server. On an
| Intel E3-1220 CPU, I see that the maximum number of requests per
| second dropped by about 25%. That would be an issue for some public
| NTP servers.

arc4random is too slow
<https://sourceware.org/bugzilla/show_bug.cgi?id=29437>

This is *not* “arc4random is 25% slower”, it is the measured overall
impact on server performance.

Historically, the solution space for getrandom and arc4random are
slightly different.  The backronym is “A Replacement Call For random”,
i.e., you should be able to use arc4random without worrying about
performance.  I don't expect cryptographic libraries to turn to
arc4random to implement their random number generators, and that
programmers that use low-level OpenSSL primitives (for example) keep
calling RAND_bytes instead of arc4random because it is available to
them.

We did these changes on the glibc side because Jason sounded very
confident that he's able to deliver vDSO acceleration for getrandom.  If
that fails to materialize, we'll just have to add back userspace
buffering in glibc.  At least we can get process fork protection via
MADV_WIPEONFORK, solving a real problem with the usual arc4random compat
implementation.  (The OpenBSD mechanism for this is slightly different.)
We won't get VM fork protection or forward secrecy against ptrace.  But
the latter is rather speculative anyway because if you can do ptrace
once, you can likely do ptrace twice, the first time patching the
process to remove forward secrecy.  There is a real gap for VM forks,
but I'm not sure how much that matters in practice.  Live migration has
to be implemented in such a way that this isn't observable (otherwise
TCP connections etc. would break), and long-term keys probably shouldn't
be generated under virtualization anyway.

Thanks,
Florian
Linus Torvalds Jan. 9, 2023, 2:28 p.m. UTC | #27
On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
>
> We did these changes on the glibc side because Jason sounded very
> confident that he's able to deliver vDSO acceleration for getrandom.  If
> that fails to materialize, we'll just have to add back userspace
> buffering in glibc.

My whole argument has been that user-space buffering is the sane thing
to do. Most definitely for something like glibc.

The number of people who go "oh, no, my buffer or randomness could be
exposed by insert-odd-situation-here" is approximately zero, and then
the onus should be on *them* to do something special.

Because *they* are special. Precious little snowflake special.

             Linus
Dr. Greg Jan. 10, 2023, 11:01 a.m. UTC | #28
On Fri, Jan 06, 2023 at 01:10:44PM -0800, Linus Torvalds wrote:

Good morning, I hope the week is going well for everyone.

> On Fri, Jan 6, 2023 at 12:54 PM Andy Lutomirski <luto@kernel.org> wrote:
> >
> > I'm going to suggest a very very different approach: fix secret
> > storage in memory for real. That is, don't lock "super secret
> > sensitive stuff" into memory, and don't wipe it either. *Encrypt* it.
> 
> I don't think you're wrong, but people will complain about key
> management, and worry about that part instead.
> 
> Honestly, this is what SGX and CPU enclaves is _supposed_ to all do
> for you, but then nobody uses it for various reasons.

The principal problem is that enclave technology was not made either
ubiquitous or accessible, long story there, suitable for multiple
snifters of single malt.

Unfortunately, the same goes for just about every other hardware
security technology.  Every conversation comes down to; "what is the
business case for the technology", which translated means, how much
money are we going to make off it.

Encrypting memory based secrets, as an alternative to wiping them, is
attractive, but hardware support is needed to do key management
securely and correctly.  Even than, by definition, there will be a
window when the material needs to be in memory as plaintext.

A discussion can be had in this arena about perfection being the enemy
of good.  If you are truely interested in perfection in this endeavor,
you need to have a trusted platform definition and implementation.

Which, if history is any indication, needs to be an open architecture
with respect to both software and hardware.

>                Linus

Best wishes to everyone for a productive remainder of the week.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
Eric Biggers Jan. 11, 2023, 7:27 a.m. UTC | #29
On Mon, Jan 09, 2023 at 08:28:58AM -0600, Linus Torvalds wrote:
> On Mon, Jan 9, 2023 at 4:34 AM Florian Weimer <fweimer@redhat.com> wrote:
> >
> > We did these changes on the glibc side because Jason sounded very
> > confident that he's able to deliver vDSO acceleration for getrandom.  If
> > that fails to materialize, we'll just have to add back userspace
> > buffering in glibc.
> 
> My whole argument has been that user-space buffering is the sane thing
> to do. Most definitely for something like glibc.
> 
> The number of people who go "oh, no, my buffer or randomness could be
> exposed by insert-odd-situation-here" is approximately zero, and then
> the onus should be on *them* to do something special.
> 
> Because *they* are special. Precious little snowflake special.
> 
>              Linus

How would userspace decide when to reseed its CRNGs, then?

IMO, the main benefit of the VDSO getrandom over a traditional userspace CRNG is
that it makes reseeds of the kernel's CRNG take effect immediately.  See the
cover letter, where Jason explains this.

It's definitely important to make the memory used by userspace CRNGs have
appropriate semantics, but my understanding is that's not the main point.

- Eric
Linus Torvalds Jan. 11, 2023, 12:07 p.m. UTC | #30
On Wed, Jan 11, 2023 at 1:27 AM Eric Biggers <ebiggers@kernel.org> wrote:
>
> How would userspace decide when to reseed its CRNGs, then?

.. and that is relevant to all the VM discussions exactly why?

                Linus
diff mbox series

Patch

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index e35a0398db63..47c7c046f2be 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -711,6 +711,9 @@  static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR
 		[ilog2(VM_UFFD_MINOR)]	= "ui",
 #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */
+#ifdef CONFIG_NEED_VM_DROPPABLE
+		[ilog2(VM_DROPPABLE)]	= "dp",
+#endif
 	};
 	size_t i;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..fba3f1e8616b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -315,11 +315,13 @@  extern unsigned int kobjsize(const void *objp);
 #define VM_HIGH_ARCH_BIT_2	34	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_3	35	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_BIT_4	36	/* bit only usable on 64-bit architectures */
+#define VM_HIGH_ARCH_BIT_5	37	/* bit only usable on 64-bit architectures */
 #define VM_HIGH_ARCH_0	BIT(VM_HIGH_ARCH_BIT_0)
 #define VM_HIGH_ARCH_1	BIT(VM_HIGH_ARCH_BIT_1)
 #define VM_HIGH_ARCH_2	BIT(VM_HIGH_ARCH_BIT_2)
 #define VM_HIGH_ARCH_3	BIT(VM_HIGH_ARCH_BIT_3)
 #define VM_HIGH_ARCH_4	BIT(VM_HIGH_ARCH_BIT_4)
+#define VM_HIGH_ARCH_5	BIT(VM_HIGH_ARCH_BIT_5)
 #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -335,6 +337,12 @@  extern unsigned int kobjsize(const void *objp);
 #endif
 #endif /* CONFIG_ARCH_HAS_PKEYS */
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define VM_DROPPABLE VM_HIGH_ARCH_5
+#else
+# define VM_DROPPABLE 0
+#endif
+
 #if defined(CONFIG_X86)
 # define VM_PAT		VM_ARCH_1	/* PAT reserves whole VMA at once (x86) */
 #elif defined(CONFIG_PPC)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 412b5a46374c..82b2fb811d06 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -163,6 +163,12 @@  IF_HAVE_PG_SKIP_KASAN_POISON(PG_skip_kasan_poison, "skip_kasan_poison")
 # define IF_HAVE_UFFD_MINOR(flag, name)
 #endif
 
+#ifdef CONFIG_NEED_VM_DROPPABLE
+# define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name},
+#else
+# define IF_HAVE_VM_DROPPABLE(flag, name)
+#endif
+
 #define __def_vmaflag_names						\
 	{VM_READ,			"read"		},		\
 	{VM_WRITE,			"write"		},		\
@@ -195,6 +201,7 @@  IF_HAVE_VM_SOFTDIRTY(VM_SOFTDIRTY,	"softdirty"	)		\
 	{VM_MIXEDMAP,			"mixedmap"	},		\
 	{VM_HUGEPAGE,			"hugepage"	},		\
 	{VM_NOHUGEPAGE,			"nohugepage"	},		\
+IF_HAVE_VM_DROPPABLE(VM_DROPPABLE,	"droppable"	)		\
 	{VM_MERGEABLE,			"mergeable"	}		\
 
 #define show_vma_flags(flags)						\
diff --git a/mm/Kconfig b/mm/Kconfig
index ff7b209dec05..91fd0be96ca4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1030,6 +1030,9 @@  config ARCH_USES_HIGH_VMA_FLAGS
 	bool
 config ARCH_HAS_PKEYS
 	bool
+config NEED_VM_DROPPABLE
+	select ARCH_USES_HIGH_VMA_FLAGS
+	bool
 
 config ARCH_USES_PG_ARCH_X
 	bool
diff --git a/mm/memory.c b/mm/memory.c
index aad226daf41b..1ade407ccbf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5220,6 +5220,10 @@  vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 
 	lru_gen_exit_fault();
 
+	/* If the mapping is droppable, then errors due to OOM aren't fatal. */
+	if (vma->vm_flags & VM_DROPPABLE)
+		ret &= ~VM_FAULT_OOM;
+
 	if (flags & FAULT_FLAG_USER) {
 		mem_cgroup_exit_user_fault();
 		/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 02c8a712282f..ebf2e3694a0a 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2173,6 +2173,9 @@  struct folio *vma_alloc_folio(gfp_t gfp, int order, struct vm_area_struct *vma,
 	int preferred_nid;
 	nodemask_t *nmask;
 
+	if (vma->vm_flags & VM_DROPPABLE)
+		gfp |= __GFP_NOWARN | __GFP_NORETRY;
+
 	pol = get_vma_policy(vma, addr);
 
 	if (pol->mode == MPOL_INTERLEAVE) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 908df12caa26..a679cc5d1c75 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -593,7 +593,7 @@  mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				may_expand_vm(mm, oldflags, nrpages))
 			return -ENOMEM;
 		if (!(oldflags & (VM_ACCOUNT|VM_WRITE|VM_HUGETLB|
-						VM_SHARED|VM_NORESERVE))) {
+				  VM_SHARED|VM_NORESERVE|VM_DROPPABLE))) {
 			charged = nrpages;
 			if (security_vm_enough_memory_mm(mm, charged))
 				return -ENOMEM;
diff --git a/mm/rmap.c b/mm/rmap.c
index b616870a09be..5ed46e59dfcd 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1294,7 +1294,8 @@  void page_add_new_anon_rmap(struct page *page,
 	int nr;
 
 	VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
-	__SetPageSwapBacked(page);
+	if (!(vma->vm_flags & VM_DROPPABLE))
+		__SetPageSwapBacked(page);
 
 	if (likely(!PageCompound(page))) {
 		/* increment count (starts at -1) */
@@ -1683,7 +1684,7 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 				 * plus the rmap(s) (dropped by discard:).
 				 */
 				if (ref_count == 1 + map_count &&
-				    !folio_test_dirty(folio)) {
+				    (!folio_test_dirty(folio) || (vma->vm_flags & VM_DROPPABLE))) {
 					/* Invalidate as we cleared the pte */
 					mmu_notifier_invalidate_range(mm,
 						address, address + PAGE_SIZE);