diff mbox series

[2/2] random: add fork_event sysctl for polling VM forks

Message ID 20220502140602.130373-2-Jason@zx2c4.com (mailing list archive)
State Not Applicable
Delegated to: Herbert Xu
Headers show
Series [1/2] sysctl: read() must consume poll events, not poll() | expand

Commit Message

Jason A. Donenfeld May 2, 2022, 2:06 p.m. UTC
In order to inform userspace of virtual machine forks, this commit adds
a "fork_event" sysctl, which does not return any data, but allows
userspace processes to poll() on it for notification of VM forks.

It avoids exposing the actual vmgenid from the hypervisor to userspace,
in case there is any randomness value in keeping it secret. Rather,
userspace is expected to simply use getrandom() if it wants a fresh
value.

For example, the following snippet can be used to print a message every
time a VM forks, after the RNG has been reseeded:

  struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY)  };
  assert(fd.fd >= 0);
  for (;;) {
    read(fd.fd, NULL, 0);
    assert(poll(&fd, 1, -1) > 0);
    puts("vm fork detected");
  }

Various programs and libraries that utilize cryptographic operations
depending on fresh randomness can invalidate old keys or take other
appropriate actions when receiving that event. While this is racier than
allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental
step forward that's not as heavyweight.

Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Alexander Graf <graf@amazon.com>
Cc: Colm MacCarthaigh <colmmacc@amazon.com>
Cc: Torben Hansen <htorben@amazon.co.uk>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
 Documentation/admin-guide/sysctl/kernel.rst |  6 ++++--
 drivers/char/random.c                       | 24 +++++++++++++++++++++
 2 files changed, 28 insertions(+), 2 deletions(-)

Comments

Lennart Poettering May 2, 2022, 3:40 p.m. UTC | #1
On Mo, 02.05.22 16:06, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> In order to inform userspace of virtual machine forks, this commit adds
> a "fork_event" sysctl, which does not return any data, but allows
> userspace processes to poll() on it for notification of VM forks.
>
> It avoids exposing the actual vmgenid from the hypervisor to userspace,
> in case there is any randomness value in keeping it secret. Rather,
> userspace is expected to simply use getrandom() if it wants a fresh
> value.

Wouldn't it make sense to expose a monotonic 64bit counter of detected
VM forks since boot through read()? It might be interesting to know
for userspace how many forks it missed the fork events for. Moreover it
might be interesting to userspace to know if any fork happened so far
*at* *all*, by checking if the counter is non-zero.

(Ideally that counter file would even be mmapable...)

does this file really belong below the `kernel.random` sysctl
hierarchy? shouldn't this rather be something like
/sys/kernel/vmclone_event or so?

Most people see sysctl as a place to tweak settings with. But this
sysctl you proposed here cannot be written, and (so far…) not even be
read! So it sounds like a weird thing to place into /proc/sys/.

Note that in userspace there's a whole infrastructure for managing
sysctls, but this one doesn't look like it would be managable at all
by userspace with these tools.

Lennart

--
Lennart Poettering, Berlin
Jason A. Donenfeld May 2, 2022, 4:12 p.m. UTC | #2
On Mon, May 02, 2022 at 05:40:02PM +0200, Lennart Poettering wrote:
> On Mo, 02.05.22 16:06, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> 
> > In order to inform userspace of virtual machine forks, this commit adds
> > a "fork_event" sysctl, which does not return any data, but allows
> > userspace processes to poll() on it for notification of VM forks.
> >
> > It avoids exposing the actual vmgenid from the hypervisor to userspace,
> > in case there is any randomness value in keeping it secret. Rather,
> > userspace is expected to simply use getrandom() if it wants a fresh
> > value.
> 
> Wouldn't it make sense to expose a monotonic 64bit counter of detected
> VM forks since boot through read()? It might be interesting to know
> for userspace how many forks it missed the fork events for. Moreover it
> might be interesting to userspace to know if any fork happened so far
> *at* *all*, by checking if the counter is non-zero.

"Might be interesting" is different from "definitely useful". I'm not
going to add this without a clear use case. This feature is pretty
narrowly scoped in its objectives right now, and I intend to keep it
that way if possible. (And yes, I realize that is likely considerably
different from your development philosophy.)

> 
> (Ideally that counter file would even be mmapable...)

You missed the last year of discussion about this and why we have wound
up here as a first step. Check the archives for extensive discussion.

Jason
Lennart Poettering May 2, 2022, 4:51 p.m. UTC | #3
On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> > > In order to inform userspace of virtual machine forks, this commit adds
> > > a "fork_event" sysctl, which does not return any data, but allows
> > > userspace processes to poll() on it for notification of VM forks.
> > >
> > > It avoids exposing the actual vmgenid from the hypervisor to userspace,
> > > in case there is any randomness value in keeping it secret. Rather,
> > > userspace is expected to simply use getrandom() if it wants a fresh
> > > value.
> >
> > Wouldn't it make sense to expose a monotonic 64bit counter of detected
> > VM forks since boot through read()? It might be interesting to know
> > for userspace how many forks it missed the fork events for. Moreover it
> > might be interesting to userspace to know if any fork happened so far
> > *at* *all*, by checking if the counter is non-zero.
>
> "Might be interesting" is different from "definitely useful". I'm not
> going to add this without a clear use case. This feature is pretty
> narrowly scoped in its objectives right now, and I intend to keep it
> that way if possible.

Sure, whatever. I mean, if you think it's preferable to have 3 API
abstractions for the same concept each for it's special usecase, then
that's certainly one way to do things. I personally would try to
figure out a modicum of generalization for things like this. But maybe
that' just me…

I can just tell you, that in systemd we'd have a usecase for consuming
such a generation counter: we try to provide stable MAC addresses for
synthetic network interfaces managed by networkd, so we hash them from
/etc/machine-id, but otoh people also want them to change when they
clone their VMs. We could very nicely solve this if we had a
generation counter easily accessible from userspace, that starts at 0
initially. Because then we can hash as we always did when the counter
is zero, but otherwise use something else, possibly hashed from the
generation counter.

But anyway, I understand you are not interested in
generalization/other usecases, so I'll shut up.

Lennart

--
Lennart Poettering, Berlin
Alexander Graf May 2, 2022, 5:59 p.m. UTC | #4
On 02.05.22 18:51, Lennart Poettering wrote:
> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>
>>>> In order to inform userspace of virtual machine forks, this commit adds
>>>> a "fork_event" sysctl, which does not return any data, but allows
>>>> userspace processes to poll() on it for notification of VM forks.
>>>>
>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace,
>>>> in case there is any randomness value in keeping it secret. Rather,
>>>> userspace is expected to simply use getrandom() if it wants a fresh
>>>> value.
>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected
>>> VM forks since boot through read()? It might be interesting to know
>>> for userspace how many forks it missed the fork events for. Moreover it
>>> might be interesting to userspace to know if any fork happened so far
>>> *at* *all*, by checking if the counter is non-zero.
>> "Might be interesting" is different from "definitely useful". I'm not
>> going to add this without a clear use case. This feature is pretty
>> narrowly scoped in its objectives right now, and I intend to keep it
>> that way if possible.
> Sure, whatever. I mean, if you think it's preferable to have 3 API
> abstractions for the same concept each for it's special usecase, then
> that's certainly one way to do things. I personally would try to
> figure out a modicum of generalization for things like this. But maybe
> that' just me…
>
> I can just tell you, that in systemd we'd have a usecase for consuming
> such a generation counter: we try to provide stable MAC addresses for
> synthetic network interfaces managed by networkd, so we hash them from
> /etc/machine-id, but otoh people also want them to change when they
> clone their VMs. We could very nicely solve this if we had a
> generation counter easily accessible from userspace, that starts at 0
> initially. Because then we can hash as we always did when the counter
> is zero, but otherwise use something else, possibly hashed from the
> generation counter.
>
> But anyway, I understand you are not interested in
> generalization/other usecases, so I'll shut up.


Let's not turn this into a pit fight please :). I think it's a good idea 
to collect the use cases we all have and evaluate whether this patch is 
a good stepping stone towards the final solution.

At the end of the road, what I would like to see is

1) A way for libraries such as s2n to identify that a clone occurred. 
Because it's a deep-down library with no access to its own thread or the 
main loop, it can not rely on poll/select. Mmap of a file however would 
work great, as you can create transactions on top of a 64bit mmap'ed 
value for example.

We can have systemd generate and then provide such a file as long as we 
have an event based notification mechanism.

2) A way to notify larger applications (think Java here) that a system 
is going to be suspended soon so it can wipe PII before it gets cloned 
for example.

3) Notifications after clone so applications know they can regenerate VM 
unique data based on randomness.

For 2 and 3, applications should have native support for these events 
(think of poll() on the fork file) as well as external script based 
support (think "invoke systemctl restart smbd.service on clone").



Lennart, looking at the current sysctl proposal, systemd could poll() on 
the fork file. It would then be able to generate a /run/fork-id file 
which it can use for the flow above, right?

The sysctl proposal also gives us 3, if we implement the inhibitor 
proposal [1] in systemd.

That leaves 2, which we don't have a hardware interface for yet. We can 
still get to at least script level automation with inhibitors and 
manually triggering a "hey, you'll be suspended soon" event via systemd.

Overall, it sounds to me like the sysctl poll based kernel interface in 
this patch in combination with systemd inhibitors gives us an answer to 
most of the flows above.

I can see attractiveness in providing the /run/fork-id directly from the 
kernel though, to remove the dependency on systemd for poll-less 
notification of libraries.

Jason, how much complexity would it add to provide an mmap() and read() 
interface to a fork counter value to the sysctl? Read sounds like a 
trivial change on top of what you have already, mmap a bit more heavy 
lift. If we had both, it would allow us to implement a Linux standard 
fork detect path in libraries that does not rely on systemd.



Alex


[1] https://github.com/systemd/systemd/issues/20222




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Jason A. Donenfeld May 2, 2022, 6:04 p.m. UTC | #5
Hey Lennart,

On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote:
> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> 
> > > > In order to inform userspace of virtual machine forks, this commit adds
> > > > a "fork_event" sysctl, which does not return any data, but allows
> > > > userspace processes to poll() on it for notification of VM forks.
> > > >
> > > > It avoids exposing the actual vmgenid from the hypervisor to userspace,
> > > > in case there is any randomness value in keeping it secret. Rather,
> > > > userspace is expected to simply use getrandom() if it wants a fresh
> > > > value.
> > >
> > > Wouldn't it make sense to expose a monotonic 64bit counter of detected
> > > VM forks since boot through read()? It might be interesting to know
> > > for userspace how many forks it missed the fork events for. Moreover it
> > > might be interesting to userspace to know if any fork happened so far
> > > *at* *all*, by checking if the counter is non-zero.
> >
> > "Might be interesting" is different from "definitely useful". I'm not
> > going to add this without a clear use case. This feature is pretty
> > narrowly scoped in its objectives right now, and I intend to keep it
> > that way if possible.
> 
> Sure, whatever. I mean, if you think it's preferable to have 3 API
> abstractions for the same concept each for it's special usecase, then
> that's certainly one way to do things. I personally would try to
> figure out a modicum of generalization for things like this. But maybe
> that' just me…
> 
> I can just tell you, that in systemd we'd have a usecase for consuming
> such a generation counter: we try to provide stable MAC addresses for
> synthetic network interfaces managed by networkd, so we hash them from
> /etc/machine-id, but otoh people also want them to change when they
> clone their VMs. We could very nicely solve this if we had a
> generation counter easily accessible from userspace, that starts at 0
> initially. Because then we can hash as we always did when the counter
> is zero, but otherwise use something else, possibly hashed from the
> generation counter.

This doesn't work, because you could have memory-A split into memory-A.1
and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
the same new value "2". The solution is to instead have the hypervisor
pass a unique value and a counter. We currently have a 16 byte unique
value from the hypervisor, which I'm keeping as a kernel space secret
for the RNG; we're waiting on a word-sized monotonic counter interface
from hypervisors in the future. When we have the latter, then we can
start talking about mmapable things. Your use case would probably be
served by exposing that 16-byte unique value (hashed with some constant
for safety I suppose), but I'm hesitant to start going down that route
all at once, especially if we're to have a more useful counter in the
future.

Jason
Jason A. Donenfeld May 2, 2022, 6:29 p.m. UTC | #6
Hi Alex,

On Mon, May 02, 2022 at 07:59:08PM +0200, Alexander Graf wrote:
> to collect the use cases we all have and evaluate whether this patch is 
> a good stepping stone towards the final solution.

Indeed, I'm all for collecting use cases. What I meant to say is that
we're not going to add something "just 'cuz"; I'd like to have some
concrete things in mind.

To date, I've basically had your s2n case in mind, but as you haven't
responded to this in the last month, I started looking to see if this
was useful elsewhere or if I should abandon it, so I filed this issue
with the Go project: <https://github.com/golang/go/issues/52544>. We're
over halfway through 5.18 now, and only at this point have you arrived
to discuss and finalize things. So in all likelihood we'll wind up
tabling this until 5.20 or never, since what I thought was an easy
consensus before now apparently is not.

> 1) A way for libraries such as s2n to identify that a clone occurred. 
> Because it's a deep-down library with no access to its own thread or the 
> main loop, it can not rely on poll/select. Mmap of a file however would 
> work great, as you can create transactions on top of a 64bit mmap'ed 
> value for example.

I didn't realize that s2n can't poll. That's surprising. In the worst
case, can't you just spawn a thread?

> 2) A way to notify larger applications (think Java here) that a system 
> is going to be suspended soon so it can wipe PII before it gets cloned 
> for example.

Suspension, like S3 power notification stuff? Talk to Rafael about that;
this isn't related to the VM fork issue. I use those PM notifiers
happily in kernel space but AFAICT, there's still no userspace thing for
it. This seems orthogonal to this conversation though, so let's not veer
off into that topic.

If you didn't mean S3 but actually meant notification prior to snapshot
taking, we don't have any virtual hardware for that, so it's a moot
point.

> 3) Notifications after clone so applications know they can regenerate VM 
> unique data based on randomness.

You mean this as "the same as (1) but with poll() instead of mmap()",
right?

> Lennart, looking at the current sysctl proposal, systemd could poll() on 
> the fork file. It would then be able to generate a /run/fork-id file 
> which it can use for the flow above, right?

For the reasons I gave in my last email to Lennart, I don't think
there's a good way for systemd to generate a fork-id file on its own
either. I don't think systemd should really be involved here as a
provider of values, just as a potential consumer of what the kernel
provides.

> The sysctl proposal also gives us 3, if we implement the inhibitor 
> proposal [1] in systemd.

These userspace components you're proposing seem like a lot of
overengineering for little gain, which is why this conversation went
nowhere when Amazon attempted all this year. But it sounds like you
agree with me based on your remark below about systemd-less interfaces
provided by the kernel.

> Overall, it sounds to me like the sysctl poll based kernel interface in 
> this patch in combination with systemd inhibitors gives us an answer to 
> most of the flows above.
> 
> I can see attractiveness in providing the /run/fork-id directly from the 
> kernel though, to remove the dependency on systemd for poll-less 
> notification of libraries.
>
> Jason, how much complexity would it add to provide an mmap() and read() 
> interface to a fork counter value to the sysctl? Read sounds like a 
> trivial change on top of what you have already, mmap a bit more heavy 
> lift. If we had both, it would allow us to implement a Linux standard 
> fork detect path in libraries that does not rely on systemd.

mmap() does not give us anything if we're not going to expose the raw
ACPI-mapped ID directly. It will still be a racy mechanism until we do
that. So I think we should wait until there's a proper vmgenid
word-sized counter to expose something mmap()able. If you have the
energy to talk to Microsoft about this and make it happen, please be my
guest. As I wrote at the beginning of this email. I haven't gotten a
response from you at all about this stuff in quite some time, so I'm not
really itching take that on alone now.

Jason
Alexander Graf May 2, 2022, 6:34 p.m. UTC | #7
On 02.05.22 20:04, Jason A. Donenfeld wrote:
> Hey Lennart,
>
> On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote:
>> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
>>
>>>>> In order to inform userspace of virtual machine forks, this commit adds
>>>>> a "fork_event" sysctl, which does not return any data, but allows
>>>>> userspace processes to poll() on it for notification of VM forks.
>>>>>
>>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace,
>>>>> in case there is any randomness value in keeping it secret. Rather,
>>>>> userspace is expected to simply use getrandom() if it wants a fresh
>>>>> value.
>>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected
>>>> VM forks since boot through read()? It might be interesting to know
>>>> for userspace how many forks it missed the fork events for. Moreover it
>>>> might be interesting to userspace to know if any fork happened so far
>>>> *at* *all*, by checking if the counter is non-zero.
>>> "Might be interesting" is different from "definitely useful". I'm not
>>> going to add this without a clear use case. This feature is pretty
>>> narrowly scoped in its objectives right now, and I intend to keep it
>>> that way if possible.
>> Sure, whatever. I mean, if you think it's preferable to have 3 API
>> abstractions for the same concept each for it's special usecase, then
>> that's certainly one way to do things. I personally would try to
>> figure out a modicum of generalization for things like this. But maybe
>> that' just me…
>>
>> I can just tell you, that in systemd we'd have a usecase for consuming
>> such a generation counter: we try to provide stable MAC addresses for
>> synthetic network interfaces managed by networkd, so we hash them from
>> /etc/machine-id, but otoh people also want them to change when they
>> clone their VMs. We could very nicely solve this if we had a
>> generation counter easily accessible from userspace, that starts at 0
>> initially. Because then we can hash as we always did when the counter
>> is zero, but otherwise use something else, possibly hashed from the
>> generation counter.
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2". The solution is to instead have the hypervisor
> pass a unique value and a counter. We currently have a 16 byte unique
> value from the hypervisor, which I'm keeping as a kernel space secret
> for the RNG; we're waiting on a word-sized monotonic counter interface
> from hypervisors in the future. When we have the latter, then we can
> start talking about mmapable things. Your use case would probably be
> served by exposing that 16-byte unique value (hashed with some constant
> for safety I suppose), but I'm hesitant to start going down that route
> all at once, especially if we're to have a more useful counter in the
> future.


Michael, since we already changed the CID in the spec, can we add a 
property to the device that indicates the first 4 bytes of the UUID will 
always be different between parent and child?

That should give us the ability to mmap the vmgenid directly to user 
space and act based on a simple u32 compare for clone notification, no?


Thanks;

Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Jason A. Donenfeld May 2, 2022, 6:44 p.m. UTC | #8
Err...

On Mon, May 02, 2022 at 08:04:21PM +0200, Jason A. Donenfeld wrote:
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2". The solution is to instead have the hypervisor
> pass a unique value and a counter. We currently have a 16 byte unique
> value from the hypervisor, which I'm keeping as a kernel space secret
> for the RNG; we're waiting on a word-sized monotonic counter interface
> from hypervisors in the future. When we have the latter, then we can
> start talking about mmapable things. Your use case would probably be
> served by exposing that 16-byte unique value (hashed with some constant
> for safety I suppose), but I'm hesitant to start going down that route
> all at once, especially if we're to have a more useful counter in the
> future.

I kind of muddled things a bit by conflating two issues.

I'd like the hypervisor to provide a counter so that we can mmap it to
userspace so that userspace programs can do word-sized comparisons on
mmap'd counters, avoiding the race that currently exists from relying on
the async ACPI notification, which arrives after the system is already
up and running. That's one thing, but not what we're talking about here
with the MAC addresses.

The point over here is that neither the guest *nor* the hypervisor can
maintain a counter that actually represents something unique. A.1 and
A.2 will both ++counter to the same value in the example above. The
guest can't do it (neither in systemd nor in the kernel), because it
will always start with the same counter value of A and ++ it to the same
next value. The hypervisor can't do it either, because snapshots can be
shipped around to different computers that aren't coordinated.

So, put that way, the counter thing that I'd like wouldn't be for having
a unique snapshot ID, but just as a mmap-able way of learning when a
snapshot forks. It wouldn't be more useful than that.

If you want a unique ID, we have two options for that: the first is
exposing the vmgenid 16 byte value to userspace (which I don't want to
do). The second is just calling getrandom() after you get a poll()
notification, and that'll be guaranteed to be unique to that VM because
of the vmgenid driver in 5.18.

This last suggestion is thus what you should do for your MAC addresses.

Jason
Jason A. Donenfeld May 2, 2022, 6:46 p.m. UTC | #9
On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote:
> Michael, since we already changed the CID in the spec, can we add a 
> property to the device that indicates the first 4 bytes of the UUID will 
> always be different between parent and child?
> 
> That should give us the ability to mmap the vmgenid directly to user 
> space and act based on a simple u32 compare for clone notification, no?

That is not a good idea. We want an _additional_ 4 bytes, so that we can
keep the first 16 bytes (128 bits) as a kernel space secret.

Jason
Alexander Graf May 2, 2022, 6:56 p.m. UTC | #10
On 02.05.22 20:46, Jason A. Donenfeld wrote:
> On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote:
>> Michael, since we already changed the CID in the spec, can we add a
>> property to the device that indicates the first 4 bytes of the UUID will
>> always be different between parent and child?
>>
>> That should give us the ability to mmap the vmgenid directly to user
>> space and act based on a simple u32 compare for clone notification, no?
> That is not a good idea. We want an _additional_ 4 bytes, so that we can
> keep the first 16 bytes (128 bits) as a kernel space secret.


An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page. 
Do we really rely on these 16 bytes to reseed after clone? If so, we'd 
need to bite the bullet and provide an additional page, yes.


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Alexander Graf May 2, 2022, 6:57 p.m. UTC | #11
Hey Jason,

On 02.05.22 20:29, Jason A. Donenfeld wrote:
> Hi Alex,
>
> On Mon, May 02, 2022 at 07:59:08PM +0200, Alexander Graf wrote:
>> to collect the use cases we all have and evaluate whether this patch is
>> a good stepping stone towards the final solution.
> Indeed, I'm all for collecting use cases. What I meant to say is that
> we're not going to add something "just 'cuz"; I'd like to have some
> concrete things in mind.
>
> To date, I've basically had your s2n case in mind, but as you haven't
> responded to this in the last month, I started looking to see if this


Unfortunate vacation timing on my side I suppose :)


> was useful elsewhere or if I should abandon it, so I filed this issue
> with the Go project: <https://github.com/golang/go/issues/52544>. We're
> over halfway through 5.18 now, and only at this point have you arrived
> to discuss and finalize things. So in all likelihood we'll wind up
> tabling this until 5.20 or never, since what I thought was an easy
> consensus before now apparently is not.


So far I see little that would block your patch? It seems to go into a 
good direction from all I can tell.


>> 1) A way for libraries such as s2n to identify that a clone occurred.
>> Because it's a deep-down library with no access to its own thread or the
>> main loop, it can not rely on poll/select. Mmap of a file however would
>> work great, as you can create transactions on top of a 64bit mmap'ed
>> value for example.
> I didn't realize that s2n can't poll. That's surprising. In the worst
> case, can't you just spawn a thread?


You block the thread on poll, so the only option is a thread. I was 
until now always under the working assumption that we can't do this in a 
thread because you don't want your single threaded application to turn 
into a pthreaded one, but you make me wonder. Let me check with Torben 
tomorrow.


>
>> 2) A way to notify larger applications (think Java here) that a system
>> is going to be suspended soon so it can wipe PII before it gets cloned
>> for example.
> Suspension, like S3 power notification stuff? Talk to Rafael about that;


Whether you go running -> S3 -> clone or you go running -> paused -> 
clone is an implementation detail I'm not terribly worried about. Users 
can do either, because on both cases the VM is in paused state.


> this isn't related to the VM fork issue. I use those PM notifiers
> happily in kernel space but AFAICT, there's still no userspace thing for
> it. This seems orthogonal to this conversation though, so let's not veer
> off into that topic.
>
> If you didn't mean S3 but actually meant notification prior to snapshot
> taking, we don't have any virtual hardware for that, so it's a moot
> point.


I think we'll want to have an external button similar to the ACPI sleep 
button or lid close event eventually, so that we can loop the VM in on 
the suspend path the same way S3 does.

Today we don't have it, I agree. And if it's possible to maintain the 
same user space interface with this in mind, all is good. Let's just 
keep it in mind as something that will probably come eventually so that 
we don't redesign the UAPI in a year from now.


>
>> 3) Notifications after clone so applications know they can regenerate VM
>> unique data based on randomness.
> You mean this as "the same as (1) but with poll() instead of mmap()",
> right?


Yes :)


>
>> Lennart, looking at the current sysctl proposal, systemd could poll() on
>> the fork file. It would then be able to generate a /run/fork-id file
>> which it can use for the flow above, right?
> For the reasons I gave in my last email to Lennart, I don't think
> there's a good way for systemd to generate a fork-id file on its own
> either. I don't think systemd should really be involved here as a
> provider of values, just as a potential consumer of what the kernel
> provides.


Yes, systemd would poll on fork_event. When that returns, it reads 
/dev/urandom and writes a new /run/fork-id file with that. Libraries 
that don't want to be in the business of spawning threads can use that 
file to identify that they were cloned. Systemd can use that id as seed 
for networkd.


>
>> The sysctl proposal also gives us 3, if we implement the inhibitor
>> proposal [1] in systemd.
> These userspace components you're proposing seem like a lot of
> overengineering for little gain, which is why this conversation went
> nowhere when Amazon attempted all this year. But it sounds like you
> agree with me based on your remark below about systemd-less interfaces
> provided by the kernel.


I would be happy to get to 99% of this with pure kernel based 
interfaces. The reason the kernel conversation seemingly went nowhere 
was not because of "overengineering". It was because the review feedback 
eventually was "This is a file, make user space manage it".


>> Overall, it sounds to me like the sysctl poll based kernel interface in
>> this patch in combination with systemd inhibitors gives us an answer to
>> most of the flows above.
>>
>> I can see attractiveness in providing the /run/fork-id directly from the
>> kernel though, to remove the dependency on systemd for poll-less
>> notification of libraries.
>>
>> Jason, how much complexity would it add to provide an mmap() and read()
>> interface to a fork counter value to the sysctl? Read sounds like a
>> trivial change on top of what you have already, mmap a bit more heavy
>> lift. If we had both, it would allow us to implement a Linux standard
>> fork detect path in libraries that does not rely on systemd.
> mmap() does not give us anything if we're not going to expose the raw
> ACPI-mapped ID directly. It will still be a racy mechanism until we do
> that. So I think we should wait until there's a proper vmgenid
> word-sized counter to expose something mmap()able. If you have the
> energy to talk to Microsoft about this and make it happen, please be my
> guest. As I wrote at the beginning of this email. I haven't gotten a
> response from you at all about this stuff in quite some time, so I'm not
> really itching take that on alone now.


Absolutely! Let's see where it goes :)


Alex





Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Jason A. Donenfeld May 2, 2022, 7:27 p.m. UTC | #12
On Mon, May 02, 2022 at 08:56:05PM +0200, Alexander Graf wrote:
> 
> On 02.05.22 20:46, Jason A. Donenfeld wrote:
> > On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote:
> >> Michael, since we already changed the CID in the spec, can we add a
> >> property to the device that indicates the first 4 bytes of the UUID will
> >> always be different between parent and child?
> >>
> >> That should give us the ability to mmap the vmgenid directly to user
> >> space and act based on a simple u32 compare for clone notification, no?
> > That is not a good idea. We want an _additional_ 4 bytes, so that we can
> > keep the first 16 bytes (128 bits) as a kernel space secret.
> 
> 
> An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page. 
> Do we really rely on these 16 bytes to reseed after clone? If so, we'd 
> need to bite the bullet and provide an additional page, yes.
 
Ugh, you're right; memory mapping is pages. The other option would be
relying on RDRAND (both existing and being trusted by the user etc), but
generally people aren't too jazzed about that. We pretty much have to
assume that the existing pool is compromised, since people share cloned
VMs casually. The 128-bit vmgenid is a nice input to have.
Alexander Graf May 2, 2022, 7:41 p.m. UTC | #13
On 02.05.22 21:27, Jason A. Donenfeld wrote:
> On Mon, May 02, 2022 at 08:56:05PM +0200, Alexander Graf wrote:
>> On 02.05.22 20:46, Jason A. Donenfeld wrote:
>>> On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote:
>>>> Michael, since we already changed the CID in the spec, can we add a
>>>> property to the device that indicates the first 4 bytes of the UUID will
>>>> always be different between parent and child?
>>>>
>>>> That should give us the ability to mmap the vmgenid directly to user
>>>> space and act based on a simple u32 compare for clone notification, no?
>>> That is not a good idea. We want an _additional_ 4 bytes, so that we can
>>> keep the first 16 bytes (128 bits) as a kernel space secret.
>>
>> An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page.
>> Do we really rely on these 16 bytes to reseed after clone? If so, we'd
>> need to bite the bullet and provide an additional page, yes.
> Ugh, you're right; memory mapping is pages. The other option would be
> relying on RDRAND (both existing and being trusted by the user etc), but
> generally people aren't too jazzed about that. We pretty much have to
> assume that the existing pool is compromised, since people share cloned
> VMs casually. The 128-bit vmgenid is a nice input to have.


I can see the merit. So yes, we'd want a second function to the 
VM_GEN_COUNTER device in addition to "ADDR" that - in a fully user space 
mappable separate page - gives us a 32-bit vmgenid that is guaranteed to 
be different from the previous one.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Jason A. Donenfeld May 2, 2022, 8:03 p.m. UTC | #14
Hi Alex,

On Mon, May 02, 2022 at 08:57:01PM +0200, Alexander Graf wrote:
> So far I see little that would block your patch? It seems to go into a 
> good direction from all I can tell.

With kernel code for the kernel, if it's good enough, it's good enough,
and can be made better later. With kernel code for userspace, it might
not be possible to make it better later, so if we've got a plethora of
options, it makes the incrementalism I was enthusiastic about a little
bit more risky.

Here's where we're at:

- /proc/sys has this easy to use poll() interface [which currently breaks
  normal poll() semantics; see patch 1/2]. It'd be nice to use this as
  an async notifier interface.

- A race-free mmap()d mechanism needs virtual hardware support to be
  useful, so I don't want to add it now. But when we do have it, where
  should it be added?

  * Atop the same /proc/sys file that this patch has returning -ENODATA
    at the moment? /proc/sys doesn't do mmap now so that'd be some
    plumbing work, which would be pretty odd for the sysctl abstraction.

  * Atop some new file elsewhere? Not hard to do, but also begs the
    question of which driver should do it, where the file should be, and
    a whole host of bikesheddy questions.

  * In the VDSO? This would make the most sense to me if we're ever
    going to do a VDSO RNG, and to design it along with the userspace
    RNG stuff people think they want, even if it's maybe not the best
    idea in the end. This is a huge can of worms. Termites, even, and
    they'll eat through the beams of your house.

  * Nowhere, because even though this race-free mechanism is attractive,
    it's way too hard to program for, and nobody is going to use it, and
    not many people care /that/ much about having this mechanism being
    100% race-free. Colm said Amazon doesn't, for example, since it uses
    quiescent states. And apparently Microsoft doesn't care much either,
    and they don't even use quiescent states. Maybe I'm the only one
    raising a fuss about the race window (while hypocritically also not
    being very eager to write code around an interface that has to check
    a variable all the time).

- A file in /proc/sys can return -ENODATA, as this patch has it doing,
  or it can return something else:

  * A counter. Flawed for reasons discussed elsewhere.

  * A UUID, just like /proc/sys/kernel/random/boot_id, except we'd name
    it generation_id or something and it'd change on VM fork. An earlier
    version of this patch did that.

  * -ENODATA, as this patch does, because each caller can get a new unique
    value with getrandom(), as there's no point in having some global
    identifier like with boot_id. This is what I'm leaning toward. It
    seems like getrandom() would handle Lennart's case, Go's case, s2n's
    case, and we don't need a particular ID.

  * The direct value of the ACPI vmgenid value. Discussed in earlier
    emails; I don't want to do this.

- You don't care about race-free, but you do care about mmap(), for
  "programming reasons" in s2n. I'm not yet sympathetic to this case
  :-P. The point of mmap() should be for the race-free stuff. Otherwise
  poll() should do the job, and if you can't poll, you can hack around
  it with another process or a thread or whatever other trick. Your
  systemd proposal with a file was just punting the idea elsewhere in
  userspace (which I think is a bit ugly too...).

This state of affairs inclines me to:

- Have a /proc/sys/kernel/random/fork_event that has a poll() interface
  with reads returning -ENODATA (this patch).

- Try to convince whomever that poll()ing on fork_event and then calling
  getrandom() for a process-local snapshot ID is better than poll()ing
  on and subsequently read()ing a /proc/sys/kernel/random/generation_id
  that returns a global UUID like boot_id. [Lennart - I'm specifically
  interested in trying to convince you of this. Thoughts?]

- Given that you are okay with an async mechanism, figure out how poll()
  can work for s2n rather than mmap().

- Not really pursue the race-free mmap() thing unless Microsoft goes
  full steam ahead with it because it's complicated to program for and
  maybe (or maybe not? I don't know? data please?) a theoretical
  concern.
 
> You block the thread on poll, so the only option is a thread. I was 
> until now always under the working assumption that we can't do this in a 
> thread because you don't want your single threaded application to turn 
> into a pthreaded one, but you make me wonder. Let me check with Torben 
> tomorrow.

Can't you use raw clone() and just have a super dirty thread that shares
a single page mapping and nothing else? Poll on the pidfd of the parent
and the fork_event fd, quit on the former, and ++ on the latter.

> >> 2) A way to notify larger applications (think Java here) that a system
> >> is going to be suspended soon so it can wipe PII before it gets cloned
> >> for example.
> > Suspension, like S3 power notification stuff? Talk to Rafael about that;
> 
> 
> Whether you go running -> S3 -> clone or you go running -> paused -> 
> clone is an implementation detail I'm not terribly worried about. Users 
> can do either, because on both cases the VM is in paused state.

Ahh, I see.
https://lore.kernel.org/lkml/20220501123849.3858-1-Jason@zx2c4.com/
might interest you.

Jason
Lennart Poettering May 3, 2022, 7:42 a.m. UTC | #15
On Mo, 02.05.22 20:04, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> > I can just tell you, that in systemd we'd have a usecase for consuming
> > such a generation counter: we try to provide stable MAC addresses for
> > synthetic network interfaces managed by networkd, so we hash them from
> > /etc/machine-id, but otoh people also want them to change when they
> > clone their VMs. We could very nicely solve this if we had a
> > generation counter easily accessible from userspace, that starts at 0
> > initially. Because then we can hash as we always did when the counter
> > is zero, but otherwise use something else, possibly hashed from the
> > generation counter.
>
> This doesn't work, because you could have memory-A split into memory-A.1
> and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> the same new value "2".

Yes, that's why I as vague about what to switch to if the counter is
non-zero, i.e. "something else, *possibly* hashed…".

For this MAC address usecase it's entirely sufficient to be able to
distinguish if the system was closed at all, i.e. if the counter is
zero or is non-zero. Because that would already be great for a policy
of "hash it in a stable way from /etc/machine-id, if counter == 0" +
"use random MAC once counter > 0".

Such a MAC address policy I think should probably even be the new
default in networkd, if we could implement it. For that we'd need a
single bit of info from the kernel, indicating whether the sysem was
cloned at all. i.e. if the vmgenid uuid is different from the one the
system booted up first.

Lennart

--
Lennart Poettering, Berlin
Lennart Poettering May 3, 2022, 8:29 a.m. UTC | #16
On Mo, 02.05.22 19:59, Alexander Graf (graf@amazon.com) wrote:

> Lennart, looking at the current sysctl proposal, systemd could poll() on the
> fork file. It would then be able to generate a /run/fork-id file which it
> can use for the flow above, right?

I am not to keen on making sytemd such a proxy. Sounds like something
the kernel could do on its own, better and resulting in an ultimately
simpler system...

If systemd its the proxy this adds in extra raciness. i.e. in a ideal
world, if we have some form of notification fd, then it would be great
if that fd is guaranteed to have POLLIN set and its contents updated
the instant the clone happened. But if we proxy this through
userspace, there's necessarily a latency involved that it it takes
userspace to catch up and effect the POLLIN and updated contents
towards its client apps.

I understand the underlying VM hypervisor APIs currently are designed
to always imply some notification latency. Which sucks, but I think we
should be very careful with replicating this design mistake with any
userspace APIs we add.

i.e. I am pretty sure that even if the underlying VM hypervisor
primitive isn't as good as we wanted, the Linux kernel→userspace API
should be built so that if one day a better VM hypervisor interface
exists it can be plugged behind it without such limitations.

> Overall, it sounds to me like the sysctl poll based kernel interface in this
> patch in combination with systemd inhibitors gives us an answer to most of
> the flows above.

As mentioned earlier, I am not convinced sysctl is the right place for
this. sysctls are understood by most people as being the place for
tweaking kernel settings. This is not a kernel setting, but a
notification concept, and the way Jason defined it there's nothing to
read nor write, which strongly suggests to move it elsewhere, but not
/proc/sys/.

Use /sys/kernel/ or so. Or maybe O_NOTIFICATION_PIPE or whatever, but
/proc/sys/ looks really wrong.

> I can see attractiveness in providing the /run/fork-id directly from the
> kernel though, to remove the dependency on systemd for poll-less
> notification of libraries.

I agree.

Lennart

--
Lennart Poettering, Berlin
Jason A. Donenfeld May 3, 2022, 9:08 a.m. UTC | #17
Hey Lennart,

On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote:
> For this MAC address usecase it's entirely sufficient to be able to
> distinguish if the system was closed at all, i.e. if the counter is
> zero or is non-zero. Because that would already be great for a policy
> of "hash it in a stable way from /etc/machine-id, if counter == 0" +
> "use random MAC once counter > 0".

Hm, are you sure that's actually what you want? It turns out this
vmgenid notification from the hypervisor might not be sufficiently
granular for this use case:

- vmgenid changes when you fork a new snapshot, so now you have two VMs
- vmgenid also changes when you rewind to 2 minutes ago

The first is what I assume you care about for this networkd business.
The second is probably not what any networkd user expects.

[Aside: I hope there are few networkd users; having seen what Yu did
with wireguard and how fast and recklessly that went, I can't recommend
that part of systemd to anyone.]

From the perspective of randomness, both of these events imply the same
thing. The situation is BAD; reseed immediately. From the perspective of
MAC addresses, though, these events would imply different behavior,
right? So it seems like vmgenid might need an additional field for this
use case. Relatedly, VMware has that prompt where you select about your
VM whether, "I moved it" or "I copied it." Presumably something like
that would play a part in what is decided as part of this hypothetical
second field.

Let me know if this seems right to you, or if actually you had in mind
changing MAC addresses in both cases instead.

Jason
Lennart Poettering May 3, 2022, 9:32 a.m. UTC | #18
On Di, 03.05.22 11:08, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> Hey Lennart,
>
> On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote:
> > For this MAC address usecase it's entirely sufficient to be able to
> > distinguish if the system was closed at all, i.e. if the counter is
> > zero or is non-zero. Because that would already be great for a policy
> > of "hash it in a stable way from /etc/machine-id, if counter == 0" +
> > "use random MAC once counter > 0".
>
> Hm, are you sure that's actually what you want? It turns out this
> vmgenid notification from the hypervisor might not be sufficiently
> granular for this use case:
>
> - vmgenid changes when you fork a new snapshot, so now you have two VMs
> - vmgenid also changes when you rewind to 2 minutes ago
>
> The first is what I assume you care about for this networkd business.
> The second is probably not what any networkd user expects.

So first of all, it appears to me that rewinding a VM is something people
would do for debugging things, i.e. not how things are done on
deployment.

> >From the perspective of randomness, both of these events imply the same
> thing. The situation is BAD; reseed immediately. From the perspective of
> MAC addresses, though, these events would imply different behavior,
> right? So it seems like vmgenid might need an additional field for this
> use case. Relatedly, VMware has that prompt where you select about your
> VM whether, "I moved it" or "I copied it." Presumably something like
> that would play a part in what is decided as part of this hypothetical
> second field.

networkd doesn't change MAC addresses in the middle of everything, but
only when a network interface is downed and upped again. This for
example happens when a link beat goes away and comes back. In the
rewind-2min case i'd assume the link beat would probably be restored
to what it was 2min ago (and thus stay online), but in the clone case
it would probably drop momentarily and be restored than, to tell
software to reacquire dhcp and so on.

or in other words: if the hypervisor wants the system to
reconfigure/reacquire its network there are explicit ways already, and
they work afaics. what's missing tehre is simply a reasonable way to
ensure that we won't end up sticking to the same fixed MAC address
when we set things up in order to acquire a new DHCP lease.

So I am not too concerned.

Lennart

--
Lennart Poettering, Berlin
Jason A. Donenfeld May 3, 2022, 10:07 a.m. UTC | #19
On Tue, May 03, 2022 at 11:32:26AM +0200, Lennart Poettering wrote:
> So first of all, it appears to me that rewinding a VM is something people
> would do for debugging things, i.e. not how things are done on
> deployment.

I wouldn't be so sure here... Some people have processes around, "always
start out from the same place", like for build machines, and employ a
single VM snapshot that's always rewound after use. It's never forked
into multiple snapshots, but just always goes back to that same starting
point.

> > >From the perspective of randomness, both of these events imply the same
> > thing. The situation is BAD; reseed immediately. From the perspective of
> > MAC addresses, though, these events would imply different behavior,
> > right? So it seems like vmgenid might need an additional field for this
> > use case. Relatedly, VMware has that prompt where you select about your
> > VM whether, "I moved it" or "I copied it." Presumably something like
> > that would play a part in what is decided as part of this hypothetical
> > second field.
> 
> networkd doesn't change MAC addresses in the middle of everything, but
> only when a network interface is downed and upped again. This for
> example happens when a link beat goes away and comes back. In the
> rewind-2min case i'd assume the link beat would probably be restored
> to what it was 2min ago (and thus stay online), but in the clone case
> it would probably drop momentarily and be restored than, to tell
> software to reacquire dhcp and so on.

That sounds like it's going to be sort of confusing. Let's say we've got
some VM scenario in which rewinds are common due to whatever weird
process (such as a build machine that wants to start out at the same
place each time). During its course of execution, it reboots, or maybe
there's some network connectivity issue and the link goes down. In that
case, when the link comes up, it's going to have a different MAC
address? That doesn't make much sense to me, but maybe I'm missing some
bigger picture detail.

Jason
Jason A. Donenfeld May 3, 2022, 11:55 a.m. UTC | #20
On Tue, May 03, 2022 at 10:29:49AM +0200, Lennart Poettering wrote:
> As mentioned earlier, I am not convinced sysctl is the right place for
> this. sysctls are understood by most people as being the place for
> tweaking kernel settings. This is not a kernel setting, but a
> notification concept, and the way Jason defined it there's nothing to
> read nor write, which strongly suggests to move it elsewhere, but not
> /proc/sys/.

I think I'm coming around to this view that having a sysctl return
-ENODATA is weird. It makes `sysctl -a` always complain to stderr, for
example, which seems bad.

> > I can see attractiveness in providing the /run/fork-id directly from the
> > kernel though, to remove the dependency on systemd for poll-less
> > notification of libraries.
> 
> I agree.

I'm still not convinced there's value in having a counter or a UUID, but
if you had to choose, would you prefer a counter or a UUID? It sounds
like the former, because you see a use for distinguishing between zero
and non-zero? Or did you finally agree with me that vmgenid isn't
granular enough for that?

Jason
Lennart Poettering May 3, 2022, 12:33 p.m. UTC | #21
On Di, 03.05.22 13:55, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> I'm still not convinced there's value in having a counter or a UUID, but
> if you had to choose, would you prefer a counter or a UUID? It sounds
> like the former, because you see a use for distinguishing between zero
> and non-zero? Or did you finally agree with me that vmgenid isn't
> granular enough for that?

I would prefer a monotonic counter, since it allows answering
questions like the following:

1. Did this image get cloned at all? (i.e. counter != 0; usecase as per the MAC address
   discussion)

2. Did the image get cloned since the last time I looked? (i.e. counter
   != my_previously_saved_counter; usecase: detect clones in an "offline"
   fashion, i.e. from a component that doesn't continously run, but
   only from time to time)

3. How many clones did I miss? (i.e. missed_clones =
   my_previously_saved_counter - counter; usecase: catch up with
   generating proxy D-Bus signal messages for clones).

There might be more.

Using a UUID would not give us #1 or #3. It would deliver #2 however.

Lennart

--
Lennart Poettering, Berlin
Lennart Poettering May 3, 2022, 12:42 p.m. UTC | #22
On Di, 03.05.22 12:07, Jason A. Donenfeld (Jason@zx2c4.com) wrote:

> I wouldn't be so sure here... Some people have processes around, "always
> start out from the same place", like for build machines, and employ a
> single VM snapshot that's always rewound after use. It's never forked
> into multiple snapshots, but just always goes back to that same starting
> point.

At sometimes it's futile phantasizing which exotic usecases people
might have.

> > > >From the perspective of randomness, both of these events imply the same
> > > thing. The situation is BAD; reseed immediately. From the perspective of
> > > MAC addresses, though, these events would imply different behavior,
> > > right? So it seems like vmgenid might need an additional field for this
> > > use case. Relatedly, VMware has that prompt where you select about your
> > > VM whether, "I moved it" or "I copied it." Presumably something like
> > > that would play a part in what is decided as part of this hypothetical
> > > second field.
> >
> > networkd doesn't change MAC addresses in the middle of everything, but
> > only when a network interface is downed and upped again. This for
> > example happens when a link beat goes away and comes back. In the
> > rewind-2min case i'd assume the link beat would probably be restored
> > to what it was 2min ago (and thus stay online), but in the clone case
> > it would probably drop momentarily and be restored than, to tell
> > software to reacquire dhcp and so on.
>
> That sounds like it's going to be sort of confusing. Let's say we've got
> some VM scenario in which rewinds are common due to whatever weird
> process (such as a build machine that wants to start out at the same
> place each time). During its course of execution, it reboots, or maybe
> there's some network connectivity issue and the link goes down. In that
> case, when the link comes up, it's going to have a different MAC
> address? That doesn't make much sense to me, but maybe I'm missing some
> bigger picture detail.

It's still better than sticking to the same MAC address for all clones
in all cases...

Dunno, in systemd the MAC address policies are configurable, for a
reason. We'll never find a default that really makes everyone
happy. Some people prefer the anonmymity of randomized MAC addresses,
others like the stability promises of hashed MAC addresses. We support
both policies. I think it would make sense to add a policy that says
"stable MAC until the first clone, then random" for example. In fact
I think it's a choice that has the chance of being a better default
than the current "always stable" approach we employ. So at the very
least we should have the option to come up with a policy taking vm
generations into account, it's a separate discussion to decide whether
to make it opt-in or the default then, and I doubt that part of the
discussion really matters here...

Lennart

--
Lennart Poettering, Berlin
Michael Kelley (LINUX) May 4, 2022, 3:45 p.m. UTC | #23
From: Alexander Graf <graf@amazon.com> Sent: Monday, May 2, 2022 11:35 AM
> 
> On 02.05.22 20:04, Jason A. Donenfeld wrote:
> > Hey Lennart,
> >
> > On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote:
> >> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote:
> >>
> >>>>> In order to inform userspace of virtual machine forks, this commit adds
> >>>>> a "fork_event" sysctl, which does not return any data, but allows
> >>>>> userspace processes to poll() on it for notification of VM forks.
> >>>>>
> >>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace,
> >>>>> in case there is any randomness value in keeping it secret. Rather,
> >>>>> userspace is expected to simply use getrandom() if it wants a fresh
> >>>>> value.
> >>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected
> >>>> VM forks since boot through read()? It might be interesting to know
> >>>> for userspace how many forks it missed the fork events for. Moreover it
> >>>> might be interesting to userspace to know if any fork happened so far
> >>>> *at* *all*, by checking if the counter is non-zero.
> >>> "Might be interesting" is different from "definitely useful". I'm not
> >>> going to add this without a clear use case. This feature is pretty
> >>> narrowly scoped in its objectives right now, and I intend to keep it
> >>> that way if possible.
> >> Sure, whatever. I mean, if you think it's preferable to have 3 API
> >> abstractions for the same concept each for it's special usecase, then
> >> that's certainly one way to do things. I personally would try to
> >> figure out a modicum of generalization for things like this. But maybe
> >> that' just me…
> >>
> >> I can just tell you, that in systemd we'd have a usecase for consuming
> >> such a generation counter: we try to provide stable MAC addresses for
> >> synthetic network interfaces managed by networkd, so we hash them from
> >> /etc/machine-id, but otoh people also want them to change when they
> >> clone their VMs. We could very nicely solve this if we had a
> >> generation counter easily accessible from userspace, that starts at 0
> >> initially. Because then we can hash as we always did when the counter
> >> is zero, but otherwise use something else, possibly hashed from the
> >> generation counter.
> > This doesn't work, because you could have memory-A split into memory-A.1
> > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with
> > the same new value "2". The solution is to instead have the hypervisor
> > pass a unique value and a counter. We currently have a 16 byte unique
> > value from the hypervisor, which I'm keeping as a kernel space secret
> > for the RNG; we're waiting on a word-sized monotonic counter interface
> > from hypervisors in the future. When we have the latter, then we can
> > start talking about mmapable things. Your use case would probably be
> > served by exposing that 16-byte unique value (hashed with some constant
> > for safety I suppose), but I'm hesitant to start going down that route
> > all at once, especially if we're to have a more useful counter in the
> > future.
> 
> 
> Michael, since we already changed the CID in the spec, can we add a
> property to the device that indicates the first 4 bytes of the UUID will
> always be different between parent and child?
> 
> That should give us the ability to mmap the vmgenid directly to user
> space and act based on a simple u32 compare for clone notification, no?
> 

I'm not ignoring this request, but my interpretation of the subsequent
discussion is that it's probably not the path that we want to go down
anyway.  Is that a correct interpretation?

Also, the chances of getting the Windows team to focus on a revision
to the spec are not high, especially a revision that has new semantics. :-(
Getting the new CID added was a relatively low bar, though I'm still trying
to get the publicly available version of the spec updated to include the
new CID.

Michael
Simo Sorce May 11, 2022, 12:40 a.m. UTC | #24
On Mon, 2022-05-02 at 16:06 +0200, Jason A. Donenfeld wrote:
> In order to inform userspace of virtual machine forks, this commit adds
> a "fork_event" sysctl, which does not return any data, but allows
> userspace processes to poll() on it for notification of VM forks.
> 
> It avoids exposing the actual vmgenid from the hypervisor to userspace,
> in case there is any randomness value in keeping it secret. Rather,
> userspace is expected to simply use getrandom() if it wants a fresh
> value.
> 
> For example, the following snippet can be used to print a message every
> time a VM forks, after the RNG has been reseeded:
> 
>   struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY)  };
>   assert(fd.fd >= 0);
>   for (;;) {
>     read(fd.fd, NULL, 0);
>     assert(poll(&fd, 1, -1) > 0);
>     puts("vm fork detected");
>   }
> 
> Various programs and libraries that utilize cryptographic operations
> depending on fresh randomness can invalidate old keys or take other
> appropriate actions when receiving that event. While this is racier than
> allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental
> step forward that's not as heavyweight.

At your request teleporting here the answer I gave on a different
thread, reinforced by some thinking.

As a user space crypto library person I think the only reasonable
interface is something like a vDSO.

Poll() interfaces are nice and all for system programs that have full
control of their event loop and do not have to react immediately to
this event, however crypto libraries do not have the luxury of
controlling the main loop of the application.

Additionally crypto libraries really need to ensure the value they
return from their PRNG is fine, which means they do not return a value
if the vmgenid has changed before they can reseed, or there could be
catastrophic duplication of "random" values used in IVs or ECDSA
Signatures or ids/cookies or whatever.

For crypto libraries it is much simpler to poll for this information 
than using notifications of any kind given libraries are
generally not in full control of what the process does.

This needs to be polled fast as well, because the whole point of
initializing a PRNG in the library is that asking /dev/urandom all the
time is too slow (due to context switches and syscall overhead), so
anything that would require a context switch in order to pull data from
the PRNG would not really fly.

A vDSO or similar would allow to pull the vmgenid or whatever epoch
value in before generating the random numbers and then barrier-style
check that the value is still unchanged before returning the random
data to the caller. This will reduce the race condition (which simply
cannot be completely avoided) to a very unlikely event.

HTH,
Simo.
Jason A. Donenfeld May 11, 2022, 1:18 a.m. UTC | #25
Hi Simo,

On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote:
> At your request teleporting here the answer I gave on a different
> thread, reinforced by some thinking.
> 
> As a user space crypto library person I think the only reasonable
> interface is something like a vDSO.
> 
> Poll() interfaces are nice and all for system programs that have full
> control of their event loop and do not have to react immediately to
> this event, however crypto libraries do not have the luxury of
> controlling the main loop of the application.
> 
> Additionally crypto libraries really need to ensure the value they
> return from their PRNG is fine, which means they do not return a value
> if the vmgenid has changed before they can reseed, or there could be
> catastrophic duplication of "random" values used in IVs or ECDSA
> Signatures or ids/cookies or whatever.
> 
> For crypto libraries it is much simpler to poll for this information 
> than using notifications of any kind given libraries are
> generally not in full control of what the process does.
> 
> This needs to be polled fast as well, because the whole point of
> initializing a PRNG in the library is that asking /dev/urandom all the
> time is too slow (due to context switches and syscall overhead), so
> anything that would require a context switch in order to pull data from
> the PRNG would not really fly.
> 
> A vDSO or similar would allow to pull the vmgenid or whatever epoch
> value in before generating the random numbers and then barrier-style
> check that the value is still unchanged before returning the random
> data to the caller. This will reduce the race condition (which simply
> cannot be completely avoided) to a very unlikely event.

It sounds like your library issue is somewhat similar to what Alex was
talking about with regards to having a hard time using poll in s2n. I'm
still waiting to hear if Amazon figured out some way that this is
possible (with, e.g., a thread). But anyway, it seems like this is
something library authors might hit.

My proposal here is made with nonce reuse in mind, for things like
session keys that use sequential nonces.

A different issue is random nonces. For these, it seems like a call to
getrandom() for each nonce is probably the best bet. But it sounds like
you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
hope you saw these threads:

- https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
- https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
- https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/

Each one of those touches on vDSO things quite a bit. Basically, the
motivation for doing that is for making userspace RNGs safe and
promoting their use with a variety of kernel enhancements to make that
easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid
business should be exposed that way, over and above other mechanisms.
It'd make the most sense...IF we're going to ship a vDSO RNG.

So the question really is: should we ship a vDSO RNG? I could work on
designing that right. But I'm a little bit skeptical generally of the
whole userspace RNG concept. By and large they always turn out to be
less safe and more complex than the kernel one. So if we're to go that
way, I'd like to understand what the strongest arguments for it are.

Jason
Simo Sorce May 11, 2022, 12:59 p.m. UTC | #26
Hi Jason,

On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
> My proposal here is made with nonce reuse in mind, for things like
> session keys that use sequential nonces.

Although this makes sense the problem is that changing applications to
do the right thing based on which situation they are in will never be
done right or soon enough. So I would focus on a solution that makes
the CSPRNGs in crypto libraries safe.

> A different issue is random nonces. For these, it seems like a call to
> getrandom() for each nonce is probably the best bet. But it sounds like
> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
> hope you saw these threads:
> 
> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/

4c does sound like a decent solution, it is semantically identical to
an epoch vmgenid, all the library needs to do is to create such a mmap
region, stick a value on  it, verify it is not zero after computing the
next random value but before returning it to the caller.
This reduces the race to a very small window when the machine is frozen
right after the random value is returned to the caller but before it is
used, but hopefully this just means that the two machines will just
make parallel computations that yield the exact same value, so no
catastrophic consequence will arise (there is the odd case where two
random values are sought and the split happens between the two are
retrieved and this has bad consequences, I think we can ignore that).

> Each one of those touches on vDSO things quite a bit. Basically, the
> motivation for doing that is for making userspace RNGs safe and
> promoting their use with a variety of kernel enhancements to make that
> easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid
> business should be exposed that way, over and above other mechanisms.
> It'd make the most sense...IF we're going to ship a vDSO RNG.
> 
> So the question really is: should we ship a vDSO RNG? I could work on
> designing that right. But I'm a little bit skeptical generally of the
> whole userspace RNG concept. By and large they always turn out to be
> less safe and more complex than the kernel one. So if we're to go that
> way, I'd like to understand what the strongest arguments for it are.

I am not entirely sure how a vDSO RNG would work, I think exposing the
epoch(or whatever indicator) is enough, crypto libraries have pretty
good PRNGs, what they require is simply a good source of entropy for
the initial seeding and this safety mechanism to avoid state
duplication on machine cloning.
All the decent libraries already support detecting process forks.

Simo.
Alexander Graf May 11, 2022, 1:19 p.m. UTC | #27
Hi Simo,

On 11.05.22 14:59, Simo Sorce wrote:
> Hi Jason,
>
> On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
>> My proposal here is made with nonce reuse in mind, for things like
>> session keys that use sequential nonces.
> Although this makes sense the problem is that changing applications to
> do the right thing based on which situation they are in will never be
> done right or soon enough. So I would focus on a solution that makes
> the CSPRNGs in crypto libraries safe.


I think we intrinsically have 2 sets of applications: Ones that want an 
event based notification and don't care about the racyness of it and 
ones that want an atomic way to determine the epoch. Userspace RNGs are 
naturally in the second category.


>
>> A different issue is random nonces. For these, it seems like a call to
>> getrandom() for each nonce is probably the best bet. But it sounds like
>> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I
>> hope you saw these threads:
>>
>> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
>> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
>> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> 4c does sound like a decent solution, it is semantically identical to
> an epoch vmgenid, all the library needs to do is to create such a mmap
> region, stick a value on  it, verify it is not zero after computing the
> next random value but before returning it to the caller.
> This reduces the race to a very small window when the machine is frozen
> right after the random value is returned to the caller but before it is
> used, but hopefully this just means that the two machines will just
> make parallel computations that yield the exact same value, so no
> catastrophic consequence will arise (there is the odd case where two
> random values are sought and the split happens between the two are
> retrieved and this has bad consequences, I think we can ignore that).


The problem with wiping memory on clone is that it means you must keep 
these special wipe on clone pages always present and pinned in memory 
and can no longer swap them out, compact them or move them for memory 
hotplug.

We started the journey with a WIPEONSUSPEND flag and eventually 
abandoned it because it seemed clunky. Happy to reopen it if we believe 
it's a viable path:

   https://lwn.net/Articles/825230/


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Jason A. Donenfeld May 11, 2022, 1:19 p.m. UTC | #28
Hi Simo,

On Wed, May 11, 2022 at 08:59:07AM -0400, Simo Sorce wrote:
> Hi Jason,
> 
> On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote:
> > My proposal here is made with nonce reuse in mind, for things like
> > session keys that use sequential nonces.
> 
> Although this makes sense the problem is that changing applications to
> do the right thing based on which situation they are in will never be
> done right or soon enough. So I would focus on a solution that makes
> the CSPRNGs in crypto libraries safe.

Please don't dismiss this. I realize you have your one single use case
in mind, but there are others, and the distinction you gave for why we
should dismiss the others to focus on yours doesn't really make any
sense. Here's why:

In my email I pointed out two places where VM forks impact crypto in bad
ways:

- Session keys, wrt nonce reuse.

- Random nonces, wrt nonce reuse.

There are other problems that arise from VM forks too. But these stand
out because they are both quite catastrophic, whether it's duplicated
ECDSA random nonces, or whether it's the same session key used with the
same sequential counter to encrypt different plaintexts with something
like AES-GCM or ChaCha20Poly1305. These are both very, very bad things.

And both things happen in:

- Libraries: crypto lib random number generators (e.g. OpenSSL), crypto
  lib session keys (e.g. any TLS library).

- Applications: application level random number generators (e.g.
  Bitcoin Core *facepalm*), application level session keys (e.g.
  OpenSSH).

So I don't think the "library vs application" distinction is really
meaningful here. Rather, things kind of fall apart all over the place
for a variety of reasons on VM fork.

> > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> 
> 4c does sound like a decent solution, it is semantically identical to

It does, yeah, but realistically it's never going to happen. I don't
think there's a near- or medium-term chance of changing hypervisor
semantics again. That means for 4-like solutions, there's 4a and 4b.

By the way, that email of mine has inaccuracy in it. I complain about
being in irq context, but it turns out not to be the case; we're inside
of a kthread during the notification, which means we have a lot more
options on what we can do.

If 4 is the solution that appeals to you most, do you want to try your
hand at a RFC patch for it? I don't yet know if that's the best
direction to take, but the devil is kind of in the details, so it might
be interesting to see how it pans out.

Jason
Alexander Graf May 11, 2022, 1:20 p.m. UTC | #29
On 11.05.22 03:18, Jason A. Donenfeld wrote:
> Hi Simo,
>
> On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote:
>> At your request teleporting here the answer I gave on a different
>> thread, reinforced by some thinking.
>>
>> As a user space crypto library person I think the only reasonable
>> interface is something like a vDSO.
>>
>> Poll() interfaces are nice and all for system programs that have full
>> control of their event loop and do not have to react immediately to
>> this event, however crypto libraries do not have the luxury of
>> controlling the main loop of the application.
>>
>> Additionally crypto libraries really need to ensure the value they
>> return from their PRNG is fine, which means they do not return a value
>> if the vmgenid has changed before they can reseed, or there could be
>> catastrophic duplication of "random" values used in IVs or ECDSA
>> Signatures or ids/cookies or whatever.
>>
>> For crypto libraries it is much simpler to poll for this information
>> than using notifications of any kind given libraries are
>> generally not in full control of what the process does.
>>
>> This needs to be polled fast as well, because the whole point of
>> initializing a PRNG in the library is that asking /dev/urandom all the
>> time is too slow (due to context switches and syscall overhead), so
>> anything that would require a context switch in order to pull data from
>> the PRNG would not really fly.
>>
>> A vDSO or similar would allow to pull the vmgenid or whatever epoch
>> value in before generating the random numbers and then barrier-style
>> check that the value is still unchanged before returning the random
>> data to the caller. This will reduce the race condition (which simply
>> cannot be completely avoided) to a very unlikely event.
> It sounds like your library issue is somewhat similar to what Alex was
> talking about with regards to having a hard time using poll in s2n. I'm
> still waiting to hear if Amazon figured out some way that this is
> possible (with, e.g., a thread). But anyway, it seems like this is


Sounds like I didn't reply with my findings - sorry about that. Our s2n 
people *could* build something based on a thread, but are afraid that 
it's racy and would introduce creating a thread which the library does 
not do today.

So in a nutshell, possible yes, desirable no.

I think we're maybe a bit too scared of building something from scratch 
here. What would the best case situation be? Let's roll backwards from 
that then.

 From what I read, we want a "VMGenID v2" device that gives us the 
ability to

   * Get an IRQ on VM clone
   * Store and update an RNG seed value (128bit? Configurable len?) at a 
physical address or stand alone page on clone
   * Store and update a unique-to-this-VM rolling 32bit identifier at a 
physical address or stand alone page on clone

We can either make the device expose these values as individual pages 
(like VMGenID does today) or as guest physical addresses that it needs 
to store into (like virtio-rng). The latter makes protection from DMA 
attacks of the hypervisor and kexec slightly more complicated, but it 
would be doable.

VMGenID has 2 out of 3 features above. So why don't we just go the easy 
route, add a second property to VMGenID that gives us another page with 
that 32bit value and then provide a /dev/vmgenid device node you can 
open and mmap() that 32bit value page from?

User space libraries can then try to open on init and determine their epoch.
For the async event, we add the poll() logic again to /dev/vmgenid and 
make networkd for example use that.


Alex




Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Simo Sorce May 11, 2022, 2:32 p.m. UTC | #30
Hi Jason,

On Wed, 2022-05-11 at 15:19 +0200, Jason A. Donenfeld wrote:
> Please don't dismiss this. I realize you have your one single use case
> in mind, but there are others, and the distinction you gave for why we
> should dismiss the others to focus on yours doesn't really make any
> sense. Here's why:

I do not think I am dismissing any other use cases, clearly anything
that depend on unique random numbers for security is impacted, but I
tend to focus where we can get the biggest impact. 

> In my email I pointed out two places where VM forks impact crypto in bad
> ways:
> 
> - Session keys, wrt nonce reuse.
> 
> - Random nonces, wrt nonce reuse.
> 
> There are other problems that arise from VM forks too. But these stand
> out because they are both quite catastrophic, whether it's duplicated
> ECDSA random nonces, or whether it's the same session key used with the
> same sequential counter to encrypt different plaintexts with something
> like AES-GCM or ChaCha20Poly1305. These are both very, very bad things.
> 
> And both things happen in:
> 
> - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto
>   lib session keys (e.g. any TLS library).
> 
> - Applications: application level random number generators (e.g.
>   Bitcoin Core *facepalm*), application level session keys (e.g.
>   OpenSSH).

Yes, some applications that are involved with security do have their
own application level PRNGs, clearly they will have to either stop
using customized PRNGs and use the library provided ones (or even just
/dev/urandom if their needs are no performance critical) or adjust
their own PRNGs to be safe using whatever mechanism will be provided.

> So I don't think the "library vs application" distinction is really
> meaningful here. Rather, things kind of fall apart all over the place
> for a variety of reasons on VM fork.

I am not really making a library vs application distinction, what I am
saying is that the library uses case has a set of tighter constraints
than the application one. Basically anything a library can use an
application can as well, while the contrary is not true. Therefore it
if we resolve the library problem, applications will have a solution as
well.

> > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/
> > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/
> > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/
> > 
> > 4c does sound like a decent solution, it is semantically identical to
> 
> It does, yeah, but realistically it's never going to happen. I don't
> think there's a near- or medium-term chance of changing hypervisor
> semantics again. That means for 4-like solutions, there's 4a and 4b.

I think 4a and 4b are fine mechanisms too, 4c is just more efficient,
and potentially optimizable in HW.
That said I think 3 (vDSO) is also a fine solution, and would not be
disappointed if 3 was chosen over 4.

I am not really after evaluating how it is done below the kernel
boundary. As long as the effects are the same, semantically, from the
user space pov.

> By the way, that email of mine has inaccuracy in it. I complain about
> being in irq context, but it turns out not to be the case; we're inside
> of a kthread during the notification, which means we have a lot more
> options on what we can do.
> 
> If 4 is the solution that appeals to you most, do you want to try your
> hand at a RFC patch for it? I don't yet know if that's the best
> direction to take, but the devil is kind of in the details, so it might
> be interesting to see how it pans out.

I think it would be prudent to agree on the correct mechanisms before
venturing into potentially invasive patches.

Simo.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst
index 1144ea3229a3..ddbd603f0be7 100644
--- a/Documentation/admin-guide/sysctl/kernel.rst
+++ b/Documentation/admin-guide/sysctl/kernel.rst
@@ -1001,7 +1001,7 @@  This is a directory, with the following entries:
 * ``urandom_min_reseed_secs``: obsolete (used to determine the minimum
   number of seconds between urandom pool reseeding). This file is
   writable for compatibility purposes, but writing to it has no effect
-  on any RNG behavior.
+  on any RNG behavior;
 
 * ``uuid``: a UUID generated every time this is retrieved (this can
   thus be used to generate UUIDs at will);
@@ -1009,8 +1009,10 @@  This is a directory, with the following entries:
 * ``write_wakeup_threshold``: when the entropy count drops below this
   (as a number of bits), processes waiting to write to ``/dev/random``
   are woken up. This file is writable for compatibility purposes, but
-  writing to it has no effect on any RNG behavior.
+  writing to it has no effect on any RNG behavior;
 
+* ``fork_event``: unreadable, but can be poll()'d on for notifications
+  delivered after the RNG reseeds following a virtual machine fork.
 
 randomize_va_space
 ==================
diff --git a/drivers/char/random.c b/drivers/char/random.c
index bffc8682d6b8..39eda91b07ec 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1213,6 +1213,9 @@  EXPORT_SYMBOL_GPL(add_bootloader_randomness);
 
 #if IS_ENABLED(CONFIG_VMGENID)
 static BLOCKING_NOTIFIER_HEAD(vmfork_chain);
+#ifdef CONFIG_SYSCTL
+static DEFINE_CTL_TABLE_POLL(sysctl_fork_event_poll);
+#endif
 
 /*
  * Handle a new unique VM ID, which is unique, not secret, so we
@@ -1227,6 +1230,9 @@  void add_vmfork_randomness(const void *unique_vm_id, size_t size)
 		pr_notice("crng reseeded due to virtual machine fork\n");
 	}
 	blocking_notifier_call_chain(&vmfork_chain, 0, NULL);
+#ifdef CONFIG_SYSCTL
+	proc_sys_poll_notify(&sysctl_fork_event_poll);
+#endif
 }
 #if IS_MODULE(CONFIG_VMGENID)
 EXPORT_SYMBOL_GPL(add_vmfork_randomness);
@@ -1694,6 +1700,8 @@  const struct file_operations urandom_fops = {
  *   It is writable to avoid breaking old userspaces, but writing
  *   to it does not change any behavior of the RNG.
  *
+ * - fork_event - an unreadable file that can be poll()'d on for VM forks.
+ *
  ********************************************************************/
 
 #ifdef CONFIG_SYSCTL
@@ -1747,6 +1755,14 @@  static int proc_do_rointvec(struct ctl_table *table, int write, void *buffer,
 	return write ? 0 : proc_dointvec(table, 0, buffer, lenp, ppos);
 }
 
+#if IS_ENABLED(CONFIG_VMGENID)
+static int proc_do_nodata(struct ctl_table *table, int write, void *buffer,
+			  size_t *lenp, loff_t *ppos)
+{
+	return -ENODATA;
+}
+#endif
+
 static struct ctl_table random_table[] = {
 	{
 		.procname	= "poolsize",
@@ -1787,6 +1803,14 @@  static struct ctl_table random_table[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_do_uuid,
 	},
+#if IS_ENABLED(CONFIG_VMGENID)
+	{
+		.procname	= "fork_event",
+		.mode		= 0444,
+		.poll		= &sysctl_fork_event_poll,
+		.proc_handler	= proc_do_nodata,
+	},
+#endif
 	{ }
 };