Message ID | 20220502140602.130373-2-Jason@zx2c4.com (mailing list archive) |
---|---|
State | Not Applicable |
Delegated to: | Herbert Xu |
Headers | show |
Series | [1/2] sysctl: read() must consume poll events, not poll() | expand |
On Mo, 02.05.22 16:06, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > In order to inform userspace of virtual machine forks, this commit adds > a "fork_event" sysctl, which does not return any data, but allows > userspace processes to poll() on it for notification of VM forks. > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > in case there is any randomness value in keeping it secret. Rather, > userspace is expected to simply use getrandom() if it wants a fresh > value. Wouldn't it make sense to expose a monotonic 64bit counter of detected VM forks since boot through read()? It might be interesting to know for userspace how many forks it missed the fork events for. Moreover it might be interesting to userspace to know if any fork happened so far *at* *all*, by checking if the counter is non-zero. (Ideally that counter file would even be mmapable...) does this file really belong below the `kernel.random` sysctl hierarchy? shouldn't this rather be something like /sys/kernel/vmclone_event or so? Most people see sysctl as a place to tweak settings with. But this sysctl you proposed here cannot be written, and (so far…) not even be read! So it sounds like a weird thing to place into /proc/sys/. Note that in userspace there's a whole infrastructure for managing sysctls, but this one doesn't look like it would be managable at all by userspace with these tools. Lennart -- Lennart Poettering, Berlin
On Mon, May 02, 2022 at 05:40:02PM +0200, Lennart Poettering wrote: > On Mo, 02.05.22 16:06, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > In order to inform userspace of virtual machine forks, this commit adds > > a "fork_event" sysctl, which does not return any data, but allows > > userspace processes to poll() on it for notification of VM forks. > > > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > > in case there is any randomness value in keeping it secret. Rather, > > userspace is expected to simply use getrandom() if it wants a fresh > > value. > > Wouldn't it make sense to expose a monotonic 64bit counter of detected > VM forks since boot through read()? It might be interesting to know > for userspace how many forks it missed the fork events for. Moreover it > might be interesting to userspace to know if any fork happened so far > *at* *all*, by checking if the counter is non-zero. "Might be interesting" is different from "definitely useful". I'm not going to add this without a clear use case. This feature is pretty narrowly scoped in its objectives right now, and I intend to keep it that way if possible. (And yes, I realize that is likely considerably different from your development philosophy.) > > (Ideally that counter file would even be mmapable...) You missed the last year of discussion about this and why we have wound up here as a first step. Check the archives for extensive discussion. Jason
On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > In order to inform userspace of virtual machine forks, this commit adds > > > a "fork_event" sysctl, which does not return any data, but allows > > > userspace processes to poll() on it for notification of VM forks. > > > > > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > > > in case there is any randomness value in keeping it secret. Rather, > > > userspace is expected to simply use getrandom() if it wants a fresh > > > value. > > > > Wouldn't it make sense to expose a monotonic 64bit counter of detected > > VM forks since boot through read()? It might be interesting to know > > for userspace how many forks it missed the fork events for. Moreover it > > might be interesting to userspace to know if any fork happened so far > > *at* *all*, by checking if the counter is non-zero. > > "Might be interesting" is different from "definitely useful". I'm not > going to add this without a clear use case. This feature is pretty > narrowly scoped in its objectives right now, and I intend to keep it > that way if possible. Sure, whatever. I mean, if you think it's preferable to have 3 API abstractions for the same concept each for it's special usecase, then that's certainly one way to do things. I personally would try to figure out a modicum of generalization for things like this. But maybe that' just me… I can just tell you, that in systemd we'd have a usecase for consuming such a generation counter: we try to provide stable MAC addresses for synthetic network interfaces managed by networkd, so we hash them from /etc/machine-id, but otoh people also want them to change when they clone their VMs. We could very nicely solve this if we had a generation counter easily accessible from userspace, that starts at 0 initially. Because then we can hash as we always did when the counter is zero, but otherwise use something else, possibly hashed from the generation counter. But anyway, I understand you are not interested in generalization/other usecases, so I'll shut up. Lennart -- Lennart Poettering, Berlin
On 02.05.22 18:51, Lennart Poettering wrote: > On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > >>>> In order to inform userspace of virtual machine forks, this commit adds >>>> a "fork_event" sysctl, which does not return any data, but allows >>>> userspace processes to poll() on it for notification of VM forks. >>>> >>>> It avoids exposing the actual vmgenid from the hypervisor to userspace, >>>> in case there is any randomness value in keeping it secret. Rather, >>>> userspace is expected to simply use getrandom() if it wants a fresh >>>> value. >>> Wouldn't it make sense to expose a monotonic 64bit counter of detected >>> VM forks since boot through read()? It might be interesting to know >>> for userspace how many forks it missed the fork events for. Moreover it >>> might be interesting to userspace to know if any fork happened so far >>> *at* *all*, by checking if the counter is non-zero. >> "Might be interesting" is different from "definitely useful". I'm not >> going to add this without a clear use case. This feature is pretty >> narrowly scoped in its objectives right now, and I intend to keep it >> that way if possible. > Sure, whatever. I mean, if you think it's preferable to have 3 API > abstractions for the same concept each for it's special usecase, then > that's certainly one way to do things. I personally would try to > figure out a modicum of generalization for things like this. But maybe > that' just me… > > I can just tell you, that in systemd we'd have a usecase for consuming > such a generation counter: we try to provide stable MAC addresses for > synthetic network interfaces managed by networkd, so we hash them from > /etc/machine-id, but otoh people also want them to change when they > clone their VMs. We could very nicely solve this if we had a > generation counter easily accessible from userspace, that starts at 0 > initially. Because then we can hash as we always did when the counter > is zero, but otherwise use something else, possibly hashed from the > generation counter. > > But anyway, I understand you are not interested in > generalization/other usecases, so I'll shut up. Let's not turn this into a pit fight please :). I think it's a good idea to collect the use cases we all have and evaluate whether this patch is a good stepping stone towards the final solution. At the end of the road, what I would like to see is 1) A way for libraries such as s2n to identify that a clone occurred. Because it's a deep-down library with no access to its own thread or the main loop, it can not rely on poll/select. Mmap of a file however would work great, as you can create transactions on top of a 64bit mmap'ed value for example. We can have systemd generate and then provide such a file as long as we have an event based notification mechanism. 2) A way to notify larger applications (think Java here) that a system is going to be suspended soon so it can wipe PII before it gets cloned for example. 3) Notifications after clone so applications know they can regenerate VM unique data based on randomness. For 2 and 3, applications should have native support for these events (think of poll() on the fork file) as well as external script based support (think "invoke systemctl restart smbd.service on clone"). Lennart, looking at the current sysctl proposal, systemd could poll() on the fork file. It would then be able to generate a /run/fork-id file which it can use for the flow above, right? The sysctl proposal also gives us 3, if we implement the inhibitor proposal [1] in systemd. That leaves 2, which we don't have a hardware interface for yet. We can still get to at least script level automation with inhibitors and manually triggering a "hey, you'll be suspended soon" event via systemd. Overall, it sounds to me like the sysctl poll based kernel interface in this patch in combination with systemd inhibitors gives us an answer to most of the flows above. I can see attractiveness in providing the /run/fork-id directly from the kernel though, to remove the dependency on systemd for poll-less notification of libraries. Jason, how much complexity would it add to provide an mmap() and read() interface to a fork counter value to the sysctl? Read sounds like a trivial change on top of what you have already, mmap a bit more heavy lift. If we had both, it would allow us to implement a Linux standard fork detect path in libraries that does not rely on systemd. Alex [1] https://github.com/systemd/systemd/issues/20222 Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hey Lennart, On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote: > On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > > > > In order to inform userspace of virtual machine forks, this commit adds > > > > a "fork_event" sysctl, which does not return any data, but allows > > > > userspace processes to poll() on it for notification of VM forks. > > > > > > > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > > > > in case there is any randomness value in keeping it secret. Rather, > > > > userspace is expected to simply use getrandom() if it wants a fresh > > > > value. > > > > > > Wouldn't it make sense to expose a monotonic 64bit counter of detected > > > VM forks since boot through read()? It might be interesting to know > > > for userspace how many forks it missed the fork events for. Moreover it > > > might be interesting to userspace to know if any fork happened so far > > > *at* *all*, by checking if the counter is non-zero. > > > > "Might be interesting" is different from "definitely useful". I'm not > > going to add this without a clear use case. This feature is pretty > > narrowly scoped in its objectives right now, and I intend to keep it > > that way if possible. > > Sure, whatever. I mean, if you think it's preferable to have 3 API > abstractions for the same concept each for it's special usecase, then > that's certainly one way to do things. I personally would try to > figure out a modicum of generalization for things like this. But maybe > that' just me… > > I can just tell you, that in systemd we'd have a usecase for consuming > such a generation counter: we try to provide stable MAC addresses for > synthetic network interfaces managed by networkd, so we hash them from > /etc/machine-id, but otoh people also want them to change when they > clone their VMs. We could very nicely solve this if we had a > generation counter easily accessible from userspace, that starts at 0 > initially. Because then we can hash as we always did when the counter > is zero, but otherwise use something else, possibly hashed from the > generation counter. This doesn't work, because you could have memory-A split into memory-A.1 and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with the same new value "2". The solution is to instead have the hypervisor pass a unique value and a counter. We currently have a 16 byte unique value from the hypervisor, which I'm keeping as a kernel space secret for the RNG; we're waiting on a word-sized monotonic counter interface from hypervisors in the future. When we have the latter, then we can start talking about mmapable things. Your use case would probably be served by exposing that 16-byte unique value (hashed with some constant for safety I suppose), but I'm hesitant to start going down that route all at once, especially if we're to have a more useful counter in the future. Jason
Hi Alex, On Mon, May 02, 2022 at 07:59:08PM +0200, Alexander Graf wrote: > to collect the use cases we all have and evaluate whether this patch is > a good stepping stone towards the final solution. Indeed, I'm all for collecting use cases. What I meant to say is that we're not going to add something "just 'cuz"; I'd like to have some concrete things in mind. To date, I've basically had your s2n case in mind, but as you haven't responded to this in the last month, I started looking to see if this was useful elsewhere or if I should abandon it, so I filed this issue with the Go project: <https://github.com/golang/go/issues/52544>. We're over halfway through 5.18 now, and only at this point have you arrived to discuss and finalize things. So in all likelihood we'll wind up tabling this until 5.20 or never, since what I thought was an easy consensus before now apparently is not. > 1) A way for libraries such as s2n to identify that a clone occurred. > Because it's a deep-down library with no access to its own thread or the > main loop, it can not rely on poll/select. Mmap of a file however would > work great, as you can create transactions on top of a 64bit mmap'ed > value for example. I didn't realize that s2n can't poll. That's surprising. In the worst case, can't you just spawn a thread? > 2) A way to notify larger applications (think Java here) that a system > is going to be suspended soon so it can wipe PII before it gets cloned > for example. Suspension, like S3 power notification stuff? Talk to Rafael about that; this isn't related to the VM fork issue. I use those PM notifiers happily in kernel space but AFAICT, there's still no userspace thing for it. This seems orthogonal to this conversation though, so let's not veer off into that topic. If you didn't mean S3 but actually meant notification prior to snapshot taking, we don't have any virtual hardware for that, so it's a moot point. > 3) Notifications after clone so applications know they can regenerate VM > unique data based on randomness. You mean this as "the same as (1) but with poll() instead of mmap()", right? > Lennart, looking at the current sysctl proposal, systemd could poll() on > the fork file. It would then be able to generate a /run/fork-id file > which it can use for the flow above, right? For the reasons I gave in my last email to Lennart, I don't think there's a good way for systemd to generate a fork-id file on its own either. I don't think systemd should really be involved here as a provider of values, just as a potential consumer of what the kernel provides. > The sysctl proposal also gives us 3, if we implement the inhibitor > proposal [1] in systemd. These userspace components you're proposing seem like a lot of overengineering for little gain, which is why this conversation went nowhere when Amazon attempted all this year. But it sounds like you agree with me based on your remark below about systemd-less interfaces provided by the kernel. > Overall, it sounds to me like the sysctl poll based kernel interface in > this patch in combination with systemd inhibitors gives us an answer to > most of the flows above. > > I can see attractiveness in providing the /run/fork-id directly from the > kernel though, to remove the dependency on systemd for poll-less > notification of libraries. > > Jason, how much complexity would it add to provide an mmap() and read() > interface to a fork counter value to the sysctl? Read sounds like a > trivial change on top of what you have already, mmap a bit more heavy > lift. If we had both, it would allow us to implement a Linux standard > fork detect path in libraries that does not rely on systemd. mmap() does not give us anything if we're not going to expose the raw ACPI-mapped ID directly. It will still be a racy mechanism until we do that. So I think we should wait until there's a proper vmgenid word-sized counter to expose something mmap()able. If you have the energy to talk to Microsoft about this and make it happen, please be my guest. As I wrote at the beginning of this email. I haven't gotten a response from you at all about this stuff in quite some time, so I'm not really itching take that on alone now. Jason
On 02.05.22 20:04, Jason A. Donenfeld wrote: > Hey Lennart, > > On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote: >> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote: >> >>>>> In order to inform userspace of virtual machine forks, this commit adds >>>>> a "fork_event" sysctl, which does not return any data, but allows >>>>> userspace processes to poll() on it for notification of VM forks. >>>>> >>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace, >>>>> in case there is any randomness value in keeping it secret. Rather, >>>>> userspace is expected to simply use getrandom() if it wants a fresh >>>>> value. >>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected >>>> VM forks since boot through read()? It might be interesting to know >>>> for userspace how many forks it missed the fork events for. Moreover it >>>> might be interesting to userspace to know if any fork happened so far >>>> *at* *all*, by checking if the counter is non-zero. >>> "Might be interesting" is different from "definitely useful". I'm not >>> going to add this without a clear use case. This feature is pretty >>> narrowly scoped in its objectives right now, and I intend to keep it >>> that way if possible. >> Sure, whatever. I mean, if you think it's preferable to have 3 API >> abstractions for the same concept each for it's special usecase, then >> that's certainly one way to do things. I personally would try to >> figure out a modicum of generalization for things like this. But maybe >> that' just me… >> >> I can just tell you, that in systemd we'd have a usecase for consuming >> such a generation counter: we try to provide stable MAC addresses for >> synthetic network interfaces managed by networkd, so we hash them from >> /etc/machine-id, but otoh people also want them to change when they >> clone their VMs. We could very nicely solve this if we had a >> generation counter easily accessible from userspace, that starts at 0 >> initially. Because then we can hash as we always did when the counter >> is zero, but otherwise use something else, possibly hashed from the >> generation counter. > This doesn't work, because you could have memory-A split into memory-A.1 > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with > the same new value "2". The solution is to instead have the hypervisor > pass a unique value and a counter. We currently have a 16 byte unique > value from the hypervisor, which I'm keeping as a kernel space secret > for the RNG; we're waiting on a word-sized monotonic counter interface > from hypervisors in the future. When we have the latter, then we can > start talking about mmapable things. Your use case would probably be > served by exposing that 16-byte unique value (hashed with some constant > for safety I suppose), but I'm hesitant to start going down that route > all at once, especially if we're to have a more useful counter in the > future. Michael, since we already changed the CID in the spec, can we add a property to the device that indicates the first 4 bytes of the UUID will always be different between parent and child? That should give us the ability to mmap the vmgenid directly to user space and act based on a simple u32 compare for clone notification, no? Thanks; Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Err... On Mon, May 02, 2022 at 08:04:21PM +0200, Jason A. Donenfeld wrote: > This doesn't work, because you could have memory-A split into memory-A.1 > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with > the same new value "2". The solution is to instead have the hypervisor > pass a unique value and a counter. We currently have a 16 byte unique > value from the hypervisor, which I'm keeping as a kernel space secret > for the RNG; we're waiting on a word-sized monotonic counter interface > from hypervisors in the future. When we have the latter, then we can > start talking about mmapable things. Your use case would probably be > served by exposing that 16-byte unique value (hashed with some constant > for safety I suppose), but I'm hesitant to start going down that route > all at once, especially if we're to have a more useful counter in the > future. I kind of muddled things a bit by conflating two issues. I'd like the hypervisor to provide a counter so that we can mmap it to userspace so that userspace programs can do word-sized comparisons on mmap'd counters, avoiding the race that currently exists from relying on the async ACPI notification, which arrives after the system is already up and running. That's one thing, but not what we're talking about here with the MAC addresses. The point over here is that neither the guest *nor* the hypervisor can maintain a counter that actually represents something unique. A.1 and A.2 will both ++counter to the same value in the example above. The guest can't do it (neither in systemd nor in the kernel), because it will always start with the same counter value of A and ++ it to the same next value. The hypervisor can't do it either, because snapshots can be shipped around to different computers that aren't coordinated. So, put that way, the counter thing that I'd like wouldn't be for having a unique snapshot ID, but just as a mmap-able way of learning when a snapshot forks. It wouldn't be more useful than that. If you want a unique ID, we have two options for that: the first is exposing the vmgenid 16 byte value to userspace (which I don't want to do). The second is just calling getrandom() after you get a poll() notification, and that'll be guaranteed to be unique to that VM because of the vmgenid driver in 5.18. This last suggestion is thus what you should do for your MAC addresses. Jason
On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote: > Michael, since we already changed the CID in the spec, can we add a > property to the device that indicates the first 4 bytes of the UUID will > always be different between parent and child? > > That should give us the ability to mmap the vmgenid directly to user > space and act based on a simple u32 compare for clone notification, no? That is not a good idea. We want an _additional_ 4 bytes, so that we can keep the first 16 bytes (128 bits) as a kernel space secret. Jason
On 02.05.22 20:46, Jason A. Donenfeld wrote: > On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote: >> Michael, since we already changed the CID in the spec, can we add a >> property to the device that indicates the first 4 bytes of the UUID will >> always be different between parent and child? >> >> That should give us the ability to mmap the vmgenid directly to user >> space and act based on a simple u32 compare for clone notification, no? > That is not a good idea. We want an _additional_ 4 bytes, so that we can > keep the first 16 bytes (128 bits) as a kernel space secret. An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page. Do we really rely on these 16 bytes to reseed after clone? If so, we'd need to bite the bullet and provide an additional page, yes. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hey Jason, On 02.05.22 20:29, Jason A. Donenfeld wrote: > Hi Alex, > > On Mon, May 02, 2022 at 07:59:08PM +0200, Alexander Graf wrote: >> to collect the use cases we all have and evaluate whether this patch is >> a good stepping stone towards the final solution. > Indeed, I'm all for collecting use cases. What I meant to say is that > we're not going to add something "just 'cuz"; I'd like to have some > concrete things in mind. > > To date, I've basically had your s2n case in mind, but as you haven't > responded to this in the last month, I started looking to see if this Unfortunate vacation timing on my side I suppose :) > was useful elsewhere or if I should abandon it, so I filed this issue > with the Go project: <https://github.com/golang/go/issues/52544>. We're > over halfway through 5.18 now, and only at this point have you arrived > to discuss and finalize things. So in all likelihood we'll wind up > tabling this until 5.20 or never, since what I thought was an easy > consensus before now apparently is not. So far I see little that would block your patch? It seems to go into a good direction from all I can tell. >> 1) A way for libraries such as s2n to identify that a clone occurred. >> Because it's a deep-down library with no access to its own thread or the >> main loop, it can not rely on poll/select. Mmap of a file however would >> work great, as you can create transactions on top of a 64bit mmap'ed >> value for example. > I didn't realize that s2n can't poll. That's surprising. In the worst > case, can't you just spawn a thread? You block the thread on poll, so the only option is a thread. I was until now always under the working assumption that we can't do this in a thread because you don't want your single threaded application to turn into a pthreaded one, but you make me wonder. Let me check with Torben tomorrow. > >> 2) A way to notify larger applications (think Java here) that a system >> is going to be suspended soon so it can wipe PII before it gets cloned >> for example. > Suspension, like S3 power notification stuff? Talk to Rafael about that; Whether you go running -> S3 -> clone or you go running -> paused -> clone is an implementation detail I'm not terribly worried about. Users can do either, because on both cases the VM is in paused state. > this isn't related to the VM fork issue. I use those PM notifiers > happily in kernel space but AFAICT, there's still no userspace thing for > it. This seems orthogonal to this conversation though, so let's not veer > off into that topic. > > If you didn't mean S3 but actually meant notification prior to snapshot > taking, we don't have any virtual hardware for that, so it's a moot > point. I think we'll want to have an external button similar to the ACPI sleep button or lid close event eventually, so that we can loop the VM in on the suspend path the same way S3 does. Today we don't have it, I agree. And if it's possible to maintain the same user space interface with this in mind, all is good. Let's just keep it in mind as something that will probably come eventually so that we don't redesign the UAPI in a year from now. > >> 3) Notifications after clone so applications know they can regenerate VM >> unique data based on randomness. > You mean this as "the same as (1) but with poll() instead of mmap()", > right? Yes :) > >> Lennart, looking at the current sysctl proposal, systemd could poll() on >> the fork file. It would then be able to generate a /run/fork-id file >> which it can use for the flow above, right? > For the reasons I gave in my last email to Lennart, I don't think > there's a good way for systemd to generate a fork-id file on its own > either. I don't think systemd should really be involved here as a > provider of values, just as a potential consumer of what the kernel > provides. Yes, systemd would poll on fork_event. When that returns, it reads /dev/urandom and writes a new /run/fork-id file with that. Libraries that don't want to be in the business of spawning threads can use that file to identify that they were cloned. Systemd can use that id as seed for networkd. > >> The sysctl proposal also gives us 3, if we implement the inhibitor >> proposal [1] in systemd. > These userspace components you're proposing seem like a lot of > overengineering for little gain, which is why this conversation went > nowhere when Amazon attempted all this year. But it sounds like you > agree with me based on your remark below about systemd-less interfaces > provided by the kernel. I would be happy to get to 99% of this with pure kernel based interfaces. The reason the kernel conversation seemingly went nowhere was not because of "overengineering". It was because the review feedback eventually was "This is a file, make user space manage it". >> Overall, it sounds to me like the sysctl poll based kernel interface in >> this patch in combination with systemd inhibitors gives us an answer to >> most of the flows above. >> >> I can see attractiveness in providing the /run/fork-id directly from the >> kernel though, to remove the dependency on systemd for poll-less >> notification of libraries. >> >> Jason, how much complexity would it add to provide an mmap() and read() >> interface to a fork counter value to the sysctl? Read sounds like a >> trivial change on top of what you have already, mmap a bit more heavy >> lift. If we had both, it would allow us to implement a Linux standard >> fork detect path in libraries that does not rely on systemd. > mmap() does not give us anything if we're not going to expose the raw > ACPI-mapped ID directly. It will still be a racy mechanism until we do > that. So I think we should wait until there's a proper vmgenid > word-sized counter to expose something mmap()able. If you have the > energy to talk to Microsoft about this and make it happen, please be my > guest. As I wrote at the beginning of this email. I haven't gotten a > response from you at all about this stuff in quite some time, so I'm not > really itching take that on alone now. Absolutely! Let's see where it goes :) Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
On Mon, May 02, 2022 at 08:56:05PM +0200, Alexander Graf wrote: > > On 02.05.22 20:46, Jason A. Donenfeld wrote: > > On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote: > >> Michael, since we already changed the CID in the spec, can we add a > >> property to the device that indicates the first 4 bytes of the UUID will > >> always be different between parent and child? > >> > >> That should give us the ability to mmap the vmgenid directly to user > >> space and act based on a simple u32 compare for clone notification, no? > > That is not a good idea. We want an _additional_ 4 bytes, so that we can > > keep the first 16 bytes (128 bits) as a kernel space secret. > > > An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page. > Do we really rely on these 16 bytes to reseed after clone? If so, we'd > need to bite the bullet and provide an additional page, yes. Ugh, you're right; memory mapping is pages. The other option would be relying on RDRAND (both existing and being trusted by the user etc), but generally people aren't too jazzed about that. We pretty much have to assume that the existing pool is compromised, since people share cloned VMs casually. The 128-bit vmgenid is a nice input to have.
On 02.05.22 21:27, Jason A. Donenfeld wrote: > On Mon, May 02, 2022 at 08:56:05PM +0200, Alexander Graf wrote: >> On 02.05.22 20:46, Jason A. Donenfeld wrote: >>> On Mon, May 02, 2022 at 08:34:38PM +0200, Alexander Graf wrote: >>>> Michael, since we already changed the CID in the spec, can we add a >>>> property to the device that indicates the first 4 bytes of the UUID will >>>> always be different between parent and child? >>>> >>>> That should give us the ability to mmap the vmgenid directly to user >>>> space and act based on a simple u32 compare for clone notification, no? >>> That is not a good idea. We want an _additional_ 4 bytes, so that we can >>> keep the first 16 bytes (128 bits) as a kernel space secret. >> >> An additional 4 bytes would be an additional 4kb (or 64kb on ARM) page. >> Do we really rely on these 16 bytes to reseed after clone? If so, we'd >> need to bite the bullet and provide an additional page, yes. > Ugh, you're right; memory mapping is pages. The other option would be > relying on RDRAND (both existing and being trusted by the user etc), but > generally people aren't too jazzed about that. We pretty much have to > assume that the existing pool is compromised, since people share cloned > VMs casually. The 128-bit vmgenid is a nice input to have. I can see the merit. So yes, we'd want a second function to the VM_GEN_COUNTER device in addition to "ADDR" that - in a fully user space mappable separate page - gives us a 32-bit vmgenid that is guaranteed to be different from the previous one. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi Alex, On Mon, May 02, 2022 at 08:57:01PM +0200, Alexander Graf wrote: > So far I see little that would block your patch? It seems to go into a > good direction from all I can tell. With kernel code for the kernel, if it's good enough, it's good enough, and can be made better later. With kernel code for userspace, it might not be possible to make it better later, so if we've got a plethora of options, it makes the incrementalism I was enthusiastic about a little bit more risky. Here's where we're at: - /proc/sys has this easy to use poll() interface [which currently breaks normal poll() semantics; see patch 1/2]. It'd be nice to use this as an async notifier interface. - A race-free mmap()d mechanism needs virtual hardware support to be useful, so I don't want to add it now. But when we do have it, where should it be added? * Atop the same /proc/sys file that this patch has returning -ENODATA at the moment? /proc/sys doesn't do mmap now so that'd be some plumbing work, which would be pretty odd for the sysctl abstraction. * Atop some new file elsewhere? Not hard to do, but also begs the question of which driver should do it, where the file should be, and a whole host of bikesheddy questions. * In the VDSO? This would make the most sense to me if we're ever going to do a VDSO RNG, and to design it along with the userspace RNG stuff people think they want, even if it's maybe not the best idea in the end. This is a huge can of worms. Termites, even, and they'll eat through the beams of your house. * Nowhere, because even though this race-free mechanism is attractive, it's way too hard to program for, and nobody is going to use it, and not many people care /that/ much about having this mechanism being 100% race-free. Colm said Amazon doesn't, for example, since it uses quiescent states. And apparently Microsoft doesn't care much either, and they don't even use quiescent states. Maybe I'm the only one raising a fuss about the race window (while hypocritically also not being very eager to write code around an interface that has to check a variable all the time). - A file in /proc/sys can return -ENODATA, as this patch has it doing, or it can return something else: * A counter. Flawed for reasons discussed elsewhere. * A UUID, just like /proc/sys/kernel/random/boot_id, except we'd name it generation_id or something and it'd change on VM fork. An earlier version of this patch did that. * -ENODATA, as this patch does, because each caller can get a new unique value with getrandom(), as there's no point in having some global identifier like with boot_id. This is what I'm leaning toward. It seems like getrandom() would handle Lennart's case, Go's case, s2n's case, and we don't need a particular ID. * The direct value of the ACPI vmgenid value. Discussed in earlier emails; I don't want to do this. - You don't care about race-free, but you do care about mmap(), for "programming reasons" in s2n. I'm not yet sympathetic to this case :-P. The point of mmap() should be for the race-free stuff. Otherwise poll() should do the job, and if you can't poll, you can hack around it with another process or a thread or whatever other trick. Your systemd proposal with a file was just punting the idea elsewhere in userspace (which I think is a bit ugly too...). This state of affairs inclines me to: - Have a /proc/sys/kernel/random/fork_event that has a poll() interface with reads returning -ENODATA (this patch). - Try to convince whomever that poll()ing on fork_event and then calling getrandom() for a process-local snapshot ID is better than poll()ing on and subsequently read()ing a /proc/sys/kernel/random/generation_id that returns a global UUID like boot_id. [Lennart - I'm specifically interested in trying to convince you of this. Thoughts?] - Given that you are okay with an async mechanism, figure out how poll() can work for s2n rather than mmap(). - Not really pursue the race-free mmap() thing unless Microsoft goes full steam ahead with it because it's complicated to program for and maybe (or maybe not? I don't know? data please?) a theoretical concern. > You block the thread on poll, so the only option is a thread. I was > until now always under the working assumption that we can't do this in a > thread because you don't want your single threaded application to turn > into a pthreaded one, but you make me wonder. Let me check with Torben > tomorrow. Can't you use raw clone() and just have a super dirty thread that shares a single page mapping and nothing else? Poll on the pidfd of the parent and the fork_event fd, quit on the former, and ++ on the latter. > >> 2) A way to notify larger applications (think Java here) that a system > >> is going to be suspended soon so it can wipe PII before it gets cloned > >> for example. > > Suspension, like S3 power notification stuff? Talk to Rafael about that; > > > Whether you go running -> S3 -> clone or you go running -> paused -> > clone is an implementation detail I'm not terribly worried about. Users > can do either, because on both cases the VM is in paused state. Ahh, I see. https://lore.kernel.org/lkml/20220501123849.3858-1-Jason@zx2c4.com/ might interest you. Jason
On Mo, 02.05.22 20:04, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > > I can just tell you, that in systemd we'd have a usecase for consuming > > such a generation counter: we try to provide stable MAC addresses for > > synthetic network interfaces managed by networkd, so we hash them from > > /etc/machine-id, but otoh people also want them to change when they > > clone their VMs. We could very nicely solve this if we had a > > generation counter easily accessible from userspace, that starts at 0 > > initially. Because then we can hash as we always did when the counter > > is zero, but otherwise use something else, possibly hashed from the > > generation counter. > > This doesn't work, because you could have memory-A split into memory-A.1 > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with > the same new value "2". Yes, that's why I as vague about what to switch to if the counter is non-zero, i.e. "something else, *possibly* hashed…". For this MAC address usecase it's entirely sufficient to be able to distinguish if the system was closed at all, i.e. if the counter is zero or is non-zero. Because that would already be great for a policy of "hash it in a stable way from /etc/machine-id, if counter == 0" + "use random MAC once counter > 0". Such a MAC address policy I think should probably even be the new default in networkd, if we could implement it. For that we'd need a single bit of info from the kernel, indicating whether the sysem was cloned at all. i.e. if the vmgenid uuid is different from the one the system booted up first. Lennart -- Lennart Poettering, Berlin
On Mo, 02.05.22 19:59, Alexander Graf (graf@amazon.com) wrote: > Lennart, looking at the current sysctl proposal, systemd could poll() on the > fork file. It would then be able to generate a /run/fork-id file which it > can use for the flow above, right? I am not to keen on making sytemd such a proxy. Sounds like something the kernel could do on its own, better and resulting in an ultimately simpler system... If systemd its the proxy this adds in extra raciness. i.e. in a ideal world, if we have some form of notification fd, then it would be great if that fd is guaranteed to have POLLIN set and its contents updated the instant the clone happened. But if we proxy this through userspace, there's necessarily a latency involved that it it takes userspace to catch up and effect the POLLIN and updated contents towards its client apps. I understand the underlying VM hypervisor APIs currently are designed to always imply some notification latency. Which sucks, but I think we should be very careful with replicating this design mistake with any userspace APIs we add. i.e. I am pretty sure that even if the underlying VM hypervisor primitive isn't as good as we wanted, the Linux kernel→userspace API should be built so that if one day a better VM hypervisor interface exists it can be plugged behind it without such limitations. > Overall, it sounds to me like the sysctl poll based kernel interface in this > patch in combination with systemd inhibitors gives us an answer to most of > the flows above. As mentioned earlier, I am not convinced sysctl is the right place for this. sysctls are understood by most people as being the place for tweaking kernel settings. This is not a kernel setting, but a notification concept, and the way Jason defined it there's nothing to read nor write, which strongly suggests to move it elsewhere, but not /proc/sys/. Use /sys/kernel/ or so. Or maybe O_NOTIFICATION_PIPE or whatever, but /proc/sys/ looks really wrong. > I can see attractiveness in providing the /run/fork-id directly from the > kernel though, to remove the dependency on systemd for poll-less > notification of libraries. I agree. Lennart -- Lennart Poettering, Berlin
Hey Lennart, On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote: > For this MAC address usecase it's entirely sufficient to be able to > distinguish if the system was closed at all, i.e. if the counter is > zero or is non-zero. Because that would already be great for a policy > of "hash it in a stable way from /etc/machine-id, if counter == 0" + > "use random MAC once counter > 0". Hm, are you sure that's actually what you want? It turns out this vmgenid notification from the hypervisor might not be sufficiently granular for this use case: - vmgenid changes when you fork a new snapshot, so now you have two VMs - vmgenid also changes when you rewind to 2 minutes ago The first is what I assume you care about for this networkd business. The second is probably not what any networkd user expects. [Aside: I hope there are few networkd users; having seen what Yu did with wireguard and how fast and recklessly that went, I can't recommend that part of systemd to anyone.] From the perspective of randomness, both of these events imply the same thing. The situation is BAD; reseed immediately. From the perspective of MAC addresses, though, these events would imply different behavior, right? So it seems like vmgenid might need an additional field for this use case. Relatedly, VMware has that prompt where you select about your VM whether, "I moved it" or "I copied it." Presumably something like that would play a part in what is decided as part of this hypothetical second field. Let me know if this seems right to you, or if actually you had in mind changing MAC addresses in both cases instead. Jason
On Di, 03.05.22 11:08, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > Hey Lennart, > > On Tue, May 03, 2022 at 09:42:40AM +0200, Lennart Poettering wrote: > > For this MAC address usecase it's entirely sufficient to be able to > > distinguish if the system was closed at all, i.e. if the counter is > > zero or is non-zero. Because that would already be great for a policy > > of "hash it in a stable way from /etc/machine-id, if counter == 0" + > > "use random MAC once counter > 0". > > Hm, are you sure that's actually what you want? It turns out this > vmgenid notification from the hypervisor might not be sufficiently > granular for this use case: > > - vmgenid changes when you fork a new snapshot, so now you have two VMs > - vmgenid also changes when you rewind to 2 minutes ago > > The first is what I assume you care about for this networkd business. > The second is probably not what any networkd user expects. So first of all, it appears to me that rewinding a VM is something people would do for debugging things, i.e. not how things are done on deployment. > >From the perspective of randomness, both of these events imply the same > thing. The situation is BAD; reseed immediately. From the perspective of > MAC addresses, though, these events would imply different behavior, > right? So it seems like vmgenid might need an additional field for this > use case. Relatedly, VMware has that prompt where you select about your > VM whether, "I moved it" or "I copied it." Presumably something like > that would play a part in what is decided as part of this hypothetical > second field. networkd doesn't change MAC addresses in the middle of everything, but only when a network interface is downed and upped again. This for example happens when a link beat goes away and comes back. In the rewind-2min case i'd assume the link beat would probably be restored to what it was 2min ago (and thus stay online), but in the clone case it would probably drop momentarily and be restored than, to tell software to reacquire dhcp and so on. or in other words: if the hypervisor wants the system to reconfigure/reacquire its network there are explicit ways already, and they work afaics. what's missing tehre is simply a reasonable way to ensure that we won't end up sticking to the same fixed MAC address when we set things up in order to acquire a new DHCP lease. So I am not too concerned. Lennart -- Lennart Poettering, Berlin
On Tue, May 03, 2022 at 11:32:26AM +0200, Lennart Poettering wrote: > So first of all, it appears to me that rewinding a VM is something people > would do for debugging things, i.e. not how things are done on > deployment. I wouldn't be so sure here... Some people have processes around, "always start out from the same place", like for build machines, and employ a single VM snapshot that's always rewound after use. It's never forked into multiple snapshots, but just always goes back to that same starting point. > > >From the perspective of randomness, both of these events imply the same > > thing. The situation is BAD; reseed immediately. From the perspective of > > MAC addresses, though, these events would imply different behavior, > > right? So it seems like vmgenid might need an additional field for this > > use case. Relatedly, VMware has that prompt where you select about your > > VM whether, "I moved it" or "I copied it." Presumably something like > > that would play a part in what is decided as part of this hypothetical > > second field. > > networkd doesn't change MAC addresses in the middle of everything, but > only when a network interface is downed and upped again. This for > example happens when a link beat goes away and comes back. In the > rewind-2min case i'd assume the link beat would probably be restored > to what it was 2min ago (and thus stay online), but in the clone case > it would probably drop momentarily and be restored than, to tell > software to reacquire dhcp and so on. That sounds like it's going to be sort of confusing. Let's say we've got some VM scenario in which rewinds are common due to whatever weird process (such as a build machine that wants to start out at the same place each time). During its course of execution, it reboots, or maybe there's some network connectivity issue and the link goes down. In that case, when the link comes up, it's going to have a different MAC address? That doesn't make much sense to me, but maybe I'm missing some bigger picture detail. Jason
On Tue, May 03, 2022 at 10:29:49AM +0200, Lennart Poettering wrote: > As mentioned earlier, I am not convinced sysctl is the right place for > this. sysctls are understood by most people as being the place for > tweaking kernel settings. This is not a kernel setting, but a > notification concept, and the way Jason defined it there's nothing to > read nor write, which strongly suggests to move it elsewhere, but not > /proc/sys/. I think I'm coming around to this view that having a sysctl return -ENODATA is weird. It makes `sysctl -a` always complain to stderr, for example, which seems bad. > > I can see attractiveness in providing the /run/fork-id directly from the > > kernel though, to remove the dependency on systemd for poll-less > > notification of libraries. > > I agree. I'm still not convinced there's value in having a counter or a UUID, but if you had to choose, would you prefer a counter or a UUID? It sounds like the former, because you see a use for distinguishing between zero and non-zero? Or did you finally agree with me that vmgenid isn't granular enough for that? Jason
On Di, 03.05.22 13:55, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > I'm still not convinced there's value in having a counter or a UUID, but > if you had to choose, would you prefer a counter or a UUID? It sounds > like the former, because you see a use for distinguishing between zero > and non-zero? Or did you finally agree with me that vmgenid isn't > granular enough for that? I would prefer a monotonic counter, since it allows answering questions like the following: 1. Did this image get cloned at all? (i.e. counter != 0; usecase as per the MAC address discussion) 2. Did the image get cloned since the last time I looked? (i.e. counter != my_previously_saved_counter; usecase: detect clones in an "offline" fashion, i.e. from a component that doesn't continously run, but only from time to time) 3. How many clones did I miss? (i.e. missed_clones = my_previously_saved_counter - counter; usecase: catch up with generating proxy D-Bus signal messages for clones). There might be more. Using a UUID would not give us #1 or #3. It would deliver #2 however. Lennart -- Lennart Poettering, Berlin
On Di, 03.05.22 12:07, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > I wouldn't be so sure here... Some people have processes around, "always > start out from the same place", like for build machines, and employ a > single VM snapshot that's always rewound after use. It's never forked > into multiple snapshots, but just always goes back to that same starting > point. At sometimes it's futile phantasizing which exotic usecases people might have. > > > >From the perspective of randomness, both of these events imply the same > > > thing. The situation is BAD; reseed immediately. From the perspective of > > > MAC addresses, though, these events would imply different behavior, > > > right? So it seems like vmgenid might need an additional field for this > > > use case. Relatedly, VMware has that prompt where you select about your > > > VM whether, "I moved it" or "I copied it." Presumably something like > > > that would play a part in what is decided as part of this hypothetical > > > second field. > > > > networkd doesn't change MAC addresses in the middle of everything, but > > only when a network interface is downed and upped again. This for > > example happens when a link beat goes away and comes back. In the > > rewind-2min case i'd assume the link beat would probably be restored > > to what it was 2min ago (and thus stay online), but in the clone case > > it would probably drop momentarily and be restored than, to tell > > software to reacquire dhcp and so on. > > That sounds like it's going to be sort of confusing. Let's say we've got > some VM scenario in which rewinds are common due to whatever weird > process (such as a build machine that wants to start out at the same > place each time). During its course of execution, it reboots, or maybe > there's some network connectivity issue and the link goes down. In that > case, when the link comes up, it's going to have a different MAC > address? That doesn't make much sense to me, but maybe I'm missing some > bigger picture detail. It's still better than sticking to the same MAC address for all clones in all cases... Dunno, in systemd the MAC address policies are configurable, for a reason. We'll never find a default that really makes everyone happy. Some people prefer the anonmymity of randomized MAC addresses, others like the stability promises of hashed MAC addresses. We support both policies. I think it would make sense to add a policy that says "stable MAC until the first clone, then random" for example. In fact I think it's a choice that has the chance of being a better default than the current "always stable" approach we employ. So at the very least we should have the option to come up with a policy taking vm generations into account, it's a separate discussion to decide whether to make it opt-in or the default then, and I doubt that part of the discussion really matters here... Lennart -- Lennart Poettering, Berlin
From: Alexander Graf <graf@amazon.com> Sent: Monday, May 2, 2022 11:35 AM > > On 02.05.22 20:04, Jason A. Donenfeld wrote: > > Hey Lennart, > > > > On Mon, May 02, 2022 at 06:51:19PM +0200, Lennart Poettering wrote: > >> On Mo, 02.05.22 18:12, Jason A. Donenfeld (Jason@zx2c4.com) wrote: > >> > >>>>> In order to inform userspace of virtual machine forks, this commit adds > >>>>> a "fork_event" sysctl, which does not return any data, but allows > >>>>> userspace processes to poll() on it for notification of VM forks. > >>>>> > >>>>> It avoids exposing the actual vmgenid from the hypervisor to userspace, > >>>>> in case there is any randomness value in keeping it secret. Rather, > >>>>> userspace is expected to simply use getrandom() if it wants a fresh > >>>>> value. > >>>> Wouldn't it make sense to expose a monotonic 64bit counter of detected > >>>> VM forks since boot through read()? It might be interesting to know > >>>> for userspace how many forks it missed the fork events for. Moreover it > >>>> might be interesting to userspace to know if any fork happened so far > >>>> *at* *all*, by checking if the counter is non-zero. > >>> "Might be interesting" is different from "definitely useful". I'm not > >>> going to add this without a clear use case. This feature is pretty > >>> narrowly scoped in its objectives right now, and I intend to keep it > >>> that way if possible. > >> Sure, whatever. I mean, if you think it's preferable to have 3 API > >> abstractions for the same concept each for it's special usecase, then > >> that's certainly one way to do things. I personally would try to > >> figure out a modicum of generalization for things like this. But maybe > >> that' just me… > >> > >> I can just tell you, that in systemd we'd have a usecase for consuming > >> such a generation counter: we try to provide stable MAC addresses for > >> synthetic network interfaces managed by networkd, so we hash them from > >> /etc/machine-id, but otoh people also want them to change when they > >> clone their VMs. We could very nicely solve this if we had a > >> generation counter easily accessible from userspace, that starts at 0 > >> initially. Because then we can hash as we always did when the counter > >> is zero, but otherwise use something else, possibly hashed from the > >> generation counter. > > This doesn't work, because you could have memory-A split into memory-A.1 > > and memory-A.2, and both A.2 and A.1 would ++counter, and wind up with > > the same new value "2". The solution is to instead have the hypervisor > > pass a unique value and a counter. We currently have a 16 byte unique > > value from the hypervisor, which I'm keeping as a kernel space secret > > for the RNG; we're waiting on a word-sized monotonic counter interface > > from hypervisors in the future. When we have the latter, then we can > > start talking about mmapable things. Your use case would probably be > > served by exposing that 16-byte unique value (hashed with some constant > > for safety I suppose), but I'm hesitant to start going down that route > > all at once, especially if we're to have a more useful counter in the > > future. > > > Michael, since we already changed the CID in the spec, can we add a > property to the device that indicates the first 4 bytes of the UUID will > always be different between parent and child? > > That should give us the ability to mmap the vmgenid directly to user > space and act based on a simple u32 compare for clone notification, no? > I'm not ignoring this request, but my interpretation of the subsequent discussion is that it's probably not the path that we want to go down anyway. Is that a correct interpretation? Also, the chances of getting the Windows team to focus on a revision to the spec are not high, especially a revision that has new semantics. :-( Getting the new CID added was a relatively low bar, though I'm still trying to get the publicly available version of the spec updated to include the new CID. Michael
On Mon, 2022-05-02 at 16:06 +0200, Jason A. Donenfeld wrote: > In order to inform userspace of virtual machine forks, this commit adds > a "fork_event" sysctl, which does not return any data, but allows > userspace processes to poll() on it for notification of VM forks. > > It avoids exposing the actual vmgenid from the hypervisor to userspace, > in case there is any randomness value in keeping it secret. Rather, > userspace is expected to simply use getrandom() if it wants a fresh > value. > > For example, the following snippet can be used to print a message every > time a VM forks, after the RNG has been reseeded: > > struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY) }; > assert(fd.fd >= 0); > for (;;) { > read(fd.fd, NULL, 0); > assert(poll(&fd, 1, -1) > 0); > puts("vm fork detected"); > } > > Various programs and libraries that utilize cryptographic operations > depending on fresh randomness can invalidate old keys or take other > appropriate actions when receiving that event. While this is racier than > allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental > step forward that's not as heavyweight. At your request teleporting here the answer I gave on a different thread, reinforced by some thinking. As a user space crypto library person I think the only reasonable interface is something like a vDSO. Poll() interfaces are nice and all for system programs that have full control of their event loop and do not have to react immediately to this event, however crypto libraries do not have the luxury of controlling the main loop of the application. Additionally crypto libraries really need to ensure the value they return from their PRNG is fine, which means they do not return a value if the vmgenid has changed before they can reseed, or there could be catastrophic duplication of "random" values used in IVs or ECDSA Signatures or ids/cookies or whatever. For crypto libraries it is much simpler to poll for this information than using notifications of any kind given libraries are generally not in full control of what the process does. This needs to be polled fast as well, because the whole point of initializing a PRNG in the library is that asking /dev/urandom all the time is too slow (due to context switches and syscall overhead), so anything that would require a context switch in order to pull data from the PRNG would not really fly. A vDSO or similar would allow to pull the vmgenid or whatever epoch value in before generating the random numbers and then barrier-style check that the value is still unchanged before returning the random data to the caller. This will reduce the race condition (which simply cannot be completely avoided) to a very unlikely event. HTH, Simo.
Hi Simo, On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote: > At your request teleporting here the answer I gave on a different > thread, reinforced by some thinking. > > As a user space crypto library person I think the only reasonable > interface is something like a vDSO. > > Poll() interfaces are nice and all for system programs that have full > control of their event loop and do not have to react immediately to > this event, however crypto libraries do not have the luxury of > controlling the main loop of the application. > > Additionally crypto libraries really need to ensure the value they > return from their PRNG is fine, which means they do not return a value > if the vmgenid has changed before they can reseed, or there could be > catastrophic duplication of "random" values used in IVs or ECDSA > Signatures or ids/cookies or whatever. > > For crypto libraries it is much simpler to poll for this information > than using notifications of any kind given libraries are > generally not in full control of what the process does. > > This needs to be polled fast as well, because the whole point of > initializing a PRNG in the library is that asking /dev/urandom all the > time is too slow (due to context switches and syscall overhead), so > anything that would require a context switch in order to pull data from > the PRNG would not really fly. > > A vDSO or similar would allow to pull the vmgenid or whatever epoch > value in before generating the random numbers and then barrier-style > check that the value is still unchanged before returning the random > data to the caller. This will reduce the race condition (which simply > cannot be completely avoided) to a very unlikely event. It sounds like your library issue is somewhat similar to what Alex was talking about with regards to having a hard time using poll in s2n. I'm still waiting to hear if Amazon figured out some way that this is possible (with, e.g., a thread). But anyway, it seems like this is something library authors might hit. My proposal here is made with nonce reuse in mind, for things like session keys that use sequential nonces. A different issue is random nonces. For these, it seems like a call to getrandom() for each nonce is probably the best bet. But it sounds like you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I hope you saw these threads: - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ Each one of those touches on vDSO things quite a bit. Basically, the motivation for doing that is for making userspace RNGs safe and promoting their use with a variety of kernel enhancements to make that easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid business should be exposed that way, over and above other mechanisms. It'd make the most sense...IF we're going to ship a vDSO RNG. So the question really is: should we ship a vDSO RNG? I could work on designing that right. But I'm a little bit skeptical generally of the whole userspace RNG concept. By and large they always turn out to be less safe and more complex than the kernel one. So if we're to go that way, I'd like to understand what the strongest arguments for it are. Jason
Hi Jason, On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: > My proposal here is made with nonce reuse in mind, for things like > session keys that use sequential nonces. Although this makes sense the problem is that changing applications to do the right thing based on which situation they are in will never be done right or soon enough. So I would focus on a solution that makes the CSPRNGs in crypto libraries safe. > A different issue is random nonces. For these, it seems like a call to > getrandom() for each nonce is probably the best bet. But it sounds like > you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I > hope you saw these threads: > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ 4c does sound like a decent solution, it is semantically identical to an epoch vmgenid, all the library needs to do is to create such a mmap region, stick a value on it, verify it is not zero after computing the next random value but before returning it to the caller. This reduces the race to a very small window when the machine is frozen right after the random value is returned to the caller but before it is used, but hopefully this just means that the two machines will just make parallel computations that yield the exact same value, so no catastrophic consequence will arise (there is the odd case where two random values are sought and the split happens between the two are retrieved and this has bad consequences, I think we can ignore that). > Each one of those touches on vDSO things quite a bit. Basically, the > motivation for doing that is for making userspace RNGs safe and > promoting their use with a variety of kernel enhancements to make that > easy. And IF we are to ship a vDSO RNG, then certainly this vmgenid > business should be exposed that way, over and above other mechanisms. > It'd make the most sense...IF we're going to ship a vDSO RNG. > > So the question really is: should we ship a vDSO RNG? I could work on > designing that right. But I'm a little bit skeptical generally of the > whole userspace RNG concept. By and large they always turn out to be > less safe and more complex than the kernel one. So if we're to go that > way, I'd like to understand what the strongest arguments for it are. I am not entirely sure how a vDSO RNG would work, I think exposing the epoch(or whatever indicator) is enough, crypto libraries have pretty good PRNGs, what they require is simply a good source of entropy for the initial seeding and this safety mechanism to avoid state duplication on machine cloning. All the decent libraries already support detecting process forks. Simo.
Hi Simo, On 11.05.22 14:59, Simo Sorce wrote: > Hi Jason, > > On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: >> My proposal here is made with nonce reuse in mind, for things like >> session keys that use sequential nonces. > Although this makes sense the problem is that changing applications to > do the right thing based on which situation they are in will never be > done right or soon enough. So I would focus on a solution that makes > the CSPRNGs in crypto libraries safe. I think we intrinsically have 2 sets of applications: Ones that want an event based notification and don't care about the racyness of it and ones that want an atomic way to determine the epoch. Userspace RNGs are naturally in the second category. > >> A different issue is random nonces. For these, it seems like a call to >> getrandom() for each nonce is probably the best bet. But it sounds like >> you're interested in a userspace RNG, akin to OpenBSD's arc4random(3). I >> hope you saw these threads: >> >> - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ >> - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ >> - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > 4c does sound like a decent solution, it is semantically identical to > an epoch vmgenid, all the library needs to do is to create such a mmap > region, stick a value on it, verify it is not zero after computing the > next random value but before returning it to the caller. > This reduces the race to a very small window when the machine is frozen > right after the random value is returned to the caller but before it is > used, but hopefully this just means that the two machines will just > make parallel computations that yield the exact same value, so no > catastrophic consequence will arise (there is the odd case where two > random values are sought and the split happens between the two are > retrieved and this has bad consequences, I think we can ignore that). The problem with wiping memory on clone is that it means you must keep these special wipe on clone pages always present and pinned in memory and can no longer swap them out, compact them or move them for memory hotplug. We started the journey with a WIPEONSUSPEND flag and eventually abandoned it because it seemed clunky. Happy to reopen it if we believe it's a viable path: https://lwn.net/Articles/825230/ Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi Simo, On Wed, May 11, 2022 at 08:59:07AM -0400, Simo Sorce wrote: > Hi Jason, > > On Wed, 2022-05-11 at 03:18 +0200, Jason A. Donenfeld wrote: > > My proposal here is made with nonce reuse in mind, for things like > > session keys that use sequential nonces. > > Although this makes sense the problem is that changing applications to > do the right thing based on which situation they are in will never be > done right or soon enough. So I would focus on a solution that makes > the CSPRNGs in crypto libraries safe. Please don't dismiss this. I realize you have your one single use case in mind, but there are others, and the distinction you gave for why we should dismiss the others to focus on yours doesn't really make any sense. Here's why: In my email I pointed out two places where VM forks impact crypto in bad ways: - Session keys, wrt nonce reuse. - Random nonces, wrt nonce reuse. There are other problems that arise from VM forks too. But these stand out because they are both quite catastrophic, whether it's duplicated ECDSA random nonces, or whether it's the same session key used with the same sequential counter to encrypt different plaintexts with something like AES-GCM or ChaCha20Poly1305. These are both very, very bad things. And both things happen in: - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto lib session keys (e.g. any TLS library). - Applications: application level random number generators (e.g. Bitcoin Core *facepalm*), application level session keys (e.g. OpenSSH). So I don't think the "library vs application" distinction is really meaningful here. Rather, things kind of fall apart all over the place for a variety of reasons on VM fork. > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > > 4c does sound like a decent solution, it is semantically identical to It does, yeah, but realistically it's never going to happen. I don't think there's a near- or medium-term chance of changing hypervisor semantics again. That means for 4-like solutions, there's 4a and 4b. By the way, that email of mine has inaccuracy in it. I complain about being in irq context, but it turns out not to be the case; we're inside of a kthread during the notification, which means we have a lot more options on what we can do. If 4 is the solution that appeals to you most, do you want to try your hand at a RFC patch for it? I don't yet know if that's the best direction to take, but the devil is kind of in the details, so it might be interesting to see how it pans out. Jason
On 11.05.22 03:18, Jason A. Donenfeld wrote: > Hi Simo, > > On Tue, May 10, 2022 at 08:40:48PM -0400, Simo Sorce wrote: >> At your request teleporting here the answer I gave on a different >> thread, reinforced by some thinking. >> >> As a user space crypto library person I think the only reasonable >> interface is something like a vDSO. >> >> Poll() interfaces are nice and all for system programs that have full >> control of their event loop and do not have to react immediately to >> this event, however crypto libraries do not have the luxury of >> controlling the main loop of the application. >> >> Additionally crypto libraries really need to ensure the value they >> return from their PRNG is fine, which means they do not return a value >> if the vmgenid has changed before they can reseed, or there could be >> catastrophic duplication of "random" values used in IVs or ECDSA >> Signatures or ids/cookies or whatever. >> >> For crypto libraries it is much simpler to poll for this information >> than using notifications of any kind given libraries are >> generally not in full control of what the process does. >> >> This needs to be polled fast as well, because the whole point of >> initializing a PRNG in the library is that asking /dev/urandom all the >> time is too slow (due to context switches and syscall overhead), so >> anything that would require a context switch in order to pull data from >> the PRNG would not really fly. >> >> A vDSO or similar would allow to pull the vmgenid or whatever epoch >> value in before generating the random numbers and then barrier-style >> check that the value is still unchanged before returning the random >> data to the caller. This will reduce the race condition (which simply >> cannot be completely avoided) to a very unlikely event. > It sounds like your library issue is somewhat similar to what Alex was > talking about with regards to having a hard time using poll in s2n. I'm > still waiting to hear if Amazon figured out some way that this is > possible (with, e.g., a thread). But anyway, it seems like this is Sounds like I didn't reply with my findings - sorry about that. Our s2n people *could* build something based on a thread, but are afraid that it's racy and would introduce creating a thread which the library does not do today. So in a nutshell, possible yes, desirable no. I think we're maybe a bit too scared of building something from scratch here. What would the best case situation be? Let's roll backwards from that then. From what I read, we want a "VMGenID v2" device that gives us the ability to * Get an IRQ on VM clone * Store and update an RNG seed value (128bit? Configurable len?) at a physical address or stand alone page on clone * Store and update a unique-to-this-VM rolling 32bit identifier at a physical address or stand alone page on clone We can either make the device expose these values as individual pages (like VMGenID does today) or as guest physical addresses that it needs to store into (like virtio-rng). The latter makes protection from DMA attacks of the hypervisor and kexec slightly more complicated, but it would be doable. VMGenID has 2 out of 3 features above. So why don't we just go the easy route, add a second property to VMGenID that gives us another page with that 32bit value and then provide a /dev/vmgenid device node you can open and mmap() that 32bit value page from? User space libraries can then try to open on init and determine their epoch. For the async event, we add the poll() logic again to /dev/vmgenid and make networkd for example use that. Alex Amazon Development Center Germany GmbH Krausenstr. 38 10117 Berlin Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B Sitz: Berlin Ust-ID: DE 289 237 879
Hi Jason, On Wed, 2022-05-11 at 15:19 +0200, Jason A. Donenfeld wrote: > Please don't dismiss this. I realize you have your one single use case > in mind, but there are others, and the distinction you gave for why we > should dismiss the others to focus on yours doesn't really make any > sense. Here's why: I do not think I am dismissing any other use cases, clearly anything that depend on unique random numbers for security is impacted, but I tend to focus where we can get the biggest impact. > In my email I pointed out two places where VM forks impact crypto in bad > ways: > > - Session keys, wrt nonce reuse. > > - Random nonces, wrt nonce reuse. > > There are other problems that arise from VM forks too. But these stand > out because they are both quite catastrophic, whether it's duplicated > ECDSA random nonces, or whether it's the same session key used with the > same sequential counter to encrypt different plaintexts with something > like AES-GCM or ChaCha20Poly1305. These are both very, very bad things. > > And both things happen in: > > - Libraries: crypto lib random number generators (e.g. OpenSSL), crypto > lib session keys (e.g. any TLS library). > > - Applications: application level random number generators (e.g. > Bitcoin Core *facepalm*), application level session keys (e.g. > OpenSSH). Yes, some applications that are involved with security do have their own application level PRNGs, clearly they will have to either stop using customized PRNGs and use the library provided ones (or even just /dev/urandom if their needs are no performance critical) or adjust their own PRNGs to be safe using whatever mechanism will be provided. > So I don't think the "library vs application" distinction is really > meaningful here. Rather, things kind of fall apart all over the place > for a variety of reasons on VM fork. I am not really making a library vs application distinction, what I am saying is that the library uses case has a set of tighter constraints than the application one. Basically anything a library can use an application can as well, while the contrary is not true. Therefore it if we resolve the library problem, applications will have a solution as well. > > > - https://lore.kernel.org/lkml/YnA5CUJKvqmXJxf2@zx2c4.com/ > > > - https://lore.kernel.org/lkml/Yh4+9+UpanJWAIyZ@zx2c4.com/ > > > - https://lore.kernel.org/lkml/CAHmME9qHGSF8w3DoyCP+ud_N0MAJ5_8zsUWx=rxQB1mFnGcu9w@mail.gmail.com/ > > > > 4c does sound like a decent solution, it is semantically identical to > > It does, yeah, but realistically it's never going to happen. I don't > think there's a near- or medium-term chance of changing hypervisor > semantics again. That means for 4-like solutions, there's 4a and 4b. I think 4a and 4b are fine mechanisms too, 4c is just more efficient, and potentially optimizable in HW. That said I think 3 (vDSO) is also a fine solution, and would not be disappointed if 3 was chosen over 4. I am not really after evaluating how it is done below the kernel boundary. As long as the effects are the same, semantically, from the user space pov. > By the way, that email of mine has inaccuracy in it. I complain about > being in irq context, but it turns out not to be the case; we're inside > of a kthread during the notification, which means we have a lot more > options on what we can do. > > If 4 is the solution that appeals to you most, do you want to try your > hand at a RFC patch for it? I don't yet know if that's the best > direction to take, but the devil is kind of in the details, so it might > be interesting to see how it pans out. I think it would be prudent to agree on the correct mechanisms before venturing into potentially invasive patches. Simo.
diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index 1144ea3229a3..ddbd603f0be7 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -1001,7 +1001,7 @@ This is a directory, with the following entries: * ``urandom_min_reseed_secs``: obsolete (used to determine the minimum number of seconds between urandom pool reseeding). This file is writable for compatibility purposes, but writing to it has no effect - on any RNG behavior. + on any RNG behavior; * ``uuid``: a UUID generated every time this is retrieved (this can thus be used to generate UUIDs at will); @@ -1009,8 +1009,10 @@ This is a directory, with the following entries: * ``write_wakeup_threshold``: when the entropy count drops below this (as a number of bits), processes waiting to write to ``/dev/random`` are woken up. This file is writable for compatibility purposes, but - writing to it has no effect on any RNG behavior. + writing to it has no effect on any RNG behavior; +* ``fork_event``: unreadable, but can be poll()'d on for notifications + delivered after the RNG reseeds following a virtual machine fork. randomize_va_space ================== diff --git a/drivers/char/random.c b/drivers/char/random.c index bffc8682d6b8..39eda91b07ec 100644 --- a/drivers/char/random.c +++ b/drivers/char/random.c @@ -1213,6 +1213,9 @@ EXPORT_SYMBOL_GPL(add_bootloader_randomness); #if IS_ENABLED(CONFIG_VMGENID) static BLOCKING_NOTIFIER_HEAD(vmfork_chain); +#ifdef CONFIG_SYSCTL +static DEFINE_CTL_TABLE_POLL(sysctl_fork_event_poll); +#endif /* * Handle a new unique VM ID, which is unique, not secret, so we @@ -1227,6 +1230,9 @@ void add_vmfork_randomness(const void *unique_vm_id, size_t size) pr_notice("crng reseeded due to virtual machine fork\n"); } blocking_notifier_call_chain(&vmfork_chain, 0, NULL); +#ifdef CONFIG_SYSCTL + proc_sys_poll_notify(&sysctl_fork_event_poll); +#endif } #if IS_MODULE(CONFIG_VMGENID) EXPORT_SYMBOL_GPL(add_vmfork_randomness); @@ -1694,6 +1700,8 @@ const struct file_operations urandom_fops = { * It is writable to avoid breaking old userspaces, but writing * to it does not change any behavior of the RNG. * + * - fork_event - an unreadable file that can be poll()'d on for VM forks. + * ********************************************************************/ #ifdef CONFIG_SYSCTL @@ -1747,6 +1755,14 @@ static int proc_do_rointvec(struct ctl_table *table, int write, void *buffer, return write ? 0 : proc_dointvec(table, 0, buffer, lenp, ppos); } +#if IS_ENABLED(CONFIG_VMGENID) +static int proc_do_nodata(struct ctl_table *table, int write, void *buffer, + size_t *lenp, loff_t *ppos) +{ + return -ENODATA; +} +#endif + static struct ctl_table random_table[] = { { .procname = "poolsize", @@ -1787,6 +1803,14 @@ static struct ctl_table random_table[] = { .mode = 0444, .proc_handler = proc_do_uuid, }, +#if IS_ENABLED(CONFIG_VMGENID) + { + .procname = "fork_event", + .mode = 0444, + .poll = &sysctl_fork_event_poll, + .proc_handler = proc_do_nodata, + }, +#endif { } };
In order to inform userspace of virtual machine forks, this commit adds a "fork_event" sysctl, which does not return any data, but allows userspace processes to poll() on it for notification of VM forks. It avoids exposing the actual vmgenid from the hypervisor to userspace, in case there is any randomness value in keeping it secret. Rather, userspace is expected to simply use getrandom() if it wants a fresh value. For example, the following snippet can be used to print a message every time a VM forks, after the RNG has been reseeded: struct pollfd fd = { .fd = open("/proc/sys/kernel/random/fork_event", O_RDONLY) }; assert(fd.fd >= 0); for (;;) { read(fd.fd, NULL, 0); assert(poll(&fd, 1, -1) > 0); puts("vm fork detected"); } Various programs and libraries that utilize cryptographic operations depending on fresh randomness can invalidate old keys or take other appropriate actions when receiving that event. While this is racier than allowing userspace to mmap/vDSO the vmgenid itself, it's an incremental step forward that's not as heavyweight. Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Alexander Graf <graf@amazon.com> Cc: Colm MacCarthaigh <colmmacc@amazon.com> Cc: Torben Hansen <htorben@amazon.co.uk> Cc: Jann Horn <jannh@google.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> --- Documentation/admin-guide/sysctl/kernel.rst | 6 ++++-- drivers/char/random.c | 24 +++++++++++++++++++++ 2 files changed, 28 insertions(+), 2 deletions(-)