diff mbox series

[v8] drivers/misc: sysgenid: add system generation id driver

Message ID 1615213083-29869-1-git-send-email-acatan@amazon.com (mailing list archive)
State New, archived
Headers show
Series [v8] drivers/misc: sysgenid: add system generation id driver | expand

Commit Message

Catangiu, Adrian Costin March 8, 2021, 2:18 p.m. UTC
- Background and problem

The System Generation ID feature is required in virtualized or
containerized environments by applications that work with local copies
or caches of world-unique data such as random values, uuids,
monotonically increasing counters, cryptographic nonces, etc.
Such applications can be negatively affected by VM or container
snapshotting when the VM or container is either cloned or returned to
an earlier point in time.

Solving the uniqueness problem strongly enough for cryptographic
purposes requires a mechanism which can deterministically reseed
userspace PRNGs with new entropy at restore time. This mechanism must
also support the high-throughput and low-latency use-cases that led
programmers to pick a userspace PRNG in the first place; be usable by
both application code and libraries; allow transparent retrofitting
behind existing popular PRNG interfaces without changing application
code; it must be efficient, especially on snapshot restore; and be
simple enough for wide adoption.

- Solution

This commit introduces a mechanism that standardizes an API for
applications and libraries to be made aware of uniqueness breaking
events such as VM or container snapshotting, and allow them to react
and adapt to such events.

The System Generation ID is meant to help in these scenarios by
providing a monotonically increasing u32 counter that changes each time
the VM or container is restored from a snapshot.

The `sysgenid` driver exposes a monotonic incremental System Generation
u32 counter via a char-dev filesystem interface accessible
through `/dev/sysgenid`. It provides synchronous and asynchronous SysGen
counter update notifications, as well as counter retrieval and
confirmation mechanisms.
The counter starts from zero when the driver is initialized and
monotonically increments every time the system generation changes.

Userspace applications or libraries can (a)synchronously consume the
system generation counter through the provided filesystem interface, to
make any necessary internal adjustments following a system generation
update.

The provided filesystem interface operations can be used to build a
system level safe workflow that guest software can follow to protect
itself from negative system snapshot effects.

System generation changes are driven by userspace software through a
dedicated driver ioctl.

**Please note**, SysGenID alone does not guarantee complete snapshot
safety to applications using it. A certain workflow needs to be
followed at the system level, in order to make the system
snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
section in the included documentation.

Signed-off-by: Adrian Catangiu <acatan@amazon.com>
---
v7 -> v8:
  - remove vmgenid driver
  - sysgenid: remove support for HW backends, SysGenID is now driven
    by software through a safe, consistent interface
  - without HW backend there is no IRQ to race with mmap() interface
  - update documentation

v6 -> v7:
  - remove sysgenid uevent

v5 -> v6:

  - sysgenid: watcher tracking disabled by default
  - sysgenid: add SYSGENID_SET_WATCHER_TRACKING ioctl to allow each
    file descriptor to set whether they should be tracked as watchers
  - rename SYSGENID_FORCE_GEN_UPDATE -> SYSGENID_TRIGGER_GEN_UPDATE
  - rework all documentation to clearly capture all prerequisites for
    achieving snapshot safety when using the provided mechanism
  - sysgenid documentation: replace individual filesystem operations
    examples with a higher level example showcasing system-level
    snapshot-safe workflow

v4 -> v5:

  - sysgenid: generation changes are also exported through uevents
  - remove SYSGENID_GET_OUTDATED_WATCHERS ioctl
  - document sysgenid ioctl major/minor numbers

v3 -> v4:

  - split functionality in two separate kernel modules: 
    1. drivers/misc/sysgenid.c which provides the generic userspace
       interface and mechanisms
    2. drivers/virt/vmgenid.c as VMGENID acpi device driver that seeds
       kernel entropy and acts as a driving backend for the generic
       sysgenid
  - rename /dev/vmgenid -> /dev/sysgenid
  - rename uapi header file vmgenid.h -> sysgenid.h
  - rename ioctls VMGENID_* -> SYSGENID_*
  - add ‘min_gen’ parameter to SYSGENID_FORCE_GEN_UPDATE ioctl
  - fix races in documentation examples

v2 -> v3:

  - separate the core driver logic and interface, from the ACPI device.
    The ACPI vmgenid device is now one possible backend
  - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
  - add locking to avoid races between fs ops handlers and hw irq
    driven generation updates
  - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    outdated or a generation change happens while waiting (thus making
    current caller outdated), the ioctl returns -EINTR to signal the
    user to handle event and retry. Fixes blocking on oneself
  - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    CAP_CHECKPOINT_RESTORE capability, through which software can force
    generation bump

v1 -> v2:

  - expose to userspace a monotonically increasing u32 Vm Gen Counter
    instead of the hw VmGen UUID
  - since the hw/hypervisor-provided 128-bit UUID is not public
    anymore, add it to the kernel RNG as device randomness
  - insert driver page containing Vm Gen Counter in the user vma in
    the driver's mmap handler instead of using a fault handler
  - turn driver into a misc device driver to auto-create /dev/vmgenid
  - change ioctl arg to avoid leaking kernel structs to userspace
  - update documentation
---
 Documentation/misc-devices/sysgenid.rst            | 228 +++++++++++++++
 Documentation/userspace-api/ioctl/ioctl-number.rst |   1 +
 MAINTAINERS                                        |   8 +
 drivers/misc/Kconfig                               |  15 +
 drivers/misc/Makefile                              |   1 +
 drivers/misc/sysgenid.c                            | 316 +++++++++++++++++++++
 include/uapi/linux/sysgenid.h                      |  14 +
 7 files changed, 583 insertions(+)
 create mode 100644 Documentation/misc-devices/sysgenid.rst
 create mode 100644 drivers/misc/sysgenid.c
 create mode 100644 include/uapi/linux/sysgenid.h

Comments

Greg Kroah-Hartman March 8, 2021, 2:36 p.m. UTC | #1
On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote:
> +static struct miscdevice sysgenid_misc = {
> +	.minor = MISC_DYNAMIC_MINOR,
> +	.name = "sysgenid",
> +	.fops = &fops,
> +};

Much cleaner, but:

> +static int __init sysgenid_init(void)
> +{
> +	int ret;
> +
> +	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> +	if (!sysgenid_data.map_buf)
> +		return -ENOMEM;
> +
> +	atomic_set(&sysgenid_data.generation_counter, 0);
> +	atomic_set(&sysgenid_data.outdated_watchers, 0);
> +	init_waitqueue_head(&sysgenid_data.read_waitq);
> +	init_waitqueue_head(&sysgenid_data.outdated_waitq);
> +	spin_lock_init(&sysgenid_data.lock);
> +
> +	ret = misc_register(&sysgenid_misc);
> +	if (ret < 0) {
> +		pr_err("misc_register() failed for sysgenid\n");
> +		goto err;
> +	}
> +
> +	return 0;
> +
> +err:
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +
> +	return ret;
> +}
> +
> +static void __exit sysgenid_exit(void)
> +{
> +	misc_deregister(&sysgenid_misc);
> +	free_pages(sysgenid_data.map_buf, 0);
> +	sysgenid_data.map_buf = 0;
> +}
> +
> +module_init(sysgenid_init);
> +module_exit(sysgenid_exit);

So you do this for any bit of hardware that happens to be out there?
Will that really work?  You do not have any hwid to trigger off of to
know that this is a valid device you can handle?

> +
> +MODULE_AUTHOR("Adrian Catangiu");
> +MODULE_DESCRIPTION("System Generation ID");
> +MODULE_LICENSE("GPL");
> +MODULE_VERSION("0.1");

MODULE_VERSION() isn't a thing, just drop it please :)

thnaks,

greg k-h
Alexander Graf March 8, 2021, 4:03 p.m. UTC | #2
On 08.03.21 15:36, Greg KH wrote:
> 
> On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote:
>> +static struct miscdevice sysgenid_misc = {
>> +     .minor = MISC_DYNAMIC_MINOR,
>> +     .name = "sysgenid",
>> +     .fops = &fops,
>> +};
> 
> Much cleaner, but:
> 
>> +static int __init sysgenid_init(void)
>> +{
>> +     int ret;
>> +
>> +     sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
>> +     if (!sysgenid_data.map_buf)
>> +             return -ENOMEM;
>> +
>> +     atomic_set(&sysgenid_data.generation_counter, 0);
>> +     atomic_set(&sysgenid_data.outdated_watchers, 0);
>> +     init_waitqueue_head(&sysgenid_data.read_waitq);
>> +     init_waitqueue_head(&sysgenid_data.outdated_waitq);
>> +     spin_lock_init(&sysgenid_data.lock);
>> +
>> +     ret = misc_register(&sysgenid_misc);
>> +     if (ret < 0) {
>> +             pr_err("misc_register() failed for sysgenid\n");
>> +             goto err;
>> +     }
>> +
>> +     return 0;
>> +
>> +err:
>> +     free_pages(sysgenid_data.map_buf, 0);
>> +     sysgenid_data.map_buf = 0;
>> +
>> +     return ret;
>> +}
>> +
>> +static void __exit sysgenid_exit(void)
>> +{
>> +     misc_deregister(&sysgenid_misc);
>> +     free_pages(sysgenid_data.map_buf, 0);
>> +     sysgenid_data.map_buf = 0;
>> +}
>> +
>> +module_init(sysgenid_init);
>> +module_exit(sysgenid_exit);
> 
> So you do this for any bit of hardware that happens to be out there?
> Will that really work?  You do not have any hwid to trigger off of to
> know that this is a valid device you can handle?

The interface is already useful in a pure container context where the 
generation change request is triggered by software.

And yes, there are hardware triggers, but Michael was quite unhappy 
about potential races between VMGenID change and SysGenID change and 
thus wanted to ideally separate the interfaces. So we went ahead and 
isolated the SysGenID one, as it's already useful as is.

Hardware drivers to inject change events into SysGenID can then follow 
later, for all different hardware platforms. But SysGenID as in this 
patch is a completely hardware agnostic concept.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879
Greg Kroah-Hartman March 8, 2021, 5:24 p.m. UTC | #3
On Mon, Mar 08, 2021 at 05:03:58PM +0100, Alexander Graf wrote:
> 
> 
> On 08.03.21 15:36, Greg KH wrote:
> > 
> > On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote:
> > > +static struct miscdevice sysgenid_misc = {
> > > +     .minor = MISC_DYNAMIC_MINOR,
> > > +     .name = "sysgenid",
> > > +     .fops = &fops,
> > > +};
> > 
> > Much cleaner, but:
> > 
> > > +static int __init sysgenid_init(void)
> > > +{
> > > +     int ret;
> > > +
> > > +     sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> > > +     if (!sysgenid_data.map_buf)
> > > +             return -ENOMEM;
> > > +
> > > +     atomic_set(&sysgenid_data.generation_counter, 0);
> > > +     atomic_set(&sysgenid_data.outdated_watchers, 0);
> > > +     init_waitqueue_head(&sysgenid_data.read_waitq);
> > > +     init_waitqueue_head(&sysgenid_data.outdated_waitq);
> > > +     spin_lock_init(&sysgenid_data.lock);
> > > +
> > > +     ret = misc_register(&sysgenid_misc);
> > > +     if (ret < 0) {
> > > +             pr_err("misc_register() failed for sysgenid\n");
> > > +             goto err;
> > > +     }
> > > +
> > > +     return 0;
> > > +
> > > +err:
> > > +     free_pages(sysgenid_data.map_buf, 0);
> > > +     sysgenid_data.map_buf = 0;
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +static void __exit sysgenid_exit(void)
> > > +{
> > > +     misc_deregister(&sysgenid_misc);
> > > +     free_pages(sysgenid_data.map_buf, 0);
> > > +     sysgenid_data.map_buf = 0;
> > > +}
> > > +
> > > +module_init(sysgenid_init);
> > > +module_exit(sysgenid_exit);
> > 
> > So you do this for any bit of hardware that happens to be out there?
> > Will that really work?  You do not have any hwid to trigger off of to
> > know that this is a valid device you can handle?
> 
> The interface is already useful in a pure container context where the
> generation change request is triggered by software.
> 
> And yes, there are hardware triggers, but Michael was quite unhappy about
> potential races between VMGenID change and SysGenID change and thus wanted
> to ideally separate the interfaces. So we went ahead and isolated the
> SysGenID one, as it's already useful as is.
> 
> Hardware drivers to inject change events into SysGenID can then follow
> later, for all different hardware platforms. But SysGenID as in this patch
> is a completely hardware agnostic concept.

Ok, so what is going to cause this driver to be automatically loaded?
How will userspace "know" they want this and know to load it?

This really is just a shared memory "driver", it's gotten so small now,
so why can't this just be a userspace program/server now?  :)

thanks,

greg k-h
Greg Kroah-Hartman March 23, 2021, 12:56 p.m. UTC | #4
On Mon, Mar 08, 2021 at 05:03:58PM +0100, Alexander Graf wrote:
> 
> 
> On 08.03.21 15:36, Greg KH wrote:
> > 
> > On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote:
> > > +static struct miscdevice sysgenid_misc = {
> > > +     .minor = MISC_DYNAMIC_MINOR,
> > > +     .name = "sysgenid",
> > > +     .fops = &fops,
> > > +};
> > 
> > Much cleaner, but:
> > 
> > > +static int __init sysgenid_init(void)
> > > +{
> > > +     int ret;
> > > +
> > > +     sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
> > > +     if (!sysgenid_data.map_buf)
> > > +             return -ENOMEM;
> > > +
> > > +     atomic_set(&sysgenid_data.generation_counter, 0);
> > > +     atomic_set(&sysgenid_data.outdated_watchers, 0);
> > > +     init_waitqueue_head(&sysgenid_data.read_waitq);
> > > +     init_waitqueue_head(&sysgenid_data.outdated_waitq);
> > > +     spin_lock_init(&sysgenid_data.lock);
> > > +
> > > +     ret = misc_register(&sysgenid_misc);
> > > +     if (ret < 0) {
> > > +             pr_err("misc_register() failed for sysgenid\n");
> > > +             goto err;
> > > +     }
> > > +
> > > +     return 0;
> > > +
> > > +err:
> > > +     free_pages(sysgenid_data.map_buf, 0);
> > > +     sysgenid_data.map_buf = 0;
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +static void __exit sysgenid_exit(void)
> > > +{
> > > +     misc_deregister(&sysgenid_misc);
> > > +     free_pages(sysgenid_data.map_buf, 0);
> > > +     sysgenid_data.map_buf = 0;
> > > +}
> > > +
> > > +module_init(sysgenid_init);
> > > +module_exit(sysgenid_exit);
> > 
> > So you do this for any bit of hardware that happens to be out there?
> > Will that really work?  You do not have any hwid to trigger off of to
> > know that this is a valid device you can handle?
> 
> The interface is already useful in a pure container context where the
> generation change request is triggered by software.
> 
> And yes, there are hardware triggers, but Michael was quite unhappy about
> potential races between VMGenID change and SysGenID change and thus wanted
> to ideally separate the interfaces. So we went ahead and isolated the
> SysGenID one, as it's already useful as is.
> 
> Hardware drivers to inject change events into SysGenID can then follow
> later, for all different hardware platforms. But SysGenID as in this patch
> is a completely hardware agnostic concept.

Ok, this is going to play havoc with fuzzers and other "automated
testers", should be fun to watch!  :)

Let's queue this up and see what happens...

thanks,

greg k-h
Catangiu, Adrian Costin March 23, 2021, 4:10 p.m. UTC | #5
Hi Greg,

After your previous reply on this thread we started considering to provide this interface and framework/functionality through a userspace service instead of a kernel interface.
The latest iteration on this evolving patch-set doesn’t have strong reasons for living in the kernel anymore - the only objectively strong advantage would be easier driving of ecosystem integration; but I am not sure that's a good enough reason to create a new kernel interface.

I am now looking into adding this through Systemd. Either as a pluggable service or maybe even a systemd builtin offering.

What are your thoughts on it?

Thanks,
Adrian.

On 23/03/2021, 14:57, "Greg KH" <gregkh@linuxfoundation.org> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Mar 08, 2021 at 05:03:58PM +0100, Alexander Graf wrote:
    >
    >
    > On 08.03.21 15:36, Greg KH wrote:
    > >
    > > On Mon, Mar 08, 2021 at 04:18:03PM +0200, Adrian Catangiu wrote:
    > > > +static struct miscdevice sysgenid_misc = {
    > > > +     .minor = MISC_DYNAMIC_MINOR,
    > > > +     .name = "sysgenid",
    > > > +     .fops = &fops,
    > > > +};
    > >
    > > Much cleaner, but:
    > >
    > > > +static int __init sysgenid_init(void)
    > > > +{
    > > > +     int ret;
    > > > +
    > > > +     sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
    > > > +     if (!sysgenid_data.map_buf)
    > > > +             return -ENOMEM;
    > > > +
    > > > +     atomic_set(&sysgenid_data.generation_counter, 0);
    > > > +     atomic_set(&sysgenid_data.outdated_watchers, 0);
    > > > +     init_waitqueue_head(&sysgenid_data.read_waitq);
    > > > +     init_waitqueue_head(&sysgenid_data.outdated_waitq);
    > > > +     spin_lock_init(&sysgenid_data.lock);
    > > > +
    > > > +     ret = misc_register(&sysgenid_misc);
    > > > +     if (ret < 0) {
    > > > +             pr_err("misc_register() failed for sysgenid\n");
    > > > +             goto err;
    > > > +     }
    > > > +
    > > > +     return 0;
    > > > +
    > > > +err:
    > > > +     free_pages(sysgenid_data.map_buf, 0);
    > > > +     sysgenid_data.map_buf = 0;
    > > > +
    > > > +     return ret;
    > > > +}
    > > > +
    > > > +static void __exit sysgenid_exit(void)
    > > > +{
    > > > +     misc_deregister(&sysgenid_misc);
    > > > +     free_pages(sysgenid_data.map_buf, 0);
    > > > +     sysgenid_data.map_buf = 0;
    > > > +}
    > > > +
    > > > +module_init(sysgenid_init);
    > > > +module_exit(sysgenid_exit);
    > >
    > > So you do this for any bit of hardware that happens to be out there?
    > > Will that really work?  You do not have any hwid to trigger off of to
    > > know that this is a valid device you can handle?
    >
    > The interface is already useful in a pure container context where the
    > generation change request is triggered by software.
    >
    > And yes, there are hardware triggers, but Michael was quite unhappy about
    > potential races between VMGenID change and SysGenID change and thus wanted
    > to ideally separate the interfaces. So we went ahead and isolated the
    > SysGenID one, as it's already useful as is.
    >
    > Hardware drivers to inject change events into SysGenID can then follow
    > later, for all different hardware platforms. But SysGenID as in this patch
    > is a completely hardware agnostic concept.

    Ok, this is going to play havoc with fuzzers and other "automated
    testers", should be fun to watch!  :)

    Let's queue this up and see what happens...

    thanks,

    greg k-h




Amazon Development Center (Romania) S.R.L. registered office: 27A Sf. Lazar Street, UBC5, floor 2, Iasi, Iasi County, 700045, Romania. Registered in Romania. Registration number J22/2621/2005.
Greg Kroah-Hartman March 23, 2021, 4:35 p.m. UTC | #6
On Tue, Mar 23, 2021 at 04:10:27PM +0000, Catangiu, Adrian Costin wrote:
> Hi Greg,
> 
> After your previous reply on this thread we started considering to provide this interface and framework/functionality through a userspace service instead of a kernel interface.
> The latest iteration on this evolving patch-set doesn’t have strong reasons for living in the kernel anymore - the only objectively strong advantage would be easier driving of ecosystem integration; but I am not sure that's a good enough reason to create a new kernel interface.
> 
> I am now looking into adding this through Systemd. Either as a pluggable service or maybe even a systemd builtin offering.
> 
> What are your thoughts on it?

I'll gladly drop this patch if it's not needed in the kernel, thanks for
letting me know.

greg k-h
Greg Kroah-Hartman March 24, 2021, 7:22 a.m. UTC | #7
On Tue, Mar 23, 2021 at 04:10:27PM +0000, Catangiu, Adrian Costin wrote:
> Hi Greg,
> 
> After your previous reply on this thread we started considering to provide this interface and framework/functionality through a userspace service instead of a kernel interface.
> The latest iteration on this evolving patch-set doesn’t have strong reasons for living in the kernel anymore - the only objectively strong advantage would be easier driving of ecosystem integration; but I am not sure that's a good enough reason to create a new kernel interface.
> 
> I am now looking into adding this through Systemd. Either as a pluggable service or maybe even a systemd builtin offering.
> 
> What are your thoughts on it?

Now dropped from my char-misc-testing branch.  If you all decide you
want this as a kernel driver, please resubmit it.

Also next time, you might give me a heads-up that you don't want a patch
merged...

thanks,

greg k-h
Michael S. Tsirkin March 24, 2021, 4:38 p.m. UTC | #8
On Tue, Mar 23, 2021 at 05:35:14PM +0100, Greg KH wrote:
> On Tue, Mar 23, 2021 at 04:10:27PM +0000, Catangiu, Adrian Costin wrote:
> > Hi Greg,
> > 
> > After your previous reply on this thread we started considering to provide this interface and framework/functionality through a userspace service instead of a kernel interface.
> > The latest iteration on this evolving patch-set doesn’t have strong reasons for living in the kernel anymore - the only objectively strong advantage would be easier driving of ecosystem integration; but I am not sure that's a good enough reason to create a new kernel interface.
> > 
> > I am now looking into adding this through Systemd. Either as a pluggable service or maybe even a systemd builtin offering.
> > 
> > What are your thoughts on it?
> 
> I'll gladly drop this patch if it's not needed in the kernel, thanks for
> letting me know.
> 
> greg k-h

Systemd sounds good to me too.
diff mbox series

Patch

diff --git a/Documentation/misc-devices/sysgenid.rst b/Documentation/misc-devices/sysgenid.rst
new file mode 100644
index 0000000..21f748d
--- /dev/null
+++ b/Documentation/misc-devices/sysgenid.rst
@@ -0,0 +1,228 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+========
+SYSGENID
+========
+
+The System Generation ID feature is required in virtualized or
+containerized environments by applications that work with local copies
+or caches of world-unique data such as random values, UUIDs,
+monotonically increasing counters, etc.
+Such applications can be negatively affected by VM or container
+snapshotting when the VM or container is either cloned or returned to
+an earlier point in time.
+
+The System Generation ID device is meant to help in these scenarios by
+providing a system level interface for applications and libraries to
+consume and synchronize against the System Generation ID.
+The SysGenID is a monotonically increasing counter that is meant to
+changes eachtime the system generation changes, for example after a VM
+or container is restored from a snapshot. The driver for it lives at
+``drivers/misc/sysgenid.c``.
+
+The ``sysgenid`` driver exposes a monotonic incremental System
+Generation u32 counter via a char-dev filesystem interface accessible
+through ``/dev/sysgenid`` that provides sync and async SysGen counter
+update notifications. It also provides SysGen counter retrieval and
+confirmation mechanisms.
+
+The counter starts from zero when the driver is initialized and
+monotonically increments every time the system generation changes.
+
+System generation changes are driven by userspace software through a
+dedicated driver ioctl.
+
+Userspace applications or libraries can (a)synchronously consume the
+system generation counter through the provided filesystem interface, to
+make any necessary internal adjustments following a system generation
+update.
+
+**Please note**, SysGenID alone does not guarantee complete snapshot
+safety to applications using it. A certain workflow needs to be
+followed at the system level, in order to make the system
+snapshot-resilient. Please see the "Snapshot Safety Prerequisites"
+section below.
+
+Driver filesystem interface
+===========================
+
+``open()``:
+  When the device is opened, a copy of the current SysGenID (counter)
+  is associated with the open file descriptor. Every open file
+  descriptor will have readable data available (EPOLLIN) while its
+  current copy of the SysGenID is outdated. Reading from the fd will
+  provide the latest SysGenID, while writing to the fd will update the
+  fd-local copy of the SysGenID and is used as a confirmation
+  mechanism.
+
+``read()``:
+  Read is meant to provide the *new* system generation counter when a
+  generation change takes place. The read operation blocks until the
+  associated counter is no longer up to date, at which point the new
+  counter is provided/returned.  Nonblocking ``read()`` returns
+  ``EAGAIN`` to signal that there is no *new* counter value available.
+  The generation counter is considered *new* for each open file
+  descriptor that hasn't confirmed the new value following a generation
+  change. Therefore, once a generation change takes place, all
+  ``read()`` calls will immediately return the new generation counter
+  and will continue to do so until the new value is confirmed back to
+  the driver through ``write()``.
+  Partial reads are not allowed - read buffer needs to be at least
+  32 bits in size.
+
+``write()``:
+  Write is used to confirm the up-to-date SysGenID counter back to the
+  driver.
+  Following a VM generation change, all existing watchers are marked
+  as *outdated*. Each file descriptor will maintain the *outdated*
+  status until a ``write()`` containing the new up-to-date generation
+  counter is used as an update confirmation mechanism.
+  Partial writes are not allowed - write buffer should be exactly
+  32 bits in size.
+
+``poll()``:
+  Poll is implemented to allow polling for generation counter updates.
+  Such updates result in ``EPOLLIN`` polling status until the new
+  up-to-date counter is confirmed back to the driver through a
+  ``write()``.
+
+``ioctl()``:
+  The driver also adds support for waiting on open file descriptors
+  that haven't acknowledged a generation counter update, as well as a
+  mechanism for userspace to *trigger* a generation update:
+
+  - SYSGENID_SET_WATCHER_TRACKING: takes a bool argument to set tracking
+    status for current file descriptor. When watcher tracking is
+    enabled, the driver tracks this file descriptor as an independent
+    *watcher*. The driver keeps accounting of how many watchers have
+    confirmed the latest Sys-Gen-Id counter and how many of them are
+    *outdated*; an outdated watcher is a *tracked* open file descriptor
+    that has lived through a Sys-Gen-Id change but has not yet confirmed
+    the new generation counter.
+    Software that wants to be waited on by the system while it adjusts
+    to generation changes, should turn tracking on. The sysgenid driver
+    then keeps track of it and can block system-level adjustment process
+    until the software has finished adjusting and confirmed it through a
+    ``write()``.
+    Tracking is disabled by default and file descriptors need to
+    explicitly opt-in using this IOCTL.
+  - SYSGENID_WAIT_WATCHERS: blocks until there are no more *outdated*
+    tracked watchers or, if a ``timeout`` argument is provided, until
+    the timeout expires.
+    If the current caller is *outdated* or a generation change happens
+    while waiting (thus making current caller *outdated*), the ioctl
+    returns ``-EINTR`` to signal the user to handle event and retry.
+  - SYSGENID_TRIGGER_GEN_UPDATE: triggers a generation counter increment.
+    It takes a ``minimum-generation`` argument which represents the
+    minimum value the generation counter will be set to. For example if
+    current generation is ``x`` and ``SYSGENID_TRIGGER_GEN_UPDATE(y)``
+    is called, the generation counter will increment to ``MAX(y,x+1)``.
+    This IOCTL can only be used by processes with CAP_CHECKPOINT_RESTORE
+    or CAP_SYS_ADMIN capabilities.
+
+``mmap()``:
+  The driver supports ``PROT_READ, MAP_SHARED`` mmaps of a single page
+  in size. The first 4 bytes of the mapped page will contain an
+  up-to-date u32 copy of the system generation counter.
+  The mapped memory can be used as a low-latency generation counter
+  probe mechanism in critical sections.
+  The mmap() interface is targeted at libraries or code that needs to
+  check for generation changes in-line, where an event loop is not
+  available or read()/write() syscalls are too expensive.
+  In such cases, logic can be added in-line with the sensitive code to
+  check and trigger on-demand/just-in-time readjustments when changes
+  are detected on the memory mapped generation counter.
+  Users of this interface that plan to lazily adjust should not enable
+  watcher tracking, since waiting on them doesn't make sense.
+
+``close()``:
+  Removes the file descriptor as a system generation counter *watcher*.
+
+Snapshot Safety Prerequisites
+=============================
+
+If VM, container or other system-level snapshots happen asynchronously,
+at arbitrary times during an active workload there is no practical way
+to ensure that in-flight local copies or caches of world-unique data
+such as random values, secrets, UUIDs, etc are properly scrubbed and
+regenerated.
+The challenge stems from the fact that the categorization of data as
+snapshot-sensitive is only known to the software working with it, and
+this software has no logical control over the moment in time when an
+external system snapshot occurs.
+
+Let's take an OpenSSL session token for example. Even if the library
+code is made 100% snapshot-safe, meaning the library guarantees that
+the session token is unique (any snapshot that happened during the
+library call did not duplicate or leak the token), the token is still
+vulnerable to snapshot events while it transits the various layers of
+the library caller, then the various layers of the OS before leaving
+the system.
+
+To catch a secret while it's in-flight, we'd have to validate system
+generation at every layer, every step of the way. Even if that would
+be deemed the right solution, it would be a long road and a whole
+universe to patch before we get there.
+
+Bottom line is we don't have a way to track all of these in-flight
+secrets and dynamically scrub them from existence with snapshot
+events happening arbitrarily.
+
+Simplifyng assumption - safety prerequisite
+-------------------------------------------
+
+**Control the snapshot flow**, disallow snapshots coming at arbitrary
+moments in the workload lifetime.
+
+Use a system-level overseer entity that quiesces the system before
+snapshot, and post-snapshot-resume oversees that software components
+have readjusted to new environment, to the new generation. Only after,
+will the overseer un-quiesce the system and allow active workloads.
+
+Software components can choose whether they want to be tracked and
+waited on by the overseer by using the ``SYSGENID_SET_WATCHER_TRACKING``
+IOCTL.
+
+The sysgenid framework standardizes the API for system software to
+find out about needing to readjust and at the same time provides a
+mechanism for the overseer entity to wait for everyone to be done, the
+system to have readjusted, so it can un-quiesce.
+
+Example snapshot-safe workflow
+------------------------------
+
+1) Before taking a snapshot, quiesce the VM/container/system. Exactly
+   how this is achieved is very workload-specific, but the general
+   description is to get all software to an expected state where their
+   event loops dry up and they are effectively quiesced.
+2) Take snapshot.
+3) Resume the VM/container/system from said snapshot.
+4) Overseer will trigger generation bump using
+   ``SYSGENID_TRIGGER_GEN_UPDATE`` IOCLT.
+5) Software components which have ``/dev/sysgenid`` in their event
+   loops (either using ``poll()`` or ``read()``) are notified of the
+   generation change.
+   They do their specific internal adjustments. Some may have requested
+   to be tracked and waited on by the overseer, others might choose to
+   do their adjustments out of band and not block the overseer.
+   Tracked ones *must* signal when they are done/ready with a ``write()``
+   while the rest *should* also do so for cleanliness, but it's not
+   mandatory.
+6) Overseer will block and wait for all tracked watchers by using the
+   ``SYSGENID_WAIT_WATCHERS`` IOCTL. Once all tracked watchers are done
+   in step 5, this overseer will return from this blocking ioctl knowing
+   that the system has readjusted and is ready for active workload.
+7) Overseer un-quiesces system.
+8) There is a class of software, usually libraries, most notably PRNGs
+   or SSLs, that don't fit the event-loop model and also have strict
+   latency requirements. These can take advantage of the ``mmap()``
+   interface and lazily adjust on-demand whenever they are called after
+   un-quiesce.
+   For a well-designed service stack, these libraries should not be
+   called while system is quiesced. When workload is resumed by the
+   overseer, on the first call into these libs, they will safely JIT
+   readjust.
+   Users of this lazy on-demand readjustment model should not enable
+   watcher tracking since doing so would introduce a logical deadlock:
+   lazy adjustments happen only after un-quiesce, but un-quiesce is
+   blocked until all tracked watchers are up-to-date.
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index 599bd44..5e9e3856 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -360,6 +360,7 @@  Code  Seq#    Include File                                           Comments
 0xDB  00-0F  drivers/char/mwave/mwavepub.h
 0xDD  00-3F                                                          ZFCP device driver see drivers/s390/scsi/
                                                                      <mailto:aherrman@de.ibm.com>
+0xE4  01-03  uapi/linux/sysgenid.h                                   SysGenID misc driver
 0xE5  00-3F  linux/fuse.h
 0xEC  00-01  drivers/platform/chrome/cros_ec_dev.h                   ChromeOS EC driver
 0xF3  00-3F  drivers/usb/misc/sisusbvga/sisusb.h                     sisfb (in development)
diff --git a/MAINTAINERS b/MAINTAINERS
index d92f85c..39f8e8d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17325,6 +17325,14 @@  L:	linux-mmc@vger.kernel.org
 S:	Maintained
 F:	drivers/mmc/host/sdhci-pci-dwc-mshc.c
 
+SYSGENID
+M:	Adrian Catangiu <acatan@amazon.com>
+L:	linux-kernel@vger.kernel.org
+S:	Supported
+F:	Documentation/misc-devices/sysgenid.rst
+F:	drivers/misc/sysgenid.c
+F:	include/uapi/linux/sysgenid.h
+
 SYSTEM CONFIGURATION (SYSCON)
 M:	Lee Jones <lee.jones@linaro.org>
 M:	Arnd Bergmann <arnd@arndb.de>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index f532c59..e345620 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -435,6 +435,21 @@  config PVPANIC
 	  a paravirtualized device provided by QEMU; it lets a virtual machine
 	  (guest) communicate panic events to the host.
 
+config SYSGENID
+	tristate "System Generation ID driver"
+	help
+	  This is a System Generation ID driver which provides a system
+	  generation counter. The driver exposes FS ops on /dev/sysgenid
+	  through which it can provide information and notifications on system
+	  generation changes that happen because of VM or container snapshots
+	  or cloning.
+	  This enables applications and libraries that store or cache
+	  sensitive information, to know that they need to regenerate it
+	  after process memory has been exposed to potential copying.
+
+	  To compile this driver as a module, choose M here: the
+	  module will be called sysgenid.
+
 config HISI_HIKEY_USB
 	tristate "USB GPIO Hub on HiSilicon Hikey 960/970 Platform"
 	depends on (OF && GPIOLIB) || COMPILE_TEST
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 99b6f15..3dbf0d3 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -56,3 +56,4 @@  obj-$(CONFIG_HABANA_AI)		+= habanalabs/
 obj-$(CONFIG_UACCE)		+= uacce/
 obj-$(CONFIG_XILINX_SDFEC)	+= xilinx_sdfec.o
 obj-$(CONFIG_HISI_HIKEY_USB)	+= hisi_hikey_usb.o
+obj-$(CONFIG_SYSGENID)		+= sysgenid.o
diff --git a/drivers/misc/sysgenid.c b/drivers/misc/sysgenid.c
new file mode 100644
index 0000000..b3b5288
--- /dev/null
+++ b/drivers/misc/sysgenid.c
@@ -0,0 +1,316 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * System Generation ID driver
+ *
+ * Copyright (C) 2020 Amazon. All rights reserved.
+ *
+ *	Authors:
+ *	  Adrian Catangiu <acatan@amazon.com>
+ *
+ */
+#include <linux/acpi.h>
+#include <linux/kernel.h>
+#include <linux/minmax.h>
+#include <linux/miscdevice.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/poll.h>
+#include <linux/random.h>
+#include <linux/uuid.h>
+#include <linux/sysgenid.h>
+
+struct sysgenid_data {
+	unsigned long		map_buf;
+	wait_queue_head_t	read_waitq;
+	atomic_t		generation_counter;
+
+	unsigned int		watchers;
+	atomic_t		outdated_watchers;
+	wait_queue_head_t	outdated_waitq;
+	spinlock_t		lock;
+};
+static struct sysgenid_data sysgenid_data;
+
+struct file_data {
+	bool tracked_watcher;
+	int acked_gen_counter;
+};
+
+static int equals_gen_counter(unsigned int counter)
+{
+	return counter == atomic_read(&sysgenid_data.generation_counter);
+}
+
+static void bump_generation(int min_gen)
+{
+	unsigned long flags;
+	int counter;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	counter = max(min_gen, 1 + atomic_read(&sysgenid_data.generation_counter));
+	atomic_set(&sysgenid_data.generation_counter, counter);
+	*((int *) sysgenid_data.map_buf) = counter;
+	atomic_set(&sysgenid_data.outdated_watchers, sysgenid_data.watchers);
+
+	wake_up_interruptible(&sysgenid_data.read_waitq);
+	wake_up_interruptible(&sysgenid_data.outdated_waitq);
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+}
+
+static void put_outdated_watchers(void)
+{
+	if (atomic_dec_and_test(&sysgenid_data.outdated_watchers))
+		wake_up_interruptible(&sysgenid_data.outdated_waitq);
+}
+
+static void start_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (!fdata->tracked_watcher) {
+		/* enable tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+			fdata->tracked_watcher = 1;
+			++sysgenid_data.watchers;
+			if (!equals_gen_counter(fdata->acked_gen_counter))
+				atomic_inc(&sysgenid_data.outdated_watchers);
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static void stop_fd_tracking(struct file_data *fdata)
+{
+	unsigned long flags;
+
+	if (fdata->tracked_watcher) {
+		/* stop tracking this fd as a watcher */
+		spin_lock_irqsave(&sysgenid_data.lock, flags);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			put_outdated_watchers();
+		--sysgenid_data.watchers;
+		fdata->tracked_watcher = 0;
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+	}
+}
+
+static int sysgenid_open(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = kzalloc(sizeof(struct file_data), GFP_KERNEL);
+
+	if (!fdata)
+		return -ENOMEM;
+	fdata->tracked_watcher = 0;
+	fdata->acked_gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	file->private_data = fdata;
+
+	return 0;
+}
+
+static int sysgenid_close(struct inode *inode, struct file *file)
+{
+	struct file_data *fdata = file->private_data;
+
+	stop_fd_tracking(fdata);
+	kfree(fdata);
+
+	return 0;
+}
+
+static ssize_t sysgenid_read(struct file *file, char __user *ubuf,
+		size_t nbytes, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	ssize_t ret;
+	int gen_counter;
+
+	if (nbytes == 0)
+		return 0;
+	/* disallow partial reads */
+	if (nbytes < sizeof(gen_counter))
+		return -EINVAL;
+
+	if (equals_gen_counter(fdata->acked_gen_counter)) {
+		if (file->f_flags & O_NONBLOCK)
+			return -EAGAIN;
+		ret = wait_event_interruptible(
+			sysgenid_data.read_waitq,
+			!equals_gen_counter(fdata->acked_gen_counter)
+		);
+		if (ret)
+			return ret;
+	}
+
+	gen_counter = atomic_read(&sysgenid_data.generation_counter);
+	ret = copy_to_user(ubuf, &gen_counter, sizeof(gen_counter));
+	if (ret)
+		return -EFAULT;
+
+	return sizeof(gen_counter);
+}
+
+static ssize_t sysgenid_write(struct file *file, const char __user *ubuf,
+		size_t count, loff_t *ppos)
+{
+	struct file_data *fdata = file->private_data;
+	unsigned int new_acked_gen;
+	unsigned long flags;
+
+	/* disallow partial writes */
+	if (count != sizeof(new_acked_gen))
+		return -ENOBUFS;
+	if (copy_from_user(&new_acked_gen, ubuf, count))
+		return -EFAULT;
+
+	spin_lock_irqsave(&sysgenid_data.lock, flags);
+	/* wrong gen-counter acknowledged */
+	if (!equals_gen_counter(new_acked_gen)) {
+		spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+		return -EINVAL;
+	}
+	/* update acked gen-counter if necessary */
+	if (!equals_gen_counter(fdata->acked_gen_counter)) {
+		fdata->acked_gen_counter = new_acked_gen;
+		if (fdata->tracked_watcher)
+			put_outdated_watchers();
+	}
+	spin_unlock_irqrestore(&sysgenid_data.lock, flags);
+
+	return (ssize_t)count;
+}
+
+static __poll_t sysgenid_poll(struct file *file, poll_table *wait)
+{
+	__poll_t mask = 0;
+	struct file_data *fdata = file->private_data;
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		return EPOLLIN | EPOLLRDNORM;
+
+	poll_wait(file, &sysgenid_data.read_waitq, wait);
+
+	if (!equals_gen_counter(fdata->acked_gen_counter))
+		mask = EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static long sysgenid_ioctl(struct file *file,
+		unsigned int cmd, unsigned long arg)
+{
+	struct file_data *fdata = file->private_data;
+	bool tracking = !!arg;
+	unsigned long timeout_ns, min_gen;
+	ktime_t until;
+	int ret = 0;
+
+	switch (cmd) {
+	case SYSGENID_SET_WATCHER_TRACKING:
+		if (tracking)
+			start_fd_tracking(fdata);
+		else
+			stop_fd_tracking(fdata);
+		break;
+	case SYSGENID_WAIT_WATCHERS:
+		timeout_ns = arg * NSEC_PER_MSEC;
+		until = timeout_ns ? ktime_set(0, timeout_ns) : KTIME_MAX;
+
+		ret = wait_event_interruptible_hrtimeout(
+			sysgenid_data.outdated_waitq,
+			(!atomic_read(&sysgenid_data.outdated_watchers) ||
+					!equals_gen_counter(fdata->acked_gen_counter)),
+			until
+		);
+		if (!equals_gen_counter(fdata->acked_gen_counter))
+			ret = -EINTR;
+		break;
+	case SYSGENID_TRIGGER_GEN_UPDATE:
+		if (!checkpoint_restore_ns_capable(current_user_ns()))
+			return -EACCES;
+		min_gen = arg;
+		bump_generation(min_gen);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+	return ret;
+}
+
+static int sysgenid_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct file_data *fdata = file->private_data;
+
+	if (vma->vm_pgoff != 0 || vma_pages(vma) > 1)
+		return -EINVAL;
+
+	if ((vma->vm_flags & VM_WRITE) != 0)
+		return -EPERM;
+
+	vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;
+	vma->vm_flags &= ~VM_MAYWRITE;
+	vma->vm_private_data = fdata;
+
+	return vm_insert_page(vma, vma->vm_start,
+			virt_to_page(sysgenid_data.map_buf));
+}
+
+static const struct file_operations fops = {
+	.owner		= THIS_MODULE,
+	.mmap		= sysgenid_mmap,
+	.open		= sysgenid_open,
+	.release	= sysgenid_close,
+	.read		= sysgenid_read,
+	.write		= sysgenid_write,
+	.poll		= sysgenid_poll,
+	.unlocked_ioctl	= sysgenid_ioctl,
+};
+
+static struct miscdevice sysgenid_misc = {
+	.minor = MISC_DYNAMIC_MINOR,
+	.name = "sysgenid",
+	.fops = &fops,
+};
+
+static int __init sysgenid_init(void)
+{
+	int ret;
+
+	sysgenid_data.map_buf = get_zeroed_page(GFP_KERNEL);
+	if (!sysgenid_data.map_buf)
+		return -ENOMEM;
+
+	atomic_set(&sysgenid_data.generation_counter, 0);
+	atomic_set(&sysgenid_data.outdated_watchers, 0);
+	init_waitqueue_head(&sysgenid_data.read_waitq);
+	init_waitqueue_head(&sysgenid_data.outdated_waitq);
+	spin_lock_init(&sysgenid_data.lock);
+
+	ret = misc_register(&sysgenid_misc);
+	if (ret < 0) {
+		pr_err("misc_register() failed for sysgenid\n");
+		goto err;
+	}
+
+	return 0;
+
+err:
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+
+	return ret;
+}
+
+static void __exit sysgenid_exit(void)
+{
+	misc_deregister(&sysgenid_misc);
+	free_pages(sysgenid_data.map_buf, 0);
+	sysgenid_data.map_buf = 0;
+}
+
+module_init(sysgenid_init);
+module_exit(sysgenid_exit);
+
+MODULE_AUTHOR("Adrian Catangiu");
+MODULE_DESCRIPTION("System Generation ID");
+MODULE_LICENSE("GPL");
+MODULE_VERSION("0.1");
diff --git a/include/uapi/linux/sysgenid.h b/include/uapi/linux/sysgenid.h
new file mode 100644
index 0000000..6cce0d0
--- /dev/null
+++ b/include/uapi/linux/sysgenid.h
@@ -0,0 +1,14 @@ 
+/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
+
+#ifndef _UAPI_LINUX_SYSGENID_H
+#define _UAPI_LINUX_SYSGENID_H
+
+#include <linux/ioctl.h>
+
+#define SYSGENID_IOCTL			0xE4
+#define SYSGENID_SET_WATCHER_TRACKING	_IO(SYSGENID_IOCTL, 1)
+#define SYSGENID_WAIT_WATCHERS		_IO(SYSGENID_IOCTL, 2)
+#define SYSGENID_TRIGGER_GEN_UPDATE	_IO(SYSGENID_IOCTL, 3)
+
+#endif /* _UAPI_LINUX_SYSGENID_H */
+