diff mbox series

[v3] /dev/mem: Revoke mappings when a driver claims the region

Message ID 159002475918.686697.11844615159862491335.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series [v3] /dev/mem: Revoke mappings when a driver claims the region | expand

Commit Message

Dan Williams May 21, 2020, 1:35 a.m. UTC
Close the hole of holding a mapping over kernel driver takeover event of
a given address range.

Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
kernel against scenarios where a /dev/mem user tramples memory that a
kernel driver owns. However, this protection only prevents *new* read(),
write() and mmap() requests. Established mappings prior to the driver
calling request_mem_region() are left alone.

Especially with persistent memory, and the core kernel metadata that is
stored there, there are plentiful scenarios for a /dev/mem user to
violate the expectations of the driver and cause amplified damage.

Teach request_mem_region() to find and shoot down active /dev/mem
mappings that it believes it has successfully claimed for the exclusive
use of the driver. Effectively a driver call to request_mem_region()
becomes a hole-punch on the /dev/mem device.

The typical usage of unmap_mapping_range() is part of
truncate_pagecache() to punch a hole in a file, but in this case the
implementation is only doing the "first half" of a hole punch. Namely it
is just evacuating current established mappings of the "hole", and it
relies on the fact that /dev/mem establishes mappings in terms of
absolute physical address offsets. Once existing mmap users are
invalidated they can attempt to re-establish the mapping, or attempt to
continue issuing read(2) / write(2) to the invalidated extent, but they
will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
block those subsequent accesses.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v2 [1]:

- Fix smp_wmb() placement relative to publishing write (Matthew)

[1]: http://lore.kernel.org/r/158987153989.4000084.17143582803685077783.stgit@dwillia2-desk3.amr.corp.intel.com

 drivers/char/mem.c         |  104 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/ioport.h     |    6 +++
 include/uapi/linux/magic.h |    1 
 kernel/resource.c          |    5 ++
 4 files changed, 114 insertions(+), 2 deletions(-)

Comments

Matthew Wilcox May 21, 2020, 2:26 a.m. UTC | #1
On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote:
> +static struct inode *devmem_inode;
> +
> +#ifdef CONFIG_IO_STRICT_DEVMEM
> +void revoke_devmem(struct resource *res)
> +{
> +	struct inode *inode = READ_ONCE(devmem_inode);
> +
> +	/*
> +	 * Check that the initialization has completed. Losing the race
> +	 * is ok because it means drivers are claiming resources before
> +	 * the fs_initcall level of init and prevent /dev/mem from
> +	 * establishing mappings.
> +	 */
> +	smp_rmb();
> +	if (!inode)
> +		return;

But we don't need the smp_rmb() here, right?  READ_ONCE and WRITE_ONCE
are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance)
so the smp_rmb() is superfluous ...

> +	/*
> +	 * Use a unified address space to have a single point to manage
> +	 * revocations when drivers want to take over a /dev/mem mapped
> +	 * range.
> +	 */
> +	inode->i_mapping = devmem_inode->i_mapping;
> +	inode->i_mapping->host = devmem_inode;

umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode?

> +
> +	/* publish /dev/mem initialized */
> +	smp_wmb();
> +	WRITE_ONCE(devmem_inode, inode);

As above, unnecessary barrier, I think.
Dan Williams May 21, 2020, 4:37 a.m. UTC | #2
On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote:
> > +static struct inode *devmem_inode;
> > +
> > +#ifdef CONFIG_IO_STRICT_DEVMEM
> > +void revoke_devmem(struct resource *res)
> > +{
> > +     struct inode *inode = READ_ONCE(devmem_inode);
> > +
> > +     /*
> > +      * Check that the initialization has completed. Losing the race
> > +      * is ok because it means drivers are claiming resources before
> > +      * the fs_initcall level of init and prevent /dev/mem from
> > +      * establishing mappings.
> > +      */
> > +     smp_rmb();
> > +     if (!inode)
> > +             return;
>
> But we don't need the smp_rmb() here, right?  READ_ONCE and WRITE_ONCE
> are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance)
> so the smp_rmb() is superfluous ...

Is it? I did not grok that from Documentation/memory-barriers.txt.
READ_ONCE and WRITE_ONCE are certainly ordered with respect to each
other in the same function, but I thought they still depend on
barriers for smp ordering?

>
> > +     /*
> > +      * Use a unified address space to have a single point to manage
> > +      * revocations when drivers want to take over a /dev/mem mapped
> > +      * range.
> > +      */
> > +     inode->i_mapping = devmem_inode->i_mapping;
> > +     inode->i_mapping->host = devmem_inode;
>
> umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode?

Not if inode is coming from:

     mknod ./newmem c 1 1

...that's the problem that a unified inode solves. You can mknod all
you want, but mapping and mapping->host will point to a common
instance.

>
> > +
> > +     /* publish /dev/mem initialized */
> > +     smp_wmb();
> > +     WRITE_ONCE(devmem_inode, inode);
>
> As above, unnecessary barrier, I think.

Well, if you're not sure, how sure should I be?
Dan Williams May 21, 2020, 4:39 a.m. UTC | #3
On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote:
> > > +static struct inode *devmem_inode;
> > > +
> > > +#ifdef CONFIG_IO_STRICT_DEVMEM
> > > +void revoke_devmem(struct resource *res)
> > > +{
> > > +     struct inode *inode = READ_ONCE(devmem_inode);
> > > +
> > > +     /*
> > > +      * Check that the initialization has completed. Losing the race
> > > +      * is ok because it means drivers are claiming resources before
> > > +      * the fs_initcall level of init and prevent /dev/mem from
> > > +      * establishing mappings.
> > > +      */
> > > +     smp_rmb();
> > > +     if (!inode)
> > > +             return;
> >
> > But we don't need the smp_rmb() here, right?  READ_ONCE and WRITE_ONCE
> > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance)
> > so the smp_rmb() is superfluous ...
>
> Is it? I did not grok that from Documentation/memory-barriers.txt.
> READ_ONCE and WRITE_ONCE are certainly ordered with respect to each
> other in the same function, but I thought they still depend on
> barriers for smp ordering?
>
> >
> > > +     /*
> > > +      * Use a unified address space to have a single point to manage
> > > +      * revocations when drivers want to take over a /dev/mem mapped
> > > +      * range.
> > > +      */
> > > +     inode->i_mapping = devmem_inode->i_mapping;
> > > +     inode->i_mapping->host = devmem_inode;
> >
> > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode?
>
> Not if inode is coming from:
>
>      mknod ./newmem c 1 1
>
> ...that's the problem that a unified inode solves. You can mknod all
> you want, but mapping and mapping->host will point to a common
> instance.
>
> >
> > > +
> > > +     /* publish /dev/mem initialized */
> > > +     smp_wmb();
> > > +     WRITE_ONCE(devmem_inode, inode);
> >
> > As above, unnecessary barrier, I think.
>
> Well, if you're not sure, how sure should I be?

I'm pretty sure they are needed, because I need the prior writes to
initialize the inode to be fenced before the final write to publish
the inode. I don't think WRITE_ONCE() enforces that prior writes have
completed.
Matthew Wilcox May 21, 2020, 11:41 a.m. UTC | #4
On Wed, May 20, 2020 at 09:39:49PM -0700, Dan Williams wrote:
> On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote:
> > > > +static struct inode *devmem_inode;
> > > > +
> > > > +#ifdef CONFIG_IO_STRICT_DEVMEM
> > > > +void revoke_devmem(struct resource *res)
> > > > +{
> > > > +     struct inode *inode = READ_ONCE(devmem_inode);
> > > > +
> > > > +     /*
> > > > +      * Check that the initialization has completed. Losing the race
> > > > +      * is ok because it means drivers are claiming resources before
> > > > +      * the fs_initcall level of init and prevent /dev/mem from
> > > > +      * establishing mappings.
> > > > +      */
> > > > +     smp_rmb();
> > > > +     if (!inode)
> > > > +             return;
> > >
> > > But we don't need the smp_rmb() here, right?  READ_ONCE and WRITE_ONCE
> > > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance)
> > > so the smp_rmb() is superfluous ...
> >
> > Is it? I did not grok that from Documentation/memory-barriers.txt.
> > READ_ONCE and WRITE_ONCE are certainly ordered with respect to each
> > other in the same function, but I thought they still depend on
> > barriers for smp ordering?
> >
> > > > +
> > > > +     /* publish /dev/mem initialized */
> > > > +     smp_wmb();
> > > > +     WRITE_ONCE(devmem_inode, inode);
> > >
> > > As above, unnecessary barrier, I think.
> >
> > Well, if you're not sure, how sure should I be?
> 
> I'm pretty sure they are needed, because I need the prior writes to
> initialize the inode to be fenced before the final write to publish
> the inode. I don't think WRITE_ONCE() enforces that prior writes have
> completed.

Completed, no, but I think it does enforce that they're visible to other
CPUs before this write is visible to other CPUs.

I'll quote relevant bits from the document ...

 (2) Data dependency barriers.

     A data dependency barrier is a weaker form of read barrier.  In the case
     where two loads are performed such that the second depends on the result
     of the first (eg: the first load retrieves the address to which the second
     load will be directed), a data dependency barrier would be required to
     make sure that the target of the second load is updated after the address
     obtained by the first load is accessed.

[...]
SMP BARRIER PAIRING
-------------------
[...]
        CPU 1                 CPU 2
        ===============       ===============================
        a = 1;
        <write barrier>
        WRITE_ONCE(b, &a);    x = READ_ONCE(b);
                              <data dependency barrier>
                              y = *x;


> > >
> > > > +     /*
> > > > +      * Use a unified address space to have a single point to manage
> > > > +      * revocations when drivers want to take over a /dev/mem mapped
> > > > +      * range.
> > > > +      */
> > > > +     inode->i_mapping = devmem_inode->i_mapping;
> > > > +     inode->i_mapping->host = devmem_inode;
> > >
> > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode?
> >
> > Not if inode is coming from:
> >
> >      mknod ./newmem c 1 1
> >
> > ...that's the problem that a unified inode solves. You can mknod all
> > you want, but mapping and mapping->host will point to a common
> > instance.

I don't think I explained myself well enough.

When we initialise devmem_inode, does devmem_inode->i_mapping->host point
to somewhere other than devmem_inode?

I appreciate in this function, inode->i_mapping->host will point to inode.
But we're now changing i_mapping to be devmem_inode's i_mapping.  Why
do we need to change devmem_inode's i_mapping->host pointer?
Dan Williams May 21, 2020, 6:21 p.m. UTC | #5
On Thu, May 21, 2020 at 4:41 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Wed, May 20, 2020 at 09:39:49PM -0700, Dan Williams wrote:
> > On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote:
> > > On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote:
> > > > > +static struct inode *devmem_inode;
> > > > > +
> > > > > +#ifdef CONFIG_IO_STRICT_DEVMEM
> > > > > +void revoke_devmem(struct resource *res)
> > > > > +{
> > > > > +     struct inode *inode = READ_ONCE(devmem_inode);
> > > > > +
> > > > > +     /*
> > > > > +      * Check that the initialization has completed. Losing the race
> > > > > +      * is ok because it means drivers are claiming resources before
> > > > > +      * the fs_initcall level of init and prevent /dev/mem from
> > > > > +      * establishing mappings.
> > > > > +      */
> > > > > +     smp_rmb();
> > > > > +     if (!inode)
> > > > > +             return;
> > > >
> > > > But we don't need the smp_rmb() here, right?  READ_ONCE and WRITE_ONCE
> > > > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance)
> > > > so the smp_rmb() is superfluous ...
> > >
> > > Is it? I did not grok that from Documentation/memory-barriers.txt.
> > > READ_ONCE and WRITE_ONCE are certainly ordered with respect to each
> > > other in the same function, but I thought they still depend on
> > > barriers for smp ordering?
> > >
> > > > > +
> > > > > +     /* publish /dev/mem initialized */
> > > > > +     smp_wmb();
> > > > > +     WRITE_ONCE(devmem_inode, inode);
> > > >
> > > > As above, unnecessary barrier, I think.
> > >
> > > Well, if you're not sure, how sure should I be?
> >
> > I'm pretty sure they are needed, because I need the prior writes to
> > initialize the inode to be fenced before the final write to publish
> > the inode. I don't think WRITE_ONCE() enforces that prior writes have
> > completed.
>
> Completed, no, but I think it does enforce that they're visible to other
> CPUs before this write is visible to other CPUs.
>
> I'll quote relevant bits from the document ...
>
>  (2) Data dependency barriers.
>
>      A data dependency barrier is a weaker form of read barrier.  In the case
>      where two loads are performed such that the second depends on the result
>      of the first (eg: the first load retrieves the address to which the second
>      load will be directed), a data dependency barrier would be required to
>      make sure that the target of the second load is updated after the address
>      obtained by the first load is accessed.
>
> [...]
> SMP BARRIER PAIRING
> -------------------
> [...]
>         CPU 1                 CPU 2
>         ===============       ===============================
>         a = 1;
>         <write barrier>
>         WRITE_ONCE(b, &a);    x = READ_ONCE(b);
>                               <data dependency barrier>
>                               y = *x;
>

Oh, I read those <* barrier> lines as a requirement not an implied
side effect of READ/WRITE_ONCE(). I can see that WRITE_ONCE() is
effectively a barrier() and READ_ONCE() includes
smp_read_barrier_depends(). I'll drop.

>
> > > >
> > > > > +     /*
> > > > > +      * Use a unified address space to have a single point to manage
> > > > > +      * revocations when drivers want to take over a /dev/mem mapped
> > > > > +      * range.
> > > > > +      */
> > > > > +     inode->i_mapping = devmem_inode->i_mapping;
> > > > > +     inode->i_mapping->host = devmem_inode;
> > > >
> > > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode?
> > >
> > > Not if inode is coming from:
> > >
> > >      mknod ./newmem c 1 1
> > >
> > > ...that's the problem that a unified inode solves. You can mknod all
> > > you want, but mapping and mapping->host will point to a common
> > > instance.
>
> I don't think I explained myself well enough.
>
> When we initialise devmem_inode, does devmem_inode->i_mapping->host point
> to somewhere other than devmem_inode?
>
> I appreciate in this function, inode->i_mapping->host will point to inode.
> But we're now changing i_mapping to be devmem_inode's i_mapping.  Why
> do we need to change devmem_inode's i_mapping->host pointer?
>

Yeah, mistook your comment. The setting of ->host is indeed redundant.
diff mbox series

Patch

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 43dd0891ca1e..46bea7a25983 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -31,11 +31,15 @@ 
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
+#include <linux/pseudo_fs.h>
+#include <uapi/linux/magic.h>
+#include <linux/mount.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
 #endif
 
+#define DEVMEM_MINOR	1
 #define DEVPORT_MINOR	4
 
 static inline unsigned long size_inside_page(unsigned long start,
@@ -805,12 +809,66 @@  static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
 	return ret;
 }
 
+static struct inode *devmem_inode;
+
+#ifdef CONFIG_IO_STRICT_DEVMEM
+void revoke_devmem(struct resource *res)
+{
+	struct inode *inode = READ_ONCE(devmem_inode);
+
+	/*
+	 * Check that the initialization has completed. Losing the race
+	 * is ok because it means drivers are claiming resources before
+	 * the fs_initcall level of init and prevent /dev/mem from
+	 * establishing mappings.
+	 */
+	smp_rmb();
+	if (!inode)
+		return;
+
+	/*
+	 * The expectation is that the driver has successfully marked
+	 * the resource busy by this point, so devmem_is_allowed()
+	 * should start returning false, however for performance this
+	 * does not iterate the entire resource range.
+	 */
+	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
+	    devmem_is_allowed(PHYS_PFN(res->end))) {
+		/*
+		 * *cringe* iomem=relaxed says "go ahead, what's the
+		 * worst that can happen?"
+		 */
+		return;
+	}
+
+	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
+}
+#endif
+
 static int open_port(struct inode *inode, struct file *filp)
 {
+	int rc;
+
 	if (!capable(CAP_SYS_RAWIO))
 		return -EPERM;
 
-	return security_locked_down(LOCKDOWN_DEV_MEM);
+	rc = security_locked_down(LOCKDOWN_DEV_MEM);
+	if (rc)
+		return rc;
+
+	if (iminor(inode) != DEVMEM_MINOR)
+		return 0;
+
+	/*
+	 * Use a unified address space to have a single point to manage
+	 * revocations when drivers want to take over a /dev/mem mapped
+	 * range.
+	 */
+	inode->i_mapping = devmem_inode->i_mapping;
+	inode->i_mapping->host = devmem_inode;
+	filp->f_mapping = inode->i_mapping;
+
+	return 0;
 }
 
 #define zero_lseek	null_lseek
@@ -885,7 +943,7 @@  static const struct memdev {
 	fmode_t fmode;
 } devlist[] = {
 #ifdef CONFIG_DEVMEM
-	 [1] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
+	 [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
 #endif
 #ifdef CONFIG_DEVKMEM
 	 [2] = { "kmem", 0, &kmem_fops, FMODE_UNSIGNED_OFFSET },
@@ -939,6 +997,46 @@  static char *mem_devnode(struct device *dev, umode_t *mode)
 
 static struct class *mem_class;
 
+static int devmem_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type devmem_fs_type = {
+	.name		= "devmem",
+	.owner		= THIS_MODULE,
+	.init_fs_context = devmem_fs_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int devmem_init_inode(void)
+{
+	static struct vfsmount *devmem_vfs_mount;
+	static int devmem_fs_cnt;
+	struct inode *inode;
+	int rc;
+
+	rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt);
+	if (rc < 0) {
+		pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
+		return rc;
+	}
+
+	inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb);
+	if (IS_ERR(inode)) {
+		rc = PTR_ERR(inode);
+		pr_err("Cannot allocate inode for /dev/mem: %d\n", rc);
+		simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt);
+		return rc;
+	}
+
+	/* publish /dev/mem initialized */
+	smp_wmb();
+	WRITE_ONCE(devmem_inode, inode);
+
+	return 0;
+}
+
 static int __init chr_dev_init(void)
 {
 	int minor;
@@ -960,6 +1058,8 @@  static int __init chr_dev_init(void)
 		 */
 		if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
 			continue;
+		if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0)
+			continue;
 
 		device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
 			      NULL, devlist[minor].name);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index a9b9170b5dd2..6c3eca90cbc4 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -301,5 +301,11 @@  struct resource *devm_request_free_mem_region(struct device *dev,
 struct resource *request_free_mem_region(struct resource *base,
 		unsigned long size, const char *name);
 
+#ifdef CONFIG_IO_STRICT_DEVMEM
+void revoke_devmem(struct resource *res);
+#else
+static inline void revoke_devmem(struct resource *res) { };
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif	/* _LINUX_IOPORT_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..f3956fc11de6 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -94,6 +94,7 @@ 
 #define BALLOON_KVM_MAGIC	0x13661366
 #define ZSMALLOC_MAGIC		0x58295829
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
+#define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
 
diff --git a/kernel/resource.c b/kernel/resource.c
index 76036a41143b..841737bbda9e 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1126,6 +1126,7 @@  struct resource * __request_region(struct resource *parent,
 {
 	DECLARE_WAITQUEUE(wait, current);
 	struct resource *res = alloc_resource(GFP_KERNEL);
+	struct resource *orig_parent = parent;
 
 	if (!res)
 		return NULL;
@@ -1176,6 +1177,10 @@  struct resource * __request_region(struct resource *parent,
 		break;
 	}
 	write_unlock(&resource_lock);
+
+	if (res && orig_parent == &iomem_resource)
+		revoke_devmem(res);
+
 	return res;
 }
 EXPORT_SYMBOL(__request_region);