Message ID | 159002475918.686697.11844615159862491335.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v3] /dev/mem: Revoke mappings when a driver claims the region | expand |
On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote: > +static struct inode *devmem_inode; > + > +#ifdef CONFIG_IO_STRICT_DEVMEM > +void revoke_devmem(struct resource *res) > +{ > + struct inode *inode = READ_ONCE(devmem_inode); > + > + /* > + * Check that the initialization has completed. Losing the race > + * is ok because it means drivers are claiming resources before > + * the fs_initcall level of init and prevent /dev/mem from > + * establishing mappings. > + */ > + smp_rmb(); > + if (!inode) > + return; But we don't need the smp_rmb() here, right? READ_ONCE and WRITE_ONCE are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance) so the smp_rmb() is superfluous ... > + /* > + * Use a unified address space to have a single point to manage > + * revocations when drivers want to take over a /dev/mem mapped > + * range. > + */ > + inode->i_mapping = devmem_inode->i_mapping; > + inode->i_mapping->host = devmem_inode; umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode? > + > + /* publish /dev/mem initialized */ > + smp_wmb(); > + WRITE_ONCE(devmem_inode, inode); As above, unnecessary barrier, I think.
On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote: > > +static struct inode *devmem_inode; > > + > > +#ifdef CONFIG_IO_STRICT_DEVMEM > > +void revoke_devmem(struct resource *res) > > +{ > > + struct inode *inode = READ_ONCE(devmem_inode); > > + > > + /* > > + * Check that the initialization has completed. Losing the race > > + * is ok because it means drivers are claiming resources before > > + * the fs_initcall level of init and prevent /dev/mem from > > + * establishing mappings. > > + */ > > + smp_rmb(); > > + if (!inode) > > + return; > > But we don't need the smp_rmb() here, right? READ_ONCE and WRITE_ONCE > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance) > so the smp_rmb() is superfluous ... Is it? I did not grok that from Documentation/memory-barriers.txt. READ_ONCE and WRITE_ONCE are certainly ordered with respect to each other in the same function, but I thought they still depend on barriers for smp ordering? > > > + /* > > + * Use a unified address space to have a single point to manage > > + * revocations when drivers want to take over a /dev/mem mapped > > + * range. > > + */ > > + inode->i_mapping = devmem_inode->i_mapping; > > + inode->i_mapping->host = devmem_inode; > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode? Not if inode is coming from: mknod ./newmem c 1 1 ...that's the problem that a unified inode solves. You can mknod all you want, but mapping and mapping->host will point to a common instance. > > > + > > + /* publish /dev/mem initialized */ > > + smp_wmb(); > > + WRITE_ONCE(devmem_inode, inode); > > As above, unnecessary barrier, I think. Well, if you're not sure, how sure should I be?
On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote: > > > +static struct inode *devmem_inode; > > > + > > > +#ifdef CONFIG_IO_STRICT_DEVMEM > > > +void revoke_devmem(struct resource *res) > > > +{ > > > + struct inode *inode = READ_ONCE(devmem_inode); > > > + > > > + /* > > > + * Check that the initialization has completed. Losing the race > > > + * is ok because it means drivers are claiming resources before > > > + * the fs_initcall level of init and prevent /dev/mem from > > > + * establishing mappings. > > > + */ > > > + smp_rmb(); > > > + if (!inode) > > > + return; > > > > But we don't need the smp_rmb() here, right? READ_ONCE and WRITE_ONCE > > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance) > > so the smp_rmb() is superfluous ... > > Is it? I did not grok that from Documentation/memory-barriers.txt. > READ_ONCE and WRITE_ONCE are certainly ordered with respect to each > other in the same function, but I thought they still depend on > barriers for smp ordering? > > > > > > + /* > > > + * Use a unified address space to have a single point to manage > > > + * revocations when drivers want to take over a /dev/mem mapped > > > + * range. > > > + */ > > > + inode->i_mapping = devmem_inode->i_mapping; > > > + inode->i_mapping->host = devmem_inode; > > > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode? > > Not if inode is coming from: > > mknod ./newmem c 1 1 > > ...that's the problem that a unified inode solves. You can mknod all > you want, but mapping and mapping->host will point to a common > instance. > > > > > > + > > > + /* publish /dev/mem initialized */ > > > + smp_wmb(); > > > + WRITE_ONCE(devmem_inode, inode); > > > > As above, unnecessary barrier, I think. > > Well, if you're not sure, how sure should I be? I'm pretty sure they are needed, because I need the prior writes to initialize the inode to be fenced before the final write to publish the inode. I don't think WRITE_ONCE() enforces that prior writes have completed.
On Wed, May 20, 2020 at 09:39:49PM -0700, Dan Williams wrote: > On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote: > > On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote: > > > > +static struct inode *devmem_inode; > > > > + > > > > +#ifdef CONFIG_IO_STRICT_DEVMEM > > > > +void revoke_devmem(struct resource *res) > > > > +{ > > > > + struct inode *inode = READ_ONCE(devmem_inode); > > > > + > > > > + /* > > > > + * Check that the initialization has completed. Losing the race > > > > + * is ok because it means drivers are claiming resources before > > > > + * the fs_initcall level of init and prevent /dev/mem from > > > > + * establishing mappings. > > > > + */ > > > > + smp_rmb(); > > > > + if (!inode) > > > > + return; > > > > > > But we don't need the smp_rmb() here, right? READ_ONCE and WRITE_ONCE > > > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance) > > > so the smp_rmb() is superfluous ... > > > > Is it? I did not grok that from Documentation/memory-barriers.txt. > > READ_ONCE and WRITE_ONCE are certainly ordered with respect to each > > other in the same function, but I thought they still depend on > > barriers for smp ordering? > > > > > > + > > > > + /* publish /dev/mem initialized */ > > > > + smp_wmb(); > > > > + WRITE_ONCE(devmem_inode, inode); > > > > > > As above, unnecessary barrier, I think. > > > > Well, if you're not sure, how sure should I be? > > I'm pretty sure they are needed, because I need the prior writes to > initialize the inode to be fenced before the final write to publish > the inode. I don't think WRITE_ONCE() enforces that prior writes have > completed. Completed, no, but I think it does enforce that they're visible to other CPUs before this write is visible to other CPUs. I'll quote relevant bits from the document ... (2) Data dependency barriers. A data dependency barrier is a weaker form of read barrier. In the case where two loads are performed such that the second depends on the result of the first (eg: the first load retrieves the address to which the second load will be directed), a data dependency barrier would be required to make sure that the target of the second load is updated after the address obtained by the first load is accessed. [...] SMP BARRIER PAIRING ------------------- [...] CPU 1 CPU 2 =============== =============================== a = 1; <write barrier> WRITE_ONCE(b, &a); x = READ_ONCE(b); <data dependency barrier> y = *x; > > > > > > > + /* > > > > + * Use a unified address space to have a single point to manage > > > > + * revocations when drivers want to take over a /dev/mem mapped > > > > + * range. > > > > + */ > > > > + inode->i_mapping = devmem_inode->i_mapping; > > > > + inode->i_mapping->host = devmem_inode; > > > > > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode? > > > > Not if inode is coming from: > > > > mknod ./newmem c 1 1 > > > > ...that's the problem that a unified inode solves. You can mknod all > > you want, but mapping and mapping->host will point to a common > > instance. I don't think I explained myself well enough. When we initialise devmem_inode, does devmem_inode->i_mapping->host point to somewhere other than devmem_inode? I appreciate in this function, inode->i_mapping->host will point to inode. But we're now changing i_mapping to be devmem_inode's i_mapping. Why do we need to change devmem_inode's i_mapping->host pointer?
On Thu, May 21, 2020 at 4:41 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, May 20, 2020 at 09:39:49PM -0700, Dan Williams wrote: > > On Wed, May 20, 2020 at 9:37 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > On Wed, May 20, 2020 at 7:26 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, May 20, 2020 at 06:35:25PM -0700, Dan Williams wrote: > > > > > +static struct inode *devmem_inode; > > > > > + > > > > > +#ifdef CONFIG_IO_STRICT_DEVMEM > > > > > +void revoke_devmem(struct resource *res) > > > > > +{ > > > > > + struct inode *inode = READ_ONCE(devmem_inode); > > > > > + > > > > > + /* > > > > > + * Check that the initialization has completed. Losing the race > > > > > + * is ok because it means drivers are claiming resources before > > > > > + * the fs_initcall level of init and prevent /dev/mem from > > > > > + * establishing mappings. > > > > > + */ > > > > > + smp_rmb(); > > > > > + if (!inode) > > > > > + return; > > > > > > > > But we don't need the smp_rmb() here, right? READ_ONCE and WRITE_ONCE > > > > are a DATA DEPENDENCY barrier (in Documentation/memory-barriers.txt parlance) > > > > so the smp_rmb() is superfluous ... > > > > > > Is it? I did not grok that from Documentation/memory-barriers.txt. > > > READ_ONCE and WRITE_ONCE are certainly ordered with respect to each > > > other in the same function, but I thought they still depend on > > > barriers for smp ordering? > > > > > > > > + > > > > > + /* publish /dev/mem initialized */ > > > > > + smp_wmb(); > > > > > + WRITE_ONCE(devmem_inode, inode); > > > > > > > > As above, unnecessary barrier, I think. > > > > > > Well, if you're not sure, how sure should I be? > > > > I'm pretty sure they are needed, because I need the prior writes to > > initialize the inode to be fenced before the final write to publish > > the inode. I don't think WRITE_ONCE() enforces that prior writes have > > completed. > > Completed, no, but I think it does enforce that they're visible to other > CPUs before this write is visible to other CPUs. > > I'll quote relevant bits from the document ... > > (2) Data dependency barriers. > > A data dependency barrier is a weaker form of read barrier. In the case > where two loads are performed such that the second depends on the result > of the first (eg: the first load retrieves the address to which the second > load will be directed), a data dependency barrier would be required to > make sure that the target of the second load is updated after the address > obtained by the first load is accessed. > > [...] > SMP BARRIER PAIRING > ------------------- > [...] > CPU 1 CPU 2 > =============== =============================== > a = 1; > <write barrier> > WRITE_ONCE(b, &a); x = READ_ONCE(b); > <data dependency barrier> > y = *x; > Oh, I read those <* barrier> lines as a requirement not an implied side effect of READ/WRITE_ONCE(). I can see that WRITE_ONCE() is effectively a barrier() and READ_ONCE() includes smp_read_barrier_depends(). I'll drop. > > > > > > > > > > + /* > > > > > + * Use a unified address space to have a single point to manage > > > > > + * revocations when drivers want to take over a /dev/mem mapped > > > > > + * range. > > > > > + */ > > > > > + inode->i_mapping = devmem_inode->i_mapping; > > > > > + inode->i_mapping->host = devmem_inode; > > > > > > > > umm ... devmem_inode->i_mapping->host doesn't already point to devmem_inode? > > > > > > Not if inode is coming from: > > > > > > mknod ./newmem c 1 1 > > > > > > ...that's the problem that a unified inode solves. You can mknod all > > > you want, but mapping and mapping->host will point to a common > > > instance. > > I don't think I explained myself well enough. > > When we initialise devmem_inode, does devmem_inode->i_mapping->host point > to somewhere other than devmem_inode? > > I appreciate in this function, inode->i_mapping->host will point to inode. > But we're now changing i_mapping to be devmem_inode's i_mapping. Why > do we need to change devmem_inode's i_mapping->host pointer? > Yeah, mistook your comment. The setting of ->host is indeed redundant.
diff --git a/drivers/char/mem.c b/drivers/char/mem.c index 43dd0891ca1e..46bea7a25983 100644 --- a/drivers/char/mem.c +++ b/drivers/char/mem.c @@ -31,11 +31,15 @@ #include <linux/uio.h> #include <linux/uaccess.h> #include <linux/security.h> +#include <linux/pseudo_fs.h> +#include <uapi/linux/magic.h> +#include <linux/mount.h> #ifdef CONFIG_IA64 # include <linux/efi.h> #endif +#define DEVMEM_MINOR 1 #define DEVPORT_MINOR 4 static inline unsigned long size_inside_page(unsigned long start, @@ -805,12 +809,66 @@ static loff_t memory_lseek(struct file *file, loff_t offset, int orig) return ret; } +static struct inode *devmem_inode; + +#ifdef CONFIG_IO_STRICT_DEVMEM +void revoke_devmem(struct resource *res) +{ + struct inode *inode = READ_ONCE(devmem_inode); + + /* + * Check that the initialization has completed. Losing the race + * is ok because it means drivers are claiming resources before + * the fs_initcall level of init and prevent /dev/mem from + * establishing mappings. + */ + smp_rmb(); + if (!inode) + return; + + /* + * The expectation is that the driver has successfully marked + * the resource busy by this point, so devmem_is_allowed() + * should start returning false, however for performance this + * does not iterate the entire resource range. + */ + if (devmem_is_allowed(PHYS_PFN(res->start)) && + devmem_is_allowed(PHYS_PFN(res->end))) { + /* + * *cringe* iomem=relaxed says "go ahead, what's the + * worst that can happen?" + */ + return; + } + + unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1); +} +#endif + static int open_port(struct inode *inode, struct file *filp) { + int rc; + if (!capable(CAP_SYS_RAWIO)) return -EPERM; - return security_locked_down(LOCKDOWN_DEV_MEM); + rc = security_locked_down(LOCKDOWN_DEV_MEM); + if (rc) + return rc; + + if (iminor(inode) != DEVMEM_MINOR) + return 0; + + /* + * Use a unified address space to have a single point to manage + * revocations when drivers want to take over a /dev/mem mapped + * range. + */ + inode->i_mapping = devmem_inode->i_mapping; + inode->i_mapping->host = devmem_inode; + filp->f_mapping = inode->i_mapping; + + return 0; } #define zero_lseek null_lseek @@ -885,7 +943,7 @@ static const struct memdev { fmode_t fmode; } devlist[] = { #ifdef CONFIG_DEVMEM - [1] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET }, + [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET }, #endif #ifdef CONFIG_DEVKMEM [2] = { "kmem", 0, &kmem_fops, FMODE_UNSIGNED_OFFSET }, @@ -939,6 +997,46 @@ static char *mem_devnode(struct device *dev, umode_t *mode) static struct class *mem_class; +static int devmem_fs_init_fs_context(struct fs_context *fc) +{ + return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM; +} + +static struct file_system_type devmem_fs_type = { + .name = "devmem", + .owner = THIS_MODULE, + .init_fs_context = devmem_fs_init_fs_context, + .kill_sb = kill_anon_super, +}; + +static int devmem_init_inode(void) +{ + static struct vfsmount *devmem_vfs_mount; + static int devmem_fs_cnt; + struct inode *inode; + int rc; + + rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt); + if (rc < 0) { + pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc); + return rc; + } + + inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb); + if (IS_ERR(inode)) { + rc = PTR_ERR(inode); + pr_err("Cannot allocate inode for /dev/mem: %d\n", rc); + simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt); + return rc; + } + + /* publish /dev/mem initialized */ + smp_wmb(); + WRITE_ONCE(devmem_inode, inode); + + return 0; +} + static int __init chr_dev_init(void) { int minor; @@ -960,6 +1058,8 @@ static int __init chr_dev_init(void) */ if ((minor == DEVPORT_MINOR) && !arch_has_dev_port()) continue; + if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0) + continue; device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor), NULL, devlist[minor].name); diff --git a/include/linux/ioport.h b/include/linux/ioport.h index a9b9170b5dd2..6c3eca90cbc4 100644 --- a/include/linux/ioport.h +++ b/include/linux/ioport.h @@ -301,5 +301,11 @@ struct resource *devm_request_free_mem_region(struct device *dev, struct resource *request_free_mem_region(struct resource *base, unsigned long size, const char *name); +#ifdef CONFIG_IO_STRICT_DEVMEM +void revoke_devmem(struct resource *res); +#else +static inline void revoke_devmem(struct resource *res) { }; +#endif + #endif /* __ASSEMBLY__ */ #endif /* _LINUX_IOPORT_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index d78064007b17..f3956fc11de6 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -94,6 +94,7 @@ #define BALLOON_KVM_MAGIC 0x13661366 #define ZSMALLOC_MAGIC 0x58295829 #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ +#define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 diff --git a/kernel/resource.c b/kernel/resource.c index 76036a41143b..841737bbda9e 100644 --- a/kernel/resource.c +++ b/kernel/resource.c @@ -1126,6 +1126,7 @@ struct resource * __request_region(struct resource *parent, { DECLARE_WAITQUEUE(wait, current); struct resource *res = alloc_resource(GFP_KERNEL); + struct resource *orig_parent = parent; if (!res) return NULL; @@ -1176,6 +1177,10 @@ struct resource * __request_region(struct resource *parent, break; } write_unlock(&resource_lock); + + if (res && orig_parent == &iomem_resource) + revoke_devmem(res); + return res; } EXPORT_SYMBOL(__request_region);
Close the hole of holding a mapping over kernel driver takeover event of a given address range. Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges") introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the kernel against scenarios where a /dev/mem user tramples memory that a kernel driver owns. However, this protection only prevents *new* read(), write() and mmap() requests. Established mappings prior to the driver calling request_mem_region() are left alone. Especially with persistent memory, and the core kernel metadata that is stored there, there are plentiful scenarios for a /dev/mem user to violate the expectations of the driver and cause amplified damage. Teach request_mem_region() to find and shoot down active /dev/mem mappings that it believes it has successfully claimed for the exclusive use of the driver. Effectively a driver call to request_mem_region() becomes a hole-punch on the /dev/mem device. The typical usage of unmap_mapping_range() is part of truncate_pagecache() to punch a hole in a file, but in this case the implementation is only doing the "first half" of a hole punch. Namely it is just evacuating current established mappings of the "hole", and it relies on the fact that /dev/mem establishes mappings in terms of absolute physical address offsets. Once existing mmap users are invalidated they can attempt to re-establish the mapping, or attempt to continue issuing read(2) / write(2) to the invalidated extent, but they will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can block those subsequent accesses. Cc: Arnd Bergmann <arnd@arndb.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges") Signed-off-by: Dan Williams <dan.j.williams@intel.com> --- Changes since v2 [1]: - Fix smp_wmb() placement relative to publishing write (Matthew) [1]: http://lore.kernel.org/r/158987153989.4000084.17143582803685077783.stgit@dwillia2-desk3.amr.corp.intel.com drivers/char/mem.c | 104 +++++++++++++++++++++++++++++++++++++++++++- include/linux/ioport.h | 6 +++ include/uapi/linux/magic.h | 1 kernel/resource.c | 5 ++ 4 files changed, 114 insertions(+), 2 deletions(-)