diff mbox series

[v4] /dev/mem: Revoke mappings when a driver claims the region

Message ID 159009507306.847224.8502634072429766747.stgit@dwillia2-desk3.amr.corp.intel.com (mailing list archive)
State New, archived
Headers show
Series [v4] /dev/mem: Revoke mappings when a driver claims the region | expand

Commit Message

Dan Williams May 21, 2020, 9:06 p.m. UTC
Close the hole of holding a mapping over kernel driver takeover event of
a given address range.

Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
kernel against scenarios where a /dev/mem user tramples memory that a
kernel driver owns. However, this protection only prevents *new* read(),
write() and mmap() requests. Established mappings prior to the driver
calling request_mem_region() are left alone.

Especially with persistent memory, and the core kernel metadata that is
stored there, there are plentiful scenarios for a /dev/mem user to
violate the expectations of the driver and cause amplified damage.

Teach request_mem_region() to find and shoot down active /dev/mem
mappings that it believes it has successfully claimed for the exclusive
use of the driver. Effectively a driver call to request_mem_region()
becomes a hole-punch on the /dev/mem device.

The typical usage of unmap_mapping_range() is part of
truncate_pagecache() to punch a hole in a file, but in this case the
implementation is only doing the "first half" of a hole punch. Namely it
is just evacuating current established mappings of the "hole", and it
relies on the fact that /dev/mem establishes mappings in terms of
absolute physical address offsets. Once existing mmap users are
invalidated they can attempt to re-establish the mapping, or attempt to
continue issuing read(2) / write(2) to the invalidated extent, but they
will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
block those subsequent accesses.

Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v3 [1]:

- Drop redundant memory barriers, READ_ONCE() and WRITE_ONCE() ensure
  sufficient ordering (Matthew)

- Drop redundant setting of i_mapping->host. (Matthew)

[1]: http://lore.kernel.org/r/159002475918.686697.11844615159862491335.stgit@dwillia2-desk3.amr.corp.intel.com

 drivers/char/mem.c         |  101 +++++++++++++++++++++++++++++++++++++++++++-
 include/linux/ioport.h     |    6 +++
 include/uapi/linux/magic.h |    1 
 kernel/resource.c          |    5 ++
 4 files changed, 111 insertions(+), 2 deletions(-)

Comments

Kees Cook May 22, 2020, 3:01 a.m. UTC | #1
On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> The typical usage of unmap_mapping_range() is part of
> truncate_pagecache() to punch a hole in a file, but in this case the
> implementation is only doing the "first half" of a hole punch. Namely it
> is just evacuating current established mappings of the "hole", and it
> relies on the fact that /dev/mem establishes mappings in terms of
> absolute physical address offsets. Once existing mmap users are
> invalidated they can attempt to re-establish the mapping, or attempt to
> continue issuing read(2) / write(2) to the invalidated extent, but they
> will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
> block those subsequent accesses.

Nice!

Reviewed-by: Kees Cook <keescook@chromium.org>

And a thread hijack...   ;)

I think this is very close to providing a way to solve another issue
I've had with /dev/mem, which is to zero the view of the first 1MB of
/dev/mem via mmap. I only fixed the read/write accesses:
a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads")
I.e. the low 1MB range should be considered allowed, but any reads will see
zeros.

> +	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);

Is unmap_mapping_range() sufficient for this? Would it need to happen
once during open_port() or something more special during mmap_mem()?
Bjorn Helgaas May 27, 2021, 8:58 p.m. UTC | #2
[+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]

On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> Close the hole of holding a mapping over kernel driver takeover event of
> a given address range.
> 
> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> kernel against scenarios where a /dev/mem user tramples memory that a
> kernel driver owns. However, this protection only prevents *new* read(),
> write() and mmap() requests. Established mappings prior to the driver
> calling request_mem_region() are left alone.
> 
> Especially with persistent memory, and the core kernel metadata that is
> stored there, there are plentiful scenarios for a /dev/mem user to
> violate the expectations of the driver and cause amplified damage.
> 
> Teach request_mem_region() to find and shoot down active /dev/mem
> mappings that it believes it has successfully claimed for the exclusive
> use of the driver. Effectively a driver call to request_mem_region()
> becomes a hole-punch on the /dev/mem device.

This idea of hole-punching /dev/mem has since been extended to PCI
BARs via [1].

Correct me if I'm wrong: I think this means that if a user process has
mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
that region via pci_request_region() or similar, we punch holes in the
the user process mmap.  The driver might be happy, but my guess is the
user starts seeing segmentation violations for no obvious reason and
is not happy.

Apart from the user process issue, the implementation of [1] is
problematic for PCI because the mmappable sysfs attributes now depend
on iomem_init_inode(), an fs_initcall, which means they can't be
static attributes, which ultimately leads to races in creating them.

So I'm raising the question of whether this hole-punch is the right
strategy.

  - Prior to revoke_iomem(), __request_region() was very
    self-contained and really only depended on the resource tree.  Now
    it depends on a lot of higher-level MM machinery to shoot down
    mappings of other tasks.  This adds quite a bit of complexity and
    some new ordering constraints.

  - Punching holes in the address space of an existing process seems
    unfriendly.  Maybe the driver's __request_region() should fail
    instead, since the driver should be prepared to handle failure
    there anyway.

  - [2] suggests that the hole punch protects drivers from /dev/mem
    writers, especially with persistent memory.  I'm not really
    convinced.  The hole punch does nothing to prevent a user process
    from mmapping and corrupting something before the driver loads.

Bjorn

[1] https://git.kernel.org/linus/636b21b50152
[2] https://git.kernel.org/linus/3234ac664a87

> The typical usage of unmap_mapping_range() is part of
> truncate_pagecache() to punch a hole in a file, but in this case the
> implementation is only doing the "first half" of a hole punch. Namely it
> is just evacuating current established mappings of the "hole", and it
> relies on the fact that /dev/mem establishes mappings in terms of
> absolute physical address offsets. Once existing mmap users are
> invalidated they can attempt to re-establish the mapping, or attempt to
> continue issuing read(2) / write(2) to the invalidated extent, but they
> will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
> block those subsequent accesses.
> 
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Russell King <linux@arm.linux.org.uk>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Fixes: 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Dan Williams May 27, 2021, 9:30 p.m. UTC | #3
On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
>
> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > Close the hole of holding a mapping over kernel driver takeover event of
> > a given address range.
> >
> > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > kernel against scenarios where a /dev/mem user tramples memory that a
> > kernel driver owns. However, this protection only prevents *new* read(),
> > write() and mmap() requests. Established mappings prior to the driver
> > calling request_mem_region() are left alone.
> >
> > Especially with persistent memory, and the core kernel metadata that is
> > stored there, there are plentiful scenarios for a /dev/mem user to
> > violate the expectations of the driver and cause amplified damage.
> >
> > Teach request_mem_region() to find and shoot down active /dev/mem
> > mappings that it believes it has successfully claimed for the exclusive
> > use of the driver. Effectively a driver call to request_mem_region()
> > becomes a hole-punch on the /dev/mem device.
>
> This idea of hole-punching /dev/mem has since been extended to PCI
> BARs via [1].
>
> Correct me if I'm wrong: I think this means that if a user process has
> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> that region via pci_request_region() or similar, we punch holes in the
> the user process mmap.  The driver might be happy, but my guess is the
> user starts seeing segmentation violations for no obvious reason and
> is not happy.
>
> Apart from the user process issue, the implementation of [1] is
> problematic for PCI because the mmappable sysfs attributes now depend
> on iomem_init_inode(), an fs_initcall, which means they can't be
> static attributes, which ultimately leads to races in creating them.

See the comments in iomem_get_mapping(), and revoke_iomem():

        /*
         * Check that the initialization has completed. Losing the race
         * is ok because it means drivers are claiming resources before
         * the fs_initcall level of init and prevent iomem_get_mapping users
         * from establishing mappings.
         */

...the observation being that it is ok for the revocation inode to
come on later in the boot process because userspace won't be able to
use the fs yet. So any missed calls to revoke_iomem() would fall back
to userspace just seeing the resource busy in the first instance. I.e.
through the normal devmem_is_allowed() exclusion.

>
> So I'm raising the question of whether this hole-punch is the right
> strategy.
>
>   - Prior to revoke_iomem(), __request_region() was very
>     self-contained and really only depended on the resource tree.  Now
>     it depends on a lot of higher-level MM machinery to shoot down
>     mappings of other tasks.  This adds quite a bit of complexity and
>     some new ordering constraints.
>
>   - Punching holes in the address space of an existing process seems
>     unfriendly.  Maybe the driver's __request_region() should fail
>     instead, since the driver should be prepared to handle failure
>     there anyway.

It's prepared to handle failure, but in this case it is dealing with a
root user of 2 minds.

>
>   - [2] suggests that the hole punch protects drivers from /dev/mem
>     writers, especially with persistent memory.  I'm not really
>     convinced.  The hole punch does nothing to prevent a user process
>     from mmapping and corrupting something before the driver loads.

The motivation for this was a case that was swapping between /dev/mem
access and /dev/pmem0 access and they forgot to stop using /dev/mem
when they switched to /dev/pmem0. If root wants to use /dev/mem it can
use it, if root wants to stop the driver from loading it can set
mopdrobe policy or manually unbind, and if root asks the kernel to
load the driver while it is actively using /dev/mem something has to
give. Given root has other options to stop a driver the decision to
revoke userspace access when root messes up and causes a collision
seems prudent to me.
David Hildenbrand May 28, 2021, 8:58 a.m. UTC | #4
On 27.05.21 23:30, Dan Williams wrote:
> On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>
>> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
>>
>> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
>>> Close the hole of holding a mapping over kernel driver takeover event of
>>> a given address range.
>>>
>>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
>>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
>>> kernel against scenarios where a /dev/mem user tramples memory that a
>>> kernel driver owns. However, this protection only prevents *new* read(),
>>> write() and mmap() requests. Established mappings prior to the driver
>>> calling request_mem_region() are left alone.
>>>
>>> Especially with persistent memory, and the core kernel metadata that is
>>> stored there, there are plentiful scenarios for a /dev/mem user to
>>> violate the expectations of the driver and cause amplified damage.
>>>
>>> Teach request_mem_region() to find and shoot down active /dev/mem
>>> mappings that it believes it has successfully claimed for the exclusive
>>> use of the driver. Effectively a driver call to request_mem_region()
>>> becomes a hole-punch on the /dev/mem device.
>>
>> This idea of hole-punching /dev/mem has since been extended to PCI
>> BARs via [1].
>>
>> Correct me if I'm wrong: I think this means that if a user process has
>> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
>> that region via pci_request_region() or similar, we punch holes in the
>> the user process mmap.  The driver might be happy, but my guess is the
>> user starts seeing segmentation violations for no obvious reason and
>> is not happy.
>>
>> Apart from the user process issue, the implementation of [1] is
>> problematic for PCI because the mmappable sysfs attributes now depend
>> on iomem_init_inode(), an fs_initcall, which means they can't be
>> static attributes, which ultimately leads to races in creating them.
> 
> See the comments in iomem_get_mapping(), and revoke_iomem():
> 
>          /*
>           * Check that the initialization has completed. Losing the race
>           * is ok because it means drivers are claiming resources before
>           * the fs_initcall level of init and prevent iomem_get_mapping users
>           * from establishing mappings.
>           */
> 
> ...the observation being that it is ok for the revocation inode to
> come on later in the boot process because userspace won't be able to
> use the fs yet. So any missed calls to revoke_iomem() would fall back
> to userspace just seeing the resource busy in the first instance. I.e.
> through the normal devmem_is_allowed() exclusion.
> 
>>
>> So I'm raising the question of whether this hole-punch is the right
>> strategy.
>>
>>    - Prior to revoke_iomem(), __request_region() was very
>>      self-contained and really only depended on the resource tree.  Now
>>      it depends on a lot of higher-level MM machinery to shoot down
>>      mappings of other tasks.  This adds quite a bit of complexity and
>>      some new ordering constraints.
>>
>>    - Punching holes in the address space of an existing process seems
>>      unfriendly.  Maybe the driver's __request_region() should fail
>>      instead, since the driver should be prepared to handle failure
>>      there anyway.
> 
> It's prepared to handle failure, but in this case it is dealing with a
> root user of 2 minds.
> 
>>
>>    - [2] suggests that the hole punch protects drivers from /dev/mem
>>      writers, especially with persistent memory.  I'm not really
>>      convinced.  The hole punch does nothing to prevent a user process
>>      from mmapping and corrupting something before the driver loads.
> 
> The motivation for this was a case that was swapping between /dev/mem
> access and /dev/pmem0 access and they forgot to stop using /dev/mem
> when they switched to /dev/pmem0. If root wants to use /dev/mem it can
> use it, if root wants to stop the driver from loading it can set
> mopdrobe policy or manually unbind, and if root asks the kernel to
> load the driver while it is actively using /dev/mem something has to
> give. Given root has other options to stop a driver the decision to
> revoke userspace access when root messes up and causes a collision
> seems prudent to me.
> 

Is there a real use case for mapping pmem via /dev/mem or could we just 
prohibit the access to these areas completely?

What's the use case for "swapping between /dev/mem access and /dev/pmem0 
access" ?
Dan Williams May 28, 2021, 4:42 p.m. UTC | #5
On Fri, May 28, 2021 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 27.05.21 23:30, Dan Williams wrote:
> > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >>
> >> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> >>
> >> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> >>> Close the hole of holding a mapping over kernel driver takeover event of
> >>> a given address range.
> >>>
> >>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> >>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> >>> kernel against scenarios where a /dev/mem user tramples memory that a
> >>> kernel driver owns. However, this protection only prevents *new* read(),
> >>> write() and mmap() requests. Established mappings prior to the driver
> >>> calling request_mem_region() are left alone.
> >>>
> >>> Especially with persistent memory, and the core kernel metadata that is
> >>> stored there, there are plentiful scenarios for a /dev/mem user to
> >>> violate the expectations of the driver and cause amplified damage.
> >>>
> >>> Teach request_mem_region() to find and shoot down active /dev/mem
> >>> mappings that it believes it has successfully claimed for the exclusive
> >>> use of the driver. Effectively a driver call to request_mem_region()
> >>> becomes a hole-punch on the /dev/mem device.
> >>
> >> This idea of hole-punching /dev/mem has since been extended to PCI
> >> BARs via [1].
> >>
> >> Correct me if I'm wrong: I think this means that if a user process has
> >> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> >> that region via pci_request_region() or similar, we punch holes in the
> >> the user process mmap.  The driver might be happy, but my guess is the
> >> user starts seeing segmentation violations for no obvious reason and
> >> is not happy.
> >>
> >> Apart from the user process issue, the implementation of [1] is
> >> problematic for PCI because the mmappable sysfs attributes now depend
> >> on iomem_init_inode(), an fs_initcall, which means they can't be
> >> static attributes, which ultimately leads to races in creating them.
> >
> > See the comments in iomem_get_mapping(), and revoke_iomem():
> >
> >          /*
> >           * Check that the initialization has completed. Losing the race
> >           * is ok because it means drivers are claiming resources before
> >           * the fs_initcall level of init and prevent iomem_get_mapping users
> >           * from establishing mappings.
> >           */
> >
> > ...the observation being that it is ok for the revocation inode to
> > come on later in the boot process because userspace won't be able to
> > use the fs yet. So any missed calls to revoke_iomem() would fall back
> > to userspace just seeing the resource busy in the first instance. I.e.
> > through the normal devmem_is_allowed() exclusion.
> >
> >>
> >> So I'm raising the question of whether this hole-punch is the right
> >> strategy.
> >>
> >>    - Prior to revoke_iomem(), __request_region() was very
> >>      self-contained and really only depended on the resource tree.  Now
> >>      it depends on a lot of higher-level MM machinery to shoot down
> >>      mappings of other tasks.  This adds quite a bit of complexity and
> >>      some new ordering constraints.
> >>
> >>    - Punching holes in the address space of an existing process seems
> >>      unfriendly.  Maybe the driver's __request_region() should fail
> >>      instead, since the driver should be prepared to handle failure
> >>      there anyway.
> >
> > It's prepared to handle failure, but in this case it is dealing with a
> > root user of 2 minds.
> >
> >>
> >>    - [2] suggests that the hole punch protects drivers from /dev/mem
> >>      writers, especially with persistent memory.  I'm not really
> >>      convinced.  The hole punch does nothing to prevent a user process
> >>      from mmapping and corrupting something before the driver loads.
> >
> > The motivation for this was a case that was swapping between /dev/mem
> > access and /dev/pmem0 access and they forgot to stop using /dev/mem
> > when they switched to /dev/pmem0. If root wants to use /dev/mem it can
> > use it, if root wants to stop the driver from loading it can set
> > mopdrobe policy or manually unbind, and if root asks the kernel to
> > load the driver while it is actively using /dev/mem something has to
> > give. Given root has other options to stop a driver the decision to
> > revoke userspace access when root messes up and causes a collision
> > seems prudent to me.
> >
>
> Is there a real use case for mapping pmem via /dev/mem or could we just
> prohibit the access to these areas completely?

The kernel offers conflicting access to iomem resources and a
long-standing mechanism to enforce mutual exclusion
(CONFIG_IO_STRICT_DEVMEM) between those interfaces. That mechanism was
found to be incomplete for the case where a /dev/mem mapping is
maintained after a kernel driver is attached, and incomplete for other
mechanisms to map iomem like pci-sysfs. This was found with PMEM, but
the issue is larger and applies to userspace drivers / debug in
general.

> What's the use case for "swapping between /dev/mem access and /dev/pmem0
> access" ?

"Who knows". I mean, I know in this case it was a platform validation
test using /dev/mem for "reasons", but I am not sure that is relevant
to the wider concern. If CONFIG_IO_STRICT_DEVMEM=n exclusion is
enforced when drivers pass the IORESOURCE_EXCLUSIVE flag, if
CONFIG_IO_STRICT_DEVMEM=y exclusion is enforced whenever the kernel
marks a resource IORESOURCE_BUSY, and if kernel lockdown is enabled
the driver state is moot as LOCKDOWN_DEV_MEM and LOCKDOWN_PCI_ACCESS
policy is in effect.
David Hildenbrand May 28, 2021, 4:51 p.m. UTC | #6
On 28.05.21 18:42, Dan Williams wrote:
> On Fri, May 28, 2021 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 27.05.21 23:30, Dan Williams wrote:
>>> On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>>>>
>>>> [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
>>>>
>>>> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
>>>>> Close the hole of holding a mapping over kernel driver takeover event of
>>>>> a given address range.
>>>>>
>>>>> Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
>>>>> introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
>>>>> kernel against scenarios where a /dev/mem user tramples memory that a
>>>>> kernel driver owns. However, this protection only prevents *new* read(),
>>>>> write() and mmap() requests. Established mappings prior to the driver
>>>>> calling request_mem_region() are left alone.
>>>>>
>>>>> Especially with persistent memory, and the core kernel metadata that is
>>>>> stored there, there are plentiful scenarios for a /dev/mem user to
>>>>> violate the expectations of the driver and cause amplified damage.
>>>>>
>>>>> Teach request_mem_region() to find and shoot down active /dev/mem
>>>>> mappings that it believes it has successfully claimed for the exclusive
>>>>> use of the driver. Effectively a driver call to request_mem_region()
>>>>> becomes a hole-punch on the /dev/mem device.
>>>>
>>>> This idea of hole-punching /dev/mem has since been extended to PCI
>>>> BARs via [1].
>>>>
>>>> Correct me if I'm wrong: I think this means that if a user process has
>>>> mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
>>>> that region via pci_request_region() or similar, we punch holes in the
>>>> the user process mmap.  The driver might be happy, but my guess is the
>>>> user starts seeing segmentation violations for no obvious reason and
>>>> is not happy.
>>>>
>>>> Apart from the user process issue, the implementation of [1] is
>>>> problematic for PCI because the mmappable sysfs attributes now depend
>>>> on iomem_init_inode(), an fs_initcall, which means they can't be
>>>> static attributes, which ultimately leads to races in creating them.
>>>
>>> See the comments in iomem_get_mapping(), and revoke_iomem():
>>>
>>>           /*
>>>            * Check that the initialization has completed. Losing the race
>>>            * is ok because it means drivers are claiming resources before
>>>            * the fs_initcall level of init and prevent iomem_get_mapping users
>>>            * from establishing mappings.
>>>            */
>>>
>>> ...the observation being that it is ok for the revocation inode to
>>> come on later in the boot process because userspace won't be able to
>>> use the fs yet. So any missed calls to revoke_iomem() would fall back
>>> to userspace just seeing the resource busy in the first instance. I.e.
>>> through the normal devmem_is_allowed() exclusion.
>>>
>>>>
>>>> So I'm raising the question of whether this hole-punch is the right
>>>> strategy.
>>>>
>>>>     - Prior to revoke_iomem(), __request_region() was very
>>>>       self-contained and really only depended on the resource tree.  Now
>>>>       it depends on a lot of higher-level MM machinery to shoot down
>>>>       mappings of other tasks.  This adds quite a bit of complexity and
>>>>       some new ordering constraints.
>>>>
>>>>     - Punching holes in the address space of an existing process seems
>>>>       unfriendly.  Maybe the driver's __request_region() should fail
>>>>       instead, since the driver should be prepared to handle failure
>>>>       there anyway.
>>>
>>> It's prepared to handle failure, but in this case it is dealing with a
>>> root user of 2 minds.
>>>
>>>>
>>>>     - [2] suggests that the hole punch protects drivers from /dev/mem
>>>>       writers, especially with persistent memory.  I'm not really
>>>>       convinced.  The hole punch does nothing to prevent a user process
>>>>       from mmapping and corrupting something before the driver loads.
>>>
>>> The motivation for this was a case that was swapping between /dev/mem
>>> access and /dev/pmem0 access and they forgot to stop using /dev/mem
>>> when they switched to /dev/pmem0. If root wants to use /dev/mem it can
>>> use it, if root wants to stop the driver from loading it can set
>>> mopdrobe policy or manually unbind, and if root asks the kernel to
>>> load the driver while it is actively using /dev/mem something has to
>>> give. Given root has other options to stop a driver the decision to
>>> revoke userspace access when root messes up and causes a collision
>>> seems prudent to me.
>>>
>>
>> Is there a real use case for mapping pmem via /dev/mem or could we just
>> prohibit the access to these areas completely?
> 
> The kernel offers conflicting access to iomem resources and a
> long-standing mechanism to enforce mutual exclusion
> (CONFIG_IO_STRICT_DEVMEM) between those interfaces. That mechanism was
> found to be incomplete for the case where a /dev/mem mapping is
> maintained after a kernel driver is attached, and incomplete for other
> mechanisms to map iomem like pci-sysfs. This was found with PMEM, but
> the issue is larger and applies to userspace drivers / debug in
> general.
> 
>> What's the use case for "swapping between /dev/mem access and /dev/pmem0
>> access" ?
> 
> "Who knows". I mean, I know in this case it was a platform validation
> test using /dev/mem for "reasons", but I am not sure that is relevant
> to the wider concern. If CONFIG_IO_STRICT_DEVMEM=n exclusion is
> enforced when drivers pass the IORESOURCE_EXCLUSIVE flag, if
> CONFIG_IO_STRICT_DEVMEM=y exclusion is enforced whenever the kernel
> marks a resource IORESOURCE_BUSY, and if kernel lockdown is enabled
> the driver state is moot as LOCKDOWN_DEV_MEM and LOCKDOWN_PCI_ACCESS
> policy is in effect.
> 

I was thinking about a mechanism to permanently disallow /dev/mem access 
to specific memory regions (BUSY or not) in any /dev/mem mode. In my 
case, it would apply to the whole virtio-mem provided memory region. 
Once the driver is loaded, it would disallow access to the whole region.

I thought about doing it via the kernel resource tree, extending the 
EXCLUSIVE flag to !BUSY SYSRAM regions. But a simplistic list managed in 
/dev/mem code would also be possible.

That's why I wondered if we could just disallow access to these physical 
PMEM memory regions right from the start similarly, such that we don't 
have to really care about revoking in case of PMEM anymore.
Bjorn Helgaas June 3, 2021, 3:39 a.m. UTC | #7
[+cc Pali, Oliver]

On Thu, May 27, 2021 at 02:30:31PM -0700, Dan Williams wrote:
> On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> >
> > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > Close the hole of holding a mapping over kernel driver takeover event of
> > > a given address range.
> > >
> > > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > > kernel against scenarios where a /dev/mem user tramples memory that a
> > > kernel driver owns. However, this protection only prevents *new* read(),
> > > write() and mmap() requests. Established mappings prior to the driver
> > > calling request_mem_region() are left alone.
> > >
> > > Especially with persistent memory, and the core kernel metadata that is
> > > stored there, there are plentiful scenarios for a /dev/mem user to
> > > violate the expectations of the driver and cause amplified damage.
> > >
> > > Teach request_mem_region() to find and shoot down active /dev/mem
> > > mappings that it believes it has successfully claimed for the exclusive
> > > use of the driver. Effectively a driver call to request_mem_region()
> > > becomes a hole-punch on the /dev/mem device.
> >
> > This idea of hole-punching /dev/mem has since been extended to PCI
> > BARs via [1].
> >
> > Correct me if I'm wrong: I think this means that if a user process has
> > mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> > that region via pci_request_region() or similar, we punch holes in the
> > the user process mmap.  The driver might be happy, but my guess is the
> > user starts seeing segmentation violations for no obvious reason and
> > is not happy.
> >
> > Apart from the user process issue, the implementation of [1] is
> > problematic for PCI because the mmappable sysfs attributes now depend
> > on iomem_init_inode(), an fs_initcall, which means they can't be
> > static attributes, which ultimately leads to races in creating them.
> 
> See the comments in iomem_get_mapping(), and revoke_iomem():
> 
>         /*
>          * Check that the initialization has completed. Losing the race
>          * is ok because it means drivers are claiming resources before
>          * the fs_initcall level of init and prevent iomem_get_mapping users
>          * from establishing mappings.
>          */
> 
> ...the observation being that it is ok for the revocation inode to
> come on later in the boot process because userspace won't be able to
> use the fs yet. So any missed calls to revoke_iomem() would fall back
> to userspace just seeing the resource busy in the first instance. I.e.
> through the normal devmem_is_allowed() exclusion.

I did see that comment, but the race I meant is different.  Pali wrote
up a nice analysis of it [3].

Here's the typical enumeration flow for PCI:

  acpi_pci_root_add                 <-- subsys_initcall (4)
    pci_acpi_scan_root
      ...
        pci_device_add
          device_initialize
          device_add
            device_add_attrs        <-- static sysfs attributes created
    ...
    pci_bus_add_devices
      pci_bus_add_device
        pci_create_sysfs_dev_files
          if (!sysfs_initialized) return;    <-- Ugh :)
          ...
            attr->mmap = pci_mmap_resource_uc
            attr->mapping = iomem_get_mapping()  <-- new dependency
              return iomem_inode->i_mapping
            sysfs_create_bin_file   <-- dynamic sysfs attributes created

  iomem_init_inode                  <-- fs_initcall (5)
    iomem_inode = ...               <-- now iomem_get_mapping() works

  pci_sysfs_init                    <-- late_initcall (7)
    sysfs_initialized = 1           <-- Ugh (see above)
    for_each_pci_dev(dev)           <-- Ugh
      pci_create_sysfs_dev_files(dev)

The race is between the pci_sysfs_init() initcall (intended for
boot-time devices) and the pci_bus_add_device() path (used for all
devices including hot-added ones).  Pali outlined cases where we call
pci_create_sysfs_dev_files() from both paths for the same device.

"sysfs_initialized" is a gross hack that prevents this most of the
time, but not always.  I want to get rid of it and pci_sysfs_init().

Oliver had the excellent idea of using static sysfs attributes to do
this cleanly [4].  If we can convert things to static attributes, the
device core creates them in device_add(), so we don't have to create
them in pci_create_sysfs_dev_files().

Krzysztof recently did some very nice work to convert most things to
static attributes, e.g., [5].  But we can't do this for the PCI BAR
attributes because they support ->mmap(), which now depends on
iomem_get_mapping(), which IIUC doesn't work until after fs_initcalls.

> > So I'm raising the question of whether this hole-punch is the right
> > strategy.
> >
> >   - Prior to revoke_iomem(), __request_region() was very
> >     self-contained and really only depended on the resource tree.  Now
> >     it depends on a lot of higher-level MM machinery to shoot down
> >     mappings of other tasks.  This adds quite a bit of complexity and
> >     some new ordering constraints.
> >
> >   - Punching holes in the address space of an existing process seems
> >     unfriendly.  Maybe the driver's __request_region() should fail
> >     instead, since the driver should be prepared to handle failure
> >     there anyway.
> 
> It's prepared to handle failure, but in this case it is dealing with a
> root user of 2 minds.
> 
> >   - [2] suggests that the hole punch protects drivers from /dev/mem
> >     writers, especially with persistent memory.  I'm not really
> >     convinced.  The hole punch does nothing to prevent a user process
> >     from mmapping and corrupting something before the driver loads.
> 
> The motivation for this was a case that was swapping between /dev/mem
> access and /dev/pmem0 access and they forgot to stop using /dev/mem
> when they switched to /dev/pmem0. If root wants to use /dev/mem it can
> use it, if root wants to stop the driver from loading it can set
> mopdrobe policy or manually unbind, and if root asks the kernel to
> load the driver while it is actively using /dev/mem something has to
> give. Given root has other options to stop a driver the decision to
> revoke userspace access when root messes up and causes a collision
> seems prudent to me.

[3] https://lore.kernel.org/linux-pci/20200716110423.xtfyb3n6tn5ixedh@pali/
[4] https://lore.kernel.org/linux-pci/CAOSf1CHss03DBSDO4PmTtMp0tCEu5kScn704ZEwLKGXQzBfqaA@mail.gmail.com/
[5] https://git.kernel.org/linus/e1d3f3268b0e
Dan Williams June 3, 2021, 4:15 a.m. UTC | #8
On Wed, Jun 2, 2021 at 8:40 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> [+cc Pali, Oliver]
>
> On Thu, May 27, 2021 at 02:30:31PM -0700, Dan Williams wrote:
> > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> > >
> > > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > > Close the hole of holding a mapping over kernel driver takeover event of
> > > > a given address range.
> > > >
> > > > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > > > kernel against scenarios where a /dev/mem user tramples memory that a
> > > > kernel driver owns. However, this protection only prevents *new* read(),
> > > > write() and mmap() requests. Established mappings prior to the driver
> > > > calling request_mem_region() are left alone.
> > > >
> > > > Especially with persistent memory, and the core kernel metadata that is
> > > > stored there, there are plentiful scenarios for a /dev/mem user to
> > > > violate the expectations of the driver and cause amplified damage.
> > > >
> > > > Teach request_mem_region() to find and shoot down active /dev/mem
> > > > mappings that it believes it has successfully claimed for the exclusive
> > > > use of the driver. Effectively a driver call to request_mem_region()
> > > > becomes a hole-punch on the /dev/mem device.
> > >
> > > This idea of hole-punching /dev/mem has since been extended to PCI
> > > BARs via [1].
> > >
> > > Correct me if I'm wrong: I think this means that if a user process has
> > > mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> > > that region via pci_request_region() or similar, we punch holes in the
> > > the user process mmap.  The driver might be happy, but my guess is the
> > > user starts seeing segmentation violations for no obvious reason and
> > > is not happy.
> > >
> > > Apart from the user process issue, the implementation of [1] is
> > > problematic for PCI because the mmappable sysfs attributes now depend
> > > on iomem_init_inode(), an fs_initcall, which means they can't be
> > > static attributes, which ultimately leads to races in creating them.
> >
> > See the comments in iomem_get_mapping(), and revoke_iomem():
> >
> >         /*
> >          * Check that the initialization has completed. Losing the race
> >          * is ok because it means drivers are claiming resources before
> >          * the fs_initcall level of init and prevent iomem_get_mapping users
> >          * from establishing mappings.
> >          */
> >
> > ...the observation being that it is ok for the revocation inode to
> > come on later in the boot process because userspace won't be able to
> > use the fs yet. So any missed calls to revoke_iomem() would fall back
> > to userspace just seeing the resource busy in the first instance. I.e.
> > through the normal devmem_is_allowed() exclusion.
>
> I did see that comment, but the race I meant is different.  Pali wrote
> up a nice analysis of it [3].
>
> Here's the typical enumeration flow for PCI:
>
>   acpi_pci_root_add                 <-- subsys_initcall (4)
>     pci_acpi_scan_root
>       ...
>         pci_device_add
>           device_initialize
>           device_add
>             device_add_attrs        <-- static sysfs attributes created
>     ...
>     pci_bus_add_devices
>       pci_bus_add_device
>         pci_create_sysfs_dev_files
>           if (!sysfs_initialized) return;    <-- Ugh :)
>           ...
>             attr->mmap = pci_mmap_resource_uc
>             attr->mapping = iomem_get_mapping()  <-- new dependency
>               return iomem_inode->i_mapping
>             sysfs_create_bin_file   <-- dynamic sysfs attributes created
>
>   iomem_init_inode                  <-- fs_initcall (5)
>     iomem_inode = ...               <-- now iomem_get_mapping() works
>
>   pci_sysfs_init                    <-- late_initcall (7)
>     sysfs_initialized = 1           <-- Ugh (see above)
>     for_each_pci_dev(dev)           <-- Ugh
>       pci_create_sysfs_dev_files(dev)
>
> The race is between the pci_sysfs_init() initcall (intended for
> boot-time devices) and the pci_bus_add_device() path (used for all
> devices including hot-added ones).  Pali outlined cases where we call
> pci_create_sysfs_dev_files() from both paths for the same device.
>
> "sysfs_initialized" is a gross hack that prevents this most of the
> time, but not always.  I want to get rid of it and pci_sysfs_init().
>
> Oliver had the excellent idea of using static sysfs attributes to do
> this cleanly [4].  If we can convert things to static attributes, the
> device core creates them in device_add(), so we don't have to create
> them in pci_create_sysfs_dev_files().
>
> Krzysztof recently did some very nice work to convert most things to
> static attributes, e.g., [5].  But we can't do this for the PCI BAR
> attributes because they support ->mmap(), which now depends on
> iomem_get_mapping(), which IIUC doesn't work until after fs_initcalls.

Ah, sorry, yes, I see the race now. And yes, anything that gets in the
way of the static attribute conversion needs fixing. How about
something like this?

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index beb8d1f4fafe..c8bc249750d6 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1195,7 +1195,7 @@ static int pci_create_attr(struct pci_dev *pdev,
int num, int write_combine)
                }
        }
        if (res_attr->mmap)
-               res_attr->mapping = iomem_get_mapping();
+               res_attr->mapping = iomem_get_mapping;
        res_attr->attr.name = res_attr_name;
        res_attr->attr.mode = 0600;
        res_attr->size = pci_resource_len(pdev, num);
diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
index 9aefa7779b29..a3ee4c32a264 100644
--- a/fs/sysfs/file.c
+++ b/fs/sysfs/file.c
@@ -175,7 +175,7 @@ static int sysfs_kf_bin_open(struct kernfs_open_file *of)
        struct bin_attribute *battr = of->kn->priv;

        if (battr->mapping)
-               of->file->f_mapping = battr->mapping;
+               of->file->f_mapping = battr->mapping();

        return 0;
 }
diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
index d76a1ddf83a3..fbb7c7df545c 100644
--- a/include/linux/sysfs.h
+++ b/include/linux/sysfs.h
@@ -170,7 +170,7 @@ struct bin_attribute {
        struct attribute        attr;
        size_t                  size;
        void                    *private;
-       struct address_space    *mapping;
+       struct address_space *(*mapping)(void);
        ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *,
                        char *, loff_t, size_t);
        ssize_t (*write)(struct file *, struct kobject *, struct
bin_attribute *,
Bjorn Helgaas June 3, 2021, 6:11 p.m. UTC | #9
On Wed, Jun 02, 2021 at 09:15:35PM -0700, Dan Williams wrote:
> On Wed, Jun 2, 2021 at 8:40 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> >
> > [+cc Pali, Oliver]
> >
> > On Thu, May 27, 2021 at 02:30:31PM -0700, Dan Williams wrote:
> > > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > >
> > > > [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> > > >
> > > > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > > > Close the hole of holding a mapping over kernel driver takeover event of
> > > > > a given address range.
> > > > >
> > > > > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > > > > kernel against scenarios where a /dev/mem user tramples memory that a
> > > > > kernel driver owns. However, this protection only prevents *new* read(),
> > > > > write() and mmap() requests. Established mappings prior to the driver
> > > > > calling request_mem_region() are left alone.
> > > > >
> > > > > Especially with persistent memory, and the core kernel metadata that is
> > > > > stored there, there are plentiful scenarios for a /dev/mem user to
> > > > > violate the expectations of the driver and cause amplified damage.
> > > > >
> > > > > Teach request_mem_region() to find and shoot down active /dev/mem
> > > > > mappings that it believes it has successfully claimed for the exclusive
> > > > > use of the driver. Effectively a driver call to request_mem_region()
> > > > > becomes a hole-punch on the /dev/mem device.
> > > >
> > > > This idea of hole-punching /dev/mem has since been extended to PCI
> > > > BARs via [1].
> > > >
> > > > Correct me if I'm wrong: I think this means that if a user process has
> > > > mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> > > > that region via pci_request_region() or similar, we punch holes in the
> > > > the user process mmap.  The driver might be happy, but my guess is the
> > > > user starts seeing segmentation violations for no obvious reason and
> > > > is not happy.
> > > >
> > > > Apart from the user process issue, the implementation of [1] is
> > > > problematic for PCI because the mmappable sysfs attributes now depend
> > > > on iomem_init_inode(), an fs_initcall, which means they can't be
> > > > static attributes, which ultimately leads to races in creating them.
> > >
> > > See the comments in iomem_get_mapping(), and revoke_iomem():
> > >
> > >         /*
> > >          * Check that the initialization has completed. Losing the race
> > >          * is ok because it means drivers are claiming resources before
> > >          * the fs_initcall level of init and prevent iomem_get_mapping users
> > >          * from establishing mappings.
> > >          */
> > >
> > > ...the observation being that it is ok for the revocation inode to
> > > come on later in the boot process because userspace won't be able to
> > > use the fs yet. So any missed calls to revoke_iomem() would fall back
> > > to userspace just seeing the resource busy in the first instance. I.e.
> > > through the normal devmem_is_allowed() exclusion.
> >
> > I did see that comment, but the race I meant is different.  Pali wrote
> > up a nice analysis of it [3].
> >
> > Here's the typical enumeration flow for PCI:
> >
> >   acpi_pci_root_add                 <-- subsys_initcall (4)
> >     pci_acpi_scan_root
> >       ...
> >         pci_device_add
> >           device_initialize
> >           device_add
> >             device_add_attrs        <-- static sysfs attributes created
> >     ...
> >     pci_bus_add_devices
> >       pci_bus_add_device
> >         pci_create_sysfs_dev_files
> >           if (!sysfs_initialized) return;    <-- Ugh :)
> >           ...
> >             attr->mmap = pci_mmap_resource_uc
> >             attr->mapping = iomem_get_mapping()  <-- new dependency
> >               return iomem_inode->i_mapping
> >             sysfs_create_bin_file   <-- dynamic sysfs attributes created
> >
> >   iomem_init_inode                  <-- fs_initcall (5)
> >     iomem_inode = ...               <-- now iomem_get_mapping() works
> >
> >   pci_sysfs_init                    <-- late_initcall (7)
> >     sysfs_initialized = 1           <-- Ugh (see above)
> >     for_each_pci_dev(dev)           <-- Ugh
> >       pci_create_sysfs_dev_files(dev)
> >
> > The race is between the pci_sysfs_init() initcall (intended for
> > boot-time devices) and the pci_bus_add_device() path (used for all
> > devices including hot-added ones).  Pali outlined cases where we call
> > pci_create_sysfs_dev_files() from both paths for the same device.
> >
> > "sysfs_initialized" is a gross hack that prevents this most of the
> > time, but not always.  I want to get rid of it and pci_sysfs_init().
> >
> > Oliver had the excellent idea of using static sysfs attributes to do
> > this cleanly [4].  If we can convert things to static attributes, the
> > device core creates them in device_add(), so we don't have to create
> > them in pci_create_sysfs_dev_files().
> >
> > Krzysztof recently did some very nice work to convert most things to
> > static attributes, e.g., [5].  But we can't do this for the PCI BAR
> > attributes because they support ->mmap(), which now depends on
> > iomem_get_mapping(), which IIUC doesn't work until after fs_initcalls.
> 
> Ah, sorry, yes, I see the race now. And yes, anything that gets in the
> way of the static attribute conversion needs fixing. How about
> something like this?

That looks like it would solve our problem, thanks a lot!  Obvious in
retrospect, like all good ideas :)

Krzysztof noticed a couple other users of iomem_get_mapping()
added by:

  71a1d8ed900f ("resource: Move devmem revoke code to resource framework")
  636b21b50152 ("PCI: Revoke mappings like devmem")

I *could* extend your patch below to cover all these, but it's kind of
outside my comfort zone, so I'd feel better if Daniel V (who wrote the
commits above) could take a look and do a follow-up.

If I could take the resulting patch via PCI, we might even be able to
get the last static attribute conversions in this cycle.

> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index beb8d1f4fafe..c8bc249750d6 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1195,7 +1195,7 @@ static int pci_create_attr(struct pci_dev *pdev,
> int num, int write_combine)
>                 }
>         }
>         if (res_attr->mmap)
> -               res_attr->mapping = iomem_get_mapping();
> +               res_attr->mapping = iomem_get_mapping;
>         res_attr->attr.name = res_attr_name;
>         res_attr->attr.mode = 0600;
>         res_attr->size = pci_resource_len(pdev, num);
> diff --git a/fs/sysfs/file.c b/fs/sysfs/file.c
> index 9aefa7779b29..a3ee4c32a264 100644
> --- a/fs/sysfs/file.c
> +++ b/fs/sysfs/file.c
> @@ -175,7 +175,7 @@ static int sysfs_kf_bin_open(struct kernfs_open_file *of)
>         struct bin_attribute *battr = of->kn->priv;
> 
>         if (battr->mapping)
> -               of->file->f_mapping = battr->mapping;
> +               of->file->f_mapping = battr->mapping();
> 
>         return 0;
>  }
> diff --git a/include/linux/sysfs.h b/include/linux/sysfs.h
> index d76a1ddf83a3..fbb7c7df545c 100644
> --- a/include/linux/sysfs.h
> +++ b/include/linux/sysfs.h
> @@ -170,7 +170,7 @@ struct bin_attribute {
>         struct attribute        attr;
>         size_t                  size;
>         void                    *private;
> -       struct address_space    *mapping;
> +       struct address_space *(*mapping)(void);
>         ssize_t (*read)(struct file *, struct kobject *, struct bin_attribute *,
>                         char *, loff_t, size_t);
>         ssize_t (*write)(struct file *, struct kobject *, struct
> bin_attribute *,
Dan Williams June 3, 2021, 6:28 p.m. UTC | #10
On Thu, Jun 3, 2021 at 11:12 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Wed, Jun 02, 2021 at 09:15:35PM -0700, Dan Williams wrote:
> > On Wed, Jun 2, 2021 at 8:40 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > >
> > > [+cc Pali, Oliver]
> > >
> > > On Thu, May 27, 2021 at 02:30:31PM -0700, Dan Williams wrote:
> > > > On Thu, May 27, 2021 at 1:58 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > > > >
> > > > > [+cc Daniel, Krzysztof, Jason, Christoph, linux-pci]
> > > > >
> > > > > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > > > > Close the hole of holding a mapping over kernel driver takeover event of
> > > > > > a given address range.
> > > > > >
> > > > > > Commit 90a545e98126 ("restrict /dev/mem to idle io memory ranges")
> > > > > > introduced CONFIG_IO_STRICT_DEVMEM with the goal of protecting the
> > > > > > kernel against scenarios where a /dev/mem user tramples memory that a
> > > > > > kernel driver owns. However, this protection only prevents *new* read(),
> > > > > > write() and mmap() requests. Established mappings prior to the driver
> > > > > > calling request_mem_region() are left alone.
> > > > > >
> > > > > > Especially with persistent memory, and the core kernel metadata that is
> > > > > > stored there, there are plentiful scenarios for a /dev/mem user to
> > > > > > violate the expectations of the driver and cause amplified damage.
> > > > > >
> > > > > > Teach request_mem_region() to find and shoot down active /dev/mem
> > > > > > mappings that it believes it has successfully claimed for the exclusive
> > > > > > use of the driver. Effectively a driver call to request_mem_region()
> > > > > > becomes a hole-punch on the /dev/mem device.
> > > > >
> > > > > This idea of hole-punching /dev/mem has since been extended to PCI
> > > > > BARs via [1].
> > > > >
> > > > > Correct me if I'm wrong: I think this means that if a user process has
> > > > > mmapped a PCI BAR via sysfs, and a kernel driver subsequently requests
> > > > > that region via pci_request_region() or similar, we punch holes in the
> > > > > the user process mmap.  The driver might be happy, but my guess is the
> > > > > user starts seeing segmentation violations for no obvious reason and
> > > > > is not happy.
> > > > >
> > > > > Apart from the user process issue, the implementation of [1] is
> > > > > problematic for PCI because the mmappable sysfs attributes now depend
> > > > > on iomem_init_inode(), an fs_initcall, which means they can't be
> > > > > static attributes, which ultimately leads to races in creating them.
> > > >
> > > > See the comments in iomem_get_mapping(), and revoke_iomem():
> > > >
> > > >         /*
> > > >          * Check that the initialization has completed. Losing the race
> > > >          * is ok because it means drivers are claiming resources before
> > > >          * the fs_initcall level of init and prevent iomem_get_mapping users
> > > >          * from establishing mappings.
> > > >          */
> > > >
> > > > ...the observation being that it is ok for the revocation inode to
> > > > come on later in the boot process because userspace won't be able to
> > > > use the fs yet. So any missed calls to revoke_iomem() would fall back
> > > > to userspace just seeing the resource busy in the first instance. I.e.
> > > > through the normal devmem_is_allowed() exclusion.
> > >
> > > I did see that comment, but the race I meant is different.  Pali wrote
> > > up a nice analysis of it [3].
> > >
> > > Here's the typical enumeration flow for PCI:
> > >
> > >   acpi_pci_root_add                 <-- subsys_initcall (4)
> > >     pci_acpi_scan_root
> > >       ...
> > >         pci_device_add
> > >           device_initialize
> > >           device_add
> > >             device_add_attrs        <-- static sysfs attributes created
> > >     ...
> > >     pci_bus_add_devices
> > >       pci_bus_add_device
> > >         pci_create_sysfs_dev_files
> > >           if (!sysfs_initialized) return;    <-- Ugh :)
> > >           ...
> > >             attr->mmap = pci_mmap_resource_uc
> > >             attr->mapping = iomem_get_mapping()  <-- new dependency
> > >               return iomem_inode->i_mapping
> > >             sysfs_create_bin_file   <-- dynamic sysfs attributes created
> > >
> > >   iomem_init_inode                  <-- fs_initcall (5)
> > >     iomem_inode = ...               <-- now iomem_get_mapping() works
> > >
> > >   pci_sysfs_init                    <-- late_initcall (7)
> > >     sysfs_initialized = 1           <-- Ugh (see above)
> > >     for_each_pci_dev(dev)           <-- Ugh
> > >       pci_create_sysfs_dev_files(dev)
> > >
> > > The race is between the pci_sysfs_init() initcall (intended for
> > > boot-time devices) and the pci_bus_add_device() path (used for all
> > > devices including hot-added ones).  Pali outlined cases where we call
> > > pci_create_sysfs_dev_files() from both paths for the same device.
> > >
> > > "sysfs_initialized" is a gross hack that prevents this most of the
> > > time, but not always.  I want to get rid of it and pci_sysfs_init().
> > >
> > > Oliver had the excellent idea of using static sysfs attributes to do
> > > this cleanly [4].  If we can convert things to static attributes, the
> > > device core creates them in device_add(), so we don't have to create
> > > them in pci_create_sysfs_dev_files().
> > >
> > > Krzysztof recently did some very nice work to convert most things to
> > > static attributes, e.g., [5].  But we can't do this for the PCI BAR
> > > attributes because they support ->mmap(), which now depends on
> > > iomem_get_mapping(), which IIUC doesn't work until after fs_initcalls.
> >
> > Ah, sorry, yes, I see the race now. And yes, anything that gets in the
> > way of the static attribute conversion needs fixing. How about
> > something like this?
>
> That looks like it would solve our problem, thanks a lot!  Obvious in
> retrospect, like all good ideas :)
>
> Krzysztof noticed a couple other users of iomem_get_mapping()
> added by:
>
>   71a1d8ed900f ("resource: Move devmem revoke code to resource framework")
>   636b21b50152 ("PCI: Revoke mappings like devmem")
>
> I *could* extend your patch below to cover all these, but it's kind of
> outside my comfort zone, so I'd feel better if Daniel V (who wrote the
> commits above) could take a look and do a follow-up.
>
> If I could take the resulting patch via PCI, we might even be able to
> get the last static attribute conversions in this cycle.

Sounds good, I'll circle back and give it a try if Daniel does not get
a chance to chime in in the next few days.
Kees Cook April 6, 2022, 7:45 p.m. UTC | #11
*thread necromancy*

Hi Dan,

I'm doing a KSPP bug scrub and am reviewing
https://github.com/KSPP/linux/issues/74 again.

Do you have a chance to look at this? I'd love a way to make mmap()
behave the same way as read() for the first meg of /dev/mem.

-Kees

On Thu, May 21, 2020 at 08:01:53PM -0700, Kees Cook wrote:
> On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > The typical usage of unmap_mapping_range() is part of
> > truncate_pagecache() to punch a hole in a file, but in this case the
> > implementation is only doing the "first half" of a hole punch. Namely it
> > is just evacuating current established mappings of the "hole", and it
> > relies on the fact that /dev/mem establishes mappings in terms of
> > absolute physical address offsets. Once existing mmap users are
> > invalidated they can attempt to re-establish the mapping, or attempt to
> > continue issuing read(2) / write(2) to the invalidated extent, but they
> > will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
> > block those subsequent accesses.
> 
> Nice!
> 
> Reviewed-by: Kees Cook <keescook@chromium.org>
> 
> And a thread hijack...   ;)
> 
> I think this is very close to providing a way to solve another issue
> I've had with /dev/mem, which is to zero the view of the first 1MB of
> /dev/mem via mmap. I only fixed the read/write accesses:
> a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads")
> I.e. the low 1MB range should be considered allowed, but any reads will see
> zeros.
> 
> > +	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> 
> Is unmap_mapping_range() sufficient for this? Would it need to happen
> once during open_port() or something more special during mmap_mem()?
> 
> -- 
> Kees Cook
Dan Williams April 7, 2022, 6:47 p.m. UTC | #12
On Wed, Apr 6, 2022 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
>
> *thread necromancy*

It's alive!

>
> Hi Dan,
>
> I'm doing a KSPP bug scrub and am reviewing
> https://github.com/KSPP/linux/issues/74 again.
>
> Do you have a chance to look at this? I'd love a way to make mmap()
> behave the same way as read() for the first meg of /dev/mem.

You want 0-reads or SIGBUS when attempting to access the first 1MB?

Because it sounds like what you want is instead of loudly failing with
-EPERM in drivers/char/mem.c::mmap_mem() you want it to silently
succeed but swap in the zero page, right? Otherwise if it's SIGBUS
then IO_STRICT_DEVMEM=y + marking that span as IORESOURCE_BUSY will
"Do the Right Thing (TM).".


>
> -Kees
>
> On Thu, May 21, 2020 at 08:01:53PM -0700, Kees Cook wrote:
> > On Thu, May 21, 2020 at 02:06:17PM -0700, Dan Williams wrote:
> > > The typical usage of unmap_mapping_range() is part of
> > > truncate_pagecache() to punch a hole in a file, but in this case the
> > > implementation is only doing the "first half" of a hole punch. Namely it
> > > is just evacuating current established mappings of the "hole", and it
> > > relies on the fact that /dev/mem establishes mappings in terms of
> > > absolute physical address offsets. Once existing mmap users are
> > > invalidated they can attempt to re-establish the mapping, or attempt to
> > > continue issuing read(2) / write(2) to the invalidated extent, but they
> > > will then be subject to the CONFIG_IO_STRICT_DEVMEM checking that can
> > > block those subsequent accesses.
> >
> > Nice!
> >
> > Reviewed-by: Kees Cook <keescook@chromium.org>
> >
> > And a thread hijack...   ;)
> >
> > I think this is very close to providing a way to solve another issue
> > I've had with /dev/mem, which is to zero the view of the first 1MB of
> > /dev/mem via mmap. I only fixed the read/write accesses:
> > a4866aa81251 ("mm: Tighten x86 /dev/mem with zeroing reads")
> > I.e. the low 1MB range should be considered allowed, but any reads will see
> > zeros.
> >
> > > +   unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
> >
> > Is unmap_mapping_range() sufficient for this? Would it need to happen
> > once during open_port() or something more special during mmap_mem()?
> >
> > --
> > Kees Cook
>
> --
> Kees Cook
Dan Williams April 7, 2022, 11:43 p.m. UTC | #13
On Thu, Apr 7, 2022 at 11:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
>
> On Wed, Apr 6, 2022 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > *thread necromancy*
>
> It's alive!
>
> >
> > Hi Dan,
> >
> > I'm doing a KSPP bug scrub and am reviewing
> > https://github.com/KSPP/linux/issues/74 again.
> >
> > Do you have a chance to look at this? I'd love a way to make mmap()
> > behave the same way as read() for the first meg of /dev/mem.
>
> You want 0-reads or SIGBUS when attempting to access the first 1MB?
>
> Because it sounds like what you want is instead of loudly failing with
> -EPERM in drivers/char/mem.c::mmap_mem() you want it to silently
> succeed but swap in the zero page, right? Otherwise if it's SIGBUS
> then IO_STRICT_DEVMEM=y + marking that span as IORESOURCE_BUSY will
> "Do the Right Thing (TM).".

In other words, if IO_STRICT_DEVMEM is enabled then the enforcement is
already there at least for anything marked IORESOURCE_BUSY. So if
tools are ok with that protection today, maybe there is no need to do
the zero page dance. I.e. legacy tools the read(2) /dev/mem below 1MB
get zeroes, and apparently no tools were mmap'ing below 1MB otherwise
they would have complained by now? At least Fedora is shipping
IO_STRICT_DEVMEM these days:

https://src.fedoraproject.org/rpms/kernel/blob/rawhide/f/kernel-x86_64-fedora.config#_2799
Kees Cook April 7, 2022, 11:46 p.m. UTC | #14
On Thu, Apr 07, 2022 at 11:47:40AM -0700, Dan Williams wrote:
> On Wed, Apr 6, 2022 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > *thread necromancy*
> 
> It's alive!

*lightning*

> > I'm doing a KSPP bug scrub and am reviewing
> > https://github.com/KSPP/linux/issues/74 again.
> >
> > Do you have a chance to look at this? I'd love a way to make mmap()
> > behave the same way as read() for the first meg of /dev/mem.
> 
> You want 0-reads or SIGBUS when attempting to access the first 1MB?
> 
> Because it sounds like what you want is instead of loudly failing with
> -EPERM in drivers/char/mem.c::mmap_mem() you want it to silently
> succeed but swap in the zero page, right? Otherwise if it's SIGBUS
> then IO_STRICT_DEVMEM=y + marking that span as IORESOURCE_BUSY will
> "Do the Right Thing (TM).".

I'd like it to just return zeros, which is what "read" currently does.
(i.e. if page_is_allowed(), really devmem_is_allowed(), returns 2, then
the user should see all zeros.)

So, yes, "silently succeed but swap in the zero page" is exactly what
I'd like. When I last looked, for the first meg, mmap did _not_ fail,
but _actually_ reads it.
Kees Cook April 8, 2022, 3:35 a.m. UTC | #15
On Thu, Apr 07, 2022 at 04:43:10PM -0700, Dan Williams wrote:
> On Thu, Apr 7, 2022 at 11:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > On Wed, Apr 6, 2022 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > *thread necromancy*
> >
> > It's alive!
> >
> > >
> > > Hi Dan,
> > >
> > > I'm doing a KSPP bug scrub and am reviewing
> > > https://github.com/KSPP/linux/issues/74 again.
> > >
> > > Do you have a chance to look at this? I'd love a way to make mmap()
> > > behave the same way as read() for the first meg of /dev/mem.
> >
> > You want 0-reads or SIGBUS when attempting to access the first 1MB?
> >
> > Because it sounds like what you want is instead of loudly failing with
> > -EPERM in drivers/char/mem.c::mmap_mem() you want it to silently
> > succeed but swap in the zero page, right? Otherwise if it's SIGBUS
> > then IO_STRICT_DEVMEM=y + marking that span as IORESOURCE_BUSY will
> > "Do the Right Thing (TM).".
> 
> In other words, if IO_STRICT_DEVMEM is enabled then the enforcement is
> already there at least for anything marked IORESOURCE_BUSY. So if
> tools are ok with that protection today, maybe there is no need to do
> the zero page dance. I.e. legacy tools the read(2) /dev/mem below 1MB
> get zeroes, and apparently no tools were mmap'ing below 1MB otherwise
> they would have complained by now? At least Fedora is shipping
> IO_STRICT_DEVMEM these days:
> 
> https://src.fedoraproject.org/rpms/kernel/blob/rawhide/f/kernel-x86_64-fedora.config#_2799

When I try to mmap a RAM area <1MiB, mmap succeeds (range_is_allowed()
is non-zero), so I don't think IO_STRICT_DEVMEM would trip anything
using mmap on /dev/mem there.

I am only reading 0s from there, though, but I don't see what's all
happening. I thought maybe it was just literally unused, but even with
CONFIG_PAGE_POISONING=y booted with page_poison=1, I still read 0s (not
0xaa), but I'd like to understand _why_ (i.e. I can't tell if it is
accidentally safe, intentionally safe, or my test is bad.)

For example:

# cat /proc/iomem
00000000-00000fff : Reserved
00001000-0009fbff : System RAM
0009fc00-0009ffff : Reserved
000a0000-000bffff : PCI Bus 0000:00
000c0000-000c99ff : Video ROM
...

If I mmap page 0, it's rejected (non-RAM). If I mmap page 1, it works,
but it's all 0s. (Which is what I'd like, but I don't see where this is
happening.)

Hmmm.
Dan Williams April 8, 2022, 6:51 a.m. UTC | #16
On Thu, Apr 7, 2022 at 8:35 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Thu, Apr 07, 2022 at 04:43:10PM -0700, Dan Williams wrote:
> > On Thu, Apr 7, 2022 at 11:47 AM Dan Williams <dan.j.williams@intel.com> wrote:
> > >
> > > On Wed, Apr 6, 2022 at 12:46 PM Kees Cook <keescook@chromium.org> wrote:
> > > >
> > > > *thread necromancy*
> > >
> > > It's alive!
> > >
> > > >
> > > > Hi Dan,
> > > >
> > > > I'm doing a KSPP bug scrub and am reviewing
> > > > https://github.com/KSPP/linux/issues/74 again.
> > > >
> > > > Do you have a chance to look at this? I'd love a way to make mmap()
> > > > behave the same way as read() for the first meg of /dev/mem.
> > >
> > > You want 0-reads or SIGBUS when attempting to access the first 1MB?
> > >
> > > Because it sounds like what you want is instead of loudly failing with
> > > -EPERM in drivers/char/mem.c::mmap_mem() you want it to silently
> > > succeed but swap in the zero page, right? Otherwise if it's SIGBUS
> > > then IO_STRICT_DEVMEM=y + marking that span as IORESOURCE_BUSY will
> > > "Do the Right Thing (TM).".
> >
> > In other words, if IO_STRICT_DEVMEM is enabled then the enforcement is
> > already there at least for anything marked IORESOURCE_BUSY. So if
> > tools are ok with that protection today, maybe there is no need to do
> > the zero page dance. I.e. legacy tools the read(2) /dev/mem below 1MB
> > get zeroes, and apparently no tools were mmap'ing below 1MB otherwise
> > they would have complained by now? At least Fedora is shipping
> > IO_STRICT_DEVMEM these days:
> >
> > https://src.fedoraproject.org/rpms/kernel/blob/rawhide/f/kernel-x86_64-fedora.config#_2799
>
> When I try to mmap a RAM area <1MiB, mmap succeeds (range_is_allowed()
> is non-zero), so I don't think IO_STRICT_DEVMEM would trip anything
> using mmap on /dev/mem there.
>
> I am only reading 0s from there, though, but I don't see what's all
> happening. I thought maybe it was just literally unused, but even with
> CONFIG_PAGE_POISONING=y booted with page_poison=1, I still read 0s (not
> 0xaa), but I'd like to understand _why_ (i.e. I can't tell if it is
> accidentally safe, intentionally safe, or my test is bad.)
>
> For example:
>
> # cat /proc/iomem
> 00000000-00000fff : Reserved
> 00001000-0009fbff : System RAM
> 0009fc00-0009ffff : Reserved
> 000a0000-000bffff : PCI Bus 0000:00
> 000c0000-000c99ff : Video ROM
> ...
>
> If I mmap page 0, it's rejected (non-RAM). If I mmap page 1, it works,
> but it's all 0s. (Which is what I'd like, but I don't see where this is
> happening.)

I'm worried it's all zero's by luck and that the logic in
devmem_is_allowed() to return 2 is actually allowing the mmap() case
to successfully bypass STRICT_DEVMEM where read(2) would have had the
buffer cleared by the kernel.

mmap_mem() would need to walk the range and map the zero_page pfn for
all of the intersections with system-ram, but if the mapping is
writable it would need to allocate memory to prevent the zero page
from being written. If you can write to it and still see your data on
the next attempt then STRICT_DEVMEM is being bypassed.
diff mbox series

Patch

diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 43dd0891ca1e..31cae88a730b 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -31,11 +31,15 @@ 
 #include <linux/uio.h>
 #include <linux/uaccess.h>
 #include <linux/security.h>
+#include <linux/pseudo_fs.h>
+#include <uapi/linux/magic.h>
+#include <linux/mount.h>
 
 #ifdef CONFIG_IA64
 # include <linux/efi.h>
 #endif
 
+#define DEVMEM_MINOR	1
 #define DEVPORT_MINOR	4
 
 static inline unsigned long size_inside_page(unsigned long start,
@@ -805,12 +809,64 @@  static loff_t memory_lseek(struct file *file, loff_t offset, int orig)
 	return ret;
 }
 
+static struct inode *devmem_inode;
+
+#ifdef CONFIG_IO_STRICT_DEVMEM
+void revoke_devmem(struct resource *res)
+{
+	struct inode *inode = READ_ONCE(devmem_inode);
+
+	/*
+	 * Check that the initialization has completed. Losing the race
+	 * is ok because it means drivers are claiming resources before
+	 * the fs_initcall level of init and prevent /dev/mem from
+	 * establishing mappings.
+	 */
+	if (!inode)
+		return;
+
+	/*
+	 * The expectation is that the driver has successfully marked
+	 * the resource busy by this point, so devmem_is_allowed()
+	 * should start returning false, however for performance this
+	 * does not iterate the entire resource range.
+	 */
+	if (devmem_is_allowed(PHYS_PFN(res->start)) &&
+	    devmem_is_allowed(PHYS_PFN(res->end))) {
+		/*
+		 * *cringe* iomem=relaxed says "go ahead, what's the
+		 * worst that can happen?"
+		 */
+		return;
+	}
+
+	unmap_mapping_range(inode->i_mapping, res->start, resource_size(res), 1);
+}
+#endif
+
 static int open_port(struct inode *inode, struct file *filp)
 {
+	int rc;
+
 	if (!capable(CAP_SYS_RAWIO))
 		return -EPERM;
 
-	return security_locked_down(LOCKDOWN_DEV_MEM);
+	rc = security_locked_down(LOCKDOWN_DEV_MEM);
+	if (rc)
+		return rc;
+
+	if (iminor(inode) != DEVMEM_MINOR)
+		return 0;
+
+	/*
+	 * Use a unified address space to have a single point to manage
+	 * revocations when drivers want to take over a /dev/mem mapped
+	 * range.
+	 */
+	inode->i_mapping = devmem_inode->i_mapping;
+	filp->f_mapping = inode->i_mapping;
+
+	return 0;
 }
 
 #define zero_lseek	null_lseek
@@ -885,7 +941,7 @@  static const struct memdev {
 	fmode_t fmode;
 } devlist[] = {
 #ifdef CONFIG_DEVMEM
-	 [1] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
+	 [DEVMEM_MINOR] = { "mem", 0, &mem_fops, FMODE_UNSIGNED_OFFSET },
 #endif
 #ifdef CONFIG_DEVKMEM
 	 [2] = { "kmem", 0, &kmem_fops, FMODE_UNSIGNED_OFFSET },
@@ -939,6 +995,45 @@  static char *mem_devnode(struct device *dev, umode_t *mode)
 
 static struct class *mem_class;
 
+static int devmem_fs_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, DEVMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type devmem_fs_type = {
+	.name		= "devmem",
+	.owner		= THIS_MODULE,
+	.init_fs_context = devmem_fs_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int devmem_init_inode(void)
+{
+	static struct vfsmount *devmem_vfs_mount;
+	static int devmem_fs_cnt;
+	struct inode *inode;
+	int rc;
+
+	rc = simple_pin_fs(&devmem_fs_type, &devmem_vfs_mount, &devmem_fs_cnt);
+	if (rc < 0) {
+		pr_err("Cannot mount /dev/mem pseudo filesystem: %d\n", rc);
+		return rc;
+	}
+
+	inode = alloc_anon_inode(devmem_vfs_mount->mnt_sb);
+	if (IS_ERR(inode)) {
+		rc = PTR_ERR(inode);
+		pr_err("Cannot allocate inode for /dev/mem: %d\n", rc);
+		simple_release_fs(&devmem_vfs_mount, &devmem_fs_cnt);
+		return rc;
+	}
+
+	/* publish /dev/mem initialized */
+	WRITE_ONCE(devmem_inode, inode);
+
+	return 0;
+}
+
 static int __init chr_dev_init(void)
 {
 	int minor;
@@ -960,6 +1055,8 @@  static int __init chr_dev_init(void)
 		 */
 		if ((minor == DEVPORT_MINOR) && !arch_has_dev_port())
 			continue;
+		if ((minor == DEVMEM_MINOR) && devmem_init_inode() != 0)
+			continue;
 
 		device_create(mem_class, NULL, MKDEV(MEM_MAJOR, minor),
 			      NULL, devlist[minor].name);
diff --git a/include/linux/ioport.h b/include/linux/ioport.h
index a9b9170b5dd2..6c3eca90cbc4 100644
--- a/include/linux/ioport.h
+++ b/include/linux/ioport.h
@@ -301,5 +301,11 @@  struct resource *devm_request_free_mem_region(struct device *dev,
 struct resource *request_free_mem_region(struct resource *base,
 		unsigned long size, const char *name);
 
+#ifdef CONFIG_IO_STRICT_DEVMEM
+void revoke_devmem(struct resource *res);
+#else
+static inline void revoke_devmem(struct resource *res) { };
+#endif
+
 #endif /* __ASSEMBLY__ */
 #endif	/* _LINUX_IOPORT_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d78064007b17..f3956fc11de6 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -94,6 +94,7 @@ 
 #define BALLOON_KVM_MAGIC	0x13661366
 #define ZSMALLOC_MAGIC		0x58295829
 #define DMA_BUF_MAGIC		0x444d4142	/* "DMAB" */
+#define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
 
diff --git a/kernel/resource.c b/kernel/resource.c
index 76036a41143b..841737bbda9e 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -1126,6 +1126,7 @@  struct resource * __request_region(struct resource *parent,
 {
 	DECLARE_WAITQUEUE(wait, current);
 	struct resource *res = alloc_resource(GFP_KERNEL);
+	struct resource *orig_parent = parent;
 
 	if (!res)
 		return NULL;
@@ -1176,6 +1177,10 @@  struct resource * __request_region(struct resource *parent,
 		break;
 	}
 	write_unlock(&resource_lock);
+
+	if (res && orig_parent == &iomem_resource)
+		revoke_devmem(res);
+
 	return res;
 }
 EXPORT_SYMBOL(__request_region);