diff mbox series

[RFC,v5,1/7] mseal, system mappings: kernel config and header change

Message ID 20250212032155.1276806-2-jeffxu@google.com (mailing list archive)
State New
Headers show
Series mseal system mappings | expand

Commit Message

Jeff Xu Feb. 12, 2025, 3:21 a.m. UTC
From: Jeff Xu <jeffxu@chromium.org>

Provide infrastructure to mseal system mappings. Establish
two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and a header file (userprocess.h)
for future patches.

As discussed during mseal() upstream process [1], mseal() protects
the VMAs of a given virtual memory range against modifications, such
as the read/write (RW) and no-execute (NX) bits. For complete
descriptions of memory sealing, please see mseal.rst [2].

The mseal() is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped.

The system mappings are readonly only, memory sealing can protect
them from ever changing to writable or unmmap/remapped as different
attributes.

System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
are created by the kernel during program initialization, and could
be sealed after creation.

Unlike the aforementioned mappings, the uprobe mapping is not
established during program startup. However, its lifetime is the same
as the process's lifetime [3]. It could be sealed from creation.

The vsyscall on x86-64 uses a special address (0xffffffffff600000),
which is outside the mm managed range. This means mprotect, munmap, and
mremap won't work on the vsyscall. Since sealing doesn't enhance
the vsyscall's security, it is skipped in this patch. If we ever seal
the vsyscall, it is probably only for decorative purpose, i.e. showing
the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.

It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
alter the system mappings during restore operations. UML(User Mode Linux)
and gVisor are also known to change the vdso/vvar mappings. Consequently,
this feature cannot be universally enabled across all systems. As such,
CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.

To support mseal of system mappings, architectures must define
CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
calls to pass mseal flag. Additionally, architectures must confirm they
do not unmap/remap system mappings during the process lifetime.

In this version, we've improved the handling of system mapping sealing from
previous versions, instead of modifying the _install_special_mapping
function itself, which would affect all architectures, we now call
_install_special_mapping with a sealing flag only within the specific
architecture that requires it. This targeted approach offers two key
advantages: 1) It limits the code change's impact to the necessary
architectures, and 2) It aligns with the software architecture by keeping
the core memory management within the mm layer, while delegating the
decision of sealing system mappings to the individual architecture, which
is particularly relevant since 32-bit architectures never require sealing.

Additionally, this patch introduces a new header,
include/linux/usrprocess.h, which contains the mseal_system_mappings()
function. This function helps the architecture determine if system
mapping is enabled within the current kernel configuration. It can be
extended in the future to handle opt-in/out prctl for enabling system
mapping sealing at the process level or a kernel cmdline feature.

A new header file was introduced because it was difficult to find the
best location for this function. Although include/linux/mm.h was
considered, this feature is more closely related to user processes
than core memory management. Additionally, future prctl or kernel
cmd-line implementations for this feature would not fit within the
scope of core memory management or mseal.c. This is relevant because
if we add unit-tests for mseal.c in the future, we would want to limit
mseal.c's dependencies to core memory management.

Prior to this patch series, we explored sealing special mappings from
userspace using glibc's dynamic linker. This approach revealed several
issues:
- The PT_LOAD header may report an incorrect length for vdso, (smaller
  than its actual size). The dynamic linker, which relies on PT_LOAD
  information to determine mapping size, would then split and partially
  seal the vdso mapping. Since each architecture has its own vdso/vvar
  code, fixing this in the kernel would require going through each
  archiecture. Our initial goal was to enable sealing readonly mappings,
  e.g. .text, across all architectures, sealing vdso from kernel since
  creation appears to be simpler than sealing vdso at glibc.
- The [vvar] mapping header only contains address information, not length
  information. Similar issues might exist for other special mappings.
- Mappings like uprobe are not covered by the dynamic linker,
  and there is no effective solution for them.

This feature's security enhancements will benefit ChromeOS, Android,
and other high security systems.

Testing:
This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
- Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
  i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
- Passing various automation tests (e.g. pre-checkin) on ChromeOS and
  Android to ensure the sealing doesn't affect the functionality of
  Chromebook and Android phone.

I also tested the feature on Ubuntu on x86-64:
- With config disabled, vdso/vvar is not sealed,
- with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
  normal operations such as browsing the web, open/edit doc are OK.

In addition, Benjamin Berg tested this on UML.

Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
Link: Documentation/userspace-api/mseal.rst [2]
Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]

Signed-off-by: Jeff Xu <jeffxu@chromium.org>
---
 include/linux/userprocess.h | 18 ++++++++++++++++++
 init/Kconfig                | 18 ++++++++++++++++++
 security/Kconfig            | 18 ++++++++++++++++++
 3 files changed, 54 insertions(+)
 create mode 100644 include/linux/userprocess.h

Comments

Randy Dunlap Feb. 12, 2025, 3:31 a.m. UTC | #1
On 2/11/25 7:21 PM, jeffxu@chromium.org wrote:
> From: Jeff Xu <jeffxu@chromium.org>
> 

> ---
>  include/linux/userprocess.h | 18 ++++++++++++++++++
>  init/Kconfig                | 18 ++++++++++++++++++
>  security/Kconfig            | 18 ++++++++++++++++++
>  3 files changed, 54 insertions(+)
>  create mode 100644 include/linux/userprocess.h
> 

> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b..892d2bcdf397 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
>  config ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	bool
>  
> +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> +	bool
> +	help
> +	  Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> +
> +	  A 64-bit kernel is required for the memory sealing feature.
> +	  No specific hardware features from the CPU are needed.
> +
> +	  To enable this feature, the architecture needs to update their
> +	  speical mappings calls to include the sealing flag and confirm

	  special

> +	  that it doesn't unmap/remap system mappings during the life
> +	  time of the process. After the architecture enables this, a
> +	  distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> +	  to the feature.
> +
> +	  For complete descriptions of memory sealing, please see
> +	  Documentation/userspace-api/mseal.rst
> +
>  config HAVE_PERF_EVENTS
>  	bool
>  	help
Jeff Xu Feb. 12, 2025, 3:40 a.m. UTC | #2
On Tue, Feb 11, 2025 at 7:31 PM Randy Dunlap <rdunlap@infradead.org> wrote:
>
>
>
> On 2/11/25 7:21 PM, jeffxu@chromium.org wrote:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
>
> > ---
> >  include/linux/userprocess.h | 18 ++++++++++++++++++
> >  init/Kconfig                | 18 ++++++++++++++++++
> >  security/Kconfig            | 18 ++++++++++++++++++
> >  3 files changed, 54 insertions(+)
> >  create mode 100644 include/linux/userprocess.h
> >
>
> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b..892d2bcdf397 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> >       bool
> >
> > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > +     bool
> > +     help
> > +       Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> > +
> > +       A 64-bit kernel is required for the memory sealing feature.
> > +       No specific hardware features from the CPU are needed.
> > +
> > +       To enable this feature, the architecture needs to update their
> > +       speical mappings calls to include the sealing flag and confirm
>
>           special
>
Ack, will fix.

Thanks !
-Jeff

> > +       that it doesn't unmap/remap system mappings during the life
> > +       time of the process. After the architecture enables this, a
> > +       distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> > +       to the feature.
> > +
> > +       For complete descriptions of memory sealing, please see
> > +       Documentation/userspace-api/mseal.rst
> > +
> >  config HAVE_PERF_EVENTS
> >       bool
> >       help
>
>
> --
> ~Randy
>
Liam R. Howlett Feb. 12, 2025, 3:05 p.m. UTC | #3
* jeffxu@chromium.org <jeffxu@chromium.org> [250211 22:22]:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> Provide infrastructure to mseal system mappings. Establish
> two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS,
> ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and a header file (userprocess.h)
> for future patches.
> 
> As discussed during mseal() upstream process [1], mseal() protects
> the VMAs of a given virtual memory range against modifications, such
> as the read/write (RW) and no-execute (NX) bits. For complete
> descriptions of memory sealing, please see mseal.rst [2].
> 
> The mseal() is useful to mitigate memory corruption issues where a
> corrupted pointer is passed to a memory management system. For
> example, such an attacker primitive can break control-flow integrity
> guarantees since read-only memory that is supposed to be trusted can
> become writable or .text pages can get remapped.
> 
> The system mappings are readonly only, memory sealing can protect
> them from ever changing to writable or unmmap/remapped as different
> attributes.
> 
> System mappings such as vdso, vvar, and sigpage (arm), vectors (arm)
> are created by the kernel during program initialization, and could
> be sealed after creation.
> 
> Unlike the aforementioned mappings, the uprobe mapping is not
> established during program startup. However, its lifetime is the same
> as the process's lifetime [3]. It could be sealed from creation.
> 
> The vsyscall on x86-64 uses a special address (0xffffffffff600000),
> which is outside the mm managed range. This means mprotect, munmap, and
> mremap won't work on the vsyscall. Since sealing doesn't enhance
> the vsyscall's security, it is skipped in this patch. If we ever seal
> the vsyscall, it is probably only for decorative purpose, i.e. showing
> the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored.
> 
> It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may
> alter the system mappings during restore operations. UML(User Mode Linux)
> and gVisor are also known to change the vdso/vvar mappings. Consequently,
> this feature cannot be universally enabled across all systems. As such,
> CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default.
> 
> To support mseal of system mappings, architectures must define
> CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings
> calls to pass mseal flag. Additionally, architectures must confirm they
> do not unmap/remap system mappings during the process lifetime.
> 
> In this version, we've improved the handling of system mapping sealing from
> previous versions, instead of modifying the _install_special_mapping
> function itself, which would affect all architectures, we now call
> _install_special_mapping with a sealing flag only within the specific
> architecture that requires it. This targeted approach offers two key
> advantages: 1) It limits the code change's impact to the necessary
> architectures, and 2) It aligns with the software architecture by keeping
> the core memory management within the mm layer, while delegating the
> decision of sealing system mappings to the individual architecture, which
> is particularly relevant since 32-bit architectures never require sealing.
> 
> Additionally, this patch introduces a new header,
> include/linux/usrprocess.h, which contains the mseal_system_mappings()
> function. This function helps the architecture determine if system
> mapping is enabled within the current kernel configuration. It can be
> extended in the future to handle opt-in/out prctl for enabling system
> mapping sealing at the process level or a kernel cmdline feature.
> 
> A new header file was introduced because it was difficult to find the
> best location for this function. Although include/linux/mm.h was
> considered, this feature is more closely related to user processes
> than core memory management. Additionally, future prctl or kernel
> cmd-line implementations for this feature would not fit within the
> scope of core memory management or mseal.c. This is relevant because
> if we add unit-tests for mseal.c in the future, we would want to limit
> mseal.c's dependencies to core memory management.
> 
> Prior to this patch series, we explored sealing special mappings from
> userspace using glibc's dynamic linker. This approach revealed several
> issues:
> - The PT_LOAD header may report an incorrect length for vdso, (smaller
>   than its actual size). The dynamic linker, which relies on PT_LOAD
>   information to determine mapping size, would then split and partially
>   seal the vdso mapping. Since each architecture has its own vdso/vvar
>   code, fixing this in the kernel would require going through each
>   archiecture. Our initial goal was to enable sealing readonly mappings,
>   e.g. .text, across all architectures, sealing vdso from kernel since
>   creation appears to be simpler than sealing vdso at glibc.
> - The [vvar] mapping header only contains address information, not length
>   information. Similar issues might exist for other special mappings.
> - Mappings like uprobe are not covered by the dynamic linker,
>   and there is no effective solution for them.
> 
> This feature's security enhancements will benefit ChromeOS, Android,
> and other high security systems.
> 
> Testing:
> This feature was tested on ChromeOS and Android for both x86-64 and ARM64.
> - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly,
>   i.e. "sl" shown in the smaps for those mappings, and mremap is blocked.
> - Passing various automation tests (e.g. pre-checkin) on ChromeOS and
>   Android to ensure the sealing doesn't affect the functionality of
>   Chromebook and Android phone.
> 
> I also tested the feature on Ubuntu on x86-64:
> - With config disabled, vdso/vvar is not sealed,
> - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK,
>   normal operations such as browsing the web, open/edit doc are OK.
> 
> In addition, Benjamin Berg tested this on UML.
> 
> Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1]
> Link: Documentation/userspace-api/mseal.rst [2]
> Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3]

I found it odd that the history and V4 links were not here, but I
realised that was in 0/7 instead.

> 
> Signed-off-by: Jeff Xu <jeffxu@chromium.org>
> ---
>  include/linux/userprocess.h | 18 ++++++++++++++++++
>  init/Kconfig                | 18 ++++++++++++++++++
>  security/Kconfig            | 18 ++++++++++++++++++
>  3 files changed, 54 insertions(+)
>  create mode 100644 include/linux/userprocess.h
> 
> diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
> new file mode 100644
> index 000000000000..bd11e2e972c5
> --- /dev/null
> +++ b/include/linux/userprocess.h
> @@ -0,0 +1,18 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_USER_PROCESS_H
> +#define _LINUX_USER_PROCESS_H
> +#include <linux/mm.h>

Why is this in a new file and not mm.h directly?  Anyone who needs to
use this will pull in mm.h anyways, and probably already has mm.h
included.

> +
> +/*
> + * mseal of userspace process's system mappings.
> + */
> +static inline unsigned long mseal_system_mappings(void)
> +{
> +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> +	return VM_SEALED;
> +#else
> +	return 0;
> +#endif
> +}
> +
> +#endif

Looking in mm.h, there are examples of config checks which either set
the bit or set it to 0.  This means you wouldn't need the function at
all and the precompiler would do all the work (although with a static
inline, I'm not sure it would make a big difference in instructions).

For instance, you could do:
#ifdef CONFIG_64BIT                                                                                                      
/* VM is sealed, in vm_flags */
#define VM_SEALED       _BITUL(63)

#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
#define VM_SYSTEM_SEAL		VM_SEALED
else
#define VM_SYSTEM_SEAL		VM_NONE
#endif

#else /* CONFIG_64BIT */
#define VM_SYSTEM_SEAL		VM_NONE
#endif

Then you can use VM_SYSTEM_SEAL in the system mappings function in the
list of flags instead of having a variable + calling the static
function.

I didn't look close enough to see if the 32bit version is necessary.

> diff --git a/init/Kconfig b/init/Kconfig
> index d0d021b3fa3b..892d2bcdf397 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
>  config ARCH_HAS_MEMBARRIER_SYNC_CORE
>  	bool
>  
> +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS

ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear.  HAS may mean it's
supported or it could mean it's enabled. SUPPORTS is more clear that
this is set if an arch supports the feature.  After all, I think HAS
here means it "has support for" mseal system mappings.

I see that both are used (HAS and SUPPORTS), but I'm still not sure what
HAS means in other contexts without digging.

> +	bool
> +	help
> +	  Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
> +
> +	  A 64-bit kernel is required for the memory sealing feature.
> +	  No specific hardware features from the CPU are needed.
> +
> +	  To enable this feature, the architecture needs to update their
> +	  speical mappings calls to include the sealing flag and confirm
> +	  that it doesn't unmap/remap system mappings during the life
> +	  time of the process. After the architecture enables this, a
> +	  distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
> +	  to the feature.
> +
> +	  For complete descriptions of memory sealing, please see
> +	  Documentation/userspace-api/mseal.rst
> +
>  config HAVE_PERF_EVENTS
>  	bool
>  	help
> diff --git a/security/Kconfig b/security/Kconfig
> index f10dbf15c294..bfb35fc7a3c6 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
>  
>  endchoice
>  
> +config MSEAL_SYSTEM_MAPPINGS
> +	bool "mseal system mappings"
> +	depends on 64BIT
> +	depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> +	depends on !CHECKPOINT_RESTORE
> +	help
> +	  Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
> +
> +	  A 64-bit kernel is required for the memory sealing feature.
> +	  No specific hardware features from the CPU are needed.
> +
> +	  Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or
> +	  unmap system mapping, therefore this config can't be enabled
> +	  universally.

I guess add rr to the list, since that's also reported to have issues as
well.

> +
> +	  For complete descriptions of memory sealing, please see
> +	  Documentation/userspace-api/mseal.rst
> +
>  config SECURITY
>  	bool "Enable different security models"
>  	depends on SYSFS
> -- 
> 2.48.1.502.g6dc24dfdaf-goog
>
Jeff Xu Feb. 13, 2025, 5:15 p.m. UTC | #4
On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
...
> >
> > In this version, we've improved the handling of system mapping sealing from
> > previous versions, instead of modifying the _install_special_mapping
> > function itself, which would affect all architectures, we now call
> > _install_special_mapping with a sealing flag only within the specific
> > architecture that requires it. This targeted approach offers two key
> > advantages: 1) It limits the code change's impact to the necessary
> > architectures, and 2) It aligns with the software architecture by keeping
> > the core memory management within the mm layer, while delegating the
> > decision of sealing system mappings to the individual architecture, which
> > is particularly relevant since 32-bit architectures never require sealing.
> >
> > Additionally, this patch introduces a new header,
> > include/linux/usrprocess.h, which contains the mseal_system_mappings()
> > function. This function helps the architecture determine if system
> > mapping is enabled within the current kernel configuration. It can be
> > extended in the future to handle opt-in/out prctl for enabling system
> > mapping sealing at the process level or a kernel cmdline feature.
> >
> > A new header file was introduced because it was difficult to find the
> > best location for this function. Although include/linux/mm.h was
> > considered, this feature is more closely related to user processes
> > than core memory management. Additionally, future prctl or kernel
> > cmd-line implementations for this feature would not fit within the
> > scope of core memory management or mseal.c. This is relevant because
> > if we add unit-tests for mseal.c in the future, we would want to limit
> > mseal.c's dependencies to core memory management.
> >
...
> >
> > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
> > new file mode 100644
> > index 000000000000..bd11e2e972c5
> > --- /dev/null
> > +++ b/include/linux/userprocess.h
> > @@ -0,0 +1,18 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_USER_PROCESS_H
> > +#define _LINUX_USER_PROCESS_H
> > +#include <linux/mm.h>
>
> Why is this in a new file and not mm.h directly?  Anyone who needs to
> use this will pull in mm.h anyways, and probably already has mm.h
> included.
>
The commit message includes the reason. I've snipped and left the
relevant portion for easy reference, please see the beginning of this
email.

> > +
> > +/*
> > + * mseal of userspace process's system mappings.
> > + */
> > +static inline unsigned long mseal_system_mappings(void)
> > +{
> > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > +     return VM_SEALED;
> > +#else
> > +     return 0;
> > +#endif
> > +}
> > +
> > +#endif
>
> Looking in mm.h, there are examples of config checks which either set
> the bit or set it to 0.  This means you wouldn't need the function at
> all and the precompiler would do all the work (although with a static
> inline, I'm not sure it would make a big difference in instructions).
>
> For instance, you could do:
> #ifdef CONFIG_64BIT
> /* VM is sealed, in vm_flags */
> #define VM_SEALED       _BITUL(63)
>
> #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> #define VM_SYSTEM_SEAL          VM_SEALED
> else
> #define VM_SYSTEM_SEAL          VM_NONE
> #endif
>
> #else /* CONFIG_64BIT */
> #define VM_SYSTEM_SEAL          VM_NONE
> #endif
>
> Then you can use VM_SYSTEM_SEAL in the system mappings function in the
> list of flags instead of having a variable + calling the static
> function.
>
> I didn't look close enough to see if the 32bit version is necessary.
>
I'm aware of the other pattern.

VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
the build. This is intentional. Any 32-bit code trying to use the
sealing function or the VM_SEALED flag will immediately fail
compilation. This makes it easier to identify incorrect usage.

> > diff --git a/init/Kconfig b/init/Kconfig
> > index d0d021b3fa3b..892d2bcdf397 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> >       bool
> >
> > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
>
> ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear.  HAS may mean it's
> supported or it could mean it's enabled. SUPPORTS is more clear that
> this is set if an arch supports the feature.  After all, I think HAS
> here means it "has support for" mseal system mappings.
>
> I see that both are used (HAS and SUPPORTS), but I'm still not sure what
> HAS means in other contexts without digging.
>
The existing pattern is to indicate arch support with
CONFIG_ARCH_HAS_N and   security/KConfig to have CONFIG_N as the main
switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and
CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in
security/Kconfig, hence I'm following the existing pattern in
security/Kconfig.

> > diff --git a/security/Kconfig b/security/Kconfig
> > index f10dbf15c294..bfb35fc7a3c6 100644
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE
> >
> >  endchoice
> >
> > +config MSEAL_SYSTEM_MAPPINGS
> > +     bool "mseal system mappings"
> > +     depends on 64BIT
> > +     depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > +     depends on !CHECKPOINT_RESTORE
> > +     help
> > +       Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
> > +
> > +       A 64-bit kernel is required for the memory sealing feature.
> > +       No specific hardware features from the CPU are needed.
> > +
> > +       Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or
> > +       unmap system mapping, therefore this config can't be enabled
> > +       universally.
>
> I guess add rr to the list, since that's also reported to have issues as
> well.
>
sure.

Thanks for reviewing.

-Jeff
Liam R. Howlett Feb. 13, 2025, 6:29 p.m. UTC | #5
* Jeff Xu <jeffxu@chromium.org> [250213 12:17]:
> On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> ...
> > >
> > > In this version, we've improved the handling of system mapping sealing from
> > > previous versions, instead of modifying the _install_special_mapping
> > > function itself, which would affect all architectures, we now call
> > > _install_special_mapping with a sealing flag only within the specific
> > > architecture that requires it. This targeted approach offers two key
> > > advantages: 1) It limits the code change's impact to the necessary
> > > architectures, and 2) It aligns with the software architecture by keeping
> > > the core memory management within the mm layer, while delegating the
> > > decision of sealing system mappings to the individual architecture, which
> > > is particularly relevant since 32-bit architectures never require sealing.
> > >
> > > Additionally, this patch introduces a new header,
> > > include/linux/usrprocess.h, which contains the mseal_system_mappings()
> > > function. This function helps the architecture determine if system
> > > mapping is enabled within the current kernel configuration. It can be
> > > extended in the future to handle opt-in/out prctl for enabling system
> > > mapping sealing at the process level or a kernel cmdline feature.
> > >
> > > A new header file was introduced because it was difficult to find the
> > > best location for this function. Although include/linux/mm.h was
> > > considered, this feature is more closely related to user processes
> > > than core memory management. Additionally, future prctl or kernel
> > > cmd-line implementations for this feature would not fit within the
> > > scope of core memory management or mseal.c. This is relevant because
> > > if we add unit-tests for mseal.c in the future, we would want to limit
> > > mseal.c's dependencies to core memory management.
> > >
> ...
> > >
> > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
> > > new file mode 100644
> > > index 000000000000..bd11e2e972c5
> > > --- /dev/null
> > > +++ b/include/linux/userprocess.h
> > > @@ -0,0 +1,18 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_USER_PROCESS_H
> > > +#define _LINUX_USER_PROCESS_H
> > > +#include <linux/mm.h>
> >
> > Why is this in a new file and not mm.h directly?  Anyone who needs to
> > use this will pull in mm.h anyways, and probably already has mm.h
> > included.
> >
> The commit message includes the reason. I've snipped and left the
> relevant portion for easy reference, please see the beginning of this
> email.

This is a _really_ good reason NOT to stack the entire patch set
description into a single patch description.  Within the above 111 lines
of text, I missed that statement.

When Andrew takes the first patch, he'll put the patch description in
there, and then we can actually concentrate on the patch with the
limited context of what is going on.

If it doesn't go through Andrew's branch, then we can fall back to each
patch standing on its own with the change stating why things are done.


Also, I don't agree with your stated reason for a new header.

Please remove this header until it is needed.  It can be added later if
it is needed.  If we all had tiny headers because we might need them in
the future, we'd have serious issues finding anything and the list of
included headers would be huge.

You have added unnecessary changes and complications to this patch set
on v5.

> 
> > > +
> > > +/*
> > > + * mseal of userspace process's system mappings.
> > > + */
> > > +static inline unsigned long mseal_system_mappings(void)
> > > +{
> > > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > > +     return VM_SEALED;
> > > +#else
> > > +     return 0;
> > > +#endif
> > > +}
> > > +
> > > +#endif
> >
> > Looking in mm.h, there are examples of config checks which either set
> > the bit or set it to 0.  This means you wouldn't need the function at
> > all and the precompiler would do all the work (although with a static
> > inline, I'm not sure it would make a big difference in instructions).
> >
> > For instance, you could do:
> > #ifdef CONFIG_64BIT
> > /* VM is sealed, in vm_flags */
> > #define VM_SEALED       _BITUL(63)
> >
> > #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > #define VM_SYSTEM_SEAL          VM_SEALED
> > else
> > #define VM_SYSTEM_SEAL          VM_NONE
> > #endif
> >
> > #else /* CONFIG_64BIT */
> > #define VM_SYSTEM_SEAL          VM_NONE
> > #endif
> >
> > Then you can use VM_SYSTEM_SEAL in the system mappings function in the
> > list of flags instead of having a variable + calling the static
> > function.
> >
> > I didn't look close enough to see if the 32bit version is necessary.
> >
> I'm aware of the other pattern.
> 
> VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> the build. This is intentional. Any 32-bit code trying to use the
> sealing function or the VM_SEALED flag will immediately fail
> compilation. This makes it easier to identify incorrect usage.

So you are against using the #define because the VM_SYSTEM_SEAL will be
defined, even though it will be VM_NONE?  This is no worse than a
function that returns 0, and it aligns better with what we have today in
that it can be put in the list of other flags.

Also, my vote for where you should put this code is clear: it should
live with the mseal #define in mm.h.  You're going to need mm.h anyways,
and that's a big hint as to where it should live.

> 
> > > diff --git a/init/Kconfig b/init/Kconfig
> > > index d0d021b3fa3b..892d2bcdf397 100644
> > > --- a/init/Kconfig
> > > +++ b/init/Kconfig
> > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > >       bool
> > >
> > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> >
> > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear.  HAS may mean it's
> > supported or it could mean it's enabled. SUPPORTS is more clear that
> > this is set if an arch supports the feature.  After all, I think HAS
> > here means it "has support for" mseal system mappings.
> >
> > I see that both are used (HAS and SUPPORTS), but I'm still not sure what
> > HAS means in other contexts without digging.
> >
> The existing pattern is to indicate arch support with
> CONFIG_ARCH_HAS_N and   security/KConfig to have CONFIG_N as the main
> switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and
> CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in
> security/Kconfig, hence I'm following the existing pattern in
> security/Kconfig.

Okay, thanks.  I really think SUPPORTS is more clear, but sticking with
whatever your area uses is also fine.

...

Thanks,
Liam
Kees Cook Feb. 13, 2025, 8:11 p.m. UTC | #6
On Thu, Feb 13, 2025 at 01:29:46PM -0500, Liam R. Howlett wrote:
> * Jeff Xu <jeffxu@chromium.org> [250213 12:17]:
> > On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > ...
> > > >
> > > > In this version, we've improved the handling of system mapping sealing from
> > > > previous versions, instead of modifying the _install_special_mapping
> > > > function itself, which would affect all architectures, we now call
> > > > _install_special_mapping with a sealing flag only within the specific
> > > > architecture that requires it. This targeted approach offers two key
> > > > advantages: 1) It limits the code change's impact to the necessary
> > > > architectures, and 2) It aligns with the software architecture by keeping
> > > > the core memory management within the mm layer, while delegating the
> > > > decision of sealing system mappings to the individual architecture, which
> > > > is particularly relevant since 32-bit architectures never require sealing.
> > > >
> > > > Additionally, this patch introduces a new header,
> > > > include/linux/usrprocess.h, which contains the mseal_system_mappings()
> > > > function. This function helps the architecture determine if system
> > > > mapping is enabled within the current kernel configuration. It can be
> > > > extended in the future to handle opt-in/out prctl for enabling system
> > > > mapping sealing at the process level or a kernel cmdline feature.
> > > >
> > > > A new header file was introduced because it was difficult to find the
> > > > best location for this function. Although include/linux/mm.h was
> > > > considered, this feature is more closely related to user processes
> > > > than core memory management. Additionally, future prctl or kernel
> > > > cmd-line implementations for this feature would not fit within the
> > > > scope of core memory management or mseal.c. This is relevant because
> > > > if we add unit-tests for mseal.c in the future, we would want to limit
> > > > mseal.c's dependencies to core memory management.
> > > >
> > ...
> > > >
> > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
> > > > new file mode 100644
> > > > index 000000000000..bd11e2e972c5
> > > > --- /dev/null
> > > > +++ b/include/linux/userprocess.h
> > > > @@ -0,0 +1,18 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef _LINUX_USER_PROCESS_H
> > > > +#define _LINUX_USER_PROCESS_H
> > > > +#include <linux/mm.h>
> [...] 
> 
> Also, I don't agree with your stated reason for a new header.
> 
> Please remove this header until it is needed.  It can be added later if
> it is needed.  If we all had tiny headers because we might need them in
> the future, we'd have serious issues finding anything and the list of
> included headers would be huge.
> 
> You have added unnecessary changes and complications to this patch set
> on v5.

In all fairness, I think v5 is significantly less complex overall. Jeff
used his best judgement with a new header, and I don't think it's
unreasonable. That it is unwanted is fine, mm.h it is, but I think it's
clear the intent was trying to make this as least burdensome for the mm
subsystem as possible.

> > > > +/*
> > > > + * mseal of userspace process's system mappings.
> > > > + */
> > > > +static inline unsigned long mseal_system_mappings(void)
> > > > +{
> > > > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > > > +     return VM_SEALED;
> > > > +#else
> > > > +     return 0;
> > > > +#endif
> > > > +}
> > > > +
> > > > +#endif
> > >
> > > Looking in mm.h, there are examples of config checks which either set
> > > the bit or set it to 0.  This means you wouldn't need the function at
> > > all and the precompiler would do all the work (although with a static
> > > inline, I'm not sure it would make a big difference in instructions).
> > >
> > > For instance, you could do:
> > > #ifdef CONFIG_64BIT
> > > /* VM is sealed, in vm_flags */
> > > #define VM_SEALED       _BITUL(63)
> > >
> > > #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
> > > #define VM_SYSTEM_SEAL          VM_SEALED
> > > else
> > > #define VM_SYSTEM_SEAL          VM_NONE
> > > #endif
> > >
> > > #else /* CONFIG_64BIT */
> > > #define VM_SYSTEM_SEAL          VM_NONE
> > > #endif
> > >
> > > Then you can use VM_SYSTEM_SEAL in the system mappings function in the
> > > list of flags instead of having a variable + calling the static
> > > function.
> > >
> > > I didn't look close enough to see if the 32bit version is necessary.
> > >
> > I'm aware of the other pattern.
> > 
> > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > the build. This is intentional. Any 32-bit code trying to use the
> > sealing function or the VM_SEALED flag will immediately fail
> > compilation. This makes it easier to identify incorrect usage.
> 
> So you are against using the #define because the VM_SYSTEM_SEAL will be
> defined, even though it will be VM_NONE?  This is no worse than a
> function that returns 0, and it aligns better with what we have today in
> that it can be put in the list of other flags.

When I was reading through all of this and considering the history of
its development goals, it strikes me that a function is easier for the
future if/when this can be made a boot-time setting.

If mm maintainers prefer a #define for now, that's fine of course. The
value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit.

> Also, my vote for where you should put this code is clear: it should
> live with the mseal #define in mm.h.  You're going to need mm.h anyways,
> and that's a big hint as to where it should live.

Sounds good.

> > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > index d0d021b3fa3b..892d2bcdf397 100644
> > > > --- a/init/Kconfig
> > > > +++ b/init/Kconfig
> > > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > >       bool
> > > >
> > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > >
> > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear.  HAS may mean it's
> > > supported or it could mean it's enabled. SUPPORTS is more clear that
> > > this is set if an arch supports the feature.  After all, I think HAS
> > > here means it "has support for" mseal system mappings.
> > >
> > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what
> > > HAS means in other contexts without digging.
> > >
> > The existing pattern is to indicate arch support with
> > CONFIG_ARCH_HAS_N and   security/KConfig to have CONFIG_N as the main
> > switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and
> > CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in
> > security/Kconfig, hence I'm following the existing pattern in
> > security/Kconfig.
> 
> Okay, thanks.  I really think SUPPORTS is more clear, but sticking with
> whatever your area uses is also fine.

Yeah, I've really scratched my head over what the best practice is here.
There's a real mixture of "HAS" vs "SUPPORTS" across both hardware
architectural support, software features, interface implementations,
and compiler behavior that bridges between those. When I've looked in
the past, it seemed that "HAS" was in the majority, but I've never been
able to make sense of it.

Let me look again. Today, HAS shows:

$ git grep 'config .*_HAS_' | grep -v Documentation/ | \
  awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n
      1 ARM
      1 MAC80211
      1 PGTABLE
      1 PPC
      1 RUSTC
      1 USB
      2 ARM64
      2 SIBYTE
      2 SOC
      3 FS
      3 PAHOLE
      6 SGI
      6 TOOLCHAIN
     10 ARC
     20 AS
     32 SYS
     34 CPU
     38 CC
    105 ARCH

Looking through them, it's tools (CC, AS, PAHOLE, etc), and
system/cpu/architecture prefixes, with ARCH being a clear winner.

SUPPORTS looks similar, but isn't as widely used:

$ git grep 'config .*_SUPPORTS_' | grep -v Documentation/ | \
  awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n
      1 PPC64
      1 X86
      2 CLANG
      2 GCC
      2 RUSTC
      8 CPU
     38 SYS
     71 ARCH

The mips arch is using SYS, and distinguishes between HAS and SUPPORTS
in the sense of "this kernel HAS support for CPU $foo, which SUPPORTS
features x, y, z".

Looking through ARCH_SUPPORTS, it's all software features. Looking
through ARCH_HAS, it looks like a mix of hardware features
and software features. Some software features are more about
implementation availability, though (so "HAS" makes sense,
but should maybe be "IMPLEMENTS"?) e.g. ARCH_HAS_SYSCALL_WRAPPER,
ARCH_HAS_STRNLEN_USER, ARCH_HAS_FORTIFY_SOURCE, ARCH_HAS_ELF_RANDOMIZE,
ARCH_HAS_STRICT_KERNEL_RWX, ARCH_HAS_STRICT_MODULE_RWX.

So, I think I'd agree: this is about a software features and not an API
implementation nor hardware feature, so let's use ARCH_SUPPORTS_.

-Kees
Liam R. Howlett Feb. 13, 2025, 8:54 p.m. UTC | #7
* Kees Cook <kees@kernel.org> [250213 15:11]:
> On Thu, Feb 13, 2025 at 01:29:46PM -0500, Liam R. Howlett wrote:
> > * Jeff Xu <jeffxu@chromium.org> [250213 12:17]:
> > > On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > ...
> > > > >
> > > > > In this version, we've improved the handling of system mapping sealing from
> > > > > previous versions, instead of modifying the _install_special_mapping
> > > > > function itself, which would affect all architectures, we now call
> > > > > _install_special_mapping with a sealing flag only within the specific
> > > > > architecture that requires it. This targeted approach offers two key
> > > > > advantages: 1) It limits the code change's impact to the necessary
> > > > > architectures, and 2) It aligns with the software architecture by keeping
> > > > > the core memory management within the mm layer, while delegating the
> > > > > decision of sealing system mappings to the individual architecture, which
> > > > > is particularly relevant since 32-bit architectures never require sealing.
> > > > >
> > > > > Additionally, this patch introduces a new header,
> > > > > include/linux/usrprocess.h, which contains the mseal_system_mappings()
> > > > > function. This function helps the architecture determine if system
> > > > > mapping is enabled within the current kernel configuration. It can be
> > > > > extended in the future to handle opt-in/out prctl for enabling system
> > > > > mapping sealing at the process level or a kernel cmdline feature.
> > > > >
> > > > > A new header file was introduced because it was difficult to find the
> > > > > best location for this function. Although include/linux/mm.h was
> > > > > considered, this feature is more closely related to user processes
> > > > > than core memory management. Additionally, future prctl or kernel
> > > > > cmd-line implementations for this feature would not fit within the
> > > > > scope of core memory management or mseal.c. This is relevant because
> > > > > if we add unit-tests for mseal.c in the future, we would want to limit
> > > > > mseal.c's dependencies to core memory management.
> > > > >
> > > ...
> > > > >
> > > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
> > > > > new file mode 100644
> > > > > index 000000000000..bd11e2e972c5
> > > > > --- /dev/null
> > > > > +++ b/include/linux/userprocess.h
> > > > > @@ -0,0 +1,18 @@
> > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > +#ifndef _LINUX_USER_PROCESS_H
> > > > > +#define _LINUX_USER_PROCESS_H
> > > > > +#include <linux/mm.h>
> > [...] 
> > 
> > Also, I don't agree with your stated reason for a new header.
> > 
> > Please remove this header until it is needed.  It can be added later if
> > it is needed.  If we all had tiny headers because we might need them in
> > the future, we'd have serious issues finding anything and the list of
> > included headers would be huge.
> > 
> > You have added unnecessary changes and complications to this patch set
> > on v5.
> 
> In all fairness, I think v5 is significantly less complex overall. Jeff
> used his best judgement with a new header, and I don't think it's
> unreasonable. That it is unwanted is fine, mm.h it is, but I think it's
> clear the intent was trying to make this as least burdensome for the mm
> subsystem as possible.

This version is certainly easier to follow, but it's still more
complicated than it needs to be.  Adding custom headers seems like an
unnecessary addition to previous versions.

I didn't mean this to be a jab or upsetting - and sorry to both of you,
I can see how it could be taken this way.

I was trying to point out that this is the same sort of thing that got
Jeff into trouble with Linus about over-engineering the v8 of the mseal
patch set [1].

...

> > > 
> > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > > the build. This is intentional. Any 32-bit code trying to use the
> > > sealing function or the VM_SEALED flag will immediately fail
> > > compilation. This makes it easier to identify incorrect usage.
> > 
> > So you are against using the #define because the VM_SYSTEM_SEAL will be
> > defined, even though it will be VM_NONE?  This is no worse than a
> > function that returns 0, and it aligns better with what we have today in
> > that it can be put in the list of other flags.
> 
> When I was reading through all of this and considering the history of
> its development goals, it strikes me that a function is easier for the
> future if/when this can be made a boot-time setting.
> 

Reworking this change to function as a boot-time parameter, or whatever,
would not be a significant amount of work, if/when the time comes.
Since we don't know what the future holds, it it unnecessary to engineer
in a potential change for a future version when the change is easy
enough to make later and keep the code cleaner now.


> If mm maintainers prefer a #define for now, that's fine of course. The
> value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit.

Thanks.  I think having a flag with VM_NONE on 32-bit is just as sane as
a "flags |= system_seal()" call that unconditionally returns 0 on
32-bit.

> 
> > Also, my vote for where you should put this code is clear: it should
> > live with the mseal #define in mm.h.  You're going to need mm.h anyways,
> > and that's a big hint as to where it should live.
> 
> Sounds good.
> 
> > > > > diff --git a/init/Kconfig b/init/Kconfig
> > > > > index d0d021b3fa3b..892d2bcdf397 100644
> > > > > --- a/init/Kconfig
> > > > > +++ b/init/Kconfig
> > > > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS
> > > > >  config ARCH_HAS_MEMBARRIER_SYNC_CORE
> > > > >       bool
> > > > >
> > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
> > > >
> > > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear.  HAS may mean it's
> > > > supported or it could mean it's enabled. SUPPORTS is more clear that
> > > > this is set if an arch supports the feature.  After all, I think HAS
> > > > here means it "has support for" mseal system mappings.
> > > >
> > > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what
> > > > HAS means in other contexts without digging.
> > > >
> > > The existing pattern is to indicate arch support with
> > > CONFIG_ARCH_HAS_N and   security/KConfig to have CONFIG_N as the main
> > > switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and
> > > CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in
> > > security/Kconfig, hence I'm following the existing pattern in
> > > security/Kconfig.
> > 
> > Okay, thanks.  I really think SUPPORTS is more clear, but sticking with
> > whatever your area uses is also fine.
> 
> Yeah, I've really scratched my head over what the best practice is here.
> There's a real mixture of "HAS" vs "SUPPORTS" across both hardware
> architectural support, software features, interface implementations,
> and compiler behavior that bridges between those. When I've looked in
> the past, it seemed that "HAS" was in the majority, but I've never been
> able to make sense of it.
> 
> Let me look again. Today, HAS shows:
> 
> $ git grep 'config .*_HAS_' | grep -v Documentation/ | \
>   awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n
>       1 ARM
>       1 MAC80211
>       1 PGTABLE
>       1 PPC
>       1 RUSTC
>       1 USB
>       2 ARM64
>       2 SIBYTE
>       2 SOC
>       3 FS
>       3 PAHOLE
>       6 SGI
>       6 TOOLCHAIN
>      10 ARC
>      20 AS
>      32 SYS
>      34 CPU
>      38 CC
>     105 ARCH
> 
> Looking through them, it's tools (CC, AS, PAHOLE, etc), and
> system/cpu/architecture prefixes, with ARCH being a clear winner.
> 
> SUPPORTS looks similar, but isn't as widely used:
> 
> $ git grep 'config .*_SUPPORTS_' | grep -v Documentation/ | \
>   awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n
>       1 PPC64
>       1 X86
>       2 CLANG
>       2 GCC
>       2 RUSTC
>       8 CPU
>      38 SYS
>      71 ARCH
> 
> The mips arch is using SYS, and distinguishes between HAS and SUPPORTS
> in the sense of "this kernel HAS support for CPU $foo, which SUPPORTS
> features x, y, z".
> 
> Looking through ARCH_SUPPORTS, it's all software features. Looking
> through ARCH_HAS, it looks like a mix of hardware features
> and software features. Some software features are more about
> implementation availability, though (so "HAS" makes sense,
> but should maybe be "IMPLEMENTS"?) e.g. ARCH_HAS_SYSCALL_WRAPPER,
> ARCH_HAS_STRNLEN_USER, ARCH_HAS_FORTIFY_SOURCE, ARCH_HAS_ELF_RANDOMIZE,
> ARCH_HAS_STRICT_KERNEL_RWX, ARCH_HAS_STRICT_MODULE_RWX.
> 
> So, I think I'd agree: this is about a software features and not an API
> implementation nor hardware feature, so let's use ARCH_SUPPORTS_.

This makes sense.  I grepped for both as well and found that HAS
outnumbers SUPPORTS, but there are a lot of each.  I figured using
whatever the subsystem uses is probably the right answer - your answer
is better.

Seeing them in the same kconfig option together gave me pause as to why
they were different.  Thanks for digging into this!


Regards,
Liam

[1] https://lore.kernel.org/linux-mm/CAHk-=wjGGgfAoiEdPqLdib7VvQgG7uVXpTPzJ9jTW0HesRpPwQ@mail.gmail.com/
Jeff Xu Feb. 13, 2025, 10 p.m. UTC | #8
On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett
<Liam.Howlett@oracle.com> wrote:

> > > >
> > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > > > the build. This is intentional. Any 32-bit code trying to use the
> > > > sealing function or the VM_SEALED flag will immediately fail
> > > > compilation. This makes it easier to identify incorrect usage.
> > >
> > > So you are against using the #define because the VM_SYSTEM_SEAL will be
> > > defined, even though it will be VM_NONE?  This is no worse than a
> > > function that returns 0, and it aligns better with what we have today in
> > > that it can be put in the list of other flags.
> >
> > When I was reading through all of this and considering the history of
> > its development goals, it strikes me that a function is easier for the
> > future if/when this can be made a boot-time setting.
> >
>
> Reworking this change to function as a boot-time parameter, or whatever,
> would not be a significant amount of work, if/when the time comes.
> Since we don't know what the future holds, it it unnecessary to engineer
> in a potential change for a future version when the change is easy
> enough to make later and keep the code cleaner now.
>
Sure, I will put the function in mm.h for this patch. We can find a
proper place when it is time to implement the boot-time parameter
change.

The call stack for sealing system mapping is something like below:

install_special_mapping (mm/mmap.c)
map_vdso (arch/x86/entry/vdso/vma.c)
load_elf_binary (fs/binfmt_elf.c)
load_misc_binary (fs/binfmt_misc.c)
bprm_execve (fs/exec.c)
do_execveat_common
__x64_sys_execve
do_syscall_64

IMO, there's a clear divide between the API implementer and the API user.
mm and mm.h are the providers, offering the core mm functionality
through APIs/data structures like install_special_mapping().

The exe layer (bprm_execve, map_vdso, etc)  is the consumer of the
install_special_mapping.
The logic related to checking if sealing system mapping is enabled
belongs to the exe layer.

>
> > If mm maintainers prefer a #define for now, that's fine of course. The
> > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit.
>
> Thanks.  I think having a flag with VM_NONE on 32-bit is just as sane as
> a "flags |= system_seal()" call that unconditionally returns 0 on
> 32-bit.
>
Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c,

#ifdef CONFIG_64BIT
[ilog2(VM_SEALED)] = "sl",
#endif

If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time.

Setting VM_SEALED to VM_NONE could simplify the coding in some cases
(get/set case), but it might make other cases error prone.

I would prefer to not have VM_SEALED for 32 bit.


-Jeff
Liam R. Howlett Feb. 14, 2025, 12:14 a.m. UTC | #9
* Jeff Xu <jeffxu@chromium.org> [250213 17:00]:
> On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett
> <Liam.Howlett@oracle.com> wrote:
> 
> > > > >
> > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > > > > the build. This is intentional. Any 32-bit code trying to use the
> > > > > sealing function or the VM_SEALED flag will immediately fail
> > > > > compilation. This makes it easier to identify incorrect usage.
> > > >
> > > > So you are against using the #define because the VM_SYSTEM_SEAL will be
> > > > defined, even though it will be VM_NONE?  This is no worse than a
> > > > function that returns 0, and it aligns better with what we have today in
> > > > that it can be put in the list of other flags.
> > >
> > > When I was reading through all of this and considering the history of
> > > its development goals, it strikes me that a function is easier for the
> > > future if/when this can be made a boot-time setting.
> > >
> >
> > Reworking this change to function as a boot-time parameter, or whatever,
> > would not be a significant amount of work, if/when the time comes.
> > Since we don't know what the future holds, it it unnecessary to engineer
> > in a potential change for a future version when the change is easy
> > enough to make later and keep the code cleaner now.
> >
> Sure, I will put the function in mm.h for this patch. We can find a
> proper place when it is time to implement the boot-time parameter
> change.
> 
> The call stack for sealing system mapping is something like below:
> 
> install_special_mapping (mm/mmap.c)
> map_vdso (arch/x86/entry/vdso/vma.c)
> load_elf_binary (fs/binfmt_elf.c)
> load_misc_binary (fs/binfmt_misc.c)
> bprm_execve (fs/exec.c)
> do_execveat_common
> __x64_sys_execve
> do_syscall_64
> 
> IMO, there's a clear divide between the API implementer and the API user.
> mm and mm.h are the providers, offering the core mm functionality
> through APIs/data structures like install_special_mapping().
> 
> The exe layer (bprm_execve, map_vdso, etc)  is the consumer of the
> install_special_mapping.
> The logic related to checking if sealing system mapping is enabled
> belongs to the exe layer.

Since this is an all or nothing enabling, there is no reason to have
each caller check the same thing and do the same action. You should put
the logic into the provider - they all end up doing the same thing.

Also, this is a compile time option so it doesn't even need to be
checked on execution - just apply it in the first place, at the source.
Your static inline function was already doing this...?

I'm confused as to what you are arguing here because it goes against
what you had and what I suggested.  The alternative you are suggesting
is more code, more instructions, and the best outcome is the same
result.

> 
> >
> > > If mm maintainers prefer a #define for now, that's fine of course. The
> > > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit.
> >
> > Thanks.  I think having a flag with VM_NONE on 32-bit is just as sane as
> > a "flags |= system_seal()" call that unconditionally returns 0 on
> > 32-bit.
> >
> Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c,
> 
> #ifdef CONFIG_64BIT
> [ilog2(VM_SEALED)] = "sl",
> #endif
> 
> If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time.
> 
> Setting VM_SEALED to VM_NONE could simplify the coding in some cases
> (get/set case), but it might make other cases error prone.
> 
> I would prefer to not have VM_SEALED for 32 bit.

But what I posted leaves VM_SEALED undefined for 32 bit, it defines
VM_SYSTEM_SEALED which can be placed, for instance, into
_install_special_mapping() arguments directly.  Reducing the change to
just adding a new flag to the list.  And it's applied to all system
mappings based on if it's enabled on compile or not.

Also:
include/linux/mm.h:#define VM_NONE              0x00000000
so, I'm not sure what evaluation you are concerned about?  It would be
defined as 0.

Thanks,
Liam
Liam R. Howlett Feb. 14, 2025, 1:10 a.m. UTC | #10
* Liam R. Howlett <Liam.Howlett@oracle.com> [250213 19:14]:
> * Jeff Xu <jeffxu@chromium.org> [250213 17:00]:
> > On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > 
> > > > > >
> > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > > > > > the build. This is intentional. Any 32-bit code trying to use the
> > > > > > sealing function or the VM_SEALED flag will immediately fail
> > > > > > compilation. This makes it easier to identify incorrect usage.
> > > > >
> > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be
> > > > > defined, even though it will be VM_NONE?  This is no worse than a
> > > > > function that returns 0, and it aligns better with what we have today in
> > > > > that it can be put in the list of other flags.
> > > >
> > > > When I was reading through all of this and considering the history of
> > > > its development goals, it strikes me that a function is easier for the
> > > > future if/when this can be made a boot-time setting.
> > > >
> > >
> > > Reworking this change to function as a boot-time parameter, or whatever,
> > > would not be a significant amount of work, if/when the time comes.
> > > Since we don't know what the future holds, it it unnecessary to engineer
> > > in a potential change for a future version when the change is easy
> > > enough to make later and keep the code cleaner now.
> > >
> > Sure, I will put the function in mm.h for this patch. We can find a
> > proper place when it is time to implement the boot-time parameter
> > change.
> > 
> > The call stack for sealing system mapping is something like below:
> > 
> > install_special_mapping (mm/mmap.c)
> > map_vdso (arch/x86/entry/vdso/vma.c)
> > load_elf_binary (fs/binfmt_elf.c)
> > load_misc_binary (fs/binfmt_misc.c)
> > bprm_execve (fs/exec.c)
> > do_execveat_common
> > __x64_sys_execve
> > do_syscall_64
> > 
> > IMO, there's a clear divide between the API implementer and the API user.
> > mm and mm.h are the providers, offering the core mm functionality
> > through APIs/data structures like install_special_mapping().
> > 
> > The exe layer (bprm_execve, map_vdso, etc)  is the consumer of the
> > install_special_mapping.
> > The logic related to checking if sealing system mapping is enabled
> > belongs to the exe layer.
> 
> Since this is an all or nothing enabling, there is no reason to have
> each caller check the same thing and do the same action. You should put
> the logic into the provider - they all end up doing the same thing.
> 
> Also, this is a compile time option so it doesn't even need to be
> checked on execution - just apply it in the first place, at the source.
> Your static inline function was already doing this...?
> 
> I'm confused as to what you are arguing here because it goes against
> what you had and what I suggested.  The alternative you are suggesting
> is more code, more instructions, and the best outcome is the same
> result.

I think I understand what you are saying now: the interface to
install_special_mapping() needs to take the vma flag, as it does today.
What I don't understand is that what you proposed and what I proposed
both do that.

What I'm saying is that, since system mappings are enabled, we can
already know implicitly by having VM_SYSTEM_SEAL either VM_NONE or
VM_SEAL.

Turning this:

@@ -264,11 +266,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
 	/*
 	 * MAYWRITE to allow gdb to COW and set breakpoints
 	 */
+	vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
+	vm_flags |= mseal_system_mappings();
 	vma = _install_special_mapping(mm,
 				       text_start,
 				       image->size,
-				       VM_READ|VM_EXEC|
-				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				       vm_flags,
 				       &vdso_mapping);

to this:

 	/*
 	 * MAYWRITE to allow gdb to COW and set breakpoints
 	 */
 	vma = _install_special_mapping(mm,
 				       text_start,
 				       image->size,
				       VM_READ|VM_EXEC|
-				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
+				       VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
+				       VM_SYSTEM_SEAL,
 				       &vdso_mapping);

No unsigned long vm_flags needed.  It's easier to read and I don't think
it's any more hidden than the vm_flags |= function call option.

> 
> > 
> > >
> > > > If mm maintainers prefer a #define for now, that's fine of course. The
> > > > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit.
> > >
> > > Thanks.  I think having a flag with VM_NONE on 32-bit is just as sane as
> > > a "flags |= system_seal()" call that unconditionally returns 0 on
> > > 32-bit.
> > >
> > Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c,
> > 
> > #ifdef CONFIG_64BIT
> > [ilog2(VM_SEALED)] = "sl",
> > #endif
> > 
> > If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time.
> > 
> > Setting VM_SEALED to VM_NONE could simplify the coding in some cases
> > (get/set case), but it might make other cases error prone.
> > 
> > I would prefer to not have VM_SEALED for 32 bit.
> 
> But what I posted leaves VM_SEALED undefined for 32 bit, it defines
> VM_SYSTEM_SEALED which can be placed, for instance, into
> _install_special_mapping() arguments directly.  Reducing the change to
> just adding a new flag to the list.  And it's applied to all system
> mappings based on if it's enabled on compile or not.
> 
> Also:
> include/linux/mm.h:#define VM_NONE              0x00000000
> so, I'm not sure what evaluation you are concerned about?  It would be
> defined as 0.
> 
> Thanks,
> Liam
> 
>
Jeff Xu Feb. 14, 2025, 2:39 p.m. UTC | #11
On Thu, Feb 13, 2025 at 5:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Liam R. Howlett <Liam.Howlett@oracle.com> [250213 19:14]:
> > * Jeff Xu <jeffxu@chromium.org> [250213 17:00]:
> > > On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett
> > > <Liam.Howlett@oracle.com> wrote:
> > >
> > > > > > >
> > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of
> > > > > > > the build. This is intentional. Any 32-bit code trying to use the
> > > > > > > sealing function or the VM_SEALED flag will immediately fail
> > > > > > > compilation. This makes it easier to identify incorrect usage.
> > > > > >
> > > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be
> > > > > > defined, even though it will be VM_NONE?  This is no worse than a
> > > > > > function that returns 0, and it aligns better with what we have today in
> > > > > > that it can be put in the list of other flags.
> > > > >
> > > > > When I was reading through all of this and considering the history of
> > > > > its development goals, it strikes me that a function is easier for the
> > > > > future if/when this can be made a boot-time setting.
> > > > >
> > > >
> > > > Reworking this change to function as a boot-time parameter, or whatever,
> > > > would not be a significant amount of work, if/when the time comes.
> > > > Since we don't know what the future holds, it it unnecessary to engineer
> > > > in a potential change for a future version when the change is easy
> > > > enough to make later and keep the code cleaner now.
> > > >
> > > Sure, I will put the function in mm.h for this patch. We can find a
> > > proper place when it is time to implement the boot-time parameter
> > > change.
> > >
> > > The call stack for sealing system mapping is something like below:
> > >
> > > install_special_mapping (mm/mmap.c)
> > > map_vdso (arch/x86/entry/vdso/vma.c)
> > > load_elf_binary (fs/binfmt_elf.c)
> > > load_misc_binary (fs/binfmt_misc.c)
> > > bprm_execve (fs/exec.c)
> > > do_execveat_common
> > > __x64_sys_execve
> > > do_syscall_64
> > >
> > > IMO, there's a clear divide between the API implementer and the API user.
> > > mm and mm.h are the providers, offering the core mm functionality
> > > through APIs/data structures like install_special_mapping().
> > >
> > > The exe layer (bprm_execve, map_vdso, etc)  is the consumer of the
> > > install_special_mapping.
> > > The logic related to checking if sealing system mapping is enabled
> > > belongs to the exe layer.
> >
> > Since this is an all or nothing enabling, there is no reason to have
> > each caller check the same thing and do the same action. You should put
> > the logic into the provider - they all end up doing the same thing.
> >
> > Also, this is a compile time option so it doesn't even need to be
> > checked on execution - just apply it in the first place, at the source.
> > Your static inline function was already doing this...?
> >
> > I'm confused as to what you are arguing here because it goes against
> > what you had and what I suggested.  The alternative you are suggesting
> > is more code, more instructions, and the best outcome is the same
> > result.
>
> I think I understand what you are saying now: the interface to
> install_special_mapping() needs to take the vma flag, as it does today.
> What I don't understand is that what you proposed and what I proposed
> both do that.
>
> What I'm saying is that, since system mappings are enabled, we can
> already know implicitly by having VM_SYSTEM_SEAL either VM_NONE or
> VM_SEAL.
>
> Turning this:
>
> @@ -264,11 +266,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr)
>         /*
>          * MAYWRITE to allow gdb to COW and set breakpoints
>          */
> +       vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC;
> +       vm_flags |= mseal_system_mappings();
>         vma = _install_special_mapping(mm,
>                                        text_start,
>                                        image->size,
> -                                      VM_READ|VM_EXEC|
> -                                      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +                                      vm_flags,
>                                        &vdso_mapping);
>
> to this:
>
>         /*
>          * MAYWRITE to allow gdb to COW and set breakpoints
>          */
>         vma = _install_special_mapping(mm,
>                                        text_start,
>                                        image->size,
>                                        VM_READ|VM_EXEC|
> -                                      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
> +                                      VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC|
> +                                      VM_SYSTEM_SEAL,
>                                        &vdso_mapping);
>
> No unsigned long vm_flags needed.  It's easier to read and I don't think
> it's any more hidden than the vm_flags |= function call option.
>
The arch code needs a mseal_system_mappings() function. Otherwise,
I'll have to change this line (in arch) again when I implement the kernel
command line or pre-process opt-in/opt-out, which requires a function
call. This isn't overthinking; based on our discussion so far, there
are clear needs for a subsequent patch series. This patch is just the
first step.

For software layering, I'd like to see a clear separation between
layers. mm implements _install_special_mapping, which accepts any
combination of vm_flags as input.  Then I'd like the caller (in arch
code) to have all the necessary code to compose the vm_flags in one
place. This helps readability. In the past, we discussed merging the
vdso/vvar code across all architectures. When that happens, the new
code in arch will likely have its own .c and .h files, but it will still sit
above mm.  That means mm won't need to change, and the
_install_special_mapping API from mm will remain unchanged.

The mseal_system_mappings() function can already return VM_SEALED,
and future patches will add more logic into mseal_system_mappings(), e.g.
check for kernel cmdline or opt-in/opt-out. we don't need a separate
VM_SYSTEM_SEAL, which is a *** redirect macro ***, I prefer to keep
all the important code logic relevant to configuration of enabling system
mapping sealing in one place, for easy reading.

mseal_system_mappings() can be placed in mm.h in this patch, as you
suggested. But in the near future, it will be moved out of mm.h and find
a right header. The functionality belongs to exe namespace, because of
the reasons I put in the commit message and discussions.

Thanks
-Jeff

-Jeff
Lorenzo Stoakes Feb. 14, 2025, 2:59 p.m. UTC | #12
On Fri, Feb 14, 2025 at 06:39:48AM -0800, Jeff Xu wrote:
> mseal_system_mappings() can be placed in mm.h in this patch, as you
> suggested. But in the near future, it will be moved out of mm.h and find
> a right header. The functionality belongs to exe namespace, because of
> the reasons I put in the commit message and discussions.

With respect Jeff - and I feel that this might simply be a miscommunciation
here - but this doesn't read wonderfully. 'can be placed' 'it will be moved
out', etc.

Please try to be respectful of experienced maintainers who are taking their
time to review your code, and respond politely and respectfully. I think
what you meant to say is something more like:

"I'm more than happy to do that, but I feel that it would be more suited in
a separate header, as this strictly belongs to the kernel functionality
surrounding the execution of code. However we can revisit this at a later
time!"

My feeling is that this is exactly what you mean, but you are just
essentially abbreviating this. However it reads rather rudely, which I'm
sure you don't intend.

Ultimately the fact of the matter is that your series will be merged when
it reaches the standards required of you by the relevant maintainers, as is
the case will all code submitted to the kernel when we reach consensus.

In this series you have addressed a great number of concerns which has
brought the merging of it very much closer, so I hope we can continue in
the same vein and reach this consensus soon.

Let's try to avoid any miscommunication which might delay us reaching this
aim!

Thanks, Lorenzo

>
> Thanks
> -Jeff
>
> -Jeff
Jeff Xu Feb. 14, 2025, 3:18 p.m. UTC | #13
On Fri, Feb 14, 2025 at 7:00 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Fri, Feb 14, 2025 at 06:39:48AM -0800, Jeff Xu wrote:
> > mseal_system_mappings() can be placed in mm.h in this patch, as you
> > suggested. But in the near future, it will be moved out of mm.h and find
> > a right header. The functionality belongs to exe namespace, because of
> > the reasons I put in the commit message and discussions.
>
> With respect Jeff - and I feel that this might simply be a miscommunciation
> here - but this doesn't read wonderfully. 'can be placed' 'it will be moved
> out', etc.
>
> Please try to be respectful of experienced maintainers who are taking their
> time to review your code, and respond politely and respectfully. I think
> what you meant to say is something more like:
>
> "I'm more than happy to do that, but I feel that it would be more suited in
> a separate header, as this strictly belongs to the kernel functionality
> surrounding the execution of code. However we can revisit this at a later
> time!"
>
Thanks!
I apologize if my expression appears to be rude. Your revision
reflects what I'm trying to say, in better english.

> My feeling is that this is exactly what you mean, but you are just
> essentially abbreviating this. However it reads rather rudely, which I'm
> sure you don't intend.
>
> Ultimately the fact of the matter is that your series will be merged when
> it reaches the standards required of you by the relevant maintainers, as is
> the case will all code submitted to the kernel when we reach consensus.
>
> In this series you have addressed a great number of concerns which has
> brought the merging of it very much closer, so I hope we can continue in
> the same vein and reach this consensus soon.
>
> Let's try to avoid any miscommunication which might delay us reaching this
> aim!
>
Thanks !
-Jeff

> Thanks, Lorenzo
>
> >
> > Thanks
> > -Jeff
> >
> > -Jeff
diff mbox series

Patch

diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h
new file mode 100644
index 000000000000..bd11e2e972c5
--- /dev/null
+++ b/include/linux/userprocess.h
@@ -0,0 +1,18 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_USER_PROCESS_H
+#define _LINUX_USER_PROCESS_H
+#include <linux/mm.h>
+
+/*
+ * mseal of userspace process's system mappings.
+ */
+static inline unsigned long mseal_system_mappings(void)
+{
+#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS
+	return VM_SEALED;
+#else
+	return 0;
+#endif
+}
+
+#endif
diff --git a/init/Kconfig b/init/Kconfig
index d0d021b3fa3b..892d2bcdf397 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1882,6 +1882,24 @@  config ARCH_HAS_MEMBARRIER_CALLBACKS
 config ARCH_HAS_MEMBARRIER_SYNC_CORE
 	bool
 
+config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
+	bool
+	help
+	  Control MSEAL_SYSTEM_MAPPINGS access based on architecture.
+
+	  A 64-bit kernel is required for the memory sealing feature.
+	  No specific hardware features from the CPU are needed.
+
+	  To enable this feature, the architecture needs to update their
+	  speical mappings calls to include the sealing flag and confirm
+	  that it doesn't unmap/remap system mappings during the life
+	  time of the process. After the architecture enables this, a
+	  distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access
+	  to the feature.
+
+	  For complete descriptions of memory sealing, please see
+	  Documentation/userspace-api/mseal.rst
+
 config HAVE_PERF_EVENTS
 	bool
 	help
diff --git a/security/Kconfig b/security/Kconfig
index f10dbf15c294..bfb35fc7a3c6 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -51,6 +51,24 @@  config PROC_MEM_NO_FORCE
 
 endchoice
 
+config MSEAL_SYSTEM_MAPPINGS
+	bool "mseal system mappings"
+	depends on 64BIT
+	depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS
+	depends on !CHECKPOINT_RESTORE
+	help
+	  Seal system mappings such as vdso, vvar, sigpage, uprobes, etc.
+
+	  A 64-bit kernel is required for the memory sealing feature.
+	  No specific hardware features from the CPU are needed.
+
+	  Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or
+	  unmap system mapping, therefore this config can't be enabled
+	  universally.
+
+	  For complete descriptions of memory sealing, please see
+	  Documentation/userspace-api/mseal.rst
+
 config SECURITY
 	bool "Enable different security models"
 	depends on SYSFS