Message ID | 20250212032155.1276806-2-jeffxu@google.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | mseal system mappings | expand |
On 2/11/25 7:21 PM, jeffxu@chromium.org wrote: > From: Jeff Xu <jeffxu@chromium.org> > > --- > include/linux/userprocess.h | 18 ++++++++++++++++++ > init/Kconfig | 18 ++++++++++++++++++ > security/Kconfig | 18 ++++++++++++++++++ > 3 files changed, 54 insertions(+) > create mode 100644 include/linux/userprocess.h > > diff --git a/init/Kconfig b/init/Kconfig > index d0d021b3fa3b..892d2bcdf397 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > config ARCH_HAS_MEMBARRIER_SYNC_CORE > bool > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > + bool > + help > + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. > + > + A 64-bit kernel is required for the memory sealing feature. > + No specific hardware features from the CPU are needed. > + > + To enable this feature, the architecture needs to update their > + speical mappings calls to include the sealing flag and confirm special > + that it doesn't unmap/remap system mappings during the life > + time of the process. After the architecture enables this, a > + distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access > + to the feature. > + > + For complete descriptions of memory sealing, please see > + Documentation/userspace-api/mseal.rst > + > config HAVE_PERF_EVENTS > bool > help
On Tue, Feb 11, 2025 at 7:31 PM Randy Dunlap <rdunlap@infradead.org> wrote: > > > > On 2/11/25 7:21 PM, jeffxu@chromium.org wrote: > > From: Jeff Xu <jeffxu@chromium.org> > > > > > --- > > include/linux/userprocess.h | 18 ++++++++++++++++++ > > init/Kconfig | 18 ++++++++++++++++++ > > security/Kconfig | 18 ++++++++++++++++++ > > 3 files changed, 54 insertions(+) > > create mode 100644 include/linux/userprocess.h > > > > > diff --git a/init/Kconfig b/init/Kconfig > > index d0d021b3fa3b..892d2bcdf397 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > > config ARCH_HAS_MEMBARRIER_SYNC_CORE > > bool > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > + bool > > + help > > + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. > > + > > + A 64-bit kernel is required for the memory sealing feature. > > + No specific hardware features from the CPU are needed. > > + > > + To enable this feature, the architecture needs to update their > > + speical mappings calls to include the sealing flag and confirm > > special > Ack, will fix. Thanks ! -Jeff > > + that it doesn't unmap/remap system mappings during the life > > + time of the process. After the architecture enables this, a > > + distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access > > + to the feature. > > + > > + For complete descriptions of memory sealing, please see > > + Documentation/userspace-api/mseal.rst > > + > > config HAVE_PERF_EVENTS > > bool > > help > > > -- > ~Randy >
* jeffxu@chromium.org <jeffxu@chromium.org> [250211 22:22]: > From: Jeff Xu <jeffxu@chromium.org> > > Provide infrastructure to mseal system mappings. Establish > two kernel configs (CONFIG_MSEAL_SYSTEM_MAPPINGS, > ARCH_HAS_MSEAL_SYSTEM_MAPPINGS) and a header file (userprocess.h) > for future patches. > > As discussed during mseal() upstream process [1], mseal() protects > the VMAs of a given virtual memory range against modifications, such > as the read/write (RW) and no-execute (NX) bits. For complete > descriptions of memory sealing, please see mseal.rst [2]. > > The mseal() is useful to mitigate memory corruption issues where a > corrupted pointer is passed to a memory management system. For > example, such an attacker primitive can break control-flow integrity > guarantees since read-only memory that is supposed to be trusted can > become writable or .text pages can get remapped. > > The system mappings are readonly only, memory sealing can protect > them from ever changing to writable or unmmap/remapped as different > attributes. > > System mappings such as vdso, vvar, and sigpage (arm), vectors (arm) > are created by the kernel during program initialization, and could > be sealed after creation. > > Unlike the aforementioned mappings, the uprobe mapping is not > established during program startup. However, its lifetime is the same > as the process's lifetime [3]. It could be sealed from creation. > > The vsyscall on x86-64 uses a special address (0xffffffffff600000), > which is outside the mm managed range. This means mprotect, munmap, and > mremap won't work on the vsyscall. Since sealing doesn't enhance > the vsyscall's security, it is skipped in this patch. If we ever seal > the vsyscall, it is probably only for decorative purpose, i.e. showing > the 'sl' flag in the /proc/pid/smaps. For this patch, it is ignored. > > It is important to note that the CHECKPOINT_RESTORE feature (CRIU) may > alter the system mappings during restore operations. UML(User Mode Linux) > and gVisor are also known to change the vdso/vvar mappings. Consequently, > this feature cannot be universally enabled across all systems. As such, > CONFIG_MSEAL_SYSTEM_MAPPINGS is disabled by default. > > To support mseal of system mappings, architectures must define > CONFIG_ARCH_HAS_MSEAL_SYSTEM_MAPPINGS and update their special mappings > calls to pass mseal flag. Additionally, architectures must confirm they > do not unmap/remap system mappings during the process lifetime. > > In this version, we've improved the handling of system mapping sealing from > previous versions, instead of modifying the _install_special_mapping > function itself, which would affect all architectures, we now call > _install_special_mapping with a sealing flag only within the specific > architecture that requires it. This targeted approach offers two key > advantages: 1) It limits the code change's impact to the necessary > architectures, and 2) It aligns with the software architecture by keeping > the core memory management within the mm layer, while delegating the > decision of sealing system mappings to the individual architecture, which > is particularly relevant since 32-bit architectures never require sealing. > > Additionally, this patch introduces a new header, > include/linux/usrprocess.h, which contains the mseal_system_mappings() > function. This function helps the architecture determine if system > mapping is enabled within the current kernel configuration. It can be > extended in the future to handle opt-in/out prctl for enabling system > mapping sealing at the process level or a kernel cmdline feature. > > A new header file was introduced because it was difficult to find the > best location for this function. Although include/linux/mm.h was > considered, this feature is more closely related to user processes > than core memory management. Additionally, future prctl or kernel > cmd-line implementations for this feature would not fit within the > scope of core memory management or mseal.c. This is relevant because > if we add unit-tests for mseal.c in the future, we would want to limit > mseal.c's dependencies to core memory management. > > Prior to this patch series, we explored sealing special mappings from > userspace using glibc's dynamic linker. This approach revealed several > issues: > - The PT_LOAD header may report an incorrect length for vdso, (smaller > than its actual size). The dynamic linker, which relies on PT_LOAD > information to determine mapping size, would then split and partially > seal the vdso mapping. Since each architecture has its own vdso/vvar > code, fixing this in the kernel would require going through each > archiecture. Our initial goal was to enable sealing readonly mappings, > e.g. .text, across all architectures, sealing vdso from kernel since > creation appears to be simpler than sealing vdso at glibc. > - The [vvar] mapping header only contains address information, not length > information. Similar issues might exist for other special mappings. > - Mappings like uprobe are not covered by the dynamic linker, > and there is no effective solution for them. > > This feature's security enhancements will benefit ChromeOS, Android, > and other high security systems. > > Testing: > This feature was tested on ChromeOS and Android for both x86-64 and ARM64. > - Enable sealing and verify vdso/vvar, sigpage, vector are sealed properly, > i.e. "sl" shown in the smaps for those mappings, and mremap is blocked. > - Passing various automation tests (e.g. pre-checkin) on ChromeOS and > Android to ensure the sealing doesn't affect the functionality of > Chromebook and Android phone. > > I also tested the feature on Ubuntu on x86-64: > - With config disabled, vdso/vvar is not sealed, > - with config enabled, vdso/vvar is sealed, and booting up Ubuntu is OK, > normal operations such as browsing the web, open/edit doc are OK. > > In addition, Benjamin Berg tested this on UML. > > Link: https://lore.kernel.org/all/20240415163527.626541-1-jeffxu@chromium.org/ [1] > Link: Documentation/userspace-api/mseal.rst [2] > Link: https://lore.kernel.org/all/CABi2SkU9BRUnqf70-nksuMCQ+yyiWjo3fM4XkRkL-NrCZxYAyg@mail.gmail.com/ [3] I found it odd that the history and V4 links were not here, but I realised that was in 0/7 instead. > > Signed-off-by: Jeff Xu <jeffxu@chromium.org> > --- > include/linux/userprocess.h | 18 ++++++++++++++++++ > init/Kconfig | 18 ++++++++++++++++++ > security/Kconfig | 18 ++++++++++++++++++ > 3 files changed, 54 insertions(+) > create mode 100644 include/linux/userprocess.h > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h > new file mode 100644 > index 000000000000..bd11e2e972c5 > --- /dev/null > +++ b/include/linux/userprocess.h > @@ -0,0 +1,18 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#ifndef _LINUX_USER_PROCESS_H > +#define _LINUX_USER_PROCESS_H > +#include <linux/mm.h> Why is this in a new file and not mm.h directly? Anyone who needs to use this will pull in mm.h anyways, and probably already has mm.h included. > + > +/* > + * mseal of userspace process's system mappings. > + */ > +static inline unsigned long mseal_system_mappings(void) > +{ > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > + return VM_SEALED; > +#else > + return 0; > +#endif > +} > + > +#endif Looking in mm.h, there are examples of config checks which either set the bit or set it to 0. This means you wouldn't need the function at all and the precompiler would do all the work (although with a static inline, I'm not sure it would make a big difference in instructions). For instance, you could do: #ifdef CONFIG_64BIT /* VM is sealed, in vm_flags */ #define VM_SEALED _BITUL(63) #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS #define VM_SYSTEM_SEAL VM_SEALED else #define VM_SYSTEM_SEAL VM_NONE #endif #else /* CONFIG_64BIT */ #define VM_SYSTEM_SEAL VM_NONE #endif Then you can use VM_SYSTEM_SEAL in the system mappings function in the list of flags instead of having a variable + calling the static function. I didn't look close enough to see if the 32bit version is necessary. > diff --git a/init/Kconfig b/init/Kconfig > index d0d021b3fa3b..892d2bcdf397 100644 > --- a/init/Kconfig > +++ b/init/Kconfig > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > config ARCH_HAS_MEMBARRIER_SYNC_CORE > bool > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear. HAS may mean it's supported or it could mean it's enabled. SUPPORTS is more clear that this is set if an arch supports the feature. After all, I think HAS here means it "has support for" mseal system mappings. I see that both are used (HAS and SUPPORTS), but I'm still not sure what HAS means in other contexts without digging. > + bool > + help > + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. > + > + A 64-bit kernel is required for the memory sealing feature. > + No specific hardware features from the CPU are needed. > + > + To enable this feature, the architecture needs to update their > + speical mappings calls to include the sealing flag and confirm > + that it doesn't unmap/remap system mappings during the life > + time of the process. After the architecture enables this, a > + distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access > + to the feature. > + > + For complete descriptions of memory sealing, please see > + Documentation/userspace-api/mseal.rst > + > config HAVE_PERF_EVENTS > bool > help > diff --git a/security/Kconfig b/security/Kconfig > index f10dbf15c294..bfb35fc7a3c6 100644 > --- a/security/Kconfig > +++ b/security/Kconfig > @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE > > endchoice > > +config MSEAL_SYSTEM_MAPPINGS > + bool "mseal system mappings" > + depends on 64BIT > + depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > + depends on !CHECKPOINT_RESTORE > + help > + Seal system mappings such as vdso, vvar, sigpage, uprobes, etc. > + > + A 64-bit kernel is required for the memory sealing feature. > + No specific hardware features from the CPU are needed. > + > + Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or > + unmap system mapping, therefore this config can't be enabled > + universally. I guess add rr to the list, since that's also reported to have issues as well. > + > + For complete descriptions of memory sealing, please see > + Documentation/userspace-api/mseal.rst > + > config SECURITY > bool "Enable different security models" > depends on SYSFS > -- > 2.48.1.502.g6dc24dfdaf-goog >
On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > ... > > > > In this version, we've improved the handling of system mapping sealing from > > previous versions, instead of modifying the _install_special_mapping > > function itself, which would affect all architectures, we now call > > _install_special_mapping with a sealing flag only within the specific > > architecture that requires it. This targeted approach offers two key > > advantages: 1) It limits the code change's impact to the necessary > > architectures, and 2) It aligns with the software architecture by keeping > > the core memory management within the mm layer, while delegating the > > decision of sealing system mappings to the individual architecture, which > > is particularly relevant since 32-bit architectures never require sealing. > > > > Additionally, this patch introduces a new header, > > include/linux/usrprocess.h, which contains the mseal_system_mappings() > > function. This function helps the architecture determine if system > > mapping is enabled within the current kernel configuration. It can be > > extended in the future to handle opt-in/out prctl for enabling system > > mapping sealing at the process level or a kernel cmdline feature. > > > > A new header file was introduced because it was difficult to find the > > best location for this function. Although include/linux/mm.h was > > considered, this feature is more closely related to user processes > > than core memory management. Additionally, future prctl or kernel > > cmd-line implementations for this feature would not fit within the > > scope of core memory management or mseal.c. This is relevant because > > if we add unit-tests for mseal.c in the future, we would want to limit > > mseal.c's dependencies to core memory management. > > ... > > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h > > new file mode 100644 > > index 000000000000..bd11e2e972c5 > > --- /dev/null > > +++ b/include/linux/userprocess.h > > @@ -0,0 +1,18 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +#ifndef _LINUX_USER_PROCESS_H > > +#define _LINUX_USER_PROCESS_H > > +#include <linux/mm.h> > > Why is this in a new file and not mm.h directly? Anyone who needs to > use this will pull in mm.h anyways, and probably already has mm.h > included. > The commit message includes the reason. I've snipped and left the relevant portion for easy reference, please see the beginning of this email. > > + > > +/* > > + * mseal of userspace process's system mappings. > > + */ > > +static inline unsigned long mseal_system_mappings(void) > > +{ > > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > > + return VM_SEALED; > > +#else > > + return 0; > > +#endif > > +} > > + > > +#endif > > Looking in mm.h, there are examples of config checks which either set > the bit or set it to 0. This means you wouldn't need the function at > all and the precompiler would do all the work (although with a static > inline, I'm not sure it would make a big difference in instructions). > > For instance, you could do: > #ifdef CONFIG_64BIT > /* VM is sealed, in vm_flags */ > #define VM_SEALED _BITUL(63) > > #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > #define VM_SYSTEM_SEAL VM_SEALED > else > #define VM_SYSTEM_SEAL VM_NONE > #endif > > #else /* CONFIG_64BIT */ > #define VM_SYSTEM_SEAL VM_NONE > #endif > > Then you can use VM_SYSTEM_SEAL in the system mappings function in the > list of flags instead of having a variable + calling the static > function. > > I didn't look close enough to see if the 32bit version is necessary. > I'm aware of the other pattern. VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of the build. This is intentional. Any 32-bit code trying to use the sealing function or the VM_SEALED flag will immediately fail compilation. This makes it easier to identify incorrect usage. > > diff --git a/init/Kconfig b/init/Kconfig > > index d0d021b3fa3b..892d2bcdf397 100644 > > --- a/init/Kconfig > > +++ b/init/Kconfig > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > > config ARCH_HAS_MEMBARRIER_SYNC_CORE > > bool > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear. HAS may mean it's > supported or it could mean it's enabled. SUPPORTS is more clear that > this is set if an arch supports the feature. After all, I think HAS > here means it "has support for" mseal system mappings. > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what > HAS means in other contexts without digging. > The existing pattern is to indicate arch support with CONFIG_ARCH_HAS_N and security/KConfig to have CONFIG_N as the main switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in security/Kconfig, hence I'm following the existing pattern in security/Kconfig. > > diff --git a/security/Kconfig b/security/Kconfig > > index f10dbf15c294..bfb35fc7a3c6 100644 > > --- a/security/Kconfig > > +++ b/security/Kconfig > > @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE > > > > endchoice > > > > +config MSEAL_SYSTEM_MAPPINGS > > + bool "mseal system mappings" > > + depends on 64BIT > > + depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > + depends on !CHECKPOINT_RESTORE > > + help > > + Seal system mappings such as vdso, vvar, sigpage, uprobes, etc. > > + > > + A 64-bit kernel is required for the memory sealing feature. > > + No specific hardware features from the CPU are needed. > > + > > + Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or > > + unmap system mapping, therefore this config can't be enabled > > + universally. > > I guess add rr to the list, since that's also reported to have issues as > well. > sure. Thanks for reviewing. -Jeff
* Jeff Xu <jeffxu@chromium.org> [250213 12:17]: > On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > ... > > > > > > In this version, we've improved the handling of system mapping sealing from > > > previous versions, instead of modifying the _install_special_mapping > > > function itself, which would affect all architectures, we now call > > > _install_special_mapping with a sealing flag only within the specific > > > architecture that requires it. This targeted approach offers two key > > > advantages: 1) It limits the code change's impact to the necessary > > > architectures, and 2) It aligns with the software architecture by keeping > > > the core memory management within the mm layer, while delegating the > > > decision of sealing system mappings to the individual architecture, which > > > is particularly relevant since 32-bit architectures never require sealing. > > > > > > Additionally, this patch introduces a new header, > > > include/linux/usrprocess.h, which contains the mseal_system_mappings() > > > function. This function helps the architecture determine if system > > > mapping is enabled within the current kernel configuration. It can be > > > extended in the future to handle opt-in/out prctl for enabling system > > > mapping sealing at the process level or a kernel cmdline feature. > > > > > > A new header file was introduced because it was difficult to find the > > > best location for this function. Although include/linux/mm.h was > > > considered, this feature is more closely related to user processes > > > than core memory management. Additionally, future prctl or kernel > > > cmd-line implementations for this feature would not fit within the > > > scope of core memory management or mseal.c. This is relevant because > > > if we add unit-tests for mseal.c in the future, we would want to limit > > > mseal.c's dependencies to core memory management. > > > > ... > > > > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h > > > new file mode 100644 > > > index 000000000000..bd11e2e972c5 > > > --- /dev/null > > > +++ b/include/linux/userprocess.h > > > @@ -0,0 +1,18 @@ > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > +#ifndef _LINUX_USER_PROCESS_H > > > +#define _LINUX_USER_PROCESS_H > > > +#include <linux/mm.h> > > > > Why is this in a new file and not mm.h directly? Anyone who needs to > > use this will pull in mm.h anyways, and probably already has mm.h > > included. > > > The commit message includes the reason. I've snipped and left the > relevant portion for easy reference, please see the beginning of this > email. This is a _really_ good reason NOT to stack the entire patch set description into a single patch description. Within the above 111 lines of text, I missed that statement. When Andrew takes the first patch, he'll put the patch description in there, and then we can actually concentrate on the patch with the limited context of what is going on. If it doesn't go through Andrew's branch, then we can fall back to each patch standing on its own with the change stating why things are done. Also, I don't agree with your stated reason for a new header. Please remove this header until it is needed. It can be added later if it is needed. If we all had tiny headers because we might need them in the future, we'd have serious issues finding anything and the list of included headers would be huge. You have added unnecessary changes and complications to this patch set on v5. > > > > + > > > +/* > > > + * mseal of userspace process's system mappings. > > > + */ > > > +static inline unsigned long mseal_system_mappings(void) > > > +{ > > > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > > > + return VM_SEALED; > > > +#else > > > + return 0; > > > +#endif > > > +} > > > + > > > +#endif > > > > Looking in mm.h, there are examples of config checks which either set > > the bit or set it to 0. This means you wouldn't need the function at > > all and the precompiler would do all the work (although with a static > > inline, I'm not sure it would make a big difference in instructions). > > > > For instance, you could do: > > #ifdef CONFIG_64BIT > > /* VM is sealed, in vm_flags */ > > #define VM_SEALED _BITUL(63) > > > > #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > > #define VM_SYSTEM_SEAL VM_SEALED > > else > > #define VM_SYSTEM_SEAL VM_NONE > > #endif > > > > #else /* CONFIG_64BIT */ > > #define VM_SYSTEM_SEAL VM_NONE > > #endif > > > > Then you can use VM_SYSTEM_SEAL in the system mappings function in the > > list of flags instead of having a variable + calling the static > > function. > > > > I didn't look close enough to see if the 32bit version is necessary. > > > I'm aware of the other pattern. > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > the build. This is intentional. Any 32-bit code trying to use the > sealing function or the VM_SEALED flag will immediately fail > compilation. This makes it easier to identify incorrect usage. So you are against using the #define because the VM_SYSTEM_SEAL will be defined, even though it will be VM_NONE? This is no worse than a function that returns 0, and it aligns better with what we have today in that it can be put in the list of other flags. Also, my vote for where you should put this code is clear: it should live with the mseal #define in mm.h. You're going to need mm.h anyways, and that's a big hint as to where it should live. > > > > diff --git a/init/Kconfig b/init/Kconfig > > > index d0d021b3fa3b..892d2bcdf397 100644 > > > --- a/init/Kconfig > > > +++ b/init/Kconfig > > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > > > config ARCH_HAS_MEMBARRIER_SYNC_CORE > > > bool > > > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear. HAS may mean it's > > supported or it could mean it's enabled. SUPPORTS is more clear that > > this is set if an arch supports the feature. After all, I think HAS > > here means it "has support for" mseal system mappings. > > > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what > > HAS means in other contexts without digging. > > > The existing pattern is to indicate arch support with > CONFIG_ARCH_HAS_N and security/KConfig to have CONFIG_N as the main > switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and > CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in > security/Kconfig, hence I'm following the existing pattern in > security/Kconfig. Okay, thanks. I really think SUPPORTS is more clear, but sticking with whatever your area uses is also fine. ... Thanks, Liam
On Thu, Feb 13, 2025 at 01:29:46PM -0500, Liam R. Howlett wrote: > * Jeff Xu <jeffxu@chromium.org> [250213 12:17]: > > On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > ... > > > > > > > > In this version, we've improved the handling of system mapping sealing from > > > > previous versions, instead of modifying the _install_special_mapping > > > > function itself, which would affect all architectures, we now call > > > > _install_special_mapping with a sealing flag only within the specific > > > > architecture that requires it. This targeted approach offers two key > > > > advantages: 1) It limits the code change's impact to the necessary > > > > architectures, and 2) It aligns with the software architecture by keeping > > > > the core memory management within the mm layer, while delegating the > > > > decision of sealing system mappings to the individual architecture, which > > > > is particularly relevant since 32-bit architectures never require sealing. > > > > > > > > Additionally, this patch introduces a new header, > > > > include/linux/usrprocess.h, which contains the mseal_system_mappings() > > > > function. This function helps the architecture determine if system > > > > mapping is enabled within the current kernel configuration. It can be > > > > extended in the future to handle opt-in/out prctl for enabling system > > > > mapping sealing at the process level or a kernel cmdline feature. > > > > > > > > A new header file was introduced because it was difficult to find the > > > > best location for this function. Although include/linux/mm.h was > > > > considered, this feature is more closely related to user processes > > > > than core memory management. Additionally, future prctl or kernel > > > > cmd-line implementations for this feature would not fit within the > > > > scope of core memory management or mseal.c. This is relevant because > > > > if we add unit-tests for mseal.c in the future, we would want to limit > > > > mseal.c's dependencies to core memory management. > > > > > > ... > > > > > > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h > > > > new file mode 100644 > > > > index 000000000000..bd11e2e972c5 > > > > --- /dev/null > > > > +++ b/include/linux/userprocess.h > > > > @@ -0,0 +1,18 @@ > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > +#ifndef _LINUX_USER_PROCESS_H > > > > +#define _LINUX_USER_PROCESS_H > > > > +#include <linux/mm.h> > [...] > > Also, I don't agree with your stated reason for a new header. > > Please remove this header until it is needed. It can be added later if > it is needed. If we all had tiny headers because we might need them in > the future, we'd have serious issues finding anything and the list of > included headers would be huge. > > You have added unnecessary changes and complications to this patch set > on v5. In all fairness, I think v5 is significantly less complex overall. Jeff used his best judgement with a new header, and I don't think it's unreasonable. That it is unwanted is fine, mm.h it is, but I think it's clear the intent was trying to make this as least burdensome for the mm subsystem as possible. > > > > +/* > > > > + * mseal of userspace process's system mappings. > > > > + */ > > > > +static inline unsigned long mseal_system_mappings(void) > > > > +{ > > > > +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > > > > + return VM_SEALED; > > > > +#else > > > > + return 0; > > > > +#endif > > > > +} > > > > + > > > > +#endif > > > > > > Looking in mm.h, there are examples of config checks which either set > > > the bit or set it to 0. This means you wouldn't need the function at > > > all and the precompiler would do all the work (although with a static > > > inline, I'm not sure it would make a big difference in instructions). > > > > > > For instance, you could do: > > > #ifdef CONFIG_64BIT > > > /* VM is sealed, in vm_flags */ > > > #define VM_SEALED _BITUL(63) > > > > > > #ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS > > > #define VM_SYSTEM_SEAL VM_SEALED > > > else > > > #define VM_SYSTEM_SEAL VM_NONE > > > #endif > > > > > > #else /* CONFIG_64BIT */ > > > #define VM_SYSTEM_SEAL VM_NONE > > > #endif > > > > > > Then you can use VM_SYSTEM_SEAL in the system mappings function in the > > > list of flags instead of having a variable + calling the static > > > function. > > > > > > I didn't look close enough to see if the 32bit version is necessary. > > > > > I'm aware of the other pattern. > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > the build. This is intentional. Any 32-bit code trying to use the > > sealing function or the VM_SEALED flag will immediately fail > > compilation. This makes it easier to identify incorrect usage. > > So you are against using the #define because the VM_SYSTEM_SEAL will be > defined, even though it will be VM_NONE? This is no worse than a > function that returns 0, and it aligns better with what we have today in > that it can be put in the list of other flags. When I was reading through all of this and considering the history of its development goals, it strikes me that a function is easier for the future if/when this can be made a boot-time setting. If mm maintainers prefer a #define for now, that's fine of course. The value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit. > Also, my vote for where you should put this code is clear: it should > live with the mseal #define in mm.h. You're going to need mm.h anyways, > and that's a big hint as to where it should live. Sounds good. > > > > diff --git a/init/Kconfig b/init/Kconfig > > > > index d0d021b3fa3b..892d2bcdf397 100644 > > > > --- a/init/Kconfig > > > > +++ b/init/Kconfig > > > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > > > > config ARCH_HAS_MEMBARRIER_SYNC_CORE > > > > bool > > > > > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > > > > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear. HAS may mean it's > > > supported or it could mean it's enabled. SUPPORTS is more clear that > > > this is set if an arch supports the feature. After all, I think HAS > > > here means it "has support for" mseal system mappings. > > > > > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what > > > HAS means in other contexts without digging. > > > > > The existing pattern is to indicate arch support with > > CONFIG_ARCH_HAS_N and security/KConfig to have CONFIG_N as the main > > switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and > > CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in > > security/Kconfig, hence I'm following the existing pattern in > > security/Kconfig. > > Okay, thanks. I really think SUPPORTS is more clear, but sticking with > whatever your area uses is also fine. Yeah, I've really scratched my head over what the best practice is here. There's a real mixture of "HAS" vs "SUPPORTS" across both hardware architectural support, software features, interface implementations, and compiler behavior that bridges between those. When I've looked in the past, it seemed that "HAS" was in the majority, but I've never been able to make sense of it. Let me look again. Today, HAS shows: $ git grep 'config .*_HAS_' | grep -v Documentation/ | \ awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n 1 ARM 1 MAC80211 1 PGTABLE 1 PPC 1 RUSTC 1 USB 2 ARM64 2 SIBYTE 2 SOC 3 FS 3 PAHOLE 6 SGI 6 TOOLCHAIN 10 ARC 20 AS 32 SYS 34 CPU 38 CC 105 ARCH Looking through them, it's tools (CC, AS, PAHOLE, etc), and system/cpu/architecture prefixes, with ARCH being a clear winner. SUPPORTS looks similar, but isn't as widely used: $ git grep 'config .*_SUPPORTS_' | grep -v Documentation/ | \ awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n 1 PPC64 1 X86 2 CLANG 2 GCC 2 RUSTC 8 CPU 38 SYS 71 ARCH The mips arch is using SYS, and distinguishes between HAS and SUPPORTS in the sense of "this kernel HAS support for CPU $foo, which SUPPORTS features x, y, z". Looking through ARCH_SUPPORTS, it's all software features. Looking through ARCH_HAS, it looks like a mix of hardware features and software features. Some software features are more about implementation availability, though (so "HAS" makes sense, but should maybe be "IMPLEMENTS"?) e.g. ARCH_HAS_SYSCALL_WRAPPER, ARCH_HAS_STRNLEN_USER, ARCH_HAS_FORTIFY_SOURCE, ARCH_HAS_ELF_RANDOMIZE, ARCH_HAS_STRICT_KERNEL_RWX, ARCH_HAS_STRICT_MODULE_RWX. So, I think I'd agree: this is about a software features and not an API implementation nor hardware feature, so let's use ARCH_SUPPORTS_. -Kees
* Kees Cook <kees@kernel.org> [250213 15:11]: > On Thu, Feb 13, 2025 at 01:29:46PM -0500, Liam R. Howlett wrote: > > * Jeff Xu <jeffxu@chromium.org> [250213 12:17]: > > > On Wed, Feb 12, 2025 at 7:05 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > ... > > > > > > > > > > In this version, we've improved the handling of system mapping sealing from > > > > > previous versions, instead of modifying the _install_special_mapping > > > > > function itself, which would affect all architectures, we now call > > > > > _install_special_mapping with a sealing flag only within the specific > > > > > architecture that requires it. This targeted approach offers two key > > > > > advantages: 1) It limits the code change's impact to the necessary > > > > > architectures, and 2) It aligns with the software architecture by keeping > > > > > the core memory management within the mm layer, while delegating the > > > > > decision of sealing system mappings to the individual architecture, which > > > > > is particularly relevant since 32-bit architectures never require sealing. > > > > > > > > > > Additionally, this patch introduces a new header, > > > > > include/linux/usrprocess.h, which contains the mseal_system_mappings() > > > > > function. This function helps the architecture determine if system > > > > > mapping is enabled within the current kernel configuration. It can be > > > > > extended in the future to handle opt-in/out prctl for enabling system > > > > > mapping sealing at the process level or a kernel cmdline feature. > > > > > > > > > > A new header file was introduced because it was difficult to find the > > > > > best location for this function. Although include/linux/mm.h was > > > > > considered, this feature is more closely related to user processes > > > > > than core memory management. Additionally, future prctl or kernel > > > > > cmd-line implementations for this feature would not fit within the > > > > > scope of core memory management or mseal.c. This is relevant because > > > > > if we add unit-tests for mseal.c in the future, we would want to limit > > > > > mseal.c's dependencies to core memory management. > > > > > > > > ... > > > > > > > > > > diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h > > > > > new file mode 100644 > > > > > index 000000000000..bd11e2e972c5 > > > > > --- /dev/null > > > > > +++ b/include/linux/userprocess.h > > > > > @@ -0,0 +1,18 @@ > > > > > +/* SPDX-License-Identifier: GPL-2.0 */ > > > > > +#ifndef _LINUX_USER_PROCESS_H > > > > > +#define _LINUX_USER_PROCESS_H > > > > > +#include <linux/mm.h> > > [...] > > > > Also, I don't agree with your stated reason for a new header. > > > > Please remove this header until it is needed. It can be added later if > > it is needed. If we all had tiny headers because we might need them in > > the future, we'd have serious issues finding anything and the list of > > included headers would be huge. > > > > You have added unnecessary changes and complications to this patch set > > on v5. > > In all fairness, I think v5 is significantly less complex overall. Jeff > used his best judgement with a new header, and I don't think it's > unreasonable. That it is unwanted is fine, mm.h it is, but I think it's > clear the intent was trying to make this as least burdensome for the mm > subsystem as possible. This version is certainly easier to follow, but it's still more complicated than it needs to be. Adding custom headers seems like an unnecessary addition to previous versions. I didn't mean this to be a jab or upsetting - and sorry to both of you, I can see how it could be taken this way. I was trying to point out that this is the same sort of thing that got Jeff into trouble with Linus about over-engineering the v8 of the mseal patch set [1]. ... > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > > the build. This is intentional. Any 32-bit code trying to use the > > > sealing function or the VM_SEALED flag will immediately fail > > > compilation. This makes it easier to identify incorrect usage. > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be > > defined, even though it will be VM_NONE? This is no worse than a > > function that returns 0, and it aligns better with what we have today in > > that it can be put in the list of other flags. > > When I was reading through all of this and considering the history of > its development goals, it strikes me that a function is easier for the > future if/when this can be made a boot-time setting. > Reworking this change to function as a boot-time parameter, or whatever, would not be a significant amount of work, if/when the time comes. Since we don't know what the future holds, it it unnecessary to engineer in a potential change for a future version when the change is easy enough to make later and keep the code cleaner now. > If mm maintainers prefer a #define for now, that's fine of course. The > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit. Thanks. I think having a flag with VM_NONE on 32-bit is just as sane as a "flags |= system_seal()" call that unconditionally returns 0 on 32-bit. > > > Also, my vote for where you should put this code is clear: it should > > live with the mseal #define in mm.h. You're going to need mm.h anyways, > > and that's a big hint as to where it should live. > > Sounds good. > > > > > > diff --git a/init/Kconfig b/init/Kconfig > > > > > index d0d021b3fa3b..892d2bcdf397 100644 > > > > > --- a/init/Kconfig > > > > > +++ b/init/Kconfig > > > > > @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS > > > > > config ARCH_HAS_MEMBARRIER_SYNC_CORE > > > > > bool > > > > > > > > > > +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS > > > > > > > > ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS is more clear. HAS may mean it's > > > > supported or it could mean it's enabled. SUPPORTS is more clear that > > > > this is set if an arch supports the feature. After all, I think HAS > > > > here means it "has support for" mseal system mappings. > > > > > > > > I see that both are used (HAS and SUPPORTS), but I'm still not sure what > > > > HAS means in other contexts without digging. > > > > > > > The existing pattern is to indicate arch support with > > > CONFIG_ARCH_HAS_N and security/KConfig to have CONFIG_N as the main > > > switch. For example, CONFIG_ARCH_HAS_FORTIFY_SOURCE and > > > CONFIG_FORTIFY_SOURCE. Since the MSEAL_SYSTEM_MAPPINGS is placed in > > > security/Kconfig, hence I'm following the existing pattern in > > > security/Kconfig. > > > > Okay, thanks. I really think SUPPORTS is more clear, but sticking with > > whatever your area uses is also fine. > > Yeah, I've really scratched my head over what the best practice is here. > There's a real mixture of "HAS" vs "SUPPORTS" across both hardware > architectural support, software features, interface implementations, > and compiler behavior that bridges between those. When I've looked in > the past, it seemed that "HAS" was in the majority, but I've never been > able to make sense of it. > > Let me look again. Today, HAS shows: > > $ git grep 'config .*_HAS_' | grep -v Documentation/ | \ > awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n > 1 ARM > 1 MAC80211 > 1 PGTABLE > 1 PPC > 1 RUSTC > 1 USB > 2 ARM64 > 2 SIBYTE > 2 SOC > 3 FS > 3 PAHOLE > 6 SGI > 6 TOOLCHAIN > 10 ARC > 20 AS > 32 SYS > 34 CPU > 38 CC > 105 ARCH > > Looking through them, it's tools (CC, AS, PAHOLE, etc), and > system/cpu/architecture prefixes, with ARCH being a clear winner. > > SUPPORTS looks similar, but isn't as widely used: > > $ git grep 'config .*_SUPPORTS_' | grep -v Documentation/ | \ > awk '{print $2}' | cut -d_ -f1 | sort | uniq -c | sort -n > 1 PPC64 > 1 X86 > 2 CLANG > 2 GCC > 2 RUSTC > 8 CPU > 38 SYS > 71 ARCH > > The mips arch is using SYS, and distinguishes between HAS and SUPPORTS > in the sense of "this kernel HAS support for CPU $foo, which SUPPORTS > features x, y, z". > > Looking through ARCH_SUPPORTS, it's all software features. Looking > through ARCH_HAS, it looks like a mix of hardware features > and software features. Some software features are more about > implementation availability, though (so "HAS" makes sense, > but should maybe be "IMPLEMENTS"?) e.g. ARCH_HAS_SYSCALL_WRAPPER, > ARCH_HAS_STRNLEN_USER, ARCH_HAS_FORTIFY_SOURCE, ARCH_HAS_ELF_RANDOMIZE, > ARCH_HAS_STRICT_KERNEL_RWX, ARCH_HAS_STRICT_MODULE_RWX. > > So, I think I'd agree: this is about a software features and not an API > implementation nor hardware feature, so let's use ARCH_SUPPORTS_. This makes sense. I grepped for both as well and found that HAS outnumbers SUPPORTS, but there are a lot of each. I figured using whatever the subsystem uses is probably the right answer - your answer is better. Seeing them in the same kconfig option together gave me pause as to why they were different. Thanks for digging into this! Regards, Liam [1] https://lore.kernel.org/linux-mm/CAHk-=wjGGgfAoiEdPqLdib7VvQgG7uVXpTPzJ9jTW0HesRpPwQ@mail.gmail.com/
On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > > > the build. This is intentional. Any 32-bit code trying to use the > > > > sealing function or the VM_SEALED flag will immediately fail > > > > compilation. This makes it easier to identify incorrect usage. > > > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be > > > defined, even though it will be VM_NONE? This is no worse than a > > > function that returns 0, and it aligns better with what we have today in > > > that it can be put in the list of other flags. > > > > When I was reading through all of this and considering the history of > > its development goals, it strikes me that a function is easier for the > > future if/when this can be made a boot-time setting. > > > > Reworking this change to function as a boot-time parameter, or whatever, > would not be a significant amount of work, if/when the time comes. > Since we don't know what the future holds, it it unnecessary to engineer > in a potential change for a future version when the change is easy > enough to make later and keep the code cleaner now. > Sure, I will put the function in mm.h for this patch. We can find a proper place when it is time to implement the boot-time parameter change. The call stack for sealing system mapping is something like below: install_special_mapping (mm/mmap.c) map_vdso (arch/x86/entry/vdso/vma.c) load_elf_binary (fs/binfmt_elf.c) load_misc_binary (fs/binfmt_misc.c) bprm_execve (fs/exec.c) do_execveat_common __x64_sys_execve do_syscall_64 IMO, there's a clear divide between the API implementer and the API user. mm and mm.h are the providers, offering the core mm functionality through APIs/data structures like install_special_mapping(). The exe layer (bprm_execve, map_vdso, etc) is the consumer of the install_special_mapping. The logic related to checking if sealing system mapping is enabled belongs to the exe layer. > > > If mm maintainers prefer a #define for now, that's fine of course. The > > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit. > > Thanks. I think having a flag with VM_NONE on 32-bit is just as sane as > a "flags |= system_seal()" call that unconditionally returns 0 on > 32-bit. > Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c, #ifdef CONFIG_64BIT [ilog2(VM_SEALED)] = "sl", #endif If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time. Setting VM_SEALED to VM_NONE could simplify the coding in some cases (get/set case), but it might make other cases error prone. I would prefer to not have VM_SEALED for 32 bit. -Jeff
* Jeff Xu <jeffxu@chromium.org> [250213 17:00]: > On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett > <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > > > > the build. This is intentional. Any 32-bit code trying to use the > > > > > sealing function or the VM_SEALED flag will immediately fail > > > > > compilation. This makes it easier to identify incorrect usage. > > > > > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be > > > > defined, even though it will be VM_NONE? This is no worse than a > > > > function that returns 0, and it aligns better with what we have today in > > > > that it can be put in the list of other flags. > > > > > > When I was reading through all of this and considering the history of > > > its development goals, it strikes me that a function is easier for the > > > future if/when this can be made a boot-time setting. > > > > > > > Reworking this change to function as a boot-time parameter, or whatever, > > would not be a significant amount of work, if/when the time comes. > > Since we don't know what the future holds, it it unnecessary to engineer > > in a potential change for a future version when the change is easy > > enough to make later and keep the code cleaner now. > > > Sure, I will put the function in mm.h for this patch. We can find a > proper place when it is time to implement the boot-time parameter > change. > > The call stack for sealing system mapping is something like below: > > install_special_mapping (mm/mmap.c) > map_vdso (arch/x86/entry/vdso/vma.c) > load_elf_binary (fs/binfmt_elf.c) > load_misc_binary (fs/binfmt_misc.c) > bprm_execve (fs/exec.c) > do_execveat_common > __x64_sys_execve > do_syscall_64 > > IMO, there's a clear divide between the API implementer and the API user. > mm and mm.h are the providers, offering the core mm functionality > through APIs/data structures like install_special_mapping(). > > The exe layer (bprm_execve, map_vdso, etc) is the consumer of the > install_special_mapping. > The logic related to checking if sealing system mapping is enabled > belongs to the exe layer. Since this is an all or nothing enabling, there is no reason to have each caller check the same thing and do the same action. You should put the logic into the provider - they all end up doing the same thing. Also, this is a compile time option so it doesn't even need to be checked on execution - just apply it in the first place, at the source. Your static inline function was already doing this...? I'm confused as to what you are arguing here because it goes against what you had and what I suggested. The alternative you are suggesting is more code, more instructions, and the best outcome is the same result. > > > > > > If mm maintainers prefer a #define for now, that's fine of course. The > > > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit. > > > > Thanks. I think having a flag with VM_NONE on 32-bit is just as sane as > > a "flags |= system_seal()" call that unconditionally returns 0 on > > 32-bit. > > > Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c, > > #ifdef CONFIG_64BIT > [ilog2(VM_SEALED)] = "sl", > #endif > > If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time. > > Setting VM_SEALED to VM_NONE could simplify the coding in some cases > (get/set case), but it might make other cases error prone. > > I would prefer to not have VM_SEALED for 32 bit. But what I posted leaves VM_SEALED undefined for 32 bit, it defines VM_SYSTEM_SEALED which can be placed, for instance, into _install_special_mapping() arguments directly. Reducing the change to just adding a new flag to the list. And it's applied to all system mappings based on if it's enabled on compile or not. Also: include/linux/mm.h:#define VM_NONE 0x00000000 so, I'm not sure what evaluation you are concerned about? It would be defined as 0. Thanks, Liam
* Liam R. Howlett <Liam.Howlett@oracle.com> [250213 19:14]: > * Jeff Xu <jeffxu@chromium.org> [250213 17:00]: > > On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett > > <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > > > > > the build. This is intentional. Any 32-bit code trying to use the > > > > > > sealing function or the VM_SEALED flag will immediately fail > > > > > > compilation. This makes it easier to identify incorrect usage. > > > > > > > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be > > > > > defined, even though it will be VM_NONE? This is no worse than a > > > > > function that returns 0, and it aligns better with what we have today in > > > > > that it can be put in the list of other flags. > > > > > > > > When I was reading through all of this and considering the history of > > > > its development goals, it strikes me that a function is easier for the > > > > future if/when this can be made a boot-time setting. > > > > > > > > > > Reworking this change to function as a boot-time parameter, or whatever, > > > would not be a significant amount of work, if/when the time comes. > > > Since we don't know what the future holds, it it unnecessary to engineer > > > in a potential change for a future version when the change is easy > > > enough to make later and keep the code cleaner now. > > > > > Sure, I will put the function in mm.h for this patch. We can find a > > proper place when it is time to implement the boot-time parameter > > change. > > > > The call stack for sealing system mapping is something like below: > > > > install_special_mapping (mm/mmap.c) > > map_vdso (arch/x86/entry/vdso/vma.c) > > load_elf_binary (fs/binfmt_elf.c) > > load_misc_binary (fs/binfmt_misc.c) > > bprm_execve (fs/exec.c) > > do_execveat_common > > __x64_sys_execve > > do_syscall_64 > > > > IMO, there's a clear divide between the API implementer and the API user. > > mm and mm.h are the providers, offering the core mm functionality > > through APIs/data structures like install_special_mapping(). > > > > The exe layer (bprm_execve, map_vdso, etc) is the consumer of the > > install_special_mapping. > > The logic related to checking if sealing system mapping is enabled > > belongs to the exe layer. > > Since this is an all or nothing enabling, there is no reason to have > each caller check the same thing and do the same action. You should put > the logic into the provider - they all end up doing the same thing. > > Also, this is a compile time option so it doesn't even need to be > checked on execution - just apply it in the first place, at the source. > Your static inline function was already doing this...? > > I'm confused as to what you are arguing here because it goes against > what you had and what I suggested. The alternative you are suggesting > is more code, more instructions, and the best outcome is the same > result. I think I understand what you are saying now: the interface to install_special_mapping() needs to take the vma flag, as it does today. What I don't understand is that what you proposed and what I proposed both do that. What I'm saying is that, since system mappings are enabled, we can already know implicitly by having VM_SYSTEM_SEAL either VM_NONE or VM_SEAL. Turning this: @@ -264,11 +266,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr) /* * MAYWRITE to allow gdb to COW and set breakpoints */ + vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC; + vm_flags |= mseal_system_mappings(); vma = _install_special_mapping(mm, text_start, image->size, - VM_READ|VM_EXEC| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, + vm_flags, &vdso_mapping); to this: /* * MAYWRITE to allow gdb to COW and set breakpoints */ vma = _install_special_mapping(mm, text_start, image->size, VM_READ|VM_EXEC| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC| + VM_SYSTEM_SEAL, &vdso_mapping); No unsigned long vm_flags needed. It's easier to read and I don't think it's any more hidden than the vm_flags |= function call option. > > > > > > > > > > If mm maintainers prefer a #define for now, that's fine of course. The > > > > value of VM_SYSTEM_SEAL can be VM_NONE on 32-bit. > > > > > > Thanks. I think having a flag with VM_NONE on 32-bit is just as sane as > > > a "flags |= system_seal()" call that unconditionally returns 0 on > > > 32-bit. > > > > > Consider the case below in src/third_party/kernel/v6.6/fs/proc/task_mmu.c, > > > > #ifdef CONFIG_64BIT > > [ilog2(VM_SEALED)] = "sl", > > #endif > > > > If #ifdef CONFIG_64BIT is missing, it won't be detected during compile time. > > > > Setting VM_SEALED to VM_NONE could simplify the coding in some cases > > (get/set case), but it might make other cases error prone. > > > > I would prefer to not have VM_SEALED for 32 bit. > > But what I posted leaves VM_SEALED undefined for 32 bit, it defines > VM_SYSTEM_SEALED which can be placed, for instance, into > _install_special_mapping() arguments directly. Reducing the change to > just adding a new flag to the list. And it's applied to all system > mappings based on if it's enabled on compile or not. > > Also: > include/linux/mm.h:#define VM_NONE 0x00000000 > so, I'm not sure what evaluation you are concerned about? It would be > defined as 0. > > Thanks, > Liam > >
On Thu, Feb 13, 2025 at 5:10 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote: > > * Liam R. Howlett <Liam.Howlett@oracle.com> [250213 19:14]: > > * Jeff Xu <jeffxu@chromium.org> [250213 17:00]: > > > On Thu, Feb 13, 2025 at 12:54 PM Liam R. Howlett > > > <Liam.Howlett@oracle.com> wrote: > > > > > > > > > > > > > > > > > VM_SEALED isn't defined in 32-bit systems, and mseal.c isn't part of > > > > > > > the build. This is intentional. Any 32-bit code trying to use the > > > > > > > sealing function or the VM_SEALED flag will immediately fail > > > > > > > compilation. This makes it easier to identify incorrect usage. > > > > > > > > > > > > So you are against using the #define because the VM_SYSTEM_SEAL will be > > > > > > defined, even though it will be VM_NONE? This is no worse than a > > > > > > function that returns 0, and it aligns better with what we have today in > > > > > > that it can be put in the list of other flags. > > > > > > > > > > When I was reading through all of this and considering the history of > > > > > its development goals, it strikes me that a function is easier for the > > > > > future if/when this can be made a boot-time setting. > > > > > > > > > > > > > Reworking this change to function as a boot-time parameter, or whatever, > > > > would not be a significant amount of work, if/when the time comes. > > > > Since we don't know what the future holds, it it unnecessary to engineer > > > > in a potential change for a future version when the change is easy > > > > enough to make later and keep the code cleaner now. > > > > > > > Sure, I will put the function in mm.h for this patch. We can find a > > > proper place when it is time to implement the boot-time parameter > > > change. > > > > > > The call stack for sealing system mapping is something like below: > > > > > > install_special_mapping (mm/mmap.c) > > > map_vdso (arch/x86/entry/vdso/vma.c) > > > load_elf_binary (fs/binfmt_elf.c) > > > load_misc_binary (fs/binfmt_misc.c) > > > bprm_execve (fs/exec.c) > > > do_execveat_common > > > __x64_sys_execve > > > do_syscall_64 > > > > > > IMO, there's a clear divide between the API implementer and the API user. > > > mm and mm.h are the providers, offering the core mm functionality > > > through APIs/data structures like install_special_mapping(). > > > > > > The exe layer (bprm_execve, map_vdso, etc) is the consumer of the > > > install_special_mapping. > > > The logic related to checking if sealing system mapping is enabled > > > belongs to the exe layer. > > > > Since this is an all or nothing enabling, there is no reason to have > > each caller check the same thing and do the same action. You should put > > the logic into the provider - they all end up doing the same thing. > > > > Also, this is a compile time option so it doesn't even need to be > > checked on execution - just apply it in the first place, at the source. > > Your static inline function was already doing this...? > > > > I'm confused as to what you are arguing here because it goes against > > what you had and what I suggested. The alternative you are suggesting > > is more code, more instructions, and the best outcome is the same > > result. > > I think I understand what you are saying now: the interface to > install_special_mapping() needs to take the vma flag, as it does today. > What I don't understand is that what you proposed and what I proposed > both do that. > > What I'm saying is that, since system mappings are enabled, we can > already know implicitly by having VM_SYSTEM_SEAL either VM_NONE or > VM_SEAL. > > Turning this: > > @@ -264,11 +266,12 @@ static int map_vdso(const struct vdso_image *image, unsigned long addr) > /* > * MAYWRITE to allow gdb to COW and set breakpoints > */ > + vm_flags = VM_READ|VM_EXEC|VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC; > + vm_flags |= mseal_system_mappings(); > vma = _install_special_mapping(mm, > text_start, > image->size, > - VM_READ|VM_EXEC| > - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, > + vm_flags, > &vdso_mapping); > > to this: > > /* > * MAYWRITE to allow gdb to COW and set breakpoints > */ > vma = _install_special_mapping(mm, > text_start, > image->size, > VM_READ|VM_EXEC| > - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, > + VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC| > + VM_SYSTEM_SEAL, > &vdso_mapping); > > No unsigned long vm_flags needed. It's easier to read and I don't think > it's any more hidden than the vm_flags |= function call option. > The arch code needs a mseal_system_mappings() function. Otherwise, I'll have to change this line (in arch) again when I implement the kernel command line or pre-process opt-in/opt-out, which requires a function call. This isn't overthinking; based on our discussion so far, there are clear needs for a subsequent patch series. This patch is just the first step. For software layering, I'd like to see a clear separation between layers. mm implements _install_special_mapping, which accepts any combination of vm_flags as input. Then I'd like the caller (in arch code) to have all the necessary code to compose the vm_flags in one place. This helps readability. In the past, we discussed merging the vdso/vvar code across all architectures. When that happens, the new code in arch will likely have its own .c and .h files, but it will still sit above mm. That means mm won't need to change, and the _install_special_mapping API from mm will remain unchanged. The mseal_system_mappings() function can already return VM_SEALED, and future patches will add more logic into mseal_system_mappings(), e.g. check for kernel cmdline or opt-in/opt-out. we don't need a separate VM_SYSTEM_SEAL, which is a *** redirect macro ***, I prefer to keep all the important code logic relevant to configuration of enabling system mapping sealing in one place, for easy reading. mseal_system_mappings() can be placed in mm.h in this patch, as you suggested. But in the near future, it will be moved out of mm.h and find a right header. The functionality belongs to exe namespace, because of the reasons I put in the commit message and discussions. Thanks -Jeff -Jeff
On Fri, Feb 14, 2025 at 06:39:48AM -0800, Jeff Xu wrote: > mseal_system_mappings() can be placed in mm.h in this patch, as you > suggested. But in the near future, it will be moved out of mm.h and find > a right header. The functionality belongs to exe namespace, because of > the reasons I put in the commit message and discussions. With respect Jeff - and I feel that this might simply be a miscommunciation here - but this doesn't read wonderfully. 'can be placed' 'it will be moved out', etc. Please try to be respectful of experienced maintainers who are taking their time to review your code, and respond politely and respectfully. I think what you meant to say is something more like: "I'm more than happy to do that, but I feel that it would be more suited in a separate header, as this strictly belongs to the kernel functionality surrounding the execution of code. However we can revisit this at a later time!" My feeling is that this is exactly what you mean, but you are just essentially abbreviating this. However it reads rather rudely, which I'm sure you don't intend. Ultimately the fact of the matter is that your series will be merged when it reaches the standards required of you by the relevant maintainers, as is the case will all code submitted to the kernel when we reach consensus. In this series you have addressed a great number of concerns which has brought the merging of it very much closer, so I hope we can continue in the same vein and reach this consensus soon. Let's try to avoid any miscommunication which might delay us reaching this aim! Thanks, Lorenzo > > Thanks > -Jeff > > -Jeff
On Fri, Feb 14, 2025 at 7:00 AM Lorenzo Stoakes <lorenzo.stoakes@oracle.com> wrote: > > On Fri, Feb 14, 2025 at 06:39:48AM -0800, Jeff Xu wrote: > > mseal_system_mappings() can be placed in mm.h in this patch, as you > > suggested. But in the near future, it will be moved out of mm.h and find > > a right header. The functionality belongs to exe namespace, because of > > the reasons I put in the commit message and discussions. > > With respect Jeff - and I feel that this might simply be a miscommunciation > here - but this doesn't read wonderfully. 'can be placed' 'it will be moved > out', etc. > > Please try to be respectful of experienced maintainers who are taking their > time to review your code, and respond politely and respectfully. I think > what you meant to say is something more like: > > "I'm more than happy to do that, but I feel that it would be more suited in > a separate header, as this strictly belongs to the kernel functionality > surrounding the execution of code. However we can revisit this at a later > time!" > Thanks! I apologize if my expression appears to be rude. Your revision reflects what I'm trying to say, in better english. > My feeling is that this is exactly what you mean, but you are just > essentially abbreviating this. However it reads rather rudely, which I'm > sure you don't intend. > > Ultimately the fact of the matter is that your series will be merged when > it reaches the standards required of you by the relevant maintainers, as is > the case will all code submitted to the kernel when we reach consensus. > > In this series you have addressed a great number of concerns which has > brought the merging of it very much closer, so I hope we can continue in > the same vein and reach this consensus soon. > > Let's try to avoid any miscommunication which might delay us reaching this > aim! > Thanks ! -Jeff > Thanks, Lorenzo > > > > > Thanks > > -Jeff > > > > -Jeff
diff --git a/include/linux/userprocess.h b/include/linux/userprocess.h new file mode 100644 index 000000000000..bd11e2e972c5 --- /dev/null +++ b/include/linux/userprocess.h @@ -0,0 +1,18 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_USER_PROCESS_H +#define _LINUX_USER_PROCESS_H +#include <linux/mm.h> + +/* + * mseal of userspace process's system mappings. + */ +static inline unsigned long mseal_system_mappings(void) +{ +#ifdef CONFIG_MSEAL_SYSTEM_MAPPINGS + return VM_SEALED; +#else + return 0; +#endif +} + +#endif diff --git a/init/Kconfig b/init/Kconfig index d0d021b3fa3b..892d2bcdf397 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -1882,6 +1882,24 @@ config ARCH_HAS_MEMBARRIER_CALLBACKS config ARCH_HAS_MEMBARRIER_SYNC_CORE bool +config ARCH_HAS_MSEAL_SYSTEM_MAPPINGS + bool + help + Control MSEAL_SYSTEM_MAPPINGS access based on architecture. + + A 64-bit kernel is required for the memory sealing feature. + No specific hardware features from the CPU are needed. + + To enable this feature, the architecture needs to update their + speical mappings calls to include the sealing flag and confirm + that it doesn't unmap/remap system mappings during the life + time of the process. After the architecture enables this, a + distribution can set CONFIG_MSEAL_SYSTEM_MAPPING to manage access + to the feature. + + For complete descriptions of memory sealing, please see + Documentation/userspace-api/mseal.rst + config HAVE_PERF_EVENTS bool help diff --git a/security/Kconfig b/security/Kconfig index f10dbf15c294..bfb35fc7a3c6 100644 --- a/security/Kconfig +++ b/security/Kconfig @@ -51,6 +51,24 @@ config PROC_MEM_NO_FORCE endchoice +config MSEAL_SYSTEM_MAPPINGS + bool "mseal system mappings" + depends on 64BIT + depends on ARCH_HAS_MSEAL_SYSTEM_MAPPINGS + depends on !CHECKPOINT_RESTORE + help + Seal system mappings such as vdso, vvar, sigpage, uprobes, etc. + + A 64-bit kernel is required for the memory sealing feature. + No specific hardware features from the CPU are needed. + + Note: CHECKPOINT_RESTORE, UML, gVisor are known to relocate or + unmap system mapping, therefore this config can't be enabled + universally. + + For complete descriptions of memory sealing, please see + Documentation/userspace-api/mseal.rst + config SECURITY bool "Enable different security models" depends on SYSFS