[v10,0/5] Introduce mseal

Message ID	20240415163527.626541-1-jeffxu@chromium.org (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: jeffxu@chromium.org To: akpm@linux-foundation.org, keescook@chromium.org, jannh@google.com, sroettger@google.com, willy@infradead.org, gregkh@linuxfoundation.org, torvalds@linux-foundation.org, usama.anjum@collabora.com, corbet@lwn.net, Liam.Howlett@oracle.com, surenb@google.com, merimus@google.com, rdunlap@infradead.org Cc: jeffxu@google.com, jorgelo@chromium.org, groeck@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, pedro.falcato@gmail.com, dave.hansen@intel.com, linux-hardening@vger.kernel.org, deraadt@openbsd.org, Jeff Xu <jeffxu@chromium.org> Subject: [PATCH v10 0/5] Introduce mseal Date: Mon, 15 Apr 2024 16:35:19 +0000 Message-ID: <20240415163527.626541-1-jeffxu@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Introduce mseal \| expand [v10,0/5] Introduce mseal [v10,1/5] mseal: Wire up mseal syscall [v10,2/5] mseal: add mseal syscall [v10,3/5] selftest mm/mseal memory sealing [v10,4/5] mseal:add documentation [v10,5/5] selftest mm/mseal read-only elf memory segment

Jeff Xu April 15, 2024, 4:35 p.m. UTC

From: Jeff Xu <jeffxu@chromium.org>

This is V10 version, it rebases v9 patch to 6.9.rc3.
We also applied and tested mseal() in chrome and chromebook.

------------------------------------------------------------------

This patchset proposes a new mseal() syscall for the Linux kernel.

In a nutshell, mseal() protects the VMAs of a given virtual memory
range against modifications, such as changes to their permission bits.

Modern CPUs support memory permissions, such as the read/write (RW)
and no-execute (NX) bits. Linux has supported NX since the release of
kernel version 2.6.8 in August 2004 [1]. The memory permission feature
improves the security stance on memory corruption bugs, as an attacker
cannot simply write to arbitrary memory and point the code to it. The
memory must be marked with the X bit, or else an exception will occur.
Internally, the kernel maintains the memory permissions in a data
structure called VMA (vm_area_struct). mseal() additionally protects
the VMA itself against modifications of the selected seal type.

Memory sealing is useful to mitigate memory corruption issues where a
corrupted pointer is passed to a memory management system. For
example, such an attacker primitive can break control-flow integrity
guarantees since read-only memory that is supposed to be trusted can
become writable or .text pages can get remapped. Memory sealing can
automatically be applied by the runtime loader to seal .text and
.rodata pages and applications can additionally seal security critical
data at runtime. A similar feature already exists in the XNU kernel
with the VM_FLAGS_PERMANENT [3] flag and on OpenBSD with the
mimmutable syscall [4]. Also, Chrome wants to adopt this feature for
their CFI work [2] and this patchset has been designed to be
compatible with the Chrome use case.

Two system calls are involved in sealing the map:  mmap() and mseal().

The new mseal() is an syscall on 64 bit CPU, and with
following signature:

int mseal(void addr, size_t len, unsigned long flags)
addr/len: memory range.
flags: reserved.

mseal() blocks following operations for the given memory range.

1> Unmapping, moving to another location, and shrinking the size,
   via munmap() and mremap(), can leave an empty space, therefore can
   be replaced with a VMA with a new set of attributes.

2> Moving or expanding a different VMA into the current location,
   via mremap().

3> Modifying a VMA via mmap(MAP_FIXED).

4> Size expansion, via mremap(), does not appear to pose any specific
   risks to sealed VMAs. It is included anyway because the use case is
   unclear. In any case, users can rely on merging to expand a sealed VMA.

5> mprotect() and pkey_mprotect().

6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous
   memory, when users don't have write permission to the memory. Those
   behaviors can alter region contents by discarding pages, effectively a
   memset(0) for anonymous memory.

The idea that inspired this patch comes from Stephen Röttger’s work in
V8 CFI [5]. Chrome browser in ChromeOS will be the first user of this
API.

Indeed, the Chrome browser has very specific requirements for sealing,
which are distinct from those of most applications. For example, in
the case of libc, sealing is only applied to read-only (RO) or
read-execute (RX) memory segments (such as .text and .RELRO) to
prevent them from becoming writable, the lifetime of those mappings
are tied to the lifetime of the process.

Chrome wants to seal two large address space reservations that are
managed by different allocators. The memory is mapped RW- and RWX
respectively but write access to it is restricted using pkeys (or in
the future ARM permission overlay extensions). The lifetime of those
mappings are not tied to the lifetime of the process, therefore, while
the memory is sealed, the allocators still need to free or discard the
unused memory. For example, with madvise(DONTNEED).

However, always allowing madvise(DONTNEED) on this range poses a
security risk. For example if a jump instruction crosses a page
boundary and the second page gets discarded, it will overwrite the
target bytes with zeros and change the control flow. Checking
write-permission before the discard operation allows us to control
when the operation is valid. In this case, the madvise will only
succeed if the executing thread has PKEY write permissions and PKRU
changes are protected in software by control-flow integrity.

Although the initial version of this patch series is targeting the
Chrome browser as its first user, it became evident during upstream
discussions that we would also want to ensure that the patch set
eventually is a complete solution for memory sealing and compatible
with other use cases. The specific scenario currently in mind is
glibc's use case of loading and sealing ELF executables. To this end,
Stephen is working on a change to glibc to add sealing support to the
dynamic linker, which will seal all non-writable segments at startup.
Once this work is completed, all applications will be able to
automatically benefit from these new protections.

In closing, I would like to formally acknowledge the valuable
contributions received during the RFC process, which were instrumental
in shaping this patch:

Jann Horn: raising awareness and providing valuable insights on the
  destructive madvise operations.
Liam R. Howlett: perf optimization.
Linus Torvalds: assisting in defining system call signature and scope.
Theo de Raadt: sharing the experiences and insight gained from
  implementing mimmutable() in OpenBSD.

MM perf benchmarks
==================
This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
check the VMAs’ sealing flag, so that no partial update can be made,
when any segment within the given memory range is sealed.

To measure the performance impact of this loop, two tests are developed.
[8]

The first is measuring the time taken for a particular system call,
by using clock_gettime(CLOCK_MONOTONIC). The second is using
PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
similar results.

The tests have roughly below sequence:
for (i = 0; i < 1000, i++)
    create 1000 mappings (1 page per VMA)
    start the sampling
    for (j = 0; j < 1000, j++)
        mprotect one mapping
    stop and save the sample
    delete 1000 mappings
calculates all samples.

Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
4G memory, Chromebook.

Based on the latest upstream code:
The first test (measuring time)
syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
munmap__  	1	909	944	35	35	104%
munmap__  	2	1398	1502	104	52	107%
munmap__  	4	2444	2594	149	37	106%
munmap__  	8	4029	4323	293	37	107%
munmap__  	16	6647	6935	288	18	104%
munmap__  	32	11811	12398	587	18	105%
mprotect	1	439	465	26	26	106%
mprotect	2	1659	1745	86	43	105%
mprotect	4	3747	3889	142	36	104%
mprotect	8	6755	6969	215	27	103%
mprotect	16	13748	14144	396	25	103%
mprotect	32	27827	28969	1142	36	104%
madvise_	1	240	262	22	22	109%
madvise_	2	366	442	76	38	121%
madvise_	4	623	751	128	32	121%
madvise_	8	1110	1324	215	27	119%
madvise_	16	2127	2451	324	20	115%
madvise_	32	4109	4642	534	17	113%

The second test (measuring cpu cycle)
syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
munmap__	1	1790	1890	100	100	106%
munmap__	2	2819	3033	214	107	108%
munmap__	4	4959	5271	312	78	106%
munmap__	8	8262	8745	483	60	106%
munmap__	16	13099	14116	1017	64	108%
munmap__	32	23221	24785	1565	49	107%
mprotect	1	906	967	62	62	107%
mprotect	2	3019	3203	184	92	106%
mprotect	4	6149	6569	420	105	107%
mprotect	8	9978	10524	545	68	105%
mprotect	16	20448	21427	979	61	105%
mprotect	32	40972	42935	1963	61	105%
madvise_	1	434	497	63	63	115%
madvise_	2	752	899	147	74	120%
madvise_	4	1313	1513	200	50	115%
madvise_	8	2271	2627	356	44	116%
madvise_	16	4312	4883	571	36	113%
madvise_	32	8376	9319	943	29	111%

Based on the result, for 6.8 kernel, sealing check adds
20-40 nano seconds, or around 50-100 CPU cycles, per VMA.

In addition, I applied the sealing to 5.10 kernel:
The first test (measuring time)
syscall__	vmas	t	tmseal	delta_ns	per_vma	%
munmap__	1	357	390	33	33	109%
munmap__	2	442	463	21	11	105%
munmap__	4	614	634	20	5	103%
munmap__	8	1017	1137	120	15	112%
munmap__	16	1889	2153	263	16	114%
munmap__	32	4109	4088	-21	-1	99%
mprotect	1	235	227	-7	-7	97%
mprotect	2	495	464	-30	-15	94%
mprotect	4	741	764	24	6	103%
mprotect	8	1434	1437	2	0	100%
mprotect	16	2958	2991	33	2	101%
mprotect	32	6431	6608	177	6	103%
madvise_	1	191	208	16	16	109%
madvise_	2	300	324	24	12	108%
madvise_	4	450	473	23	6	105%
madvise_	8	753	806	53	7	107%
madvise_	16	1467	1592	125	8	108%
madvise_	32	2795	3405	610	19	122%
					
The second test (measuring cpu cycle)
syscall__	nbr_vma	cpu	cmseal	delta_cpu	per_vma	%
munmap__	1	684	715	31	31	105%
munmap__	2	861	898	38	19	104%
munmap__	4	1183	1235	51	13	104%
munmap__	8	1999	2045	46	6	102%
munmap__	16	3839	3816	-23	-1	99%
munmap__	32	7672	7887	216	7	103%
mprotect	1	397	443	46	46	112%
mprotect	2	738	788	50	25	107%
mprotect	4	1221	1256	35	9	103%
mprotect	8	2356	2429	72	9	103%
mprotect	16	4961	4935	-26	-2	99%
mprotect	32	9882	10172	291	9	103%
madvise_	1	351	380	29	29	108%
madvise_	2	565	615	49	25	109%
madvise_	4	872	933	61	15	107%
madvise_	8	1508	1640	132	16	109%
madvise_	16	3078	3323	245	15	108%
madvise_	32	5893	6704	811	25	114%

For 5.10 kernel, sealing check adds 0-15 ns in time, or 10-30
CPU cycles, there is even decrease in some cases.

It might be interesting to compare 5.10 and 6.8 kernel
The first test (measuring time)
syscall__	vmas	t_5_10	t_6_8	delta_ns	per_vma	%
munmap__	1	357	909	552	552	254%
munmap__	2	442	1398	956	478	316%
munmap__	4	614	2444	1830	458	398%
munmap__	8	1017	4029	3012	377	396%
munmap__	16	1889	6647	4758	297	352%
munmap__	32	4109	11811	7702	241	287%
mprotect	1	235	439	204	204	187%
mprotect	2	495	1659	1164	582	335%
mprotect	4	741	3747	3006	752	506%
mprotect	8	1434	6755	5320	665	471%
mprotect	16	2958	13748	10790	674	465%
mprotect	32	6431	27827	21397	669	433%
madvise_	1	191	240	49	49	125%
madvise_	2	300	366	67	33	122%
madvise_	4	450	623	173	43	138%
madvise_	8	753	1110	357	45	147%
madvise_	16	1467	2127	660	41	145%
madvise_	32	2795	4109	1314	41	147%

The second test (measuring cpu cycle)
syscall__	vmas	cpu_5_10	c_6_8	delta_cpu	per_vma	%
munmap__	1	684	1790	1106	1106	262%
munmap__	2	861	2819	1958	979	327%
munmap__	4	1183	4959	3776	944	419%
munmap__	8	1999	8262	6263	783	413%
munmap__	16	3839	13099	9260	579	341%
munmap__	32	7672	23221	15549	486	303%
mprotect	1	397	906	509	509	228%
mprotect	2	738	3019	2281	1140	409%
mprotect	4	1221	6149	4929	1232	504%
mprotect	8	2356	9978	7622	953	423%
mprotect	16	4961	20448	15487	968	412%
mprotect	32	9882	40972	31091	972	415%
madvise_	1	351	434	82	82	123%
madvise_	2	565	752	186	93	133%
madvise_	4	872	1313	442	110	151%
madvise_	8	1508	2271	763	95	151%
madvise_	16	3078	4312	1234	77	140%
madvise_	32	5893	8376	2483	78	142%

From 5.10 to 6.8
munmap: added 250-550 ns in time, or 500-1100 in cpu cycle, per vma.
mprotect: added 200-750 ns in time, or 500-1200 in cpu cycle, per vma.
madvise: added 33-50 ns in time, or 70-110 in cpu cycle, per vma.

In comparison to mseal, which adds 20-40 ns or 50-100 CPU cycles, the
increase from 5.10 to 6.8 is significantly larger, approximately ten
times greater for munmap and mprotect.

When I discuss the mm performance with Brian Makin, an engineer worked
on performance, it was brought to my attention that such a performance
benchmarks, which measuring millions of mm syscall in a tight loop, may
not accurately reflect real-world scenarios, such as that of a database
service. Also this is tested using a single HW and ChromeOS, the data
from another HW or distribution might be different. It might be best
to take this data with a grain of salt.


Change history:
===============
V10:
- rebase to 6.9.rc3 (no code change, resolve conflict only)
- Stephen Röttger applied mseal() in Chrome code, and I tested it on
  chromebook, the mseal() is working as designed.

V9:
- remove mmap(PROT_SEAL) and mmap(MAP_SEALABLE) (Linus, Theo de Raadt)
- Update mseal_test to check for prot bit (Liam R. Howlett)
- Update documentation to give more detail on sealing check (Liam R. Howlett)
- Add seal_elf test.
- Add performance measure data.
- mseal_test: fix arm build.
https://lore.kernel.org/all/20240214151130.616240-1-jeffxu@chromium.org/

V8:
- perf optimization in mmap. (Liam R. Howlett)
- add one testcase (test_seal_zero_address) 
- Update mseal.rst to add note for MAP_SEALABLE.
https://lore.kernel.org/lkml/20240131175027.3287009-1-jeffxu@chromium.org/

V7:
- fix index.rst (Randy Dunlap)
- fix arm build (Randy Dunlap)
- return EPERM for blocked operations (Theo de Raadt)
https://lore.kernel.org/linux-mm/20240122152905.2220849-2-jeffxu@chromium.org/T/

V6:
- Drop RFC from subject, Given Linus's general approval.
- Adjust syscall number for mseal (main Jan.11/2024) 
- Code style fix (Matthew Wilcox)
- selftest: use ksft macros (Muhammad Usama Anjum)
- Document fix. (Randy Dunlap)
https://lore.kernel.org/all/20240111234142.2944934-1-jeffxu@chromium.org/

V5:
- fix build issue in mseal-Wire-up-mseal-syscall
  (Suggested by Linus Torvalds, and Greg KH)
- updates on selftest.
https://lore.kernel.org/lkml/20240109154547.1839886-1-jeffxu@chromium.org/#r

V4:
(Suggested by Linus Torvalds)
- new signature: mseal(start,len,flags)
- 32 bit is not supported. vm_seal is removed, use vm_flags instead.
- single bit in vm_flags for sealed state.
- CONFIG_MSEAL kernel config is removed.
- single bit of PROT_SEAL in the "Prot" field of mmap().
Other changes:
- update selftest (Suggested by Muhammad Usama Anjum)
- update documentation.
https://lore.kernel.org/all/20240104185138.169307-1-jeffxu@chromium.org/

V3:
- Abandon per-syscall approach, (Suggested by Linus Torvalds).
- Organize sealing types around their functionality, such as
  MM_SEAL_BASE, MM_SEAL_PROT_PKEY.
- Extend the scope of sealing from calls originated in userspace to
  both kernel and userspace. (Suggested by Linus Torvalds)
- Add seal type support in mmap(). (Suggested by Pedro Falcato)
- Add a new sealing type: MM_SEAL_DISCARD_RO_ANON to prevent
  destructive operations of madvise. (Suggested by Jann Horn and
  Stephen Röttger)
- Make sealed VMAs mergeable. (Suggested by Jann Horn)
- Add MAP_SEALABLE to mmap()
- Add documentation - mseal.rst
https://lore.kernel.org/linux-mm/20231212231706.2680890-2-jeffxu@chromium.org/

v2:
Use _BITUL to define MM_SEAL_XX type.
Use unsigned long for seal type in sys_mseal() and other functions.
Remove internal VM_SEAL_XX type and convert_user_seal_type().
Remove MM_ACTION_XX type.
Remove caller_origin(ON_BEHALF_OF_XX) and replace with sealing bitmask.
Add more comments in code.
Add a detailed commit message.
https://lore.kernel.org/lkml/20231017090815.1067790-1-jeffxu@chromium.org/

v1:
https://lore.kernel.org/lkml/20231016143828.647848-1-jeffxu@chromium.org/

----------------------------------------------------------------
[1] https://kernelnewbies.org/Linux_2_6_8
[2] https://v8.dev/blog/control-flow-integrity
[3] https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/osfmk/mach/vm_statistics.h#L274
[4] https://man.openbsd.org/mimmutable.2
[5] https://docs.google.com/document/d/1O2jwK4dxI3nRcOJuPYkonhTkNQfbmwdvxQMyXgeaRHo/edit#heading=h.bvaojj9fu6hc
[6] https://lore.kernel.org/lkml/CAG48ez3ShUYey+ZAFsU2i1RpQn0a5eOs2hzQ426FkcgnfUGLvA@mail.gmail.com/
[7] https://lore.kernel.org/lkml/20230515130553.2311248-1-jeffxu@chromium.org/
[8] https://github.com/peaktocreek/mmperf



Jeff Xu (5):
  mseal: Wire up mseal syscall
  mseal: add mseal syscall
  selftest mm/mseal memory sealing
  mseal:add documentation
  selftest mm/mseal read-only elf memory segment

 Documentation/userspace-api/index.rst       |    1 +
 Documentation/userspace-api/mseal.rst       |  199 ++
 arch/alpha/kernel/syscalls/syscall.tbl      |    1 +
 arch/arm/tools/syscall.tbl                  |    1 +
 arch/arm64/include/asm/unistd.h             |    2 +-
 arch/arm64/include/asm/unistd32.h           |    2 +
 arch/m68k/kernel/syscalls/syscall.tbl       |    1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |    1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |    1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |    1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |    1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |    1 +
 arch/s390/kernel/syscalls/syscall.tbl       |    1 +
 arch/sh/kernel/syscalls/syscall.tbl         |    1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |    1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |    1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |    1 +
 include/linux/syscalls.h                    |    1 +
 include/uapi/asm-generic/unistd.h           |    5 +-
 kernel/sys_ni.c                             |    1 +
 mm/Makefile                                 |    4 +
 mm/internal.h                               |   37 +
 mm/madvise.c                                |   12 +
 mm/mmap.c                                   |   31 +-
 mm/mprotect.c                               |   10 +
 mm/mremap.c                                 |   31 +
 mm/mseal.c                                  |  307 ++++
 tools/testing/selftests/mm/.gitignore       |    2 +
 tools/testing/selftests/mm/Makefile         |    2 +
 tools/testing/selftests/mm/mseal_test.c     | 1836 +++++++++++++++++++
 tools/testing/selftests/mm/seal_elf.c       |  183 ++
 33 files changed, 2678 insertions(+), 3 deletions(-)
 create mode 100644 Documentation/userspace-api/mseal.rst
 create mode 100644 mm/mseal.c
 create mode 100644 tools/testing/selftests/mm/mseal_test.c
 create mode 100644 tools/testing/selftests/mm/seal_elf.c

Liam R. Howlett April 16, 2024, 3:13 p.m. UTC | #1

* jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> From: Jeff Xu <jeffxu@chromium.org>
> 
> This is V10 version, it rebases v9 patch to 6.9.rc3.
> We also applied and tested mseal() in chrome and chromebook.
> 
> ------------------------------------------------------------------
...

> MM perf benchmarks
> ==================
> This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> check the VMAs’ sealing flag, so that no partial update can be made,
> when any segment within the given memory range is sealed.
> 
> To measure the performance impact of this loop, two tests are developed.
> [8]
> 
> The first is measuring the time taken for a particular system call,
> by using clock_gettime(CLOCK_MONOTONIC). The second is using
> PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> similar results.
> 
> The tests have roughly below sequence:
> for (i = 0; i < 1000, i++)
>     create 1000 mappings (1 page per VMA)
>     start the sampling
>     for (j = 0; j < 1000, j++)
>         mprotect one mapping
>     stop and save the sample
>     delete 1000 mappings
> calculates all samples.


Thank you for doing this performance testing.

> 
> Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> 4G memory, Chromebook.
> 
> Based on the latest upstream code:
> The first test (measuring time)
> syscall__	vmas	t	t_mseal	delta_ns	per_vma	%
> munmap__  	1	909	944	35	35	104%
> munmap__  	2	1398	1502	104	52	107%
> munmap__  	4	2444	2594	149	37	106%
> munmap__  	8	4029	4323	293	37	107%
> munmap__  	16	6647	6935	288	18	104%
> munmap__  	32	11811	12398	587	18	105%
> mprotect	1	439	465	26	26	106%
> mprotect	2	1659	1745	86	43	105%
> mprotect	4	3747	3889	142	36	104%
> mprotect	8	6755	6969	215	27	103%
> mprotect	16	13748	14144	396	25	103%
> mprotect	32	27827	28969	1142	36	104%
> madvise_	1	240	262	22	22	109%
> madvise_	2	366	442	76	38	121%
> madvise_	4	623	751	128	32	121%
> madvise_	8	1110	1324	215	27	119%
> madvise_	16	2127	2451	324	20	115%
> madvise_	32	4109	4642	534	17	113%
> 
> The second test (measuring cpu cycle)
> syscall__	vmas	cpu	cmseal	delta_cpu	per_vma	%
> munmap__	1	1790	1890	100	100	106%
> munmap__	2	2819	3033	214	107	108%
> munmap__	4	4959	5271	312	78	106%
> munmap__	8	8262	8745	483	60	106%
> munmap__	16	13099	14116	1017	64	108%
> munmap__	32	23221	24785	1565	49	107%
> mprotect	1	906	967	62	62	107%
> mprotect	2	3019	3203	184	92	106%
> mprotect	4	6149	6569	420	105	107%
> mprotect	8	9978	10524	545	68	105%
> mprotect	16	20448	21427	979	61	105%
> mprotect	32	40972	42935	1963	61	105%
> madvise_	1	434	497	63	63	115%
> madvise_	2	752	899	147	74	120%
> madvise_	4	1313	1513	200	50	115%
> madvise_	8	2271	2627	356	44	116%
> madvise_	16	4312	4883	571	36	113%
> madvise_	32	8376	9319	943	29	111%
> 

If I am reading this right, madvise() is affected more than the other
calls?  Is that expected or do we need to have a closer look?

...

> When I discuss the mm performance with Brian Makin, an engineer worked
> on performance, it was brought to my attention that such a performance
> benchmarks, which measuring millions of mm syscall in a tight loop, may
> not accurately reflect real-world scenarios, such as that of a database
> service. Also this is tested using a single HW and ChromeOS, the data
> from another HW or distribution might be different. It might be best
> to take this data with a grain of salt.
> 

Absolutely, these types of benchmarks are pointless to simulate what
will really happen with any sane program.

However, they are valuable in that they can highlight areas where
something may have been made more inefficient.  These inefficiencies
would otherwise be lost in the noise of regular system use.  They can be
used as a relatively high level sanity on what you believe is going on.

I appreciate you doing the work on testing the performance here.

...

Jeff Xu April 16, 2024, 7:40 p.m. UTC | #2

On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > From: Jeff Xu <jeffxu@chromium.org>
> >
> > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > We also applied and tested mseal() in chrome and chromebook.
> >
> > ------------------------------------------------------------------
> ...
>
> > MM perf benchmarks
> > ==================
> > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > check the VMAs’ sealing flag, so that no partial update can be made,
> > when any segment within the given memory range is sealed.
> >
> > To measure the performance impact of this loop, two tests are developed.
> > [8]
> >
> > The first is measuring the time taken for a particular system call,
> > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > similar results.
> >
> > The tests have roughly below sequence:
> > for (i = 0; i < 1000, i++)
> >     create 1000 mappings (1 page per VMA)
> >     start the sampling
> >     for (j = 0; j < 1000, j++)
> >         mprotect one mapping
> >     stop and save the sample
> >     delete 1000 mappings
> > calculates all samples.
>
>
> Thank you for doing this performance testing.
>
> >
> > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > 4G memory, Chromebook.
> >
> > Based on the latest upstream code:
> > The first test (measuring time)
> > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > munmap__      1       909     944     35      35      104%
> > munmap__      2       1398    1502    104     52      107%
> > munmap__      4       2444    2594    149     37      106%
> > munmap__      8       4029    4323    293     37      107%
> > munmap__      16      6647    6935    288     18      104%
> > munmap__      32      11811   12398   587     18      105%
> > mprotect      1       439     465     26      26      106%
> > mprotect      2       1659    1745    86      43      105%
> > mprotect      4       3747    3889    142     36      104%
> > mprotect      8       6755    6969    215     27      103%
> > mprotect      16      13748   14144   396     25      103%
> > mprotect      32      27827   28969   1142    36      104%
> > madvise_      1       240     262     22      22      109%
> > madvise_      2       366     442     76      38      121%
> > madvise_      4       623     751     128     32      121%
> > madvise_      8       1110    1324    215     27      119%
> > madvise_      16      2127    2451    324     20      115%
> > madvise_      32      4109    4642    534     17      113%
> >
> > The second test (measuring cpu cycle)
> > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > munmap__      1       1790    1890    100     100     106%
> > munmap__      2       2819    3033    214     107     108%
> > munmap__      4       4959    5271    312     78      106%
> > munmap__      8       8262    8745    483     60      106%
> > munmap__      16      13099   14116   1017    64      108%
> > munmap__      32      23221   24785   1565    49      107%
> > mprotect      1       906     967     62      62      107%
> > mprotect      2       3019    3203    184     92      106%
> > mprotect      4       6149    6569    420     105     107%
> > mprotect      8       9978    10524   545     68      105%
> > mprotect      16      20448   21427   979     61      105%
> > mprotect      32      40972   42935   1963    61      105%
> > madvise_      1       434     497     63      63      115%
> > madvise_      2       752     899     147     74      120%
> > madvise_      4       1313    1513    200     50      115%
> > madvise_      8       2271    2627    356     44      116%
> > madvise_      16      4312    4883    571     36      113%
> > madvise_      32      8376    9319    943     29      111%
> >
>
> If I am reading this right, madvise() is affected more than the other
> calls?  Is that expected or do we need to have a closer look?
>
The madvise() has a bigger percentage (per_vma %), but it also has a
smaller base value (cpu).

-Jeff

Suren Baghdasaryan April 18, 2024, 8:19 p.m. UTC | #3

On Tue, Apr 16, 2024 at 12:40 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> >
> > * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > > From: Jeff Xu <jeffxu@chromium.org>
> > >
> > > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > > We also applied and tested mseal() in chrome and chromebook.
> > >
> > > ------------------------------------------------------------------
> > ...
> >
> > > MM perf benchmarks
> > > ==================
> > > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > > check the VMAs’ sealing flag, so that no partial update can be made,
> > > when any segment within the given memory range is sealed.
> > >
> > > To measure the performance impact of this loop, two tests are developed.
> > > [8]
> > >
> > > The first is measuring the time taken for a particular system call,
> > > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > > similar results.
> > >
> > > The tests have roughly below sequence:
> > > for (i = 0; i < 1000, i++)
> > >     create 1000 mappings (1 page per VMA)
> > >     start the sampling
> > >     for (j = 0; j < 1000, j++)
> > >         mprotect one mapping
> > >     stop and save the sample
> > >     delete 1000 mappings
> > > calculates all samples.
> >
> >
> > Thank you for doing this performance testing.
> >
> > >
> > > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > > 4G memory, Chromebook.
> > >
> > > Based on the latest upstream code:
> > > The first test (measuring time)
> > > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > > munmap__      1       909     944     35      35      104%
> > > munmap__      2       1398    1502    104     52      107%
> > > munmap__      4       2444    2594    149     37      106%
> > > munmap__      8       4029    4323    293     37      107%
> > > munmap__      16      6647    6935    288     18      104%
> > > munmap__      32      11811   12398   587     18      105%
> > > mprotect      1       439     465     26      26      106%
> > > mprotect      2       1659    1745    86      43      105%
> > > mprotect      4       3747    3889    142     36      104%
> > > mprotect      8       6755    6969    215     27      103%
> > > mprotect      16      13748   14144   396     25      103%
> > > mprotect      32      27827   28969   1142    36      104%
> > > madvise_      1       240     262     22      22      109%
> > > madvise_      2       366     442     76      38      121%
> > > madvise_      4       623     751     128     32      121%
> > > madvise_      8       1110    1324    215     27      119%
> > > madvise_      16      2127    2451    324     20      115%
> > > madvise_      32      4109    4642    534     17      113%
> > >
> > > The second test (measuring cpu cycle)
> > > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > > munmap__      1       1790    1890    100     100     106%
> > > munmap__      2       2819    3033    214     107     108%
> > > munmap__      4       4959    5271    312     78      106%
> > > munmap__      8       8262    8745    483     60      106%
> > > munmap__      16      13099   14116   1017    64      108%
> > > munmap__      32      23221   24785   1565    49      107%
> > > mprotect      1       906     967     62      62      107%
> > > mprotect      2       3019    3203    184     92      106%
> > > mprotect      4       6149    6569    420     105     107%
> > > mprotect      8       9978    10524   545     68      105%
> > > mprotect      16      20448   21427   979     61      105%
> > > mprotect      32      40972   42935   1963    61      105%
> > > madvise_      1       434     497     63      63      115%
> > > madvise_      2       752     899     147     74      120%
> > > madvise_      4       1313    1513    200     50      115%
> > > madvise_      8       2271    2627    356     44      116%
> > > madvise_      16      4312    4883    571     36      113%
> > > madvise_      32      8376    9319    943     29      111%
> > >
> >
> > If I am reading this right, madvise() is affected more than the other
> > calls?  Is that expected or do we need to have a closer look?
> >
> The madvise() has a bigger percentage (per_vma %), but it also has a
> smaller base value (cpu).

Sorry, it's unclear to me what the "vmas" column denotes. Is that how
many VMAs were created before timing the syscall? If so, then 32 is
the max that you show here while you seem to have tested with 1000
VMAs. What is the overhead with 1000 VMAs?
My worry is that if the overhead grows linearly with the number of
VMAs then the effects will be quite noticeable on Android where an
application with a few thousand VMAs is not so unusual.

>
> -Jeff

Jeff Xu April 19, 2024, 1:22 a.m. UTC | #4

On Thu, Apr 18, 2024 at 1:19 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Apr 16, 2024 at 12:40 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > >
> > > * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > > > From: Jeff Xu <jeffxu@chromium.org>
> > > >
> > > > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > > > We also applied and tested mseal() in chrome and chromebook.
> > > >
> > > > ------------------------------------------------------------------
> > > ...
> > >
> > > > MM perf benchmarks
> > > > ==================
> > > > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > > > check the VMAs’ sealing flag, so that no partial update can be made,
> > > > when any segment within the given memory range is sealed.
> > > >
> > > > To measure the performance impact of this loop, two tests are developed.
> > > > [8]
> > > >
> > > > The first is measuring the time taken for a particular system call,
> > > > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > > > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > > > similar results.
> > > >
> > > > The tests have roughly below sequence:
> > > > for (i = 0; i < 1000, i++)
> > > >     create 1000 mappings (1 page per VMA)
> > > >     start the sampling
> > > >     for (j = 0; j < 1000, j++)
> > > >         mprotect one mapping
> > > >     stop and save the sample
> > > >     delete 1000 mappings
> > > > calculates all samples.
> > >
> > >
> > > Thank you for doing this performance testing.
> > >
> > > >
> > > > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > > > 4G memory, Chromebook.
> > > >
> > > > Based on the latest upstream code:
> > > > The first test (measuring time)
> > > > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > > > munmap__      1       909     944     35      35      104%
> > > > munmap__      2       1398    1502    104     52      107%
> > > > munmap__      4       2444    2594    149     37      106%
> > > > munmap__      8       4029    4323    293     37      107%
> > > > munmap__      16      6647    6935    288     18      104%
> > > > munmap__      32      11811   12398   587     18      105%
> > > > mprotect      1       439     465     26      26      106%
> > > > mprotect      2       1659    1745    86      43      105%
> > > > mprotect      4       3747    3889    142     36      104%
> > > > mprotect      8       6755    6969    215     27      103%
> > > > mprotect      16      13748   14144   396     25      103%
> > > > mprotect      32      27827   28969   1142    36      104%
> > > > madvise_      1       240     262     22      22      109%
> > > > madvise_      2       366     442     76      38      121%
> > > > madvise_      4       623     751     128     32      121%
> > > > madvise_      8       1110    1324    215     27      119%
> > > > madvise_      16      2127    2451    324     20      115%
> > > > madvise_      32      4109    4642    534     17      113%
> > > >
> > > > The second test (measuring cpu cycle)
> > > > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > > > munmap__      1       1790    1890    100     100     106%
> > > > munmap__      2       2819    3033    214     107     108%
> > > > munmap__      4       4959    5271    312     78      106%
> > > > munmap__      8       8262    8745    483     60      106%
> > > > munmap__      16      13099   14116   1017    64      108%
> > > > munmap__      32      23221   24785   1565    49      107%
> > > > mprotect      1       906     967     62      62      107%
> > > > mprotect      2       3019    3203    184     92      106%
> > > > mprotect      4       6149    6569    420     105     107%
> > > > mprotect      8       9978    10524   545     68      105%
> > > > mprotect      16      20448   21427   979     61      105%
> > > > mprotect      32      40972   42935   1963    61      105%
> > > > madvise_      1       434     497     63      63      115%
> > > > madvise_      2       752     899     147     74      120%
> > > > madvise_      4       1313    1513    200     50      115%
> > > > madvise_      8       2271    2627    356     44      116%
> > > > madvise_      16      4312    4883    571     36      113%
> > > > madvise_      32      8376    9319    943     29      111%
> > > >
> > >
> > > If I am reading this right, madvise() is affected more than the other
> > > calls?  Is that expected or do we need to have a closer look?
> > >
> > The madvise() has a bigger percentage (per_vma %), but it also has a
> > smaller base value (cpu).
>
> Sorry, it's unclear to me what the "vmas" column denotes. Is that how
> many VMAs were created before timing the syscall? If so, then 32 is
> the max that you show here while you seem to have tested with 1000
> VMAs. What is the overhead with 1000 VMAs?

The vmas column is the number of VMA used in one call.

For example: for 32 and mprotect(ptr,size), the memory range used in
mprotect has 32 VMAs.

It also matters how many memory ranges are in-use at the time of the
test, This is where 1000 comes in. The test creates 1000 memory
ranges, each memory range has 32 vmas, then calls mprotect on the 1000
memory range. (the pseudocode was included in the original email)

> My worry is that if the overhead grows linearly with the number of
> VMAs then the effects will be quite noticeable on Android where an
> application with a few thousand VMAs is not so unusual.
>
The overhead is likely to grow linearly with the number of VMA, since
it takes time to retrieve VMA's metadata.

Let's use one data sample to look at impact:

Test: munmap 1000 memory range, each memory range has 1 VMA

syscall__       vmas    t       t_mseal delta_ns        per_vma %
munmap__        1       909     944     35      35      104%

For those 1000 munmap calls, sealing adds 35000 ns in total, or 35 ns per call.

The delta seems to be insignificant. e.g. it will take about 28571
munmap call to have 1 ms difference (1000000/35=28571)

When I look at the data from 5.10 to 6.8, for the same munmap call,
6.8 adds 552 ns per call, which is 15 times bigger.

syscall__       vmas    t_5_10  t_6_8   delta_ns        per_vma %
munmap__        1       357     909     552     552     254%


> >
> > -Jeff

Suren Baghdasaryan April 19, 2024, 2:57 p.m. UTC | #5

On Thu, Apr 18, 2024 at 6:22 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Thu, Apr 18, 2024 at 1:19 PM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Tue, Apr 16, 2024 at 12:40 PM Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > >
> > > > * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > >
> > > > > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > > > > We also applied and tested mseal() in chrome and chromebook.
> > > > >
> > > > > ------------------------------------------------------------------
> > > > ...
> > > >
> > > > > MM perf benchmarks
> > > > > ==================
> > > > > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > > > > check the VMAs’ sealing flag, so that no partial update can be made,
> > > > > when any segment within the given memory range is sealed.
> > > > >
> > > > > To measure the performance impact of this loop, two tests are developed.
> > > > > [8]
> > > > >
> > > > > The first is measuring the time taken for a particular system call,
> > > > > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > > > > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > > > > similar results.
> > > > >
> > > > > The tests have roughly below sequence:
> > > > > for (i = 0; i < 1000, i++)
> > > > >     create 1000 mappings (1 page per VMA)
> > > > >     start the sampling
> > > > >     for (j = 0; j < 1000, j++)
> > > > >         mprotect one mapping
> > > > >     stop and save the sample
> > > > >     delete 1000 mappings
> > > > > calculates all samples.
> > > >
> > > >
> > > > Thank you for doing this performance testing.
> > > >
> > > > >
> > > > > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > > > > 4G memory, Chromebook.
> > > > >
> > > > > Based on the latest upstream code:
> > > > > The first test (measuring time)
> > > > > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > > > > munmap__      1       909     944     35      35      104%
> > > > > munmap__      2       1398    1502    104     52      107%
> > > > > munmap__      4       2444    2594    149     37      106%
> > > > > munmap__      8       4029    4323    293     37      107%
> > > > > munmap__      16      6647    6935    288     18      104%
> > > > > munmap__      32      11811   12398   587     18      105%
> > > > > mprotect      1       439     465     26      26      106%
> > > > > mprotect      2       1659    1745    86      43      105%
> > > > > mprotect      4       3747    3889    142     36      104%
> > > > > mprotect      8       6755    6969    215     27      103%
> > > > > mprotect      16      13748   14144   396     25      103%
> > > > > mprotect      32      27827   28969   1142    36      104%
> > > > > madvise_      1       240     262     22      22      109%
> > > > > madvise_      2       366     442     76      38      121%
> > > > > madvise_      4       623     751     128     32      121%
> > > > > madvise_      8       1110    1324    215     27      119%
> > > > > madvise_      16      2127    2451    324     20      115%
> > > > > madvise_      32      4109    4642    534     17      113%
> > > > >
> > > > > The second test (measuring cpu cycle)
> > > > > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > > > > munmap__      1       1790    1890    100     100     106%
> > > > > munmap__      2       2819    3033    214     107     108%
> > > > > munmap__      4       4959    5271    312     78      106%
> > > > > munmap__      8       8262    8745    483     60      106%
> > > > > munmap__      16      13099   14116   1017    64      108%
> > > > > munmap__      32      23221   24785   1565    49      107%
> > > > > mprotect      1       906     967     62      62      107%
> > > > > mprotect      2       3019    3203    184     92      106%
> > > > > mprotect      4       6149    6569    420     105     107%
> > > > > mprotect      8       9978    10524   545     68      105%
> > > > > mprotect      16      20448   21427   979     61      105%
> > > > > mprotect      32      40972   42935   1963    61      105%
> > > > > madvise_      1       434     497     63      63      115%
> > > > > madvise_      2       752     899     147     74      120%
> > > > > madvise_      4       1313    1513    200     50      115%
> > > > > madvise_      8       2271    2627    356     44      116%
> > > > > madvise_      16      4312    4883    571     36      113%
> > > > > madvise_      32      8376    9319    943     29      111%
> > > > >
> > > >
> > > > If I am reading this right, madvise() is affected more than the other
> > > > calls?  Is that expected or do we need to have a closer look?
> > > >
> > > The madvise() has a bigger percentage (per_vma %), but it also has a
> > > smaller base value (cpu).
> >
> > Sorry, it's unclear to me what the "vmas" column denotes. Is that how
> > many VMAs were created before timing the syscall? If so, then 32 is
> > the max that you show here while you seem to have tested with 1000
> > VMAs. What is the overhead with 1000 VMAs?
>
> The vmas column is the number of VMA used in one call.
>
> For example: for 32 and mprotect(ptr,size), the memory range used in
> mprotect has 32 VMAs.

Ok, so the 32 here denotes how many VMAs one mprotect() call spans?

>
> It also matters how many memory ranges are in-use at the time of the
> test, This is where 1000 comes in. The test creates 1000 memory
> ranges, each memory range has 32 vmas, then calls mprotect on the 1000
> memory range. (the pseudocode was included in the original email)

So, if each range has 32 vmas and you have 1000 ranges then you are
creating 32000 vmas? Sorry, your pseudocode does not clarify that. My
current understanding is this:

for (i = 0; i < 1000, i++)
    mmap N*1000 areas (N=[1-32])
    start the sampling
    for (j = 0; j < 1000, j++)
        mprotect N areas with one syscall
    stop and save the sample
    munmap N*1000 areas
calculates all samples.

Is that correct?

>
> > My worry is that if the overhead grows linearly with the number of
> > VMAs then the effects will be quite noticeable on Android where an
> > application with a few thousand VMAs is not so unusual.
> >
> The overhead is likely to grow linearly with the number of VMA, since
> it takes time to retrieve VMA's metadata.
>
> Let's use one data sample to look at impact:
>
> Test: munmap 1000 memory range, each memory range has 1 VMA
>
> syscall__       vmas    t       t_mseal delta_ns        per_vma %
> munmap__        1       909     944     35      35      104%
>
> For those 1000 munmap calls, sealing adds 35000 ns in total, or 35 ns per call.
>
> The delta seems to be insignificant. e.g. it will take about 28571
> munmap call to have 1 ms difference (1000000/35=28571)
>
> When I look at the data from 5.10 to 6.8, for the same munmap call,
> 6.8 adds 552 ns per call, which is 15 times bigger.
>
> syscall__       vmas    t_5_10  t_6_8   delta_ns        per_vma %
> munmap__        1       357     909     552     552     254%

I'm not yet claiming the overhead is too high. I want to understand
first what is being measured here.
Thanks,
Suren.

>
>
> > >
> > > -Jeff

Jeff Xu April 19, 2024, 3:14 p.m. UTC | #6

On Fri, Apr 19, 2024 at 7:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Thu, Apr 18, 2024 at 6:22 PM Jeff Xu <jeffxu@chromium.org> wrote:
> >
> > On Thu, Apr 18, 2024 at 1:19 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > On Tue, Apr 16, 2024 at 12:40 PM Jeff Xu <jeffxu@chromium.org> wrote:
> > > >
> > > > On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > >
> > > > > * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > >
> > > > > > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > > > > > We also applied and tested mseal() in chrome and chromebook.
> > > > > >
> > > > > > ------------------------------------------------------------------
> > > > > ...
> > > > >
> > > > > > MM perf benchmarks
> > > > > > ==================
> > > > > > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > > > > > check the VMAs’ sealing flag, so that no partial update can be made,
> > > > > > when any segment within the given memory range is sealed.
> > > > > >
> > > > > > To measure the performance impact of this loop, two tests are developed.
> > > > > > [8]
> > > > > >
> > > > > > The first is measuring the time taken for a particular system call,
> > > > > > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > > > > > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > > > > > similar results.
> > > > > >
> > > > > > The tests have roughly below sequence:
> > > > > > for (i = 0; i < 1000, i++)
> > > > > >     create 1000 mappings (1 page per VMA)
> > > > > >     start the sampling
> > > > > >     for (j = 0; j < 1000, j++)
> > > > > >         mprotect one mapping
> > > > > >     stop and save the sample
> > > > > >     delete 1000 mappings
> > > > > > calculates all samples.
> > > > >
> > > > >
> > > > > Thank you for doing this performance testing.
> > > > >
> > > > > >
> > > > > > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > > > > > 4G memory, Chromebook.
> > > > > >
> > > > > > Based on the latest upstream code:
> > > > > > The first test (measuring time)
> > > > > > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > > > > > munmap__      1       909     944     35      35      104%
> > > > > > munmap__      2       1398    1502    104     52      107%
> > > > > > munmap__      4       2444    2594    149     37      106%
> > > > > > munmap__      8       4029    4323    293     37      107%
> > > > > > munmap__      16      6647    6935    288     18      104%
> > > > > > munmap__      32      11811   12398   587     18      105%
> > > > > > mprotect      1       439     465     26      26      106%
> > > > > > mprotect      2       1659    1745    86      43      105%
> > > > > > mprotect      4       3747    3889    142     36      104%
> > > > > > mprotect      8       6755    6969    215     27      103%
> > > > > > mprotect      16      13748   14144   396     25      103%
> > > > > > mprotect      32      27827   28969   1142    36      104%
> > > > > > madvise_      1       240     262     22      22      109%
> > > > > > madvise_      2       366     442     76      38      121%
> > > > > > madvise_      4       623     751     128     32      121%
> > > > > > madvise_      8       1110    1324    215     27      119%
> > > > > > madvise_      16      2127    2451    324     20      115%
> > > > > > madvise_      32      4109    4642    534     17      113%
> > > > > >
> > > > > > The second test (measuring cpu cycle)
> > > > > > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > > > > > munmap__      1       1790    1890    100     100     106%
> > > > > > munmap__      2       2819    3033    214     107     108%
> > > > > > munmap__      4       4959    5271    312     78      106%
> > > > > > munmap__      8       8262    8745    483     60      106%
> > > > > > munmap__      16      13099   14116   1017    64      108%
> > > > > > munmap__      32      23221   24785   1565    49      107%
> > > > > > mprotect      1       906     967     62      62      107%
> > > > > > mprotect      2       3019    3203    184     92      106%
> > > > > > mprotect      4       6149    6569    420     105     107%
> > > > > > mprotect      8       9978    10524   545     68      105%
> > > > > > mprotect      16      20448   21427   979     61      105%
> > > > > > mprotect      32      40972   42935   1963    61      105%
> > > > > > madvise_      1       434     497     63      63      115%
> > > > > > madvise_      2       752     899     147     74      120%
> > > > > > madvise_      4       1313    1513    200     50      115%
> > > > > > madvise_      8       2271    2627    356     44      116%
> > > > > > madvise_      16      4312    4883    571     36      113%
> > > > > > madvise_      32      8376    9319    943     29      111%
> > > > > >
> > > > >
> > > > > If I am reading this right, madvise() is affected more than the other
> > > > > calls?  Is that expected or do we need to have a closer look?
> > > > >
> > > > The madvise() has a bigger percentage (per_vma %), but it also has a
> > > > smaller base value (cpu).
> > >
> > > Sorry, it's unclear to me what the "vmas" column denotes. Is that how
> > > many VMAs were created before timing the syscall? If so, then 32 is
> > > the max that you show here while you seem to have tested with 1000
> > > VMAs. What is the overhead with 1000 VMAs?
> >
> > The vmas column is the number of VMA used in one call.
> >
> > For example: for 32 and mprotect(ptr,size), the memory range used in
> > mprotect has 32 VMAs.
>
> Ok, so the 32 here denotes how many VMAs one mprotect() call spans?
>
Yes.

> >
> > It also matters how many memory ranges are in-use at the time of the
> > test, This is where 1000 comes in. The test creates 1000 memory
> > ranges, each memory range has 32 vmas, then calls mprotect on the 1000
> > memory range. (the pseudocode was included in the original email)
>
> So, if each range has 32 vmas and you have 1000 ranges then you are
> creating 32000 vmas? Sorry, your pseudocode does not clarify that. My
> current understanding is this:
>
> for (i = 0; i < 1000, i++)
>     mmap N*1000 areas (N=[1-32])
>     start the sampling
>     for (j = 0; j < 1000, j++)
>         mprotect N areas with one syscall
>     stop and save the sample
>     munmap N*1000 areas
> calculates all samples.
>
> Is that correct?
>
Yes, There will be 32000 VMA in the system.

The pseudocode is correct in concept.
The test implementation is slightly different, it uses mprotect to
split the memory and make sure the VMAs  doesn't merge. For detail,
the reference [8]  of the original email link to the test code.

-Jeff

Suren Baghdasaryan April 19, 2024, 4:54 p.m. UTC | #7

On Fri, Apr 19, 2024 at 3:15 PM Jeff Xu <jeffxu@chromium.org> wrote:
>
> On Fri, Apr 19, 2024 at 7:57 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > On Thu, Apr 18, 2024 at 6:22 PM Jeff Xu <jeffxu@chromium.org> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 1:19 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > >
> > > > On Tue, Apr 16, 2024 at 12:40 PM Jeff Xu <jeffxu@chromium.org> wrote:
> > > > >
> > > > > On Tue, Apr 16, 2024 at 8:13 AM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
> > > > > >
> > > > > > * jeffxu@chromium.org <jeffxu@chromium.org> [240415 12:35]:
> > > > > > > From: Jeff Xu <jeffxu@chromium.org>
> > > > > > >
> > > > > > > This is V10 version, it rebases v9 patch to 6.9.rc3.
> > > > > > > We also applied and tested mseal() in chrome and chromebook.
> > > > > > >
> > > > > > > ------------------------------------------------------------------
> > > > > > ...
> > > > > >
> > > > > > > MM perf benchmarks
> > > > > > > ==================
> > > > > > > This patch adds a loop in the mprotect/munmap/madvise(DONTNEED) to
> > > > > > > check the VMAs’ sealing flag, so that no partial update can be made,
> > > > > > > when any segment within the given memory range is sealed.
> > > > > > >
> > > > > > > To measure the performance impact of this loop, two tests are developed.
> > > > > > > [8]
> > > > > > >
> > > > > > > The first is measuring the time taken for a particular system call,
> > > > > > > by using clock_gettime(CLOCK_MONOTONIC). The second is using
> > > > > > > PERF_COUNT_HW_REF_CPU_CYCLES (exclude user space). Both tests have
> > > > > > > similar results.
> > > > > > >
> > > > > > > The tests have roughly below sequence:
> > > > > > > for (i = 0; i < 1000, i++)
> > > > > > >     create 1000 mappings (1 page per VMA)
> > > > > > >     start the sampling
> > > > > > >     for (j = 0; j < 1000, j++)
> > > > > > >         mprotect one mapping
> > > > > > >     stop and save the sample
> > > > > > >     delete 1000 mappings
> > > > > > > calculates all samples.
> > > > > >
> > > > > >
> > > > > > Thank you for doing this performance testing.
> > > > > >
> > > > > > >
> > > > > > > Below tests are performed on Intel(R) Pentium(R) Gold 7505 @ 2.00GHz,
> > > > > > > 4G memory, Chromebook.
> > > > > > >
> > > > > > > Based on the latest upstream code:
> > > > > > > The first test (measuring time)
> > > > > > > syscall__     vmas    t       t_mseal delta_ns        per_vma %
> > > > > > > munmap__      1       909     944     35      35      104%
> > > > > > > munmap__      2       1398    1502    104     52      107%
> > > > > > > munmap__      4       2444    2594    149     37      106%
> > > > > > > munmap__      8       4029    4323    293     37      107%
> > > > > > > munmap__      16      6647    6935    288     18      104%
> > > > > > > munmap__      32      11811   12398   587     18      105%
> > > > > > > mprotect      1       439     465     26      26      106%
> > > > > > > mprotect      2       1659    1745    86      43      105%
> > > > > > > mprotect      4       3747    3889    142     36      104%
> > > > > > > mprotect      8       6755    6969    215     27      103%
> > > > > > > mprotect      16      13748   14144   396     25      103%
> > > > > > > mprotect      32      27827   28969   1142    36      104%
> > > > > > > madvise_      1       240     262     22      22      109%
> > > > > > > madvise_      2       366     442     76      38      121%
> > > > > > > madvise_      4       623     751     128     32      121%
> > > > > > > madvise_      8       1110    1324    215     27      119%
> > > > > > > madvise_      16      2127    2451    324     20      115%
> > > > > > > madvise_      32      4109    4642    534     17      113%
> > > > > > >
> > > > > > > The second test (measuring cpu cycle)
> > > > > > > syscall__     vmas    cpu     cmseal  delta_cpu       per_vma %
> > > > > > > munmap__      1       1790    1890    100     100     106%
> > > > > > > munmap__      2       2819    3033    214     107     108%
> > > > > > > munmap__      4       4959    5271    312     78      106%
> > > > > > > munmap__      8       8262    8745    483     60      106%
> > > > > > > munmap__      16      13099   14116   1017    64      108%
> > > > > > > munmap__      32      23221   24785   1565    49      107%
> > > > > > > mprotect      1       906     967     62      62      107%
> > > > > > > mprotect      2       3019    3203    184     92      106%
> > > > > > > mprotect      4       6149    6569    420     105     107%
> > > > > > > mprotect      8       9978    10524   545     68      105%
> > > > > > > mprotect      16      20448   21427   979     61      105%
> > > > > > > mprotect      32      40972   42935   1963    61      105%
> > > > > > > madvise_      1       434     497     63      63      115%
> > > > > > > madvise_      2       752     899     147     74      120%
> > > > > > > madvise_      4       1313    1513    200     50      115%
> > > > > > > madvise_      8       2271    2627    356     44      116%
> > > > > > > madvise_      16      4312    4883    571     36      113%
> > > > > > > madvise_      32      8376    9319    943     29      111%
> > > > > > >
> > > > > >
> > > > > > If I am reading this right, madvise() is affected more than the other
> > > > > > calls?  Is that expected or do we need to have a closer look?
> > > > > >
> > > > > The madvise() has a bigger percentage (per_vma %), but it also has a
> > > > > smaller base value (cpu).
> > > >
> > > > Sorry, it's unclear to me what the "vmas" column denotes. Is that how
> > > > many VMAs were created before timing the syscall? If so, then 32 is
> > > > the max that you show here while you seem to have tested with 1000
> > > > VMAs. What is the overhead with 1000 VMAs?
> > >
> > > The vmas column is the number of VMA used in one call.
> > >
> > > For example: for 32 and mprotect(ptr,size), the memory range used in
> > > mprotect has 32 VMAs.
> >
> > Ok, so the 32 here denotes how many VMAs one mprotect() call spans?
> >
> Yes.
>
> > >
> > > It also matters how many memory ranges are in-use at the time of the
> > > test, This is where 1000 comes in. The test creates 1000 memory
> > > ranges, each memory range has 32 vmas, then calls mprotect on the 1000
> > > memory range. (the pseudocode was included in the original email)
> >
> > So, if each range has 32 vmas and you have 1000 ranges then you are
> > creating 32000 vmas? Sorry, your pseudocode does not clarify that. My
> > current understanding is this:
> >
> > for (i = 0; i < 1000, i++)
> >     mmap N*1000 areas (N=[1-32])
> >     start the sampling
> >     for (j = 0; j < 1000, j++)
> >         mprotect N areas with one syscall
> >     stop and save the sample
> >     munmap N*1000 areas
> > calculates all samples.
> >
> > Is that correct?
> >
> Yes, There will be 32000 VMA in the system.
>
> The pseudocode is correct in concept.
> The test implementation is slightly different, it uses mprotect to
> split the memory and make sure the VMAs  doesn't merge. For detail,
> the reference [8]  of the original email link to the test code.

Ok, thanks for clarifications. I don't think the overhead is high
enough to worry about.
Thanks,
Suren.


>
> -Jeff

Pedro Falcato April 19, 2024, 5:59 p.m. UTC | #8

On Fri, Apr 19, 2024 at 2:22 AM Jeff Xu <jeffxu@chromium.org> wrote:
> The overhead is likely to grow linearly with the number of VMA, since
> it takes time to retrieve VMA's metadata.
>
> Let's use one data sample to look at impact:
>
> Test: munmap 1000 memory range, each memory range has 1 VMA
>
> syscall__       vmas    t       t_mseal delta_ns        per_vma %
> munmap__        1       909     944     35      35      104%
>
> For those 1000 munmap calls, sealing adds 35000 ns in total, or 35 ns per call.

Have you tried to spray around some likely() and unlikely()s? Does
that make a difference? I'm thinking that sealing VMAs will be very
rare, and mprotect/munmapping them is probably a programming error
anyway, so the extra branches in the mprotect/munmap/madvice (etc)
should be a nice target for some branch annotation.

Jeff Xu April 20, 2024, 1:23 a.m. UTC | #9

On Fri, Apr 19, 2024 at 10:59 AM Pedro Falcato <pedro.falcato@gmail.com> wrote:
>
> On Fri, Apr 19, 2024 at 2:22 AM Jeff Xu <jeffxu@chromium.org> wrote:
> > The overhead is likely to grow linearly with the number of VMA, since
> > it takes time to retrieve VMA's metadata.
> >
> > Let's use one data sample to look at impact:
> >
> > Test: munmap 1000 memory range, each memory range has 1 VMA
> >
> > syscall__       vmas    t       t_mseal delta_ns        per_vma %
> > munmap__        1       909     944     35      35      104%
> >
> > For those 1000 munmap calls, sealing adds 35000 ns in total, or 35 ns per call.
>
> Have you tried to spray around some likely() and unlikely()s? Does
> that make a difference? I'm thinking that sealing VMAs will be very
> rare, and mprotect/munmapping them is probably a programming error
> anyway, so the extra branches in the mprotect/munmap/madvice (etc)
> should be a nice target for some branch annotation.
>
Most cost will be in locating the node in the maple tree that stores
the VMAs, branch annotation is not possible there.

We could put unlikely() in the can_modify_mm check,  I suspect it
won't make a measurable difference in the real-world. On the other
hand, this also causes no harm, and existing mm code uses
unlikely/likely in a lot of places, so why not.

I will send a follow-up patch in the next few days.

Thanks
-Jeff
> --
> Pedro

Andrew Morton May 14, 2024, 5:46 p.m. UTC | #10

On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:

> This patchset proposes a new mseal() syscall for the Linux kernel.

I have not moved this into mm-stable for a 6.10 merge.  Mainly because
of the total lack of Reviewed-by:s and Acked-by:s.

The code appears to be stable enough for a merge.

It's awkward that we're in conference this week, but I ask people to
give consideration to the desirability of moving mseal() into mainline
sometime over the next week, please.

Kees Cook May 14, 2024, 7:52 p.m. UTC | #11

On Tue, May 14, 2024 at 10:46:46AM -0700, Andrew Morton wrote:
> On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:
> 
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> 
> I have not moved this into mm-stable for a 6.10 merge.  Mainly because
> of the total lack of Reviewed-by:s and Acked-by:s.

Oh, I thought I had already reviewed it. FWIW, please consider it:

Reviewed-by: Kees Cook <keescook@chromium.org>

> The code appears to be stable enough for a merge.

Agreed.

> It's awkward that we're in conference this week, but I ask people to
> give consideration to the desirability of moving mseal() into mainline
> sometime over the next week, please.

Yes please. :)

Jonathan Corbet May 14, 2024, 8:59 p.m. UTC | #12

Andrew Morton <akpm@linux-foundation.org> writes:

> On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:
>
>> This patchset proposes a new mseal() syscall for the Linux kernel.
>
> I have not moved this into mm-stable for a 6.10 merge.  Mainly because
> of the total lack of Reviewed-by:s and Acked-by:s.
>
> The code appears to be stable enough for a merge.
>
> It's awkward that we're in conference this week, but I ask people to
> give consideration to the desirability of moving mseal() into mainline
> sometime over the next week, please.

I hate to be obnoxious, but I *was* copied ... :)

Not taking a position on merging, but I have to ask: are we convinced at
this point that mseal() isn't a chrome-only system call?  Did we ever
see the glibc patches that were promised?

Thanks,

jon

Liam R. Howlett May 14, 2024, 9:28 p.m. UTC | #13

* Andrew Morton <akpm@linux-foundation.org> [240514 13:47]:
> On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:
> 
> > This patchset proposes a new mseal() syscall for the Linux kernel.
> 
> I have not moved this into mm-stable for a 6.10 merge.  Mainly because
> of the total lack of Reviewed-by:s and Acked-by:s.
> 
> The code appears to be stable enough for a merge.
> 
> It's awkward that we're in conference this week, but I ask people to
> give consideration to the desirability of moving mseal() into mainline
> sometime over the next week, please.

I have looked at this code a fair bit at this point, but I wanted to get
some clarification on the fact that we now have mseal doing checks
upfront while others fail in the middle.

mbind:
                /*
                 * If any vma in the range got policy other than MPOL_BIND                                                             
                 * or MPOL_PREFERRED_MANY we return error. We don't reset                                                              
                 * the home node for vmas we already updated before.
                 */ 

mlock:
mlock will abort (through one path), when it sees a gap in mapped areas,
but will not undo what it did so far.

mlock_fixup can fail on vma_modify_flags(), but previous vmas are not
updated.  This can fail due to allocation failures on the splitting of
VMAs (or failed merging).  The allocations could happen before, but this
is more work (but doable, no doubt).

userfaultfd is... complicated.

And even munmap() can fail and NOT undo the previous split(s).
mmap.c:
                        /*
                         * If userfaultfd_unmap_prep returns an error the vmas                                                         
                         * will remain split, but userland will get a                                                                  
                         * highly unexpected error anyway. This is no
                         * different than the case where the first of the two                                                          
                         * __split_vma fails, but we don't undo the first
                         * split, despite we could. This is unlikely enough                                                            
                         * failure that it's not worth optimizing it for.                                                              
                         */

But this one is different form the others..
madvise:
        /*
         * If the interval [start,end) covers some unmapped address                                                                    
         * ranges, just ignore them, but return -ENOMEM at the end.
         * - different from the way of handling in mlock etc.
         */

Either we are planning to clean this up and do what we can up-front, or
just move the mseal check with the rest.  Otherwise we are making a
larger mess with more technical dept for a single user, and I think this
is not an acceptable trade-off.

Considering the benchmarks that were provided, performance arguments
seem like they are not a concern.

I want to know if we are planning to sort and move existing checks if we
proceed with this change?

Thanks,
Liam

Matthew Wilcox May 14, 2024, 9:28 p.m. UTC | #14

On Tue, May 14, 2024 at 02:59:57PM -0600, Jonathan Corbet wrote:
> Andrew Morton <akpm@linux-foundation.org> writes:
> 
> > On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:
> >
> >> This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > I have not moved this into mm-stable for a 6.10 merge.  Mainly because
> > of the total lack of Reviewed-by:s and Acked-by:s.
> >
> > The code appears to be stable enough for a merge.
> >
> > It's awkward that we're in conference this week, but I ask people to
> > give consideration to the desirability of moving mseal() into mainline
> > sometime over the next week, please.
> 
> I hate to be obnoxious, but I *was* copied ... :)
> 
> Not taking a position on merging, but I have to ask: are we convinced at
> this point that mseal() isn't a chrome-only system call?  Did we ever
> see the glibc patches that were promised?

I think _this_ version of mseal() is OpenBSD's mimmutable() with a
basically unused extra 'flags' argument.  As such, we have an existance
proof that it's useful beyond Chrome.

I think Liam still had concerns around the
walk-the-vmas-twice-to-error-out-early part of the implementation?
Although we can always fix the implementation later; changing the API
is hard.

Theo de Raadt May 14, 2024, 10:48 p.m. UTC | #15

Matthew Wilcox <willy@infradead.org> wrote:

> > Not taking a position on merging, but I have to ask: are we convinced at
> > this point that mseal() isn't a chrome-only system call?  Did we ever
> > see the glibc patches that were promised?
> 
> I think _this_ version of mseal() is OpenBSD's mimmutable() with a
> basically unused extra 'flags' argument.  As such, we have an existance
> proof that it's useful beyond Chrome.

Yes, it is close enough.

> I think Liam still had concerns around the
> walk-the-vmas-twice-to-error-out-early part of the implementation?
> Although we can always fix the implementation later; changing the API
> is hard.

Yes I am a bit worried about the point Liam brings up -- we've discussed
it privately at length.  Matthew, to keep it short I have a different
viewpoint:

Some of the Linux m* system calls have non-conforming, partial-work-then-return-error
behaviour.  I cannot find anything like this in any system call in any other
operating system, and I believe there is a defacto rule against doing this, and
Linux has an optimization which violating this, and I think it could be fixed
with fairly minor expense, and can't imagine it affecting a single application.

I worry that the non-atomicity will one day be used by an attacker.

Andrew Morton May 14, 2024, 11:01 p.m. UTC | #16

On Tue, 14 May 2024 16:48:47 -0600 "Theo de Raadt" <deraadt@openbsd.org> wrote:

> Matthew Wilcox <willy@infradead.org> wrote:
> 
> > > Not taking a position on merging, but I have to ask: are we convinced at
> > > this point that mseal() isn't a chrome-only system call?  Did we ever
> > > see the glibc patches that were promised?
> > 
> > I think _this_ version of mseal() is OpenBSD's mimmutable() with a
> > basically unused extra 'flags' argument.  As such, we have an existance
> > proof that it's useful beyond Chrome.
> 
> Yes, it is close enough.
> 
> > I think Liam still had concerns around the
> > walk-the-vmas-twice-to-error-out-early part of the implementation?
> > Although we can always fix the implementation later; changing the API
> > is hard.
> 
> Yes I am a bit worried about the point Liam brings up -- we've discussed
> it privately at length.  Matthew, to keep it short I have a different
> viewpoint:
> 
> Some of the Linux m* system calls have non-conforming, partial-work-then-return-error
> behaviour.  I cannot find anything like this in any system call in any other
> operating system, and I believe there is a defacto rule against doing this, and
> Linux has an optimization which violating this, and I think it could be fixed
> with fairly minor expense, and can't imagine it affecting a single application.

Thanks.

> I worry that the non-atomicity will one day be used by an attacker.

How might an attacker exploit this?

Theo de Raadt May 14, 2024, 11:47 p.m. UTC | #17

Andrew Morton <akpm@linux-foundation.org> wrote:

> > I worry that the non-atomicity will one day be used by an attacker.
> 
> How might an attacker exploit this?

Various ways which are going to be very application specific. Most ways
will depend on munmap / mprotect arguments being incorrect for some
reason, and callers not checking the return values.

After the system call, the memory is in a very surprising configuration.

Consider a larger memory region containing the following sections:

  [regular memory]  [sealed memory]  [regular memory containing a secret]

unmap() gets called on the whole region, for some reason.  The first
section is removed.  It hits the sealed memory, and returns EPERM.  It does
not unmap the sealed reason, not the memory containing the secret.

The return values of mprotect and munmap are *very rarely* checked,
which adds additional intrigue. They are not checked because these
system calls never failed in this way on systems before Linux.

It is difficult to write test programs which fail under the current ENOMEM
situation (the only current failure mode, AFAIK).  But with the new mseal()
EPERM condition, it will be very easy to write programs which leave memory
behind.

I don't know how you'll document this trap in the manual page, let me try.

    If msealed memory is found inside the range [start, start+len], 
    earlier memory will be unmapped, but later memory will remain unmapped
    and the system call returns error EPERM.

    If kernel memory shortage occurs while unmapping the region, early
    regions may be unmapped but higher regions may remain mapped, and
    the system call may return ENOMEM.

I feel so gross now, time for a shower..

Linus Torvalds May 15, 2024, 12:43 a.m. UTC | #18

On Tue, 14 May 2024 at 15:48, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> and can't imagine it affecting a single application

Honestly, that's the reason for not caring.

You have to do actively wrong things for this to matter AT ALL.

So no, we're not making gratuitous changes for stupid reasons.

> I worry that the non-atomicity will one day be used by an attacker.

Blah blah blah. That's a made-up scare tactic if I ever heard one.
It's unworthy of you.

Anybody who does mprotect/mmap/munmap/whatever over multiple
independent memory mappings had better know exactly what mappings they
are touching. Otherwise they are *already* just doing random crap.

In other words: nobody actually does that. Yes, you have people who
first carve out one big area with an mmap(), and then do their own
memory management within that area. But the point is, they are very
much in control and if they do something inconsistent, they absolutely
only have themselves to blame.

And if you have some app that randomly does mprotect etc over multipl
memory mappings that it doesn't know what the f*^% they are, then
there is no saving such a piece of unbelievable garbahe.

So stop the pointless fear-mongering. Linux does the smart thing,
which is to not waste a single cycle on something that cannot possibly
be relevant.

             Linus

Theo de Raadt May 15, 2024, 12:57 a.m. UTC | #19

> > I worry that the non-atomicity will one day be used by an attacker.
> 
> Blah blah blah. That's a made-up scare tactic if I ever heard one.
> It's unworthy of you.

Let's wait and see.

(Linus, don't be a jerk)

Linus Torvalds May 15, 2024, 1:20 a.m. UTC | #20

On Tue, 14 May 2024 at 17:57, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Let's wait and see.

You may not be aware, but the Open Group literally endorses the Linux model:

  "When mprotect() fails for reasons other than [EINVAL], the
protections on some of the pages in the range [addr,addr+len) may have
been changed"

at least according to this:

    https://pubs.opengroup.org/onlinepubs/9699919799/functions/mprotect.html

so I think your atomicity arguments have always been misleading. At
least for mprotect, POSIX is very explicit about this not being
atomic.

I find very similar wording in mmap:

  "If mmap() fails for reasons other than [EBADF], [EINVAL], or
[ENOTSUP], some of the mappings in the address range starting at addr
and continuing for len bytes may have been unmapped"

Maybe some atomicity rules have always been true for BSD, but they've
never been true for Linux, and while I don't know how authoritative
that opengroup thing is, it's what google found.

> (Linus, don't be a jerk)

I'm not the one who makes unsubstantiated statements and uses scare
tactics to try to make said arguments sound more valid than they are.

So keep your arguments real, please.

               Linus

Theo de Raadt May 15, 2024, 1:47 a.m. UTC | #21

Linus Torvalds <torvalds@linux-foundation.org> wrote:

Regarding mprotect(), POSIX also says:

    An implementation may permit accesses other than those specified by
    prot; however, no implementation shall permit a write to succeed where
    PROT_WRITE has not been set or shall permit any access where PROT_NONE
    alone has been set.

When sealed memory is encountered in the middle of a range, an error
will be returned (which almost noone looks at). Memory after the sealed
region will not be fixed to follow this rule.

It may retain higher permission.

> Maybe some atomicity rules have always been true for BSD, but they've
> never been true for Linux, and while I don't know how authoritative
> that opengroup thing is, it's what google found.

It is not a BSD thing.  I searched many kernels.  I did not find the
Linux behaviour anywhere else.

> > (Linus, don't be a jerk)
> 
> I'm not the one who makes unsubstantiated statements and uses scare
> tactics to try to make said arguments sound more valid than they are.
> 
> So keep your arguments real, please.

CAN YOU PLEASE SHUT IT WITH THE PERSONAL ATTACKS?  ARE YOU SO INSECURE
THAT YOU NEED TO TAKE A TECHNICAL DISCUSSION AND MAKE IT PERSONAL?

In a new world of immutable / sealed memory, I believe there is a much
bigger problem and I would appreciate if the Linux team would give it
some consideration.

mprotect and munmap (and other calls) can now fail, due to intentional
address space manipulation requested by a process (previously).

The other previous errors have been transient system effects, like ENOMEM.

This EPERM with partial change is not transient.  A 5 line test program
can show memory which is not released, or which memory will retain
incorrect permissions.

Have any of you written test programs?

Linus Torvalds May 15, 2024, 2:28 a.m. UTC | #22

On Tue, 14 May 2024 at 18:47, Theo de Raadt <deraadt@openbsd.org> wrote:
>
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> Regarding mprotect(), POSIX also says:
>
>     An implementation may permit accesses other than those specified by
>     prot; however, no implementation shall permit a write to succeed where
>     PROT_WRITE has not been set or shall permit any access where PROT_NONE
>     alone has been set.

Why do you quote entirely irrelevant issues?

If the mprotect didn't succeed, then clearly the above is irrelevant.

> When sealed memory is encountered in the middle of a range, an error
> will be returned (which almost noone looks at). Memory after the sealed
> region will not be fixed to follow this rule.
>
> It may retain higher permission.

This is not in any way specific to mseal().

Theo, you're making shit up.

You claim that this is somehow new behavior:

> The other previous errors have been transient system effects, like ENOMEM.

but that's simply NOT TRUE. Try this:

    #include <stdio.h>
    #include <sys/mman.h>

    int main(int argc, char **argv)
    {
        /* Just three pages for VM space allocation */
        void *a = mmap(NULL, 3*4096, PROT_READ, MAP_PRIVATE |
MAP_ANONYMOUS, -1, 0);

        /* Make the second page a shared read mapping of stdin */
        mmap(a+4096, 4096, PROT_READ, MAP_FIXED | MAP_SHARED, 0, 0);

        /* Turn them all PROT_WRITE */
        mprotect(a, 3*4096, PROT_WRITE);

        fprintf(stderr, "Write to first page\n");
        *(int *) (a+0) = 0;

        fprintf(stderr, "Write to second page\n");
        *(int *) (a+4096) = 0;

        fprintf(stderr, "Write to third page\n");
        *(int *) (a+2*4096) = 0;
    }

and what you will get (under Linux) is

    $ ./a.out < ./a.out
    Write to first page
    Write to second page
    Segmentation fault (core dumped)

because that mprotect() will have returned EACCES on the shared
mapping, but will have successfully made the first one writable.

End result: this whole "transient system effects" is just not true.
And "mseal()" isn't somethign new.

If somebody makes random mprotect() calls, and doesn't check the
result, they get exactly what they deserve. And mseal() isn't the
issue - bad programming is.

Anyway, you're just making things up for your nonexistent arguments. I'm done.

            Linus

Theo de Raadt May 15, 2024, 2:42 a.m. UTC | #23

Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Tue, 14 May 2024 at 18:47, Theo de Raadt <deraadt@openbsd.org> wrote:
> >
> > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > Regarding mprotect(), POSIX also says:
> >
> >     An implementation may permit accesses other than those specified by
> >     prot; however, no implementation shall permit a write to succeed where
> >     PROT_WRITE has not been set or shall permit any access where PROT_NONE
> >     alone has been set.
> 
> Why do you quote entirely irrelevant issues?
> 
> If the mprotect didn't succeed, then clearly the above is irrelevant.

Imagine the following region:

    <--------------------------------------------- len
    [region PROT_READ] [region PROT_READ + sealed] 
addr ^

then perform
    mprotect(addr, len, PROT_WRITE | PROT_READ);

This will return -1, with EPERM, when it encounters the sealed region.

I believe in Linux, since it has not checked for errors as a first
phase, this changes the first region of memory to PROT_READ |
PROT_WRITE.  Liam, is that correct?  If I am correct, then this
follows:

So tell me -- did the mprotect() system call succeed or did not it
succeed?

If EPERM means it did not succeed, then why is the first region now
writable?

Immediately after this "call that failed", the process can perform a
write to that first region.  But no succesful system call was made to
change that memory to PROT_WRITE.

Alternatively, does EPERM mean it did not completely fail, and therefore
it is OK that that the prot value has been applied?  That's really obscure,
and undocumented.

In any case it seems, PROT_WRITE can be set on memory, and it is even
more pointless than before for userland to check the errno *because you
can't determine the resulting protection on every page of memory.  It's
all a mishmash after that.

(There is no POSIX system call to ask "what is the permission of a page or
region).

> Theo, you're making shit up.

I'm trying to have a technical discussion.  Please change your approach,
Linus.

Willy Tarreau May 15, 2024, 2:58 a.m. UTC | #24

On Tue, May 14, 2024 at 05:47:30PM -0600, Theo de Raadt wrote:
> Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > > I worry that the non-atomicity will one day be used by an attacker.
> > 
> > How might an attacker exploit this?
> 
> Various ways which are going to be very application specific. Most ways
> will depend on munmap / mprotect arguments being incorrect for some
> reason, and callers not checking the return values.
> 
> After the system call, the memory is in a very surprising configuration.
> 
> Consider a larger memory region containing the following sections:
> 
>   [regular memory]  [sealed memory]  [regular memory containing a secret]
> 
> unmap() gets called on the whole region, for some reason.  The first
> section is removed.  It hits the sealed memory, and returns EPERM.  It does
> not unmap the sealed reason, not the memory containing the secret.

If we consider that the attack consists, for an attacker, in mseal()ing
the beginning of an area to make sure to pin the whole area by making a
future munmap() fail, maybe we could make munmap() not stop anymore on
such errors and continue to unmap the rest of the area, for the purpose
of not leaving such a theoretical attack vector work ? After all, munmap()
currently skips wholes and continues to unmap the area. But then what
would prevent the attacker from doing mseal() on the whole area in this
case, to prevent it from being unmapped ?

Wouldn't it be more effective to have a non-resettable prctl() allowing
the application to prefer to be killed upon such an munmap() failure in
order to stay consistent and more robust against such class of attacks?

Willy

Linus Torvalds May 15, 2024, 3:36 a.m. UTC | #25

On Tue, 14 May 2024 at 20:13, Willy Tarreau <w@1wt.eu> wrote:
>
> Wouldn't it be more effective to have a non-resettable prctl() allowing
> the application to prefer to be killed upon such an munmap() failure in
> order to stay consistent and more robust against such class of attacks?

This whole argument is based on a castle of sand, and some notion that
this is a problem in the first place.

Guys, if you let untrusted code execute random system calls, the whole
"look, now unmap() acts oddly" IS THE LEAST OF YOUR ISSUES.

This whole "problem" is made-up. It's not real. Theo is literally
upset about something that Linux has done forever, and that has never
been an issue.

Stop inventing make-believe problems - there are enough *real* bugs
people can look at that you really don't need to.

                 Linus

Linus Torvalds May 15, 2024, 4:14 a.m. UTC | #26

On Tue, 14 May 2024 at 20:36, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Guys, if you let untrusted code execute random system calls, the whole
> "look, now unmap() acts oddly" IS THE LEAST OF YOUR ISSUES.

Side note: it doesn't even help to make things "atomic". munmap() acts
oddly whether it fals completely or whether it fails partially, and if
the user doesn't check the result, neither case is great.

If you want to have some "hardened mseal()", you make any attempt to
change a mseal'ed memory area be a fatal error. The whole "atomic or
not" is a complete red herring.

I'd certainly be ok with that. If the point of mseal is "you can't
change this mapping", then anybody who tries to change it is obviously
untrustworthy, and killing the whole thing sounds perfectly sane to
me.

Maybe that's a first valid use-case for the flags argument.

            Linus

Liam R. Howlett May 15, 2024, 4:53 a.m. UTC | #27

* Theo de Raadt <deraadt@openbsd.org> [240514 22:42]:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Tue, 14 May 2024 at 18:47, Theo de Raadt <deraadt@openbsd.org> wrote:
> > >
> > > Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > >
> > > Regarding mprotect(), POSIX also says:
> > >
> > >     An implementation may permit accesses other than those specified by
> > >     prot; however, no implementation shall permit a write to succeed where
> > >     PROT_WRITE has not been set or shall permit any access where PROT_NONE
> > >     alone has been set.
> > 
> > Why do you quote entirely irrelevant issues?
> > 
> > If the mprotect didn't succeed, then clearly the above is irrelevant.
> 
> Imagine the following region:
> 
> 
>     <--------------------------------------------- len
>     [region PROT_READ] [region PROT_READ + sealed] 
> addr ^
> 
> then perform
>     mprotect(addr, len, PROT_WRITE | PROT_READ);
> 
> This will return -1, with EPERM, when it encounters the sealed region.
> 
> I believe in Linux, since it has not checked for errors as a first
> phase, this changes the first region of memory to PROT_READ |
> PROT_WRITE.  Liam, is that correct?

I really don't want to fight about this - I just want to have reliable
code that is maintainable.  I think the correctness argument is always
going to be unclear because we're all going to interpret the
documentation from our point of view - which is probably how we got here
in the first place.  My opinion of the matter of correctness is,
obviously, the least important.

My problem right now is that we're changing it so that we are not
consistent in when we should check.  I'm not sure how doing both fits
into either model, but it increases the next change going to the 'wrong'
side of the argument (whatever side that happens to be from your view).

If there isn't a technical reason to keep the check before, then we
should treat mseal the same as all other checks.

If we are going to have an up-front check, then it makes sense to keep
the checks that we can (reasonably) do at the same time together.
Linus, you said up front checks is a good thing to aim for.

That said, I don't think the example above will allow the madvise to
succeed at all.  mseal checks the entire region up front while most
other checks occur during the loop across vmas.

Thanks,
Liam

Willy Tarreau May 15, 2024, 6:14 a.m. UTC | #28

On Tue, May 14, 2024 at 09:14:37PM -0700, Linus Torvalds wrote:
> On Tue, 14 May 2024 at 20:36, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Guys, if you let untrusted code execute random system calls, the whole
> > "look, now unmap() acts oddly" IS THE LEAST OF YOUR ISSUES.

I totally agree with this, I'm more speaking about a more general
hardening measure, like what is commonly offered via prctl() and that,
for example, manages to mitigate the consequences of a successful RCE.

> Side note: it doesn't even help to make things "atomic". munmap() acts
> oddly whether it fals completely or whether it fails partially, and if
> the user doesn't check the result, neither case is great.

I don't find the "atomic" aspect that important either, however the
munmap() man page says:

  All  pages containing a part of the indicated range are un-
  mapped, and subsequent references to these pages will generate SIGSEGV.
  It  is  not an error if the indicated range does not contain any mapped
  pages.

This alone is an encouragement to not check the result. And BTW, what
should one do to try to repair the situation after a failed munmap() ?
It reads as "best effort" above: usually upon return, anything that
could be unmapped was unmapped. That's how I'm reading it. I think it's
a nice property that makes this syscall trustable by its users, and
contrary to the atomic aspect I would find it nice if munmap() would
properly unmap everything it can then return the error caused by the
encounter of a sealed area. For me that's even the opposite of an
atomic approach, it's really about making sure to best follow the
developer's intent regardless of any obstacles.

> If you want to have some "hardened mseal()", you make any attempt to
> change a mseal'ed memory area be a fatal error. The whole "atomic or
> not" is a complete red herring.

Yep, agreed.

> I'd certainly be ok with that. If the point of mseal is "you can't
> change this mapping", then anybody who tries to change it is obviously
> untrustworthy, and killing the whole thing sounds perfectly sane to
> me.
> 
> Maybe that's a first valid use-case for the flags argument.

That could be for that use case (developer doing mseal, attacker
trying munmap), indeed, though that couldn't cover for the other
way around (attacker doing mseal() in hope to make a future munmap()
fail).

That's what I like with prctl(), it offers the developer a panoply of
options to decide when and how to lock down a process in order to
mitigate consequences of exploited bugs.

And it could be independent on this series, by essentially focusing on
the ability to kill a process that fails to munmap() a sealed area. I.e.
no need to that that property on the area itself, it's a matter of whether
we consider the process sensitive enough or not.

Willy

Jeff Xu May 15, 2024, 5:18 p.m. UTC | #29

On Tue, May 14, 2024 at 2:28 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Andrew Morton <akpm@linux-foundation.org> [240514 13:47]:
> > On Mon, 15 Apr 2024 16:35:19 +0000 jeffxu@chromium.org wrote:
> >
> > > This patchset proposes a new mseal() syscall for the Linux kernel.
> >
> > I have not moved this into mm-stable for a 6.10 merge.  Mainly because
> > of the total lack of Reviewed-by:s and Acked-by:s.
> >
> > The code appears to be stable enough for a merge.
> >
> > It's awkward that we're in conference this week, but I ask people to
> > give consideration to the desirability of moving mseal() into mainline
> > sometime over the next week, please.
>
> I have looked at this code a fair bit at this point, but I wanted to get
> some clarification on the fact that we now have mseal doing checks
> upfront while others fail in the middle.
>
> mbind:
>                 /*
>                  * If any vma in the range got policy other than MPOL_BIND
>                  * or MPOL_PREFERRED_MANY we return error. We don't reset
>                  * the home node for vmas we already updated before.
>                  */
>
>
> mlock:
> mlock will abort (through one path), when it sees a gap in mapped areas,
> but will not undo what it did so far.
>
> mlock_fixup can fail on vma_modify_flags(), but previous vmas are not
> updated.  This can fail due to allocation failures on the splitting of
> VMAs (or failed merging).  The allocations could happen before, but this
> is more work (but doable, no doubt).
>
> userfaultfd is... complicated.
>
> And even munmap() can fail and NOT undo the previous split(s).
> mmap.c:
>                         /*
>                          * If userfaultfd_unmap_prep returns an error the vmas
>                          * will remain split, but userland will get a
>                          * highly unexpected error anyway. This is no
>                          * different than the case where the first of the two
>                          * __split_vma fails, but we don't undo the first
>                          * split, despite we could. This is unlikely enough
>                          * failure that it's not worth optimizing it for.
>                          */
>
>
> But this one is different form the others..
> madvise:
>         /*
>          * If the interval [start,end) covers some unmapped address
>          * ranges, just ignore them, but return -ENOMEM at the end.
>          * - different from the way of handling in mlock etc.
>          */
>
Thanks.

The current mseal patch does up-front checking in two different situations:
1 when applying mseal()
   Checking for unallocated memory in the given memory range.

2 When checking mseal flag during mprotect/unmap/remap/mmap
  Checking mseal flag is placed ahead of the main business logic, and
treated the same as input arguments check.

> Either we are planning to clean this up and do what we can up-front, or
> just move the mseal check with the rest.  Otherwise we are making a
> larger mess with more technical dept for a single user, and I think this
> is not an acceptable trade-off.
>
The sealing use case  is different  from regular mm API and this
didn't create additional technical debt.  Please allow me to explain
those separately.

The main use case and threat model is that an attacker exploits a
vulnerability and has arbitrary write access to the process, and can
manipulate some arguments to syscalls from some threads. Placing the
checking of mseal flag ahead of mprotect main business logic is
stricter compared with doing it in-place. It is meant to be harder for
the attacker, e.g. blocking the  opportunistically attempt of munmap
by modifying the size argument.

The legit app code won't call mprotect/munmap on sealed memory.  It is
irrelevant for both precheck and in-place check approaches, from a
legit app code point of view.

The regular mm API (other than sealing)'s user-case is  for legit
code, a legit code knows the whole picture of memory mappings and is
unlikely to rely on the opportunist return.

The user-cases are different, I hope we look into the difference and
not force them into one direction.

About tech debt, code-wise , placing pre-check ahead of the main
business logic of mprotect/munmap APIs, reduces the size of code
change, and is easy to carry from release to release, or backporting.

But let's compare  the alternatives - doing it in-place without precheck.
- munmap
munmap calls arch_unmap(mm, start, end) ahead of main business logic,
the checking of sealing flags would need to be architect specific. In
addition, if arch_unmap return fails due to sealing, the code should
still proceed, till the main business logic fails again.

- mremap/mmap
The check of sealing would be scattered, e.g. checking the src address
range in-place, dest arrange in-place, unmap in-place, etc. The code
is complex and prone to error.

-mprotect/madvice
Easy to change to in-place.

- mseal
mseal() check unallocated memory in the given memory range in the
pre-check. Easy to change to in-place (same as mprotect)

The situation in munmap and mremap/mmap make in-place checks less desirable imo.

> Considering the benchmarks that were provided, performance arguments
> seem like they are not a concern.
>
Yes. Performance is not a factor in making a design choice on this.

> I want to know if we are planning to sort and move existing checks if we
> proceed with this change?
>
I would argue that we should not change the existing mm code. mseal is
new and no backward compatible problem. That is not the case for
mprotect and other mm api. E.g. if we were to change mprotect to add a
precheck for memory gap, some badly written application might break.

The 'atomic' approach is also really difficult to enforce to the whole
MM area, mseal() doesn't claim it is atomic. Most regular mm API might
go deeper in mm data structure to update page tables and HW, etc. The
rollback in handling those error cases, and performance cost. I'm not
sure if the benefit is worth the cost. However, atomicity is another
topic to discuss unrelated to mm sealing.  The current design of mm
sealing is due to its use case and practical coding reason.

Thanks
-Jeff

> Thanks,
> Liam

Liam R. Howlett May 15, 2024, 10:19 p.m. UTC | #30

* Jeff Xu <jeffxu@chromium.org> [240515 13:18]:
...

> The current mseal patch does up-front checking in two different situations:
> 1 when applying mseal()
>    Checking for unallocated memory in the given memory range.
> 
> 2 When checking mseal flag during mprotect/unmap/remap/mmap
>   Checking mseal flag is placed ahead of the main business logic, and
> treated the same as input arguments check.
> 
> > Either we are planning to clean this up and do what we can up-front, or
> > just move the mseal check with the rest.  Otherwise we are making a
> > larger mess with more technical dept for a single user, and I think this
> > is not an acceptable trade-off.
> >
> The sealing use case  is different  from regular mm API and this
> didn't create additional technical debt.  Please allow me to explain
> those separately.
> 
> The main use case and threat model is that an attacker exploits a
> vulnerability and has arbitrary write access to the process, and can
> manipulate some arguments to syscalls from some threads. Placing the
> checking of mseal flag ahead of mprotect main business logic is
> stricter compared with doing it in-place. It is meant to be harder for
> the attacker, e.g. blocking the  opportunistically attempt of munmap
> by modifying the size argument.

If you can manipulate some arguments to syscalls, couldn't it avoid
having the VMA mseal'ed?

Again I don't care where the check goes - but having it happen alone is
pointless.

> 
> The legit app code won't call mprotect/munmap on sealed memory.  It is
> irrelevant for both precheck and in-place check approaches, from a
> legit app code point of view.

So let's do them together.

...

> About tech debt, code-wise , placing pre-check ahead of the main
> business logic of mprotect/munmap APIs, reduces the size of code
> change, and is easy to carry from release to release, or backporting.

It sounds like the other changes to the looping code in recent kernels
is going to mess up the backporting if we do this with the rest of the
checks.

> 
> But let's compare  the alternatives - doing it in-place without precheck.
> - munmap
> munmap calls arch_unmap(mm, start, end) ahead of main business logic,
> the checking of sealing flags would need to be architect specific. In
> addition, if arch_unmap return fails due to sealing, the code should
> still proceed, till the main business logic fails again.

You are going to mseal the vdso?

> 
> - mremap/mmap
> The check of sealing would be scattered, e.g. checking the src address
> range in-place, dest arrange in-place, unmap in-place, etc. The code
> is complex and prone to error.
> 
> -mprotect/madvice
> Easy to change to in-place.
> 
> - mseal
> mseal() check unallocated memory in the given memory range in the
> pre-check. Easy to change to in-place (same as mprotect)
> 
> The situation in munmap and mremap/mmap make in-place checks less desirable imo.
> 
> > Considering the benchmarks that were provided, performance arguments
> > seem like they are not a concern.
> >
> Yes. Performance is not a factor in making a design choice on this.
> 
> > I want to know if we are planning to sort and move existing checks if we
> > proceed with this change?
> >
> I would argue that we should not change the existing mm code. mseal is
> new and no backward compatible problem. That is not the case for
> mprotect and other mm api. E.g. if we were to change mprotect to add a
> precheck for memory gap, some badly written application might break.

This is a weak argument. Your new function may break these badly written
applications *if* gcc adds support.  If you're not checking the return
type then it doesn't really matter - the application will run into
issues rather quickly anyways.  The only thing that you could argue is
the speed - but you've proven that false.

> 
> The 'atomic' approach is also really difficult to enforce to the whole
> MM area, mseal() doesn't claim it is atomic. Most regular mm API might
> go deeper in mm data structure to update page tables and HW, etc. The
> rollback in handling those error cases, and performance cost. I'm not
> sure if the benefit is worth the cost. However, atomicity is another
> topic to discuss unrelated to mm sealing.  The current design of mm
> sealing is due to its use case and practical coding reason.

"best effort" is what I'm saying.  It's actually not really difficult to
do atomic, but no one cares besides Theo.

How hard is it to put userfaultfd into your loop and avoid having that
horrible userfaulfd in munmap?  For years people see horrible failure
paths and just dump in a huge comment saying "but it's okay because it's
probably not going to happen".  But now we're putting this test up
front, and doing it alone - Why?

Thanks,
Liam

Jeff Xu May 16, 2024, 12:59 a.m. UTC | #31

On Wed, May 15, 2024 at 3:19 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240515 13:18]:
> ...
>
> > The current mseal patch does up-front checking in two different situations:
> > 1 when applying mseal()
> >    Checking for unallocated memory in the given memory range.
> >
> > 2 When checking mseal flag during mprotect/unmap/remap/mmap
> >   Checking mseal flag is placed ahead of the main business logic, and
> > treated the same as input arguments check.
> >
> > > Either we are planning to clean this up and do what we can up-front, or
> > > just move the mseal check with the rest.  Otherwise we are making a
> > > larger mess with more technical dept for a single user, and I think this
> > > is not an acceptable trade-off.
> > >
> > The sealing use case  is different  from regular mm API and this
> > didn't create additional technical debt.  Please allow me to explain
> > those separately.
> >
> > The main use case and threat model is that an attacker exploits a
> > vulnerability and has arbitrary write access to the process, and can
> > manipulate some arguments to syscalls from some threads. Placing the
> > checking of mseal flag ahead of mprotect main business logic is
> > stricter compared with doing it in-place. It is meant to be harder for
> > the attacker, e.g. blocking the  opportunistically attempt of munmap
> > by modifying the size argument.
>
> If you can manipulate some arguments to syscalls, couldn't it avoid
> having the VMA mseal'ed?
>
The mm sealing can be applied in advance. This type of approach is
common in sandboxer, e.g. setup restrictive environments in advance.

> Again I don't care where the check goes - but having it happen alone is
> pointless.
>
> >
> > The legit app code won't call mprotect/munmap on sealed memory.  It is
> > irrelevant for both precheck and in-place check approaches, from a
> > legit app code point of view.
>
> So let's do them together.
>
For the user case I describe in the threat-model, precheck is a better
approach. Legit code doesn't care.

> ...
>
> > About tech debt, code-wise , placing pre-check ahead of the main
> > business logic of mprotect/munmap APIs, reduces the size of code
> > change, and is easy to carry from release to release, or backporting.
>
> It sounds like the other changes to the looping code in recent kernels
> is going to mess up the backporting if we do this with the rest of the
> checks.
>
What other changes do you refer to ?

I backported V9 to 5.10 when I ran the performance test on your
request, and the backporting to 5.10 is relatively straight forward,
the mseal flag check is placed after input arguments check and before
the main business logic.

> >
> > But let's compare  the alternatives - doing it in-place without precheck.
> > - munmap
> > munmap calls arch_unmap(mm, start, end) ahead of main business logic,
> > the checking of sealing flags would need to be architect specific. In
> > addition, if arch_unmap return fails due to sealing, the code should
> > still proceed, till the main business logic fails again.
>
> You are going to mseal the vdso?
>
How is that relevant ?
To answer your question: I don't know at this moment.
The initial scope of libc change is sealing the RO/RX part during elf
loading.e.g. .text and .RELO

> >
> > - mremap/mmap
> > The check of sealing would be scattered, e.g. checking the src address
> > range in-place, dest arrange in-place, unmap in-place, etc. The code
> > is complex and prone to error.
> >
> > -mprotect/madvice
> > Easy to change to in-place.
> >
> > - mseal
> > mseal() check unallocated memory in the given memory range in the
> > pre-check. Easy to change to in-place (same as mprotect)
> >
> > The situation in munmap and mremap/mmap make in-place checks less desirable imo.
> >
> > > Considering the benchmarks that were provided, performance arguments
> > > seem like they are not a concern.
> > >
> > Yes. Performance is not a factor in making a design choice on this.
> >
> > > I want to know if we are planning to sort and move existing checks if we
> > > proceed with this change?
> > >
> > I would argue that we should not change the existing mm code. mseal is
> > new and no backward compatible problem. That is not the case for
> > mprotect and other mm api. E.g. if we were to change mprotect to add a
> > precheck for memory gap, some badly written application might break.
>
> This is a weak argument. Your new function may break these badly written
> applications *if* gcc adds support.  If you're not checking the return
> type then it doesn't really matter - the application will run into
> issues rather quickly anyways.  The only thing that you could argue is
> the speed - but you've proven that false.
>
The point I raised here is that there is a risk to modify  mm API's
established behavior. Kernel doesn't usually make this kind of
behavior change.

mm sealing is a new functionality, I think applications will need to
opt in , e.g. allow dynamic linker to seal .text.

> >
> > The 'atomic' approach is also really difficult to enforce to the whole
> > MM area, mseal() doesn't claim it is atomic. Most regular mm API might
> > go deeper in mm data structure to update page tables and HW, etc. The
> > rollback in handling those error cases, and performance cost. I'm not
> > sure if the benefit is worth the cost. However, atomicity is another
> > topic to discuss unrelated to mm sealing.  The current design of mm
> > sealing is due to its use case and practical coding reason.
>
> "best effort" is what I'm saying.  It's actually not really difficult to
> do atomic, but no one cares besides Theo.
>
OK, if you strongly believe in 'atomic' or 'best effort atomic',
whatever it is, consider sending a patch and getting feedback from the
community ?

> How hard is it to put userfaultfd into your loop and avoid having that
> horrible userfaulfd in munmap?  For years people see horrible failure
> paths and just dump in a huge comment saying "but it's okay because it's
> probably not going to happen".  But now we're putting this test up
> front, and doing it alone - Why?
>
As a summary of why:
- The use case: it makes it harder for attackers to modify memory
opportunistically.
- Code: Less and simpler code change.

Thanks
-Jeff

> Thanks,
> Liam

[v10,0/5] Introduce mseal

Message

Comments