From patchwork Fri Apr 4 02:18:42 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Anthony Yznaga X-Patchwork-Id: 14037988 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5EF50C3601A for ; Fri, 4 Apr 2025 02:19:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CFBEB6B000A; Thu, 3 Apr 2025 22:19:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CAA676B000D; Thu, 3 Apr 2025 22:19:29 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B24646B000E; Thu, 3 Apr 2025 22:19:29 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8BF0C6B000A for ; Thu, 3 Apr 2025 22:19:29 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 18B3759544 for ; Fri, 4 Apr 2025 02:19:31 +0000 (UTC) X-FDA: 83294754942.24.9541C1C Received: from mx0b-00069f02.pphosted.com (mx0b-00069f02.pphosted.com [205.220.177.32]) by imf13.hostedemail.com (Postfix) with ESMTP id 4963C20003 for ; Fri, 4 Apr 2025 02:19:29 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-11-20 header.b=lLsYLooY; dmarc=pass (policy=reject) header.from=oracle.com; spf=pass (imf13.hostedemail.com: domain of anthony.yznaga@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=anthony.yznaga@oracle.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743733169; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references:dkim-signature; bh=8YrBsw5n/2Gu317TtpDDmsLHhH+FLh8PCv2dLcFMr8E=; b=6v1DerHZYF/46CjxN5Pt7MQMhro58zgESHlsUmFtyXaKU5H4tSXFiJGyA1/j9vFnJkpjQo IVzPXl1bhhWzUrXpFxx/Irstd1Ewq4q6YPMzUDaa5y5pCzpKSevs6mcA1mkbf3JDGh4PL1 yAlqHBDW3h/flgCkNJwhKMq49Mu+bTg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743733169; a=rsa-sha256; cv=none; b=Un4C2LRCYGNhvhOsNs88HLtQTKLV+p39km6Bh9X3lcDqWRSWVLxwmmlUNWtYsHZDeA9YCF LVjf+C7nBx1zSA3RR5C/1QuOYpw3qatM8/P/ZDDvbtxLjxvlYM0FGTNp1JwwGg8ygFPj8u Mgrm+64B53PROfMR3oN9J007trn2/3E= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=oracle.com header.s=corp-2023-11-20 header.b=lLsYLooY; dmarc=pass (policy=reject) header.from=oracle.com; spf=pass (imf13.hostedemail.com: domain of anthony.yznaga@oracle.com designates 205.220.177.32 as permitted sender) smtp.mailfrom=anthony.yznaga@oracle.com Received: from pps.filterd (m0333520.ppops.net [127.0.0.1]) by mx0b-00069f02.pphosted.com (8.18.1.2/8.18.1.2) with ESMTP id 5341Nuhg001155; Fri, 4 Apr 2025 02:19:06 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=oracle.com; h=cc :content-transfer-encoding:date:from:message-id:mime-version :subject:to; s=corp-2023-11-20; bh=8YrBsw5n/2Gu317TtpDDmsLHhH+FL h8PCv2dLcFMr8E=; b=lLsYLooYGjyx4TtetP9NgdP2bP+xwdP3rHaaIRxv8eaou dBKnupHa/RfWo3laIFy7hRBZwnJor7FLRNvf0buTzf4B21DYS2TgP2apJp3IFSMc EJvfOCBCuKVCD+aM3OAgI62nUk9cvV6MhQqONPO46EiiObBFwgHprqQ6F3BsYzCb T99ymP9DknZbHcj9NnNTVxw3t0Y5tQs9NcqBBge9zFdkI+SV2xcOQgiuh0Vd6qnJ sLJ5OFpv8PbmvOS1GWu8CfgqlcO+knVM5W1tQ9XHxkbstceEU1yQg7BZzfCcy6Ew lVu2bItBAromak6zGpdVOmVLnXBf2hJ0tb0fkrLIw== Received: from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta02.appoci.oracle.com [147.154.18.20]) by mx0b-00069f02.pphosted.com (PPS) with ESMTPS id 45p8wcpnpq-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 04 Apr 2025 02:19:06 +0000 (GMT) Received: from pps.filterd (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (8.18.1.2/8.18.1.2) with ESMTP id 5340eutt017397; Fri, 4 Apr 2025 02:19:05 GMT Received: from pps.reinject (localhost [127.0.0.1]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTPS id 45t2pspj7n-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Fri, 04 Apr 2025 02:19:05 +0000 Received: from iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com [127.0.0.1]) by pps.reinject (8.17.1.5/8.17.1.5) with ESMTP id 5342H8ga030074; Fri, 4 Apr 2025 02:19:05 GMT Received: from localhost.localdomain (ca-dev60.us.oracle.com [10.129.136.27]) by iadpaimrmta02.imrmtpd1.prodappiadaev1.oraclevcn.com (PPS) with ESMTP id 45t2pspj73-1; Fri, 04 Apr 2025 02:19:05 +0000 From: Anthony Yznaga To: akpm@linux-foundation.org, willy@infradead.org, markhemm@googlemail.com, viro@zeniv.linux.org.uk, david@redhat.com, khalid@kernel.org Cc: anthony.yznaga@oracle.com, andreyknvl@gmail.com, dave.hansen@intel.com, luto@kernel.org, brauner@kernel.org, arnd@arndb.de, ebiederm@xmission.com, catalin.marinas@arm.com, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhiramat@kernel.org, rostedt@goodmis.org, vasily.averin@linux.dev, xhao@linux.alibaba.com, pcc@google.com, neilb@suse.de, maz@kernel.org Subject: [PATCH v2 00/20] Add support for shared PTEs across processes Date: Thu, 3 Apr 2025 19:18:42 -0700 Message-ID: <20250404021902.48863-1-anthony.yznaga@oracle.com> X-Mailer: git-send-email 2.43.5 MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.293,Aquarius:18.0.1095,Hydra:6.0.680,FMLib:17.12.68.34 definitions=2025-04-04_01,2025-04-03_03,2024-11-22_01 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 bulkscore=0 mlxscore=0 spamscore=0 adultscore=0 suspectscore=0 mlxlogscore=999 malwarescore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2502280000 definitions=main-2504040014 X-Proofpoint-GUID: vcCQ_I---la9ErYfw2XkNTh7u537Tf9B X-Proofpoint-ORIG-GUID: vcCQ_I---la9ErYfw2XkNTh7u537Tf9B X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4963C20003 X-Stat-Signature: pxbmo191of3od4qiwc3ukonqofjfr4so X-Rspam-User: X-HE-Tag: 1743733169-297765 X-HE-Meta: U2FsdGVkX1/LGQGiU/mrU8Vcp5LJ4EkZ9TynETiepuPHW432y9BIflbCLJarrtWF0TX6bvRoeYZ8AExr/Wy1TgsjoWQ9mV/viMXUQpnLQ6ciOydKvrmkYZYKeq80vr0Y9CIHBheW4PtehXLlCiqbeubbnYdMDtzojqaTU3QnEd7w9knXXRwgMl6yCj0ccq/OjPsEUgZVyChvEixJESEKkrhamF7Hw7F4aSRFD3kpaV0URs365nPYCJxoV64iFMm2Rtep8wur8GK3eOpWuqK2ejslbf8eew4mnV/o773q2dlS45Kz/3KrJejRytE9kryny+3hzAzDrC7Ev9MA25sK4GRBleEc59OxnCEpDQdOwCzgkYFL+yl/WLVvxs1cApuyL/oXWMjz5LxHSjSLhLVfOyotdpRscRrWRWWW7NNRL1YMQ1tB3OFAVceHDgD+DdVcqFgsrAiIhytUBafaM0dLTaB6mX3JfVWMP/YLEfCijT3Meoa9OJPOBfQkCIEyJ/vEZ5D5R8ilpA4J/RTdB8W9c/oAm2QVa4sbXUUHCMP7pRrzAF0IOPQ8f5tmfJFx+ShiTO2xuVzare+QcZd+cuoQR9v0kJM2x8XKa2beCfN/MT+nSdIKW9WCpJyxA7xxCg4wYkoD8ZmoJctXsm6fk+IuWEr55nZLQwMUdEWHESo4JXmjGCx2RS6x8+nRJUiQc6tHFDrQfFzZEzJegCMrd6GDrc8MsmzYSOQyK8CaCmVJJXO6+2wVYcIxqZp8Dn8zuu2htXJIpA6q+sEdIyLUcyjFKLB7yX6Ov4etlvTyQOvTOGG+bxNUNF/CUXEAMJfaD9KDnXBXkoKgqLKDOdI6EMPAXN/dktQ3NEepSg5Z5Pa8IdubZxVX+RmbojnBsNuH730WzZNIP5Ot4UamWIOw72cThb5Pe269gGhdf49QE6j1G+RG6mld816uvvW9yO2qNLBu/Tu2PP/7FwqQnGG2EdR rn8okIE4 6p2F0jZzt31jKNIDsy12whzgpvE+UNpUqBk1kV9gCI8WtiTXDnzkOrtD7gwk9n3AK2nz1DBdrEX0oViKw7xxo6J1owAIYOyVgYW0POH7CHNAUf+LZt3z8UCuoiq0ZPkfOpeswQRbgeO30/JVXCIpdszdb3wRuLK8yPOrUkXtePehmWZG25fMxXOR80dZzMRQ0v5XcfMtIj+7sD5Xk4LTOQFBVj2j3CxN0UVU4kHatVfxck1XbemZsokuqkEinR2RT+Uu8N9+qvrWu3BtJuTO1WfMj7aWpnM29qbZsIbY67qEa3yI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Memory pages shared between processes require page table entries (PTEs) for each process. Each of these PTEs consume some of the memory and as long as the number of mappings being maintained is small enough, this space consumed by page tables is not objectionable. When very few memory pages are shared between processes, the number of PTEs to maintain is mostly constrained by the number of pages of memory on the system. As the number of shared pages and the number of times pages are shared goes up, amount of memory consumed by page tables starts to become significant. This issue does not apply to threads. Any number of threads can share the same pages inside a process while sharing the same PTEs. Extending this same model to sharing pages across processes can eliminate this issue for sharing across processes as well. Some of the field deployments commonly see memory pages shared across 1000s of processes. On x86_64, each page requires a PTE that is 8 bytes long which is very small compared to the 4K page size. When 2000 processes map the same page in their address space, each one of them requires 8 bytes for its PTE and together that adds up to 8K of memory just to hold the PTEs for one 4K page. On a database server with 300GB SGA, a system crash was seen with out-of-memory condition when 1500+ clients tried to share this SGA even though the system had 512GB of memory. On this server, in the worst case scenario of all 1500 processes mapping every page from SGA would have required 878GB+ for just the PTEs. If these PTEs could be shared, the a substantial amount of memory saved. This patch series implements a mechanism that allows userspace processes to opt into sharing PTEs. It adds a new in-memory filesystem - msharefs. A file created on msharefs represents a shared region where all processes mapping that region will map objects within it with shared PTEs. When the file is created, a new host mm struct is created to hold the shared page tables and vmas for objects later mapped into the shared region. This host mm struct is associated with the file and not with a task. When a process mmap's the shared region, a vm flag VM_MSHARE is added to the vma. On page fault the vma is checked for the presence of the VM_MSHARE flag. If found, the host mm is searched for a vma that covers the fault address. Fault handling then continues using that host vma which establishes PTEs in the host mm. Fault handling in a shared region also links the shared page table to the process page table if the shared page table already exists. Ioctls are used to map and unmap objects in the shared region and to (eventually) perform other operations on the shared objects such as changing protections. API === The steps to use this feature are: 1. Mount msharefs on /sys/fs/mshare - mount -t msharefs msharefs /sys/fs/mshare 2. mshare regions have alignment and size requirements. The start address for the region must be aligned to an address boundary and be a multiple of fixed size. This alignment and size requirement can be obtained by reading the file /sys/fs/mshare/mshare_info which returns a number in text format. mshare regions must be aligned to this boundary and be a multiple of this size. 3. For the process creating an mshare region: a. Create a file on /sys/fs/mshare, for example - fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600); b. Establish the size of the region ftruncate(fd, BUFFER_SIZE); c. Map some memory in the region struct mshare_create mcreate; mcreate.region_offset = 0; mcreate.size = BUFFER_SIZE; mcreate.offset = 0; mcreate.prot = PROT_READ | PROT_WRITE; mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED; mcreate.fd = -1; ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate) d. Map the mshare region into the process mmap((void *)TB(2), BUFFER_SIZE, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); e. Write and read to mshared region normally. 4. For processes attaching an mshare region: a. Open the file on msharefs, for example - fd = open("/sys/fs/mshare/shareme", O_RDWR); b. Get information about mshare'd region from the file: struct stat sb; fstat(fd, &sb); mshare_size = sb.st_size; c. Map the mshare'd region into the process mmap((void *)TB(2), mshare_size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); 5. To delete the mshare region - unlink("/sys/fs/mshare/shareme"); Example Code ============ Snippet of the code that a donor process would run looks like below: ----------------- struct mshare_create mcreate; fd = open("/sys/fs/mshare/mshare_info", O_RDONLY); read(fd, req, 128); alignsize = atoi(req); close(fd); fd = open("/sys/fs/mshare/shareme", O_RDWR|O_CREAT|O_EXCL, 0600); start = alignsize * 4; size = alignsize * 2; ftruncate(fd, size); mcreate.region_offset = 0; mcreate.size = size; mcreate.offset = 0; mcreate.prot = PROT_READ | PROT_WRITE; mcreate.flags = MAP_ANONYMOUS | MAP_SHARED | MAP_FIXED; mcreate.fd = -1; ret = ioctl(fd, MSHAREFS_CREATE_MAPPING, &mcreate); if (ret < 0) perror("ERROR: MSHAREFS_CREATE_MAPPING"); addr = mmap((void *)start, size, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_SHARED, fd, 0); if (addr == MAP_FAILED) perror("ERROR: mmap failed"); strncpy(addr, "Some random shared text", sizeof("Some random shared text")); ----------------- Snippet of code that a consumer process would execute looks like: ----------------- fd = open("/sys/fs/mshare/shareme", O_RDONLY); fstat(fd, &sb); size = sb.st_size; if (!size) perror("ERROR: mshare region not init'd"); addr = mmap(0, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); printf("Guest mmap at %px:\n", addr); printf("%s\n", addr); printf("\nDone\n"); ----------------- v2: - Based on mm-unstable as of 2025-04-03 (8ff02705ba8f) - Set mshare size via fallocate or ftruncate instead of MSHAREFS_SET_SIZE. Removed MSHAREFS_SET_SIZE/MSHAREFS_GET_SIZE ioctls. Use stat to get size. (David H) - Remove spinlock from mshare_data. Initializing the size is protected by the inode lock. - Support mapping a single mshare region at different virtual addresses. - Support system selection of the start address when mmap'ing an mshare region. - Changed MSHAREFS_CREATE_MAPPING and MSHAREFS_UNMAP to use a byte offset to specify the start of a mapping. - Updated documentation. v1: (https://lore.kernel.org/linux-mm/20250124235454.84587-1-anthony.yznaga@oracle.com/) - Based on mm-unstable mm-hotfixes-stable-2025-01-16-21-11 - Use mshare size instead of start address to check if mshare region has been initialized. - Share page tables at PUD level instead of PGD. - Rename vma_is_shared() to vma_is_mshare() (James H / David H) - Introduce and use mmap_read_lock_nested() (Kirill) - Use an mmu notifier to flush all TLBs when updating shared pagetable mappings. (Dave Hansen) - Move logic for finding the shared vma to use to handle a fault from handle_mm_fault() to do_user_addr_fault() because the arch-specific fault handling checks vma flags for access permissions. - Add CONFIG_MSHARE / ARCH_SUPPORTS_MSHARE - Add msharefs_get_unmapped_area() - Implemented vm_ops->unmap_page_range (Kirill) - Update free_pgtables/free_pgd_range to free process pagetable levels but not shared pagetable levels. - A first take at cgroup support RFC v2 -> v3: - Now based on 6.11-rc5 - Addressed many comments from v2. - Simplified filesystem code. Removed refcounting of the shared mm_struct allocated for an mshare file. The mm_struct and the pagetables and mappings it contains are freed when the inode is evicted. - Switched to an ioctl-based interface. Ioctls implemented are used to set and get the start address and size of an mshare region and to map objects into an mshare region (only anon shared memory is supported in this series). - Updated example code [1] v2: https://lore.kernel.org/linux-mm/cover.1656531090.git.khalid.aziz@oracle.com/ RFC v1 -> v2: - Eliminated mshare and mshare_unlink system calls and replaced API with standard mmap and unlink (Based upon v1 patch discussions and LSF/MM discussions) - All fd based API (based upon feedback and suggestions from Andy Lutomirski, Eric Biederman, Kirill and others) - Added a file /sys/fs/mshare/mshare_info to provide alignment and size requirement info (based upon feedback from Dave Hansen, Mark Hemment and discussions at LSF/MM) - Addressed TODOs in v1 - Added support for directories in msharefs - Added locks around any time vma is touched (Dave Hansen) - Eliminated the need to point vm_mm in original vmas to the newly synthesized mshare mm - Ensured mmap_read_unlock is called for correct mm in handle_mm_fault (Dave Hansen) Anthony Yznaga (13): mm/mshare: allocate an mm_struct for msharefs files mm/mshare: add ways to set the size of an mshare region mm/mshare: flush all TLBs when updating PTEs in an mshare range sched/numa: do not scan msharefs vmas mm: add mmap_read_lock_killable_nested() mm: add and use unmap_page_range vm_ops hook x86/mm: enable page table sharing mm: create __do_mmap() to take an mm_struct * arg mm: pass the mm in vma_munmap_struct mm/mshare: Add an ioctl for unmapping objects in an mshare region mm/mshare: provide a way to identify an mm as an mshare host mm mm/mshare: get memcg from current->mm instead of mshare mm mm/mshare: associate a mem cgroup with an mshare file Khalid Aziz (7): mm: Add msharefs filesystem mm/mshare: pre-populate msharefs with information file mm/mshare: make msharefs writable and support directories mm/mshare: Add a vma flag to indicate an mshare region mm/mshare: Add mmap support mm/mshare: prepare for page table sharing support mm/mshare: Add an ioctl for mapping objects in an mshare region Documentation/filesystems/index.rst | 1 + Documentation/filesystems/msharefs.rst | 96 +++ .../userspace-api/ioctl/ioctl-number.rst | 1 + arch/Kconfig | 3 + arch/x86/Kconfig | 1 + arch/x86/mm/fault.c | 48 +- include/linux/memcontrol.h | 3 + include/linux/mm.h | 56 ++ include/linux/mm_types.h | 2 + include/linux/mmap_lock.h | 7 + include/trace/events/mmflags.h | 7 + include/uapi/linux/magic.h | 1 + include/uapi/linux/msharefs.h | 38 + ipc/shm.c | 17 + kernel/sched/fair.c | 3 +- mm/Kconfig | 9 + mm/Makefile | 4 + mm/hugetlb.c | 25 + mm/memcontrol.c | 3 +- mm/memory.c | 81 +- mm/mmap.c | 10 +- mm/mshare.c | 785 ++++++++++++++++++ mm/vma.c | 22 +- mm/vma.h | 3 +- 24 files changed, 1176 insertions(+), 50 deletions(-) create mode 100644 Documentation/filesystems/msharefs.rst create mode 100644 include/uapi/linux/msharefs.h create mode 100644 mm/mshare.c