From patchwork Fri Nov 22 20:38:27 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bijan Tabatabai X-Patchwork-Id: 13883600 Received: from mail-il1-f182.google.com (mail-il1-f182.google.com [209.85.166.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1697B17B428 for ; Fri, 22 Nov 2024 20:38:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307935; cv=none; b=JWcvlVtVDsqc2hiRbOAD4tI4zVU3CZHIEGtg63pyQTtNhOWPadr1eoxAhfJtum/i9YZWfDxO+NUPJ2G71UnIBCLj6iGD7mBMrWFKGrIJiVkJHcjlats5nyefNso5hCK7UVD8yrmAxlYGKz4zXmrrnyry5mrxY+Ea/Q2UJXl6+x8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307935; c=relaxed/simple; bh=gGtI5P/MJV/wLpXhshQc6As3k2RtJRqrL/omL+D2zFE=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=Cat6NHyDeZ/CDGaakvdxc/Ih4r+mTPjgd6O4O/yg7U6Pq0kNt7Qe7l1MCMgHU75MH5GAM1Xv4qfHMT/CQcfLM/t3Wbmix89akObXWFZoLYXq1rcuiVbaU7vGDieh5LLQmxlx6dvwgh+yso6uwFoX1/NBUilpFJ/wNUuJzdjFKDg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=X5a1lVs+; arc=none smtp.client-ip=209.85.166.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="X5a1lVs+" Received: by mail-il1-f182.google.com with SMTP id e9e14a558f8ab-3a7222089aaso9868095ab.1 for ; Fri, 22 Nov 2024 12:38:52 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732307932; x=1732912732; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=WCXfx2p9j24LDeZHTikLC43EcClPxrOAh9xxUfo2iK4=; b=X5a1lVs+TAR21W0JKsZZ+rCJ1ArVlT5GFxrO6oy5DMcbDZyDUncBj+AiGbYyPO+L+F Xx1H0Xcm8jZi6mNjq9uegVjeffrVVu0x5WkIegbThdmeABtffl9BbPN81J/IaJKh/Aq/ v79vb5nVIN4n2Vx6IMLsJJZCeGy8L/EZbsb78TAqAAkhEA1wslfhQ874CTCVvIMhmZQa N0Y77Uo/lapdEC9wkbFY5tHxenhhYT1FS9rNVHQ4fRfj6zQPsf7ZsB5/Lun8HCqiWR6P iA2DpvPzFOg25un49pd9ZcTL+ug3BWkyU1bwM1dSWltDzAlUfuENLC5jfG7Y0LjUnL0u H+gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732307932; x=1732912732; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=WCXfx2p9j24LDeZHTikLC43EcClPxrOAh9xxUfo2iK4=; b=tUApEK3PmYPm41u10EHwyiZKAGWckbxNFDqOX9jm5IMsDc+gHeajK7NeP6XyswVcsn qx0eTFsQSX0l8L9PSnpDNIa8jVH3z8YetdYsdG7Ck/3tir7q/IezrJVFI/Fx3G1IeGtv qP511z4NU2LL12d2kZMyG/gKH/i5nG6GfvDI0tz7/MzO3QEXt9I0Zyq3gTGx3qU/smCM 9zKeNUqrqB2o5G9h2qmTMkVWb180KIjcZtKYMprrDpZVpVoe/oXlsaP2xsQTC0OHEeE4 m8Jt4j3lLyuObjQcFrKFjFCDNlf2rQHBRJ/ChiqXke1/n+f5ZKUK6ysqu5QoNeJabEHR 27EQ== X-Gm-Message-State: AOJu0YwabaHuXCZMnLbyzvm9MV2qKSxDSvamqfHqLvlosPuqn4OOHCMT JwfUuGqDJTn1XwA2mUn5AXLQJ9tZZisjYbDoeeVpCCHejW8vR6s6CxJ3Cq8s X-Gm-Gg: ASbGncuWMgEImGRj+ab/lKHQ1R3zzdo2z4D7SdM4v66zjTvh5GyGfzGBO7xEyQF4rlK hTZsB8N65WzsYcwg7PKeHuw7vvJwxcD450InJJG90sQxV1rsX60w7mCb15x3/CAmXp+6lshTwKn LQsvurRdb8N/QeKHrAC2dvN7dt7nD+CRlB8I+biBVNdj3ohDU5UA1NfD1FTonzDZJpJM7B0E67S v0Ei5TJ2SchMB/cKTOA2ffEt11HmHV99gnccwxir5OoCbaO3bO5abmK1JluUaiR09Vs92ZiNuw= X-Google-Smtp-Source: AGHT+IGhAevHULnuw2kVw09A4K8gopPCWCRR5mKOHt8mTprpWkBU3zWTbyGNcMgt0jm5HEZ5jaRiQw== X-Received: by 2002:a92:db11:0:b0:3a7:a352:213d with SMTP id e9e14a558f8ab-3a7a352237bmr30117455ab.7.1732307931628; Fri, 22 Nov 2024 12:38:51 -0800 (PST) Received: from manaslu.cs.wisc.edu (manaslu.cs.wisc.edu. [128.105.15.4]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4e1cfe52506sm794682173.77.2024.11.22.12.38.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Nov 2024 12:38:51 -0800 (PST) From: Bijan Tabatabai X-Google-Original-From: Bijan Tabatabai To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, btabatabai@wisc.edu Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, mingo@redhat.com Subject: [RFC PATCH 1/4] mm: Add support for File Based Memory Management Date: Fri, 22 Nov 2024 14:38:27 -0600 Message-Id: <20241122203830.2381905-2-btabatabai@wisc.edu> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241122203830.2381905-1-btabatabai@wisc.edu> References: <20241122203830.2381905-1-btabatabai@wisc.edu> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch introduces File Based Memory Management (FBMM), which allows for memory managers that are written as filesystems, similar to HugeTLBFS, to be used transparently by applications. The steps for using FBMM are the following: 1) Mount the memory management filesystem (MFS) 2) Enable FBMM by writing 1 to /sys/kernel/mm/fbmm/state 3) Set the MFS an application should allocate its memory from by writing the MFS's mount directory to /proc//fbmm_mnt_dir, where is the PID of the target process. To have a process use an MFS for the entirety of the execution, one could use a shim program that writes /proc/self/fbmm_mount_dir then calls exec for the target process. We have created such a shim, which can be found at [1]. Providing this transparency is useful since it allows applications to use an arbitrary MFS to mange its memory without having to modify that application. Writing memory management functionality as an MFS is useful for more easily prototyping MM functionality and maintaining support for a variety of different hardware configurations or application needs. FBMM was originally created as a research project at the University of Wisconsin-Madison [2]. The core of FBMM is found in fs/file_based_mm.c. Other parts of the kernel call into functions in that file to allow processes to allocate their memory from an MFS without changing the application's code. For example, the do_mmap function is modified so that when it is called with the MAP_ANONYMOUS flag by a process using FBMM, fbmm_get_file is called to acquire a file in the MFS used by the process along with the page offset to map that file to. do_mmap then proceeds to map that file instead of anonymous memory, allowing the desired MFS to control the memory behavior of the mapped region. A similar process happens inside of the brk syscall implementation. Another example is handle_mm_fault being modified to call fbmm_fault for regions using FBMM which will invoke the MFS's page fault handler. The main overhead of FBMM comes from creating the files for the process to memory map. To ammortize this cost, we give files created by FBMM a large virtual size (currently 128GB) and have multiple calls to mmap/brk share a file. The fbmm_get_file function handles this logic. It takes the size of a new allocation and the virtual address it will be mapped to. On a process's first call to fbmm_get_file, it creates a new file and assigns the file a virtual address range that it can be mapped to. Files created by FBMM are added to a per-process tree indexed by the files's virtual address range. On subsequent calls to fbmm_get_file, it searches the tree for a file that can fit the new memory allocation. If such a file does not exist, a new file is created and added to the tree of files. A pointer to a fbmm_info struct is added to task_struct to keep track of the state used by FBMM. This includes the path to the MFS used by the process and the tree of files used by the process. Signed-off-by: Bijan Tabatabai [1] https://github.com/multifacet/fbmm-workspace/blob/main/bmks/fbmm_wrapper.c [2] https://www.usenix.org/conference/atc24/presentation/tabatabai --- fs/Kconfig | 7 + fs/Makefile | 1 + fs/file_based_mm.c | 564 ++++++++++++++++++++++++++++++++++ fs/proc/base.c | 4 + include/linux/file_based_mm.h | 81 +++++ include/linux/mm.h | 10 + include/linux/sched.h | 4 + kernel/exit.c | 3 + kernel/fork.c | 3 + mm/gup.c | 1 + mm/memory.c | 2 + mm/mmap.c | 42 ++- 12 files changed, 719 insertions(+), 3 deletions(-) create mode 100644 fs/file_based_mm.c create mode 100644 include/linux/file_based_mm.h diff --git a/fs/Kconfig b/fs/Kconfig index a46b0cbc4d8f..52994b2491fe 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -96,6 +96,13 @@ config FS_DAX_PMD depends on ZONE_DEVICE depends on TRANSPARENT_HUGEPAGE +config FILE_BASED_MM + bool "File Based Memory Management" + help + This option enables file based memory management (FBMM). FBMM allows users + to have a process transparently allocate its memory from a memory manager + that is written as a filesystem. + # Selected by DAX drivers that do not expect filesystem DAX to support # get_user_pages() of DAX mappings. I.e. "limited" indicates no support # for fork() of processes with MAP_SHARED mappings or support for diff --git a/fs/Makefile b/fs/Makefile index 6ecc9b0a53f2..f1a5e540fe72 100644 --- a/fs/Makefile +++ b/fs/Makefile @@ -45,6 +45,7 @@ obj-$(CONFIG_FS_POSIX_ACL) += posix_acl.o obj-$(CONFIG_NFS_COMMON) += nfs_common/ obj-$(CONFIG_COREDUMP) += coredump.o obj-$(CONFIG_SYSCTL) += drop_caches.o sysctls.o +obj-$(CONFIG_FILE_BASED_MM) += file_based_mm.o obj-$(CONFIG_FHANDLE) += fhandle.o obj-y += iomap/ diff --git a/fs/file_based_mm.c b/fs/file_based_mm.c new file mode 100644 index 000000000000..c05797d51cb3 --- /dev/null +++ b/fs/file_based_mm.c @@ -0,0 +1,564 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "proc/internal.h" + +enum file_based_mm_state { + FBMM_OFF = 0, + FBMM_ON = 1, +}; + +#define FBMM_DEFAULT_FILE_SIZE (128L << 30) +struct fbmm_file { + struct file *f; + /* The starting virtual address assigned to this file (inclusive) */ + unsigned long va_start; + /* The ending virtual address assigned to this file (exclusive) */ + unsigned long va_end; +}; + +static enum file_based_mm_state fbmm_state = FBMM_OFF; + +const int GUA_OPEN_FLAGS = O_EXCL | O_TMPFILE | O_RDWR; +const umode_t GUA_OPEN_MODE = S_IFREG | 0600; + +static struct fbmm_info *fbmm_create_new_info(char *mnt_dir_str) +{ + struct fbmm_info *info; + int ret; + + info = kmalloc(sizeof(struct fbmm_info), GFP_KERNEL); + if (!info) + return NULL; + + info->mnt_dir_str = mnt_dir_str; + ret = kern_path(mnt_dir_str, LOOKUP_DIRECTORY | LOOKUP_FOLLOW, &info->mnt_dir_path); + if (ret) { + kfree(info); + return NULL; + } + + info->get_unmapped_area_file = file_open_root(&info->mnt_dir_path, "", + GUA_OPEN_FLAGS, GUA_OPEN_MODE); + if (IS_ERR(info->get_unmapped_area_file)) { + path_put(&info->mnt_dir_path); + kfree(info); + return NULL; + } + + mt_init(&info->files_mt); + + return info; +} + +static void drop_fbmm_file(struct fbmm_file *file) +{ + if (atomic_dec_return(&file->refcount) == 0) { + fput(file->f); + kfree(file); + } +} + +static pmdval_t fbmm_alloc_pmd(struct vm_fault *vmf) +{ + struct mm_struct *mm = vmf->vma->vm_mm; + unsigned long address = vmf->address; + pgd_t *pgd; + p4d_t *p4d; + + pgd = pgd_offset(mm, address); + p4d = p4d_alloc(mm, pgd, address); + if (!p4d) + return VM_FAULT_OOM; + + vmf->pud = pud_alloc(mm, p4d, address); + if (!vmf->pud) + return VM_FAULT_OOM; + + vmf->pmd = pmd_alloc(mm, vmf->pud, address); + if (!vmf->pmd) + return VM_FAULT_OOM; + + vmf->orig_pmd = pmdp_get_lockless(vmf->pmd); + + return pmd_val(*vmf->pmd); +} + +inline bool is_vm_fbmm_page(struct vm_area_struct *vma) +{ + return !!(vma->vm_flags & VM_FBMM); +} + +int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) +{ + struct vm_fault vmf = { + .vma = vma, + .address = address & PAGE_MASK, + .real_address = address, + .flags = flags, + .pgoff = linear_page_index(vma, address), + .gfp_mask = mapping_gfp_mask(vma->vm_file->f_mapping) | __GFP_FS | __GFP_IO, + }; + + if (fbmm_alloc_pmd(&vmf) == VM_FAULT_OOM) + return VM_FAULT_OOM; + + return vma->vm_ops->fault(&vmf); +} + +bool use_file_based_mm(struct task_struct *tsk) +{ + if (fbmm_state == FBMM_OFF) + return false; + else + return tsk->fbmm_info && tsk->fbmm_info->mnt_dir_str; +} + +unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + struct fbmm_info *info; + + info = current->fbmm_info; + if (!info) + return -EINVAL; + + return get_unmapped_area(info->get_unmapped_area_file, addr, len, pgoff, flags); +} + +struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned long len, + unsigned long prot, int flags, bool topdown, unsigned long *pgoff) +{ + struct file *f; + struct fbmm_file *fbmm_file; + struct fbmm_info *info; + struct path *path; + int open_flags = O_EXCL | O_TMPFILE; + unsigned long truncate_len; + umode_t open_mode = S_IFREG; + s64 ret = 0; + + info = tsk->fbmm_info; + if (!info) + return NULL; + + /* Does a file exist that will already fit this mmap call? */ + fbmm_file = mt_prev(&info->files_mt, addr + 1, 0); + if (fbmm_file) { + /* + * Just see if this mmap will fit inside the file. + * We don't need to check if other mappings in the file overlap + * because get_unmapped_area should have done that already. + */ + if (fbmm_file->va_start <= addr && addr + len <= fbmm_file->va_end) { + f = fbmm_file->f; + goto end; + } + } + + /* Determine what flags to use for the call to open */ + if (prot & PROT_EXEC) + open_mode |= 0100; + + if ((prot & (PROT_READ | PROT_WRITE)) == (PROT_READ | PROT_WRITE)) { + open_flags |= O_RDWR; + open_mode |= 0600; + } else if (prot & PROT_WRITE) { + open_flags |= O_WRONLY; + open_mode |= 0200; + } else if (prot & PROT_READ) { + /* It doesn't make sense for anon memory to be read only */ + return NULL; + } + + path = &info->mnt_dir_path; + f = file_open_root(path, "", open_flags, open_mode); + if (IS_ERR(f)) + return NULL; + + /* + * It takes time to create new files and create new VMAs for mappings + * with different files, so we want to create huge files that we can reuse + * for different calls to mmap + */ + if (len < FBMM_DEFAULT_FILE_SIZE) + truncate_len = FBMM_DEFAULT_FILE_SIZE; + else + truncate_len = len; + ret = vfs_truncate(&f->f_path, truncate_len); + if (ret) { + filp_close(f, current->files); + return (struct file *)ret; + } + + fbmm_file = kmalloc(sizeof(struct fbmm_file), GFP_KERNEL); + if (!fbmm_file) { + filp_close(f, current->files); + return NULL; + } + fbmm_file->f = f; + if (topdown) { + /* + * Since VAs in this region grow down, this mapping will be the + * "end" of the file + */ + fbmm_file->va_end = addr + len; + fbmm_file->va_start = fbmm_file->va_end - truncate_len; + } else { + fbmm_file->va_start = addr; + fbmm_file->va_end = addr + truncate_len; + } + + mtree_store(&info->files_mt, fbmm_file->va_start, fbmm_file, GFP_KERNEL); + +end: + if (f && !IS_ERR(f)) + *pgoff = (addr - fbmm_file->va_start) >> PAGE_SHIFT; + + return f; +} + +void fbmm_populate_file(unsigned long start, unsigned long len) +{ + struct fbmm_info *info; + struct fbmm_file *file = NULL; + loff_t offset; + + info = current->fbmm_info; + if (!info) + return; + + file = mt_prev(&info->files_mt, start, 0); + if (!file || file->va_end <= start) + return; + + offset = start - file->va_start; + vfs_fallocate(file->f, 0, offset, len); +} + +int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len) +{ + struct fbmm_info *info = NULL; + struct fbmm_file *fbmm_file = NULL; + struct fbmm_file *prev_file = NULL; + unsigned long end = start + len; + unsigned long falloc_start_offset, falloc_end_offset, falloc_len; + int ret = 0; + + info = tsk->fbmm_info; + if (!info) + return 0; + + /* + * Finds the last (by va_start) mapping where file->va_start <= start, so we have to + * check this file is actually within the range + */ + fbmm_file = mt_prev(&info->files_mt, start + 1, 0); + if (!fbmm_file || fbmm_file->va_end <= start) + goto exit; + + /* + * Since the ranges overlap, we have to keep going backwards until we + * the first mapping where file->va_start <= start and file->va_end > start + */ + while (1) { + prev_file = mt_prev(&info->files_mt, fbmm_file->va_start, 0); + if (!prev_file || prev_file->va_end <= start) + break; + fbmm_file = prev_file; + } + + /* + * A munmap call can span multiple memory ranges, so we might have to do this + * multiple times + */ + while (fbmm_file) { + if (start > fbmm_file->va_start) + falloc_start_offset = start - fbmm_file->va_start; + else + falloc_start_offset = 0; + + if (fbmm_file->va_end <= end) + falloc_end_offset = fbmm_file->va_end - fbmm_file->va_start; + else + falloc_end_offset = end - fbmm_file->va_start; + + falloc_len = falloc_end_offset - falloc_start_offset; + + ret = vfs_fallocate(fbmm_file->f, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + falloc_start_offset, falloc_len); + + fbmm_file = mt_next(&info->files_mt, fbmm_file->va_start, ULONG_MAX); + if (!fbmm_file || fbmm_file->va_end <= start) + break; + } + +exit: + return ret; +} + +static void fbmm_free_info(struct task_struct *tsk) +{ + struct fbmm_file *file; + struct fbmm_info *info = tsk->fbmm_info; + unsigned long index = 0; + + mt_for_each(&info->files_mt, file, index, ULONG_MAX) { + drop_fbmm_file(file); + } + mtree_destroy(&info->files_mt); + + if (info->mnt_dir_str) { + path_put(&info->mnt_dir_path); + fput(info->get_unmapped_area_file); + kfree(info->mnt_dir_str); + } + kfree(info); +} + +void fbmm_exit(struct task_struct *tsk) +{ + if (tsk->tgid != tsk->pid) + return; + + if (!tsk->fbmm_info) + return; + + fbmm_free_info(tsk); +} + +int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags) +{ + struct fbmm_info *info; + char *buffer; + char *src_dir; + + /* If this new task is just a thread, not a new process, just copy fbmm info */ + if (clone_flags & CLONE_THREAD) { + dst_tsk->fbmm_info = src_tsk->fbmm_info; + return 0; + } + + /* Does the src actually have a default mnt dir */ + if (!use_file_based_mm(src_tsk)) { + dst_tsk->fbmm_info = NULL; + return 0; + } + info = src_tsk->fbmm_info; + + /* Make a new fbmm_info with the same mnt dir */ + src_dir = info->mnt_dir_str; + + buffer = kstrndup(src_dir, PATH_MAX, GFP_KERNEL); + if (!buffer) + return -ENOMEM; + + dst_tsk->fbmm_info = fbmm_create_new_info(buffer); + if (!dst_tsk->fbmm_info) { + kfree(buffer); + return -ENOMEM; + } + + return 0; +} + +static ssize_t fbmm_state_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", fbmm_state); +} + +static ssize_t fbmm_state_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int state; + int ret; + + ret = kstrtoint(buf, 0, &state); + + if (ret != 0) { + fbmm_state = FBMM_OFF; + return ret; + } else if (state == 0) { + fbmm_state = FBMM_OFF; + } else { + fbmm_state = FBMM_ON; + } + return count; +} +static struct kobj_attribute fbmm_state_attribute = +__ATTR(state, 0644, fbmm_state_show, fbmm_state_store); + +static struct attribute *file_based_mm_attr[] = { + &fbmm_state_attribute.attr, + NULL, +}; + +static const struct attribute_group file_based_mm_attr_group = { + .attrs = file_based_mm_attr, +}; + +static ssize_t fbmm_mnt_dir_read(struct file *file, char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct task_struct *task = get_proc_task(file_inode(file)); + char *buffer; + struct fbmm_info *info; + size_t len, ret; + + if (!task) + return -ESRCH; + + buffer = kmalloc(PATH_MAX + 1, GFP_KERNEL); + if (!buffer) { + put_task_struct(task); + return -ENOMEM; + } + + info = task->fbmm_info; + if (info && info->mnt_dir_str) + len = sprintf(buffer, "%s\n", info->mnt_dir_str); + else + len = sprintf(buffer, "not enabled\n"); + + ret = simple_read_from_buffer(ubuf, count, ppos, buffer, len); + + kfree(buffer); + put_task_struct(task); + + return ret; +} + +static ssize_t fbmm_mnt_dir_write(struct file *file, const char __user *ubuf, + size_t count, loff_t *ppos) +{ + struct task_struct *task; + struct path p; + char *buffer; + struct fbmm_info *info; + int ret = 0; + + if (count > PATH_MAX) + return -ENOMEM; + + buffer = kmalloc(count + 1, GFP_KERNEL); + if (!buffer) + return -ENOMEM; + + if (copy_from_user(buffer, ubuf, count)) { + kfree(buffer); + return -EFAULT; + } + buffer[count] = 0; + + /* + * echo likes to put an extra \n at the end of the string + * if it's there, remove it + */ + if (buffer[count - 1] == '\n') + buffer[count - 1] = 0; + + task = get_proc_task(file_inode(file)); + if (!task) { + kfree(buffer); + return -ESRCH; + } + + /* Check if the given path is actually a valid directory */ + ret = kern_path(buffer, LOOKUP_DIRECTORY | LOOKUP_FOLLOW, &p); + if (!ret) { + path_put(&p); + info = task->fbmm_info; + + if (!info) { + info = fbmm_create_new_info(buffer); + task->fbmm_info = info; + if (!info) + ret = -ENOMEM; + } else { + /* + * Cleanup the old directory info, but keep the fbmm files + * stuff because the application may still be using them + */ + if (info->mnt_dir_str) { + path_put(&info->mnt_dir_path); + fput(info->get_unmapped_area_file); + kfree(info->mnt_dir_str); + } + + info->mnt_dir_str = buffer; + ret = kern_path(buffer, LOOKUP_DIRECTORY | LOOKUP_FOLLOW, + &info->mnt_dir_path); + if (ret) + goto end; + + fput(info->get_unmapped_area_file); + info->get_unmapped_area_file = file_open_root(&info->mnt_dir_path, "", + GUA_OPEN_FLAGS, GUA_OPEN_MODE); + if (IS_ERR(info->get_unmapped_area_file)) + ret = PTR_ERR(info->get_unmapped_area_file); + } + } else { + kfree(buffer); + + info = task->fbmm_info; + if (info && info->mnt_dir_str) { + kfree(info->mnt_dir_str); + path_put(&info->mnt_dir_path); + fput(info->get_unmapped_area_file); + info->mnt_dir_str = NULL; + } + } + +end: + put_task_struct(task); + if (ret) + return ret; + return count; +} + +const struct file_operations proc_fbmm_mnt_dir = { + .read = fbmm_mnt_dir_read, + .write = fbmm_mnt_dir_write, + .llseek = default_llseek, +}; + + +static int __init file_based_mm_init(void) +{ + struct kobject *fbmm_kobj; + int err; + + fbmm_kobj = kobject_create_and_add("fbmm", mm_kobj); + if (unlikely(!fbmm_kobj)) { + pr_warn("failed to create the fbmm kobject\n"); + return -ENOMEM; + } + + err = sysfs_create_group(fbmm_kobj, &file_based_mm_attr_group); + if (err) { + pr_warn("failed to register the fbmm group\n"); + kobject_put(fbmm_kobj); + return err; + } + + return 0; +} +subsys_initcall(file_based_mm_init); diff --git a/fs/proc/base.c b/fs/proc/base.c index 72a1acd03675..ef5688f0ab95 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -97,6 +97,7 @@ #include #include #include +#include #include #include #include "internal.h" @@ -3359,6 +3360,9 @@ static const struct pid_entry tgid_base_stuff[] = { ONE("ksm_merging_pages", S_IRUSR, proc_pid_ksm_merging_pages), ONE("ksm_stat", S_IRUSR, proc_pid_ksm_stat), #endif +#ifdef CONFIG_FILE_BASED_MM + REG("fbmm_mnt_dir", S_IRUGO|S_IWUSR, proc_fbmm_mnt_dir), +#endif }; static int proc_tgid_base_readdir(struct file *file, struct dir_context *ctx) diff --git a/include/linux/file_based_mm.h b/include/linux/file_based_mm.h new file mode 100644 index 000000000000..c1c5e82e36ec --- /dev/null +++ b/include/linux/file_based_mm.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +#ifndef _FILE_BASED_MM_H_ +#define _FILE_BASED_MM_H_ + +#include +#include +#include +#include + +struct fbmm_info { + char *mnt_dir_str; + struct path mnt_dir_path; + /* This file exists just to be passed to get_unmapped_area in mmap */ + struct file *get_unmapped_area_file; + struct maple_tree files_mt; +}; + + +#ifdef CONFIG_FILE_BASED_MM +extern const struct file_operations proc_fbmm_mnt_dir; + +bool use_file_based_mm(struct task_struct *task); + +bool is_vm_fbmm_page(struct vm_area_struct *vma); +int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags); +unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len, unsigned long pgoff, + unsigned long flags); +struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned long len, + unsigned long prot, int flags, bool topdown, unsigned long *pgoff); +void fbmm_populate_file(unsigned long start, unsigned long len); +int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len); +void fbmm_exit(struct task_struct *tsk); +int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags); + +#else /* CONFIG_FILE_BASED_MM */ + +static inline bool is_vm_fbmm_page(struct vm_area_struct *vma) +{ + return 0; +} + +static inline bool use_file_based_mm(struct task_struct *tsk) +{ + return false; +} + +static inline int fbmm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) +{ + return 0; +} + +static inline unsigned long fbmm_get_unmapped_area(unsigned long addr, unsigned long len, + unsigned long pgoff, unsigned long flags) +{ + return 0; +} + +static inline struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, + unsigned long len, unsigned long prot, int flags, bool topdown, + unsigned long *pgoff) +{ + return NULL; +} + +static inline void fbmm_populate_file(unsigned long start, unsigned long len) {} + +static inline int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len) +{ + return 0; +} + +static inline void fbmm_exit(struct task_struct *tsk) {} + +static inline int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, + u64 clone_flags) +{ + return 0; +} +#endif /* CONFIG_FILE_BASED_MM */ + +#endif /* __FILE_BASED_MM_H */ diff --git a/include/linux/mm.h b/include/linux/mm.h index eb7c96d24ac0..614d40ef249a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -31,6 +31,7 @@ #include #include #include +#include struct mempolicy; struct anon_vma; @@ -321,12 +322,14 @@ extern unsigned int kobjsize(const void *objp); #define VM_HIGH_ARCH_BIT_3 35 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_4 36 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_BIT_5 37 /* bit only usable on 64-bit architectures */ +#define VM_HIGH_ARCH_BIT_6 38 /* bit only usable on 64-bit architectures */ #define VM_HIGH_ARCH_0 BIT(VM_HIGH_ARCH_BIT_0) #define VM_HIGH_ARCH_1 BIT(VM_HIGH_ARCH_BIT_1) #define VM_HIGH_ARCH_2 BIT(VM_HIGH_ARCH_BIT_2) #define VM_HIGH_ARCH_3 BIT(VM_HIGH_ARCH_BIT_3) #define VM_HIGH_ARCH_4 BIT(VM_HIGH_ARCH_BIT_4) #define VM_HIGH_ARCH_5 BIT(VM_HIGH_ARCH_BIT_5) +#define VM_HIGH_ARCH_6 BIT(VM_HIGH_ARCH_BIT_6) #endif /* CONFIG_ARCH_USES_HIGH_VMA_FLAGS */ #ifdef CONFIG_ARCH_HAS_PKEYS @@ -357,6 +360,12 @@ extern unsigned int kobjsize(const void *objp); # define VM_SHADOW_STACK VM_NONE #endif +#ifdef CONFIG_FILE_BASED_MM +# define VM_FBMM VM_HIGH_ARCH_6 +#else +# define VM_FBMM VM_NONE +#endif + #if defined(CONFIG_X86) # define VM_PAT VM_ARCH_1 /* PAT reserves whole VMA at once (x86) */ #elif defined(CONFIG_PPC) @@ -3465,6 +3474,7 @@ extern int __mm_populate(unsigned long addr, unsigned long len, int ignore_errors); static inline void mm_populate(unsigned long addr, unsigned long len) { + fbmm_populate_file(addr, len); /* Ignore errors */ (void) __mm_populate(addr, len, 1); } diff --git a/include/linux/sched.h b/include/linux/sched.h index a5f4b48fca18..8a98490618b0 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1554,6 +1554,10 @@ struct task_struct { struct user_event_mm *user_event_mm; #endif +#ifdef CONFIG_FILE_BASED_MM + struct fbmm_info *fbmm_info; +#endif + /* * New fields for task_struct should be added above here, so that * they are included in the randomized portion of task_struct. diff --git a/kernel/exit.c b/kernel/exit.c index 81fcee45d630..49a76f7f6cc6 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -70,6 +70,7 @@ #include #include #include +#include #include @@ -824,6 +825,8 @@ void __noreturn do_exit(long code) WARN_ON(tsk->plug); + fbmm_exit(tsk); + kcov_task_exit(tsk); kmsan_task_exit(tsk); diff --git a/kernel/fork.c b/kernel/fork.c index 99076dbe27d8..2b47276b1300 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2369,6 +2369,9 @@ __latent_entropy struct task_struct *copy_process( goto bad_fork_cleanup_perf; /* copy all the process information */ shm_init_task(p); + retval = fbmm_copy(current, p, clone_flags); + if (retval) + goto bad_fork_cleanup_audit; retval = security_task_alloc(p, clone_flags); if (retval) goto bad_fork_cleanup_audit; diff --git a/mm/gup.c b/mm/gup.c index f1d6bc06eb52..762bbaf1cabf 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -22,6 +22,7 @@ #include #include +#include #include "internal.h" diff --git a/mm/memory.c b/mm/memory.c index d10e616d7389..fa2fe3ee0867 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5685,6 +5685,8 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); + else if (unlikely(is_vm_fbmm_page(vma))) + ret = fbmm_fault(vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); diff --git a/mm/mmap.c b/mm/mmap.c index 83b4682ec85c..d684d8bd218b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -182,6 +182,7 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) struct vm_area_struct *brkvma, *next = NULL; unsigned long min_brk; bool populate = false; + bool used_fbmm = false; LIST_HEAD(uf); struct vma_iterator vmi; @@ -256,8 +257,23 @@ SYSCALL_DEFINE1(brk, unsigned long, brk) brkvma = vma_prev_limit(&vmi, mm->start_brk); /* Ok, looks good - let it rip. */ - if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, 0) < 0) - goto out; + if (use_file_based_mm(current)) { + vm_flags_t vm_flags; + unsigned long prot = PROT_READ | PROT_WRITE; + unsigned long pgoff = 0; + struct file *f = fbmm_get_file(current, oldbrk, newbrk-oldbrk, prot, 0, false, + &pgoff); + + if (f) { + vm_flags = VM_DATA_DEFAULT_FLAGS | VM_ACCOUNT | mm->def_flags | VM_FBMM; + mmap_region(f, oldbrk, newbrk-oldbrk, vm_flags, pgoff, NULL); + used_fbmm = true; + } + } + if (!used_fbmm) { + if (do_brk_flags(&vmi, brkvma, oldbrk, newbrk - oldbrk, 0) < 0) + goto out; + } mm->brk = brk; if (mm->def_flags & VM_LOCKED) @@ -1219,6 +1235,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr, { struct mm_struct *mm = current->mm; int pkey = 0; + bool used_fbmm = false; *populate = 0; @@ -1278,10 +1295,28 @@ unsigned long do_mmap(struct file *file, unsigned long addr, vm_flags |= calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC; + /* Do we want to use FBMM? */ + if (!file && (flags & MAP_ANONYMOUS) && use_file_based_mm(current)) { + addr = fbmm_get_unmapped_area(addr, len, pgoff, flags); + + if (!IS_ERR_VALUE(addr)) { + bool topdown = test_bit(MMF_TOPDOWN, &mm->flags); + + file = fbmm_get_file(current, addr, len, prot, flags, topdown, &pgoff); + + if (file) { + used_fbmm = true; + flags = flags & ~MAP_ANONYMOUS; + vm_flags |= VM_FBMM; + } + } + } + /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address space. */ - addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags); + if (!used_fbmm) + addr = __get_unmapped_area(file, addr, len, pgoff, flags, vm_flags); if (IS_ERR_VALUE(addr)) return addr; @@ -2690,6 +2725,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, mmap_read_unlock(mm); __mt_destroy(&mt_detach); + fbmm_munmap(current, start, end - start); return 0; clear_tree_failed: From patchwork Fri Nov 22 20:38:28 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bijan Tabatabai X-Patchwork-Id: 13883601 Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99F77186E34 for ; Fri, 22 Nov 2024 20:38:55 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.45 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307937; cv=none; b=HtVG9ES1Z5jFEE4pN9tXae60asb/Ylc7SXqu36B2K2nn6bzSsj+9KqprSAEMZ2sCweWTjb5fGrBk44q4XhpFpjwnTULTeL8bzVwNge5q8PT4Zk6yMfwybxvIsOCTjOSTG61pDX5PiErzlnSuLT1cbn7dg9b37vn9AAXMaV/9s1E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307937; c=relaxed/simple; bh=rNPaW9IJZZSPN9rZXAk+w4APo8vudqB5byh9MHmAZQs=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=jR1dmLwpHlqg/EGppr6BAfbqAgYDeoyXFn2GRLULSDJbn0ZzL+Kzw8S6MH5ySdSuVPAaFoeDJ59UHe7m+eV8S782FwY045sFiqd2yejrJjTpJ1QsldAmM8BPUyoLGfPDgmxkD3A8YrBD/lDCQVp/p4eCYJw6IPmuV1dF+NIFh+E= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mIf81/fy; arc=none smtp.client-ip=209.85.166.45 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mIf81/fy" Received: by mail-io1-f45.google.com with SMTP id ca18e2360f4ac-83e5dd0e5faso108491739f.1 for ; Fri, 22 Nov 2024 12:38:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732307934; x=1732912734; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=SnpfIdvtT2sXibK3C2QNgXvNfC8tW5qF1zVo5K9odCY=; b=mIf81/fyo/OzEi5haLKLjUt7Ny4QGdOddrp92iXK+ncFJYe8eSqI8LXSylpMjqIyxZ baDOMVmTVtWPbeVAckH2gxCRySvJm9e9E5/UnMAJfK2kD8iB2UUZ/+d1i8K3jpP6e69g OCUtwhF45aPmnPnDs4vYYY+yyH4QNh6nqpsaNl5zJckrR44lhUUOr06oaNDGKUd5A2Bx TAXZiK3aeCt1HqHVwiC995asxqbAIt0I1drKh/9EBKO66JHbmb/kCDLBSuBpS2S5hrVB O7Km+tUSI3mhlP+I+PfdRfi7wd+yLv60zzAEFF5Fy39TJEjpSer0FZyn1QqcQ55fEQlf mYug== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732307934; x=1732912734; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=SnpfIdvtT2sXibK3C2QNgXvNfC8tW5qF1zVo5K9odCY=; b=eGvJHFjw64cfGCQiIViVQeaoWRrbmjs2Is6uCizP5//rmoTQdVx0/f8rd6UVLsmwrZ No+Oc+iQm1BySp6n/zTf2TagBTZLmmntU4ZuKs8MNXjXp/3s7AX28eUhnGZq5iItKLdR s4xHT/UqHzCARjYOrMb7Hgj/v2FCrYengBea5UMoCExVqvvMHj5LE0hInr5ipYvziuCq ts74YcWbIEE/jsdxV5/WL8nsfQDUGYFYBWsNFvDgWRoP9PYiSVPDvfkQxqJiyMmm53Eo 1zmWZjcIMdp+cKsuEQiGIe8E4M1ja+TZu/nYLogF0tXlgAEPha9gwtp8Dk66IQ3HR4jg Py3Q== X-Gm-Message-State: AOJu0Yx1SBp1sFdQibtSjiG9MLTYz7IWyZbAxEEXDnrpi39VWVpOCj0z ZWN2o1GZqRkCnqAKJsaG8NW1Zvzz3bGnD9P/vbgA2KXFk6KHyz19Hme3s8Jk X-Gm-Gg: ASbGncvgU0sbTD/ScynhkfLo8dZV04st/AcBCkoEFcV92yC69w4nfj8mWkNvrl1XR59 gVKhnWGaFru6xdM+2uE8O4LP+PnP/gvEP/19oExUybCil9f3ucRKdCwINkWjBaF1eDO1ZSLqRgr f5837XOqMZKfUylHLneBZekI8GaKZRxj95rizTNoDRKz7bS4DnIu39tYtS2bSyFlzOyrQBcIyQs JoarO3xXgt5/f79N+7jJhRf58YLa9a96XN4hTDuIISC3rPpknxvWg6OHeYWPszmDRMVyPiOWFI= X-Google-Smtp-Source: AGHT+IEjUnZMs6ZIo2aEUHvO/CA9994z49M0T4LNHqFF5eun05YztQPp1GZzjt2Jtr6wmH3T8uzrTw== X-Received: by 2002:a05:6602:809:b0:82c:da1e:4ae7 with SMTP id ca18e2360f4ac-83ec19dc310mr779558239f.2.1732307934437; Fri, 22 Nov 2024 12:38:54 -0800 (PST) Received: from manaslu.cs.wisc.edu (manaslu.cs.wisc.edu. [128.105.15.4]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4e1cfe52506sm794682173.77.2024.11.22.12.38.53 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Nov 2024 12:38:54 -0800 (PST) From: Bijan Tabatabai X-Google-Original-From: Bijan Tabatabai To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, btabatabai@wisc.edu Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, mingo@redhat.com Subject: [RFC PATCH 2/4] fbmm: Add helper functions for FBMM MM Filesystems Date: Fri, 22 Nov 2024 14:38:28 -0600 Message-Id: <20241122203830.2381905-3-btabatabai@wisc.edu> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241122203830.2381905-1-btabatabai@wisc.edu> References: <20241122203830.2381905-1-btabatabai@wisc.edu> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch adds four helper functions to simplify the implementation of MFSs. fbmm_swapout_folio: Takes a folio to swap out. fbmm_writepage: An implementation of the address_space_operations.writepage callback that simply writes the page to the default swap space. fbmm_read_swap_entry: Reads the contents of a swap entry into a page. fbmm_copy_page_range: Copies the page table corresponding to the VMA of one process into the page table of another. The pages in both processes are write protected for CoW. This patch also adds infrastructure for FBMM to support copy on write. The dup_mmap function is modified to create new FBMM files for a forked process. We also add a callback to the super_operations struct called copy_page_range, which is called in place of the normal copy_page_range function in dup_mmap to copy the page table entries to the forked process. The fbmm_copy_page_range helper is our base implementation of this that MFSs can use to write protect pages for CoW. However, an MFS can have its own copy_page_range implementation if, for example, the creaters prefer to do a deep copy of the pages on fork. Logic is added to FBMM to handle multiple processes sharing files. A forked process will keep the list of FBMM files it depends on for CoW, and takes a reference to those FBMM files. To ensure one process doesn't free memory used by other another, FBMM will only free memory from a file if its reference count is 1. Signed-off-by: Bijan Tabatabai --- fs/exec.c | 2 + fs/file_based_mm.c | 105 +++++++++- include/linux/file_based_mm.h | 18 ++ include/linux/fs.h | 1 + kernel/fork.c | 54 ++++- mm/Makefile | 1 + mm/fbmm_helpers.c | 372 ++++++++++++++++++++++++++++++++++ mm/internal.h | 13 ++ mm/vmscan.c | 14 +- 9 files changed, 558 insertions(+), 22 deletions(-) create mode 100644 mm/fbmm_helpers.c diff --git a/fs/exec.c b/fs/exec.c index 40073142288f..f8f8d3d3ccd1 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -68,6 +68,7 @@ #include #include #include +#include #include #include @@ -1900,6 +1901,7 @@ static int bprm_execve(struct linux_binprm *bprm) user_events_execve(current); acct_update_integrals(current); task_numa_free(current, false); + fbmm_clear_cow_files(current); return retval; out: diff --git a/fs/file_based_mm.c b/fs/file_based_mm.c index c05797d51cb3..1feabdea1b77 100644 --- a/fs/file_based_mm.c +++ b/fs/file_based_mm.c @@ -30,6 +30,12 @@ struct fbmm_file { unsigned long va_start; /* The ending virtual address assigned to this file (exclusive) */ unsigned long va_end; + atomic_t refcount; +}; + +struct fbmm_cow_list_entry { + struct list_head node; + struct fbmm_file *file; }; static enum file_based_mm_state fbmm_state = FBMM_OFF; @@ -62,6 +68,7 @@ static struct fbmm_info *fbmm_create_new_info(char *mnt_dir_str) } mt_init(&info->files_mt); + INIT_LIST_HEAD(&info->cow_files); return info; } @@ -74,6 +81,11 @@ static void drop_fbmm_file(struct fbmm_file *file) } } +static void get_fbmm_file(struct fbmm_file *file) +{ + atomic_inc(&file->refcount); +} + static pmdval_t fbmm_alloc_pmd(struct vm_fault *vmf) { struct mm_struct *mm = vmf->vma->vm_mm; @@ -212,6 +224,7 @@ struct file *fbmm_get_file(struct task_struct *tsk, unsigned long addr, unsigned return NULL; } fbmm_file->f = f; + atomic_set(&fbmm_file->refcount, 1); if (topdown) { /* * Since VAs in this region grow down, this mapping will be the @@ -300,9 +313,18 @@ int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len) falloc_len = falloc_end_offset - falloc_start_offset; - ret = vfs_fallocate(fbmm_file->f, - FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, - falloc_start_offset, falloc_len); + /* + * Because shared mappings via fork are hard, only punch a hole if there + * is only one proc using this file. + * It would be nice to be able to free the memory if all procs sharing + * the file have unmapped it, but that would require tracking usage at + * a page granularity. + */ + if (atomic_read(&fbmm_file->refcount) == 1) { + ret = vfs_fallocate(fbmm_file->f, + FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + falloc_start_offset, falloc_len); + } fbmm_file = mt_next(&info->files_mt, fbmm_file->va_start, ULONG_MAX); if (!fbmm_file || fbmm_file->va_end <= start) @@ -324,6 +346,8 @@ static void fbmm_free_info(struct task_struct *tsk) } mtree_destroy(&info->files_mt); + fbmm_clear_cow_files(tsk); + if (info->mnt_dir_str) { path_put(&info->mnt_dir_path); fput(info->get_unmapped_area_file); @@ -346,6 +370,7 @@ void fbmm_exit(struct task_struct *tsk) int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags) { struct fbmm_info *info; + struct fbmm_cow_list_entry *src_cow, *dst_cow; char *buffer; char *src_dir; @@ -375,9 +400,83 @@ int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clon return -ENOMEM; } + /* + * If the source has CoW files, they may also be CoW files in the destination + * so we need to copy that too + */ + list_for_each_entry(src_cow, &info->cow_files, node) { + dst_cow = kmalloc(sizeof(struct fbmm_cow_list_entry), GFP_KERNEL); + if (!dst_cow) { + fbmm_free_info(dst_tsk); + dst_tsk->fbmm_info = NULL; + return -ENOMEM; + } + + get_fbmm_file(src_cow->file); + dst_cow->file = src_cow->file; + + list_add(&dst_cow->node, &dst_tsk->fbmm_info->cow_files); + } + return 0; } +int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk, + struct file *file, unsigned long start) +{ + struct fbmm_info *new_info; + struct fbmm_info *old_info; + struct fbmm_file *fbmm_file; + struct fbmm_cow_list_entry *cow_entry; + unsigned long search_start = start + 1; + + new_info = new_tsk->fbmm_info; + old_info = old_tsk->fbmm_info; + if (!new_info || !old_info) + return -EINVAL; + + /* + * Find the fbmm_file that corresponds with the struct file. + * fbmm files can overlap, so make sure to find the one that corresponds + * to this file + */ + do { + fbmm_file = mt_prev(&old_info->files_mt, search_start, 0); + if (!fbmm_file || fbmm_file->va_end <= start) { + /* Could not find the corressponding fbmm file */ + return -ENOMEM; + } + search_start = fbmm_file->va_start; + } while (fbmm_file->f != file); + + cow_entry = kmalloc(sizeof(struct fbmm_cow_list_entry), GFP_KERNEL); + if (!cow_entry) + return -ENOMEM; + + get_fbmm_file(fbmm_file); + cow_entry->file = fbmm_file; + + list_add(&cow_entry->node, &new_info->cow_files); + return 0; +} + +void fbmm_clear_cow_files(struct task_struct *tsk) +{ + struct fbmm_info *info; + struct fbmm_cow_list_entry *cow_entry, *tmp; + + info = tsk->fbmm_info; + if (!info) + return; + + list_for_each_entry_safe(cow_entry, tmp, &info->cow_files, node) { + list_del(&cow_entry->node); + + drop_fbmm_file(cow_entry->file); + kfree(cow_entry); + } +} + static ssize_t fbmm_state_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { diff --git a/include/linux/file_based_mm.h b/include/linux/file_based_mm.h index c1c5e82e36ec..22bb8e890144 100644 --- a/include/linux/file_based_mm.h +++ b/include/linux/file_based_mm.h @@ -13,6 +13,7 @@ struct fbmm_info { /* This file exists just to be passed to get_unmapped_area in mmap */ struct file *get_unmapped_area_file; struct maple_tree files_mt; + struct list_head cow_files; }; @@ -31,6 +32,16 @@ void fbmm_populate_file(unsigned long start, unsigned long len); int fbmm_munmap(struct task_struct *tsk, unsigned long start, unsigned long len); void fbmm_exit(struct task_struct *tsk); int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst_tsk, u64 clone_flags); +int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk, + struct file *file, unsigned long start); +void fbmm_clear_cow_files(struct task_struct *tsk); + +/* FBMM helper functions for MFSs */ +int fbmm_swapout_folio(struct folio *folio); +int fbmm_writepage(struct page *page, struct writeback_control *wbc); +struct page *fbmm_read_swap_entry(struct vm_fault *vmf, swp_entry_t entry, unsigned long pgoff, + struct page *page); +int fbmm_copy_page_range(struct vm_area_struct *dst, struct vm_area_struct *src); #else /* CONFIG_FILE_BASED_MM */ @@ -76,6 +87,13 @@ static inline int fbmm_copy(struct task_struct *src_tsk, struct task_struct *dst { return 0; } + +static inline int fbmm_add_cow_file(struct task_struct *new_tsk, struct task_struct *old_tsk, + struct file *file, unsigned long start) { + return 0; +} + +static inline void fbmm_clear_cow_files(struct task_struct *tsk) {} #endif /* CONFIG_FILE_BASED_MM */ #endif /* __FILE_BASED_MM_H */ diff --git a/include/linux/fs.h b/include/linux/fs.h index 0283cf366c2a..d38691819880 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2181,6 +2181,7 @@ struct super_operations { long (*free_cached_objects)(struct super_block *, struct shrink_control *); void (*shutdown)(struct super_block *sb); + int (*copy_page_range)(struct vm_area_struct *dst, struct vm_area_struct *src); }; /* diff --git a/kernel/fork.c b/kernel/fork.c index 2b47276b1300..249367110519 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -625,8 +625,8 @@ static void dup_mm_exe_file(struct mm_struct *mm, struct mm_struct *oldmm) } #ifdef CONFIG_MMU -static __latent_entropy int dup_mmap(struct mm_struct *mm, - struct mm_struct *oldmm) +static __latent_entropy int dup_mmap(struct task_struct *tsk, + struct mm_struct *mm, struct mm_struct *oldmm) { struct vm_area_struct *mpnt, *tmp; int retval; @@ -732,7 +732,45 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, tmp->vm_ops->open(tmp); file = tmp->vm_file; - if (file) { + if (file && use_file_based_mm(tsk) && + (tmp->vm_flags & (VM_SHARED | VM_FBMM)) == VM_FBMM) { + /* + * If this is a private FBMM file, we need to create a new + * file for this allocation + */ + bool topdown = test_bit(MMF_TOPDOWN, &mm->flags); + unsigned long len = tmp->vm_end - tmp->vm_start; + unsigned long prot = 0; + unsigned long pgoff; + struct file *orig_file = file; + + if (tmp->vm_flags & VM_READ) + prot |= PROT_READ; + if (tmp->vm_flags & VM_WRITE) + prot |= PROT_WRITE; + if (tmp->vm_flags & VM_EXEC) + prot |= PROT_EXEC; + + /* + * topdown may be incorrect if it is true but this is for a region created + * by brk, which grows up, but if it's wrong, it'll only affect the next + * brk allocation + */ + file = fbmm_get_file(tsk, tmp->vm_start, len, prot, 0, topdown, &pgoff); + if (!file) { + retval = -ENOMEM; + goto loop_out; + } + + tmp->vm_pgoff = pgoff; + tmp->vm_file = get_file(file); + call_mmap(file, tmp); + + retval = fbmm_add_cow_file(tsk, current, orig_file, tmp->vm_start); + if (retval) { + goto loop_out; + } + } else if (file) { struct address_space *mapping = file->f_mapping; get_file(file); @@ -747,8 +785,12 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, i_mmap_unlock_write(mapping); } - if (!(tmp->vm_flags & VM_WIPEONFORK)) - retval = copy_page_range(tmp, mpnt); + if (!(tmp->vm_flags & VM_WIPEONFORK)) { + if (file && file->f_inode->i_sb->s_op->copy_page_range) + retval = file->f_inode->i_sb->s_op->copy_page_range(tmp, mpnt); + else + retval = copy_page_range(tmp, mpnt); + } if (retval) { mpnt = vma_next(&vmi); @@ -1685,7 +1727,7 @@ static struct mm_struct *dup_mm(struct task_struct *tsk, if (!mm_init(mm, tsk, mm->user_ns)) goto fail_nomem; - err = dup_mmap(mm, oldmm); + err = dup_mmap(tsk, mm, oldmm); if (err) goto free_pt; diff --git a/mm/Makefile b/mm/Makefile index 8fb85acda1b1..fc5d1c4e0d5e 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -139,3 +139,4 @@ obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o obj-$(CONFIG_SHRINKER_DEBUG) += shrinker_debug.o obj-$(CONFIG_EXECMEM) += execmem.o +obj-$(CONFIG_FILE_BASED_MM) += fbmm_helpers.o diff --git a/mm/fbmm_helpers.c b/mm/fbmm_helpers.c new file mode 100644 index 000000000000..2c3c5522f34c --- /dev/null +++ b/mm/fbmm_helpers.c @@ -0,0 +1,372 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "internal.h" +#include "swap.h" + +/****************************************************************************** + * Swap Helpers + *****************************************************************************/ +static bool fbmm_try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, + unsigned long address, void *arg) +{ + struct mm_struct *mm = vma->vm_mm; + DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0); + pte_t pteval, swp_pte; + swp_entry_t entry; + struct page *page; + bool ret = true; + struct mmu_notifier_range range; + + range.end = vma_address_end(&pvmw); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm, + address, range.end); + mmu_notifier_invalidate_range_start(&range); + + while (page_vma_mapped_walk(&pvmw)) { + page = folio_page(folio, pte_pfn(*pvmw.pte) - folio_pfn(folio)); + address = pvmw.address; + + pteval = ptep_clear_flush(vma, address, pvmw.pte); + + if (pte_dirty(pteval)) + folio_mark_dirty(folio); + + entry.val = page_private(page); + + if (swap_duplicate(entry) < 0) { + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + + dec_mm_counter(mm, MM_FILEPAGES); + inc_mm_counter(mm, MM_SWAPENTS); + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + + set_pte_at(mm, address, pvmw.pte, swp_pte); + + folio_remove_rmap_pte(folio, page, vma); + folio_put(folio); + } + + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + +static int folio_not_mapped(struct folio *folio) +{ + return !folio_mapped(folio); +} + +static void fbmm_try_to_unmap(struct folio *folio) +{ + struct rmap_walk_control rwc = { + .rmap_one = fbmm_try_to_unmap_one, + .arg = NULL, + .done = folio_not_mapped, + }; + + rmap_walk(folio, &rwc); +} + +/* + * fbmm_swapout_folio - Helper function for MFSs to swapout a folio + * @folio: The folio to swap out. Must has a reference count of at least 3. + * One the thread is holding on to, one for the file mapping, and one for each + * page table entry it is mapped to + * + * Returns 0 on success and nonzero otherwise + */ +int fbmm_swapout_folio(struct folio *folio) +{ + struct address_space *mapping; + struct swap_info_struct *si; + unsigned long offset; + struct swap_iocb *plug = NULL; + swp_entry_t entry; + + if (!folio_trylock(folio)) + return 1; + + entry = folio_alloc_swap(folio); + if (!entry.val) + goto unlock; + + offset = swp_offset(entry); + + folio->swap = entry; + + folio_mark_dirty(folio); + + if (folio_ref_count(folio) < 3) + goto unlock; + + if (folio_mapped(folio)) { + fbmm_try_to_unmap(folio); + if (folio_mapped(folio)) + goto unlock; + } + + mapping = folio_mapping(folio); + if (folio_test_dirty(folio)) { + try_to_unmap_flush_dirty(); + switch (pageout(folio, mapping, &plug)) { + case PAGE_KEEP: + fallthrough; + case PAGE_ACTIVATE: + goto unlock; + case PAGE_SUCCESS: + /* pageout eventually unlocks the folio on success, so lock it */ + if (!folio_trylock(folio)) + return 1; + fallthrough; + case PAGE_CLEAN: + ; + } + } + + remove_mapping(mapping, folio); + folio_unlock(folio); + + si = get_swap_device(entry); + si->swap_map[offset] &= ~SWAP_HAS_CACHE; + put_swap_device(si); + + return 0; + +unlock: + folio_unlock(folio); + return 1; +} +EXPORT_SYMBOL(fbmm_swapout_folio); + +static void fbmm_end_swap_bio_write(struct bio *bio) +{ + struct folio *folio = bio_first_folio_all(bio); + int ret; + + /* This is the simplification of __folio_end_writeback */ + ret = folio_test_clear_writeback(folio); + if (!ret) + return; + + sb_clear_inode_writeback(folio_mapping(folio)->host); + + /* Simplification of folio_end_writeback */ + smp_mb__after_atomic(); + acct_reclaim_writeback(folio); +} + +/* Analogue to __swap_writepage */ +static void __fbmm_writepage(struct folio *folio, struct writeback_control *wbc) +{ + struct bio bio; + struct bio_vec bv; + struct swap_info_struct *sis = swp_swap_info(folio->swap); + + bio_init(&bio, sis->bdev, &bv, 1, + REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc)); + bio.bi_iter.bi_sector = swap_folio_sector(folio); + bio_add_folio_nofail(&bio, folio, folio_size(folio), 0); + + count_vm_events(PSWPOUT, folio_nr_pages(folio)); + folio_start_writeback(folio); + folio_unlock(folio); + + submit_bio_wait(&bio); + fbmm_end_swap_bio_write(&bio); +} + +int fbmm_writepage(struct page *page, struct writeback_control *wbc) +{ + struct folio *folio = page_folio(page); + int ret = 0; + + ret = arch_prepare_to_swap(folio); + if (ret) { + folio_mark_dirty(folio); + folio_unlock(folio); + return 0; + } + + __fbmm_writepage(folio, wbc); + return 0; +} +EXPORT_SYMBOL(fbmm_writepage); + +struct page *fbmm_read_swap_entry(struct vm_fault *vmf, swp_entry_t entry, unsigned long pgoff, + struct page *page) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + struct swap_info_struct *si; + struct folio *folio; + + if (unlikely(non_swap_entry(entry))) + return NULL; + + /* + * If a folio is already mapped here, just return that. + * Another process has probably already brought in the shared page + */ + folio = filemap_get_folio(mapping, pgoff); + if (!IS_ERR(folio)) + return folio_page(folio, 0); + + si = get_swap_device(entry); + if (!si) + return NULL; + + folio = page_folio(page); + + folio_lock(folio); + folio->swap = entry; + /* swap_read_folio unlocks the folio */ + swap_read_folio(folio, true, NULL); + folio->private = NULL; + + swap_free(entry); + + put_swap_device(si); + count_vm_events(PSWPIN, folio_nr_pages(folio)); + dec_mm_counter(vma->vm_mm, MM_SWAPENTS); + return folio_page(folio, 0); +} +EXPORT_SYMBOL(fbmm_read_swap_entry); + +/****************************************************************************** + * Copy on write helpers + *****************************************************************************/ +struct page_walk_levels { + struct vm_area_struct *vma; + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; +}; + +static int fbmm_copy_pgd(pgd_t *pgd, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + struct page_walk_levels *dst_levels = walk->private; + + dst_levels->pgd = pgd_offset(dst_levels->vma->vm_mm, addr); + return 0; +} + +static int fbmm_copy_p4d(p4d_t *p4d, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + struct page_walk_levels *dst_levels = walk->private; + + dst_levels->p4d = p4d_alloc(dst_levels->vma->vm_mm, dst_levels->pgd, addr); + if (!dst_levels->p4d) + return -ENOMEM; + return 0; +} + +static int fbmm_copy_pud(pud_t *pud, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + struct page_walk_levels *dst_levels = walk->private; + + dst_levels->pud = pud_alloc(dst_levels->vma->vm_mm, dst_levels->p4d, addr); + if (!dst_levels->pud) + return -ENOMEM; + return 0; +} + +static int fbmm_copy_pmd(pmd_t *pmd, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + struct page_walk_levels *dst_levels = walk->private; + + dst_levels->pmd = pmd_alloc(dst_levels->vma->vm_mm, dst_levels->pud, addr); + if (!dst_levels->pmd) + return -ENOMEM; + return 0; +} + +static int fbmm_copy_pte(pte_t *pte, unsigned long addr, unsigned long next, struct mm_walk *walk) +{ + struct page_walk_levels *dst_levels = walk->private; + struct mm_struct *dst_mm = dst_levels->vma->vm_mm; + struct mm_struct *src_mm = walk->mm; + pte_t *src_pte = pte; + pte_t *dst_pte; + spinlock_t *dst_ptl; + pte_t entry; + struct page *page; + struct folio *folio; + int ret = 0; + + dst_pte = pte_alloc_map(dst_mm, dst_levels->pmd, addr); + if (!dst_pte) + return -ENOMEM; + dst_ptl = pte_lockptr(dst_mm, dst_levels->pmd); + /* The spinlock for the src pte should already be taken */ + spin_lock_nested(dst_ptl, SINGLE_DEPTH_NESTING); + + if (pte_none(*src_pte)) + goto unlock; + + /* I don't really want to handle to swap case, so I won't for now */ + if (unlikely(!pte_present(*src_pte))) { + ret = -EIO; + goto unlock; + } + + entry = ptep_get(src_pte); + page = vm_normal_page(walk->vma, addr, entry); + if (page) + folio = page_folio(page); + + folio_get(folio); + folio_dup_file_rmap_pte(folio, page); + percpu_counter_inc(&dst_mm->rss_stat[MM_FILEPAGES]); + + if (!(walk->vma->vm_flags & VM_SHARED) && pte_write(entry)) { + ptep_set_wrprotect(src_mm, addr, src_pte); + entry = pte_wrprotect(entry); + } + + entry = pte_mkold(entry); + set_pte_at(dst_mm, addr, dst_pte, entry); + +unlock: + pte_unmap_unlock(dst_pte, dst_ptl); + return ret; +} + +int fbmm_copy_page_range(struct vm_area_struct *dst, struct vm_area_struct *src) +{ + struct page_walk_levels dst_levels; + struct mm_walk_ops walk_ops = { + .pgd_entry = fbmm_copy_pgd, + .p4d_entry = fbmm_copy_p4d, + .pud_entry = fbmm_copy_pud, + .pmd_entry = fbmm_copy_pmd, + .pte_entry = fbmm_copy_pte, + }; + + dst_levels.vma = dst; + + return walk_page_range(src->vm_mm, src->vm_start, src->vm_end, + &walk_ops, &dst_levels); +} +EXPORT_SYMBOL(fbmm_copy_page_range); diff --git a/mm/internal.h b/mm/internal.h index cc2c5e07fad3..bed53f3a6ed3 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -1515,4 +1515,17 @@ static inline void shrinker_debugfs_remove(struct dentry *debugfs_entry, void workingset_update_node(struct xa_node *node); extern struct list_lru shadow_nodes; +/* possible outcome of pageout() */ +typedef enum { + /* failed to write folio out, folio is locked */ + PAGE_KEEP, + /* move folio to the active list, folio is locked */ + PAGE_ACTIVATE, + /* folio has been sent to the disk successfully, folio is unlocked */ + PAGE_SUCCESS, + /* folio is clean and locked */ + PAGE_CLEAN, +} pageout_t; +pageout_t pageout(struct folio *folio, struct address_space *mapping, + struct swap_iocb **plug); #endif /* __MM_INTERNAL_H */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 2e34de9cd0d4..93291d25eb11 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -591,23 +591,11 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio, wake_up(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]); } -/* possible outcome of pageout() */ -typedef enum { - /* failed to write folio out, folio is locked */ - PAGE_KEEP, - /* move folio to the active list, folio is locked */ - PAGE_ACTIVATE, - /* folio has been sent to the disk successfully, folio is unlocked */ - PAGE_SUCCESS, - /* folio is clean and locked */ - PAGE_CLEAN, -} pageout_t; - /* * pageout is called by shrink_folio_list() for each dirty folio. * Calls ->writepage(). */ -static pageout_t pageout(struct folio *folio, struct address_space *mapping, +pageout_t pageout(struct folio *folio, struct address_space *mapping, struct swap_iocb **plug) { /* From patchwork Fri Nov 22 20:38:29 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bijan Tabatabai X-Patchwork-Id: 13883602 Received: from mail-il1-f181.google.com (mail-il1-f181.google.com [209.85.166.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC9B517B428 for ; Fri, 22 Nov 2024 20:38:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307939; cv=none; b=BDBdGmbmKFEbjAW6UOmFWuXNp4DmtBX2HsDN8ZxX3yN2f0RzXX7woSoerWE1owk3f7l9FqBBHKOOOnXa79agzLBGI2DbdFY2itHkOfMwa3wFG+d/fYP0u5SWvTWirpFda30rO1zON7P/5jWcCQJG4sqn9mNuHMbW53ClFEJxOPg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307939; c=relaxed/simple; bh=4JV/qqNB1CiLSM5H0MLjYsrrHdadzisZfdZ/NnT62Ac=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=YgSvTfUMW4ZrzzVf9ue7tq+4vqUi3C/pfdOgc0LXGQQ2NfCm7iNUpv83AxTqvy9Atw10b4v3xuVm2Nw4XTljAZDfu4MI8rY8wEJmlXFlOgSCSgajkFfycRf9b53f/yvRpTKabxqmCu9GfQA9TxcG1+uV8kHvbDYpAGhC93Vm/BA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=mx0k6BAL; arc=none smtp.client-ip=209.85.166.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="mx0k6BAL" Received: by mail-il1-f181.google.com with SMTP id e9e14a558f8ab-3a761e21ddeso8839905ab.3 for ; Fri, 22 Nov 2024 12:38:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732307937; x=1732912737; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=+Nas8eIV96EE2tCIE/RR89rZ3FmEkH4A6p7KtVBi+ag=; b=mx0k6BALM93ZTVFprQeVwpCSQhvDaaoy2i5DTfU/cEW/EwnBaUVvskCWfgayE9DFPX 2yX7bZHmWSDrba39PS2RLswbh+8plRl4XpcAF3jgPZNiBL5lGO55REoaSOl5gDmdT059 MSZ6Cr6PjmwL/ppaK1S8foZGKW+5d70RMe2I/f69yA2V4Lu6vRBbenv/rtEU6xhpmBzq kXcmXvfFcd3vD+oSiBr7rAFLlLelRW+QG6qMe8zOKhZy7GYd0bWOywSTTsT6KISAiChV ophbJrxdeURaMhzF2xH4LcDtrlQ/CILaacb3XfPFMWsjFJDer7+fi9+hykCwnOxRd1/I Ns6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732307937; x=1732912737; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+Nas8eIV96EE2tCIE/RR89rZ3FmEkH4A6p7KtVBi+ag=; b=bUqUURSpRw+5z37n+5aGGu2B5dCb6Tlel8R1n0bStdZi0l9H0F82y1D8R6r8rOR8KT 48JOcenUQK+I5Ns+F0Mq991t21RbmDq5C6Q5py/2sRkb7GElG7dLabuZakrXWS0gS/q2 vPRJ0BlTvN7ySUhw5Rg7MVcnqIBsWde6qw0e6RukUAuXLSyHEpFHA/bnkyQTVFIKUSNH 51Ij9HpDxXMQh3tnTDCA2Sb4bn+YFRbEm/OQBLtWfEgokywiLhme5yqqvjF7ZyI39tIH UWwnIZtXKYNVizNmesSh+Gppuzt5XinI62oPgi3hNzKC11rQbZ4y6A+t+xb20fuP7gSK Eb2A== X-Gm-Message-State: AOJu0YzgDWAxnlcsiSJBId/bBQKjkq4ktzVcaRDsfhN7UtiVC7JB36ST 98DpfCYQ3XDSxvskNw+2kvQTIegRQtBi55UaNlYWu+ZqY4IcoWFBXxyA4BPH X-Gm-Gg: ASbGncsQp/CwXYu6/501K4K3KWRqkbZBKJoFzv1GsXrlcEZ+k6XP32m7aPVlESxBZLU 3BXiPR8X6cq+r/kJigklFfUAGAfHOl330C0mPqIdfSjdAAne1dSdiORUWoACQVOezXlldph1WfZ 7nXaXrcouhGZkXnleDoKmrHT9I4A6laSvbWDkMZSUGsapg7Eq+yPTbwoAEOA+IpFe1UChJNHnBY oejMUd7GU2ZK9efWc4xl/Bp1FqEn+wSf/mJ1kjbbmlQOGylWnR5Y8C8njfn3rVwApKnF4mObjY= X-Google-Smtp-Source: AGHT+IH84ZXuguW5NeIKMjIEW1e8CTOIrQYKEK8pZg5bmF51hPpd6+LMDLkZbGnKlu+xUy04ChImeA== X-Received: by 2002:a05:6e02:1a08:b0:3a7:4700:7c1 with SMTP id e9e14a558f8ab-3a79ad748f5mr52242395ab.12.1732307936778; Fri, 22 Nov 2024 12:38:56 -0800 (PST) Received: from manaslu.cs.wisc.edu (manaslu.cs.wisc.edu. [128.105.15.4]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4e1cfe52506sm794682173.77.2024.11.22.12.38.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Nov 2024 12:38:56 -0800 (PST) From: Bijan Tabatabai X-Google-Original-From: Bijan Tabatabai To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, btabatabai@wisc.edu Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, mingo@redhat.com Subject: [RFC PATCH 3/4] mm: Export functions for writing MM Filesystems Date: Fri, 22 Nov 2024 14:38:29 -0600 Message-Id: <20241122203830.2381905-4-btabatabai@wisc.edu> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241122203830.2381905-1-btabatabai@wisc.edu> References: <20241122203830.2381905-1-btabatabai@wisc.edu> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patch exports memory management functions that are useful to memory managers, so that they can be used in memory management filesystems created in kernel modules. Signed-off-by: Bijan Tabatabai --- arch/x86/include/asm/tlbflush.h | 2 -- arch/x86/mm/tlb.c | 1 + mm/filemap.c | 2 ++ mm/memory.c | 1 + mm/mmap.c | 2 ++ mm/pgtable-generic.c | 1 + mm/rmap.c | 2 ++ 7 files changed, 9 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 25726893c6f4..9877176d396f 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -57,7 +57,6 @@ static inline void cr4_clear_bits(unsigned long mask) local_irq_restore(flags); } -#ifndef MODULE /* * 6 because 6 should be plenty and struct tlb_state will fit in two cache * lines. @@ -417,7 +416,6 @@ static inline void set_tlbstate_lam_mode(struct mm_struct *mm) { } #endif -#endif /* !MODULE */ static inline void __native_tlb_flush_global(unsigned long cr4) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 44ac64f3a047..f054cee7bc7c 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -1036,6 +1036,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, put_cpu(); mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); } +EXPORT_SYMBOL_GPL(flush_tlb_mm_range); static void do_flush_tlb_all(void *info) diff --git a/mm/filemap.c b/mm/filemap.c index 657bcd887fdb..8532ddd37e7f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -269,6 +269,7 @@ void filemap_remove_folio(struct folio *folio) filemap_free_folio(mapping, folio); } +EXPORT_SYMBOL_GPL(filemap_remove_folio); /* * page_cache_delete_batch - delete several folios from page cache @@ -955,6 +956,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, return xas_error(&xas); } ALLOW_ERROR_INJECTION(__filemap_add_folio, ERRNO); +EXPORT_SYMBOL_GPL(__filemap_add_folio); int filemap_add_folio(struct address_space *mapping, struct folio *folio, pgoff_t index, gfp_t gfp) diff --git a/mm/memory.c b/mm/memory.c index fa2fe3ee0867..23e74a0397fa 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -448,6 +448,7 @@ int __pte_alloc(struct mm_struct *mm, pmd_t *pmd) pte_free(mm, new); return 0; } +EXPORT_SYMBOL_GPL(__pte_alloc); int __pte_alloc_kernel(pmd_t *pmd) { diff --git a/mm/mmap.c b/mm/mmap.c index d684d8bd218b..1090ef982929 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1780,6 +1780,7 @@ generic_get_unmapped_area(struct file *filp, unsigned long addr, info.high_limit = mmap_end; return vm_unmapped_area(&info); } +EXPORT_SYMBOL_GPL(generic_get_unmapped_area); #ifndef HAVE_ARCH_UNMAPPED_AREA unsigned long @@ -1844,6 +1845,7 @@ generic_get_unmapped_area_topdown(struct file *filp, unsigned long addr, return addr; } +EXPORT_SYMBOL_GPL(generic_get_unmapped_area_topdown); #ifndef HAVE_ARCH_UNMAPPED_AREA_TOPDOWN unsigned long diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index a78a4adf711a..1a3b4a86b005 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -304,6 +304,7 @@ pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp) rcu_read_unlock(); return NULL; } +EXPORT_SYMBOL_GPL(__pte_offset_map); pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t **ptlp) diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..fdade910cc95 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1468,6 +1468,7 @@ void folio_add_file_rmap_ptes(struct folio *folio, struct page *page, { __folio_add_file_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE); } +EXPORT_SYMBOL_GPL(folio_add_file_rmap_ptes); /** * folio_add_file_rmap_pmd - add a PMD mapping to a page range of a folio @@ -1594,6 +1595,7 @@ void folio_remove_rmap_ptes(struct folio *folio, struct page *page, { __folio_remove_rmap(folio, page, nr_pages, vma, RMAP_LEVEL_PTE); } +EXPORT_SYMBOL_GPL(folio_remove_rmap_ptes); /** * folio_remove_rmap_pmd - remove a PMD mapping from a page range of a folio From patchwork Fri Nov 22 20:38:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Bijan Tabatabai X-Patchwork-Id: 13883603 Received: from mail-io1-f47.google.com (mail-io1-f47.google.com [209.85.166.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 447E41DEFFC for ; Fri, 22 Nov 2024 20:39:00 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.166.47 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307942; cv=none; b=blU+BfOkMS4KEnOWJr9qM5RO04NFSpuFSaXhygo4c0ahA59ABJZX+/tsqf1D985cK9966EygXkI1fK4midrtExAfk+gS2Oro6ddwrIZoV2cFMUXAarlcpE7DL9up0v6BamxBSN/cYevAxj2HetaLbnocWPYaMjt42gSZ35adizg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1732307942; c=relaxed/simple; bh=YcN5sv3AY2BFgon81zMLjIfNghyFepRA14QJt3JKHzQ=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=entDxcPX0Xg6lPksChtXQwZtRBprN9BzOEtrIgvRf5sJH3g4UvTs8hhcS5mMQwhhPhoMao5tXP/+usmPwoOJoWUX4/OX1J2f4wOD1K1AJt8i5C0t0No0OUYnVR4muvsKrrdjliY9cQaQMgSoWXeElMH5ee77u91gxCVpABUk41A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=B8ko+DAa; arc=none smtp.client-ip=209.85.166.47 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="B8ko+DAa" Received: by mail-io1-f47.google.com with SMTP id ca18e2360f4ac-83ab94452a7so93493939f.3 for ; Fri, 22 Nov 2024 12:39:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1732307939; x=1732912739; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=h8XN/NPs91LN7p5/v86U/iFyl8u+jIuX77I4zRh/+yo=; b=B8ko+DAaCV8UHGTuPktOUcfYSl3eheoVYpXi4Duw+jL3y4EM6egkepA5Q/qy4YPRp3 mQaQ0Nr3esJhGtYZti52MHwJ8xW+cpLhbkY9uEUOrlWln5hkuk5RJRYkzNSVWtyjp05g ZcrGof9oEGfHTgEbc6Tezm8esZHHZO0WyLPISwHBh3k7nVPTjJbjt6X79u+8XwWxSm0L 4LRw3o5V120urNQZNvQOFktvTa8IypsxA4pjT2u0fSClnLIuoFR7FMWVdDmdBaYDAVkE Q6i3z7od7/3zzOhh5Puy3g/kiqgsunxCWJrOVHlAm7agwc5uO1U6NtPOkP97J7pk60Ku FM9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1732307939; x=1732912739; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=h8XN/NPs91LN7p5/v86U/iFyl8u+jIuX77I4zRh/+yo=; b=jM7b6vrIrSaF6PbWDNFaBOjNZ9qs1zzswzYDu8OjqmW8emZWlmFZXAtpU9dl/4JiaO eSLKuXYGSBGMre4BHMYLQQDH7F5+l2W8oi6UqSmGoC9FVtAuMZnBPRgr4xg5NE9fi4XX JKH4vTqLEDrS0otc45a/ZbzCApVyuTl6fTu5SpQu+Q8vyOWds2OW86T58qimreaIf07T jIu0fFjWLsMUyDmmGyjc6JRHiJ4b9lvI++joPp6+uGXnrD9OdJliccKeXdxHtVyKAXUr fqvpQfA8peUELGRg7+PllQjg8YXcl3zOwE2qpdsT3Kum2Fw7JZWGWzimnGM0YJhl7gtv CklQ== X-Gm-Message-State: AOJu0Yyfcgi6VM9TWDBG2VKEtoN1giuRplEQOO8CFDer6Dpp24A3Hlyr ixizCm2ypEwNvt4iJipi3OY+BOmbXvyYlJwGDr0Vqo8UPydxzoLqEBWqvKsp X-Gm-Gg: ASbGnctfD6Yjo1xZIRxBpDuhaNbIqnoeiyWlzfbFdEGZwMhcAmWZzDOEH/VUk5N/iIB 0cT/b1TigS6DdrvOPPT9MOTGFTtOh9rsoBf388pgyl163qWylnWiTuyqARsBnUQ0EEm0X72LE/H znqizjK0/kZQ9pcS9kE5UAkh/c4OMCz67rDCMvMXS5niLDJtjqNSxjDMRosAo8cT72Izg78AVEg tWrG11NX6iwRdL14NOS9EKiBUB91JDTuJrYRrRnn6ynKDVAc9+oqTNPXya/EVAT3MgqQS5wpO8= X-Google-Smtp-Source: AGHT+IFGF0YgsxuRMYsRn+VxY3M5jIVEBPETJACQYzRCGVbGpI02U73OdFerHCz5zK1jigOhwCCz7A== X-Received: by 2002:a05:6602:6422:b0:83a:b500:3513 with SMTP id ca18e2360f4ac-83ecdc8b310mr507317939f.8.1732307939130; Fri, 22 Nov 2024 12:38:59 -0800 (PST) Received: from manaslu.cs.wisc.edu (manaslu.cs.wisc.edu. [128.105.15.4]) by smtp.gmail.com with ESMTPSA id 8926c6da1cb9f-4e1cfe52506sm794682173.77.2024.11.22.12.38.58 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 22 Nov 2024 12:38:58 -0800 (PST) From: Bijan Tabatabai X-Google-Original-From: Bijan Tabatabai To: linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, btabatabai@wisc.edu Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, mingo@redhat.com Subject: [RFC PATCH 4/4] Add base implementation of an MFS Date: Fri, 22 Nov 2024 14:38:30 -0600 Message-Id: <20241122203830.2381905-5-btabatabai@wisc.edu> X-Mailer: git-send-email 2.34.1 In-Reply-To: <20241122203830.2381905-1-btabatabai@wisc.edu> References: <20241122203830.2381905-1-btabatabai@wisc.edu> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Mount by running sudo mount -t BasicMFS BasicMFS -o numpages= Where is the max number of 4KB pages it can use, and is the directory to mount the filesystem to. This patch is meant to serve as a reference for the reviewers and is not intended to be upstreamed. Signed-off-by: Bijan Tabatabai --- BasicMFS/Kconfig | 3 + BasicMFS/Makefile | 8 + BasicMFS/basic.c | 717 ++++++++++++++++++++++++++++++++++++++++++++++ BasicMFS/basic.h | 29 ++ 4 files changed, 757 insertions(+) create mode 100644 BasicMFS/Kconfig create mode 100644 BasicMFS/Makefile create mode 100644 BasicMFS/basic.c create mode 100644 BasicMFS/basic.h diff --git a/BasicMFS/Kconfig b/BasicMFS/Kconfig new file mode 100644 index 000000000000..3b536eded0ed --- /dev/null +++ b/BasicMFS/Kconfig @@ -0,0 +1,3 @@ +config BASICMMFS + tristate "Adds the BasicMMFS" + default m diff --git a/BasicMFS/Makefile b/BasicMFS/Makefile new file mode 100644 index 000000000000..e50d27819c3c --- /dev/null +++ b/BasicMFS/Makefile @@ -0,0 +1,8 @@ +obj-m += basicmfs.o +basicmfs-y += basic.o + +all: + make -C ../kbuild M=$(PWD) modules + +clean: + make -C ../kbuild M=$(PWD) clean diff --git a/BasicMFS/basic.c b/BasicMFS/basic.c new file mode 100644 index 000000000000..88490de64db4 --- /dev/null +++ b/BasicMFS/basic.c @@ -0,0 +1,717 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "basic.h" + +static const struct super_operations basicmfs_ops; +static const struct inode_operations basicmfs_dir_inode_operations; + +static struct basicmfs_sb_info *BMFS_SB(struct super_block *sb) +{ + return sb->s_fs_info; +} + +static struct basicmfs_inode_info *BMFS_I(struct inode *inode) +{ + return inode->i_private; +} + +/* + * Allocate a base page and assign it to the inode at the given page offset + * Takes the sbi->lock. + * Returns the allocated page if there is one, else NULL + */ +static struct page *basicmfs_alloc_page(struct basicmfs_inode_info *inode_info, + struct basicmfs_sb_info *sbi, u64 page_offset) +{ + u8 *kaddr; + u64 pages_added; + u64 alloc_size = 64; + struct page *page = NULL; + + spin_lock(&sbi->lock); + + /* First, do we have any free pages available? */ + if (sbi->free_pages == 0) { + /* Try to allocate more pages if we can */ + alloc_size = min(alloc_size, sbi->max_pages - sbi->num_pages); + if (alloc_size == 0) + goto unlock; + + pages_added = alloc_pages_bulk_list(GFP_HIGHUSER, alloc_size, &sbi->free_list); + + if (pages_added == 0) + goto unlock; + + sbi->num_pages += pages_added; + sbi->free_pages += pages_added; + } + + page = list_first_entry(&sbi->free_list, struct page, lru); + list_del(&page->lru); + sbi->free_pages--; + + /* Zero the page outside of the critical section */ + spin_unlock(&sbi->lock); + + kaddr = kmap_local_page(page); + memset(kaddr, 0, PAGE_SIZE); + kunmap_local(kaddr); + + spin_lock(&sbi->lock); + + list_add(&page->lru, &sbi->active_list); + +unlock: + spin_unlock(&sbi->lock); + return page; +} + +static void basicmfs_return_page(struct page *page, struct basicmfs_sb_info *sbi) +{ + spin_lock(&sbi->lock); + + list_del(&page->lru); + /* + * We don't need to put page here for being unmapped that seems to have + * been handled by the unmapping code. + */ + + list_add_tail(&page->lru, &sbi->free_list); + sbi->free_pages++; + + spin_unlock(&sbi->lock); +} + +static void basicmfs_free_range(struct inode *inode, u64 offset, loff_t len) +{ + struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb); + struct basicmfs_inode_info *inode_info = BMFS_I(inode); + struct address_space *mapping = inode_info->mapping; + struct folio_batch fbatch; + int i; + pgoff_t cur_offset = offset >> PAGE_SHIFT; + pgoff_t end_offset = (offset + len) >> PAGE_SHIFT; + + folio_batch_init(&fbatch); + while (cur_offset < end_offset) { + filemap_get_folios(mapping, &cur_offset, end_offset - 1, &fbatch); + + for (i = 0; i < fbatch.nr; i++) { + folio_lock(fbatch.folios[i]); + filemap_remove_folio(fbatch.folios[i]); + folio_unlock(fbatch.folios[i]); + basicmfs_return_page(folio_page(fbatch.folios[i], 0), sbi); + } + + folio_batch_release(&fbatch); + } +} + +static vm_fault_t basicmfs_fault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma = vmf->vma; + struct address_space *mapping = vma->vm_file->f_mapping; + struct inode *inode = vma->vm_file->f_inode; + struct basicmfs_inode_info *inode_info; + struct basicmfs_sb_info *sbi; + struct page *page = NULL; + bool new_page = true; + bool cow_fault = false; + u64 pgoff = ((vmf->address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff; + vm_fault_t ret = 0; + pte_t entry; + + inode_info = BMFS_I(inode); + sbi = BMFS_SB(inode->i_sb); + + if (!vmf->pte) { + if (pte_alloc(vma->vm_mm, vmf->pmd)) + return VM_FAULT_OOM; + } + + vmf->pte = pte_offset_map(vmf->pmd, vmf->address); + vmf->orig_pte = *vmf->pte; + if (!pte_none(vmf->orig_pte) && pte_present(vmf->orig_pte)) { + if (!(vmf->flags & FAULT_FLAG_WRITE)) { + /* + * It looks like the PTE is already populated, + * so maybe two threads raced to first fault. + */ + ret = VM_FAULT_NOPAGE; + goto unmap; + } + + cow_fault = true; + } + + /* Get the page if it was preallocated */ + page = mtree_erase(&inode_info->falloc_mt, pgoff); + + /* Try to allocate the page if it hasn't been already */ + if (!page) { + page = basicmfs_alloc_page(inode_info, sbi, pgoff); + if (!page) { + ret = VM_FAULT_OOM; + goto unmap; + } + } + + if (!pte_none(vmf->orig_pte) && !pte_present(vmf->orig_pte)) { + /* Swapped out page */ + struct page *ret_page; + swp_entry_t swp_entry = pte_to_swp_entry(vmf->orig_pte); + + ret_page = fbmm_read_swap_entry(vmf, swp_entry, pgoff, page); + if (page != ret_page) { + /* + * A physical page was already being used for this virt page + * or there was an error, so we can return the page we allocated. + */ + basicmfs_return_page(page, sbi); + page = ret_page; + new_page = false; + } + if (!page) { + pr_warn("BasicMFS: Error swapping in page! %lx\n", vmf->address); + goto unmap; + } + } + + vmf->ptl = pte_lockptr(vma->vm_mm, vmf->pmd); + spin_lock(vmf->ptl); + /* Check if some other thread faulted here */ + if (!pte_same(vmf->orig_pte, *vmf->pte)) { + if (new_page) + basicmfs_return_page(page, sbi); + goto unlock; + } + + /* Handle COW fault */ + if (cow_fault) { + u8 *src_kaddr, *dst_kaddr; + struct page *old_page; + struct folio *old_folio; + unsigned long old_pfn; + + old_pfn = pte_pfn(vmf->orig_pte); + old_page = pfn_to_page(old_pfn); + + lock_page(old_page); + + /* + * If there's more than one reference to this page, we need to copy it. + * Otherwise, we can just reuse it + */ + if (page_mapcount(old_page) > 1) { + src_kaddr = kmap_local_page(old_page); + dst_kaddr = kmap_local_page(page); + memcpy(dst_kaddr, src_kaddr, PAGE_SIZE); + kunmap_local(dst_kaddr); + kunmap_local(src_kaddr); + } else { + basicmfs_return_page(page, sbi); + page = old_page; + } + /* + * Drop a reference to old_page even if we are going to keep it + * because the reference will be increased at the end of the fault + */ + put_page(old_page); + /* Decrease the filepage and rmap count for the same reason */ + percpu_counter_dec(&vma->vm_mm->rss_stat[MM_FILEPAGES]); + folio_remove_rmap_pte(page_folio(old_page), old_page, vma); + + old_folio = page_folio(old_page); + /* + * If we are copying a page for the process that originally faulted the + * page, we have to replace the mapping. + */ + if (mapping == old_folio->mapping) { + if (old_page != page) + replace_page_cache_folio(old_folio, page_folio(page)); + new_page = false; + } + unlock_page(old_page); + } + + if (new_page) + /* + * We want to manage the folio ourselves, and don't want it on the LRU lists, + * so we use __filemap_add_folio instead of filemap_add_folio. + */ + __filemap_add_folio(mapping, page_folio(page), pgoff, GFP_KERNEL, NULL); + + /* Construct the pte entry */ + entry = mk_pte(page, vma->vm_page_prot); + entry = pte_mkyoung(entry); + if (vma->vm_flags & VM_WRITE) + entry = pte_mkwrite_novma(pte_mkdirty(entry)); + + folio_add_file_rmap_pte(page_folio(page), page, vma); + percpu_counter_inc(&vma->vm_mm->rss_stat[MM_FILEPAGES]); + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); + + update_mmu_cache(vma, vmf->address, vmf->pte); + vmf->page = page; + get_page(page); + flush_tlb_page(vma, vmf->address); + ret = VM_FAULT_NOPAGE; + +unlock: + spin_unlock(vmf->ptl); +unmap: + pte_unmap(vmf->pte); + return ret; +} + +const struct vm_operations_struct basicmfs_vm_ops = { + .fault = basicmfs_fault, + .page_mkwrite = basicmfs_fault, + .pfn_mkwrite = basicmfs_fault, +}; + +static int basicmfs_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode = file_inode(file); + struct basicmfs_inode_info *inode_info = BMFS_I(inode); + + file_accessed(file); + vma->vm_ops = &basicmfs_vm_ops; + + inode_info->file_va_start = vma->vm_start - (vma->vm_pgoff << PAGE_SHIFT); + inode_info->mapping = file->f_mapping; + + return 0; +} + +static int basicmfs_release(struct inode *inode, struct file *file) +{ + struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb); + struct basicmfs_inode_info *inode_info = BMFS_I(inode); + struct page *page; + unsigned long index = 0; + unsigned long free_count = 0; + + basicmfs_free_range(inode, 0, inode->i_size); + + mt_for_each(&inode_info->falloc_mt, page, index, ULONG_MAX) { + basicmfs_return_page(page, sbi); + free_count++; + } + + mtree_destroy(&inode_info->falloc_mt); + kfree(inode_info); + + return 0; +} + +static long basicmfs_fallocate(struct file *file, int mode, loff_t offset, loff_t len) +{ + struct inode *inode = file_inode(file); + struct basicmfs_sb_info *sbi = BMFS_SB(inode->i_sb); + struct basicmfs_inode_info *inode_info = BMFS_I(inode); + struct page *page; + loff_t off; + + if (mode & FALLOC_FL_PUNCH_HOLE) { + basicmfs_free_range(inode, offset, len); + return 0; + } else if (mode != 0) { + return -EOPNOTSUPP; + } + + for (off = offset; off < offset + len; off += PAGE_SIZE) { + page = basicmfs_alloc_page(inode_info, sbi, off >> PAGE_SHIFT); + mtree_store(&inode_info->falloc_mt, off >> PAGE_SHIFT, page, GFP_KERNEL); + if (!page) + return -ENOMEM; + } + + return 0; +} + +const struct file_operations basicmfs_file_operations = { + .mmap = basicmfs_mmap, + .release = basicmfs_release, + .fsync = noop_fsync, + .llseek = generic_file_llseek, + .get_unmapped_area = generic_get_unmapped_area_topdown, + .fallocate = basicmfs_fallocate, +}; + +const struct inode_operations basicmfs_file_inode_operations = { + .setattr = simple_setattr, + .getattr = simple_getattr, +}; + +const struct address_space_operations basicmfs_aops = { + .direct_IO = noop_direct_IO, + .dirty_folio = noop_dirty_folio, + .writepage = fbmm_writepage, +}; + +static struct inode *basicmfs_get_inode(struct super_block *sb, + const struct inode *dir, umode_t mode, dev_t dev) +{ + struct inode *inode = new_inode(sb); + struct basicmfs_inode_info *info; + + if (!inode) + return NULL; + + info = kzalloc(sizeof(struct basicmfs_inode_info), GFP_KERNEL); + if (!info) + return NULL; + mt_init(&info->falloc_mt); + info->file_va_start = 0; + + inode->i_ino = get_next_ino(); + inode_init_owner(&nop_mnt_idmap, inode, dir, mode); + inode->i_mapping->a_ops = &basicmfs_aops; + inode->i_flags |= S_DAX; + inode->i_private = info; + switch (mode & S_IFMT) { + case S_IFREG: + inode->i_op = &basicmfs_file_inode_operations; + inode->i_fop = &basicmfs_file_operations; + break; + case S_IFDIR: + inode->i_op = &basicmfs_dir_inode_operations; + inode->i_fop = &simple_dir_operations; + + /* Directory inodes start off with i_nlink == 2 (for "." entry) */ + inc_nlink(inode); + break; + default: + return NULL; + } + + return inode; +} + +static int basicmfs_mknod(struct mnt_idmap *idmap, struct inode *dir, + struct dentry *dentry, umode_t mode, dev_t dev) +{ + struct inode *inode = basicmfs_get_inode(dir->i_sb, dir, mode, dev); + int error = -ENOSPC; + + if (inode) { + d_instantiate(dentry, inode); + dget(dentry); /* Extra count - pin the dentry in core */ + error = 0; + } + + return error; +} + +static int basicmfs_mkdir(struct mnt_idmap *idmap, struct inode *dir, + struct dentry *dentry, umode_t mode) +{ + return -EINVAL; +} + +static int basicmfs_create(struct mnt_idmap *idmap, struct inode *dir, + struct dentry *dentry, umode_t mode, bool excl) +{ + // TODO: Replace 0777 with mode and see if anything breaks + return basicmfs_mknod(idmap, dir, dentry, 0777 | S_IFREG, 0); +} + +static int basicmfs_symlink(struct mnt_idmap *idmap, struct inode *dir, + struct dentry *dentry, const char *symname) +{ + return -EINVAL; +} + +static int basicmfs_tmpfile(struct mnt_idmap *idmap, + struct inode *dir, struct file *file, umode_t mode) +{ + struct inode *inode; + + inode = basicmfs_get_inode(dir->i_sb, dir, mode, 0); + if (!inode) + return -ENOSPC; + d_tmpfile(file, inode); + return finish_open_simple(file, 0); +} + +static const struct inode_operations basicmfs_dir_inode_operations = { + .create = basicmfs_create, + .lookup = simple_lookup, + .link = simple_link, + .unlink = simple_unlink, + .symlink = basicmfs_symlink, + .mkdir = basicmfs_mkdir, + .rmdir = simple_rmdir, + .mknod = basicmfs_mknod, + .rename = simple_rename, + .tmpfile = basicmfs_tmpfile, +}; + +static int basicmfs_statfs(struct dentry *dentry, struct kstatfs *buf) +{ + struct super_block *sb = dentry->d_sb; + struct basicmfs_sb_info *sbi = BMFS_SB(sb); + + buf->f_type = sb->s_magic; + buf->f_bsize = PAGE_SIZE; + buf->f_blocks = sbi->num_pages; + buf->f_bfree = buf->f_bavail = sbi->free_pages; + buf->f_files = LONG_MAX; + buf->f_ffree = LONG_MAX; + buf->f_namelen = 255; + + return 0; +} + +static int basicmfs_show_options(struct seq_file *m, struct dentry *root) +{ + return 0; +} + +#define BASICMFS_MAX_PAGEOUT 512 +static long basicmfs_nr_cached_objects(struct super_block *sb, struct shrink_control *sc) +{ + struct basicmfs_sb_info *sbi = BMFS_SB(sb); + long nr = 0; + + spin_lock(&sbi->lock); + if (sbi->free_pages > 0) + nr = sbi->free_pages; + else + nr = max(sbi->num_pages - sbi->free_pages, (u64)BASICMFS_MAX_PAGEOUT); + spin_unlock(&sbi->lock); + + return nr; +} + +static long basicmfs_free_cached_objects(struct super_block *sb, struct shrink_control *sc) +{ + LIST_HEAD(folio_list); + LIST_HEAD(fail_list); + struct basicmfs_sb_info *sbi = BMFS_SB(sb); + struct page *page; + u64 i, num_scanned; + + if (sbi->free_pages > 0) { + spin_lock(&sbi->lock); + for (i = 0; i < sc->nr_to_scan && i < sbi->free_pages; i++) { + page = list_first_entry(&sbi->free_list, struct page, lru); + list_del(&page->lru); + put_page(page); + } + + sbi->num_pages -= i; + sbi->free_pages -= i; + spin_unlock(&sbi->lock); + } else if (sbi->num_pages > 0) { + spin_lock(&sbi->lock); + for (i = 0; i < sc->nr_to_scan && sbi->num_pages > 0; i++) { + page = list_first_entry(&sbi->active_list, struct page, lru); + list_move(&page->lru, &folio_list); + sbi->num_pages--; + } + spin_unlock(&sbi->lock); + + num_scanned = i; + for (i = 0; i < num_scanned && !list_empty(&folio_list); i++) { + page = list_first_entry(&folio_list, struct page, lru); + list_del(&page->lru); + if (fbmm_swapout_folio(page_folio(page))) + list_add_tail(&page->lru, &fail_list); + else + put_page(page); + } + + spin_lock(&sbi->lock); + while (!list_empty(&fail_list)) { + page = list_first_entry(&fail_list, struct page, lru); + list_del(&page->lru); + list_add_tail(&page->lru, &sbi->active_list); + sbi->num_pages++; + } + spin_unlock(&sbi->lock); + + } + + sc->nr_scanned = i; + return i; +} + +static const struct super_operations basicmfs_ops = { + .statfs = basicmfs_statfs, + .drop_inode = generic_delete_inode, + .show_options = basicmfs_show_options, + .nr_cached_objects = basicmfs_nr_cached_objects, + .free_cached_objects = basicmfs_free_cached_objects, + .copy_page_range = fbmm_copy_page_range, +}; + +static int basicmfs_fill_super(struct super_block *sb, struct fs_context *fc) +{ + struct inode *inode; + struct basicmfs_sb_info *sbi = kzalloc(sizeof(struct basicmfs_sb_info), GFP_KERNEL); + u64 nr_pages = *(u64 *)fc->fs_private; + u64 alloc_size = 1024; + + if (!sbi) + return -ENOMEM; + + sb->s_fs_info = sbi; + sb->s_maxbytes = MAX_LFS_FILESIZE; + sb->s_magic = 0xDEADBEEF; + sb->s_op = &basicmfs_ops; + sb->s_time_gran = 1; + sb->s_blocksize = PAGE_SIZE; + sb->s_blocksize_bits = PAGE_SHIFT; + + spin_lock_init(&sbi->lock); + INIT_LIST_HEAD(&sbi->free_list); + INIT_LIST_HEAD(&sbi->active_list); + sbi->max_pages = nr_pages; + sbi->num_pages = 0; + for (int i = 0; i < nr_pages / alloc_size; i++) + sbi->num_pages += alloc_pages_bulk_list(GFP_HIGHUSER, alloc_size, &sbi->free_list); + sbi->free_pages = sbi->num_pages; + + inode = basicmfs_get_inode(sb, NULL, S_IFDIR | 0755, 0); + sb->s_root = d_make_root(inode); + if (!sb->s_root) { + kfree(sbi); + return -ENOMEM; + } + + return 0; +} + +static int basicmfs_get_tree(struct fs_context *fc) +{ + return get_tree_nodev(fc, basicmfs_fill_super); +} + +enum basicmfs_param { + Opt_numpages, +}; + +const struct fs_parameter_spec basicmfs_fs_parameters[] = { + fsparam_u64("numpages", Opt_numpages), + {}, +}; + +static int basicmfs_parse_param(struct fs_context *fc, struct fs_parameter *param) +{ + struct fs_parse_result result; + u64 *num_pages = (u64 *)fc->fs_private; + int opt; + + opt = fs_parse(fc, basicmfs_fs_parameters, param, &result); + if (opt < 0) { + /* + * We might like to report bad mount options here; + * but traditionally ramfs has ignored all mount options, + * and as it is used as a !CONFIG_SHMEM simple substitute + * for tmpfs, better continue to ignore other mount options. + */ + if (opt == -ENOPARAM) + opt = 0; + return opt; + } + + switch (opt) { + case Opt_numpages: + *num_pages = result.uint_64; + break; + }; + + return 0; +} + +static void basicmfs_free_fc(struct fs_context *fc) +{ + kfree(fc->fs_private); +} + +static const struct fs_context_operations basicmfs_context_ops = { + .free = basicmfs_free_fc, + .parse_param = basicmfs_parse_param, + .get_tree = basicmfs_get_tree, +}; + +static int basicmfs_init_fs_context(struct fs_context *fc) +{ + fc->ops = &basicmfs_context_ops; + + fc->fs_private = kzalloc(sizeof(u64), GFP_KERNEL); + /* Set a default number of pages to use */ + *(u64 *)fc->fs_private = 128 * 1024; + return 0; +} + +static void basicmfs_kill_sb(struct super_block *sb) +{ + struct basicmfs_sb_info *sbi = BMFS_SB(sb); + struct page *page, *tmp; + + spin_lock(&sbi->lock); + + /* + * Return the pages we took to the kernel. + * All the pages should be in the free list at this point + */ + list_for_each_entry_safe(page, tmp, &sbi->free_list, lru) { + list_del(&page->lru); + put_page(page); + } + + spin_unlock(&sbi->lock); + + kfree(sbi); + + kill_litter_super(sb); +} + +static struct file_system_type basicmfs_fs_type = { + .owner = THIS_MODULE, + .name = "BasicMFS", + .init_fs_context = basicmfs_init_fs_context, + .parameters = basicmfs_fs_parameters, + .kill_sb = basicmfs_kill_sb, + .fs_flags = FS_USERNS_MOUNT, +}; + +static int __init init_basicmfs(void) +{ + printk(KERN_INFO "Starting BasicMFS"); + register_filesystem(&basicmfs_fs_type); + + return 0; +} +module_init(init_basicmfs); + +static void cleanup_basicmfs(void) +{ + printk(KERN_INFO "Removing BasicMFS"); + unregister_filesystem(&basicmfs_fs_type); +} +module_exit(cleanup_basicmfs); + +MODULE_LICENSE("GPL"); +MODULE_AUTHOR("Bijan Tabatabai"); diff --git a/BasicMFS/basic.h b/BasicMFS/basic.h new file mode 100644 index 000000000000..8e727201aca3 --- /dev/null +++ b/BasicMFS/basic.h @@ -0,0 +1,29 @@ +#ifndef BASIC_MMFS_H +#define BASIC_MMFS_H + +#include +#include +#include +#include +#include + +struct basicmfs_sb_info { + spinlock_t lock; + struct list_head free_list; + struct list_head active_list; + u64 num_pages; + u64 max_pages; + u64 free_pages; +}; + +struct basicmfs_inode_info { + // Maple tree mapping the page offset to the folio mapped to that offset + // Used to hold preallocated pages that haven't been mapped yet + struct maple_tree falloc_mt; + // The first virtual address this file is associated with. + u64 file_va_start; + // The file offset to folio mapping from the file + struct address_space *mapping; +}; + +#endif //BASIC_MMFS_H