Message ID | 20210922070645.47345-2-rongwei.wang@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/2] mm, thp: check page mapping when truncating page cache | expand |
On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > Transparent huge page has supported read-only non-shmem files. The file- > backed THP is collapsed by khugepaged and truncated when written (for > shared libraries). > > However, there is race in two possible places. > > 1) multiple writers truncate the same page cache concurrently; > 2) collapse_file rolls back when writer truncates the page cache; As I've said before, the bug here is that somehow there is a writable fd to a file with THPs. That's what we need to track down and fix. https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/
> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: >> Transparent huge page has supported read-only non-shmem files. The file- >> backed THP is collapsed by khugepaged and truncated when written (for >> shared libraries). >> >> However, there is race in two possible places. >> >> 1) multiple writers truncate the same page cache concurrently; >> 2) collapse_file rolls back when writer truncates the page cache; > > As I've said before, the bug here is that somehow there is a writable fd > to a file with THPs. That's what we need to track down and fix. Hi, Matthew I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. So, the Pagecache of DSO need to be cleaned when someone opened this DSO in a writeable way. The process of Pagecache cleaned mainly refers to ’truncate_inode_pages_range’. Of course, the above is consensus for us. The ’somehow’ that your said is sure for us. Maybe I can try to describe roughly here: In collapse_file <khugepaged>, when PTEs scan failed (e.g. SCAN_FAIL, SCAN_TRUNCATED..), the following code will be run: } else { struct page *page; /* Something went wrong: roll back page cache changes */ xas_lock_irq(&xas); mapping->nrpages -= nr_none; if (is_shmem) shmem_uncharge(mapping->host, nr_none); xas_set(&xas, start); xas_for_each(&xas, page, end - 1) { page = list_first_entry_or_null(&pagelist, struct page, lru); if (!page || xas.xa_index < page->index) { if (!nr_none) break; nr_none--; /* Put holes back where they were */ xas_store(&xas, NULL); continue; } VM_BUG_ON_PAGE(page->index != xas.xa_index, page); /* Unfreeze the page. */ list_del(&page->lru); page_ref_unfreeze(page, 2); line1 xas_store(&xas, page); xas_pause(&xas); xas_unlock_irq(&xas); unlock_page(page); putback_lru_page(page); xas_lock_irq(&xas); } …. new_page->mapping = NULL; } line2 unlock_page(new_page); --- The above code refers to roll back. When someone starts open a DSO in a writeable way, and the collapse_file is scanning PTEs. The hugepage had been locked and was mapped in page cache before line1. And the hugepage not be in pagecache and be unlocked at line2. The race that between ‘collapse_file’ and ’truncate_inode_pages_range’ will happen when ‘collapse_file' is executing line1 and line2. This race can be shown concisely: Khugepaged: writer Collapse_file: truncate_inode_pages_range lock_page(hugepage) skip hugepage because locked by others scan_left_pages() and wait_lock_page(hugepage) scan_fail() & xas_store(&xas, 4k_page) unlock_page(hugepage) get_lock_page(hugepage) hugepage is not in pagecache here, but not be checked This situation triggers easily if we setting a very small ‘scan_sleep_millisecs’. The above descriptions are roughly my understanding of ’somehow’. It says a lot of nonsense, maybe it helps! > > https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ All in all, what you mean is that we should solve this race at the source? However, a writer happens to appear in the middle of a process in ‘collapse_file', so this bug solved when roll back. The above is my understanding, but it may be wrong. Thanks
On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > > > > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > >> Transparent huge page has supported read-only non-shmem files. The file- > >> backed THP is collapsed by khugepaged and truncated when written (for > >> shared libraries). > >> > >> However, there is race in two possible places. > >> > >> 1) multiple writers truncate the same page cache concurrently; > >> 2) collapse_file rolls back when writer truncates the page cache; > > > > As I've said before, the bug here is that somehow there is a writable fd > > to a file with THPs. That's what we need to track down and fix. > Hi, Matthew > I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > > ... > > > https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > All in all, what you mean is that we should solve this race at the source? Matthew is being pretty clear here: we shouldn't be permitting userspace to get a writeable fd for a thp-backed file. Why are we permitting the DSO to be opened writeably? If there's a legitimate case for doing this then presumably "mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs: should be fixed or reverted. If there is no legitimate use case for returning a writeable fd for a thp-backed file then we should fail such an attempt at open(). This approach has back-compatibility issues which need to be thought about. Perhaps we should permit the open-writeably attempt to appear to succeed, but to really return a read-only fd?
On Thu, Sep 23, 2021 at 7:43 PM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > > > > > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > > >> Transparent huge page has supported read-only non-shmem files. The file- > > >> backed THP is collapsed by khugepaged and truncated when written (for > > >> shared libraries). > > >> > > >> However, there is race in two possible places. > > >> > > >> 1) multiple writers truncate the same page cache concurrently; > > >> 2) collapse_file rolls back when writer truncates the page cache; > > > > > > As I've said before, the bug here is that somehow there is a writable fd > > > to a file with THPs. That's what we need to track down and fix. > > Hi, Matthew > > I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > > Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > > > > ... > > > > > https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > > All in all, what you mean is that we should solve this race at the source? > > Matthew is being pretty clear here: we shouldn't be permitting > userspace to get a writeable fd for a thp-backed file. No, he doesn't mean it IIRC. Actually we had the same conversation for another patch. Quoted below: " > > Things have already gone wrong before we get to this point. See > > do_dentry_open(). You aren't supposed to be able to get a writable file > > descriptor on a file which has had huge pages added to the page cache > > without the filesystem's knowledge. That's the problem that needs to > > be fixed. > > I don't quite understand your point here. Do you mean do_dentry_open() > should fail for such cases instead of truncating the page cache? No, do_dentry_open() should have truncated the page cache when it was called and found that there were THPs in the cache. Then khugepaged should see that someone has the file open for write and decline to create new THPs. So it shouldn't be possible to get here with THPs in the cache." Please see https://lore.kernel.org/linux-mm/YUkCI2I085Sos%2F64@casper.infradead.org/ But actually "mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" did so exactly. > > Why are we permitting the DSO to be opened writeably? If there's a > legitimate case for doing this then presumably "mm, thp: relax the > VM_DENYWRITE constraint on file-backed THPs: should be fixed or > reverted. Unfortunately we can't revert this commit anymore since VM_DENYWRITE is gone due to commit 8d0920bde5eb ("mm: remove VM_DENYWRITE") > > If there is no legitimate use case for returning a writeable fd for a > thp-backed file then we should fail such an attempt at open(). This > approach has back-compatibility issues which need to be thought about. > Perhaps we should permit the open-writeably attempt to appear to > succeed, but to really return a read-only fd? > >
On 9/24/21 10:43 AM, Andrew Morton wrote: > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > >> >>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: >>> >>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: >>>> Transparent huge page has supported read-only non-shmem files. The file- >>>> backed THP is collapsed by khugepaged and truncated when written (for >>>> shared libraries). >>>> >>>> However, there is race in two possible places. >>>> >>>> 1) multiple writers truncate the same page cache concurrently; >>>> 2) collapse_file rolls back when writer truncates the page cache; >>> As I've said before, the bug here is that somehow there is a writable fd >>> to a file with THPs. That's what we need to track down and fix. >> Hi, Matthew >> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" >> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. >> >> ... >> >>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ >> All in all, what you mean is that we should solve this race at the source? > Matthew is being pretty clear here: we shouldn't be permitting > userspace to get a writeable fd for a thp-backed file. > > Why are we permitting the DSO to be opened writeably? If there's a > legitimate case for doing this then presumably "mm, thp: relax the Hi, we have written a user case to trigger this race mentioned above. this case just create one reader to open DSO in RDONLY mode, and keep making the mapping page of DSO use huge pages by madvise, then multiple writer to open and write the same DSO. I will send it later after simple adjust. Thanks > VM_DENYWRITE constraint on file-backed THPs: should be fixed or > reverted. > > If there is no legitimate use case for returning a writeable fd for a > thp-backed file then we should fail such an attempt at open(). This > approach has back-compatibility issues which need to be thought about. > Perhaps we should permit the open-writeably attempt to appear to > succeed, but to really return a read-only fd?
On 9/24/21 10:43 AM, Andrew Morton wrote: > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > >> >> >>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: >>> >>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: >>>> Transparent huge page has supported read-only non-shmem files. The file- >>>> backed THP is collapsed by khugepaged and truncated when written (for >>>> shared libraries). >>>> >>>> However, there is race in two possible places. >>>> >>>> 1) multiple writers truncate the same page cache concurrently; >>>> 2) collapse_file rolls back when writer truncates the page cache; >>> >>> As I've said before, the bug here is that somehow there is a writable fd >>> to a file with THPs. That's what we need to track down and fix. >> Hi, Matthew >> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" >> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. >> >> ... >> >>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ >> All in all, what you mean is that we should solve this race at the source? > > Matthew is being pretty clear here: we shouldn't be permitting > userspace to get a writeable fd for a thp-backed file. > > Why are we permitting the DSO to be opened writeably? If there's a > legitimate case for doing this then presumably "mm, thp: relax the There is a use case to stress file-backed THP within attachment. I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c $ ulimit -s unlimited $ ./stress_madvise_dso 10000 <libtest.so> the meaning of above parameters: 10000: the max test time; <libtest.so>: the DSO that will been mapped into file-backed THP by madvise. It recommended that the text segment of DSO to be tested is greater than 2M. The crash will been triggered at once in the latest kernel. And this case also can used to trigger the bug that mentioned in our another patch. > VM_DENYWRITE constraint on file-backed THPs: should be fixed or > reverted. > > If there is no legitimate use case for returning a writeable fd for a > thp-backed file then we should fail such an attempt at open(). This > approach has back-compatibility issues which need to be thought about. > Perhaps we should permit the open-writeably attempt to appear to > succeed, but to really return a read-only fd? > /* * case: stress file-backed THP */ #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <assert.h> #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <sched.h> #include <time.h> #include <string.h> #include <fcntl.h> #include <signal.h> /* for signal */ #include <sys/mman.h> #include <sys/time.h> #include <sys/types.h> #include <sys/wait.h> #include <unistd.h> #include <errno.h> #define PATH_MAX 1024 #define BUFF_MAX 1024 #define TIME_DFL 180 /* seconds */ void signal_handler(int signo) { /* Restore env */ system("echo never > /sys/kernel/mm/transparent_hugepage/enabled"); system("echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs"); printf("\nrestore env:\n"); printf(" echo never > /sys/kernel/mm/transparent_hugepage/enabled\n"); printf(" echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs\n"); exit(-1); } /* in KB */ #define text_size (14UL << 10) #define PROCMAP_SZ 8 struct procmap { uint64_t vm_start; uint64_t vm_end; uint64_t pgoff; uint32_t maj; uint32_t min; uint32_t ino; #define PROT_SZ 5 char prot[PROT_SZ]; char fname[PATH_MAX]; }; unsigned long sleep_secs = 0; /* * Routines of procmap, i.e., /proc/pid/(s)maps */ static int get_memory_map(pid_t pid, struct procmap *procmap, const char *fname) { char path[PATH_MAX]; char line[BUFF_MAX]; FILE *fp = NULL; char *end = NULL; char *pos, *sp = NULL, *in[PROCMAP_SZ]; char dlm[] = "- : "; uint64_t counter; int i; snprintf(path, PATH_MAX, "/proc/%u/maps", pid); fp = fopen(path, "r"); if (fp == NULL) { printf("fopen: %s: %s\n", path, strerror(errno)); return -1; } if (procmap == NULL || fname == NULL) { perror("fail: procmap or fname is NULL"); goto failed; } while (fgets(line, BUFF_MAX, fp)) { /* Split line into fields */ pos = line; for (i = 0; i < PROCMAP_SZ; i++) { in[i] = strtok_r(pos, &dlm[i], &sp); if (in[i] == NULL) break; pos = NULL; } /* Check this line is procmap item header */ if (i != PROCMAP_SZ) continue; memcpy(procmap->prot, in[2], PROT_SZ); memcpy(procmap->fname, in[7], PATH_MAX); /* Find the target entry */ if (strcmp(procmap->prot, "r-xp") || !strstr(procmap->fname, fname)) continue; /* Convert/Copy each field as needed */ errno = 0; procmap->vm_start = strtoull(in[0], &end, 16); if ((in[0] == '\0') || (end == NULL) || (*end != '\0') || (errno != 0)) goto failed; procmap->vm_end = strtoull(in[1], &end, 16); if ((in[1] == '\0') || (end == NULL) || (*end != '\0') || (errno != 0)) goto failed; procmap->pgoff = strtoull(in[3], &end, 16); if ((in[3] == '\0') || (end == NULL) || (*end != '\0') || (errno != 0)) goto failed; procmap->ino = strtoul(in[6], &end, 16); if ((in[6] == '\0') || (end == NULL) || (*end != '\0') || (errno != 0)) goto failed; } if (fp) fclose(fp); return 0; failed: if (fp) fclose(fp); printf("fail: exit\n"); return -1; } #define NR_CPU 32 uint64_t gettimeofday_sec(void); inline uint64_t gettimeofday_sec(void) { struct timeval tv; gettimeofday(&tv, NULL); return tv.tv_sec; } void thread_read(int cpu, char *args) { int fd; char *dso_path = args; char buf[0x800000]; struct procmap maps; pid_t pid = getpid(); cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(cpu, &mask); if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) == -1) { printf("warning: can not set CPU affinity\n"); } printf("read %s\n", dso_path); fd = open(dso_path, O_RDONLY); /* The start addr must be alignment with 2M */ void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | PROT_EXEC, MAP_PRIVATE, fd, 0); if (p == MAP_FAILED) { perror("mmap"); goto out; } /* get the mapping address (ONLY r-xp) of the DSO */ get_memory_map(pid, &maps, dso_path); printf("pid: %d\n", pid); printf("text vm_start: 0x%lx\n", maps.vm_start); printf("text vm_end: 0x%lx\n", maps.vm_end); madvise((void *)maps.vm_start, maps.vm_end - maps.vm_start, MADV_HUGEPAGE); lseek(fd, 0, SEEK_SET); for(;;) { memcpy(buf, p, 0x800000 - 1); sleep(1); } sleep(100); out: /* Restore env */ system("echo never > /sys/kernel/mm/transparent_hugepage/enabled"); system("echo 10000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs"); printf("read exit %s\n", dso_path); } void thread_write(int cpu, char *args) { void *p = NULL; char buf[32]; uint64_t sec = 1; uint64_t count = 0; char *dso_path = args; cpu_set_t mask; CPU_ZERO(&mask); CPU_SET(cpu, &mask); if (sched_setaffinity(0, sizeof(cpu_set_t), &mask) == -1) { printf("warning: can not set CPU affinity\n"); } sleep(3); printf("write %s\n", dso_path); for (;;) { sec = gettimeofday_sec(); while ((sec % 10) >= 3) { sec = gettimeofday_sec(); } int fd = open(dso_path, O_RDWR); p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (p == MAP_FAILED) { perror("mmap"); goto out; /* fail */ } lseek(fd, 0x1600, SEEK_SET); for(long i=1; i <= 2; i++){ memcpy(p + 0x10, buf, 16); } munmap(p, 0x800000); close(fd); sleep(2); count++; if (count >= sleep_secs) { printf("test finish: %ld\n", count); break; } } /* end for */ out: printf("write exit %s\n", dso_path); } /* * usage: * stress_madvise_dso <test time> <libtest.so> */ int main(int argc, char *argv[]) { struct timeval start, end; char dso_path[80]; int ret = 0; pid_t pid; if (argc > 2) { sleep_secs = strtoul(argv[1], NULL, 10); realpath(argv[2], dso_path); } else { printf("usage error:\n"\ " stress_madvise_dso <test time> <libtest.so>\n"\ " e.g. stress_madvise_dso 10000 libtest.so\n"); exit(-1); } /* Set env */ system("ulimit -s unlimited"); system("echo madvise > /sys/kernel/mm/transparent_hugepage/enabled"); system("echo 1000 > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs"); gettimeofday(&start, NULL); /* * fork 32 task to read and write the same DSO: * task 0: read dso; * task 1 - 31: write dso; */ for (int i = 0; i < NR_CPU; ++i) { pid = fork(); if (pid == 0) { if (i == 0) thread_read(i, dso_path); else thread_write(i, dso_path); break; /* forbid child fork */ } else { /* parent */ } } if (pid != 0) { signal(SIGINT, signal_handler); signal(SIGSEGV, signal_handler); signal(SIGABRT, signal_handler); /* wait */ while (1) { int status; pid_t done = wait(&status); if (done == -1) { if (errno == ECHILD) break; /* No more child processes */ } else { if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) { printf("Pid:%d failed\n", done); goto out; } } } } out: if (ret == 0) { gettimeofday(&end, NULL); printf("time to collapse file thp: %ld ms\n", 1000 * (end.tv_sec - start.tv_sec) + (end.tv_usec - start.tv_usec) / 1000); } return ret; }
On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > On 9/24/21 10:43 AM, Andrew Morton wrote: > > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > >> > >> > >>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > >>> > >>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > >>>> Transparent huge page has supported read-only non-shmem files. The file- > >>>> backed THP is collapsed by khugepaged and truncated when written (for > >>>> shared libraries). > >>>> > >>>> However, there is race in two possible places. > >>>> > >>>> 1) multiple writers truncate the same page cache concurrently; > >>>> 2) collapse_file rolls back when writer truncates the page cache; > >>> > >>> As I've said before, the bug here is that somehow there is a writable fd > >>> to a file with THPs. That's what we need to track down and fix. > >> Hi, Matthew > >> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > >> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > >> > >> ... > >> > >>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > >> All in all, what you mean is that we should solve this race at the source? > > > > Matthew is being pretty clear here: we shouldn't be permitting > > userspace to get a writeable fd for a thp-backed file. > > > > Why are we permitting the DSO to be opened writeably? If there's a > > legitimate case for doing this then presumably "mm, thp: relax the > There is a use case to stress file-backed THP within attachment. > I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: > > $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c > $ ulimit -s unlimited > $ ./stress_madvise_dso 10000 <libtest.so> > > the meaning of above parameters: > 10000: the max test time; > <libtest.so>: the DSO that will been mapped into file-backed THP by > madvise. It recommended that the text segment of DSO to be tested is > greater than 2M. > > The crash will been triggered at once in the latest kernel. And this > case also can used to trigger the bug that mentioned in our another patch. Hmm.. I am not able to use the repro program to crash the system. Not sure what I did wrong. OTOH, does it make sense to block writes within khugepaged, like: diff --git i/mm/khugepaged.c w/mm/khugepaged.c index 045cc579f724e..ad7c41ec15027 100644 --- i/mm/khugepaged.c +++ w/mm/khugepaged.c @@ -51,6 +51,7 @@ enum scan_result { SCAN_CGROUP_CHARGE_FAIL, SCAN_TRUNCATED, SCAN_PAGE_HAS_PRIVATE, + SCAN_BUSY_WRITE, }; #define CREATE_TRACE_POINTS @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, /* Only allocate from the target node */ gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; + if (deny_write_access(file)) { + result = SCAN_BUSY_WRITE; + return; + } + new_page = khugepaged_alloc_page(hpage, gfp, node); if (!new_page) { result = SCAN_ALLOC_HUGE_PAGE_FAIL; @@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm, else { __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); filemap_nr_thps_inc(mapping); - /* - * Paired with smp_mb() in do_dentry_open() to ensure - * i_writecount is up to date and the update to nr_thps is - * visible. Ensures the page cache will be truncated if the - * file is opened writable. - */ - smp_mb(); - if (inode_is_open_for_write(mapping->host)) { - result = SCAN_FAIL; - __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); - filemap_nr_thps_dec(mapping); - goto xa_locked; - } } if (nr_none) { @@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm, VM_BUG_ON(!list_empty(&pagelist)); if (!IS_ERR_OR_NULL(*hpage)) mem_cgroup_uncharge(*hpage); + allow_write_access(file); /* TODO: tracepoints */ }
On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote: > OTOH, does it make sense to block writes within khugepaged, like: > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > /* Only allocate from the target node */ > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > + if (deny_write_access(file)) { > + result = SCAN_BUSY_WRITE; > + return; > + } The problem is that it denies, rather than blocking. That means that the writer gets ETXTBSY instead of waiting until khugepaged is done.
On 9/28/21 6:24 AM, Song Liu wrote: > On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: >> >> >> >> On 9/24/21 10:43 AM, Andrew Morton wrote: >>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: >>> >>>> >>>> >>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: >>>>> >>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: >>>>>> Transparent huge page has supported read-only non-shmem files. The file- >>>>>> backed THP is collapsed by khugepaged and truncated when written (for >>>>>> shared libraries). >>>>>> >>>>>> However, there is race in two possible places. >>>>>> >>>>>> 1) multiple writers truncate the same page cache concurrently; >>>>>> 2) collapse_file rolls back when writer truncates the page cache; >>>>> >>>>> As I've said before, the bug here is that somehow there is a writable fd >>>>> to a file with THPs. That's what we need to track down and fix. >>>> Hi, Matthew >>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" >>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. >>>> >>>> ... >>>> >>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ >>>> All in all, what you mean is that we should solve this race at the source? >>> >>> Matthew is being pretty clear here: we shouldn't be permitting >>> userspace to get a writeable fd for a thp-backed file. >>> >>> Why are we permitting the DSO to be opened writeably? If there's a >>> legitimate case for doing this then presumably "mm, thp: relax the >> There is a use case to stress file-backed THP within attachment. >> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: >> >> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c >> $ ulimit -s unlimited >> $ ./stress_madvise_dso 10000 <libtest.so> >> >> the meaning of above parameters: >> 10000: the max test time; >> <libtest.so>: the DSO that will been mapped into file-backed THP by >> madvise. It recommended that the text segment of DSO to be tested is >> greater than 2M. >> >> The crash will been triggered at once in the latest kernel. And this >> case also can used to trigger the bug that mentioned in our another patch. > > Hmm.. I am not able to use the repro program to crash the system. Not > sure what I did wrong. > Hi I have tried to check my test case again. Can you make sure the DSO that you test have THP mapping? If you are willing to try again, I can send my libtest.c which is used to test by myself (actually, it shouldn't be target DSO problem). Thanks very much! > OTOH, does it make sense to block writes within khugepaged, like: > > diff --git i/mm/khugepaged.c w/mm/khugepaged.c > index 045cc579f724e..ad7c41ec15027 100644 > --- i/mm/khugepaged.c > +++ w/mm/khugepaged.c > @@ -51,6 +51,7 @@ enum scan_result { > SCAN_CGROUP_CHARGE_FAIL, > SCAN_TRUNCATED, > SCAN_PAGE_HAS_PRIVATE, > + SCAN_BUSY_WRITE, > }; > > #define CREATE_TRACE_POINTS > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > /* Only allocate from the target node */ > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > + if (deny_write_access(file)) { > + result = SCAN_BUSY_WRITE; > + return; > + } > + This can indeed avoid some possible races from source. But, I am thinking about whether this will lead to DDoS attack? I remember the reason of DSO has ignored MAP_DENYWRITE in kernel is that DDoS attack. In addition, 'deny_write_access' will change the behavior, such as user will get 'Text file busy' during collapse_file. I am not sure whether the behavior changing is acceptable in user space. If it is acceptable, I am very willing to fix the races like your way. Thanks! > new_page = khugepaged_alloc_page(hpage, gfp, node); > if (!new_page) { > result = SCAN_ALLOC_HUGE_PAGE_FAIL; > @@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm, > else { > __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr); > filemap_nr_thps_inc(mapping); > - /* > - * Paired with smp_mb() in do_dentry_open() to ensure > - * i_writecount is up to date and the update to nr_thps is > - * visible. Ensures the page cache will be truncated if the > - * file is opened writable. > - */ > - smp_mb(); > - if (inode_is_open_for_write(mapping->host)) { > - result = SCAN_FAIL; > - __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr); > - filemap_nr_thps_dec(mapping); > - goto xa_locked; > - } > } > > if (nr_none) { > @@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm, > VM_BUG_ON(!list_empty(&pagelist)); > if (!IS_ERR_OR_NULL(*hpage)) > mem_cgroup_uncharge(*hpage); > + allow_write_access(file); > /* TODO: tracepoints */ > } > #ifndef _DEFAULT_SOURCE #define _DEFAULT_SOURCE #endif #include <fcntl.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/mman.h> #include <unistd.h> /* "volatile" to forbid compiler optimization */ volatile static int x; #define DO_0 ++x; #define DO_1 {DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0; DO_0} #define DO_2 {DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1; DO_1} #define DO_3 {DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2; DO_2} #define DO_4 {DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3; DO_3} #define DO_5 {DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4; DO_4} #define DO_6 {DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5; DO_5} #define DO_7 {DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6; DO_6} void libtest_work1(void) { printf("work 1\n"); DO_0; } void libtest_work2(void) { printf("work 2\n"); DO_2; } void libtest_work3(void) { printf("work 3\n"); DO_4; } void libtest_work4(void) { printf("work 4\n"); DO_6; }
On Tue, Sep 28, 2021 at 5:07 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote: > > OTOH, does it make sense to block writes within khugepaged, like: > > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > > /* Only allocate from the target node */ > > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > > > + if (deny_write_access(file)) { > > + result = SCAN_BUSY_WRITE; > > + return; > > + } > > The problem is that it denies, rather than blocking. That means that the > writer gets ETXTBSY instead of waiting until khugepaged is done. > Yes, I was thinking about the same problem last night and this morning. Unfortunately, I haven't got a good solution yet. Do you have some suggestions on this? Thanks, Song
On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > On 9/28/21 6:24 AM, Song Liu wrote: > > On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang > > <rongwei.wang@linux.alibaba.com> wrote: > >> > >> > >> > >> On 9/24/21 10:43 AM, Andrew Morton wrote: > >>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > >>> > >>>> > >>>> > >>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > >>>>> > >>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > >>>>>> Transparent huge page has supported read-only non-shmem files. The file- > >>>>>> backed THP is collapsed by khugepaged and truncated when written (for > >>>>>> shared libraries). > >>>>>> > >>>>>> However, there is race in two possible places. > >>>>>> > >>>>>> 1) multiple writers truncate the same page cache concurrently; > >>>>>> 2) collapse_file rolls back when writer truncates the page cache; > >>>>> > >>>>> As I've said before, the bug here is that somehow there is a writable fd > >>>>> to a file with THPs. That's what we need to track down and fix. > >>>> Hi, Matthew > >>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > >>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > >>>> > >>>> ... > >>>> > >>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > >>>> All in all, what you mean is that we should solve this race at the source? > >>> > >>> Matthew is being pretty clear here: we shouldn't be permitting > >>> userspace to get a writeable fd for a thp-backed file. > >>> > >>> Why are we permitting the DSO to be opened writeably? If there's a > >>> legitimate case for doing this then presumably "mm, thp: relax the > >> There is a use case to stress file-backed THP within attachment. > >> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: > >> > >> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c > >> $ ulimit -s unlimited > >> $ ./stress_madvise_dso 10000 <libtest.so> > >> > >> the meaning of above parameters: > >> 10000: the max test time; > >> <libtest.so>: the DSO that will been mapped into file-backed THP by > >> madvise. It recommended that the text segment of DSO to be tested is > >> greater than 2M. > >> > >> The crash will been triggered at once in the latest kernel. And this > >> case also can used to trigger the bug that mentioned in our another patch. > > > > Hmm.. I am not able to use the repro program to crash the system. Not > > sure what I did wrong. > > > Hi > I have tried to check my test case again. Can you make sure the DSO that > you test have THP mapping? > > If you are willing to try again, I can send my libtest.c which is used > to test by myself (actually, it shouldn't be target DSO problem). > > Thanks very much! > > OTOH, does it make sense to block writes within khugepaged, like: > > > > diff --git i/mm/khugepaged.c w/mm/khugepaged.c > > index 045cc579f724e..ad7c41ec15027 100644 > > --- i/mm/khugepaged.c > > +++ w/mm/khugepaged.c > > @@ -51,6 +51,7 @@ enum scan_result { > > SCAN_CGROUP_CHARGE_FAIL, > > SCAN_TRUNCATED, > > SCAN_PAGE_HAS_PRIVATE, > > + SCAN_BUSY_WRITE, > > }; > > > > #define CREATE_TRACE_POINTS > > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > > /* Only allocate from the target node */ > > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > > > + if (deny_write_access(file)) { > > + result = SCAN_BUSY_WRITE; > > + return; > > + } > > + > This can indeed avoid some possible races from source. > > But, I am thinking about whether this will lead to DDoS attack? > I remember the reason of DSO has ignored MAP_DENYWRITE in kernel > is that DDoS attack. In addition, 'deny_write_access' will change > the behavior, such as user will get 'Text file busy' during > collapse_file. I am not sure whether the behavior changing is acceptable > in user space. > > If it is acceptable, I am very willing to fix the races like your way. I guess we should not let the write get ETXTBUSY for khugepaged work. I am getting some segfault on stress_madvise_dso. And it doesn't really generate the bug stack in my vm (qemu-system-x86_64). Is there an newer version of it? Thanks, Song
On 9/29/21 3:14 PM, Song Liu wrote: > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: >> >> >> >> On 9/28/21 6:24 AM, Song Liu wrote: >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang >>> <rongwei.wang@linux.alibaba.com> wrote: >>>> >>>> >>>> >>>> On 9/24/21 10:43 AM, Andrew Morton wrote: >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: >>>>> >>>>>> >>>>>> >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: >>>>>>> >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file- >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for >>>>>>>> shared libraries). >>>>>>>> >>>>>>>> However, there is race in two possible places. >>>>>>>> >>>>>>>> 1) multiple writers truncate the same page cache concurrently; >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache; >>>>>>> >>>>>>> As I've said before, the bug here is that somehow there is a writable fd >>>>>>> to a file with THPs. That's what we need to track down and fix. >>>>>> Hi, Matthew >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. >>>>>> >>>>>> ... >>>>>> >>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ >>>>>> All in all, what you mean is that we should solve this race at the source? >>>>> >>>>> Matthew is being pretty clear here: we shouldn't be permitting >>>>> userspace to get a writeable fd for a thp-backed file. >>>>> >>>>> Why are we permitting the DSO to be opened writeably? If there's a >>>>> legitimate case for doing this then presumably "mm, thp: relax the >>>> There is a use case to stress file-backed THP within attachment. >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: >>>> >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c >>>> $ ulimit -s unlimited >>>> $ ./stress_madvise_dso 10000 <libtest.so> >>>> >>>> the meaning of above parameters: >>>> 10000: the max test time; >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by >>>> madvise. It recommended that the text segment of DSO to be tested is >>>> greater than 2M. >>>> >>>> The crash will been triggered at once in the latest kernel. And this >>>> case also can used to trigger the bug that mentioned in our another patch. >>> >>> Hmm.. I am not able to use the repro program to crash the system. Not >>> sure what I did wrong. >>> >> Hi >> I have tried to check my test case again. Can you make sure the DSO that >> you test have THP mapping? >> >> If you are willing to try again, I can send my libtest.c which is used >> to test by myself (actually, it shouldn't be target DSO problem). >> >> Thanks very much! >>> OTOH, does it make sense to block writes within khugepaged, like: >>> >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c >>> index 045cc579f724e..ad7c41ec15027 100644 >>> --- i/mm/khugepaged.c >>> +++ w/mm/khugepaged.c >>> @@ -51,6 +51,7 @@ enum scan_result { >>> SCAN_CGROUP_CHARGE_FAIL, >>> SCAN_TRUNCATED, >>> SCAN_PAGE_HAS_PRIVATE, >>> + SCAN_BUSY_WRITE, >>> }; >>> >>> #define CREATE_TRACE_POINTS >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, >>> /* Only allocate from the target node */ >>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; >>> >>> + if (deny_write_access(file)) { >>> + result = SCAN_BUSY_WRITE; >>> + return; >>> + } >>> + >> This can indeed avoid some possible races from source. >> >> But, I am thinking about whether this will lead to DDoS attack? >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel >> is that DDoS attack. In addition, 'deny_write_access' will change >> the behavior, such as user will get 'Text file busy' during >> collapse_file. I am not sure whether the behavior changing is acceptable >> in user space. >> >> If it is acceptable, I am very willing to fix the races like your way. > > I guess we should not let the write get ETXTBUSY for khugepaged work. > > I am getting some segfault on stress_madvise_dso. And it doesn't really > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer Hi, I can sure I am not update the stress_madvise_dso.c. My test environment is vm (qemu-system-aarch64, 32 cores). And I can think of the following possibilities: (1) in thread_read() printf("read %s\n", dso_path); d = open(dso_path, O_RDONLY); /* The start addr must be alignment with 2M */ void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | PROT_EXEC,MAP_PRIVATE, fd, 0); if (p == MAP_FAILED) { perror("mmap"); goto out; } 0x40000dc00000 is random setting by myself. I am not sure this address is available in your vm. (2) in thread_write() int fd = open(dso_path, O_RDWR); p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); if (p == MAP_FAILED) { perror("mmap"); goto out; /* fail */ } because of I am sure the DSO is bigger than 0x800000, so directly map the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to compile the DSO? likes: $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o If you don't mind, you can send the segment fault log to me. And I will find x86 environment to test. Thanks! > version of it? > > Thanks, > Song >
On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > On 9/29/21 3:14 PM, Song Liu wrote: > > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang > > <rongwei.wang@linux.alibaba.com> wrote: > >> > >> > >> > >> On 9/28/21 6:24 AM, Song Liu wrote: > >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang > >>> <rongwei.wang@linux.alibaba.com> wrote: > >>>> > >>>> > >>>> > >>>> On 9/24/21 10:43 AM, Andrew Morton wrote: > >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > >>>>> > >>>>>> > >>>>>> > >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > >>>>>>> > >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file- > >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for > >>>>>>>> shared libraries). > >>>>>>>> > >>>>>>>> However, there is race in two possible places. > >>>>>>>> > >>>>>>>> 1) multiple writers truncate the same page cache concurrently; > >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache; > >>>>>>> > >>>>>>> As I've said before, the bug here is that somehow there is a writable fd > >>>>>>> to a file with THPs. That's what we need to track down and fix. > >>>>>> Hi, Matthew > >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > >>>>>> > >>>>>> ... > >>>>>> > >>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > >>>>>> All in all, what you mean is that we should solve this race at the source? > >>>>> > >>>>> Matthew is being pretty clear here: we shouldn't be permitting > >>>>> userspace to get a writeable fd for a thp-backed file. > >>>>> > >>>>> Why are we permitting the DSO to be opened writeably? If there's a > >>>>> legitimate case for doing this then presumably "mm, thp: relax the > >>>> There is a use case to stress file-backed THP within attachment. > >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: > >>>> > >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c > >>>> $ ulimit -s unlimited > >>>> $ ./stress_madvise_dso 10000 <libtest.so> > >>>> > >>>> the meaning of above parameters: > >>>> 10000: the max test time; > >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by > >>>> madvise. It recommended that the text segment of DSO to be tested is > >>>> greater than 2M. > >>>> > >>>> The crash will been triggered at once in the latest kernel. And this > >>>> case also can used to trigger the bug that mentioned in our another patch. > >>> > >>> Hmm.. I am not able to use the repro program to crash the system. Not > >>> sure what I did wrong. > >>> > >> Hi > >> I have tried to check my test case again. Can you make sure the DSO that > >> you test have THP mapping? > >> > >> If you are willing to try again, I can send my libtest.c which is used > >> to test by myself (actually, it shouldn't be target DSO problem). > >> > >> Thanks very much! > >>> OTOH, does it make sense to block writes within khugepaged, like: > >>> > >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c > >>> index 045cc579f724e..ad7c41ec15027 100644 > >>> --- i/mm/khugepaged.c > >>> +++ w/mm/khugepaged.c > >>> @@ -51,6 +51,7 @@ enum scan_result { > >>> SCAN_CGROUP_CHARGE_FAIL, > >>> SCAN_TRUNCATED, > >>> SCAN_PAGE_HAS_PRIVATE, > >>> + SCAN_BUSY_WRITE, > >>> }; > >>> > >>> #define CREATE_TRACE_POINTS > >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > >>> /* Only allocate from the target node */ > >>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > >>> > >>> + if (deny_write_access(file)) { > >>> + result = SCAN_BUSY_WRITE; > >>> + return; > >>> + } > >>> + > >> This can indeed avoid some possible races from source. > >> > >> But, I am thinking about whether this will lead to DDoS attack? > >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel > >> is that DDoS attack. In addition, 'deny_write_access' will change > >> the behavior, such as user will get 'Text file busy' during > >> collapse_file. I am not sure whether the behavior changing is acceptable > >> in user space. > >> > >> If it is acceptable, I am very willing to fix the races like your way. > > > > I guess we should not let the write get ETXTBUSY for khugepaged work. > > > > I am getting some segfault on stress_madvise_dso. And it doesn't really > > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer > Hi, I can sure I am not update the stress_madvise_dso.c. > > My test environment is vm (qemu-system-aarch64, 32 cores). And I can > think of the following possibilities: > > (1) in thread_read() > > printf("read %s\n", dso_path); > d = open(dso_path, O_RDONLY); > /* The start addr must be alignment with 2M */ > void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | > PROT_EXEC,MAP_PRIVATE, fd, 0); > if (p == MAP_FAILED) { > perror("mmap"); > goto out; > } > > 0x40000dc00000 is random setting by myself. I am not sure this address > is available in your vm. > > (2) in thread_write() > int fd = open(dso_path, O_RDWR); > p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > if (p == MAP_FAILED) { > perror("mmap"); > goto out; /* fail */ > } > > because of I am sure the DSO is bigger than 0x800000, so directly map > the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to > compile the DSO? likes: > $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o > > If you don't mind, you can send the segment fault log to me. And I will > find x86 environment to test. I fixed the segfault with 1. malloc buf (as it is too big for stack) in thread_read 2. reduce memcpy() size in thread_read. Now, I am able to crash the system on find_lock_entries () { ... VM_BUG_ON_PAGE(page->index != xas.xa_index, page); } I guess it is related. I will test more. Thanks, Song
On Wed, Sep 29, 2021 at 09:59:38AM -0700, Song Liu wrote: > On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: > > > > > > > > On 9/29/21 3:14 PM, Song Liu wrote: > > > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang > > > <rongwei.wang@linux.alibaba.com> wrote: > > >> > > >> > > >> > > >> On 9/28/21 6:24 AM, Song Liu wrote: > > >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang > > >>> <rongwei.wang@linux.alibaba.com> wrote: > > >>>> > > >>>> > > >>>> > > >>>> On 9/24/21 10:43 AM, Andrew Morton wrote: > > >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > >>>>> > > >>>>>> > > >>>>>> > > >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <willy@infradead.org> wrote: > > >>>>>>> > > >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote: > > >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file- > > >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for > > >>>>>>>> shared libraries). > > >>>>>>>> > > >>>>>>>> However, there is race in two possible places. > > >>>>>>>> > > >>>>>>>> 1) multiple writers truncate the same page cache concurrently; > > >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache; > > >>>>>>> > > >>>>>>> As I've said before, the bug here is that somehow there is a writable fd > > >>>>>>> to a file with THPs. That's what we need to track down and fix. > > >>>>>> Hi, Matthew > > >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs" > > >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. > > >>>>>> > > >>>>>> ... > > >>>>>> > > >>>>>>> https://lore.kernel.org/linux-mm/YUdL3lFLFHzC80Wt@casper.infradead.org/ > > >>>>>> All in all, what you mean is that we should solve this race at the source? > > >>>>> > > >>>>> Matthew is being pretty clear here: we shouldn't be permitting > > >>>>> userspace to get a writeable fd for a thp-backed file. > > >>>>> > > >>>>> Why are we permitting the DSO to be opened writeably? If there's a > > >>>>> legitimate case for doing this then presumably "mm, thp: relax the > > >>>> There is a use case to stress file-backed THP within attachment. > > >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS: > > >>>> > > >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c > > >>>> $ ulimit -s unlimited > > >>>> $ ./stress_madvise_dso 10000 <libtest.so> > > >>>> > > >>>> the meaning of above parameters: > > >>>> 10000: the max test time; > > >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by > > >>>> madvise. It recommended that the text segment of DSO to be tested is > > >>>> greater than 2M. > > >>>> > > >>>> The crash will been triggered at once in the latest kernel. And this > > >>>> case also can used to trigger the bug that mentioned in our another patch. > > >>> > > >>> Hmm.. I am not able to use the repro program to crash the system. Not > > >>> sure what I did wrong. > > >>> > > >> Hi > > >> I have tried to check my test case again. Can you make sure the DSO that > > >> you test have THP mapping? > > >> > > >> If you are willing to try again, I can send my libtest.c which is used > > >> to test by myself (actually, it shouldn't be target DSO problem). > > >> > > >> Thanks very much! > > >>> OTOH, does it make sense to block writes within khugepaged, like: > > >>> > > >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c > > >>> index 045cc579f724e..ad7c41ec15027 100644 > > >>> --- i/mm/khugepaged.c > > >>> +++ w/mm/khugepaged.c > > >>> @@ -51,6 +51,7 @@ enum scan_result { > > >>> SCAN_CGROUP_CHARGE_FAIL, > > >>> SCAN_TRUNCATED, > > >>> SCAN_PAGE_HAS_PRIVATE, > > >>> + SCAN_BUSY_WRITE, > > >>> }; > > >>> > > >>> #define CREATE_TRACE_POINTS > > >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm, > > >>> /* Only allocate from the target node */ > > >>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE; > > >>> > > >>> + if (deny_write_access(file)) { > > >>> + result = SCAN_BUSY_WRITE; > > >>> + return; > > >>> + } > > >>> + > > >> This can indeed avoid some possible races from source. > > >> > > >> But, I am thinking about whether this will lead to DDoS attack? > > >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel > > >> is that DDoS attack. In addition, 'deny_write_access' will change > > >> the behavior, such as user will get 'Text file busy' during > > >> collapse_file. I am not sure whether the behavior changing is acceptable > > >> in user space. > > >> > > >> If it is acceptable, I am very willing to fix the races like your way. > > > > > > I guess we should not let the write get ETXTBUSY for khugepaged work. > > > > > > I am getting some segfault on stress_madvise_dso. And it doesn't really > > > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer > > Hi, I can sure I am not update the stress_madvise_dso.c. > > > > My test environment is vm (qemu-system-aarch64, 32 cores). And I can > > think of the following possibilities: > > > > (1) in thread_read() > > > > printf("read %s\n", dso_path); > > d = open(dso_path, O_RDONLY); > > /* The start addr must be alignment with 2M */ > > void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ | > > PROT_EXEC,MAP_PRIVATE, fd, 0); > > if (p == MAP_FAILED) { > > perror("mmap"); > > goto out; > > } > > > > 0x40000dc00000 is random setting by myself. I am not sure this address > > is available in your vm. > > > > (2) in thread_write() > > int fd = open(dso_path, O_RDWR); > > p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); > > if (p == MAP_FAILED) { > > perror("mmap"); > > goto out; /* fail */ > > } > > > > because of I am sure the DSO is bigger than 0x800000, so directly map > > the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to > > compile the DSO? likes: > > $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o > > > > If you don't mind, you can send the segment fault log to me. And I will > > find x86 environment to test. > > I fixed the segfault with > 1. malloc buf (as it is too big for stack) in thread_read > 2. reduce memcpy() size in thread_read. > > Now, I am able to crash the system on > find_lock_entries () { > ... > VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > } > I guess it is related. I will test more. That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. Andrew has it too, but for some reason, he hasn't sent it on to Linus. +++ b/mm/filemap.c @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, if (!xa_is_value(page)) { if (page->index < start) goto put; - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); if (page->index + thp_nr_pages(page) - 1 > end) goto put; if (!trylock_page(page))
On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote: > [...] > > Now, I am able to crash the system on > > find_lock_entries () { > > ... > > VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > > } > > I guess it is related. I will test more. > > That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. > Andrew has it too, but for some reason, he hasn't sent it on to Linus. > > +++ b/mm/filemap.c > @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, > if (!xa_is_value(page)) { > if (page->index < start) > goto put; > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > if (page->index + thp_nr_pages(page) - 1 > end) > goto put; > if (!trylock_page(page)) Yes, after removing this line, I am able to see the same bug. Here is my finding so far: The issue is NOT caused by concurrent khugepaged:collapse_file() and truncate_pagecache(inode, 0). With some printks, we can see a clear time gap (>2 second ) between collapse_file() finishes, and truncate_pagecache() (which crashes soon). Therefore, my earlier suggestion that adds deny_write_access() to collapse_file() does NOT work. The crash is actually caused by concurrent truncate_pagecache(inode, 0). If I change the number of write thread in stress_madvise_dso.c to one, (IOW, one thread_read and one thread_write), I cannot reproduce the crash anymore. I think this means we cannot fix this issue in collapse_file(), because it finishes long before the crash. Thanks, Song
On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote: > The issue is NOT caused by concurrent khugepaged:collapse_file() and > truncate_pagecache(inode, 0). With some printks, we can see a clear > time gap (>2 second ) between collapse_file() finishes, and > truncate_pagecache() (which crashes soon). Therefore, my earlier > suggestion that adds deny_write_access() to collapse_file() does NOT > work. > > The crash is actually caused by concurrent truncate_pagecache(inode, 0). > If I change the number of write thread in stress_madvise_dso.c to one, > (IOW, one thread_read and one thread_write), I cannot reproduce the > crash anymore. > > I think this means we cannot fix this issue in collapse_file(), because it > finishes long before the crash. Ah! So are we missing one or more of these locks: inode_lock(inode); filemap_invalidate_lock(mapping); in the open path?
On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote: > > The issue is NOT caused by concurrent khugepaged:collapse_file() and > > truncate_pagecache(inode, 0). With some printks, we can see a clear > > time gap (>2 second ) between collapse_file() finishes, and > > truncate_pagecache() (which crashes soon). Therefore, my earlier > > suggestion that adds deny_write_access() to collapse_file() does NOT > > work. > > > > The crash is actually caused by concurrent truncate_pagecache(inode, 0). > > If I change the number of write thread in stress_madvise_dso.c to one, > > (IOW, one thread_read and one thread_write), I cannot reproduce the > > crash anymore. > > > > I think this means we cannot fix this issue in collapse_file(), because it > > finishes long before the crash. > > Ah! So are we missing one or more of these locks: > > inode_lock(inode); > filemap_invalidate_lock(mapping); > > in the open path? The following fixes the crash in my test. But I am not sure whether this is the best fix. Rongwei, could you please run more tests on it? Thanks, Song diff --git i/fs/open.c w/fs/open.c index daa324606a41f..d13c4668b2e53 100644 --- i/fs/open.c +++ w/fs/open.c @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f, * of THPs into the page cache will fail. */ smp_mb(); - if (filemap_nr_thps(inode->i_mapping)) + if (filemap_nr_thps(inode->i_mapping)) { + filemap_invalidate_lock(inode->i_mapping); truncate_pagecache(inode, 0); + filemap_invalidate_unlock(inode->i_mapping); + } } return 0;
On 9/30/21 7:41 AM, Song Liu wrote: > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote: >> > [...] >>> Now, I am able to crash the system on >>> find_lock_entries () { >>> ... >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page); >>> } >>> I guess it is related. I will test more. >> >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. >> Andrew has it too, but for some reason, he hasn't sent it on to Linus. >> >> +++ b/mm/filemap.c >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, >> if (!xa_is_value(page)) { >> if (page->index < start) >> goto put; >> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); >> if (page->index + thp_nr_pages(page) - 1 > end) >> goto put; >> if (!trylock_page(page)) > > Yes, after removing this line, I am able to see the same bug. > > Here is my finding so far: > > The issue is NOT caused by concurrent khugepaged:collapse_file() and > truncate_pagecache(inode, 0). With some printks, we can see a clear > time gap (>2 second ) between collapse_file() finishes, and > truncate_pagecache() (which crashes soon). Therefore, my earlier > suggestion that adds deny_write_access() to collapse_file() does NOT > work. > > The crash is actually caused by concurrent truncate_pagecache(inode, 0). > If I change the number of write thread in stress_madvise_dso.c to one, > (IOW, one thread_read and one thread_write), I cannot reproduce the > crash anymore. Whether CONFIG_DEBUG_VM is enabled in your vm? I think the second possibility mentioned above will been found if you enable CONFIG_DEBUG_VM: 1) multiple writers truncate the same page cache concurrently; 2) collapse_file rolls back when writer truncates the page cache; The following log will be print after enable CONFIG_DEBUG_VM: [22216.789904] do_idle+0xb4/0x104 [22216.789906] cpu_startup_entry+0x34/0x9c [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015 [22216.790553] secondary_start_kernel+0x104/0x180 [22216.790778] Call trace: [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000) [22216.791662] dump_backtrace+0x0/0x1ec [22216.791664] show_stack+0x24/0x30 [22216.791956] ---[ end trace dc769a61c1af087b ]--- [22216.792295] dump_stack+0xd0/0x128 [22216.792299] bad_page+0xe4/0x110 [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt [22216.792937] check_free_page_bad+0x84/0x90 [22216.792940] free_pcp_prepare+0x1fc/0x21c [22216.793253] SMP: stopping secondary CPUs [22216.793525] free_unref_page+0x2c/0xec [22216.805537] __put_page+0x60/0x70 [22216.805931] collapse_file+0xdc8/0x12f0 [22216.806385] khugepaged_scan_file+0x2dc/0x37c [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380 [22216.807450] khugepaged_do_scan+0x2dc/0x2fc [22216.807946] khugepaged+0x38/0x100 [22216.808342] kthread+0x11c/0x120 [22216.808735] Kernel Offset: disabled [22216.809153] CPU features: 0x0040002,62208238 [22216.809681] Memory Limit: none [22216.813477] Starting crashdump kernel... So I think the race also exists between collapse_file and truncate_pagecache. > > I think this means we cannot fix this issue in collapse_file(), because it > finishes long before the crash. > > Thanks, > Song >
On 9/30/21 8:41 AM, Song Liu wrote: > On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote: >>> The issue is NOT caused by concurrent khugepaged:collapse_file() and >>> truncate_pagecache(inode, 0). With some printks, we can see a clear >>> time gap (>2 second ) between collapse_file() finishes, and >>> truncate_pagecache() (which crashes soon). Therefore, my earlier >>> suggestion that adds deny_write_access() to collapse_file() does NOT >>> work. >>> >>> The crash is actually caused by concurrent truncate_pagecache(inode, 0). >>> If I change the number of write thread in stress_madvise_dso.c to one, >>> (IOW, one thread_read and one thread_write), I cannot reproduce the >>> crash anymore. >>> >>> I think this means we cannot fix this issue in collapse_file(), because it >>> finishes long before the crash. >> >> Ah! So are we missing one or more of these locks: >> >> inode_lock(inode); >> filemap_invalidate_lock(mapping); >> >> in the open path? > > The following fixes the crash in my test. But I am not sure whether this is the > best fix. > > Rongwei, could you please run more tests on it? Yes, I'd like to. > > Thanks, > Song > > > diff --git i/fs/open.c w/fs/open.c > index daa324606a41f..d13c4668b2e53 100644 > --- i/fs/open.c > +++ w/fs/open.c > @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f, > * of THPs into the page cache will fail. > */ > smp_mb(); > - if (filemap_nr_thps(inode->i_mapping)) > + if (filemap_nr_thps(inode->i_mapping)) { > + filemap_invalidate_lock(inode->i_mapping); Learned something, Thanks! But, the race between collapse_file and truncate_pagecache, I am not sure whether it exists or not. If exists, whether this patch only can fix truncate_pagecache concurrent? Anyway, I will run more tests on it at first. Thanks! > truncate_pagecache(inode, 0); > + filemap_invalidate_unlock(inode->i_mapping); > + } > } > > return 0; >
On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > > > On 9/30/21 7:41 AM, Song Liu wrote: > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote: > >> > > [...] > >>> Now, I am able to crash the system on > >>> find_lock_entries () { > >>> ... > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > >>> } > >>> I guess it is related. I will test more. > >> > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus. > >> > >> +++ b/mm/filemap.c > >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, > >> if (!xa_is_value(page)) { > >> if (page->index < start) > >> goto put; > >> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > >> if (page->index + thp_nr_pages(page) - 1 > end) > >> goto put; > >> if (!trylock_page(page)) > > > > Yes, after removing this line, I am able to see the same bug. > > > > Here is my finding so far: > > > > The issue is NOT caused by concurrent khugepaged:collapse_file() and > > truncate_pagecache(inode, 0). With some printks, we can see a clear > > time gap (>2 second ) between collapse_file() finishes, and > > truncate_pagecache() (which crashes soon). Therefore, my earlier > > suggestion that adds deny_write_access() to collapse_file() does NOT > > work. > > > > The crash is actually caused by concurrent truncate_pagecache(inode, 0). > > If I change the number of write thread in stress_madvise_dso.c to one, > > (IOW, one thread_read and one thread_write), I cannot reproduce the > > crash anymore. > Whether CONFIG_DEBUG_VM is enabled in your vm? > > I think the second possibility mentioned above will been found if you > enable CONFIG_DEBUG_VM: > > 1) multiple writers truncate the same page cache concurrently; > 2) collapse_file rolls back when writer truncates the page cache; > > The following log will be print after enable CONFIG_DEBUG_VM: > > [22216.789904] do_idle+0xb4/0x104 > [22216.789906] cpu_startup_entry+0x34/0x9c > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS > 0.0.0 02/06/2015 > [22216.790553] secondary_start_kernel+0x104/0x180 > [22216.790778] Call trace: > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000) > [22216.791662] dump_backtrace+0x0/0x1ec > [22216.791664] show_stack+0x24/0x30 > [22216.791956] ---[ end trace dc769a61c1af087b ]--- > [22216.792295] dump_stack+0xd0/0x128 > [22216.792299] bad_page+0xe4/0x110 > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception > in interrupt > [22216.792937] check_free_page_bad+0x84/0x90 > [22216.792940] free_pcp_prepare+0x1fc/0x21c > [22216.793253] SMP: stopping secondary CPUs > [22216.793525] free_unref_page+0x2c/0xec > [22216.805537] __put_page+0x60/0x70 > [22216.805931] collapse_file+0xdc8/0x12f0 > [22216.806385] khugepaged_scan_file+0x2dc/0x37c > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380 > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc > [22216.807946] khugepaged+0x38/0x100 > [22216.808342] kthread+0x11c/0x120 > [22216.808735] Kernel Offset: disabled > [22216.809153] CPU features: 0x0040002,62208238 > [22216.809681] Memory Limit: none > [22216.813477] Starting crashdump kernel... > > So I think the race also exists between collapse_file and > truncate_pagecache. I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet. Thanks, Song
On Wed, 29 Sep 2021, Song Liu wrote: > On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang > <rongwei.wang@linux.alibaba.com> wrote: > > On 9/30/21 7:41 AM, Song Liu wrote: > > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > >> > > > [...] > > >>> Now, I am able to crash the system on > > >>> find_lock_entries () { > > >>> ... > > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > > >>> } > > >>> I guess it is related. I will test more. > > >> > > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. > > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus. I think Andrew has held back from sending it on to Linus because I pointed out that the example Matthew cited (shmem and swap cache) was wrong, and could not explain it: we wanted to understand what syzbot had actually hit before sending on. Would this bug be a good explanation for it? In the meantime, independently, I was looking at fuse_try_move_page(), which appears to do the old splice thievery that got disabled from splice itself, stealing a page from one mapping to put into another. I suspect that could result in a page->index != xas.xa_index too (outside page lock). > > >> > > >> +++ b/mm/filemap.c > > >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, > > >> if (!xa_is_value(page)) { > > >> if (page->index < start) > > >> goto put; > > >> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > > >> if (page->index + thp_nr_pages(page) - 1 > end) > > >> goto put; > > >> if (!trylock_page(page)) > > > > > > Yes, after removing this line, I am able to see the same bug. > > > > > > Here is my finding so far: > > > > > > The issue is NOT caused by concurrent khugepaged:collapse_file() and > > > truncate_pagecache(inode, 0). With some printks, we can see a clear > > > time gap (>2 second ) between collapse_file() finishes, and > > > truncate_pagecache() (which crashes soon). Therefore, my earlier > > > suggestion that adds deny_write_access() to collapse_file() does NOT > > > work. > > > > > > The crash is actually caused by concurrent truncate_pagecache(inode, 0). > > > If I change the number of write thread in stress_madvise_dso.c to one, > > > (IOW, one thread_read and one thread_write), I cannot reproduce the > > > crash anymore. > > Whether CONFIG_DEBUG_VM is enabled in your vm? > > > > I think the second possibility mentioned above will been found if you > > enable CONFIG_DEBUG_VM: > > > > 1) multiple writers truncate the same page cache concurrently; > > 2) collapse_file rolls back when writer truncates the page cache; > > > > The following log will be print after enable CONFIG_DEBUG_VM: > > > > [22216.789904] do_idle+0xb4/0x104 > > [22216.789906] cpu_startup_entry+0x34/0x9c > > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS > > 0.0.0 02/06/2015 > > [22216.790553] secondary_start_kernel+0x104/0x180 > > [22216.790778] Call trace: > > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000) > > [22216.791662] dump_backtrace+0x0/0x1ec > > [22216.791664] show_stack+0x24/0x30 > > [22216.791956] ---[ end trace dc769a61c1af087b ]--- > > [22216.792295] dump_stack+0xd0/0x128 > > [22216.792299] bad_page+0xe4/0x110 > > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception > > in interrupt > > [22216.792937] check_free_page_bad+0x84/0x90 > > [22216.792940] free_pcp_prepare+0x1fc/0x21c > > [22216.793253] SMP: stopping secondary CPUs > > [22216.793525] free_unref_page+0x2c/0xec > > [22216.805537] __put_page+0x60/0x70 > > [22216.805931] collapse_file+0xdc8/0x12f0 > > [22216.806385] khugepaged_scan_file+0x2dc/0x37c > > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380 > > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc > > [22216.807946] khugepaged+0x38/0x100 > > [22216.808342] kthread+0x11c/0x120 > > [22216.808735] Kernel Offset: disabled > > [22216.809153] CPU features: 0x0040002,62208238 > > [22216.809681] Memory Limit: none > > [22216.813477] Starting crashdump kernel... > > > > So I think the race also exists between collapse_file and > > truncate_pagecache. > > I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet. Sorry, it's taken me a long time to get into this bug(s). I haven't tried to reproduce it, but I do think that Rongwei will be proved right. In doing a lockless lookup of the page cache, we tend to imagine that the THP head page will be encountered first, and the special treatment for the head will do the right thing to skip the tails. But when racing against collapse_file()'s rewind, I agree with Rongwei that it is possible for truncation (or page cache cleanup in this instance) to collect a pagevec which starts with a PageTail some way into the compound page. shmem_undo_range() (which shmem uses rather than truncate_pagecache()) would not call truncate_inode_page() on a THP tail: if it encounters a tail, it will try to split the THP, and not call truncate_inode_page() if it cannot. Unless I'm inventing the memory, I now think that I did encounter this race between truncate and collapse on huge shmem, and had to re-craft my shmem_punch_compound() to handle it correctly. If it is just a matter of collapse_file() rewind, I suppose we could reverse the direction in which that proceeds; but I'm not convinced that would be enough, and don't think we need such a "big" change. Aside from the above page->index mischeck in find_lock_entries(), I now think this bug needs nothing more than simply removing the VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). (You could say move its page->mapping check before its PageTail check; but the PageTail check would then never catch anything.) Rongwei's patch went in the right direction, but why add N lines if deleting one is good? And for a while I thought that idea of using filemap_invalidate_lock() was very good; but now think the page lock we already have is better, and solve both races. Hugh
On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote: > On Wed, 29 Sep 2021, Song Liu wrote: > > On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang > > <rongwei.wang@linux.alibaba.com> wrote: > > > On 9/30/21 7:41 AM, Song Liu wrote: > > > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <willy@infradead.org> wrote: > > > >> > > > > [...] > > > >>> Now, I am able to crash the system on > > > >>> find_lock_entries () { > > > >>> ... > > > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > > > >>> } > > > >>> I guess it is related. I will test more. > > > >> > > > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it. > > > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus. > > I think Andrew has held back from sending it on to Linus because I pointed > out that the example Matthew cited (shmem and swap cache) was wrong, and > could not explain it: we wanted to understand what syzbot had actually > hit before sending on. > > Would this bug be a good explanation for it? I don't think so? Even if that specific example is not what's happening, the general principle is that you can't verify page->index until you're holding the page lock. So the VM_BUG_ON is definitely in the wrong place. > In the meantime, independently, I was looking at fuse_try_move_page(), > which appears to do the old splice thievery that got disabled from splice > itself, stealing a page from one mapping to put into another. I suspect > that could result in a page->index != xas.xa_index too (outside page lock). I think there are other examples too where page->index gets changed ... they're not springing to mind right now, but I have some other intricate details occupying that bit of my brain at the moment. > > > I think the second possibility mentioned above will been found if you > > > enable CONFIG_DEBUG_VM: > > > > > > 1) multiple writers truncate the same page cache concurrently; > > > 2) collapse_file rolls back when writer truncates the page cache; > > > > > > The following log will be print after enable CONFIG_DEBUG_VM: > > > > > > [22216.789904] do_idle+0xb4/0x104 > > > [22216.789906] cpu_startup_entry+0x34/0x9c > > > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS > > > 0.0.0 02/06/2015 > > > [22216.790553] secondary_start_kernel+0x104/0x180 > > > [22216.790778] Call trace: > > > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000) > > > [22216.791662] dump_backtrace+0x0/0x1ec > > > [22216.791664] show_stack+0x24/0x30 > > > [22216.791956] ---[ end trace dc769a61c1af087b ]--- > > > [22216.792295] dump_stack+0xd0/0x128 > > > [22216.792299] bad_page+0xe4/0x110 > > > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception > > > in interrupt > > > [22216.792937] check_free_page_bad+0x84/0x90 > > > [22216.792940] free_pcp_prepare+0x1fc/0x21c > > > [22216.793253] SMP: stopping secondary CPUs > > > [22216.793525] free_unref_page+0x2c/0xec > > > [22216.805537] __put_page+0x60/0x70 > > > [22216.805931] collapse_file+0xdc8/0x12f0 > > > [22216.806385] khugepaged_scan_file+0x2dc/0x37c > > > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380 > > > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc > > > [22216.807946] khugepaged+0x38/0x100 > > > [22216.808342] kthread+0x11c/0x120 > > > [22216.808735] Kernel Offset: disabled > > > [22216.809153] CPU features: 0x0040002,62208238 > > > [22216.809681] Memory Limit: none > > > [22216.813477] Starting crashdump kernel... > > > > > > So I think the race also exists between collapse_file and > > > truncate_pagecache. > > > > I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet. > > Sorry, it's taken me a long time to get into this bug(s). > > I haven't tried to reproduce it, but I do think that Rongwei will > be proved right. > > In doing a lockless lookup of the page cache, we tend to imagine > that the THP head page will be encountered first, and the special > treatment for the head will do the right thing to skip the tails. > > But when racing against collapse_file()'s rewind, I agree with > Rongwei that it is possible for truncation (or page cache cleanup > in this instance) to collect a pagevec which starts with a PageTail > some way into the compound page. > > shmem_undo_range() (which shmem uses rather than truncate_pagecache()) > would not call truncate_inode_page() on a THP tail: if it encounters a > tail, it will try to split the THP, and not call truncate_inode_page() > if it cannot. Unless I'm inventing the memory, I now think that I did > encounter this race between truncate and collapse on huge shmem, and > had to re-craft my shmem_punch_compound() to handle it correctly. > > If it is just a matter of collapse_file() rewind, I suppose we could > reverse the direction in which that proceeds; but I'm not convinced > that would be enough, and don't think we need such a "big" change. > > Aside from the above page->index mischeck in find_lock_entries(), > I now think this bug needs nothing more than simply removing the > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). I don't think that's right. This bug was also observed when calling truncate(), so there's clearly a situation where two concurrent calls to truncate_pagecache() leaves a THP in the cache. Unless there's a case where mapping->nr_thps gets corrupted, so the open() thinks there's no THPs in the page cache, when there's actually one or more? That might be a problem that we're solving by locking around the truncate_pagecache() call? > (You could say move its page->mapping check before its PageTail > check; but the PageTail check would then never catch anything.) > > Rongwei's patch went in the right direction, but why add N lines > if deleting one is good? And for a while I thought that idea of > using filemap_invalidate_lock() was very good; but now think > the page lock we already have is better, and solve both races. > > Hugh
On Thu, 30 Sep 2021, Matthew Wilcox wrote: > On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote: > > > > Aside from the above page->index mischeck in find_lock_entries(), > > I now think this bug needs nothing more than simply removing the > > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). > > I don't think that's right. This bug was also observed when calling > truncate(), so there's clearly a situation where two concurrent calls > to truncate_pagecache() leaves a THP in the cache. I assume you're thinking of one of the fuzzer blkdev ones: https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ or https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ I haven't started on those ones yet: yes, I imagine one or both of those will need a further fix (S_ISREG() check somewhere if we're lucky; but could well be nastier); but for the bug in this thread, I expect removing the VM_BUG_ON_PAGE(PageTail) to be enough. If you're thinking of something else, please send a link if you can - thanks. Hugh
On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote: > > On Thu, 30 Sep 2021, Matthew Wilcox wrote: > > On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote: > > > > > > Aside from the above page->index mischeck in find_lock_entries(), > > > I now think this bug needs nothing more than simply removing the > > > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). > > > > I don't think that's right. This bug was also observed when calling > > truncate(), so there's clearly a situation where two concurrent calls > > to truncate_pagecache() leaves a THP in the cache. > > I assume you're thinking of one of the fuzzer blkdev ones: > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > or > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > I haven't started on those ones yet: yes, I imagine one or both of those > will need a further fix (S_ISREG() check somewhere if we're lucky; but > could well be nastier); but for the bug in this thread, I expect Makes sense to me. We should be able to check S_ISREG() in khugepaged, if it is not a regular file, just bail out. Sounds not that nasty to me AFAIU. > removing the VM_BUG_ON_PAGE(PageTail) to be enough. > > If you're thinking of something else, please send a link if you can - thanks. > > Hugh >
On 10/1/21 12:49 AM, Hugh Dickins wrote: > On Thu, 30 Sep 2021, Matthew Wilcox wrote: >> On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote: >>> >>> Aside from the above page->index mischeck in find_lock_entries(), >>> I now think this bug needs nothing more than simply removing the >>> VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page(). >> >> I don't think that's right. This bug was also observed when calling >> truncate(), so there's clearly a situation where two concurrent calls >> to truncate_pagecache() leaves a THP in the cache. > > I assume you're thinking of one of the fuzzer blkdev ones: > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > or > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > I haven't started on those ones yet: yes, I imagine one or both of those > will need a further fix (S_ISREG() check somewhere if we're lucky; but > could well be nastier); but for the bug in this thread, I expect > removing the VM_BUG_ON_PAGE(PageTail) to be enough. Thanks for your advices! I'm so sorry that delay the test due to my recent vacation. I plan to start further test tomorrow. I think removing the VM_BUG_ON_PAGE(PageTail) is a good idea, but also think using filemap_invalidate_lock is safer and necessary. And of course, this is just my own view! So, now, I tend to use filemap_invalidate_lock and either check page mapping again or remove VM_BUG_ON_PAGE(PageTail). Anyway, I will run more tests and then give feedback. Thanks! > > If you're thinking of something else, please send a link if you can - thanks. > > Hugh >
On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote: > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote: > > I assume you're thinking of one of the fuzzer blkdev ones: > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > > or > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > > > I haven't started on those ones yet: yes, I imagine one or both of those > > will need a further fix (S_ISREG() check somewhere if we're lucky; but > > could well be nastier); but for the bug in this thread, I expect > > Makes sense to me. We should be able to check S_ISREG() in khugepaged, > if it is not a regular file, just bail out. Sounds not that nasty to > me AFAIU. I don't see why we should have an S_ISREG() check. I agree it's not the intended usecase, but it ought to work fine. Unless there's something I'm missing?
Hi, I have run our cases these two days to stress test new Patch #1. The new Patch #1 mainly add filemap_invalidate_{un}lock before and after truncate_pagecache(), basing on original Patch #1. And the crash has not happened. Now, I keep the original Patch #1, then adding the code below which suggested by liu song (I'm not sure which one I should add in the next version, Suggested-by or Signed-off-by? If you know, please remind me). - if (filemap_nr_thps(inode->i_mapping)) + if (filemap_nr_thps(inode->i_mapping)) { + filemap_invalidate_lock(inode->i_mapping); truncate_pagecache(inode, 0); + filemap_invalidate_unlock(inode->i_mapping); + } And the reason for keeping the original Patch #1 is mainly to fix the race between collapse_file and truncate_pagecache. It seems necessary. Despite the two-day test, I did not reproduce this race any more. In addition, I also test the below method: diff --git a/mm/truncate.c b/mm/truncate.c index 3f47190f98a8..33604e4ce60a 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping, struct page *page) int truncate_inode_page(struct address_space *mapping, struct page *page) { - VM_BUG_ON_PAGE(PageTail(page), page); - if (page->mapping != mapping) return -EIO; I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no effect. So, I still keep the original Patch #1 to fix one race. I plan to send Patch v3 after receiving your reply. Thanks! On 9/30/21 8:41 AM, Song Liu wrote: > On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <willy@infradead.org> wrote: >> >> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote: >>> The issue is NOT caused by concurrent khugepaged:collapse_file() and >>> truncate_pagecache(inode, 0). With some printks, we can see a clear >>> time gap (>2 second ) between collapse_file() finishes, and >>> truncate_pagecache() (which crashes soon). Therefore, my earlier >>> suggestion that adds deny_write_access() to collapse_file() does NOT >>> work. >>> >>> The crash is actually caused by concurrent truncate_pagecache(inode, 0). >>> If I change the number of write thread in stress_madvise_dso.c to one, >>> (IOW, one thread_read and one thread_write), I cannot reproduce the >>> crash anymore. >>> >>> I think this means we cannot fix this issue in collapse_file(), because it >>> finishes long before the crash. >> >> Ah! So are we missing one or more of these locks: >> >> inode_lock(inode); >> filemap_invalidate_lock(mapping); >> >> in the open path? > > The following fixes the crash in my test. But I am not sure whether this is the > best fix. > > Rongwei, could you please run more tests on it? > > Thanks, > Song > > > diff --git i/fs/open.c w/fs/open.c > index daa324606a41f..d13c4668b2e53 100644 > --- i/fs/open.c > +++ w/fs/open.c > @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f, > * of THPs into the page cache will fail. > */ > smp_mb(); > - if (filemap_nr_thps(inode->i_mapping)) > + if (filemap_nr_thps(inode->i_mapping)) { > + filemap_invalidate_lock(inode->i_mapping); > truncate_pagecache(inode, 0); > + filemap_invalidate_unlock(inode->i_mapping); > + } > } > > return 0; >
On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote: > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote: > > > I assume you're thinking of one of the fuzzer blkdev ones: > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > > > or > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > > > > > I haven't started on those ones yet: yes, I imagine one or both of those > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but > > > could well be nastier); but for the bug in this thread, I expect > > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged, > > if it is not a regular file, just bail out. Sounds not that nasty to > > me AFAIU. > > I don't see why we should have an S_ISREG() check. I agree it's not the > intended usecase, but it ought to work fine. Unless there's something > I'm missing? Check out this bug report: https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ and the patch from me: https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/ I don't think we handle buffers correctly for file THP, right? My patch is ad hoc, so I thought Hugh's suggestion makes some sense to me. Why do we have THP collapsed for unintended usecase in the first place?
On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote: > Hi, > I have run our cases these two days to stress test new Patch #1. The new > Patch #1 mainly add filemap_invalidate_{un}lock before and after > truncate_pagecache(), basing on original Patch #1. And the crash has not > happened. You shouldn't need most of patch 1. In fact, the only two patches you should need would be this: +++ b/mm/filemap.c @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, if (!xa_is_value(page)) { if (page->index < start) goto put; - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); if (page->index + thp_nr_pages(page) - 1 > end) goto put; if (!trylock_page(page)) (already in Andrew's tree) and: > - if (filemap_nr_thps(inode->i_mapping)) > + if (filemap_nr_thps(inode->i_mapping)) { > + filemap_invalidate_lock(inode->i_mapping); > truncate_pagecache(inode, 0); > + filemap_invalidate_unlock(inode->i_mapping); > + } If you can still hit a bug with just those two patches, then something else is going wrong, and needs to be investigated.
On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote: > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote: > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote: > > > > I assume you're thinking of one of the fuzzer blkdev ones: > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > > > > or > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > > > > > > > I haven't started on those ones yet: yes, I imagine one or both of those > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but > > > > could well be nastier); but for the bug in this thread, I expect > > > > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged, > > > if it is not a regular file, just bail out. Sounds not that nasty to > > > me AFAIU. > > > > I don't see why we should have an S_ISREG() check. I agree it's not the > > intended usecase, but it ought to work fine. Unless there's something > > I'm missing? > > Check out this bug report: > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > and the patch from me: > https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/ > > I don't think we handle buffers correctly for file THP, right? My > patch is ad hoc, so I thought Hugh's suggestion makes some sense to > me. Why do we have THP collapsed for unintended usecase in the first > place? OK, I've done some more digging. I think what's going on with this report is userspace opens the block device RO, causes the page cache to be loaded with data, then khugepaged comes in and creates THPs. What confuses me is that these THPs have private data attached to them. I don't know how that happens. If it's block device specific, then yes, something like your S_ISREG() idea should work fine. Otherwise, we might need to track down another problem. Hao Sun, can you try this patch and see what comes out of it? diff --git a/fs/buffer.c b/fs/buffer.c index c615387aedca..fefe05a9973d 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -872,6 +872,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head) bh = bh->b_this_page; } while (bh); tail->b_this_page = head; + VM_BUG_ON_PAGE(PageCompound(page), page); attach_page_private(page, head); } @@ -1577,6 +1578,7 @@ void create_empty_buffers(struct page *page, bh = bh->b_this_page; } while (bh != head); } + VM_BUG_ON_PAGE(PageCompound(page), page); attach_page_private(page, head); spin_unlock(&page->mapping->private_lock); } @@ -2565,6 +2567,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head) bh->b_this_page = head; bh = bh->b_this_page; } while (bh != head); + VM_BUG_ON_PAGE(PageCompound(page), page); attach_page_private(page, head); spin_unlock(&page->mapping->private_lock); }
On Mon, Oct 4, 2021 at 10:26 AM Rongwei Wang <rongwei.wang@linux.alibaba.com> wrote: > > Hi, > I have run our cases these two days to stress test new Patch #1. The new > Patch #1 mainly add filemap_invalidate_{un}lock before and after > truncate_pagecache(), basing on original Patch #1. And the crash has not > happened. > > Now, I keep the original Patch #1, then adding the code below which > suggested by liu song (I'm not sure which one I should add in the next > version, Suggested-by or Signed-off-by? If you know, please remind me). > > - if (filemap_nr_thps(inode->i_mapping)) > + if (filemap_nr_thps(inode->i_mapping)) { > + filemap_invalidate_lock(inode->i_mapping); > truncate_pagecache(inode, 0); > + filemap_invalidate_unlock(inode->i_mapping); > + } It is mostly suggested by Matthew. If the patch goes that way, you can add Tested-by: Song Liu <song@kernel.org>
On 10/5/21 3:05 AM, Matthew Wilcox wrote: > On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote: >> Hi, >> I have run our cases these two days to stress test new Patch #1. The new >> Patch #1 mainly add filemap_invalidate_{un}lock before and after >> truncate_pagecache(), basing on original Patch #1. And the crash has not >> happened. > > You shouldn't need most of patch 1. > > In fact, the only two patches you should need would be this: > > +++ b/mm/filemap.c > @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, > if (!xa_is_value(page)) { > if (page->index < start) > goto put; > - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); > if (page->index + thp_nr_pages(page) - 1 > end) > goto put; > if (!trylock_page(page)) > > (already in Andrew's tree) and: > >> - if (filemap_nr_thps(inode->i_mapping)) >> + if (filemap_nr_thps(inode->i_mapping)) { >> + filemap_invalidate_lock(inode->i_mapping); >> truncate_pagecache(inode, 0); >> + filemap_invalidate_unlock(inode->i_mapping); >> + } > > If you can still hit a bug with just those two patches, then something > else is going wrong, and needs to be investigated. OK, I see what your mean. I will send Patch v3 and only keep filemap_invalidate_{un}lock in Patch #1. Thanks! >
On Mon, 4 Oct 2021, Matthew Wilcox wrote: > On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote: > > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <willy@infradead.org> wrote: > > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote: > > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <hughd@google.com> wrote: > > > > > I assume you're thinking of one of the fuzzer blkdev ones: > > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/ > > > > > or > > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > > > > > > > > > I haven't started on those ones yet: yes, I imagine one or both of those > > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but > > > > > could well be nastier); but for the bug in this thread, I expect > > > > > > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged, > > > > if it is not a regular file, just bail out. Sounds not that nasty to > > > > me AFAIU. > > > > > > I don't see why we should have an S_ISREG() check. I agree it's not the > > > intended usecase, but it ought to work fine. Unless there's something > > > I'm missing? > > > > Check out this bug report: > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/ > > and the patch from me: > > https://lore.kernel.org/linux-mm/20210917205731.262693-1-shy828301@gmail.com/ > > > > I don't think we handle buffers correctly for file THP, right? My > > patch is ad hoc, so I thought Hugh's suggestion makes some sense to > > me. Why do we have THP collapsed for unintended usecase in the first > > place? > > OK, I've done some more digging. I think what's going on with this > report is userspace opens the block device RO, causes the page cache to > be loaded with data, then khugepaged comes in and creates THPs. Yes. > What confuses me is that these THPs have private data attached to them. > I don't know how that happens. If it's block device specific, then > yes, something like your S_ISREG() idea should work fine. Otherwise, > we might need to track down another problem. Agreed, the file THP is created without PagePrivate, so the puzzle was why the read-only cached page would later become page_has_private(). The C repro showed that it uses (a BTRFS_IOC_ADD_DEV ioctl which might not be relevant and) a BLKRRPART ioctl 0x125f: I didn't follow BLKRRPART all the way down, but imagine it has to attach buffer-heads to re-read the partition table. Which would explain it. Aside from that particular ioctl, it seems a good idea to insist on S_ISREG just to shrink the attack surface: as Yang Shi says, executable THP on block device was never an intended usecase, and not a usecase anyone is likely to miss! And that fuzzer appears to delight in tormenting /dev/nullb0, so let's just seal off that avenue. You're right to have some doubt, as to whether there might be other ways for buffer-heads to get attached, even on a read-only regular file; but no way has sprung to my mind, and READ_ONLY_THP_FOR_FS has survived well in its intended usage: so I think we should proceed on the assumption that no further bugs remain - then fix them when found. I wasn't able to reproduce the problem with the repro, would need to waste many hours to do so. But here's the untested S_ISREG patch I came up with. Sorry, I've mixed something else in: in moving the alignment part to clarify the conditions, I was alarmed to see that shmem with !shmem_huge_enabled was falling through to THP_FOR_FS to give unexpected huge pages: fixed that, though later found there's a separate shmem_huge_enabled() check which should exclude it. --- 5.15-rc4/mm/khugepaged.c 2021-09-12 17:39:21.943438422 -0700 +++ linux/khugepaged.c 2021-10-03 20:41:13.194822795 -0700 @@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm if (!transhuge_vma_enabled(vma, vm_flags)) return false; + if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - + vma->vm_pgoff, HPAGE_PMD_NR)) + return false; + /* Enabled via shmem mount options or sysfs settings. */ - if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) { - return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR); - } + if (shmem_file(vma->vm_file)) + return shmem_huge_enabled(vma); /* THP settings require madvise. */ if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always()) return false; /* Read-only file mappings need to be aligned for THP to work. */ - if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file && - !inode_is_open_for_write(vma->vm_file->f_inode) && - (vm_flags & VM_EXEC)) { - return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff, - HPAGE_PMD_NR); + if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && + (vm_flags & VM_EXEC) && vma->vm_file) { + struct inode *inode = vma->vm_file->f_inode; + + return !inode_is_open_for_write(inode) && + S_ISREG(inode->i_mode); } if (!vma->anon_vma || vma->vm_ops)
On Tue, 5 Oct 2021, Rongwei Wang wrote: > Hi, > I have run our cases these two days to stress test new Patch #1. The new Patch > #1 mainly add filemap_invalidate_{un}lock before and after > truncate_pagecache(), basing on original Patch #1. And the crash has not > happened. > > Now, I keep the original Patch #1, then adding the code below which suggested > by liu song (I'm not sure which one I should add in the next version, > Suggested-by or Signed-off-by? If you know, please remind me). > > - if (filemap_nr_thps(inode->i_mapping)) > + if (filemap_nr_thps(inode->i_mapping)) { > + filemap_invalidate_lock(inode->i_mapping); > truncate_pagecache(inode, 0); > + filemap_invalidate_unlock(inode->i_mapping); > + } I won't NAK that patch; but I still believe it's unnecessary, and don't see how it protects against all the races (collapse_file() does not use that lock, whereas collapse_file() does use page lock). And if you're hoping to fix 5.10, then you will have to backport those invalidate_lock patches there too (they're really intended to protect hole-punching). > > And the reason for keeping the original Patch #1 is mainly to fix the race > between collapse_file and truncate_pagecache. It seems necessary. Despite the > two-day test, I did not reproduce this race any more. > > In addition, I also test the below method: > > diff --git a/mm/truncate.c b/mm/truncate.c > index 3f47190f98a8..33604e4ce60a 100644 > --- a/mm/truncate.c > +++ b/mm/truncate.c > @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping, > struct page *page) > > int truncate_inode_page(struct address_space *mapping, struct page *page) > { > - VM_BUG_ON_PAGE(PageTail(page), page); > - > if (page->mapping != mapping) > return -EIO; > > I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And > the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no > effect. So, I still keep the original Patch #1 to fix one race. Yes, that's exactly what I meant, and thank you for intending to try it. But if that patch had "no effect", then I think you were not running the kernel with that patch applied: because it deletes the BUG on line 213 of mm/truncate.c, which is what you reported in the first mail! Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel something else? I've been looking at 5.15-rc. But I wasn't proposing to delete it merely to hide the BUG: as I hope I explained, we could move it below the page->mapping check, but it wouldn't really be of any value there since tails have NULL page->mapping anyway (well, I didn't check first and second tails, maybe mapping gets reused for some compound page field in those). I was proposing to delete it because the page->mapping check then weeds out the racy case once we're holding page lock, without the need for adding anything special. Hugh
On Mon, Oct 04, 2021 at 07:58:10PM -0700, Hugh Dickins wrote: > On Tue, 5 Oct 2021, Rongwei Wang wrote: > > > Hi, > > I have run our cases these two days to stress test new Patch #1. The new Patch > > #1 mainly add filemap_invalidate_{un}lock before and after > > truncate_pagecache(), basing on original Patch #1. And the crash has not > > happened. > > > > Now, I keep the original Patch #1, then adding the code below which suggested > > by liu song (I'm not sure which one I should add in the next version, > > Suggested-by or Signed-off-by? If you know, please remind me). > > > > - if (filemap_nr_thps(inode->i_mapping)) > > + if (filemap_nr_thps(inode->i_mapping)) { > > + filemap_invalidate_lock(inode->i_mapping); > > truncate_pagecache(inode, 0); > > + filemap_invalidate_unlock(inode->i_mapping); > > + } > > I won't NAK that patch; but I still believe it's unnecessary, and don't > see how it protects against all the races (collapse_file() does not use > that lock, whereas collapse_file() does use page lock). And if you're > hoping to fix 5.10, then you will have to backport those invalidate_lock > patches there too (they're really intended to protect hole-punching). I believe all we really need to do is protect against calling truncate_pagecache() simultaneously to avoid one of the calls seeing a tail page. i_mutex would work for this purpose just as well as filemap_invalidate_lock(). See, for example, ext4_zero_range() which first takes inode_lock(), then filemap_invalidate_lock() before calling truncate_pagecache_range(). > > And the reason for keeping the original Patch #1 is mainly to fix the race > > between collapse_file and truncate_pagecache. It seems necessary. Despite the > > two-day test, I did not reproduce this race any more. > > > > In addition, I also test the below method: > > > > diff --git a/mm/truncate.c b/mm/truncate.c > > index 3f47190f98a8..33604e4ce60a 100644 > > --- a/mm/truncate.c > > +++ b/mm/truncate.c > > @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping, > > struct page *page) > > > > int truncate_inode_page(struct address_space *mapping, struct page *page) > > { > > - VM_BUG_ON_PAGE(PageTail(page), page); > > - > > if (page->mapping != mapping) > > return -EIO; > > > > I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And > > the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no > > effect. So, I still keep the original Patch #1 to fix one race. > > Yes, that's exactly what I meant, and thank you for intending to try it. > > But if that patch had "no effect", then I think you were not running the > kernel with that patch applied: because it deletes the BUG on line 213 > of mm/truncate.c, which is what you reported in the first mail! > > Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel > something else? I've been looking at 5.15-rc. > > But I wasn't proposing to delete it merely to hide the BUG: as I hope > I explained, we could move it below the page->mapping check, but it > wouldn't really be of any value there since tails have NULL page->mapping > anyway (well, I didn't check first and second tails, maybe mapping gets > reused for some compound page field in those). I was proposing to delete > it because the page->mapping check then weeds out the racy case once > we're holding page lock, without the need for adding anything special. I think if we remove the race with the above mutex lock then we'll never see a tail page in this routine.
On 10/5/21 10:58 AM, Hugh Dickins wrote: > On Tue, 5 Oct 2021, Rongwei Wang wrote: > >> Hi, >> I have run our cases these two days to stress test new Patch #1. The new Patch >> #1 mainly add filemap_invalidate_{un}lock before and after >> truncate_pagecache(), basing on original Patch #1. And the crash has not >> happened. >> >> Now, I keep the original Patch #1, then adding the code below which suggested >> by liu song (I'm not sure which one I should add in the next version, >> Suggested-by or Signed-off-by? If you know, please remind me). >> >> - if (filemap_nr_thps(inode->i_mapping)) >> + if (filemap_nr_thps(inode->i_mapping)) { >> + filemap_invalidate_lock(inode->i_mapping); >> truncate_pagecache(inode, 0); >> + filemap_invalidate_unlock(inode->i_mapping); >> + } > > I won't NAK that patch; but I still believe it's unnecessary, and don't > see how it protects against all the races (collapse_file() does not use > that lock, whereas collapse_file() does use page lock). And if you're > hoping to fix 5.10, then you will have to backport those invalidate_lock > patches there too (they're really intended to protect hole-punching). > >> >> And the reason for keeping the original Patch #1 is mainly to fix the race >> between collapse_file and truncate_pagecache. It seems necessary. Despite the >> two-day test, I did not reproduce this race any more. >> >> In addition, I also test the below method: >> >> diff --git a/mm/truncate.c b/mm/truncate.c >> index 3f47190f98a8..33604e4ce60a 100644 >> --- a/mm/truncate.c >> +++ b/mm/truncate.c >> @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping, >> struct page *page) >> >> int truncate_inode_page(struct address_space *mapping, struct page *page) >> { >> - VM_BUG_ON_PAGE(PageTail(page), page); >> - >> if (page->mapping != mapping) >> return -EIO; >> >> I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And >> the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no >> effect. So, I still keep the original Patch #1 to fix one race. > > Yes, that's exactly what I meant, and thank you for intending to try it. > > But if that patch had "no effect", then I think you were not running the > kernel with that patch applied: because it deletes the BUG on line 213 > of mm/truncate.c, which is what you reported in the first mail! > > Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel > something else? I've been looking at 5.15-rc. Hi, Hugh I'm sorry the confusing '5.10.46-hugetext+'. I am also look and test at 5.15-rc. > > But I wasn't proposing to delete it merely to hide the BUG: as I hope > I explained, we could move it below the page->mapping check, but it > wouldn't really be of any value there since tails have NULL page->mapping > anyway (well, I didn't check first and second tails, maybe mapping gets > reused for some compound page field in those). I was proposing to delete > it because the page->mapping check then weeds out the racy case once > we're holding page lock, without the need for adding anything special. > > Hugh Today, I try again to create some cases to reproduce the race, such as ensuring that multiple processes are always executing ‘truncate_pagecache’ and only mapping 2M DSO. In this way, I try to ensure that the target of 'collapse_file' and 'truncate_pagecache' can only be the same VMA, to increase the probability of reproducing that race. But, I can't reproduce that race any more. In fact, according to the previous experience, the current number of attempts should be able to reproduce that race. If you have the idea about creating this case, please tell me, and I can try again. Or we can solve it when it appears again. Thanks! >
diff --git a/mm/filemap.c b/mm/filemap.c index dae481293..a3af2ec 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, if (!xa_is_value(page)) { if (page->index < start) goto put; - VM_BUG_ON_PAGE(page->index != xas.xa_index, page); if (page->index + thp_nr_pages(page) - 1 > end) goto put; if (!trylock_page(page)) @@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start, goto unlock; VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index), page); + /* + * We can find and get head page of file THP with + * non-head index. The head page should have already + * be truncated with page->mapping reset to NULL. + */ + VM_BUG_ON_PAGE(page->index != xas.xa_index, page); } indices[pvec->nr] = xas.xa_index; if (!pagevec_add(pvec, page)) diff --git a/mm/truncate.c b/mm/truncate.c index 714eaf1..3f47190 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping, index = start; while (index < end && find_lock_entries(mapping, index, end - 1, &pvec, indices)) { - index = indices[pagevec_count(&pvec) - 1] + 1; + index = indices[pagevec_count(&pvec) - 1] + + thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]); truncate_exceptional_pvec_entries(mapping, &pvec, indices); for (i = 0; i < pagevec_count(&pvec); i++) truncate_cleanup_page(pvec.pages[i]); @@ -392,6 +393,20 @@ void truncate_inode_pages_range(struct address_space *mapping, continue; lock_page(page); + /* + * Already truncated? We can find and get subpage + * of file THP, of which the head page is truncated. + * + * In addition, another race will be avoided, where + * collapse_file rolls back when writer truncates the + * page cache. + */ + if (page_mapping(page) != mapping) { + unlock_page(page); + /* Restart to make sure all gone */ + index = start - 1; + break; + } WARN_ON(page_to_index(page) != index); wait_on_page_writeback(page); truncate_inode_page(mapping, page);