From patchwork Mon Apr 13 10:04:42 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485389 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id BF89014DD for ; Mon, 13 Apr 2020 10:05:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7AC7B20753 for ; Mon, 13 Apr 2020 10:05:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7AC7B20753 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 09CD78E00FB; Mon, 13 Apr 2020 06:04:58 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id EF0AE8E00FA; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF19B8E00FB; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0109.hostedemail.com [216.40.44.109]) by kanga.kvack.org (Postfix) with ESMTP id A9ED58E00FA for ; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 4E508181AEF0B for ; Mon, 13 Apr 2020 10:04:57 +0000 (UTC) X-FDA: 76702398234.02.house26_2b29bf551fa35 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30029:30054:30056:30069:30070,0,RBL:192.55.52.151:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: house26_2b29bf551fa35 X-Filterd-Recvd-Size: 23927 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:56 +0000 (UTC) IronPort-SDR: PRgFwKKwVATBcvPCMZMGJVScSDIwmX3e3kCmcuBKoED/TJVBEEH+1ePAzkelyppQIzNFWnsWmi dLft9h92swUQ== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:53 -0700 IronPort-SDR: swKs3QkBtYmIJ1B8WR+lvUUNdcIfrrjn2PtPsifsaQR6GhUVTWcAf5LKmtKbB0i29q1rXtkE9x fLcc5Ujn1HBw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="276856128" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga004.fm.intel.com with ESMTP; 13 Apr 2020 03:04:51 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 784E995; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 1/6] khugepaged: Add self test Date: Mon, 13 Apr 2020 13:04:42 +0300 Message-Id: <20200413100447.20073-2-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The test checks if khugepaged is able to recover huge page where we expect to do so. It only covers anon-THP for now. Currently the test shows few failures. They are going to be addressed by the following patches. Signed-off-by: Kirill A. Shutemov --- tools/testing/selftests/vm/Makefile | 1 + tools/testing/selftests/vm/khugepaged.c | 899 ++++++++++++++++++++++++ 2 files changed, 900 insertions(+) create mode 100644 tools/testing/selftests/vm/khugepaged.c diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile index 7f9a8a8c31da..981d0dc21f9e 100644 --- a/tools/testing/selftests/vm/Makefile +++ b/tools/testing/selftests/vm/Makefile @@ -18,6 +18,7 @@ TEST_GEN_FILES += on-fault-limit TEST_GEN_FILES += thuge-gen TEST_GEN_FILES += transhuge-stress TEST_GEN_FILES += userfaultfd +TEST_GEN_FILES += khugepaged ifneq (,$(filter $(ARCH),arm64 ia64 mips64 parisc64 ppc64 riscv64 s390x sh64 sparc64 x86_64)) TEST_GEN_FILES += va_128TBswitch diff --git a/tools/testing/selftests/vm/khugepaged.c b/tools/testing/selftests/vm/khugepaged.c new file mode 100644 index 000000000000..34d945e71e2e --- /dev/null +++ b/tools/testing/selftests/vm/khugepaged.c @@ -0,0 +1,899 @@ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#ifndef MADV_PAGEOUT +#define MADV_PAGEOUT 21 +#endif + +#define BASE_ADDR ((void *)(1UL << 30)) +static unsigned long hpage_pmd_size; +static unsigned long page_size; +static int hpage_pmd_nr; + +#define THP_SYSFS "/sys/kernel/mm/transparent_hugepage/" + +enum thp_enabled { + THP_ALWAYS, + THP_MADVISE, + THP_NEVER, +}; + +static const char *thp_enabled_strings[] = { + "always", + "madvise", + "never", + NULL +}; + +enum thp_defrag { + THP_DEFRAG_ALWAYS, + THP_DEFRAG_DEFER, + THP_DEFRAG_DEFER_MADVISE, + THP_DEFRAG_MADVISE, + THP_DEFRAG_NEVER, +}; + +static const char *thp_defrag_strings[] = { + "always", + "defer", + "defer+madvise", + "madvise", + "never", + NULL +}; + +enum shmem_enabled { + SHMEM_ALWAYS, + SHMEM_WITHIN_SIZE, + SHMEM_ADVISE, + SHMEM_NEVER, + SHMEM_DENY, + SHMEM_FORCE, +}; + +static const char *shmem_enabled_strings[] = { + "always", + "within_size", + "advise", + "never", + "deny", + "force", + NULL +}; + +struct khugepaged_settings { + bool defrag; + unsigned int alloc_sleep_millisecs; + unsigned int scan_sleep_millisecs; + unsigned int max_ptes_none; + unsigned int max_ptes_swap; + unsigned long pages_to_scan; +}; + +struct settings { + enum thp_enabled thp_enabled; + enum thp_defrag thp_defrag; + enum shmem_enabled shmem_enabled; + bool debug_cow; + bool use_zero_page; + struct khugepaged_settings khugepaged; +}; + +static struct settings default_settings = { + .thp_enabled = THP_MADVISE, + .thp_defrag = THP_DEFRAG_ALWAYS, + .shmem_enabled = SHMEM_NEVER, + .debug_cow = 0, + .use_zero_page = 0, + .khugepaged = { + .defrag = 1, + .alloc_sleep_millisecs = 10, + .scan_sleep_millisecs = 10, + }, +}; + +static struct settings saved_settings; +static bool skip_settings_restore; + +static int exit_status; + +static void success(const char *msg) +{ + printf(" \e[32m%s\e[0m\n", msg); +} + +static void fail(const char *msg) +{ + printf(" \e[31m%s\e[0m\n", msg); + exit_status++; +} + +static int read_file(const char *path, char *buf, size_t buflen) +{ + int fd; + ssize_t numread; + + fd = open(path, O_RDONLY); + if (fd == -1) + return 0; + + numread = read(fd, buf, buflen - 1); + if (numread < 1) { + close(fd); + return 0; + } + + buf[numread] = '\0'; + close(fd); + + return (unsigned int) numread; +} + +static int write_file(const char *path, const char *buf, size_t buflen) +{ + int fd; + ssize_t numwritten; + + fd = open(path, O_WRONLY); + if (fd == -1) + return 0; + + numwritten = write(fd, buf, buflen - 1); + close(fd); + if (numwritten < 1) + return 0; + + return (unsigned int) numwritten; +} + +static int read_string(const char *name, const char *strings[]) +{ + char path[PATH_MAX]; + char buf[256]; + char *c; + int ret; + + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name); + if (ret >= PATH_MAX) { + printf("%s: Pathname is too long\n", __func__); + exit(EXIT_FAILURE); + } + + if (!read_file(path, buf, sizeof(buf))) { + perror(path); + exit(EXIT_FAILURE); + } + + c = strchr(buf, '['); + if (!c) { + printf("%s: Parse failure\n", __func__); + exit(EXIT_FAILURE); + } + + c++; + memmove(buf, c, sizeof(buf) - (c - buf)); + + c = strchr(buf, ']'); + if (!c) { + printf("%s: Parse failure\n", __func__); + exit(EXIT_FAILURE); + } + *c = '\0'; + + ret = 0; + while (strings[ret]) { + if (!strcmp(strings[ret], buf)) + return ret; + ret++; + } + + printf("Failed to parse %s\n", name); + exit(EXIT_FAILURE); +} + +static void write_string(const char *name, const char *val) +{ + char path[PATH_MAX]; + int ret; + + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name); + if (ret >= PATH_MAX) { + printf("%s: Pathname is too long\n", __func__); + exit(EXIT_FAILURE); + } + + if (!write_file(path, val, strlen(val) + 1)) { + perror(path); + exit(EXIT_FAILURE); + } +} + +static const unsigned long read_num(const char *name) +{ + char path[PATH_MAX]; + char buf[21]; + int ret; + + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name); + if (ret >= PATH_MAX) { + printf("%s: Pathname is too long\n", __func__); + exit(EXIT_FAILURE); + } + + ret = read_file(path, buf, sizeof(buf)); + if (ret < 0) { + perror("read_file(read_num)"); + exit(EXIT_FAILURE); + } + + return strtoul(buf, NULL, 10); +} + +static void write_num(const char *name, unsigned long num) +{ + char path[PATH_MAX]; + char buf[21]; + int ret; + + ret = snprintf(path, PATH_MAX, THP_SYSFS "%s", name); + if (ret >= PATH_MAX) { + printf("%s: Pathname is too long\n", __func__); + exit(EXIT_FAILURE); + } + + sprintf(buf, "%ld", num); + if (!write_file(path, buf, strlen(buf) + 1)) { + perror(path); + exit(EXIT_FAILURE); + } +} + +static void write_settings(struct settings *settings) +{ + struct khugepaged_settings *khugepaged = &settings->khugepaged; + + write_string("enabled", thp_enabled_strings[settings->thp_enabled]); + write_string("defrag", thp_defrag_strings[settings->thp_defrag]); + write_string("shmem_enabled", + shmem_enabled_strings[settings->shmem_enabled]); + write_num("debug_cow", settings->debug_cow); + write_num("use_zero_page", settings->use_zero_page); + + write_num("khugepaged/defrag", khugepaged->defrag); + write_num("khugepaged/alloc_sleep_millisecs", + khugepaged->alloc_sleep_millisecs); + write_num("khugepaged/scan_sleep_millisecs", + khugepaged->scan_sleep_millisecs); + write_num("khugepaged/max_ptes_none", khugepaged->max_ptes_none); + write_num("khugepaged/max_ptes_swap", khugepaged->max_ptes_swap); + write_num("khugepaged/pages_to_scan", khugepaged->pages_to_scan); +} + +static void restore_settings(int sig) +{ + if (skip_settings_restore) + goto out; + + printf("Restore THP and khugepaged settings..."); + write_settings(&saved_settings); + success("OK"); + if (sig) + exit(EXIT_FAILURE); +out: + exit(exit_status); +} + +static void save_settings(void) +{ + printf("Save THP and khugepaged settings..."); + saved_settings = (struct settings) { + .thp_enabled = read_string("enabled", thp_enabled_strings), + .thp_defrag = read_string("defrag", thp_defrag_strings), + .shmem_enabled = + read_string("shmem_enabled", shmem_enabled_strings), + .debug_cow = read_num("debug_cow"), + .use_zero_page = read_num("use_zero_page"), + }; + saved_settings.khugepaged = (struct khugepaged_settings) { + .defrag = read_num("khugepaged/defrag"), + .alloc_sleep_millisecs = + read_num("khugepaged/alloc_sleep_millisecs"), + .scan_sleep_millisecs = + read_num("khugepaged/scan_sleep_millisecs"), + .max_ptes_none = read_num("khugepaged/max_ptes_none"), + .max_ptes_swap = read_num("khugepaged/max_ptes_swap"), + .pages_to_scan = read_num("khugepaged/pages_to_scan"), + }; + success("OK"); + + signal(SIGTERM, restore_settings); + signal(SIGINT, restore_settings); + signal(SIGHUP, restore_settings); + signal(SIGQUIT, restore_settings); +} + +static void adjust_settings(void) +{ + + printf("Adjust settings..."); + write_settings(&default_settings); + success("OK"); +} + +#define CHECK_HUGE_FMT "sed -ne " \ + "'/^%lx/,/^AnonHugePages/{/^AnonHugePages:\\s*%ld kB/ q1}' " \ + "/proc/%d/smaps" + +static bool check_huge(void *p) +{ + char *cmd; + int ret; + + ret = asprintf(&cmd, CHECK_HUGE_FMT, + (unsigned long)p, hpage_pmd_size >> 10, getpid()); + if (ret < 0) { + perror("asprintf(CHECK_FMT)"); + exit(EXIT_FAILURE); + } + + ret = system(cmd); + free(cmd); + if (ret < 0 || !WIFEXITED(ret)) { + perror("system(check_huge)"); + exit(EXIT_FAILURE); + } + + return WEXITSTATUS(ret); +} + +#define CHECK_SWAP_FMT "sed -ne " \ + "'/^%lx/,/^Swap:/{/^Swap:\\s*%ld kB/ q1}' " \ + "/proc/%d/smaps" + +static bool check_swap(void *p, unsigned long size) +{ + char *cmd; + int ret; + + ret = asprintf(&cmd, CHECK_SWAP_FMT, + (unsigned long)p, size >> 10, getpid()); + if (ret < 0) { + perror("asprintf(CHECK_SWAP)"); + exit(EXIT_FAILURE); + } + + ret = system(cmd); + free(cmd); + if (ret < 0 || !WIFEXITED(ret)) { + perror("system(check_swap)"); + exit(EXIT_FAILURE); + } + + return WEXITSTATUS(ret); +} + +static void *alloc_mapping(void) +{ + void *p; + + p = mmap(BASE_ADDR, hpage_pmd_size, PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, -1, 0); + if (p != BASE_ADDR) { + printf("Failed to allocate VMA at %p\n", BASE_ADDR); + exit(EXIT_FAILURE); + } + + return p; +} + +static void fill_memory(int *p, unsigned long start, unsigned long end) +{ + int i; + + for (i = start / page_size; i < end / page_size; i++) + p[i * page_size / sizeof(*p)] = i + 0xdead0000; +} + +static void validate_memory(int *p, unsigned long start, unsigned long end) +{ + int i; + + for (i = start / page_size; i < end / page_size; i++) { + if (p[i * page_size / sizeof(*p)] != i + 0xdead0000) { + printf("Page %d is corrupted: %#x\n", + i, p[i * page_size / sizeof(*p)]); + exit(EXIT_FAILURE); + } + } +} + +#define TICK 500000 +static bool wait_for_scan(const char *msg, char *p) +{ + int full_scans; + int timeout = 6; /* 3 seconds */ + + /* Sanity check */ + if (check_huge(p)) { + printf("Unexpected huge page\n"); + exit(EXIT_FAILURE); + } + + madvise(p, hpage_pmd_size, MADV_HUGEPAGE); + + /* Wait until the second full_scan completed */ + full_scans = read_num("khugepaged/full_scans") + 2; + + printf("%s...", msg); + while (timeout--) { + if (check_huge(p)) + break; + if (read_num("khugepaged/full_scans") >= full_scans) + break; + printf("."); + usleep(TICK); + } + + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE); + + return !timeout; +} + +static void alloc_at_fault(void) +{ + struct settings settings = default_settings; + char *p; + + settings.thp_enabled = THP_ALWAYS; + write_settings(&settings); + + p = alloc_mapping(); + *p = 1; + printf("Allocate huge page on fault..."); + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + write_settings(&default_settings); + + madvise(p, page_size, MADV_DONTNEED); + printf("Split huge PMD on MADV_DONTNEED..."); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + munmap(p, hpage_pmd_size); +} + +static void collapse_full(void) +{ + void *p; + + p = alloc_mapping(); + fill_memory(p, 0, hpage_pmd_size); + if (wait_for_scan("Collapse fully populated PTE table", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, hpage_pmd_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_empty(void) +{ + void *p; + + p = alloc_mapping(); + if (wait_for_scan("Do not collapse empty PTE table", p)) + fail("Timeout"); + else if (check_huge(p)) + fail("Fail"); + else + success("OK"); + munmap(p, hpage_pmd_size); +} + +static void collapse_single_pte_entry(void) +{ + void *p; + + p = alloc_mapping(); + fill_memory(p, 0, page_size); + if (wait_for_scan("Collapse PTE table with single PTE entry present", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, page_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_max_ptes_none(void) +{ + int max_ptes_none = hpage_pmd_nr / 2; + struct settings settings = default_settings; + void *p; + + settings.khugepaged.max_ptes_none = max_ptes_none; + write_settings(&settings); + + p = alloc_mapping(); + + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size); + if (wait_for_scan("Do not collapse with max_ptes_none exeeded", p)) + fail("Timeout"); + else if (check_huge(p)) + fail("Fail"); + else + success("OK"); + validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none - 1) * page_size); + + fill_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size); + if (wait_for_scan("Collapse with max_ptes_none PTEs empty", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, (hpage_pmd_nr - max_ptes_none) * page_size); + + munmap(p, hpage_pmd_size); + write_settings(&default_settings); +} + +static void collapse_swapin_single_pte(void) +{ + void *p; + p = alloc_mapping(); + fill_memory(p, 0, hpage_pmd_size); + + printf("Swapout one page..."); + if (madvise(p, page_size, MADV_PAGEOUT)) { + perror("madvise(MADV_PAGEOUT)"); + exit(EXIT_FAILURE); + } + if (check_swap(p, page_size)) { + success("OK"); + } else { + fail("Fail"); + goto out; + } + + if (wait_for_scan("Collapse with swaping in single PTE entry", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, hpage_pmd_size); +out: + munmap(p, hpage_pmd_size); +} + +static void collapse_max_ptes_swap(void) +{ + int max_ptes_swap = read_num("khugepaged/max_ptes_swap"); + void *p; + + p = alloc_mapping(); + + fill_memory(p, 0, hpage_pmd_size); + printf("Swapout %d of %d pages...", max_ptes_swap + 1, hpage_pmd_nr); + if (madvise(p, (max_ptes_swap + 1) * page_size, MADV_PAGEOUT)) { + perror("madvise(MADV_PAGEOUT)"); + exit(EXIT_FAILURE); + } + if (check_swap(p, (max_ptes_swap + 1) * page_size)) { + success("OK"); + } else { + fail("Fail"); + goto out; + } + + if (wait_for_scan("Do not collapse with max_ptes_swap exeeded", p)) + fail("Timeout"); + else if (check_huge(p)) + fail("Fail"); + else + success("OK"); + validate_memory(p, 0, hpage_pmd_size); + + fill_memory(p, 0, hpage_pmd_size); + printf("Swapout %d of %d pages...", max_ptes_swap, hpage_pmd_nr); + if (madvise(p, max_ptes_swap * page_size, MADV_PAGEOUT)) { + perror("madvise(MADV_PAGEOUT)"); + exit(EXIT_FAILURE); + } + if (check_swap(p, max_ptes_swap * page_size)) { + success("OK"); + } else { + fail("Fail"); + goto out; + } + + if (wait_for_scan("Collapse with max_ptes_swap pages swapped out", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, hpage_pmd_size); +out: + munmap(p, hpage_pmd_size); +} + +static void collapse_single_pte_entry_compound(void) +{ + void *p; + + p = alloc_mapping(); + + printf("Allocate huge page..."); + madvise(p, hpage_pmd_size, MADV_HUGEPAGE); + fill_memory(p, 0, hpage_pmd_size); + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE); + + printf("Split huge page leaving single PTE mapping compount page..."); + madvise(p + page_size, hpage_pmd_size - page_size, MADV_DONTNEED); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + + if (wait_for_scan("Collapse PTE table with single PTE mapping compount page", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, page_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_full_of_compound(void) +{ + void *p; + + p = alloc_mapping(); + + printf("Allocate huge page..."); + madvise(p, hpage_pmd_size, MADV_HUGEPAGE); + fill_memory(p, 0, hpage_pmd_size); + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + printf("Split huge page leaving single PTE page table full of compount pages..."); + madvise(p, page_size, MADV_NOHUGEPAGE); + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + + if (wait_for_scan("Collapse PTE table full of compound pages", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, hpage_pmd_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_compound_extreme(void) +{ + void *p; + int i; + + p = alloc_mapping(); + for (i = 0; i < hpage_pmd_nr; i++) { + printf("\rConstruct PTE page table full of different PTE-mapped compound pages %3d/%d...", + i + 1, hpage_pmd_nr); + + madvise(BASE_ADDR, hpage_pmd_size, MADV_HUGEPAGE); + fill_memory(BASE_ADDR, 0, hpage_pmd_size); + if (!check_huge(BASE_ADDR)) { + printf("Failed to allocate huge page\n"); + exit(EXIT_FAILURE); + } + madvise(BASE_ADDR, hpage_pmd_size, MADV_NOHUGEPAGE); + + p = mremap(BASE_ADDR - i * page_size, + i * page_size + hpage_pmd_size, + (i + 1) * page_size, + MREMAP_MAYMOVE | MREMAP_FIXED, + BASE_ADDR + 2 * hpage_pmd_size); + if (p == MAP_FAILED) { + perror("mremap+unmap"); + exit(EXIT_FAILURE); + } + + p = mremap(BASE_ADDR + 2 * hpage_pmd_size, + (i + 1) * page_size, + (i + 1) * page_size + hpage_pmd_size, + MREMAP_MAYMOVE | MREMAP_FIXED, + BASE_ADDR - (i + 1) * page_size); + if (p == MAP_FAILED) { + perror("mremap+alloc"); + exit(EXIT_FAILURE); + } + } + + munmap(BASE_ADDR, hpage_pmd_size); + fill_memory(p, 0, hpage_pmd_size); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + + if (wait_for_scan("Collapse PTE table full of different compound pages", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + validate_memory(p, 0, hpage_pmd_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_fork(void) +{ + int wstatus; + void *p; + + p = alloc_mapping(); + + printf("Allocate small page..."); + fill_memory(p, 0, page_size); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + + printf("Share small page over fork()..."); + if (!fork()) { + /* Do not touch settings on child exit */ + skip_settings_restore = true; + exit_status = 0; + + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + + fill_memory(p, page_size, 2 * page_size); + + if (wait_for_scan("Collapse PTE table with single page shared with parent process", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + validate_memory(p, 0, page_size); + munmap(p, hpage_pmd_size); + exit(exit_status); + } + + wait(&wstatus); + exit_status += WEXITSTATUS(wstatus); + + printf("Check if parent still has small page..."); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, page_size); + munmap(p, hpage_pmd_size); +} + +static void collapse_fork_compound(void) +{ + int wstatus; + void *p; + + p = alloc_mapping(); + + printf("Allocate huge page..."); + madvise(p, hpage_pmd_size, MADV_HUGEPAGE); + fill_memory(p, 0, hpage_pmd_size); + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + printf("Share huge page over fork()..."); + if (!fork()) { + /* Do not touch settings on child exit */ + skip_settings_restore = true; + exit_status = 0; + + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + printf("Split huge page PMD in child process..."); + madvise(p, page_size, MADV_NOHUGEPAGE); + madvise(p, hpage_pmd_size, MADV_NOHUGEPAGE); + if (!check_huge(p)) + success("OK"); + else + fail("Fail"); + fill_memory(p, 0, page_size); + + if (wait_for_scan("Collapse PTE table full of compound pages in child", p)) + fail("Timeout"); + else if (check_huge(p)) + success("OK"); + else + fail("Fail"); + + validate_memory(p, 0, hpage_pmd_size); + munmap(p, hpage_pmd_size); + exit(exit_status); + } + + wait(&wstatus); + exit_status += WEXITSTATUS(wstatus); + + printf("Check if parent still has huge page..."); + if (check_huge(p)) + success("OK"); + else + fail("Fail"); + validate_memory(p, 0, hpage_pmd_size); + munmap(p, hpage_pmd_size); +} + +int main(void) +{ + setbuf(stdout, NULL); + + page_size = getpagesize(); + hpage_pmd_size = read_num("hpage_pmd_size"); + hpage_pmd_nr = hpage_pmd_size / page_size; + + default_settings.khugepaged.max_ptes_none = hpage_pmd_nr - 1; + default_settings.khugepaged.max_ptes_swap = hpage_pmd_nr / 8; + default_settings.khugepaged.pages_to_scan = hpage_pmd_nr * 8; + + save_settings(); + adjust_settings(); + + alloc_at_fault(); + collapse_full(); + collapse_empty(); + collapse_single_pte_entry(); + collapse_max_ptes_none(); + collapse_swapin_single_pte(); + collapse_max_ptes_swap(); + collapse_single_pte_entry_compound(); + collapse_full_of_compound(); + collapse_compound_extreme(); + collapse_fork(); + collapse_fork_compound(); + + restore_settings(0); +} From patchwork Mon Apr 13 10:04:43 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485385 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 698DC912 for ; Mon, 13 Apr 2020 10:05:04 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 40758206E9 for ; Mon, 13 Apr 2020 10:05:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 40758206E9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 68D348E00F8; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 5CB398E00FA; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 385C78E00F8; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0243.hostedemail.com [216.40.44.243]) by kanga.kvack.org (Postfix) with ESMTP id E610C8E00FA for ; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id AFDD3283D for ; Mon, 13 Apr 2020 10:04:56 +0000 (UTC) X-FDA: 76702398192.22.level60_2b172fcb1c91e X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30003:30034:30054:30070,0,RBL:134.134.136.65:@linux.intel.com:.lbl8.mailshell.net-62.18.0.100 64.95.201.95,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: level60_2b172fcb1c91e X-Filterd-Recvd-Size: 5031 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf41.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:55 +0000 (UTC) IronPort-SDR: FfwylIsEXz35HYpTtOtPRO7TJdqUV7cJcIQ25VOclB3kA/L1M79WaTtfw23SyrNFI85M9f3SHB LNIAT0cmhjxw== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:54 -0700 IronPort-SDR: Xqmk6jTy4ieXBOVMMhQAIu0zp0jWrua68mhW+1Xlt2bLQU6qzENoTC3uHAMwruDw6g3AMNRYlm FlSr+BPAeAIg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="453111385" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga005.fm.intel.com with ESMTP; 13 Apr 2020 03:04:51 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 87B8A16B; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 2/6] khugepaged: Do not stop collapse if less than half PTEs are referenced Date: Mon, 13 Apr 2020 13:04:43 +0300 Message-Id: <20200413100447.20073-3-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: __collapse_huge_page_swapin() checks the number of referenced PTE to decide if the memory range is hot enough to justify swapin. We have few problems with the approach: - It is way too late: we can do the check much earlier and safe time. khugepaged_scan_pmd() already knows if we have any pages to swap in and number of referenced page. - It stops collapse altogether if there's not enough referenced pages, not only swappingin. Fix it by making the right check early. We also can avoid additional page table scanning if khugepaged_scan_pmd() haven't found any swap entries. Signed-off-by: Kirill A. Shutemov Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative") --- mm/khugepaged.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 99bab7e4d05b..5968ec5ddd6b 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -902,11 +902,6 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, .pgoff = linear_page_index(vma, address), }; - /* we only decide to swapin, if there is enough young ptes */ - if (referenced < HPAGE_PMD_NR/2) { - trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 0); - return false; - } vmf.pte = pte_offset_map(pmd, address); for (; vmf.address < address + HPAGE_PMD_NR*PAGE_SIZE; vmf.pte++, vmf.address += PAGE_SIZE) { @@ -946,7 +941,7 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, static void collapse_huge_page(struct mm_struct *mm, unsigned long address, struct page **hpage, - int node, int referenced) + int node, int referenced, int unmapped) { pmd_t *pmd, _pmd; pte_t *pte; @@ -1003,7 +998,8 @@ static void collapse_huge_page(struct mm_struct *mm, * If it fails, we release mmap_sem and jump out_nolock. * Continuing to collapse causes inconsistency. */ - if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) { + if (unmapped && !__collapse_huge_page_swapin(mm, vma, address, + pmd, referenced)) { mem_cgroup_cancel_charge(new_page, memcg, true); up_read(&mm->mmap_sem); goto out_nolock; @@ -1214,22 +1210,21 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, mmu_notifier_test_young(vma->vm_mm, address)) referenced++; } - if (writable) { - if (referenced) { + if (!writable) { + result = SCAN_PAGE_RO; + } else if (!referenced || (unmapped && referenced < HPAGE_PMD_NR/2)) { + result = SCAN_LACK_REFERENCED_PAGE; + } else { result = SCAN_SUCCEED; ret = 1; - } else { - result = SCAN_LACK_REFERENCED_PAGE; - } - } else { - result = SCAN_PAGE_RO; } out_unmap: pte_unmap_unlock(pte, ptl); if (ret) { node = khugepaged_find_target_node(); /* collapse_huge_page will return with the mmap_sem released */ - collapse_huge_page(mm, address, hpage, node, referenced); + collapse_huge_page(mm, address, hpage, node, + referenced, unmapped); } out: trace_mm_khugepaged_scan_pmd(mm, page, writable, referenced, From patchwork Mon Apr 13 10:04:44 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485383 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4C71313B2 for ; Mon, 13 Apr 2020 10:05:02 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1AA08206E9 for ; Mon, 13 Apr 2020 10:05:02 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1AA08206E9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 0A7458E0001; Mon, 13 Apr 2020 06:04:57 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id ECFDF8E00FB; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C06288E0001; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0222.hostedemail.com [216.40.44.222]) by kanga.kvack.org (Postfix) with ESMTP id 9A84F8E00F8 for ; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 4B8BD181AEF0B for ; Mon, 13 Apr 2020 10:04:56 +0000 (UTC) X-FDA: 76702398192.06.ocean99_2b07616dc1e33 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30054,0,RBL:192.55.52.151:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: ocean99_2b07616dc1e33 X-Filterd-Recvd-Size: 2603 Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:55 +0000 (UTC) IronPort-SDR: igQMbUhpCHyY3PWbFJOuixJ6EwGLSC9aIt3DeH1rlowWSNLkDA7jk0DSG/NMzbi3p1jhuY7nNl b39qowEfHX2w== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:53 -0700 IronPort-SDR: 3mWnaOO+JmTSCPuCTOnFOfK1CGZY/AlNXPoIlCr28ClP3WBwQVVzwkoWA3tPapr9JTw5De68rV h/wVa3xF/QmQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="243499890" Received: from black.fi.intel.com ([10.237.72.28]) by fmsmga007.fm.intel.com with ESMTP; 13 Apr 2020 03:04:51 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 8D698E1; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 3/6] khugepaged: Drain all LRU caches before scanning pages Date: Mon, 13 Apr 2020 13:04:44 +0300 Message-Id: <20200413100447.20073-4-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Having a page in LRU add cache offsets page refcount and gives false-negative on PageLRU(). It reduces collapse success rate. Drain all LRU add caches before scanning. It happens relatively rare and should not disturb the system too much. Signed-off-by: Kirill A. Shutemov --- mm/khugepaged.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 5968ec5ddd6b..ee66c140c2d6 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -2059,6 +2059,8 @@ static void khugepaged_do_scan(void) barrier(); /* write khugepaged_pages_to_scan to local stack */ + lru_add_drain_all(); + while (progress < pages) { if (!khugepaged_prealloc_page(&hpage, &wait)) break; From patchwork Mon Apr 13 10:04:45 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485381 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0B99013B2 for ; Mon, 13 Apr 2020 10:05:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D6808206E9 for ; Mon, 13 Apr 2020 10:04:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D6808206E9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CCF778E00F9; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C7D638E00F8; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A81748E00F9; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0092.hostedemail.com [216.40.44.92]) by kanga.kvack.org (Postfix) with ESMTP id 91B228E0001 for ; Mon, 13 Apr 2020 06:04:56 -0400 (EDT) Received: from smtpin09.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 51C4F3ABB for ; Mon, 13 Apr 2020 10:04:56 +0000 (UTC) X-FDA: 76702398192.09.store07_2b03f39509b2e X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30054:30090,0,RBL:134.134.136.20:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:24,LUA_SUMMARY:none X-HE-Tag: store07_2b03f39509b2e X-Filterd-Recvd-Size: 2659 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:55 +0000 (UTC) IronPort-SDR: dnxZa8waIBgtUSy8IVzu1pV+MJ4jhBB5fYYcD409JvaVF1rUu9zCLgvBHwvuy19EMHzS3YYOfW YZSEBd4wToeA== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:53 -0700 IronPort-SDR: n8PBSwTUDigaajs2PeudinO86a5TZgFJJ1s8S6JJnX8rkBPQxXXZyCA+YNpQ7WJmwUXQ7sLzSU c9hZYXniHjeA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="426640499" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga005.jf.intel.com with ESMTP; 13 Apr 2020 03:04:51 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 9EFFF1B4; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 4/6] khugepaged: Drain LRU add pagevec after swapin Date: Mon, 13 Apr 2020 13:04:45 +0300 Message-Id: <20200413100447.20073-5-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: __collapse_huge_page_isolate() may fail due to extra pin in the LRU add pagevec. It's pretty common for swapin case: we swap in pages just to fail due to the extra pin. Drain LRU add pagevec on successful swapin. Signed-off-by: Kirill A. Shutemov --- mm/khugepaged.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index ee66c140c2d6..e3e41c2768d8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -934,6 +934,11 @@ static bool __collapse_huge_page_swapin(struct mm_struct *mm, } vmf.pte--; pte_unmap(vmf.pte); + + /* Drain LRU add pagevec to remove extra pin on the swapped in pages */ + if (swapped_in) + lru_add_drain(); + trace_mm_collapse_huge_page_swapin(mm, swapped_in, referenced, 1); return true; } From patchwork Mon Apr 13 10:04:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485395 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D582E14DD for ; Mon, 13 Apr 2020 10:05:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id A2F1120786 for ; Mon, 13 Apr 2020 10:05:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org A2F1120786 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id A0EB08E00FC; Mon, 13 Apr 2020 06:04:58 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 997188E00FA; Mon, 13 Apr 2020 06:04:58 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85E078E00FC; Mon, 13 Apr 2020 06:04:58 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0225.hostedemail.com [216.40.44.225]) by kanga.kvack.org (Postfix) with ESMTP id 699138E00FA for ; Mon, 13 Apr 2020 06:04:58 -0400 (EDT) Received: from smtpin19.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2EE712C06 for ; Mon, 13 Apr 2020 10:04:58 +0000 (UTC) X-FDA: 76702398276.19.steel90_2b4a897a3c632 X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30003:30054:30070,0,RBL:134.134.136.20:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:27,LUA_SUMMARY:none X-HE-Tag: steel90_2b4a897a3c632 X-Filterd-Recvd-Size: 5022 Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf18.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:57 +0000 (UTC) IronPort-SDR: 0MdJGBlLZpEUywQxtw5TDhQKt3Q/7DKN7Sb+rpp/7tM0nRuFuGfcnafTGUORB6USlOhM2SfqJj ejAg993Qjp+A== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:57 -0700 IronPort-SDR: cLZHmO+l+QKfmjoFjz3dZ1MGe5wuaWm/RPlyh8Ulv1B9c46mfYFCPsX0/iFhX7tTTOwgvuXTQ9 25V3V9bdAysQ== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="426640509" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga005.jf.intel.com with ESMTP; 13 Apr 2020 03:04:54 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id A788B1F5; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 5/6] khugepaged: Allow to collapse a page shared across fork Date: Mon, 13 Apr 2020 13:04:46 +0300 Message-Id: <20200413100447.20073-6-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The page can be included into collapse as long as it doesn't have extra pins (from GUP or otherwise). Logic to check the refcound is moved to a separate function. Note that the function is ready to deal with compound pages. It's preparation for the following patch. VM_BUG_ON_PAGE() was removed from __collapse_huge_page_copy() as the invariant it checks is no longer valid: the source can be mapped multiple times now. Signed-off-by: Kirill A. Shutemov Reviewed-by: William Kucharski --- mm/khugepaged.c | 41 ++++++++++++++++++++++++++++++----------- 1 file changed, 30 insertions(+), 11 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e3e41c2768d8..f9864644c3b7 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -529,6 +529,24 @@ static void release_pte_pages(pte_t *pte, pte_t *_pte) } } +static bool is_refcount_suitable(struct page *page) +{ + int expected_refcount, refcount; + + refcount = page_count(page); + expected_refcount = total_mapcount(page); + if (PageSwapCache(page)) + expected_refcount += compound_nr(page); + + if (IS_ENABLED(CONFIG_DEBUG_VM) && expected_refcount > refcount) { + pr_err("expected_refcount: %d, refcount: %d\n", + expected_refcount, refcount); + dump_page(page, "Unexpected refcount"); + } + + return page_count(page) == expected_refcount; +} + static int __collapse_huge_page_isolate(struct vm_area_struct *vma, unsigned long address, pte_t *pte) @@ -581,11 +599,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, } /* - * cannot use mapcount: can't collapse if there's a gup pin. - * The page must only be referenced by the scanned process - * and page swap cache. + * Check if the page has any GUP (or other external) pins. + * + * The page table that maps the page has been already unlinked + * from the page table tree and this process cannot get + * an additinal pin on the page. + * + * New pins can come later if the page is shared across fork, + * but not for the this process. It is fine. The other process + * cannot write to the page, only trigger CoW. */ - if (page_count(page) != 1 + PageSwapCache(page)) { + if (!is_refcount_suitable(page)) { unlock_page(page); result = SCAN_PAGE_COUNT; goto out; @@ -672,7 +696,6 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, } else { src_page = pte_page(pteval); copy_user_highpage(page, src_page, address, vma); - VM_BUG_ON_PAGE(page_mapcount(src_page) != 1, src_page); release_pte_page(src_page); /* * ptl mostly unnecessary, but preempt has to @@ -1201,12 +1224,8 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, goto out_unmap; } - /* - * cannot use mapcount: can't collapse if there's a gup pin. - * The page must only be referenced by the scanned process - * and page swap cache. - */ - if (page_count(page) != 1 + PageSwapCache(page)) { + /* Check if the page has any GUP (or other external) pins */ + if (!is_refcount_suitable(page)) { result = SCAN_PAGE_COUNT; goto out_unmap; } From patchwork Mon Apr 13 10:04:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "Kirill A . Shutemov" X-Patchwork-Id: 11485397 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E5B60912 for ; Mon, 13 Apr 2020 10:05:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC868206E9 for ; Mon, 13 Apr 2020 10:05:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC868206E9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id D1BF28E00FD; Mon, 13 Apr 2020 06:04:59 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id C82028E00FA; Mon, 13 Apr 2020 06:04:59 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A36D08E00FD; Mon, 13 Apr 2020 06:04:59 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0129.hostedemail.com [216.40.44.129]) by kanga.kvack.org (Postfix) with ESMTP id 83FD48E00FA for ; Mon, 13 Apr 2020 06:04:59 -0400 (EDT) Received: from smtpin12.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 4094E824556B for ; Mon, 13 Apr 2020 10:04:59 +0000 (UTC) X-FDA: 76702398318.12.nail46_2b7728d8ce61c X-Spam-Summary: 1,0,0,,d41d8cd98f00b204,kirill.shutemov@linux.intel.com,,RULES_HIT:30003:30054:30070,0,RBL:134.134.136.100:@linux.intel.com:.lbl8.mailshell.net-64.95.201.95 62.18.0.100,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fn,MSBL:0,DNSBL:neutral,Custom_rules:0:0:0,LFtime:27,LUA_SUMMARY:none X-HE-Tag: nail46_2b7728d8ce61c X-Filterd-Recvd-Size: 9039 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf21.hostedemail.com (Postfix) with ESMTP for ; Mon, 13 Apr 2020 10:04:58 +0000 (UTC) IronPort-SDR: oHL87OfjodmymcY3UcqnBLiw7/AEPRbYU63EM6K1Udr8SGRdm44H4RFfKDDvBaleix9hfSZHmX 4Ze8ivc2KvtA== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Apr 2020 03:04:57 -0700 IronPort-SDR: sbs5y4txjLbh2ReyQC9+iBTnKjXOFCJ2aXjvfVF7qGn/cQIFSECBqCv74wwxE04W/E4mtAEbJ8 6ydiq5kTEmtw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.72,378,1580803200"; d="scan'208";a="271027084" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga002.jf.intel.com with ESMTP; 13 Apr 2020 03:04:54 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id B36B2236; Mon, 13 Apr 2020 13:04:50 +0300 (EEST) From: "Kirill A. Shutemov" To: akpm@linux-foundation.org, Andrea Arcangeli Cc: Zi Yan , Yang Shi , Ralph Campbell , John Hubbard , William Kucharski , linux-mm@kvack.org, linux-kernel@vger.kernel.org, "Kirill A. Shutemov" Subject: [PATCH 6/6] khugepaged: Allow to collapse PTE-mapped compound pages Date: Mon, 13 Apr 2020 13:04:47 +0300 Message-Id: <20200413100447.20073-7-kirill.shutemov@linux.intel.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> References: <20200413100447.20073-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: We can collapse PTE-mapped compound pages. We only need to avoid handling them more than once: lock/unlock page only once if it's present in the PMD range multiple times as it handled on compound level. The same goes for LRU isolation and putback. Signed-off-by: Kirill A. Shutemov --- mm/khugepaged.c | 99 ++++++++++++++++++++++++++++++++----------------- 1 file changed, 65 insertions(+), 34 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index f9864644c3b7..11d500396d85 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -515,17 +515,30 @@ void __khugepaged_exit(struct mm_struct *mm) static void release_pte_page(struct page *page) { - dec_node_page_state(page, NR_ISOLATED_ANON + page_is_file_cache(page)); + mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_cache(page), + -compound_nr(page)); unlock_page(page); putback_lru_page(page); } -static void release_pte_pages(pte_t *pte, pte_t *_pte) +static void release_pte_pages(pte_t *pte, pte_t *_pte, + struct list_head *compound_pagelist) { + struct page *page, *tmp; + while (--_pte >= pte) { pte_t pteval = *_pte; - if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval))) - release_pte_page(pte_page(pteval)); + + page = pte_page(pteval); + if (!pte_none(pteval) && !is_zero_pfn(pte_pfn(pteval)) && + !PageCompound(page)) + release_pte_page(page); + } + + list_for_each_entry_safe(page, tmp, compound_pagelist, lru) { + list_del(&page->lru); + release_pte_page(page); } } @@ -549,7 +562,8 @@ static bool is_refcount_suitable(struct page *page) static int __collapse_huge_page_isolate(struct vm_area_struct *vma, unsigned long address, - pte_t *pte) + pte_t *pte, + struct list_head *compound_pagelist) { struct page *page = NULL; pte_t *_pte; @@ -579,13 +593,21 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, goto out; } - /* TODO: teach khugepaged to collapse THP mapped with pte */ + VM_BUG_ON_PAGE(!PageAnon(page), page); + if (PageCompound(page)) { - result = SCAN_PAGE_COMPOUND; - goto out; - } + struct page *p; + page = compound_head(page); - VM_BUG_ON_PAGE(!PageAnon(page), page); + /* + * Check if we have dealt with the compound page + * already + */ + list_for_each_entry(p, compound_pagelist, lru) { + if (page == p) + goto next; + } + } /* * We can do it before isolate_lru_page because the @@ -614,19 +636,15 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_PAGE_COUNT; goto out; } - if (pte_write(pteval)) { - writable = true; - } else { - if (PageSwapCache(page) && - !reuse_swap_page(page, NULL)) { - unlock_page(page); - result = SCAN_SWAP_CACHE_PAGE; - goto out; - } + if (!pte_write(pteval) && PageSwapCache(page) && + !reuse_swap_page(page, NULL)) { /* - * Page is not in the swap cache. It can be collapsed - * into a THP. + * Page is in the swap cache and cannot be re-used. + * It cannot be collapsed into a THP. */ + unlock_page(page); + result = SCAN_SWAP_CACHE_PAGE; + goto out; } /* @@ -638,16 +656,23 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, result = SCAN_DEL_PAGE_LRU; goto out; } - inc_node_page_state(page, - NR_ISOLATED_ANON + page_is_file_cache(page)); + mod_node_page_state(page_pgdat(page), + NR_ISOLATED_ANON + page_is_file_cache(page), + compound_nr(page)); VM_BUG_ON_PAGE(!PageLocked(page), page); VM_BUG_ON_PAGE(PageLRU(page), page); + if (PageCompound(page)) + list_add_tail(&page->lru, compound_pagelist); +next: /* There should be enough young pte to collapse the page */ if (pte_young(pteval) || page_is_young(page) || PageReferenced(page) || mmu_notifier_test_young(vma->vm_mm, address)) referenced++; + + if (pte_write(pteval)) + writable = true; } if (likely(writable)) { if (likely(referenced)) { @@ -661,7 +686,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, } out: - release_pte_pages(pte, _pte); + release_pte_pages(pte, _pte, compound_pagelist); trace_mm_collapse_huge_page_isolate(page, none_or_zero, referenced, writable, result); return 0; @@ -670,13 +695,14 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, static void __collapse_huge_page_copy(pte_t *pte, struct page *page, struct vm_area_struct *vma, unsigned long address, - spinlock_t *ptl) + spinlock_t *ptl, + struct list_head *compound_pagelist) { + struct page *src_page, *tmp; pte_t *_pte; for (_pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, page++, address += PAGE_SIZE) { pte_t pteval = *_pte; - struct page *src_page; if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { clear_user_highpage(page, address); @@ -696,7 +722,8 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, } else { src_page = pte_page(pteval); copy_user_highpage(page, src_page, address, vma); - release_pte_page(src_page); + if (!PageCompound(src_page)) + release_pte_page(src_page); /* * ptl mostly unnecessary, but preempt has to * be disabled to update the per-cpu stats @@ -713,6 +740,11 @@ static void __collapse_huge_page_copy(pte_t *pte, struct page *page, free_page_and_swap_cache(src_page); } } + + list_for_each_entry_safe(src_page, tmp, compound_pagelist, lru) { + list_del(&src_page->lru); + release_pte_page(src_page); + } } static void khugepaged_alloc_sleep(void) @@ -971,6 +1003,7 @@ static void collapse_huge_page(struct mm_struct *mm, struct page **hpage, int node, int referenced, int unmapped) { + LIST_HEAD(compound_pagelist); pmd_t *pmd, _pmd; pte_t *pte; pgtable_t pgtable; @@ -1071,7 +1104,8 @@ static void collapse_huge_page(struct mm_struct *mm, mmu_notifier_invalidate_range_end(&range); spin_lock(pte_ptl); - isolated = __collapse_huge_page_isolate(vma, address, pte); + isolated = __collapse_huge_page_isolate(vma, address, pte, + &compound_pagelist); spin_unlock(pte_ptl); if (unlikely(!isolated)) { @@ -1096,7 +1130,8 @@ static void collapse_huge_page(struct mm_struct *mm, */ anon_vma_unlock_write(vma->anon_vma); - __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl); + __collapse_huge_page_copy(pte, new_page, vma, address, pte_ptl, + &compound_pagelist); pte_unmap(pte); __SetPageUptodate(new_page); pgtable = pmd_pgtable(_pmd); @@ -1193,11 +1228,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, goto out_unmap; } - /* TODO: teach khugepaged to collapse THP mapped with pte */ - if (PageCompound(page)) { - result = SCAN_PAGE_COMPOUND; - goto out_unmap; - } + page = compound_head(page); /* * Record which node the original page is from and save this