From patchwork Sat Aug 7 03:25:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Peter Xu X-Patchwork-Id: 12424099 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3C680C432BE for ; Sat, 7 Aug 2021 03:25:31 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AD02560F01 for ; Sat, 7 Aug 2021 03:25:30 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org AD02560F01 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 024E46B0071; Fri, 6 Aug 2021 23:25:29 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F18BC6B0072; Fri, 6 Aug 2021 23:25:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C31708D0001; Fri, 6 Aug 2021 23:25:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106]) by kanga.kvack.org (Postfix) with ESMTP id 9E7216B0071 for ; Fri, 6 Aug 2021 23:25:28 -0400 (EDT) Received: from smtpin20.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 463021C994 for ; Sat, 7 Aug 2021 03:25:28 +0000 (UTC) X-FDA: 78446844336.20.A68B8BE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf04.hostedemail.com (Postfix) with ESMTP id 81E28500C861 for ; Sat, 7 Aug 2021 03:25:27 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1628306726; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=MzzaQtyEANO0KczaG3nkvHpODjFxPeAKP1ZtgNPa4L0=; b=X3L2+g4rulYhXPMSYxC7fn14Rb4J/hJBcCGUN4Sd01GsPxzjNC5ymSNtZ89pYFqkLnFI/k z9WV+FpBQlEA8kVqaUvUmy1b06jpr8sPNO0/h4GKoTFaG+/11yGstBnfZzWG6lpEkEEmlp iQXeQ09vNdlUqgyG1jUog1Stwl0wzAQ= Received: from mail-qk1-f198.google.com (mail-qk1-f198.google.com [209.85.222.198]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-205-bLCU3AOtPBesDKetVkGbNQ-1; Fri, 06 Aug 2021 23:25:24 -0400 X-MC-Unique: bLCU3AOtPBesDKetVkGbNQ-1 Received: by mail-qk1-f198.google.com with SMTP id o2-20020a05620a1102b02903b9ade0af31so3242159qkk.1 for ; Fri, 06 Aug 2021 20:25:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=pqNQRtWb5voo9Ih9+1fZ0vEef/svPNXQ6fhBQgHHkn0=; b=armBQKYBUH0BRmAd9sI6+WXAnouAR1BJUIRinBgsG4t5Q2vvy5AxaBw1WTXFNV6vi+ a/uk09cb+4/81Tbrxi8sIDpOMo/og49SNxZ2yr3zt/f3l71BYlJkcJJ/OPDs1O0s7Pzz bRJNjqwHEa1tl2lSgltf8RTPtTQ6wiuST+CgdZbkbdmUw669oF4kHmJ69FhbewngB7mi TT/QMbJHv501lcgpBq6Uz8twUx3wR4XfJZ8S3fRO2qyXDh5Mk/I+wslz4AArTl0lf6tT dFvhXa1vTe9bYbjqfv7c0auzKnBQftG7pf6OAS+JmfuD2oKoVT728/uusKDc9rHv0gl4 Gviw== X-Gm-Message-State: AOAM530C8fIvo07VNNAIJQdsxTUwtTrbK0Crnm7Q3u8GjGyh5JYyUs/T oXJMBMwwfhBqUynwVwBliOP64IehHZJZDhjdirsAG1FQwVSrB10DML9wJfR8u0mKJ6gpPp+ByOE 81UBnr+O2pZg= X-Received: by 2002:a37:6192:: with SMTP id v140mr12901108qkb.421.1628306724292; Fri, 06 Aug 2021 20:25:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzFExRUPi9CGa65lAjOV9lJ8vZu85E5OIHZ8AVVWFS5EDc14G6sJT0c03xGy21u57ShFC8FhQ== X-Received: by 2002:a37:6192:: with SMTP id v140mr12901087qkb.421.1628306724011; Fri, 06 Aug 2021 20:25:24 -0700 (PDT) Received: from localhost.localdomain (bras-base-toroon474qw-grc-92-76-70-75-133.dsl.bell.ca. [76.70.75.133]) by smtp.gmail.com with ESMTPSA id a5sm5514875qkk.92.2021.08.06.20.25.22 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Fri, 06 Aug 2021 20:25:23 -0700 (PDT) From: Peter Xu To: linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: Alistair Popple , Tiberiu Georgescu , ivan.teterevkov@nutanix.com, Mike Rapoport , Hugh Dickins , peterx@redhat.com, Matthew Wilcox , Andrea Arcangeli , David Hildenbrand , "Kirill A . Shutemov" , Andrew Morton , Mike Kravetz Subject: [PATCH RFC 0/4] mm: Enable PM_SWAP for shmem with PTE_MARKER Date: Fri, 6 Aug 2021 23:25:17 -0400 Message-Id: <20210807032521.7591-1-peterx@redhat.com> X-Mailer: git-send-email 2.32.0 MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X3L2+g4r; spf=none (imf04.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 81E28500C861 X-Stat-Signature: kjaspq43cs7ec8nggjjs7xxmy5t43pht X-HE-Tag: 1628306727-46895 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Summary ======= [Based on v5.14-rc4] This patchset enables PM_SWAP of pagemap on shmem. IOW userspace will be able to detect whether a shmem page is swapped out, just like anonymous pages. This feature can be enabled with CONFIG_PTE_MARKER_PAGEOUT. When enabled, it brings 0.8% overhead on swap-in performance on a shmem page, so I didn't make it the default yet. However IMHO 0.8% is still in an acceptable range that we can even make it the default at last. Comments are welcomed here. There's one previous series that wanted to address the same issue but in another way by Tiberiu A Georgescu , here: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/ In that series it's done by looking up page cache for all none ptes. However I raised concern on 4x performance degradation for all shmem pagemap users. Unlike the other approach, this series has zero overhead on pagemap read because the PM_SWAP info is consolidated into the zapped PTEs directly. Goals ===== One major goal of this series is to add the PM_SWAP support, the reason is as stated by Tiberiu and Ivan in the other patchset: https://lore.kernel.org/lkml/CY4PR0201MB3460E372956C0E1B8D33F904E9E39@CY4PR0201MB3460.namprd02.prod.outlook.com/ As a summary: for some reason the userspace needs to scan the pages in the background, however that scanning could misguide page reclaim on which page is hot and which is cold. With correct PM_SWAP information, the userspace can correct the behavior of page reclaim by firstly fetching that info from pagemap, and explicit madvise(MADV_PAGEOUT). In this case, the pages are for the guest, but it can be any shmem page. Another major goal of this series is to do a proof-of-concept of the PTE marker idea, and that's also the major reason why it's RFC. So far PTE marker can potentially be the solution for below three problems that I'm aware of: (a) PM_SWAP on shmem (b) Userfaultfd-wp on shmem/hugetlbfs (c) PM_SOFT_DIRTY lost for shmem over swapping This series tries to resolve problem (a) which should be the simplest, ideally it should solve immediate problem for the live migration issue raised by Tiberiu and Ivan on proactive paging out unused guest pages. Both (a) and (c) will be for performance-wise or statistic-wise. Scenario (b) will require pte markers as part of the function to trap writes to uffd-wp protected regions when the pages were e.g. swapped out or zapped for any reason. Currently, uffd-wp shmem work (still during review on the list, latest v5, [1]) used another solution called "special swap pte". It works similarly like PTE markers as both of the approachs are to persist information into zapped pte, but people showed concern about that idea and it's suggested to use a safer (swp-entry level operation, not pte level), and arch-independent approach. Hopefully PTE markers satifsfy these demands. Before I rework the uffd-wp series, I wanted to know whether this approach can be accepted upstream. So besides the swap part, comments on PTE markers will be extremely welcomed. What is PTE Markers? ==================== PTE markers are defined as some special PTEs that works like a "marker" just like in normal life. Firstly it uses a swap type, IOW it's not a valid/present pte, so processor will trigger a page fault when it's accessed. Meanwhile, the format of the PTE is well-defined, so as to contain some information that we would like to know before/during the page access happening. In this specific case, when the shmem page is paged out, we set a marker showing that this page was paged out, then when pagemap is read about this pte, we know this is a swapped-out/very-cold page. This use case is not an obvious one but the most simplest. The uffd-wp use case is more obvious (wr-protect is per-pte, so we can't save into page cache; meanwhile we need that info to persist across zappings e.g. thp split or page out of shmem pages). So in the future, it can contain more information, e.g., whether this pte is wr-protected by userfaultfd; whether this pte was written in this mm context for soft-dirtying. On 64 bit systems, we have a total of 58 bits (swp_offset). I'm also curious whether it can be further expanded to other mm areas. E.g., logically it can work too for non-RAM based memories outside shmem/hugetlbfs, e.g. a common file system like ext4 or btrfs? As long as there will be a need to store some per-pte information across zapping of the ptes, then maybe it can be considered. Known Issues/Concerns ===================== About THP --------- Currently we don't need to worry about THP because paged out shmem pages will be split when shrinking, IOW we only need to consider PTE, and the markers will only be applied to a shmem pte not pmd or bigger. About PM_SWAP Accuracy ---------------------- This is not an "accurate" solution to provide PM_SWAP bit. Two exmaples: - When process A & B both map shmem page P somewhere, it can happen that only one of these ptes got marked with the pte marker. Imagine below sequence: 0. Process A & B both map shmem page P somewhere 1. Process A zap pte of page P for some reason (e.g. thp split) 2. System decides to recycle page P 3. System replace process B's pte (pointed to P) by PTE marker 4. System _didn't_ replace process A's pte because it was none pte, and it'll continue to be none pte 5. Only process B's relevant pte has the PTE marker after P swapped out - When fork, we don't copy shmem vma ptes, including the pte markers. So even if page P was swapped out, only the parent process has the pte marker installed, in child it'll be none pte if fork() happened after pageout. Conclusion: just like it used to be, the PM_SWAP is best-effort. But it should work in 99.99% cases and it should already start to solve problems. About Performance Impact ------------------------ Due to the special PTE marker, page fault logic needs to understand this pte and there will be some extra logic to handle that. The overhead is merely non-observable with 0.82% perf drop. For more information, please see the test section below where I wrote a test for it. When we really care about that small difference, the user can also disable the shmem PM_SWAP support with !CONFIG_PTE_MARKER_PAGEOUT. Tests ===== Test case I used is here: https://github.com/xzpeter/clibs/blob/master/bsd/pagemap.c Functional test --------------- Run with !CONFIG_PTE_MARKER_PAGEOUT, we'll miss the PM_SWAP when paged out (see swap bit always being zeros): FAULT1 (expect swap==0): present bit 1, swap bit 0 PAGEOUT (expect swap==1): present bit 0, swap bit 0 FAULT2 (expect swap==0): present bit 1, swap bit 0 REMOVE (expect swap==0): present bit 0, swap bit 0 PAGEOUT (expect swap==1): present bit 0, swap bit 0 REMOVE (expect swap==0): present bit 0, swap bit 0 Run with CONFIG_PTE_MARKER_PAGEOUT, we'll be able to observe correct PM_SWAP: FAULT1 (expect swap==0): present bit 1, swap bit 0 PAGEOUT (expect swap==1): present bit 0, swap bit 1 FAULT2 (expect swap==0): present bit 1, swap bit 0 REMOVE (expect swap==0): present bit 0, swap bit 0 PAGEOUT (expect swap==1): present bit 0, swap bit 1 REMOVE (expect swap==0): present bit 0, swap bit 0 Performance test ---------------- The performance test is not about pagemap reading, because it should be the same as before. Instead there's indeed extra overhead in the fault path, when the page is swapped in from the disk. I did some sequential swap-in tests of 1GB range (each for 5 times in a loop) to measure the difference. Hardware I used: Processor: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Memory: 32GB memory, 16GB swap (on a PERC H330 Mini 2TBi disk) Test Size: 1GB shmem I only measured the time to fault-in the pages on the disk, so the measurement does not include pageout time, one can refer to the .c file. Results: |-----------------------------------+------------------+------------| | Config | Time used (us) | Change (%) | |-----------------------------------+------------------+------------| | !PTE_MARKER | 519652 (+-0.73%) | N/A | | PTE_MARKER && !PTE_MARKER_PAGEOUT | 519874 (+-0.40%) | -0.04% | | PTE_MARKER && PTE_MARKER_PAGEOUT | 523914 (+-0.71%) | -0.82% | |-----------------------------------+------------------+------------| Any comment would be greatly welcomed. [1] https://lore.kernel.org/lkml/20210715201422.211004-1-peterx@redhat.com/ Peter Xu (4): mm: Introduce PTE_MARKER swap entry mm: Check against orig_pte for finish_fault() mm: Handle PTE_MARKER page faults mm: Install marker pte when page out for shmem pages fs/proc/task_mmu.c | 1 + include/linux/rmap.h | 1 + include/linux/swap.h | 14 ++++++++++++- include/linux/swapops.h | 45 +++++++++++++++++++++++++++++++++++++++++ mm/Kconfig | 17 ++++++++++++++++ mm/memory.c | 43 ++++++++++++++++++++++++++++++++++++++- mm/rmap.c | 19 +++++++++++++++++ mm/vmscan.c | 2 +- 8 files changed, 139 insertions(+), 3 deletions(-)