From patchwork Wed Dec 14 22:51:21 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 13073643 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB485C4332F for ; Wed, 14 Dec 2022 22:51:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C22AE8E0003; Wed, 14 Dec 2022 17:51:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BD2EC8E0002; Wed, 14 Dec 2022 17:51:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A73988E0003; Wed, 14 Dec 2022 17:51:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 948638E0002 for ; Wed, 14 Dec 2022 17:51:45 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 691A5A0470 for ; Wed, 14 Dec 2022 22:51:45 +0000 (UTC) X-FDA: 80242410570.12.5A0F5D6 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) by imf23.hostedemail.com (Postfix) with ESMTP id D8991140004 for ; Wed, 14 Dec 2022 22:51:42 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=myX2b9us; spf=pass (imf23.hostedemail.com: domain of 3flOaYwcKCLEplReTYlXffXcV.TfdcZelo-ddbmRTb.fiX@flex--yuanchu.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3flOaYwcKCLEplReTYlXffXcV.TfdcZelo-ddbmRTb.fiX@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671058302; a=rsa-sha256; cv=none; b=cOOKzwjylx35pFTXEgjwg+JTYENnBHwwjX/I9JPZcLUpk+e7S1frA3Nrhqh66GF3WKa2vP a0hH1Tsc4XlpvIcxQt9uWvwzzVepmxQBy0506RGgETSpo4vGKl2yO1SHWe+SKKYnJztCPi aOn99GwlqOryf0lApm2lfr49oyeshKo= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=myX2b9us; spf=pass (imf23.hostedemail.com: domain of 3flOaYwcKCLEplReTYlXffXcV.TfdcZelo-ddbmRTb.fiX@flex--yuanchu.bounces.google.com designates 209.85.128.201 as permitted sender) smtp.mailfrom=3flOaYwcKCLEplReTYlXffXcV.TfdcZelo-ddbmRTb.fiX@flex--yuanchu.bounces.google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671058302; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=nWNYEkPGu4xeEk865eL9sMmuW2qvvMiYdO6LZIHyaa0=; b=asYpB7H38fLq3yZi9tiSooZB5OL/w8zlyzhhZ0BS7dI4Nw/y4KixonkmDWHOVynFD8b2dA dUEZqeDZ0msbuL0uVQzYcwHR5jmLXPDDOPas8buaV5SZR9ZxWO8/fhfAuhSPSbEJqHeMdw dvnwTEkNB1EQ1pDEu/bQ3YwUehYUoy0= Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-36810cfa61fso14982377b3.6 for ; Wed, 14 Dec 2022 14:51:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject :date:message-id:reply-to; bh=nWNYEkPGu4xeEk865eL9sMmuW2qvvMiYdO6LZIHyaa0=; b=myX2b9usISZoaAY5lU77WMryQ3aiX1kl/4FYrRRu2anLDBwXSbsBolikYQkFrE5uPw lKs2zBenkmqM7nk4VAPKqxphYh/C85VVm75JcjKuqBNO7N+RCTdsu9AKHgx8EHirHk0T w2r7juJ9kckYckm1i+vB7kURLm7hD2ScKtVQmsSW++IEbyOuNS0oIEenv3GT7xAIZGOX ggQX/lJ0VWmWQKS79OFshXSdsQBIsO8T5XlIfMQRtVy0cVtOKlhsA0DnGU5/gGbBUU/V Sl08VGAI9QDRgV8KEgh9sUbJzjvAOu6YVO6tvrm43b8SdqxmfkyWf4RSUf6hEhzEtctS GUpw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=nWNYEkPGu4xeEk865eL9sMmuW2qvvMiYdO6LZIHyaa0=; b=Vn4ORwkDzxi1XnJdcc0IN9shQdHAjkSVFxvzh/+v6UdZ6iVujusjercMy9Q5cu01aC TmehByd4zFDDI5PwIFFauWiMaFLWBXWvTVygma/+zmhaMvuqHSO2az2Calnnd+1qg8Qf NFU6gvmkE3ZZR6yj7Y+NdlV+kIPJ9Q5NyJlO6o9vcs71+QnOSFNBeIiqi6H0+wRVMmUB 7eRq2sx+U2yvz3bKP3DXUVqbIYSPnGgHzlHUSXVZhurCY+bjopQ6YSYXCHogNmlLdNO/ JKqEYIHgA+LYUCcy7hWolAY0uMV9N8+f8NaucENImhYtpqCDqxWskJ/s3LtY+jPyuq7V hrqA== X-Gm-Message-State: ANoB5pl0atcVCdGgTLq+7uwQh8MvdtiwLD39hRbnF+H0A0eNcgycu4yQ m/TLW8I9NroHGNR/i900Tb2nKFp3qzaQ X-Google-Smtp-Source: AA0mqf5p/8+ElAHu6Tyopsq5sy0q9WFD35Y2Rao35CtwLJIqMC0MUh72/TWe1DhYSvz6JihTOiHjdwdazh4j X-Received: from yuanchu.svl.corp.google.com ([2620:15c:2d4:203:1311:60bc:9e2a:ab1]) (user=yuanchu job=sendgmr) by 2002:a81:7b06:0:b0:3d8:677e:29e with SMTP id w6-20020a817b06000000b003d8677e029emr38045398ywc.410.1671058302001; Wed, 14 Dec 2022 14:51:42 -0800 (PST) Date: Wed, 14 Dec 2022 14:51:21 -0800 Mime-Version: 1.0 X-Mailer: git-send-email 2.39.0.314.g84b9a713c41-goog Message-ID: <20221214225123.2770216-1-yuanchu@google.com> Subject: [RFC PATCH 0/2] mm: multi-gen LRU: working set extensions From: Yuanchu Xie To: Johannes Weiner , Michal Hocko , Roman Gushchin , Yu Zhao Cc: Andrew Morton , Shakeel Butt , Muchun Song , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, Yuanchu Xie X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: D8991140004 X-Stat-Signature: mx3pnmgcbp47a9s4b78mdknqk7737u7r X-HE-Tag: 1671058302-349663 X-HE-Meta: U2FsdGVkX1/4S98+eRHlUZQ4MTe9hu9UQwqB9Ua0xOA29NtuqNJ1g+KYR5U+T1yxMC8kczcc2nIJpI+BoC9oiQockGnmU6d50EXh5wYK8lOHSl79cW/6XHJV48Fv9/PK95rU9ydue5w0zFRBunbzpc9w3KMfeSWK8u/9lndxq8zkQRRq8cvtoi6T1fs+PgJnmYDeVG8pDoEVZ5niHCuLyfMB2yMMqyIO3CSCcT6SsQlJWex91FA7eFgqPoENDw9jNgGLvyrN3qc+r3XzQvLpp4Jq/oL+aqqPjuwrpG+xKpgCVNOj4LsplC0wbR0ACLkDKlIb8GtzQurbmtpgnZUSRrW6i4wvXVy3yeWS5gD6XTc4VKgUtM9PsUpWubiFLwOvWk9RG9VLHWpg89+yXP/OrLxZyTh6BdOumm1URAZ3t2KbegJyMuU43LldwftxCD0HGPYm5J/UnHeYOOkIQe9Y2uHZD4eB2Y+Nvr6HijjFgf2iIKDw9yGhsuNp1SbP1xhvLCk7VGN7JFjq0hnrLXJjhlszC7tq/UXb6ERqWsBuX/mMqxmunZgLFPF4c72+76Q8j+uIJfWwkmixKVWHKKedmMui0cgBKv/ikkB+y37ooIt9uHwdUMknC2ROq/Wt1XQYPCxlVhUJCJReLbogTg64OGAjYcIMIrVaugNUrjAkrO/AsabSHwmEecH6b7+2hUFm2SJITrUJZeCgR4AqKy+aVCrd7srOgOqtMYO+y5cI2ldlJwAD1ETfK2/HT9EpQrSuwd8nNoP5mfX+RNBx/iHI3Aux2RxQnqolTb7TEE74HDZ/zZSwdWRcg4dVusrXe7uJIrKKm2ZauVunE+5zcSZmKR1ay8Wsird0Y91FQTKMREeAn+c/QFVocR70kSMHtR6UIkCF1vTZh4ym2w+iKtqgNhb/NKCSQEQk6q92Vuk1Um0ZsOZqgTIfs0tInDllVlNV269p1LSnobDk+wSx+9O v9v8cECr pqyzm6Ry0Qd49amTq+iNbe8evNjlYa6aexLEy+I9MYhzbLEgt0SGDDjnE7/21gn+PIp2Lkm6SxH2tYueye0zXZaXjKiVI6FC3LX0KTMj10GeZjFx/Dst9WGsYp+tZ4EW87EqoG1Memx9GC83PE4uMiZGcF/fyueR9fFmcwqmBP6XjnSzZ1pEhXpuffd+bG/Qb5tziE1rTDEpYtCte5ISYiG3zG1WtO2v64hgjYDYnRboNQguwwfglFzE5K6rWJJ2t3lVQIENB10EQZ/ShplwFEpGkR1LEFCWW9sJIIR+OlCRztzp4Tz5YuTAHbU82PKk5keC9QvayB58YZGnlRdLCmTe333rdIlYfo/vgrghXk7D3iRmRNb7AJWGTig== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Introduce a way of monitoring the working set of a workload, per page type and per NUMA node, with granularity in minutes. It has page-level granularity and minimal memory overhead by building on the Multi-generational LRU framework, which already has most of the infrastructure and is just missing a useful interface. MGLRU organizes pages in generations, where an older generation contains colder pages, and aging promotes the recently used pages into the young generation and creates a new one. The working set size is how much memory an application needs to keep working, the amount of "hot" memory that's frequently used. The only missing pieces between MGLRU generations and working set estimation are a consistent aging cadence and an interface; we introduce the two additions. Periodic aging ====== MGLRU Aging is currently driven by reclaim, so the amount of time between generations is non-deterministic. With memcgs being aged regularly, MGLRU generations become time-based working set information. - memory.periodic_aging: a new root-level only file in cgroupfs Writing to memory.periodic_aging sets the aging interval and opts into periodic aging. - kold: a new kthread that ages memcgs based on the set aging interval. Page idle age stats ====== - memory.page_idle_age: we group pages into idle age ranges, and present the number of pages per node per pagetype in each range. This aggregates the time information from MGLRU generations hierarchically. Use case: proactive reclaimer ====== The proactive reclaimer sets the aging interval, and periodically reads the page idle age stats, forming a working set estimation, which it then calculates an amount to write to memory.reclaim. With the page idle age stats, a proactive reclaimer could calculate a precise amount of memory to reclaim without continuously probing and inducing reclaim. A proactive reclaimer that uses a similar interface is used in the Google data centers. Use case: workload introspection ====== A workload may use the working set estimates to adjust application behavior as needed, e.g. preemptively killing some of its workers to avoid its working set thrashing, or dropping caches to fit within a limit. It can also be valuable to application developers, who can benefit from an out-of-the-box overview of the application's usage behaviors. TODO List ====== - selftests - a userspace demonstrator combining periodic aging, page idle age stats, memory.reclaim, and/or PSI Open questions ====== - MGLRU aging mechanism has a flag called force_scan. With force_scan=false, invoking MGLRU aging when an lruvec has a maximum number of generations does not actually perform aging. However, with force_scan=true, MGLRU moves the pages in the oldest generation to the second oldest generation. The force_scan=true flag also disables some optimizations in MGLRU's page table walks. The current patch sets force_scan=true, so that periodic aging would work without a proactive reclaimer evicting the oldest generation. - The page idle age format uses a fixed set of time ranges in seconds. I have considered having it be based on the aging interval, or just compiling the raw timestamps. With the age ranges based on the aging interval, a memcg that's undergoing memcg reclaim might have its generations in the 10 seconds range, and a much longer aging interval would obscure this fact. The raw timestamps from MGLRU could lead to a very large file when aggregated hierarchically. Yuanchu Xie (2): mm: multi-gen LRU: periodic aging mm: multi-gen LRU: cgroup working set stats include/linux/kold.h | 44 ++++++++++ include/linux/mmzone.h | 4 +- mm/Makefile | 3 + mm/kold.c | 150 ++++++++++++++++++++++++++++++++ mm/memcontrol.c | 188 +++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 35 +++++++- 6 files changed, 422 insertions(+), 2 deletions(-) create mode 100644 include/linux/kold.h create mode 100644 mm/kold.c