From patchwork Thu Nov 7 20:53:31 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Johannes Weiner X-Patchwork-Id: 11233755 Return-Path: Received: from mail.kernel.org (pdx-korg-mail-1.web.codeaurora.org [172.30.200.123]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B438C1850 for ; Thu, 7 Nov 2019 20:53:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7374A2245A for ; Thu, 7 Nov 2019 20:53:42 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="ePqqupRS" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7374A2245A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 612486B0003; Thu, 7 Nov 2019 15:53:41 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 5C1C06B0006; Thu, 7 Nov 2019 15:53:41 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D8676B0007; Thu, 7 Nov 2019 15:53:41 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161]) by kanga.kvack.org (Postfix) with ESMTP id 337646B0003 for ; Thu, 7 Nov 2019 15:53:41 -0500 (EST) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with SMTP id D950B8249980 for ; Thu, 7 Nov 2019 20:53:40 +0000 (UTC) X-FDA: 76130682600.26.watch04_2eb67e0ba55f X-Spam-Summary: 2,0,0,5625e85bd537e78c,d41d8cd98f00b204,hannes@cmpxchg.org,:akpm@linux-foundation.org:aryabinin@virtuozzo.com:surenb@google.com:shakeelb@google.com:riel@surriel.com:mhocko@suse.com::cgroups@vger.kernel.org:linux-kernel@vger.kernel.org:kernel-team@fb.com,RULES_HIT:41:69:355:379:541:973:988:989:1260:1311:1314:1345:1437:1515:1535:1543:1605:1711:1730:1747:1777:1792:2198:2199:2393:2559:2562:2689:2693:2731:2897:3138:3139:3140:3141:3142:3865:3866:3867:3868:3870:3871:3872:3874:4117:4250:4321:4470:4605:5007:6119:6261:6653:7875:7903:7904:8603:8957:9010:10004:10241:11658:11914:12043:12291:12295:12297:12517:12519:12555:12663:12679:12683:12740:12895:12986:13161:13180:13227:13229:13870:13894:14040:14096:14181:14394:14721:21080:21444:21450:21451:21627:21740:21790:21795:30012:30051:30054:30070,0,RBL:209.85.210.174:@cmpxchg.org:.lbl8.mailshell.net-62.2.0.100 66.100.201.201,CacheIP:none,Bayesian:0.5,0.5,0.5,Netcheck:none,DomainCache:0,MSF:not bulk,SPF:fp,MSBL:0,DNSBL:neutral,Cu stom_rul X-HE-Tag: watch04_2eb67e0ba55f X-Filterd-Recvd-Size: 6742 Received: from mail-pf1-f174.google.com (mail-pf1-f174.google.com [209.85.210.174]) by imf03.hostedemail.com (Postfix) with ESMTP for ; Thu, 7 Nov 2019 20:53:39 +0000 (UTC) Received: by mail-pf1-f174.google.com with SMTP id s5so3290550pfh.9 for ; Thu, 07 Nov 2019 12:53:39 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=hRiOqVqF7fkbNnBCdHcHlpMVwV+oaW2ZJ1zi2pGGke8=; b=ePqqupRSQg4vkwjECEem6IB1+aU57zZva9VDVkIaaqP6DGkNxVKNuLmE0ilILpAalH jMzJjfAYnwHRFBdxoRnP/AysWORHd+ahoj5f/tXDlrT2MBZZbbH+owvB3CnN3Iayg8bZ J6paV8IauAo9Xtgmt1u4oaab8CuleD9F3fECDlOiaPXhcMpljSBdQ4/6spJeKkWNJehU p/TZhlv1VEdVWXT48uBz0yIWk7QzNoAPsAs3AzRkhML1fCwK2d6Ss0c2FpYtjG2GdeKz Ecvl0fQfiJvF4i0j2EVmbYO0SY2b8voSIBWd0NiBB9s19+6thn3VXdaFj5TBD2UyNiZ4 ZJ+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:mime-version :content-transfer-encoding; bh=hRiOqVqF7fkbNnBCdHcHlpMVwV+oaW2ZJ1zi2pGGke8=; b=krjmJgxpSRQLt4apUG4TsYpFbtL1T/Mmnh6MfJfA8E6JRiXWxa7UlbCvhkB7565IoH tkxSurr7tl5yH0ltAbZTSHXmZ+NRoSgyyXkBR8g7WzqIY/8NQrCn1g4lMV6nILbyC8uv 9xRWmF6Mko3gWl0LJfpE+wPGGIZOkpGJn2phXOPreb+N2yrd/HTMI3oEMhVkr9fKPmC1 xrFRK62qos2jWq8WucumkORO9rP3LdUkm/N4oaQavWYDK7Km1urlquFUpAdceRCkDIM1 kPC87S3FLhqqW+DBrefkf3sdLBYpPQgdhKKv5R5sGY09F9hfe5ElGMR3XtOjZTz4Ek41 gKMQ== X-Gm-Message-State: APjAAAVCknQeu4pCUSP8xVzNBSoinEl3VRtiqN6K76/Iye3f3b6OXr29 t5NNDtmffG0u6dsrGfwZoCcM2w== X-Google-Smtp-Source: APXvYqx8xTAxYY/Su1yCNM4c3FuS3t23gN3HcTWSvbJ5dSw/TtaPxyTbdceDprP0HN7/X5xVMvk0Nw== X-Received: by 2002:a63:fc09:: with SMTP id j9mr6985037pgi.272.1573160018173; Thu, 07 Nov 2019 12:53:38 -0800 (PST) Received: from localhost ([2620:10d:c090:200::3:3792]) by smtp.gmail.com with ESMTPSA id h186sm3937459pfb.63.2019.11.07.12.53.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 07 Nov 2019 12:53:37 -0800 (PST) From: Johannes Weiner To: Andrew Morton Cc: Andrey Ryabinin , Suren Baghdasaryan , Shakeel Butt , Rik van Riel , Michal Hocko , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 0/3] mm: fix page aging across multiple cgroups Date: Thu, 7 Nov 2019 12:53:31 -0800 Message-Id: <20191107205334.158354-1-hannes@cmpxchg.org> X-Mailer: git-send-email 2.24.0 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: When applications are put into unconfigured cgroups for memory accounting purposes, the cgrouping itself should not change the behavior of the page reclaim code. We expect the VM to reclaim the coldest pages in the system. But right now the VM can reclaim hot pages in one cgroup while there is eligible cold cache in others. This is because one part of the reclaim algorithm isn't truly cgroup hierarchy aware: the inactive/active list balancing. That is the part that is supposed to protect hot cache data from one-off streaming IO. The recursive cgroup reclaim scheme will scan and rotate the physical LRU lists of each eligible cgroup at the same rate in a round-robin fashion, thereby establishing a relative order among the pages of all those cgroups. However, the inactive/active balancing decisions are made locally within each cgroup, so when a cgroup is running low on cold pages, its hot pages will get reclaimed - even when sibling cgroups have plenty of cold cache eligible in the same reclaim run. For example: [root@ham ~]# head -n1 /proc/meminfo MemTotal: 1016336 kB [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.269s user 0m0.051s sys 0m4.182s Hot pages cached: 134/12800 workingset-a The streaming IO in B, which doesn't benefit from caching at all, pushes out most of the workingset in A. Solution This series fixes the problem by elevating inactive/active balancing decisions to the toplevel of the reclaim run. This is either a cgroup that hit its limit, or straight-up global reclaim if there is physical memory pressure. From there, it takes a recursive view of the cgroup subtree to decide whether page deactivation is necessary. In the test above, the VM will then recognize that cgroup B has plenty of eligible cold cache, and that the hot pages in A can be spared: [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.244s user 0m0.064s sys 0m4.177s Hot pages cached: 12800/12800 workingset-a Implementation Whether active pages can be deactivated or not is influenced by two factors: the inactive list dropping below a minimum size relative to the active list, and the occurence of refaults. This patch series first moves refault detection to the reclaim root, then enforces the minimum inactive size based on a recursive view of the cgroup tree's LRUs. History Note that this actually never worked correctly in Linux cgroups. In the past it worked for global reclaim and leaf limit reclaim only (we used to have two physical LRU linkages per page), but it never worked for intermediate limit reclaim over multiple leaf cgroups. We're noticing this now because 1) we're putting everything into cgroups for accounting, not just the things we want to control and 2) we're moving away from leaf limits that invoke reclaim on individual cgroups, toward large tree reclaim, triggered by high-level limits, or physical memory pressure that is influenced by local protections such as memory.low and memory.min instead. Requirements These changes are based on v5.4-rc6-mmotm-2019-11-05-20-44. include/linux/memcontrol.h | 5 + include/linux/mmzone.h | 4 +- include/linux/swap.h | 2 +- mm/vmscan.c | 269 +++++++++++++++++++++++++------------------ mm/workingset.c | 72 +++++++++--- 5 files changed, 223 insertions(+), 129 deletions(-)