From patchwork Mon Aug 20 03:26:40 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrea Arcangeli X-Patchwork-Id: 10569857 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 32667920 for ; Mon, 20 Aug 2018 03:26:49 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 2260329131 for ; Mon, 20 Aug 2018 03:26:49 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1651129135; Mon, 20 Aug 2018 03:26:49 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7054A29131 for ; Mon, 20 Aug 2018 03:26:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 152C96B16F4; Sun, 19 Aug 2018 23:26:46 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 0DF1E6B16F5; Sun, 19 Aug 2018 23:26:46 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DFCB36B16F6; Sun, 19 Aug 2018 23:26:45 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-qk0-f200.google.com (mail-qk0-f200.google.com [209.85.220.200]) by kanga.kvack.org (Postfix) with ESMTP id ADB936B16F4 for ; Sun, 19 Aug 2018 23:26:45 -0400 (EDT) Received: by mail-qk0-f200.google.com with SMTP id a70-v6so13755780qkb.16 for ; Sun, 19 Aug 2018 20:26:45 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:in-reply-to:references; bh=v0nkrCugw0RYcLs59216akS/60cjHtJ8nIKc/HcFc6U=; b=RD0CohjO7hBgeNzSF26ENb9vejaXqrLTxXwDLLeWOHmaQuI4bSpVtCeEIHO2FXTRiN SwfSMM7wQu4y1rBUFPJuPiqom7aabY9OKCL9exfyYy8CS1jY0HbUDkDKwgOhn9YuecYp OKazCCMkQ+umAVETp+zVLyTrO/mpQTPAfz5qZ0GboXRkvMZlmA/xEZRG7kowi8IfIecY odz+nHx1XUfsGYxWvqtQAIc+BBc2bHpnzVXp9+IgYL4Sk8Fb9fI+O/oGmCUM0dQ7IFMN y93YqxhItWGx9N4x6NRlvuO5S/eqWWMZ5rHuEPXZwb17TqnT6yi11RsnCgHWjWu1AvDe Etfw== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com X-Gm-Message-State: AOUpUlH2y+xsZhzH1gOgSIMx2hefOjErsXPPOE8GYtJ5Nq0rBMPILQt8 N7219TV6jn/PiPohkEy7GAL6X2uCL9Gw2daxMUQRNJs41wSskfaqaHTqtPY5HQbVoIkCE49kAdJ CgRlRCkAaqbrLh3HdAXP5a9IxMMuFP50lgw4f+NVRkIBeZdX1+NHBmP6tam1GpdPlfw== X-Received: by 2002:a0c:f9c8:: with SMTP id j8-v6mr39508667qvo.177.1534735605501; Sun, 19 Aug 2018 20:26:45 -0700 (PDT) X-Google-Smtp-Source: AA+uWPw/C5MjFW7tyzwjx1ptLlQ3vSNDPDA28joFZVl0pYzFg4TR/Ll+eEuNUlj7SzAtp2Xf0jKq X-Received: by 2002:a0c:f9c8:: with SMTP id j8-v6mr39508650qvo.177.1534735604864; Sun, 19 Aug 2018 20:26:44 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1534735604; cv=none; d=google.com; s=arc-20160816; b=GeOUHaFnQG9enigTULHaZYKZFTU0tIPKpb+WqxgVJ9A8U/qk7MPwgZiOkCKf374260 an9pe3KddEdytO8IuQ/SePcIRBJlZ/aI7P5IPXQjvV570kkrmyKnPL04oOfDsLTp1l3C 08x50xkg4WstQ02AkMb6Hu8dXYwQYJDRW9a5a+NMwOAKlsCi8iarEPJKuV2GG9aXLW/v prkI3H/iFvoYjX++hpwd6IZgdh9rpobEUNsUsEYIf6gfyku+FV5yyIFFbDdD8Qnc0KRF YKZfJfkD3GpkH8S/EtA56vYGTLwlDiC/H63cxK+3Fnj+WNW+64TdfgcVhG83RIZB6D5i Sn1A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=references:in-reply-to:message-id:date:subject:cc:to:from :arc-authentication-results; bh=v0nkrCugw0RYcLs59216akS/60cjHtJ8nIKc/HcFc6U=; b=NeTy644sPIh3p0rd1GLNLJRrVfi+cy/lvjmtT2ZWjHhj0kHrTnWOiTnk5NjhGkeoIE +bCKqT9E5LAXki9E09Djasj2oYMP5k9AbxWIPeaXYEDtfiucbI21FZWdIIKznYA+XSY4 Y1hoMvJM0/CtY/6ABd5PW4xfKowQD1sjG5Yt6uaIFQx6edUjo9pKJg1z5ohXeTg+gbdV gnJsmNVgDxrD4IiZooVzNvBq6BhtuiA3foL3qaLG1yrPRbYSvY1rrGW1iHBea7kbsACk 4WS5DHMeySMKxGxiS22aOjHBpyhpUTcozCTPs4ErWolJmqmANs16aj3s+1M1bAJBhXm5 /KzA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id e65-v6si7698495qkd.158.2018.08.19.20.26.44 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sun, 19 Aug 2018 20:26:44 -0700 (PDT) Received-SPF: pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) client-ip=209.132.183.28; Authentication-Results: mx.google.com; spf=pass (google.com: domain of aarcange@redhat.com designates 209.132.183.28 as permitted sender) smtp.mailfrom=aarcange@redhat.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 10B2A308330B; Mon, 20 Aug 2018 03:26:44 +0000 (UTC) Received: from sky.random (ovpn-120-72.rdu2.redhat.com [10.10.120.72]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 98A0960921; Mon, 20 Aug 2018 03:26:41 +0000 (UTC) From: Andrea Arcangeli To: Andrew Morton Cc: linux-mm@kvack.org, Alex Williamson , David Rientjes , Vlastimil Babka Subject: [PATCH 1/1] mm: thp: fix transparent_hugepage/defrag = madvise || always Date: Sun, 19 Aug 2018 23:26:40 -0400 Message-Id: <20180820032640.9896-2-aarcange@redhat.com> In-Reply-To: <20180820032640.9896-1-aarcange@redhat.com> References: <20180820032204.9591-3-aarcange@redhat.com> <20180820032640.9896-1-aarcange@redhat.com> X-Scanned-By: MIMEDefang 2.79 on 10.5.11.13 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.44]); Mon, 20 Aug 2018 03:26:44 +0000 (UTC) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP qemu uses MADV_HUGEPAGE which allows direct compaction (i.e. __GFP_DIRECT_RECLAIM is set). The problem is that direct compaction combined with the NUMA __GFP_THISNODE logic in mempolicy.c is telling reclaim to swap very hard the local node, instead of failing the allocation if there's no THP available in the local node. Such logic was ok until __GFP_THISNODE was added to the THP allocation path even with MPOL_DEFAULT. The idea behind the __GFP_THISNODE addition, is that it is better to provide local memory in PAGE_SIZE units than to use remote NUMA THP backed memory. That largely depends on the remote latency though, on threadrippers for example the overhead is relatively low in my experience. The combination of __GFP_THISNODE and __GFP_DIRECT_RECLAIM results in extremely slow qemu startup with vfio, if the VM is larger than the size of one host NUMA node. This is because it will try very hard to unsuccessfully swapout get_user_pages pinned pages as result of the __GFP_THISNODE being set, instead of falling back to PAGE_SIZE allocations and instead of trying to allocate THP on other nodes (it would be even worse without vfio type1 GUP pins of course, except it'd be swapping heavily instead). It's very easy to reproduce this by setting transparent_hugepage/defrag to "always", even with a simple memhog. 1) This can be fixed by retaining the __GFP_THISNODE logic also for __GFP_DIRECT_RELCAIM by allowing only one compaction run. Not even COMPACT_SKIPPED (i.e. compaction failing because not enough free memory in the zone) should be allowed to invoke reclaim. 2) An alternative is not use __GFP_THISNODE if __GFP_DIRECT_RELCAIM has been set by the caller (i.e. MADV_HUGEPAGE or defrag="always"). That would keep the NUMA locality restriction only when __GFP_DIRECT_RECLAIM is not set by the caller. So THP will be provided from remote nodes if available before falling back to PAGE_SIZE units in the local node, but an app using defrag = always (or madvise with MADV_HUGEPAGE) supposedly prefers that. These are the results of 1) (higher GB/s is better). Finished: 30 GB mapped, 10.188535s elapsed, 2.94GB/s Finished: 34 GB mapped, 12.274777s elapsed, 2.77GB/s Finished: 38 GB mapped, 13.847840s elapsed, 2.74GB/s Finished: 42 GB mapped, 14.288587s elapsed, 2.94GB/s Finished: 30 GB mapped, 8.907367s elapsed, 3.37GB/s Finished: 34 GB mapped, 10.724797s elapsed, 3.17GB/s Finished: 38 GB mapped, 14.272882s elapsed, 2.66GB/s Finished: 42 GB mapped, 13.929525s elapsed, 3.02GB/s These are the results of 2) (higher GB/s is better). Finished: 30 GB mapped, 10.163159s elapsed, 2.95GB/s Finished: 34 GB mapped, 11.806526s elapsed, 2.88GB/s Finished: 38 GB mapped, 10.369081s elapsed, 3.66GB/s Finished: 42 GB mapped, 12.357719s elapsed, 3.40GB/s Finished: 30 GB mapped, 8.251396s elapsed, 3.64GB/s Finished: 34 GB mapped, 12.093030s elapsed, 2.81GB/s Finished: 38 GB mapped, 11.824903s elapsed, 3.21GB/s Finished: 42 GB mapped, 15.950661s elapsed, 2.63GB/s This is current upstream (higher GB/s is better). Finished: 30 GB mapped, 8.821632s elapsed, 3.40GB/s Finished: 34 GB mapped, 341.979543s elapsed, 0.10GB/s Finished: 38 GB mapped, 761.933231s elapsed, 0.05GB/s Finished: 42 GB mapped, 1188.409235s elapsed, 0.04GB/s vfio is a good test because by pinning all memory it avoids the swapping and reclaim only wastes CPU, a memhog based test would created swapout storms and supposedly show a bigger stddev. What is better between 1) and 2) depends on the hardware and on the software. Virtualization EPT/NTP gets a bigger boost from THP as well than host applications. This commit implements 2). Reported-by: Alex Williamson Signed-off-by: Andrea Arcangeli --- mm/mempolicy.c | 32 ++++++++++++++++++++++++++++++-- 1 file changed, 30 insertions(+), 2 deletions(-) diff --git a/mm/mempolicy.c b/mm/mempolicy.c index d6512ef28cde..fb7f9581a835 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -2047,8 +2047,36 @@ alloc_pages_vma(gfp_t gfp, int order, struct vm_area_struct *vma, if (!nmask || node_isset(hpage_node, *nmask)) { mpol_cond_put(pol); - page = __alloc_pages_node(hpage_node, - gfp | __GFP_THISNODE, order); + /* + * We cannot invoke reclaim if __GFP_THISNODE + * is set. Invoking reclaim with + * __GFP_THISNODE set, would cause THP + * allocations to trigger heavy swapping + * despite there may be tons of free memory + * (including potentially plenty of THP + * already available in the buddy) on all the + * other NUMA nodes. + * + * At most we could invoke compaction when + * __GFP_THISNODE is set (but we would need to + * refrain from invoking reclaim even if + * compaction returned COMPACT_SKIPPED because + * there wasn't not enough memory to succeed + * compaction). For now just avoid + * __GFP_THISNODE instead of limiting the + * allocation path to a strict and single + * compaction invocation. + * + * Supposedly if direct reclaim was enabled by + * the caller, the app prefers THP regardless + * of the node it comes from so this would be + * more desiderable behavior than only + * providing THP originated from the local + * node in such case. + */ + if (!(gfp & __GFP_DIRECT_RECLAIM)) + gfp |= __GFP_THISNODE; + page = __alloc_pages_node(hpage_node, gfp, order); goto out; } }