From patchwork Wed Oct 10 07:19:03 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: "Huang, Ying" X-Patchwork-Id: 10634099 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id C170913AD for ; Wed, 10 Oct 2018 07:27:22 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B2481294F6 for ; Wed, 10 Oct 2018 07:27:22 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id A694B296AF; Wed, 10 Oct 2018 07:27:22 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 01BE2294F6 for ; Wed, 10 Oct 2018 07:27:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5B5736B0006; Wed, 10 Oct 2018 03:27:15 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 3DA936B000C; Wed, 10 Oct 2018 03:27:15 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 251E26B0007; Wed, 10 Oct 2018 03:27:15 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id C72C86B0007 for ; Wed, 10 Oct 2018 03:27:14 -0400 (EDT) Received: by mail-pf1-f199.google.com with SMTP id y86-v6so3907705pff.6 for ; Wed, 10 Oct 2018 00:27:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id:mime-version:content-transfer-encoding; bh=f3VmfqBFf/aMuP8r8nE8xW2Kcmwgph24qWGXyOp7QZU=; b=Ltb5J/8xkdF5tHgrE3FruHlRzgJ/rH8AGgFLGe2bMz40JrC32VUnAghv/xLw00BITf OiA86QjN73IX7kkKTMZgPts2gcZ/gN8NoeGj/zE4gAK7DO4/6jxuHECXdBRHhO3rdWuX +Ff01InSiv4hUkeAtuWzuaSVlAtCO91IvmXljZFEMnxkbMSuM2bNBtalH4dAfwDok7NH XT/p+XnhLjXOZhTlMVF4RM4TfxlRYueTFUhKWhvzu75p/2mIwIR4wPUhJmh/Q9rDhUdv r/6yYgg9kwZMKRcARRPXPazMrgZsVxOXM4+XjCjdW6NOX/09FuuqdlUnIGTrnxHoeS3G vzdQ== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: ABuFfoj7In1OS+wGJ5hXvRyU9qpcbGKIwDDXpRzvEQEQm/2kMMg7VWj3 wL20Jqy+Aw9+mUnIYa9M6Z5F6G+JWjZJCvIYeopEj2A6D7VqcC7beFkAKFMQm1Uw5TwykKDvAiG 0fOXG9I5T1zpHmz1EbD2svRl4fAyu8xx836Qx3U1LZGWjVS7i/pwPnu7cfDqRp+m6SA== X-Received: by 2002:a17:902:bf0a:: with SMTP id bi10-v6mr32095713plb.163.1539156434469; Wed, 10 Oct 2018 00:27:14 -0700 (PDT) X-Google-Smtp-Source: ACcGV63+XMDonkOtK9RlCy1BCqpe+H2YaBeNc3+7tZyt9hgkZ8yLsuy7JK7eFoCbbYoC0KA1uwIB X-Received: by 2002:a17:902:bf0a:: with SMTP id bi10-v6mr32095630plb.163.1539156433196; Wed, 10 Oct 2018 00:27:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539156433; cv=none; d=google.com; s=arc-20160816; b=T4l4YxRQcLHI5q5nBVDAb9VB2iglN4+IrigJYYAV6HzLMhCmRaf6OeEyO6Ixhp5Vl8 rJYH7hozdiFj+b+mxTO1qpPo2mVve5u7hpHWMOO0Z1x7MlbEOjFplwcZ4ZMAYV51gmaa zKLFYCXqd+t7yn9gVcLhE9zLrkaC5+MOrsOHfwYDVVbxyXOHPy9uK1KvunrakMaVV4jo YfKxNSJunKZgR5+y+O0rN/H0E7TKj7IyCYLqJZereK7CaY12sn6jz60khH1c/YKP+3zE hIfylEGvzCrZTzgqZCKpUejO8fuig5pk/V5NLviMSuRmF0hazfNLJeZbveBweWK5LjC+ xDoQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from; bh=f3VmfqBFf/aMuP8r8nE8xW2Kcmwgph24qWGXyOp7QZU=; b=dxhoqXs6+jfIun6B9VCkIVTVXSuZW7nGxOwv/EiOG5NGo2GjyPeASAjbs13oy5EJdj Y1QjFgKMMu2N/hPOWv8m0UOH6Hnkoxifr0aYxEFI8iU26xK8As157EK+TKRXbGF2uBhC MwDWEKFTLtqbnSRmJp35SJUll1q2sdmMX8QS53zuMR1dK+tczv/OsrKMukM7r2BfKvsY Zpe0J/dXVv3MKQfxRhVhUMg7H3IZZj57OBn7FydQEmJsRUR4Dv9t5UvmwW/6NDZm/Dq4 h6596iRielUkFuUM6Z4afe6qUwkq8IbrgPU7L1iOXwTjPcSo6D/Ml34Ulrl61rnqsuMU g3mg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga18.intel.com (mga18.intel.com. [134.134.136.126]) by mx.google.com with ESMTPS id e2-v6si30331496pfh.64.2018.10.10.00.27.12 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 10 Oct 2018 00:27:13 -0700 (PDT) Received-SPF: pass (google.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) client-ip=134.134.136.126; Authentication-Results: mx.google.com; spf=pass (google.com: domain of ying.huang@intel.com designates 134.134.136.126 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga002.fm.intel.com ([10.253.24.26]) by orsmga106.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 10 Oct 2018 00:27:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.54,363,1534834800"; d="scan'208";a="93869950" Received: from yhuang-mobile.sh.intel.com ([10.239.198.87]) by fmsmga002.fm.intel.com with ESMTP; 10 Oct 2018 00:18:52 -0700 From: Huang Ying To: Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Huang Ying , "Kirill A. Shutemov" , Andrea Arcangeli , Michal Hocko , Johannes Weiner , Shaohua Li , Hugh Dickins , Minchan Kim , Rik van Riel , Dave Hansen , Naoya Horiguchi , Zi Yan , Daniel Jordan Subject: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Date: Wed, 10 Oct 2018 15:19:03 +0800 Message-Id: <20181010071924.18767-1-ying.huang@intel.com> X-Mailer: git-send-email 2.16.4 MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Hi, Andrew, could you help me to check whether the overall design is reasonable? Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the swap part of the patchset? Especially [02/21], [03/21], [04/21], [05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21], [12/21], [20/21], [21/21]. Hi, Andrea and Kirill, could you help me to review the THP part of the patchset? Especially [01/21], [07/21], [09/21], [11/21], [13/21], [15/21], [16/21], [17/21], [18/21], [19/21], [20/21]. Hi, Johannes and Michal, could you help me to review the cgroup part of the patchset? Especially [14/21]. And for all, Any comment is welcome! This patchset is based on the 2018-10-3 head of mmotm/master. This is the final step of THP (Transparent Huge Page) swap optimization. After the first and second step, the splitting huge page is delayed from almost the first step of swapout to after swapout has been finished. In this step, we avoid splitting THP for swapout and swapout/swapin the THP in one piece. We tested the patchset with vm-scalability benchmark swap-w-seq test case, with 16 processes. The test case forks 16 processes. Each process allocates large anonymous memory range, and writes it from begin to end for 8 rounds. The first round will swapout, while the remaining rounds will swapin and swapout. The test is done on a Xeon E5 v3 system, the swap device used is a RAM simulated PMEM (persistent memory) device. The test result is as follow, base optimized ---------------- -------------------------- %stddev %change %stddev \ | \ 1417897 ± 2% +992.8% 15494673 vm-scalability.throughput 1020489 ± 4% +1091.2% 12156349 vmstat.swap.si 1255093 ± 3% +940.3% 13056114 vmstat.swap.so 1259769 ± 7% +1818.3% 24166779 meminfo.AnonHugePages 28021761 -10.7% 25018848 ± 2% meminfo.AnonPages 64080064 ± 4% -95.6% 2787565 ± 33% interrupts.CAL:Function_call_interrupts 13.91 ± 5% -13.8 0.10 ± 27% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath Where, the score of benchmark (bytes written per second) improved 992.8%. The swapout/swapin throughput improved 1008% (from about 2.17GB/s to 24.04GB/s). The performance difference is huge. In base kernel, for the first round of writing, the THP is swapout and split, so in the remaining rounds, there is only normal page swapin and swapout. While in optimized kernel, the THP is kept after first swapout, so THP swapin and swapout is used in the remaining rounds. This shows the key benefit to swapout/swapin THP in one piece, the THP will be kept instead of being split. meminfo information verified this, in base kernel only 4.5% of anonymous page are THP during the test, while in optimized kernel, that is 96.6%. The TLB flushing IPI (represented as interrupts.CAL:Function_call_interrupts) reduced 95.6%, while cycles for spinlock reduced from 13.9% to 0.1%. These are performance benefit of THP swapout/swapin too. Below is the description for all steps of THP swap optimization. Recently, the performance of the storage devices improved so fast that we cannot saturate the disk bandwidth with single logical CPU when do page swapping even on a high-end server machine. Because the performance of the storage device improved faster than that of single logical CPU. And it seems that the trend will not change in the near future. On the other hand, the THP becomes more and more popular because of increased memory size. So it becomes necessary to optimize THP swap performance. The advantages to swapout/swapin a THP in one piece include: - Batch various swap operations for the THP. Many operations need to be done once per THP instead of per normal page, for example, allocating/freeing the swap space, writing/reading the swap space, flushing TLB, page fault, etc. This will improve the performance of the THP swap greatly. - The THP swap space read/write will be large sequential IO (2M on x86_64). It is particularly helpful for the swapin, which are usually 4k random IO. This will improve the performance of the THP swap too. - It will help the memory fragmentation, especially when the THP is heavily used by the applications. The THP order pages will be free up after THP swapout. - It will improve the THP utilization on the system with the swap turned on. Because the speed for khugepaged to collapse the normal pages into the THP is quite slow. After the THP is split during the swapout, it will take quite long time for the normal pages to collapse back into the THP after being swapin. The high THP utilization helps the efficiency of the page based memory management too. There are some concerns regarding THP swapin, mainly because possible enlarged read/write IO size (for swapout/swapin) may put more overhead on the storage device. To deal with that, the THP swapin is turned on only when necessary. A new sysfs interface: /sys/kernel/mm/transparent_hugepage/swapin_enabled is added to configure it. It uses "always/never/madvise" logic, to be turned on globally, turned off globally, or turned on only for VMA with MADV_HUGEPAGE, etc. GE, etc. Changelog --------- V6: - Rebased on 10/3 HEAD of mmotm/master - Added return value checking in swap_duplicate() per Daniel's comments v5: - Rebased on 9/20 HEAD of mmotm/master - Merged the swap operations implementation for the huge and the normal swap entries when possible - Added more code comments to improve code readability - Changed function parameter style to avoid to use Boolean parameter as much as possible - Fixed a deadlock issue in do_huge_pmd_swap_page(), thanks 0-Day and sparse v4: - Rebased on 6/14 HEAD of mmotm/master - Fixed one build bug and several coding style issues, Thanks Daniel Jordon v3: - Rebased on 5/18 HEAD of mmotm/master - Fixed a build bug, Thanks 0-Day! v2: - Fixed several build bugs, Thanks 0-Day! - Improved documentation as suggested by Randy Dunlap. - Fixed several bugs in reading huge swap cluster