From patchwork Tue Dec 18 04:23:23 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10734781 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 0FEAA6C2 for ; Tue, 18 Dec 2018 04:36:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id F3F0B2A787 for ; Tue, 18 Dec 2018 04:36:03 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id E87912A788; Tue, 18 Dec 2018 04:36:03 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3EEB22A78A for ; Tue, 18 Dec 2018 04:36:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32C838E0002; Mon, 17 Dec 2018 23:36:01 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 2ACA38E0001; Mon, 17 Dec 2018 23:36:01 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1732A8E0002; Mon, 17 Dec 2018 23:36:01 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id CE9FA8E0001 for ; Mon, 17 Dec 2018 23:36:00 -0500 (EST) Received: by mail-pf1-f199.google.com with SMTP id h11so14036184pfj.13 for ; Mon, 17 Dec 2018 20:36:00 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:user-agent:mime-version :content-transfer-encoding; bh=RoqddiIe8lqXw1ZKuTOnjFGFxs0fHkjT6+qZE4PupcM=; b=LpEzs2M63+VxDym06KvDAMHfEUFRS1vUy5GYFwRtDEwfXTxkgnGO5iElb8zx/yBiLz H55Ijv+UTOEZDpFtt9MDKNuiX1gIoMju2kJyjA5KFuehF3c0oMob8G52yk+rsJukMa2a 6wuaz8E3hP9I/Pq++iIuGZhOt9I72V5pZLKN9Bg4hIRAYcBFqc4FwPBl6aRSBMA2ezOu EdihS25GiGl3Z/owA+5qXcIj/ql5ZbPRHO6k15gxo7SkoRXYj1o+KCC/3LqLdBUKUhda 1P60dN1TloTAI8GLTqjvhZh8aIZulmrgwnvFJzghBRDVIVcwrhf+Q64os6akw81iMAm1 jOtg== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AA+aEWYGZhh9cVaidsfor3ry/rgpYkBIxq5+5tGZ7xoMqOmZ8S2X7AQl grcJGFbXaZ72exNrAY3J+KJK47ey2CwcKUg45fE6UiRknh9cE+7+IMHTfVPsuLHrpFPxQBVjf8l vxHAetiBbRKfB5uPL6bqoZ7rNRDgcwXLW4DKXxvvcAHxLusLawiWv1TR5E/PwMQfpkg== X-Received: by 2002:a17:902:6bc7:: with SMTP id m7mr15437670plt.106.1545107760487; Mon, 17 Dec 2018 20:36:00 -0800 (PST) X-Google-Smtp-Source: AFSGD/WlthVGzq9D267p4bP17emPOto6N7ZKs4e0DS+EIHT1ahAxlEhgPBf1Axc4uIhU6ZvGctV7 X-Received: by 2002:a17:902:6bc7:: with SMTP id m7mr15437625plt.106.1545107759325; Mon, 17 Dec 2018 20:35:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1545107759; cv=none; d=google.com; s=arc-20160816; b=yy1AKGTWcNB7smnWjYXnE+wXGqJr7QhiJ1r8vSpY4fuDW6vOUtK+c62SHt+fvEZi1x 0g7eO247YcWHdD+miiEIqUvjrBpfNDrpXelK9XSfy4Ad3+wx9Wh8s1aVryCZCsH5GU7N 0ecPaB077j7RXUFB2IdqnnUMyD1AvaMDT7DjsS/CqiDbHoqXoTqvxLexL9zBpdvmT4xU q3HZWlAlMoRPC/JGmE1hsKe5/AT6DiZM/OdVQm5nHg5KiCyBRzerlwG+8TEB/tRO3a7h 8sUUX29g5wDEBk6rAtPbDAul5qafLpvBeWsuczcVzgr9B59w1PeuXuBl5yvf0p7L3uG0 Y/aQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:message-id:date :cc:to:from:subject; bh=RoqddiIe8lqXw1ZKuTOnjFGFxs0fHkjT6+qZE4PupcM=; b=TgNDmYwE8dvP/OXXnszGcg047fblMDRkhfanED+TCNB+03vbgZZfe4vcs7vnajiBOW 2jYKVa74qmssVryuZx2LSK0e+Nkm3UE2KQCWzU/M0Ty1gr4bcNcQUdPHN5kDGN6uWuoa KjzzYHifZ4cmwsSfJUn2s+748bS5jvxE3Z5Ta1FySlTX4GfGNsX8H9x8p8Cn9M9KeB0m UC+QUctbNev6USbUCpOkNC0iuD8VFdqQi1QSMryFtk0PmoB6H/P2Ca9NJlJlYTJ5PRbr pn6ydLIHNTMdfgiW638qqRhL3CeG0lFgD/qLVPxDHRIg0oQmE3RtenU+qhh7HJ5K3QJT pdIg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga11.intel.com (mga11.intel.com. [192.55.52.93]) by mx.google.com with ESMTPS id i62si119393pfc.17.2018.12.17.20.35.59 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 17 Dec 2018 20:35:59 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) client-ip=192.55.52.93; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 192.55.52.93 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga102.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 17 Dec 2018 20:35:58 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,367,1539673200"; d="scan'208";a="119156575" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga002.jf.intel.com with ESMTP; 17 Dec 2018 20:35:58 -0800 Subject: [PATCH v6 0/6] mm: Randomize free memory From: Dan Williams To: akpm@linux-foundation.org Cc: "Rafael J. Wysocki" , Keith Busch , Mike Rapoport , Kees Cook , x86@kernel.org, Michal Hocko , Dave Hansen , Peter Zijlstra , "Rafael J. Wysocki" , Andy Lutomirski , linux-mm@kvack.org, x86@kernel.org, linux-kernel@vger.kernel.org, mgorman@suse.de Date: Mon, 17 Dec 2018 20:23:23 -0800 Message-ID: <154510700291.1941238.817190985966612531.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Changes since v5 [1]: * Add missing kernel-doc for new functionality, and fold some changes from patch 4 to patch 3. (Mike) * Actually include the HMAT parsing, and test the autodetect in QEMU. (Keith) * Test against hibernation. Note, only a basic checkout with pm_test in QEMU was performed. (Rafael) * Fix up interaction between auto-detect, override, and the status value exported to /sys/module/page_alloc/parameters/shuffle * Don't pollute mm.h, move the new functionality to its own header. [1]: https://lkml.kernel.org/r/154483851047.1672629.15001135860756738866.stgit@dwillia2-desk3.amr.corp.intel.com/ --- Andrew, this needs at least an ack from Michal, or Mel before it moves forward. It would be a nice surprise / present to see it move forward before the holidays, but I suspect it may need to simmer until the new year. This series is against v4.20-rc6. Summary, quote patch 4: Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [2]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [3], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.org/lkml/2017/8/23/195 Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform split capability". With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. See patch 3 for more details. [2]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [3]: https://lkml.org/lkml/2018/9/22/54 --- Dan Williams (3): mm: Shuffle initial free memory to improve memory-side-cache utilization mm: Move buddy list manipulations into helpers mm: Maintain randomization of page free lists Keith Busch (3): acpi: Create subtable parsing infrastructure acpi: Add HMAT to generic parsing tables acpi/numa: Set the memory-side-cache size in memblocks arch/ia64/kernel/acpi.c | 12 + arch/x86/Kconfig | 1 arch/x86/kernel/acpi/boot.c | 36 ++-- drivers/acpi/numa.c | 48 ++++- drivers/acpi/scan.c | 4 drivers/acpi/tables.c | 76 +++++++- drivers/irqchip/irq-gic-v2m.c | 2 drivers/irqchip/irq-gic-v3-its-pci-msi.c | 2 drivers/irqchip/irq-gic-v3-its-platform-msi.c | 2 drivers/irqchip/irq-gic-v3-its.c | 6 - drivers/irqchip/irq-gic-v3.c | 8 - drivers/irqchip/irq-gic.c | 4 drivers/mailbox/pcc.c | 2 include/linux/acpi.h | 6 + include/linux/list.h | 17 ++ include/linux/memblock.h | 38 ++++ include/linux/mm.h | 3 include/linux/mm_types.h | 3 include/linux/mmzone.h | 65 +++++++ include/linux/shuffle.h | 59 ++++++ init/Kconfig | 36 ++++ mm/Kconfig | 3 mm/Makefile | 7 + mm/compaction.c | 4 mm/memblock.c | 50 +++++ mm/memory_hotplug.c | 3 mm/page_alloc.c | 82 ++++----- mm/shuffle.c | 231 +++++++++++++++++++++++++ 28 files changed, 702 insertions(+), 108 deletions(-) create mode 100644 include/linux/shuffle.h create mode 100644 mm/shuffle.c