From patchwork Wed Jan 16 22:57:36 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10767093 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 4548813A4 for ; Wed, 16 Jan 2019 23:10:18 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 328722FB62 for ; Wed, 16 Jan 2019 23:10:18 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 263022FB5F; Wed, 16 Jan 2019 23:10:18 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 699E72FB65 for ; Wed, 16 Jan 2019 23:10:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 718D08E0004; Wed, 16 Jan 2019 18:10:16 -0500 (EST) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 69E708E0002; Wed, 16 Jan 2019 18:10:16 -0500 (EST) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 53FC78E0004; Wed, 16 Jan 2019 18:10:16 -0500 (EST) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pf1-f199.google.com (mail-pf1-f199.google.com [209.85.210.199]) by kanga.kvack.org (Postfix) with ESMTP id 0D26C8E0002 for ; Wed, 16 Jan 2019 18:10:16 -0500 (EST) Received: by mail-pf1-f199.google.com with SMTP id t2so5855098pfj.15 for ; Wed, 16 Jan 2019 15:10:16 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:user-agent:mime-version :content-transfer-encoding; bh=LVrS0r+lln5mHqJMmmrfwsBIR8k8+nzCPfOLTFNM7I4=; b=YBYs2XZzduQIylrEDevngDwbxJMTSoiNw6B76/M7HXk5nRtd+N95cuoPX0P4POnB9M oCqOK1M4ln8A6bQfeg4IzUNqzXOoMXIYPJLvLH46lBHw2CiIp27kFxDaDEshmWXbe3vQ asniw9xQPzAnepyKKGLPmS0aFThMQ7s4N+vFHZ/RHd3TKnVbuhjHlm1KmrIBymAyaP9p CYxi5YHOsNRWqzYLPEEyRNopE8lITn//kp8nYn8NeDiZtMJ9QNV1978Y/iVcmbY5YmtO OD7dOGgCR80GeQEi0UwP1cK9Q8ZXLjyOhABYhJ7TbSmZMUmeh9LHAXZtd5M454ciD0h/ QL4g== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AJcUukfqS3FeYvyf+nu7xxNqwy3Mp8mxveANwHtigdBVqrAzF8WwrJ4G 5l1sEWY2WOQlxzizgi2NcCcM4NSCPLIL0nHpavJzfIo1WLzlF2YDtDQf+mTbHcphX4hsQe5Y2NH pDfesnMb2tgkAyicTp3wEE7mXw+cJMO5ncqr5CPB97gB8ZlNOd+KAse9hFi7V/l0VsQ== X-Received: by 2002:a17:902:765:: with SMTP id 92mr12455129pli.242.1547680215635; Wed, 16 Jan 2019 15:10:15 -0800 (PST) X-Google-Smtp-Source: ALg8bN7X5O6lhMwCKes0uERkAa5ErZv4nUC2Vxi6IntEEt4jhxFcBsXAbLzxMSDSd2VPtoGT5X5d X-Received: by 2002:a17:902:765:: with SMTP id 92mr12455009pli.242.1547680214190; Wed, 16 Jan 2019 15:10:14 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1547680214; cv=none; d=google.com; s=arc-20160816; b=Nt6oqIwmmRJ8CvXb7uk8U8U3OfkYytZuXACHmYr654ePhvpHKxQgdu5WCVHvkt8KeT sAh04YnMeUR/zDhgraIkcwUymh1/k27QpnCb6sTtbOTOJilf0ut3kIED+Omt5mMcHA7I 0UHANhLa1bgOESCPP7UHXY73tBV0xL4XDcE/d2ShPGuK9qq7r1n/xqTFipkYPYd2Urkc wzsEGfq7N+IputIBwOdKaKH4CtvRnd3aiul4H7330Ke9i/ZZYgG8inB+sqF4h8cIeRS/ br3SOGS+BH3hmFMGIu5nGdzxt4109ghl2L5fT2lvoGEkteCMpxIeCEV4UB+gIyHz7y3D plCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:message-id:date :cc:to:from:subject; bh=LVrS0r+lln5mHqJMmmrfwsBIR8k8+nzCPfOLTFNM7I4=; b=wklYy3JZPjERYrDol6P6pufqQYsRni9J+10/Vqn6wtwPQVIqR788cITcCwWIGWKguc P2JlIdU6TkRle2PAfew8XRmGJDCbmWIUJPMbw41wHpK7Nj/sED/NRT+mKqWe5SV9hqu5 TkRnlXYIqwAg6kqPVigGu/OR1grix9VAxAsLdRD5pyvF+m8dpkkN3kwG1f832RedW22q ADFHKgx6fwTjMe6PlBo5wma0SC9UP2lf8+h+YDFX5zHTyBB3nz6RWNMS3il1b8jyzP2B w3EwHqo7Jr8tG3IFecMd/tZOhtYSNiVkV0wTtdJxGZ25oOZ5vR1x87cH9M8PGnKnyqVP VgWw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga06.intel.com (mga06.intel.com. [134.134.136.31]) by mx.google.com with ESMTPS id j5si3784806pgq.82.2019.01.16.15.10.14 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Wed, 16 Jan 2019 15:10:14 -0800 (PST) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) client-ip=134.134.136.31; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.31 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jan 2019 15:10:13 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,487,1539673200"; d="scan'208";a="292148425" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga005.jf.intel.com with ESMTP; 16 Jan 2019 15:10:13 -0800 Subject: [PATCH v8 0/3] mm: Randomize free memory From: Dan Williams To: akpm@linux-foundation.org Cc: Michal Hocko , Dave Hansen , Mike Rapoport , Kees Cook , linux-mm@kvack.org, linux-kernel@vger.kernel.org, keith.busch@intel.com Date: Wed, 16 Jan 2019 14:57:36 -0800 Message-ID: <154767945660.1983228.12167020940431682725.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP Changes since v7 [1]: * Make the functionality available to non-ACPI_NUMA builds (Kees) * Mark the shuffle_state configuration variable __ro_after_init, since it is not meant to be changed after boot. (Kees) * Collect Reviewed-by tag from Kees on patch1 and patch3 [1]: https://lwn.net/Articles/776228/ --- Hi Andrew, It has been a week since I attempted to nudge Mel from his not-NAK-but-not-ACK position [2]. One of the concerns raised was buddy-merging undoing the randomization over time. That situation is avoided by the fact that the randomization is still valuable even at the MAX_ORDER buddy-page size, and that is the default randomization granularity. CMA might also undo some of the randomization, but there's not much else to be done about that besides note the incompatibility. It is not clear how widely CMA is used on servers. Outside of the server performance use case Kees continues to "remain a fan" of the set for its potential security implications. Recall that the functionality is default-disabled and self contained. The risk of merging this remains confined to workloads that were somehow dependent on in-order page allocation as Mel discovered [3], *and* do not benefit from increased average memory-side-cache utilization. Everyone else will not be at risk for regression because this functionality will be disabled without an explicit command line, or ACPI HMAT tables indicating the presence of a direct-mapped memory-side-cache. My read is that "not-NAK" is the best I can hope for from other core-mm folks at this point. Please consider this for the -mm tree and the v5.1 merge window. [2]: https://lore.kernel.org/lkml/CAPcyv4gkSBW5Te0RZLrkxzufyVq56-7pHu__YfffBiWhoqg7Yw@mail.gmail.com/ [3]: https://lkml.org/lkml/2018/10/12/309 --- Quote Patch 1: Randomization of the page allocator improves the average utilization of a direct-mapped memory-side-cache. Memory side caching is a platform capability that Linux has been previously exposed to in HPC (high-performance computing) environments on specialty platforms. In that instance it was a smaller pool of high-bandwidth-memory relative to higher-capacity / lower-bandwidth DRAM. Now, this capability is going to be found on general purpose server platforms where DRAM is a cache in front of higher latency persistent memory [4]. Robert offered an explanation of the state of the art of Linux interactions with memory-side-caches [5], and I copy it here: It's been a problem in the HPC space: http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/ A kernel module called zonesort is available to try to help: https://software.intel.com/en-us/articles/xeon-phi-software and this abandoned patch series proposed that for the kernel: https://lkml.org/lkml/2017/8/23/195 Dan's patch series doesn't attempt to ensure buffers won't conflict, but also reduces the chance that the buffers will. This will make performance more consistent, albeit slower than "optimal" (which is near impossible to attain in a general-purpose kernel). That's better than forcing users to deploy remedies like: "To eliminate this gradual degradation, we have added a Stream measurement to the Node Health Check that follows each job; nodes are rebooted whenever their measured memory bandwidth falls below 300 GB/s." A replacement for zonesort was merged upstream in commit cc9aec03e58f "x86/numa_emulation: Introduce uniform split capability". With this numa_emulation capability, memory can be split into cache sized ("near-memory" sized) numa nodes. A bind operation to such a node, and disabling workloads on other nodes, enables full cache performance. However, once the workload exceeds the cache size then cache conflicts are unavoidable. While HPC environments might be able to tolerate time-scheduling of cache sized workloads, for general purpose server platforms, the oversubscribed cache case will be the common case. The worst case scenario is that a server system owner benchmarks a workload at boot with an un-contended cache only to see that performance degrade over time, even below the average cache performance due to excessive conflicts. Randomization clips the peaks and fills in the valleys of cache utilization to yield steady average performance. See patch 1 for more details. [4]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/ [5]: https://lkml.org/lkml/2018/9/22/54 --- Dan Williams (3): mm: Shuffle initial free memory to improve memory-side-cache utilization mm: Move buddy list manipulations into helpers mm: Maintain randomization of page free lists include/linux/list.h | 17 +++ include/linux/mm.h | 3 - include/linux/mm_types.h | 3 + include/linux/mmzone.h | 65 +++++++++++++ include/linux/shuffle.h | 60 ++++++++++++ init/Kconfig | 35 +++++++ mm/Makefile | 7 + mm/compaction.c | 4 - mm/memblock.c | 10 ++ mm/memory_hotplug.c | 3 + mm/page_alloc.c | 82 ++++++++-------- mm/shuffle.c | 231 ++++++++++++++++++++++++++++++++++++++++++++++ 12 files changed, 470 insertions(+), 50 deletions(-) create mode 100644 include/linux/shuffle.h create mode 100644 mm/shuffle.c