From patchwork Wed Nov 14 22:49:14 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683279 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id F33A413BF for ; Wed, 14 Nov 2018 22:53:04 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id E3CED2BF63 for ; Wed, 14 Nov 2018 22:53:04 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id D63102BF72; Wed, 14 Nov 2018 22:53:04 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 772B42BF63 for ; Wed, 14 Nov 2018 22:53:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726814AbeKOI6Q (ORCPT ); Thu, 15 Nov 2018 03:58:16 -0500 Received: from mga12.intel.com ([192.55.52.136]:50974 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726379AbeKOI6Q (ORCPT ); Thu, 15 Nov 2018 03:58:16 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:02 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314877" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:02 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 1/7] node: Link memory nodes to their compute nodes Date: Wed, 14 Nov 2018 15:49:14 -0700 Message-Id: <20181114224921.12123-2-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Memory-only nodes will often have affinity to a compute node, and platforms have ways to express that locality relationship. A node containing CPUs or other DMA devices that can initiate memory access are referred to as "memory iniators". A "memory target" is a node that provides at least one phyiscal address range accessible to a memory initiator. In preparation for these systems, provide a new kernel API to link the target memory node to its initiator compute node with symlinks to each other. The following example shows the new sysfs hierarchy setup for memory node 'Y' local to commpute node 'X': # ls -l /sys/devices/system/node/nodeX/initiator* /sys/devices/system/node/nodeX/targetY -> ../nodeY # ls -l /sys/devices/system/node/nodeY/target* /sys/devices/system/node/nodeY/initiatorX -> ../nodeX Signed-off-by: Keith Busch --- drivers/base/node.c | 32 ++++++++++++++++++++++++++++++++ include/linux/node.h | 2 ++ 2 files changed, 34 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index 86d6cd92ce3d..a9b7512a9502 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -372,6 +372,38 @@ int register_cpu_under_node(unsigned int cpu, unsigned int nid) kobject_name(&node_devices[nid]->dev.kobj)); } +int register_memory_node_under_compute_node(unsigned int m, unsigned int p) +{ + int ret; + char initiator[20], target[17]; + + if (!node_online(p) || !node_online(m)) + return -ENODEV; + if (m == p) + return 0; + + snprintf(initiator, sizeof(initiator), "initiator%d", p); + snprintf(target, sizeof(target), "target%d", m); + + ret = sysfs_create_link(&node_devices[p]->dev.kobj, + &node_devices[m]->dev.kobj, + target); + if (ret) + return ret; + + ret = sysfs_create_link(&node_devices[m]->dev.kobj, + &node_devices[p]->dev.kobj, + initiator); + if (ret) + goto err; + + return 0; + err: + sysfs_remove_link(&node_devices[p]->dev.kobj, + kobject_name(&node_devices[m]->dev.kobj)); + return ret; +} + int unregister_cpu_under_node(unsigned int cpu, unsigned int nid) { struct device *obj; diff --git a/include/linux/node.h b/include/linux/node.h index 257bb3d6d014..1fd734a3fb3f 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -75,6 +75,8 @@ extern int register_mem_sect_under_node(struct memory_block *mem_blk, extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk, unsigned long phys_index); +extern int register_memory_node_under_compute_node(unsigned int m, unsigned int p); + #ifdef CONFIG_HUGETLBFS extern void register_hugetlbfs_with_node(node_registration_func_t doregister, node_registration_func_t unregister); From patchwork Wed Nov 14 22:49:15 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683301 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 3488C1759 for ; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 282DE2AC61 for ; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 1CF0B2BF88; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id ACC902BF70 for ; Wed, 14 Nov 2018 22:53:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727325AbeKOI6T (ORCPT ); Thu, 15 Nov 2018 03:58:19 -0500 Received: from mga12.intel.com ([192.55.52.136]:50974 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726704AbeKOI6Q (ORCPT ); Thu, 15 Nov 2018 03:58:16 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:03 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314880" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:02 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 2/7] node: Add heterogenous memory performance Date: Wed, 14 Nov 2018 15:49:15 -0700 Message-Id: <20181114224921.12123-3-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Heterogeneous memory systems provide memory nodes with latency and bandwidth performance attributes that are different from other nodes. Create an interface for the kernel to register these attributes under the node that provides the memory. If the system provides this information, applications can query the node attributes when deciding which node to request memory. When multiple memory initiators exist, accessing the same memory target from each may not perform the same as the other. The highest performing initiator to a given target is considered to be a local initiator for that target. The kernel provides performance attributes only for the local initiators. The memory's compute node should be symlinked in sysfs as one of the node's initiators. The following example shows the new sysfs hierarchy for a node exporting performance attributes: # tree /sys/devices/system/node/nodeY/initiator_access /sys/devices/system/node/nodeY/initiator_access |-- read_bandwidth |-- read_latency |-- write_bandwidth `-- write_latency The bandwidth is exported as MB/s and latency is reported in nanoseconds. Signed-off-by: Keith Busch --- drivers/base/Kconfig | 8 ++++++++ drivers/base/node.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ include/linux/node.h | 22 ++++++++++++++++++++++ 3 files changed, 74 insertions(+) diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig index ae213ed2a7c8..2cf67c80046d 100644 --- a/drivers/base/Kconfig +++ b/drivers/base/Kconfig @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE unusable. You should say N here unless you are explicitly looking to test this functionality. +config HMEM + bool + default y + depends on NUMA + help + Enable reporting for heterogenous memory access attributes under + their non-uniform memory nodes. + source "drivers/base/test/Kconfig" config SYS_HYPERVISOR diff --git a/drivers/base/node.c b/drivers/base/node.c index a9b7512a9502..232535761998 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -59,6 +59,50 @@ static inline ssize_t node_read_cpulist(struct device *dev, static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL); +#ifdef CONFIG_HMEM +const struct attribute_group node_access_attrs_group; + +#define ACCESS_ATTR(name) \ +static ssize_t name##_show(struct device *dev, \ + struct device_attribute *attr, \ + char *buf) \ +{ \ + return sprintf(buf, "%d\n", to_node(dev)->hmem_attrs.name); \ +} \ +static DEVICE_ATTR_RO(name); + +ACCESS_ATTR(read_bandwidth) +ACCESS_ATTR(read_latency) +ACCESS_ATTR(write_bandwidth) +ACCESS_ATTR(write_latency) + +static struct attribute *access_attrs[] = { + &dev_attr_read_bandwidth.attr, + &dev_attr_read_latency.attr, + &dev_attr_write_bandwidth.attr, + &dev_attr_write_latency.attr, + NULL, +}; + +const struct attribute_group node_access_attrs_group = { + .name = "initiator_access", + .attrs = access_attrs, +}; + +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs) +{ + struct node *node; + + if (WARN_ON_ONCE(!node_online(nid))) + return; + node = node_devices[nid]; + node->hmem_attrs = *hmem_attrs; + if (sysfs_create_group(&node->dev.kobj, &node_access_attrs_group)) + pr_info("failed to add performance attribute group to node %d\n", + nid); +} +#endif + #define K(x) ((x) << (PAGE_SHIFT - 10)) static ssize_t node_read_meminfo(struct device *dev, struct device_attribute *attr, char *buf) diff --git a/include/linux/node.h b/include/linux/node.h index 1fd734a3fb3f..6a1aa6a153f8 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -17,14 +17,36 @@ #include #include +#include #include +#ifdef CONFIG_HMEM +/** + * struct node_hmem_attrs - heterogeneous memory performance attributes + * + * read_bandwidth: Read bandwidth in MB/s + * write_bandwidth: Write bandwidth in MB/s + * read_latency: Read latency in nanoseconds + * write_latency: Write latency in nanoseconds + */ +struct node_hmem_attrs { + unsigned int read_bandwidth; + unsigned int write_bandwidth; + unsigned int read_latency; + unsigned int write_latency; +}; +void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs); +#endif + struct node { struct device dev; #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS) struct work_struct node_work; #endif +#ifdef CONFIG_HMEM + struct node_hmem_attrs hmem_attrs; +#endif }; struct memory_block; From patchwork Wed Nov 14 22:49:16 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683303 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id B3A8513BF for ; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id A7FEC2AC61 for ; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 9C57F2BF88; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3DBBA2AC61 for ; Wed, 14 Nov 2018 22:53:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726379AbeKOI6S (ORCPT ); Thu, 15 Nov 2018 03:58:18 -0500 Received: from mga12.intel.com ([192.55.52.136]:50985 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727084AbeKOI6R (ORCPT ); Thu, 15 Nov 2018 03:58:17 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:03 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314883" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:03 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 3/7] doc/vm: New documentation for memory performance Date: Wed, 14 Nov 2018 15:49:16 -0700 Message-Id: <20181114224921.12123-4-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Platforms may provide system memory where some physical address ranges perform differently than others. These heterogeneous memory attributes are common to the node that provides the memory and exported by the kernel. Add new documentation providing a brief overview of such systems and the attributes the kernel makes available to aid applications wishing to query this information. Signed-off-by: Keith Busch --- Documentation/vm/numaperf.rst | 71 +++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 71 insertions(+) create mode 100644 Documentation/vm/numaperf.rst diff --git a/Documentation/vm/numaperf.rst b/Documentation/vm/numaperf.rst new file mode 100644 index 000000000000..5a3ecaff5474 --- /dev/null +++ b/Documentation/vm/numaperf.rst @@ -0,0 +1,71 @@ +.. _numaperf: + +================ +NUMA Performance +================ + +Some platforms may have multiple types of memory attached to a single +CPU. These disparate memory ranges share some characteristics, such as +CPU cache coherence, but may have different performance. For example, +different media types and buses affect bandwidth and latency. + +A system supporting such heterogeneous memory groups each memory type +under different "nodes" based on similar CPU locality and performance +characteristics. Some memory may share the same node as a CPU, and +others are provided as memory-only nodes. While memory only nodes do not +provide CPUs, they may still be local to one or more compute nodes. The +following diagram shows one such example of two compute noes with local +memory and a memory only node for each of compute node: + + +------------------+ +------------------+ + | Compute Node 0 +-----+ Compute Node 1 | + | Local Node0 Mem | | Local Node1 Mem | + +--------+---------+ +--------+---------+ + | | + +--------+---------+ +--------+---------+ + | Slower Node2 Mem | | Slower Node3 Mem | + +------------------+ +--------+---------+ + +A "memory initiator" is a node containing one or more devices such as +CPUs or separate memory I/O devices that can initiate memory requests. A +"memory target" is a node containing one or more CPU-accessible physical +address ranges. + +When multiple memory initiators exist, accessing the same memory +target may not perform the same as each other. The highest performing +initiator to a given target is considered to be one of that target's +local initiators. + +To aid applications matching memory targets with their initiators, +the kernel provide symlinks to each other like the following example:: + + # ls -l /sys/devices/system/node/nodeX/initiator* + /sys/devices/system/node/nodeX/targetY -> ../nodeY + + # ls -l /sys/devices/system/node/nodeY/target* + /sys/devices/system/node/nodeY/initiatorX -> ../nodeX + +Applications may wish to consider which node they want their memory to +be allocated from based on the nodes performance characteristics. If +the system provides these attributes, the kernel exports them under the +node sysfs hierarchy by appending the initiator_access directory under +the node as follows:: + + /sys/devices/system/node/nodeY/initiator_access/ + +The kernel does not provide performance attributes for non-local memory +initiators. The performance characteristics the kernel provides for +the local initiators are exported are as follows:: + + # tree /sys/devices/system/node/nodeY/initiator_access + /sys/devices/system/node/nodeY/initiator_access + |-- read_bandwidth + |-- read_latency + |-- write_bandwidth + `-- write_latency + +The bandwidth attributes are provided in MiB/second. + +The latency attributes are provided in nanoseconds. + +See also: https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf From patchwork Wed Nov 14 22:49:17 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683305 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id 1E01313BF for ; Wed, 14 Nov 2018 22:53:43 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 119122AC61 for ; Wed, 14 Nov 2018 22:53:43 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 05E782BF88; Wed, 14 Nov 2018 22:53:43 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7EA0F2AC61 for ; Wed, 14 Nov 2018 22:53:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728303AbeKOI6t (ORCPT ); Thu, 15 Nov 2018 03:58:49 -0500 Received: from mga12.intel.com ([192.55.52.136]:50974 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727128AbeKOI6S (ORCPT ); Thu, 15 Nov 2018 03:58:18 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:03 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314887" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:03 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 4/7] node: Add memory caching attributes Date: Wed, 14 Nov 2018 15:49:17 -0700 Message-Id: <20181114224921.12123-5-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP System memory may have side caches to help improve access speed. While the system provided cache is transparent to the software accessing these memory ranges, applications can optimize their own access based on cache attributes. In preparation for such systems, provide a new API for the kernel to register these memory side caches under the memory node that provides it. The kernel's sysfs representation is modeled from the cpu cacheinfo attributes, as seen from /sys/devices/system/cpu/cpuX/cache/. Unlike CPU cacheinfo, though, a higher node's memory cache level is nearer to the CPU, while lower levels are closer to the backing memory. Also unlike CPU cache, the system handles flushing any dirty cached memory to the last level the memory on a power failure if the range is persistent. The exported attributes are the cache size, the line size, associativity, and write back policy. Signed-off-by: Keith Busch --- drivers/base/node.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/node.h | 23 ++++++++++ 2 files changed, 140 insertions(+) diff --git a/drivers/base/node.c b/drivers/base/node.c index 232535761998..bb94f1d18115 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -60,6 +60,12 @@ static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL); #ifdef CONFIG_HMEM +struct node_cache_obj { + struct kobject kobj; + struct list_head node; + struct node_cache_attrs cache_attrs; +}; + const struct attribute_group node_access_attrs_group; #define ACCESS_ATTR(name) \ @@ -101,6 +107,115 @@ void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs) pr_info("failed to add performance attribute group to node %d\n", nid); } + +struct cache_attribute_entry { + struct attribute attr; + ssize_t (*show)(struct node_cache_attrs *, char *); +}; + +#define CACHE_ATTR(name, fmt) \ +static ssize_t name##_show(struct node_cache_attrs *cache, \ + char *buf) \ +{ \ + return sprintf(buf, fmt "\n", cache->name); \ +} \ +static struct cache_attribute_entry cache_attr_##name = __ATTR_RO(name); + +CACHE_ATTR(size, "%lld") +CACHE_ATTR(level, "%d") +CACHE_ATTR(line_size, "%d") +CACHE_ATTR(associativity, "%d") +CACHE_ATTR(write_policy, "%d") + +static struct attribute *cache_attrs[] = { + &cache_attr_level.attr, + &cache_attr_associativity.attr, + &cache_attr_size.attr, + &cache_attr_line_size.attr, + &cache_attr_write_policy.attr, + NULL, +}; + +static ssize_t cache_attr_show(struct kobject *kobj, struct attribute *attr, + char *page) +{ + struct cache_attribute_entry *entry = + container_of(attr, struct cache_attribute_entry, attr); + struct node_cache_obj *cache_obj = + container_of(kobj, struct node_cache_obj, kobj); + return entry->show(&cache_obj->cache_attrs, page); +} + +static const struct sysfs_ops cache_ops = { + .show = &cache_attr_show, +}; + +static struct kobj_type cache_ktype = { + .default_attrs = cache_attrs, + .sysfs_ops = &cache_ops, +}; + +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs) +{ + struct node_cache_obj *cache_obj; + struct node *node; + + if (!node_online(nid) || !node_devices[nid]) + return; + + node = node_devices[nid]; + list_for_each_entry(cache_obj, &node->cache_attrs, node) { + if (cache_obj->cache_attrs.level == cache_attrs->level) { + dev_warn(&node->dev, + "attempt to add duplicate cache level:%d\n", + cache_attrs->level); + return; + } + } + + if (!node->cache_kobj) + node->cache_kobj = kobject_create_and_add("cache", + &node->dev.kobj); + if (!node->cache_kobj) + return; + + cache_obj = kzalloc(sizeof(*cache_obj), GFP_KERNEL); + if (!cache_obj) + return; + + cache_obj->cache_attrs = *cache_attrs; + if (kobject_init_and_add(&cache_obj->kobj, &cache_ktype, node->cache_kobj, + "index%d", cache_attrs->level)) { + dev_warn(&node->dev, "failed to add cache level:%d\n", + cache_attrs->level); + kfree(cache_obj); + return; + } + list_add_tail(&cache_obj->node, &node->cache_attrs); +} + +static void node_remove_caches(struct node *node) +{ + struct node_cache_obj *obj, *next; + + if (!node->cache_kobj) + return; + + list_for_each_entry_safe(obj, next, &node->cache_attrs, node) { + list_del(&obj->node); + kobject_put(&obj->kobj); + kfree(obj); + } + kobject_put(node->cache_kobj); +} + +static void node_init_caches(unsigned int nid) +{ + INIT_LIST_HEAD(&node_devices[nid]->cache_attrs); +} +#else +static void node_init_caches(unsigned int nid) { } +static void node_remove_caches(struct node *node) { } #endif #define K(x) ((x) << (PAGE_SHIFT - 10)) @@ -345,6 +460,7 @@ static void node_device_release(struct device *dev) */ flush_work(&node->node_work); #endif + node_remove_caches(node); kfree(node); } @@ -658,6 +774,7 @@ int __register_one_node(int nid) /* initialize work queue for memory hot plug */ init_node_hugetlb_work(nid); + node_init_caches(nid); return error; } diff --git a/include/linux/node.h b/include/linux/node.h index 6a1aa6a153f8..f499a17f84bc 100644 --- a/include/linux/node.h +++ b/include/linux/node.h @@ -36,6 +36,27 @@ struct node_hmem_attrs { unsigned int write_latency; }; void node_set_perf_attrs(unsigned int nid, struct node_hmem_attrs *hmem_attrs); + +enum cache_associativity { + NODE_CACHE_DIRECT_MAP, + NODE_CACHE_INDEXED, + NODE_CACHE_OTHER, +}; + +enum cache_write_policy { + NODE_CACHE_WRITE_BACK, + NODE_CACHE_WRITE_THROUGH, + NODE_CACHE_WRITE_OTHER, +}; + +struct node_cache_attrs { + enum cache_associativity associativity; + enum cache_write_policy write_policy; + u64 size; + u16 line_size; + u8 level; +}; +void node_add_cache(unsigned int nid, struct node_cache_attrs *cache_attrs); #endif struct node { @@ -46,6 +67,8 @@ struct node { #endif #ifdef CONFIG_HMEM struct node_hmem_attrs hmem_attrs; + struct list_head cache_attrs; + struct kobject *cache_kobj; #endif }; From patchwork Wed Nov 14 22:49:18 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683299 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id D958813BF for ; Wed, 14 Nov 2018 22:53:36 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id CD53F2AC61 for ; Wed, 14 Nov 2018 22:53:36 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id BFDEC2BF8B; Wed, 14 Nov 2018 22:53:36 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=unavailable version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 4C27E2AC61 for ; Wed, 14 Nov 2018 22:53:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727264AbeKOI6n (ORCPT ); Thu, 15 Nov 2018 03:58:43 -0500 Received: from mga12.intel.com ([192.55.52.136]:50985 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727141AbeKOI6T (ORCPT ); Thu, 15 Nov 2018 03:58:19 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314891" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:04 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 5/7] doc/vm: New documentation for memory cache Date: Wed, 14 Nov 2018 15:49:18 -0700 Message-Id: <20181114224921.12123-6-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Platforms may provide system memory that contains side caches to help spped up access. These memory caches are part of a memory node and the cache attributes are exported by the kernel. Add new documentation providing a brief overview of system memory side caches and the kernel provided attributes for application optimization. Signed-off-by: Keith Busch --- Documentation/vm/numacache.rst | 76 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) create mode 100644 Documentation/vm/numacache.rst diff --git a/Documentation/vm/numacache.rst b/Documentation/vm/numacache.rst new file mode 100644 index 000000000000..e79c801b7e3b --- /dev/null +++ b/Documentation/vm/numacache.rst @@ -0,0 +1,76 @@ +.. _numacache: + +========== +NUMA Cache +========== + +System memory may be constructed in a hierarchy of various performing +characteristics in order to provide large address space of slower +performing memory cached by a smaller size of higher performing +memory. The system physical addresses that software is aware of see +is provided by the last memory level in the hierarchy, while higher +performing memory transparently provides caching to slower levels. + +The term "far memory" is used to denote the last level memory in the +hierarchy. Each increasing cache level provides higher performing CPU +access, and the term "near memory" represents the highest level cache +provided by the system. This number is different than CPU caches where +the cache level (ex: L1, L2, L3) uses a CPU centric view with each level +being lower performing and closer to system memory. The memory cache +level is centric to the last level memory, so the higher numbered cache +level denotes memory nearer to the CPU, and further from far memory. + +The memory side caches are not directly addressable by software. When +software accesses a system address, the system will return it from the +near memory cache if it is present. If it is not present, the system +accesses the next level of memory until there is either a hit in that +cache level, or it reaches far memory. + +In order to maximize the performance out of such a setup, software may +wish to query the memory cache attributes. If the system provides a way +to query this information, for example with ACPI HMAT (Heterogeneous +Memory Attribute Table)[1], the kernel will append these attributes to +the NUMA node that provides the memory. + +When the kernel first registers a memory cache with a node, the kernel +will create the following directory:: + + /sys/devices/system/node/nodeX/cache/ + +If that directory is not present, then either the memory does not have +a side cache, or that information is not provided to the kernel. + +The attributes for each level of cache is provided under its cache +level index:: + + /sys/devices/system/node/nodeX/cache/indexA/ + /sys/devices/system/node/nodeX/cache/indexB/ + /sys/devices/system/node/nodeX/cache/indexC/ + +Each cache level's directory provides its attributes. For example, +the following is a single cache level and the attributes available for +software to query:: + + # tree sys/devices/system/node/node0/cache/ + /sys/devices/system/node/node0/cache/ + |-- index1 + | |-- associativity + | |-- level + | |-- line_size + | |-- size + | `-- write_policy + +The cache "associativity" will be 0 if it is a direct-mapped cache, and +non-zero for any other indexed based, multi-way associativity. + +The "level" is the distance from the far memory, and matches the number +appended to its "index" directory. + +The "line_size" is the number of bytes accessed on a cache miss. + +The "size" is the number of bytes provided by this cache level. + +The "write_policy" will be 0 for write-back, and non-zero for +write-through caching. + +[1] https://www.uefi.org/sites/default/files/resources/ACPI_6_2.pdf From patchwork Wed Nov 14 22:49:19 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683289 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id E168613BF for ; Wed, 14 Nov 2018 22:53:15 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id D55352BF63 for ; Wed, 14 Nov 2018 22:53:15 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id C98F22BF7F; Wed, 14 Nov 2018 22:53:15 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5EA8B2BF63 for ; Wed, 14 Nov 2018 22:53:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726704AbeKOI6V (ORCPT ); Thu, 15 Nov 2018 03:58:21 -0500 Received: from mga12.intel.com ([192.55.52.136]:50988 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727086AbeKOI6U (ORCPT ); Thu, 15 Nov 2018 03:58:20 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:04 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314894" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:04 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 6/7] acpi: Create subtable parsing infrastructure Date: Wed, 14 Nov 2018 15:49:19 -0700 Message-Id: <20181114224921.12123-7-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP Parsing entries in an ACPI table had assumed a generic header structure that is most common. There is no standard ACPI header, though, so less common types would need custom parsers if they want go walk their subtable entry list. Create the infrastructure for adding different table types so parsing the entries array may be more reused for all ACPI system tables. Signed-off-by: Keith Busch --- drivers/acpi/tables.c | 75 ++++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 65 insertions(+), 10 deletions(-) diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 61203eebf3a1..15ee77780f68 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -49,6 +49,19 @@ static struct acpi_table_desc initial_tables[ACPI_MAX_TABLES] __initdata; static int acpi_apic_instance __initdata; +enum acpi_subtable_type { + ACPI_SUBTABLE_COMMON, +}; + +union acpi_subtable_headers { + struct acpi_subtable_header common; +}; + +struct acpi_subtable_entry { + union acpi_subtable_headers *hdr; + enum acpi_subtable_type type; +}; + /* * Disable table checksum verification for the early stage due to the size * limitation of the current x86 early mapping implementation. @@ -217,6 +230,45 @@ void acpi_table_print_madt_entry(struct acpi_subtable_header *header) } } +static unsigned long __init +acpi_get_entry_type(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return entry->hdr->common.type; + } + WARN_ONCE(1, "invalid acpi type\n"); + return 0; +} + +static unsigned long __init +acpi_get_entry_length(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return entry->hdr->common.length; + } + WARN_ONCE(1, "invalid acpi type\n"); + return 0; +} + +static unsigned long __init +acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) +{ + switch (entry->type) { + case ACPI_SUBTABLE_COMMON: + return sizeof(entry->hdr->common); + } + WARN_ONCE(1, "invalid acpi type\n"); + return 0; +} + +static enum acpi_subtable_type __init +acpi_get_subtable_type(char *id) +{ + return ACPI_SUBTABLE_COMMON; +} + /** * acpi_parse_entries_array - for each proc_num find a suitable subtable * @@ -246,8 +298,8 @@ acpi_parse_entries_array(char *id, unsigned long table_size, struct acpi_subtable_proc *proc, int proc_num, unsigned int max_entries) { - struct acpi_subtable_header *entry; - unsigned long table_end; + struct acpi_subtable_entry entry; + unsigned long table_end, subtable_len, entry_len; int count = 0; int errs = 0; int i; @@ -270,19 +322,21 @@ acpi_parse_entries_array(char *id, unsigned long table_size, /* Parse all entries looking for a match. */ - entry = (struct acpi_subtable_header *) + entry.type = acpi_get_subtable_type(id); + entry.hdr = (union acpi_subtable_headers *) ((unsigned long)table_header + table_size); + subtable_len = acpi_get_subtable_header_length(&entry); - while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) < - table_end) { + while (((unsigned long)entry.hdr) + subtable_len < table_end) { if (max_entries && count >= max_entries) break; for (i = 0; i < proc_num; i++) { - if (entry->type != proc[i].id) + if (acpi_get_entry_type(&entry) != proc[i].id) continue; if (!proc[i].handler || - (!errs && proc[i].handler(entry, table_end))) { + (!errs && proc[i].handler(&entry.hdr->common, + table_end))) { errs++; continue; } @@ -297,13 +351,14 @@ acpi_parse_entries_array(char *id, unsigned long table_size, * If entry->length is 0, break from this loop to avoid * infinite loop. */ - if (entry->length == 0) { + entry_len = acpi_get_entry_length(&entry); + if (entry_len == 0) { pr_err("[%4.4s:0x%02x] Invalid zero length\n", id, proc->id); return -EINVAL; } - entry = (struct acpi_subtable_header *) - ((unsigned long)entry + entry->length); + entry.hdr = (union acpi_subtable_headers *) + ((unsigned long)entry.hdr + entry_len); } if (max_entries && count > max_entries) { From patchwork Wed Nov 14 22:49:20 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Keith Busch X-Patchwork-Id: 10683293 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork-2.web.codeaurora.org (Postfix) with ESMTP id AF54B13BF for ; Wed, 14 Nov 2018 22:53:16 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9FEB42BF63 for ; Wed, 14 Nov 2018 22:53:16 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 93E4D2BF6C; Wed, 14 Nov 2018 22:53:16 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.9 required=2.0 tests=BAYES_00,MAILING_LIST_MULTI, RCVD_IN_DNSWL_HI autolearn=ham version=3.3.1 Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 7FA9B2BF72 for ; Wed, 14 Nov 2018 22:53:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727086AbeKOI61 (ORCPT ); Thu, 15 Nov 2018 03:58:27 -0500 Received: from mga12.intel.com ([192.55.52.136]:50974 "EHLO mga12.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727264AbeKOI60 (ORCPT ); Thu, 15 Nov 2018 03:58:26 -0500 X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by fmsmga106.fm.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 14 Nov 2018 14:53:05 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.56,234,1539673200"; d="scan'208";a="106314898" Received: from unknown (HELO localhost.lm.intel.com) ([10.232.112.69]) by fmsmga004.fm.intel.com with ESMTP; 14 Nov 2018 14:53:04 -0800 From: Keith Busch To: linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org Cc: Greg Kroah-Hartman , Rafael Wysocki , Dave Hansen , Dan Williams , Keith Busch Subject: [PATCH 7/7] acpi/hmat: Parse and report heterogeneous memory Date: Wed, 14 Nov 2018 15:49:20 -0700 Message-Id: <20181114224921.12123-8-keith.busch@intel.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20181114224921.12123-2-keith.busch@intel.com> References: <20181114224921.12123-2-keith.busch@intel.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org X-Virus-Scanned: ClamAV using ClamSMTP The ACPI Heterogeneous Memory Attribute Table (HMAT) table added in ACPI 6.2 provides a way for platforms to express different memory address range characteristics. Parse these tables provided by the platform and register the local initiator performance access and caching attributes with the memory numa nodes so they can be observed through sysfs. Signed-off-by: Keith Busch --- drivers/acpi/Kconfig | 9 ++ drivers/acpi/Makefile | 1 + drivers/acpi/hmat.c | 384 ++++++++++++++++++++++++++++++++++++++++++++++++++ drivers/acpi/tables.c | 10 ++ 4 files changed, 404 insertions(+) create mode 100644 drivers/acpi/hmat.c diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig index 7cea769c37df..9e6b767dd366 100644 --- a/drivers/acpi/Kconfig +++ b/drivers/acpi/Kconfig @@ -327,6 +327,15 @@ config ACPI_NUMA depends on (X86 || IA64 || ARM64) default y if IA64_GENERIC || IA64_SGI_SN2 || ARM64 +config ACPI_HMAT + bool "ACPI Heterogeneous Memory Attribute Table Support" + depends on ACPI_NUMA + select HMEM + help + Parses representation of the ACPI Heterogeneous Memory Attributes + Table (HMAT) and set the memory node relationships and access + attributes. + config ACPI_CUSTOM_DSDT_FILE string "Custom DSDT Table file to include" default "" diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile index edc039313cd6..b5e13499f88b 100644 --- a/drivers/acpi/Makefile +++ b/drivers/acpi/Makefile @@ -55,6 +55,7 @@ acpi-$(CONFIG_X86) += x86/apple.o acpi-$(CONFIG_X86) += x86/utils.o acpi-$(CONFIG_DEBUG_FS) += debugfs.o acpi-$(CONFIG_ACPI_NUMA) += numa.o +acpi-$(CONFIG_ACPI_HMAT) += hmat.o acpi-$(CONFIG_ACPI_PROCFS_POWER) += cm_sbs.o acpi-y += acpi_lpat.o acpi-$(CONFIG_ACPI_LPIT) += acpi_lpit.o diff --git a/drivers/acpi/hmat.c b/drivers/acpi/hmat.c new file mode 100644 index 000000000000..9070e84fd30e --- /dev/null +++ b/drivers/acpi/hmat.c @@ -0,0 +1,384 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Heterogeneous Memory Attributes Table (HMAT) representation + * + * Copyright (c) 2018, Intel Corporation. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static LIST_HEAD(targets); + +struct memory_target { + struct list_head node; + unsigned int memory_pxm; + unsigned long processor_nodes[BITS_TO_LONGS(MAX_NUMNODES)]; + struct node_hmem_attrs hmem; +}; + +static __init struct memory_target *find_mem_target(unsigned int m) +{ + struct memory_target *t; + + list_for_each_entry(t, &targets, node) + if (t->memory_pxm == m) + return t; + return NULL; +} + +static __init void alloc_memory_target(unsigned int mem_pxm) +{ + struct memory_target *t; + + if (pxm_to_node(mem_pxm) == NUMA_NO_NODE) + return; + + t = find_mem_target(mem_pxm); + if (t) + return; + + t = kzalloc(sizeof(*t), GFP_KERNEL); + if (!t) + return; + + t->memory_pxm = mem_pxm; + list_add_tail(&t->node, &targets); +} + +static __init const char *hmat_data_type(u8 type) +{ + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + return "Access Latency"; + case ACPI_HMAT_READ_LATENCY: + return "Read Latency"; + case ACPI_HMAT_WRITE_LATENCY: + return "Write Latency"; + case ACPI_HMAT_ACCESS_BANDWIDTH: + return "Access Bandwidth"; + case ACPI_HMAT_READ_BANDWIDTH: + return "Read Bandwidth"; + case ACPI_HMAT_WRITE_BANDWIDTH: + return "Write Bandwidth"; + default: + return "Reserved"; + }; +} + +static __init const char *hmat_data_type_suffix(u8 type) +{ + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + case ACPI_HMAT_READ_LATENCY: + case ACPI_HMAT_WRITE_LATENCY: + return " nsec"; + case ACPI_HMAT_ACCESS_BANDWIDTH: + case ACPI_HMAT_READ_BANDWIDTH: + case ACPI_HMAT_WRITE_BANDWIDTH: + return " MB/s"; + default: + return ""; + }; +} + +static __init void hmat_update_value(u32 *dest, u32 value, bool gt, + bool *update, bool *tie) +{ + if (!(*dest)) { + *update = true; + *dest = value; + } else if (gt && value > *dest) { + *update = true; + *dest = value; + } else if (!gt && value < *dest) { + *update = true; + *dest = value; + } else if (*dest == value) { + *tie = true; + } +} + +static __init void hmat_update_target_access(unsigned int p, unsigned int m, u8 type, + unsigned int value) +{ + struct memory_target *t = find_mem_target(m); + int p_node = pxm_to_node(p); + bool update = false, tie = false; + + if (!t || p_node == NUMA_NO_NODE || !value) + return; + + switch (type) { + case ACPI_HMAT_ACCESS_LATENCY: + hmat_update_value(&t->hmem.read_latency, value, false, + &update, &tie); + hmat_update_value(&t->hmem.write_latency, value, false, + &update, &tie); + break; + case ACPI_HMAT_READ_LATENCY: + hmat_update_value(&t->hmem.read_latency, value, false, + &update, &tie); + break; + case ACPI_HMAT_WRITE_LATENCY: + hmat_update_value(&t->hmem.write_latency, value, false, + &update, &tie); + break; + case ACPI_HMAT_ACCESS_BANDWIDTH: + hmat_update_value(&t->hmem.read_bandwidth, value, true, + &update, &tie); + hmat_update_value(&t->hmem.write_bandwidth, value, true, + &update, &tie); + break; + case ACPI_HMAT_READ_BANDWIDTH: + hmat_update_value(&t->hmem.read_bandwidth, value, true, + &update, &tie); + break; + case ACPI_HMAT_WRITE_BANDWIDTH: + hmat_update_value(&t->hmem.write_bandwidth, value, true, + &update, &tie); + break; + } + + if (update) { + bitmap_zero(t->processor_nodes, MAX_NUMNODES); + set_bit(p_node, t->processor_nodes); + } else if (tie) + set_bit(p_node, t->processor_nodes); +} + +static __init int hmat_parse_locality(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_hmat_locality *loc = (void *)header; + unsigned int i, j, total_size, ipds, tpds; + u32 *inits, *targs, value; + u16 *entries; + u8 type; + + if (loc->header.length < sizeof(*loc)) { + pr_err("HMAT: Unexpected locality header length: %d\n", + loc->header.length); + return -EINVAL; + } + + type = loc->data_type; + ipds = loc->number_of_initiator_Pds; + tpds = loc->number_of_target_Pds; + total_size = sizeof(*loc) + sizeof(*entries) * ipds * tpds + + sizeof(*inits) * ipds + sizeof(*targs) * tpds; + if (loc->header.length < total_size) { + pr_err("HMAT: Unexpected locality header length:%d, minimum required:%d\n", + loc->header.length, total_size); + return -EINVAL; + } + + pr_info("HMAT: Locality: Flags:%02x Type:%s Initiator Domains:%d Target Domains:%d Base:%lld\n", + loc->flags, hmat_data_type(type), ipds, tpds, + loc->entry_base_unit); + + inits = (u32 *)(loc + 1); + targs = &inits[ipds]; + entries = (u16 *)(&targs[tpds]); + for (i = 0; i < ipds; i++) { + for (j = 0; j < loc->number_of_target_Pds; j++) { + value = (*entries * loc->entry_base_unit) / 10; + pr_info(" Initiator-Target[%d-%d]:%d%s\n", inits[i], + targs[j], value, hmat_data_type_suffix(type)); + + if ((loc->flags & ACPI_HMAT_MEMORY_HIERARCHY) == + ACPI_HMAT_MEMORY) + hmat_update_target_access(inits[i], targs[j], + type, value); + entries++; + } + } + return 0; +} + +static __init int hmat_parse_cache(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_hmat_cache *cache = (void *)header; + struct node_cache_attrs cache_attrs; + u32 attrs; + + if (cache->header.length < sizeof(*cache)) { + pr_err("HMAT: Unexpected cache header length: %d\n", + cache->header.length); + return -EINVAL; + } + + attrs = cache->cache_attributes; + pr_info("HMAT: Cache: Domain:%d Size:%llu Attrs:%08x SMBIOS Handles:%d\n", + cache->memory_PD, cache->cache_size, attrs, + cache->number_of_SMBIOShandles); + + cache_attrs.size = cache->cache_size; + cache_attrs.level = (attrs & ACPI_HMAT_CACHE_LEVEL) >> 4; + cache_attrs.line_size = (attrs & ACPI_HMAT_CACHE_LINE_SIZE) >> 16; + + switch ((attrs & ACPI_HMAT_CACHE_ASSOCIATIVITY) >> 8) { + case ACPI_HMAT_CA_DIRECT_MAPPED: + cache_attrs.associativity = NODE_CACHE_DIRECT_MAP; + break; + case ACPI_HMAT_CA_COMPLEX_CACHE_INDEXING: + cache_attrs.associativity = NODE_CACHE_INDEXED; + break; + case ACPI_HMAT_CA_NONE: + default: + cache_attrs.associativity = NODE_CACHE_OTHER; + break; + } + switch ((attrs & ACPI_HMAT_WRITE_POLICY) >> 12) { + case ACPI_HMAT_CP_WB: + cache_attrs.write_policy = NODE_CACHE_WRITE_BACK; + break; + case ACPI_HMAT_CP_WT: + cache_attrs.write_policy = NODE_CACHE_WRITE_THROUGH; + break; + case ACPI_HMAT_CP_NONE: + default: + cache_attrs.write_policy = NODE_CACHE_WRITE_OTHER; + break; + } + + node_add_cache(pxm_to_node(cache->memory_PD), &cache_attrs); + return 0; +} + +static int __init hmat_parse_address_range(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_hmat_address_range *spa = (void *)header; + + if (spa->header.length != sizeof(*spa)) { + pr_err("HMAT: Unexpected address range header length: %d\n", + spa->header.length); + return -EINVAL; + } + pr_info("HMAT: Memory (%#llx length %#llx) Flags:%04x Processor Domain:%d Memory Domain:%d\n", + spa->physical_address_base, spa->physical_address_length, + spa->flags, spa->processor_PD, spa->memory_PD); + return 0; +} + +static int __init hmat_parse_subtable(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_hmat_structure *hdr = (void *)header; + + if (!hdr) + return -EINVAL; + + switch (hdr->type) { + case ACPI_HMAT_TYPE_ADDRESS_RANGE: + return hmat_parse_address_range(header, end); + case ACPI_HMAT_TYPE_LOCALITY: + return hmat_parse_locality(header, end); + case ACPI_HMAT_TYPE_CACHE: + return hmat_parse_cache(header, end); + default: + return -EINVAL; + } +} + +static __init int srat_parse_mem_affinity(struct acpi_subtable_header *header, + const unsigned long end) +{ + struct acpi_srat_mem_affinity *ma = (void *)header; + + if (!ma) + return -EINVAL; + if (!(ma->flags & ACPI_SRAT_MEM_ENABLED)) + return 0; + alloc_memory_target(ma->proximity_domain); + return 0; +} + +static __init void hmat_register_targets(void) +{ + struct memory_target *t, *next; + unsigned m, p; + + list_for_each_entry_safe(t, next, &targets, node) { + bool hmem_valid = false; + + list_del(&t->node); + m = pxm_to_node(t->memory_pxm); + + for_each_set_bit(p, t->processor_nodes, MAX_NUMNODES) { + if (register_memory_node_under_compute_node(m, p)) + goto free_target; + hmem_valid = true; + } + if (hmem_valid) + node_set_perf_attrs(m, &t->hmem); +free_target: + kfree(t); + } +} + +static __init int parse_noop(struct acpi_table_header *table) +{ + return 0; +} + +static __init int hmat_init(void) +{ + struct acpi_subtable_proc subtable_proc; + struct acpi_table_header *tbl; + enum acpi_hmat_type i; + acpi_status status; + + if (srat_disabled()) + return 0; + + status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl); + if (ACPI_FAILURE(status)) + return 0; + + if (acpi_table_parse(ACPI_SIG_SRAT, parse_noop)) + goto out_put; + + memset(&subtable_proc, 0, sizeof(subtable_proc)); + subtable_proc.id = ACPI_SRAT_TYPE_MEMORY_AFFINITY; + subtable_proc.handler = srat_parse_mem_affinity; + + if (acpi_table_parse_entries_array(ACPI_SIG_SRAT, + sizeof(struct acpi_table_srat), + &subtable_proc, 1, 0) < 0) + goto out_put; + acpi_put_table(tbl); + + status = acpi_get_table(ACPI_SIG_HMAT, 0, &tbl); + if (ACPI_FAILURE(status)) + return 0; + + if (acpi_table_parse(ACPI_SIG_HMAT, parse_noop)) + goto out_put; + + memset(&subtable_proc, 0, sizeof(subtable_proc)); + subtable_proc.handler = hmat_parse_subtable; + for (i = ACPI_HMAT_TYPE_ADDRESS_RANGE; i < ACPI_HMAT_TYPE_RESERVED; i++) { + subtable_proc.id = i; + if (acpi_table_parse_entries_array(ACPI_SIG_HMAT, + sizeof(struct acpi_table_hmat), + &subtable_proc, 1, 0) < 0) + goto out_put; + } + hmat_register_targets(); + out_put: + acpi_put_table(tbl); + return 0; +} +device_initcall(hmat_init); diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c index 15ee77780f68..a34669a1d47a 100644 --- a/drivers/acpi/tables.c +++ b/drivers/acpi/tables.c @@ -51,10 +51,12 @@ static int acpi_apic_instance __initdata; enum acpi_subtable_type { ACPI_SUBTABLE_COMMON, + ACPI_SUBTABLE_HMAT, }; union acpi_subtable_headers { struct acpi_subtable_header common; + struct acpi_hmat_structure hmat; }; struct acpi_subtable_entry { @@ -236,6 +238,8 @@ acpi_get_entry_type(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return entry->hdr->common.type; + case ACPI_SUBTABLE_HMAT: + return entry->hdr->hmat.type; } WARN_ONCE(1, "invalid acpi type\n"); return 0; @@ -247,6 +251,8 @@ acpi_get_entry_length(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return entry->hdr->common.length; + case ACPI_SUBTABLE_HMAT: + return entry->hdr->hmat.length; } WARN_ONCE(1, "invalid acpi type\n"); return 0; @@ -258,6 +264,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) switch (entry->type) { case ACPI_SUBTABLE_COMMON: return sizeof(entry->hdr->common); + case ACPI_SUBTABLE_HMAT: + return sizeof(entry->hdr->hmat); } WARN_ONCE(1, "invalid acpi type\n"); return 0; @@ -266,6 +274,8 @@ acpi_get_subtable_header_length(struct acpi_subtable_entry *entry) static enum acpi_subtable_type __init acpi_get_subtable_type(char *id) { + if (strncmp(id, ACPI_SIG_HMAT, 4) == 0) + return ACPI_SUBTABLE_HMAT; return ACPI_SUBTABLE_COMMON; }