From patchwork Fri Feb 10 09:07:13 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 13135566 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1BEBAC636CD for ; Fri, 10 Feb 2023 09:07:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AF6026B0136; Fri, 10 Feb 2023 04:07:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A7EBD280003; Fri, 10 Feb 2023 04:07:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8D1B16B0139; Fri, 10 Feb 2023 04:07:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 56B156B0136 for ; Fri, 10 Feb 2023 04:07:18 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 18C20AB64F for ; Fri, 10 Feb 2023 09:07:18 +0000 (UTC) X-FDA: 80450803356.28.E4572BA Received: from mga17.intel.com (mga17.intel.com [192.55.52.151]) by imf23.hostedemail.com (Postfix) with ESMTP id E9387140021 for ; Fri, 10 Feb 2023 09:07:15 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PEwG5yaK; spf=pass (imf23.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1676020036; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=H+O0M4zatzFyKiYeQ21q8omDpceQ1caFZFs+o8Udjok=; b=DTfmXpA2tWqoccBm187KANCacF8Gj+ps4KG1BRWK70h4X19zGFgtYENjYEGdkraKGIiysK oENkTVMvdxq/J8iDjcWcpHJmkupuTJBBcRHplj03QdiSpZ58+jL3alTnVruQx28zh3/Aac FwoHeMxB59pOf1pKj+ivbpjqKvO1NIY= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=PEwG5yaK; spf=pass (imf23.hostedemail.com: domain of dan.j.williams@intel.com designates 192.55.52.151 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1676020036; a=rsa-sha256; cv=none; b=pydszlPp45jPVh7Cnqh3AktXRme27UM0wj3kVsFd8HxkC4mtouP7B7VA4AAngFCl2HH4X0 Ef/S+mAz2sJfCe2VR61cxyYEy9OTbsoE2f/MfG9lCy9GTpYLDzc5lAEc3xKdoNT/+//Mja TBj5ycclPBY+q7F01sICYtazvVvLqe4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1676020036; x=1707556036; h=subject:from:to:cc:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=9J3LVeqpx/pEaciMjT7PQYSzPAi14MY2qHE7XN4c/uM=; b=PEwG5yaKnMByGJLUczG3Gxuj1rtia57WjXngOe4L+xfLG0WWkT5OO8DS EHBUsSvIskSqoNXqNszak9uwViokRBh00DI9sdKrYFHbOTmBPmbHN1Gqa f3uK1WQ7Zz31gt/OR/D47hkAPfJkB+6kyz/y9344a8Ih7yHTBaILI7iih rzruPP7zlQBdiVLIHHJXVALGt6IMsQD6/9uDlZggXlTe8l4SdfO43LtnY jZwDXQK9i0PI6l63K78cUxGzWtLPoLFMN1p2lbWAuQ1Obn4U+2HYiyDRT l/cTmIAtmVMjs/ddFfaIoe6Fopv/dP7YG91LEyzUNBp0RJVFz9d5xgCBs g==; X-IronPort-AV: E=McAfee;i="6500,9779,10616"; a="310738836" X-IronPort-AV: E=Sophos;i="5.97,286,1669104000"; d="scan'208";a="310738836" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga107.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2023 01:07:14 -0800 X-IronPort-AV: E=McAfee;i="6500,9779,10616"; a="661341213" X-IronPort-AV: E=Sophos;i="5.97,286,1669104000"; d="scan'208";a="661341213" Received: from hrchavan-mobl.amr.corp.intel.com (HELO dwillia2-xfh.jf.intel.com) ([10.209.46.42]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 10 Feb 2023 01:07:13 -0800 Subject: [PATCH v2 19/20] dax: Assign RAM regions to memory-hotplug by default From: Dan Williams To: linux-cxl@vger.kernel.org Cc: Michal Hocko , David Hildenbrand , Dave Hansen , Gregory Price , Fan Ni , vishal.l.verma@intel.com, linux-mm@kvack.org, linux-acpi@vger.kernel.org Date: Fri, 10 Feb 2023 01:07:13 -0800 Message-ID: <167602003336.1924368.6809503401422267885.stgit@dwillia2-xfh.jf.intel.com> In-Reply-To: <167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com> References: <167601992097.1924368.18291887895351917895.stgit@dwillia2-xfh.jf.intel.com> User-Agent: StGit/0.18-3-g996c MIME-Version: 1.0 X-Stat-Signature: sw8na5za5a9ks5mm1iifntojnrr5q5mz X-Rspam-User: X-Rspamd-Queue-Id: E9387140021 X-Rspamd-Server: rspam06 X-HE-Tag: 1676020035-707582 X-HE-Meta: U2FsdGVkX1/VVuY2DwRoVqcWPTAQr8QaKbuvFF7/uhp4dT4GotdngNNY9ykmXeVQa9bU4G/fAlpoV28ftA4MeDl8tkOLTPFp/eIfmAU9pjpyrLnjrcyrL/YVEbAshTprbAdTXSs5Tf8F6IaYHqb6S+73GlWiXEfzpb2Mh/IP8TM/CwO6RvPqO0ZzaZcOCLsE9F4rrotykkkVRzMbOptKlOfmH8h3rr19Ya2UEKuSWzCkDM34+KRKydQxaW32Xabu/X/0l5sw2+YNfGGsdfNVky0xk2Bm+vvL41acXXbA8zoPYFsywSvlUPVhbB+HK0ayUnjzS6tGbE58QDzLYbBQOA/4u8tAw/2APiBMHjeUA1J/zoCxbQCR1S+PBxIDaa/vXDSmCZAJ4X5Bv7OafJAA/rfSCURDCGDpSMlMv2I1sGBe/0YDTwFAjHwJtCFwpuaCpSiGdnhRdv6rmHvuvgpMiMKzVnmOZPFzO4+5ZyMULZKcamFpa6S+nqqkndDZP5FziSkv5b7a/Qt82uJiZd1IDtlvCTRNr+RLr0rfi3iRFh6WInFLn+8TeYbMsISgCK0eMPsOCxlrBht5SUB505NeNnE+BmyN55XzXmamYfWUITT9OJIPJBO8Cq1vjjFzoN68yCPU5obomAfuyvQBQe6wMBO8zUlUTSmhaZpGV80Nuu55kAW4gzEppoVdtGFhqPdN1zG2UwFlx3yB1tb5yE3y779LL89IPFHx6WikFwqz3CnFKEzfDkG9sb5f9lpWE9IUYcU+2f/+MDVZN1n5YdVqiNGDYB+ehkpHSG9qKognkvc6DZo9F8LMdvf+Ed7kqbFPMveoNEkZSGzAXYS24WRtXvnXinPjDrnVIwFNM3Ux/7s0DK4WuK/o0NPFv+iXyONCpLW1dDPnhbi+CMjpzBZdxUOoFAWT9CVXb2474qIAXf4kg/AIISlMRhPq/h7Xxp8EF9s16SU+0pBf59l8y6H mrgmZBtZ hkWBmRBNgw0SxReoPdI/lQyt6srWnzMK2Amach7L9XlQkm86grEB1XICl8mvZOnHqyhF2RZMqXUgmQsX9olrjLLs4TQsuk0kg5SgldS/E579Cty47PqHk9Zw4qNsADLCiDIRwvAkXvZY4na1G73XHeoTt9kEGkK0/ir2QGu+5CyYHF9yF2CwZIs65YhKdlivFMOLfeICLzvoG4N7WRRMv5zZGUcxPAXfgLC0dEw+qOgrlDwBwoMNDQuzAcODlONM3Vtaez6DjmfeIllvnScQm4wYQ5Y/H0j/VO9qCODnumIEQyEdSH6XHcXhTSVtn3t3KoDhkvd0zB6YJGb6QypS/AX1zx+K0izjO3py+JPcvRAFNv0lutk8omZDCWzzRQM2EEx+0mccXzv00IwSbW8cdjG9FMtOkI87CqeFI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The default mode for device-dax instances is backwards for RAM-regions as evidenced by the fact that it tends to catch end users by surprise. "Where is my memory?". Recall that platforms are increasingly shipping with performance-differentiated memory pools beyond typical DRAM and NUMA effects. This includes HBM (high-bandwidth-memory) and CXL (dynamic interleave, varied media types, and future fabric attached possibilities). For this reason the EFI_MEMORY_SP (EFI Special Purpose Memory => Linux 'Soft Reserved') attribute is expected to be applied to all memory-pools that are not the general purpose pool. This designation gives an Operating System a chance to defer usage of a memory pool until later in the boot process where its performance properties can be interrogated and administrator policy can be applied. 'Soft Reserved' memory can be anything from too limited and precious to be part of the general purpose pool (HBM), too slow to host hot kernel data structures (some PMEM media), or anything in between. However, in the absence of an explicit policy, the memory should at least be made usable by default. The current device-dax default hides all non-general-purpose memory behind a device interface. The expectation is that the distribution of users that want the memory online by default vs device-dedicated-access by default follows the Pareto principle. A small number of enlightened users may want to do userspace memory management through a device, but general users just want the kernel to make the memory available with an option to get more advanced later. Arrange for all device-dax instances not backed by PMEM to default to attaching to the dax_kmem driver. From there the baseline memory hotplug policy (CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE / memhp_default_state=) gates whether the memory comes online or stays offline. Where, if it stays offline, it can be reliably converted back to device-mode where it can be partitioned, or fronted by a userspace allocator. So, if someone wants device-dax instances for their 'Soft Reserved' memory: 1/ Build a kernel with CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n or boot with memhp_default_state=offline, or roll the dice and hope that the kernel has not pinned a page in that memory before step 2. 2/ Write a udev rule to convert the target dax device(s) from 'system-ram' mode to 'devdax' mode: daxctl reconfigure-device $dax -m devdax -f Cc: Michal Hocko Cc: David Hildenbrand Cc: Dave Hansen Reviewed-by: Gregory Price Tested-by: Fan Ni Link: https://lore.kernel.org/r/167564544513.847146.4645646177864365755.stgit@dwillia2-xfh.jf.intel.com Signed-off-by: Dan Williams Reviewed-by: Dave Jiang Reviewed-by: Vishal Verma --- drivers/dax/Kconfig | 2 +- drivers/dax/bus.c | 53 ++++++++++++++++++++--------------------------- drivers/dax/bus.h | 12 +++++++++-- drivers/dax/device.c | 3 +-- drivers/dax/hmem/hmem.c | 12 ++++++++++- drivers/dax/kmem.c | 1 + 6 files changed, 46 insertions(+), 37 deletions(-) diff --git a/drivers/dax/Kconfig b/drivers/dax/Kconfig index d13c889c2a64..1163eb62e5f6 100644 --- a/drivers/dax/Kconfig +++ b/drivers/dax/Kconfig @@ -50,7 +50,7 @@ config DEV_DAX_HMEM_DEVICES def_bool y config DEV_DAX_KMEM - tristate "KMEM DAX: volatile-use of persistent memory" + tristate "KMEM DAX: map dax-devices as System-RAM" default DEV_DAX depends on DEV_DAX depends on MEMORY_HOTPLUG # for add_memory() and friends diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c index 1dad813ee4a6..012d576004e9 100644 --- a/drivers/dax/bus.c +++ b/drivers/dax/bus.c @@ -56,6 +56,25 @@ static int dax_match_id(struct dax_device_driver *dax_drv, struct device *dev) return match; } +static int dax_match_type(struct dax_device_driver *dax_drv, struct device *dev) +{ + enum dax_driver_type type = DAXDRV_DEVICE_TYPE; + struct dev_dax *dev_dax = to_dev_dax(dev); + + if (dev_dax->region->res.flags & IORESOURCE_DAX_KMEM) + type = DAXDRV_KMEM_TYPE; + + if (dax_drv->type == type) + return 1; + + /* default to device mode if dax_kmem is disabled */ + if (dax_drv->type == DAXDRV_DEVICE_TYPE && + !IS_ENABLED(CONFIG_DEV_DAX_KMEM)) + return 1; + + return 0; +} + enum id_action { ID_REMOVE, ID_ADD, @@ -216,14 +235,9 @@ static int dax_bus_match(struct device *dev, struct device_driver *drv) { struct dax_device_driver *dax_drv = to_dax_drv(drv); - /* - * All but the 'device-dax' driver, which has 'match_always' - * set, requires an exact id match. - */ - if (dax_drv->match_always) + if (dax_match_id(dax_drv, dev)) return 1; - - return dax_match_id(dax_drv, dev); + return dax_match_type(dax_drv, dev); } /* @@ -1413,13 +1427,10 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data) } EXPORT_SYMBOL_GPL(devm_create_dev_dax); -static int match_always_count; - int __dax_driver_register(struct dax_device_driver *dax_drv, struct module *module, const char *mod_name) { struct device_driver *drv = &dax_drv->drv; - int rc = 0; /* * dax_bus_probe() calls dax_drv->probe() unconditionally. @@ -1434,26 +1445,7 @@ int __dax_driver_register(struct dax_device_driver *dax_drv, drv->mod_name = mod_name; drv->bus = &dax_bus_type; - /* there can only be one default driver */ - mutex_lock(&dax_bus_lock); - match_always_count += dax_drv->match_always; - if (match_always_count > 1) { - match_always_count--; - WARN_ON(1); - rc = -EINVAL; - } - mutex_unlock(&dax_bus_lock); - if (rc) - return rc; - - rc = driver_register(drv); - if (rc && dax_drv->match_always) { - mutex_lock(&dax_bus_lock); - match_always_count -= dax_drv->match_always; - mutex_unlock(&dax_bus_lock); - } - - return rc; + return driver_register(drv); } EXPORT_SYMBOL_GPL(__dax_driver_register); @@ -1463,7 +1455,6 @@ void dax_driver_unregister(struct dax_device_driver *dax_drv) struct dax_id *dax_id, *_id; mutex_lock(&dax_bus_lock); - match_always_count -= dax_drv->match_always; list_for_each_entry_safe(dax_id, _id, &dax_drv->ids, list) { list_del(&dax_id->list); kfree(dax_id); diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h index fbb940293d6d..8cd79ab34292 100644 --- a/drivers/dax/bus.h +++ b/drivers/dax/bus.h @@ -11,7 +11,10 @@ struct dax_device; struct dax_region; void dax_region_put(struct dax_region *dax_region); -#define IORESOURCE_DAX_STATIC (1UL << 0) +/* dax bus specific ioresource flags */ +#define IORESOURCE_DAX_STATIC BIT(0) +#define IORESOURCE_DAX_KMEM BIT(1) + struct dax_region *alloc_dax_region(struct device *parent, int region_id, struct range *range, int target_node, unsigned int align, unsigned long flags); @@ -25,10 +28,15 @@ struct dev_dax_data { struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data); +enum dax_driver_type { + DAXDRV_KMEM_TYPE, + DAXDRV_DEVICE_TYPE, +}; + struct dax_device_driver { struct device_driver drv; struct list_head ids; - int match_always; + enum dax_driver_type type; int (*probe)(struct dev_dax *dev); void (*remove)(struct dev_dax *dev); }; diff --git a/drivers/dax/device.c b/drivers/dax/device.c index 5494d745ced5..ecdff79e31f2 100644 --- a/drivers/dax/device.c +++ b/drivers/dax/device.c @@ -475,8 +475,7 @@ EXPORT_SYMBOL_GPL(dev_dax_probe); static struct dax_device_driver device_dax_driver = { .probe = dev_dax_probe, - /* all probe actions are unwound by devm, so .remove isn't necessary */ - .match_always = 1, + .type = DAXDRV_DEVICE_TYPE, }; static int __init dax_init(void) diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c index e7bdff3132fa..5ec08f9f8a57 100644 --- a/drivers/dax/hmem/hmem.c +++ b/drivers/dax/hmem/hmem.c @@ -11,15 +11,25 @@ module_param_named(region_idle, region_idle, bool, 0644); static int dax_hmem_probe(struct platform_device *pdev) { + unsigned long flags = IORESOURCE_DAX_KMEM; struct device *dev = &pdev->dev; struct dax_region *dax_region; struct memregion_info *mri; struct dev_dax_data data; struct dev_dax *dev_dax; + /* + * @region_idle == true indicates that an administrative agent + * wants to manipulate the range partitioning before the devices + * are created, so do not send them to the dax_kmem driver by + * default. + */ + if (region_idle) + flags = 0; + mri = dev->platform_data; dax_region = alloc_dax_region(dev, pdev->id, &mri->range, - mri->target_node, PMD_SIZE, 0); + mri->target_node, PMD_SIZE, flags); if (!dax_region) return -ENOMEM; diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c index 4852a2dbdb27..918d01d3fbaa 100644 --- a/drivers/dax/kmem.c +++ b/drivers/dax/kmem.c @@ -239,6 +239,7 @@ static void dev_dax_kmem_remove(struct dev_dax *dev_dax) static struct dax_device_driver device_dax_kmem_driver = { .probe = dev_dax_kmem_probe, .remove = dev_dax_kmem_remove, + .type = DAXDRV_KMEM_TYPE, }; static int __init dax_kmem_init(void)