From patchwork Mon Jul 16 17:00:42 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dan Williams X-Patchwork-Id: 10527233 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id A777060348 for ; Mon, 16 Jul 2018 17:10:44 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 9380C28D7E for ; Mon, 16 Jul 2018 17:10:44 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 87B3628D82; Mon, 16 Jul 2018 17:10:44 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id B3F0528D80 for ; Mon, 16 Jul 2018 17:10:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DB5B6B026B; Mon, 16 Jul 2018 13:10:42 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 98C316B026C; Mon, 16 Jul 2018 13:10:42 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8A3416B026D; Mon, 16 Jul 2018 13:10:42 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-pl0-f72.google.com (mail-pl0-f72.google.com [209.85.160.72]) by kanga.kvack.org (Postfix) with ESMTP id 4BB206B026B for ; Mon, 16 Jul 2018 13:10:42 -0400 (EDT) Received: by mail-pl0-f72.google.com with SMTP id 66-v6so5635577plb.18 for ; Mon, 16 Jul 2018 10:10:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:subject:from :to:cc:date:message-id:in-reply-to:references:user-agent :mime-version:content-transfer-encoding; bh=RD7ZhiksLTQofIxjjEZteUa9B3QtqhO8Mxk30zGjOZ4=; b=Zp+2AZeCgSmVXpcDYUXNzWHIZfLMywCwJIIaB5QVmUzOSjoywFBZ75ZtLgye5gaKmr lPzZOKijvi5kqqpOi9D5jUrFK0SzjZzwqzaLGzT/AGG19/yjkCzV22fcqskYSUtO5c9H aJ3puzVqA4I/pBLcuNBl6I1fVPHuPhPNkWiq5f+X71nWSRNU/HcuE8CHcbNnm8csWB07 UUnpd+P5QBZH+ezqItzMCRSXDWX1M06ekM09CR6l/NehZg8nNHHMiWc6r/K3YBGK8JyR no+s3NnJjIPN44cbGCkwsoLXUIQdKE4UFDK27ePRt9+QcJkYsEY8qAzua67VC6QmSGai kTcA== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Gm-Message-State: AOUpUlE0HJTj/DMuH4a5B+zFF5cSO/9NEXMKVTzjrIoA2PmPKtORJiQu jZYFVkoNESISn2cVapvM/ihL6BeI1ozhSx3DaKUyusWiq6EEtvKeFVVv7jCeMMwlZCdVO3HQg7+ VQJyQ279ETwibAXBD3b7KY2zsL+KWu6vgXRb3h5diLkR14O2HcAapaOlXWvldCyclCg== X-Received: by 2002:a65:57c9:: with SMTP id q9-v6mr16510509pgr.128.1531761041942; Mon, 16 Jul 2018 10:10:41 -0700 (PDT) X-Google-Smtp-Source: AAOMgpcPbEl3S6yZx/6xx88Dyo42XUkZHywbHGFAKu6SYf127eUVk0C5DYOVIy3FMjgbYrFhhuZa X-Received: by 2002:a65:57c9:: with SMTP id q9-v6mr16510463pgr.128.1531761041118; Mon, 16 Jul 2018 10:10:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531761041; cv=none; d=google.com; s=arc-20160816; b=SzkyHlr5UCGauA3xb9IV71LiJml3tGF3l1bXXi7ajhpZtV3QXtHKLyAQNKrcq+oWHr K0EsU1gnGu+4RL8WEZzYXrlz42c/pQDj3pg8xRXJLidC+1aEJFIIquO6IGgPvbXFQgri 54Y4dASxLsfEuPbSEcmiXSdjxLZv5MAlksyP1z7YbwqkSCNs1lf6p57Bx6HNV0+2Yxgy iLBQ7KwMPcDUab7RCmdB3YAqs26umVQv0BpU+2qjTB3CXxr0ZlPXT2j+vNmPTlSz7VqO dRkocJDLK3fCVKaWQbLh68v1o4zV6IPTQnHmt8UuclMG/j+eNHBTlbmyJ1CbBofcX7RQ ggBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:mime-version:user-agent:references :in-reply-to:message-id:date:cc:to:from:subject :arc-authentication-results; bh=RD7ZhiksLTQofIxjjEZteUa9B3QtqhO8Mxk30zGjOZ4=; b=kagVeDqcnBj0Ml8Zc+uzTZwT4M3NIsHYLT7fh2N8FlbddO9umV0Pb5dHhP/M6QO0BR joTWrLqojw4OaZ/5zf0Irpqyr8EPKW3dzIZ2WLZkJX648EBh+EKDbGV2IDSmcw9O9CRa oGUqSEowkdDPq9e+D/LD1+paI8/IIvn5duuSirOuDByq7VqccpYCnB9sAkqEFpXfa55U mJJRilMAsgQM0mVwc6rlxmqWoUmNcHgOjEAgtSAXUjWUQjH9cAs7FUfOBK8c0QirpFr7 4d+PJ/DJvd9M18i/SMIqa04DqdKjhosZMp+giAr4fH/mU8AKoorrMgra0jEytfGAkww5 OJZA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from mga02.intel.com (mga02.intel.com. [134.134.136.20]) by mx.google.com with ESMTPS id n70-v6si28176600pfa.320.2018.07.16.10.10.40 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 16 Jul 2018 10:10:41 -0700 (PDT) Received-SPF: pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.20 as permitted sender) client-ip=134.134.136.20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of dan.j.williams@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=dan.j.williams@intel.com; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga003.jf.intel.com ([10.7.209.27]) by orsmga101.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 16 Jul 2018 10:10:40 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.51,362,1526367600"; d="scan'208";a="67372472" Received: from dwillia2-desk3.jf.intel.com (HELO dwillia2-desk3.amr.corp.intel.com) ([10.54.39.16]) by orsmga003.jf.intel.com with ESMTP; 16 Jul 2018 10:10:40 -0700 Subject: [PATCH v2 04/14] mm: Multithread ZONE_DEVICE initialization From: Dan Williams To: akpm@linux-foundation.org Cc: Michal Hocko , Daniel Jordan , Pavel Tatashin , vishal.l.verma@intel.com, hch@lst.de, linux-mm@kvack.org, jack@suse.cz, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org Date: Mon, 16 Jul 2018 10:00:42 -0700 Message-ID: <153176044282.12695.11215811746294680749.stgit@dwillia2-desk3.amr.corp.intel.com> In-Reply-To: <153176041838.12695.3365448145295112857.stgit@dwillia2-desk3.amr.corp.intel.com> References: <153176041838.12695.3365448145295112857.stgit@dwillia2-desk3.amr.corp.intel.com> User-Agent: StGit/0.18-2-gc94f MIME-Version: 1.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP On large / multi-socket persistent memory systems it can potentially take minutes to initialize the memmap. Even though such systems have multiple persistent memory namespaces that are registered asynchronously, they serialize on the mem_hotplug_begin() lock. The method for hiding memmap initialization in the typical memory case can not be directly reused for persistent memory. In the typical / volatile memory case pages are background freed to the memory allocator as they become initialized. For persistent memory the aim is to push everything to the background, but since it is dax mapped there is no way to redirect applications to limit their usage to the initialized set. I.e. any address may be directly accessed at any time. The bulk of the work is memmap_init_zone(). Splitting the work into threads yields a 1.5x to 2x performance in the time to initialize a 128GB namespace. However, the work is still serialized when there are multiple namespaces and the work is ultimately limited by memory-media write bandwidth. So, this commit is only a preparation step towards ultimately moving all memmap initialization completely into the background. Cc: Andrew Morton Cc: Michal Hocko Cc: Daniel Jordan Cc: Pavel Tatashin Signed-off-by: Dan Williams --- include/linux/memmap_async.h | 25 +++++++ mm/page_alloc.c | 147 ++++++++++++++++++++++++++++-------------- 2 files changed, 123 insertions(+), 49 deletions(-) diff --git a/include/linux/memmap_async.h b/include/linux/memmap_async.h index c641b80a3c24..2b1a0636d5bb 100644 --- a/include/linux/memmap_async.h +++ b/include/linux/memmap_async.h @@ -2,18 +2,33 @@ #ifndef __LINUX_MEMMAP_ASYNC_H #define __LINUX_MEMMAP_ASYNC_H #include +#include +struct dev_pagemap; struct vmem_altmap; +/* + * Regardless of how many threads we request here the workqueue core may + * limit based on the amount of other concurrent 'async' work in the + * system, see WQ_MAX_ACTIVE + */ +#define NR_MEMMAP_THREADS 8 + /** * struct memmap_init_env - common global data for all async memmap operations * @altmap: set-aside / alternative memory for allocating the memmap + * @pgmap: pgmap for describing ZONE_DEVICE ranges * @want_memblock: typical memory (!ZONE_DEVICE) is onlined via memblock + * @zone: zone number when initializing a given struct page + * @context: indicate hotplug vs early boot memmap init * @nid: home node for the memory range */ struct memmap_init_env { struct vmem_altmap *altmap; + struct dev_pagemap *pgmap; bool want_memblock; + unsigned long zone; + enum memmap_context context; int nid; }; @@ -34,6 +49,16 @@ struct memmap_init_memmap { }; /** + * struct memmap_init_pages - arguments for async 'struct page' init + * @res: range for one instance of memmap_init_async() to operate + * @env: link to thread range invariant parameters + */ +struct memmap_init_pages { + struct resource res; + struct memmap_init_env *env; +}; + +/** * struct memmap_async_state - support and track async memmap operations * @env: storage for common memmap init parameters * @memmap: storage for background page-table setup operations diff --git a/mm/page_alloc.c b/mm/page_alloc.c index fb45cfeb4a50..71e3f01a1548 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -5455,6 +5456,70 @@ void __ref build_all_zonelists(pg_data_t *pgdat) ASYNC_DOMAIN_EXCLUSIVE(memmap_init_domain); +static void __meminit memmap_init_one(unsigned long pfn, unsigned long zone, + int nid, enum memmap_context context, struct dev_pagemap *pgmap) +{ + struct page *page = pfn_to_page(pfn); + + __init_single_page(page, pfn, zone, nid); + if (context == MEMMAP_HOTPLUG) + SetPageReserved(page); + + /* + * Mark the block movable so that blocks are reserved for + * movable at startup. This will force kernel allocations to + * reserve their blocks rather than leaking throughout the + * address space during boot when many long-lived kernel + * allocations are made. + * + * bitmap is created for zone's valid pfn range. but memmap can + * be created for invalid pages (for alignment) check here not + * to call set_pageblock_migratetype() against pfn out of zone. + * + * Please note that MEMMAP_HOTPLUG path doesn't clear memmap + * because this is done early in sparse_add_one_section + */ + if (!(pfn & (pageblock_nr_pages - 1))) { + set_pageblock_migratetype(page, MIGRATE_MOVABLE); + cond_resched(); + } + + if (is_zone_device_page(page)) { + struct vmem_altmap *altmap = &pgmap->altmap; + + if (WARN_ON_ONCE(!pgmap)) + return; + + /* skip invalid device pages */ + if (pgmap->altmap_valid && (pfn < (altmap->base_pfn + + vmem_altmap_offset(altmap)))) + return; + /* + * ZONE_DEVICE pages union ->lru with a ->pgmap back + * pointer. It is a bug if a ZONE_DEVICE page is ever + * freed or placed on a driver-private list. Seed the + * storage with poison. + */ + page->lru.prev = LIST_POISON2; + page->pgmap = pgmap; + percpu_ref_get(pgmap->ref); + } +} + +static void __ref memmap_init_async(void *data, async_cookie_t cookie) +{ + struct memmap_init_pages *args = data; + struct memmap_init_env *env = args->env; + struct resource *res = &args->res; + unsigned long pfn, start, end; + + start = PHYS_PFN(res->start); + end = PHYS_PFN(res->end+1); + for (pfn = start; pfn < end; pfn++) + memmap_init_one(pfn, env->zone, env->nid, env->context, + env->pgmap); +} + /* * Initially all pages are reserved - free ones are freed * up by free_all_bootmem() once the early boot process is @@ -5469,7 +5534,6 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, struct vmem_altmap *altmap = NULL; unsigned long pfn; unsigned long nr_initialised = 0; - struct page *page; #ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP struct memblock_region *r = NULL, *tmp; #endif @@ -5486,14 +5550,43 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, if (altmap && start_pfn == altmap->base_pfn) start_pfn += altmap->reserve; - for (pfn = start_pfn; pfn < end_pfn; pfn++) { + if (context != MEMMAP_EARLY) { /* * There can be holes in boot-time mem_map[]s handed to this * function. They do not exist on hotplugged memory. */ - if (context != MEMMAP_EARLY) - goto not_early; + ASYNC_DOMAIN_EXCLUSIVE(local); + struct memmap_init_pages args[NR_MEMMAP_THREADS]; + struct memmap_init_env env = { + .nid = nid, + .zone = zone, + .pgmap = pgmap, + .context = context, + }; + unsigned long step, rem; + int i; + + size = end_pfn - start_pfn; + step = size / NR_MEMMAP_THREADS; + rem = size % NR_MEMMAP_THREADS; + for (i = 0; i < NR_MEMMAP_THREADS; i++) { + struct memmap_init_pages *t = &args[i]; + + t->env = &env; + t->res.start = PFN_PHYS(start_pfn); + t->res.end = PFN_PHYS(start_pfn + step) - 1; + if (i == NR_MEMMAP_THREADS-1) + t->res.end += PFN_PHYS(rem); + + async_schedule_domain(memmap_init_async, t, &local); + + start_pfn += step; + } + async_synchronize_full_domain(&local); + return; + } + for (pfn = start_pfn; pfn < end_pfn; pfn++) { if (!early_pfn_valid(pfn)) continue; if (!early_pfn_in_nid(pfn, nid)) @@ -5522,51 +5615,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, } } #endif - -not_early: - page = pfn_to_page(pfn); - __init_single_page(page, pfn, zone, nid); - if (context == MEMMAP_HOTPLUG) - SetPageReserved(page); - - /* - * Mark the block movable so that blocks are reserved for - * movable at startup. This will force kernel allocations - * to reserve their blocks rather than leaking throughout - * the address space during boot when many long-lived - * kernel allocations are made. - * - * bitmap is created for zone's valid pfn range. but memmap - * can be created for invalid pages (for alignment) - * check here not to call set_pageblock_migratetype() against - * pfn out of zone. - * - * Please note that MEMMAP_HOTPLUG path doesn't clear memmap - * because this is done early in sparse_add_one_section - */ - if (!(pfn & (pageblock_nr_pages - 1))) { - set_pageblock_migratetype(page, MIGRATE_MOVABLE); - cond_resched(); - } - - if (is_zone_device_page(page)) { - if (WARN_ON_ONCE(!pgmap)) - continue; - - /* skip invalid device pages */ - if (altmap && (pfn < (altmap->base_pfn - + vmem_altmap_offset(altmap)))) - continue; - /* - * ZONE_DEVICE pages union ->lru with a ->pgmap back - * pointer. It is a bug if a ZONE_DEVICE page is ever - * freed or placed on a driver-private list. Seed the - * storage with poison. - */ - page->lru.prev = LIST_POISON2; - page->pgmap = pgmap; - percpu_ref_get(pgmap->ref); - } + memmap_init_one(pfn, zone, nid, context, NULL); } }