From patchwork Wed Jun 14 06:19:00 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: mawupeng X-Patchwork-Id: 13279549 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3616EEB64D8 for ; Wed, 14 Jun 2023 06:19:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 577356B0075; Wed, 14 Jun 2023 02:19:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B06E8E0002; Wed, 14 Jun 2023 02:19:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 377BB6B007B; Wed, 14 Jun 2023 02:19:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 29FE96B0075 for ; Wed, 14 Jun 2023 02:19:15 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E2F63807EA for ; Wed, 14 Jun 2023 06:19:14 +0000 (UTC) X-FDA: 80900351028.30.BE39D55 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf09.hostedemail.com (Postfix) with ESMTP id C8114140020 for ; Wed, 14 Jun 2023 06:19:10 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686723553; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sQmlmQJxMfFfpYYQ8/EYo0l6gdS2+FgWwNJQhYP4Lo4=; b=MeyhXi4Jd26zaFfZGCBZ5bG2jTbSWbgPIc5P47EPt33fb9+bsIS2+jgncHnqTNGmDqPrCw fwav4rktldmWMypqxHgGclul/FtAwkguXRf+i2ynyxbZrfpHy52KitOX1mYQZL6XgPV6mO da09U8m5la8DlswFujlFTDWQ3REEvB0= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf09.hostedemail.com: domain of mawupeng1@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=mawupeng1@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686723553; a=rsa-sha256; cv=none; b=qYhWJq8/MGdeK6Pb/v0McpKYQLjLWmD7mU+fZxRtW5MGxwLTmK1dDUfAW9ckUDJteZyxjR wjLaiYYZCXPrWb3UERLGWjrnWZ+9m4f+5eqI+AAl5bC0fkO8vvMhpFMFsuQP5uePrfFJtn 7eLflbNu0i7qXdF4CCLFOLB/1ZdRBh8= Received: from dggpemm500014.china.huawei.com (unknown [172.30.72.55]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4QgwCg6tnmzLqNk; Wed, 14 Jun 2023 14:15:59 +0800 (CST) Received: from localhost.localdomain (10.175.112.125) by dggpemm500014.china.huawei.com (7.185.36.153) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Wed, 14 Jun 2023 14:19:04 +0800 From: Wupeng Ma To: , CC: , , , , Wei Yang , "Michael S. Tsirkin" , Jason Wang , Pankaj Gupta , Michal Hocko , Oscar Salvador Subject: [PATCH stable 5.10 1/1] mm/memory_hotplug: extend offline_and_remove_memory() to handle more than one memory block Date: Wed, 14 Jun 2023 14:19:00 +0800 Message-ID: <20230614061900.3296725-2-mawupeng1@huawei.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: <20230614061900.3296725-1-mawupeng1@huawei.com> References: <20230614061900.3296725-1-mawupeng1@huawei.com> MIME-Version: 1.0 X-Originating-IP: [10.175.112.125] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemm500014.china.huawei.com (7.185.36.153) X-CFilter-Loop: Reflected X-Rspamd-Queue-Id: C8114140020 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9jpwj86diujm78rsosjte7yuskcbj6zs X-HE-Tag: 1686723550-899773 X-HE-Meta: U2FsdGVkX18m/mtcAKNEqX64ijigCs+vAnSbEFdkr3fHutL0n7MTzR1y2Cxkxq1pHGD0ENifJhM6EjCVW+Cy+sN8wZD7PwEzW0hnYk15RGtyP6ScZHSEaKiUoQlP7mNRk79Y4AMQpdfxWK3syxnmE7AlVCknYt3Tm2SN1YepcP3+gFQRUfHxpBLAMRkeCRzWVyjiG3aAy3Pga1YCXjcS6/2LX9nTdU4KmiMUy1hGPhm8lgCsauZh9R8drFLesFVcjVBjx2/XvaqikHZE2sbgtZpgM4ZyPegvjfrS5fnV7zAYTcSNO5DUDcbGTAuPpq4X51vZCNkL6WEnAuMEc4xgLwaXw41WTgEdRts0LUwXJhFcHaZ5EoHLWY/cW90UTFj3Cs13vrLrzrYsL0Jx1peiO2YPy9ldtP221tXBUXlt7rYKJVzqEfNxXqwEJzyh6FH+22ANGBujGilVZ9FET5bjbAfXAEo3RVYLnYD/aboXh0RfQwxUT2yaWyk4bBi1gjE6WhqudRdJT+riqF/DYcb3DMS7r+a/59sBmppiZRZRQVDLDhyvtVw9F0Fw6oMqiH5Oy8usZ4TqhGK0FozHb8UDPUk9ZWnp10a9RKZfAADw+EB2OamCt01CP2zpNwZnC6RQ6C9/bhFEanQKVTXMeNWAfNHG1Z9WmDel5f3UAqMprvN24d3P9pwHZvG2i5pwZjB1ibcwBPMFFDTsB+rX1XNe3mLN544e+haxTwyppU669PA37DnY9VlPEayQzc8dqxBgda+Gx+y1717Rf2LOQ3tIZxkm+pF9NFMW5OOOb/HYz8r9kEsnUUFpBvu3S4OaEMdmRsTyCAgYnuJkOvR8aGIur6hNLgmMqI0q1X7huumkUfj0kKoZUCAPpO/XeS/xMB4POnBS1LnVwxreaoBShK+4pLU+3HoR06Zo2lMEVhuguINWo0jelmV8qI4p+ZfcRJnA6rLVSFI+1Zg3rGYfa/o GHTMabO3 8M425AXCQFSVTmonHkT98B18qkYVQw8FGxyA9OMxDbJ35NzvOp1PwzCzJUJuUFX3USRcV7Dl+eD5whRkRzP7AOVWoceXWADr5Lnjx9isVBHIqXe69cqkRho6c3ELVNUrMCoYCOeHfQT05xMH9tY9N9ZSpjApuGnEXoyVCsDpwxctu0IWkeEliGgRyaxV8UUkXraJxqF2wtzjxQIhYc5lqr1oWPquJPlCk7BGMjVQA9To7m0UdxBEtZ3xKEqiBMj5C9kDpRLwilGrs0PCYgejRR0dsasvFb7L3dI8yf7TuErSWuns4niH2Wa0nyJ6vEVrQ5EQOmAMoA1Dc8LuGu9S7+w8CVQRSM7alKsMcOFa3qOL7aekfA81PNLtWa2ac4iZMRpb1slsokQIz/qx/8DERo4sJtwBUvCHeh4HYc+JZ3OlO0Tnv45FQqnXNtQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: David Hildenbrand virtio-mem soon wants to use offline_and_remove_memory() memory that exceeds a single Linux memory block (memory_block_size_bytes()). Let's remove that restriction. Let's remember the old state and try to restore that if anything goes wrong. While re-onlining can, in general, fail, it's highly unlikely to happen (usually only when a notifier fails to allocate memory, and these are rather rare). This will be used by virtio-mem to offline+remove memory ranges that are bigger than a single memory block - for example, with a device block size of 1 GiB (e.g., gigantic pages in the hypervisor) and a Linux memory block size of 128MB. While we could compress the state into 2 bit, using 8 bit is much easier. This handling is similar, but different to acpi_scan_try_to_offline(): a) We don't try to offline twice. I am not sure if this CONFIG_MEMCG optimization is still relevant - it should only apply to ZONE_NORMAL (where we have no guarantees). If relevant, we can always add it. b) acpi_scan_try_to_offline() simply onlines all memory in case something goes wrong. It doesn't restore previous online type. Let's do that, so we won't overwrite what e.g., user space configured. Reviewed-by: Wei Yang Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: Pankaj Gupta Cc: Michal Hocko Cc: Oscar Salvador Cc: Wei Yang Cc: Andrew Morton Signed-off-by: David Hildenbrand Link: https://lore.kernel.org/r/20201112133815.13332-28-david@redhat.com Signed-off-by: Michael S. Tsirkin Acked-by: Andrew Morton Signed-off-by: Ma Wupeng --- mm/memory_hotplug.c | 105 +++++++++++++++++++++++++++++++++++++------- 1 file changed, 89 insertions(+), 16 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f0633f9a9116..9ec9e1e67705 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1788,39 +1788,112 @@ int remove_memory(int nid, u64 start, u64 size) } EXPORT_SYMBOL_GPL(remove_memory); +static int try_offline_memory_block(struct memory_block *mem, void *arg) +{ + uint8_t online_type = MMOP_ONLINE_KERNEL; + uint8_t **online_types = arg; + struct page *page; + int rc; + + /* + * Sense the online_type via the zone of the memory block. Offlining + * with multiple zones within one memory block will be rejected + * by offlining code ... so we don't care about that. + */ + page = pfn_to_online_page(section_nr_to_pfn(mem->start_section_nr)); + if (page && zone_idx(page_zone(page)) == ZONE_MOVABLE) + online_type = MMOP_ONLINE_MOVABLE; + + rc = device_offline(&mem->dev); + /* + * Default is MMOP_OFFLINE - change it only if offlining succeeded, + * so try_reonline_memory_block() can do the right thing. + */ + if (!rc) + **online_types = online_type; + + (*online_types)++; + /* Ignore if already offline. */ + return rc < 0 ? rc : 0; +} + +static int try_reonline_memory_block(struct memory_block *mem, void *arg) +{ + uint8_t **online_types = arg; + int rc; + + if (**online_types != MMOP_OFFLINE) { + mem->online_type = **online_types; + rc = device_online(&mem->dev); + if (rc < 0) + pr_warn("%s: Failed to re-online memory: %d", + __func__, rc); + } + + /* Continue processing all remaining memory blocks. */ + (*online_types)++; + return 0; +} + /* - * Try to offline and remove a memory block. Might take a long time to - * finish in case memory is still in use. Primarily useful for memory devices - * that logically unplugged all memory (so it's no longer in use) and want to - * offline + remove the memory block. + * Try to offline and remove memory. Might take a long time to finish in case + * memory is still in use. Primarily useful for memory devices that logically + * unplugged all memory (so it's no longer in use) and want to offline + remove + * that memory. */ int offline_and_remove_memory(int nid, u64 start, u64 size) { - struct memory_block *mem; - int rc = -EINVAL; + const unsigned long mb_count = size / memory_block_size_bytes(); + uint8_t *online_types, *tmp; + int rc; if (!IS_ALIGNED(start, memory_block_size_bytes()) || - size != memory_block_size_bytes()) - return rc; + !IS_ALIGNED(size, memory_block_size_bytes()) || !size) + return -EINVAL; + + /* + * We'll remember the old online type of each memory block, so we can + * try to revert whatever we did when offlining one memory block fails + * after offlining some others succeeded. + */ + online_types = kmalloc_array(mb_count, sizeof(*online_types), + GFP_KERNEL); + if (!online_types) + return -ENOMEM; + /* + * Initialize all states to MMOP_OFFLINE, so when we abort processing in + * try_offline_memory_block(), we'll skip all unprocessed blocks in + * try_reonline_memory_block(). + */ + memset(online_types, MMOP_OFFLINE, mb_count); lock_device_hotplug(); - mem = find_memory_block(__pfn_to_section(PFN_DOWN(start))); - if (mem) - rc = device_offline(&mem->dev); - /* Ignore if the device is already offline. */ - if (rc > 0) - rc = 0; + + tmp = online_types; + rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block); /* - * In case we succeeded to offline the memory block, remove it. + * In case we succeeded to offline all memory, remove it. * This cannot fail as it cannot get onlined in the meantime. */ if (!rc) { rc = try_remove_memory(nid, start, size); - WARN_ON_ONCE(rc); + if (rc) + pr_err("%s: Failed to remove memory: %d", __func__, rc); + } + + /* + * Rollback what we did. While memory onlining might theoretically fail + * (nacked by a notifier), it barely ever happens. + */ + if (rc) { + tmp = online_types; + walk_memory_blocks(start, size, &tmp, + try_reonline_memory_block); } unlock_device_hotplug(); + kfree(online_types); return rc; } EXPORT_SYMBOL_GPL(offline_and_remove_memory);