From patchwork Mon Jan 20 22:47:29 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vinay Banakar X-Patchwork-Id: 13945554 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 787D9C02181 for ; Mon, 20 Jan 2025 22:47:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0A4956B0082; Mon, 20 Jan 2025 17:47:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0569F6B0083; Mon, 20 Jan 2025 17:47:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E5EF56B0085; Mon, 20 Jan 2025 17:47:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id CD0176B0082 for ; Mon, 20 Jan 2025 17:47:44 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 7AF268015C for ; Mon, 20 Jan 2025 22:47:44 +0000 (UTC) X-FDA: 83029318848.27.923A21B Received: from mail-pj1-f49.google.com (mail-pj1-f49.google.com [209.85.216.49]) by imf14.hostedemail.com (Postfix) with ESMTP id BBD68100005 for ; Mon, 20 Jan 2025 22:47:42 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=HPW98+rT; spf=pass (imf14.hostedemail.com: domain of vny@google.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=vny@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737413262; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding:in-reply-to: references:dkim-signature; bh=9lZqlsInkbs+VoXx/qwEPKi8//IGtA1d4kj7PVcfNO4=; b=UsCWsr52MNOsz4LAiUdkUd9BRE28tgFYns0j0d1LC2IMQzItUiV1D+ddiFp0+CYB1lUUEz Gt6URwz48GsgxJVCW/D5sgco6P1rIAn82jOCKl+vCqOu6vkWHcZB8C7KRnzIrXSWjdjJqX Zq4HgyVRbmFPI6AI5oCEIGwqZpwQMKQ= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=HPW98+rT; spf=pass (imf14.hostedemail.com: domain of vny@google.com designates 209.85.216.49 as permitted sender) smtp.mailfrom=vny@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737413262; a=rsa-sha256; cv=none; b=c6YI1rpxeU5JoCTceXkMK+SwiLLWBfSlWHES8c71PrwTBXwc3WD2C04XckHxsZ70fLM77Z 8p6tamYLTLhOgeqY82R7jlsgQX/5C46C/sHdst6lebjkgocrtGipZCH/fN6oqOnPAQKiSq cPmnVEhUc7Xa5Pu324/ZYatwZ0Pq6/4= Received: by mail-pj1-f49.google.com with SMTP id 98e67ed59e1d1-2ee67e9287fso8489984a91.0 for ; Mon, 20 Jan 2025 14:47:42 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1737413261; x=1738018061; darn=kvack.org; h=cc:to:subject:message-id:date:from:mime-version:from:to:cc:subject :date:message-id:reply-to; bh=9lZqlsInkbs+VoXx/qwEPKi8//IGtA1d4kj7PVcfNO4=; b=HPW98+rTsae7mCxW83o3TLwX6NjKxGmnMQ/aC76MGHaZhMzkdTV1K/2C43DDOVmBs5 Z3lNqpzogi10fiHK/ULV3TljTLP6J813tlgIERGQP2WaIzUhRY2qNm/jCSNOvZt8itZh ul0P+9x8b5piUTSaUBgbEDuXUzeD4FvZpegE9SwLOF6X50izJvnY8eIYncnBIUhtZ+L7 noneD3rpiWrOW8GsmKwOSM7neyjyES9UAhooG0LNM01MW8JPh/E2LNlU0rtdzNdklKG6 pzlhe5bQY3gKYzFFGPIEnacv4rDNwpykoCFq3iQVL/REZY3tNNqig/di/SWjBHHtzOg7 MUmg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737413261; x=1738018061; h=cc:to:subject:message-id:date:from:mime-version:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=9lZqlsInkbs+VoXx/qwEPKi8//IGtA1d4kj7PVcfNO4=; b=Nc2nM+VRpTBmcT2PNDmxBX6TrItw/0zkaY5bnTp1YmourGMGogDpKDt8V8RrPCrj8Z KwAmiikHjmj0WXjHiPv+2qL6nKev1lV8Kfzp1MdohsOCUZrtV2P0agnO6nm7AENImtnk 8nvhYkvgMzDWjVKqn4B4+PqsrU1TQBnh5E9HvDlmoFq8N1HcxQunceM6XhlrUOGPXQXN wQyFc95I3hqXXZ3mXlWyqT7bFPZZ24xXBkJXiFZlAuavEHB9UD27QolJvGlHHDGVPlhu mvxT6Yt2ydyMtUN5ekVFrfLDc7+jbex8YcFCzPp4VZXg9bxAaJpM1LtllApaVUFE1Q03 GUjA== X-Gm-Message-State: AOJu0YxJnp+cW6kb7kA/NRC4TIw1SRg1n0rmPyCREoDDyV2a6+AKdDDC 2AHe4CFegCKtQU6CR6bIEOMvwxH28JyJ8x/YjL+6TDj9VyTqyxVlPy/2/JMC2/ju9b4eNIKp4Yn KnUK70/qFTnxh8KewEqkoJ8cxyKGt6YQtngJoEHEEd+NaYpBqeyxx X-Gm-Gg: ASbGncvKoJeSp/lY6DifF1Avs+HbFSQbClQUhQtCb3PZCh4h//2h4g9hJqZd4p1aYrQ PZRK9Dz6JgD3yDSi9lon2JrrkYHCtCFQZR87lzhDz84/Upf9IUwI= X-Google-Smtp-Source: AGHT+IHhxuz+A74LuQdLWm/u83rIBnpUH6By64pXZNa+qPds8fXdA4RdP+mNcN+Wt+pjFnoiajPqrKWnVWXrLxhfvUU= X-Received: by 2002:a17:90b:38d0:b0:2ee:d35c:3996 with SMTP id 98e67ed59e1d1-2f782d972f5mr21138021a91.31.1737413260868; Mon, 20 Jan 2025 14:47:40 -0800 (PST) MIME-Version: 1.0 From: Vinay Banakar Date: Mon, 20 Jan 2025 16:47:29 -0600 X-Gm-Features: AbW1kvZZQxAZWdnvhEoMNBWPMVHeOp6Nkjvi_VNCLn_UqeLbc7ODri1JXcT4lF8 Message-ID: Subject: [PATCH] mm: Optimize TLB flushes during page reclaim To: linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, willy@infradead.org, mgorman@suse.de, Wei Xu , Greg Thelen X-Rspamd-Queue-Id: BBD68100005 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: cx8co38rtfdekkq6iyeq3gjikadatfdg X-HE-Tag: 1737413262-93973 X-HE-Meta: U2FsdGVkX19VVeOvzPQjQmSSPGTX78GE7UVCIVpWz2qD85+7CmgtA1ue8ITwin8AIjNIf9pvJyp1s4oN5f+gB8QnOK3C2gjUD95NnlE+0vq9wu7WQxwgrdNbhfvSc8GSc42ad0WWmvtnt9aBNiAEGywfsOhplFaetxmSfodDxD+5n6GkoB+m+vzlmcZMzucLHpmwgpMJKvjfN7RwhyKNfUSS8BI2vyQmSLUVXjc3iS+NUWLA3rJRjgpXDIk7p4sH6Y4aYHl8L1rhuyFPqnc4DosxVdIMRWd4j0La3tSAIvKgsrbbP2o815HeXI4sOx15xHQTu8nzChOtm27qQ4mXPvcCNdfr/SaREh93Yq0ugKivwDHPqs19J2hmZFry9c2Q6icLTGYuw2s9ydkDnvBLEzZWlTpcNGzRutbauR191K4Q92L9GA1pUyvgX8nSrUn3yChIjlpRAUlnrMjrKIHM5Fcgk5xMVGFv3wN2u8vC3UY21616ADlz+tbkaAMPOM9fx6GoRCzl23Mwj1mtnuAxJan95IY/vIkFQYmBPqa7AqpuKVkj9NYnVRfKjx7DCG/LnZr2HFN/uOq5rpVOLAEKgSf5L7cx6EMDgdWomx1RcVn9tZ3FaT5VuZ3pqhbgoPfCRFWRc/6uu7wTC15AK8zebKFokjT5mOYl94zlpaWyGztaMwtl2WwP5gSgokHrpeOpkci6E1oieRNASgP0CQIiR/oSMhX2gP+Nf8EKQpkxjxUzB/kk/M4knR/PlPUvwRSKVMAA39AqVysS1Q0OT95/Mm1ZNi0kFfUmX7iuVSoSG/jnzE7JWWdxGmH5puQbEfGEVntq/YjKlOFLZ1JQMx+znrP6/SyaYSs7+U8FFL5hB8CWrirwthkzS3tETmls2zUE54skVbwTJ9faa/CtnCeujk/CamBKEihqopwEeyWybJ/M1dwTsaN0BayAQD3EeNRPXlBfxgR/j53qxYwgJ6F THCp8ZC0 lgGvxWdalBW0fs0ITaq4+wuNcbPMrMYLWBM0I2m1sXtDOxMbckSU9Rs74dkOfP31eCnl/KWGUuf59EaL55gBR0l5rP2CEMVB5kFrxI4z/ZNJruhWT+DWkqSDhUy2XN8KS17zGf7OMCY93sRGFrSxqzSybpQdcWH8k1iPFUsXJl0D1NxcqGvW2n0Ykh9qQ02qrr2QSrpBelVLH0ZBX+pKNTYLkbA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: The current implementation in shrink_folio_list() performs full TLB flushes and issues IPIs for each individual page being reclaimed. This causes unnecessary overhead during memory reclaim, whether triggered by madvise(MADV_PAGEOUT) or kswapd, especially in scenarios where applications are actively moving cold pages to swap while maintaining high performance requirements for hot pages. The current code: 1. Clears PTE and unmaps each page individually 2. Performs a full TLB flush on all cores using the VMA (via CR3 write) or issues individual TLB shootdowns (invlpg+invlpcid) for single-core usage 3. Submits each page individually to BIO This approach results in: - Excessive full TLB flushes across all cores - Unnecessary IPI storms when processing multiple pages - Suboptimal I/O submission patterns I initially tried using selective TLB shootdowns (invlpg) instead of full TLB flushes per each page to avoid interference with other threads. However, this approach still required sending IPIs to all cores for each page, which did not significantly improve application throughput. This patch instead optimizes the process by batching operations, issuing one IPI per PMD instead of per page. This reduces interrupts by a factor of 512 and enables batching page submissions to BIO. The new approach: 1. Collect dirty pages that need to be written back 2. Issue a single TLB flush for all dirty pages in the batch 3. Process the collected pages for writebacks (submit to BIO) Testing shows significant reduction in application throughput impact during page-out operations. Applications maintain better performance during memory reclaim, when triggered by explicit madvise(MADV_PAGEOUT) calls. I'd appreciate your feedback on this approach, especially on the correctness of batched BIO submissions. Looking forward to your comments. Signed-off-by: Vinay Banakar Signed-off-by: Vinay Banakar --- mm/vmscan.c | 107 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--------------------------------- 1 file changed, 74 insertions(+), 33 deletions(-) bool do_demote_pass; @@ -1351,39 +1352,9 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (!sc->may_writepage) goto keep_locked; - /* - * Folio is dirty. Flush the TLB if a writable entry - * potentially exists to avoid CPU writes after I/O - * starts and then write it out here. - */ - try_to_unmap_flush_dirty(); - switch (pageout(folio, mapping, &plug)) { - case PAGE_KEEP: - goto keep_locked; - case PAGE_ACTIVATE: - goto activate_locked; - case PAGE_SUCCESS: - stat->nr_pageout += nr_pages; - - if (folio_test_writeback(folio)) - goto keep; - if (folio_test_dirty(folio)) - goto keep; - - /* - * A synchronous write - probably a ramdisk. Go - * ahead and try to reclaim the folio. - */ - if (!folio_trylock(folio)) - goto keep; - if (folio_test_dirty(folio) || - folio_test_writeback(folio)) - goto keep_locked; - mapping = folio_mapping(folio); - fallthrough; - case PAGE_CLEAN: - ; /* try to free the folio below */ - } + /* Add to pageout list for defered bio submissions */ + list_add(&folio->lru, &pageout_list); + continue; } /* @@ -1494,6 +1465,76 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, } /* 'folio_list' is always empty here */ + if (!list_empty(&pageout_list)) { + /* + * Batch TLB flushes by flushing once before processing all dirty pages. + * Since we operate on one PMD at a time, this batches TLB flushes at + * PMD granularity rather than per-page, reducing IPIs. + */ + struct address_space *mapping; + try_to_unmap_flush_dirty(); + + while (!list_empty(&pageout_list)) { + struct folio *folio = lru_to_folio(&pageout_list); + list_del(&folio->lru); + + /* Recheck if page got reactivated */ + if (folio_test_active(folio) || + (folio_mapped(folio) && folio_test_young(folio))) + goto skip_pageout_locked; + + mapping = folio_mapping(folio); + pageout_t pageout_res = pageout(folio, mapping, &plug); + switch (pageout_res) { + case PAGE_KEEP: + goto skip_pageout_locked; + case PAGE_ACTIVATE: + goto skip_pageout_locked; + case PAGE_SUCCESS: + stat->nr_pageout += folio_nr_pages(folio); + + if (folio_test_writeback(folio) || + folio_test_dirty(folio)) + goto skip_pageout; + + /* + * A synchronous write - probably a ramdisk. Go + * ahead and try to reclaim the folio. + */ + if (!folio_trylock(folio)) + goto skip_pageout; + if (folio_test_dirty(folio) || + folio_test_writeback(folio)) + goto skip_pageout_locked; + + // Try to free the page + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + + case PAGE_CLEAN: + if (!mapping || + !__remove_mapping(mapping, folio, true, + sc->target_mem_cgroup)) + goto skip_pageout_locked; + + nr_reclaimed += folio_nr_pages(folio); + folio_unlock(folio); + continue; + } + +skip_pageout_locked: + folio_unlock(folio); +skip_pageout: + list_add(&folio->lru, &ret_folios); + } + } + /* Migrate folios selected for demotion */ nr_reclaimed += demote_folio_list(&demote_folios, pgdat); /* Folios that could not be demoted are still in @demote_folios */ diff --git a/mm/vmscan.c b/mm/vmscan.c index bd489c1af..1bd510622 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1035,6 +1035,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, struct folio_batch free_folios; LIST_HEAD(ret_folios); LIST_HEAD(demote_folios); + LIST_HEAD(pageout_list); unsigned int nr_reclaimed = 0; unsigned int pgactivate = 0;