From patchwork Mon Aug 26 03:45:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Qi Zheng X-Patchwork-Id: 13777045 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A239C5320E for ; Mon, 26 Aug 2024 03:45:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F007B8D0040; Sun, 25 Aug 2024 23:45:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E885A8D0029; Sun, 25 Aug 2024 23:45:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D28888D0040; Sun, 25 Aug 2024 23:45:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id B433E8D0029 for ; Sun, 25 Aug 2024 23:45:56 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 652C51A0ADB for ; Mon, 26 Aug 2024 03:45:56 +0000 (UTC) X-FDA: 82493007912.03.6C2EDCB Received: from mail-pl1-f179.google.com (mail-pl1-f179.google.com [209.85.214.179]) by imf07.hostedemail.com (Postfix) with ESMTP id 2FA204000C for ; Mon, 26 Aug 2024 03:45:52 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ckwFujeD; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf07.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1724643885; a=rsa-sha256; cv=none; b=U+/JQoo38RHrDeSN/NLqNRYFRk9spYUhWzYzuwhksvMsLOqOvyCrk+HF86KJthNgllM598 oPXAFGMZQm7ZpG1k4k/yQ0jhdoXYowLS8r+W/ej6PskxZUOIM2cc53hxoJ8vxNUWKCMLAm jzB6J4LBxHrUBdlRa8x6YCLt44bjadA= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=ckwFujeD; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf07.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.179 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1724643885; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zvWOwClfiaH1DttbH7BPu8W++zXReNa1Df7Gb//N78M=; b=Z9W+N/Vwt+sOkJXPvQlGqdAklOtDZTebavgBZZiJzerlgr3vq+ktFHgJMtZzEIN+WwzJUP fYvUuaqtga0Fy9qqDkhLkoz4ciFw1t9GKoNqN7xv5NWsRI/IbyQa8Gomx+A+uqQVpNXwPF DIopPd4oiLtYCPptKU6vVidtBikDmLg= Received: by mail-pl1-f179.google.com with SMTP id d9443c01a7336-2021aeee5e4so24504025ad.0 for ; Sun, 25 Aug 2024 20:45:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1724643951; x=1725248751; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=zvWOwClfiaH1DttbH7BPu8W++zXReNa1Df7Gb//N78M=; b=ckwFujeDdH5Pgs7+l8gicT9YhcqllN07RHluNwqmHBlDZE6yE59PTuRTWjcQDbHYwT M7evjTqF46spuMVwzbtH4NdwVdCQwNRvRmCJ8dzPF8sFVXKDdISPwGGXVCxj7c0Rq0Hb gH9I0h0Gtfv//0WP0CyRuNlyR43v0UeWvibhsAn+Q5xFTcrCp14GLmM7BeQHcg47qnir /7wqUDFyQdXFHtlgoQT3GnGA0Zfe7hzKxwjzhlVo3W3+FHvKo9H+TOQINQRsBURBh5Zq HkfKTHPwWaZAD+nyTSxTO4TnXLnTGaMJrvlfP0VIfkNf6Tn9MhZRNTzmdK50kBtZh8Em dMGA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1724643951; x=1725248751; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zvWOwClfiaH1DttbH7BPu8W++zXReNa1Df7Gb//N78M=; b=FnPPb+NhOb2CnHSIgfvaoZoIUh1344tGkXJzoIRIcyxD9wzM29WKl2U6cJck+kK7nA q74MqSVoaCp555RMXkJfYBP8lXV03g+7p1gnDX9FGMy6PC61deMkKlianRp0lX0PuBgx xIqHDOJ/sUs9RyNOSwIUH1euoItNiWignpHvA35aUffvGURH9SPZtqUc9IKYxyY4j0IF 18Wuj0u5JoHeEmCm+a/UUqsL+CRdZXwT6akSIOE7ZqtxBtrxeBpZ90NUi3H0AR24oLEc A5wbACWjtWFJu08X8eBP8a9Ioaz8r0/jGe3fX1hl1YNNKFSAA4wdxSPYodASnpRp7jWy SfkQ== X-Forwarded-Encrypted: i=1; AJvYcCXcD5d3Q1Xmp1bNOp1zHyBEVWccUwHifQ/AAemc9RMwIABBOP3zrmeFnc2riqr85RsYqzuoeyNJug==@kvack.org X-Gm-Message-State: AOJu0YwRxSNY9x9QQ1zCq4gpc8xerMGhP23ChBrUP2WEpMolvkBiqbeq nRzS0/p3xuPe2TmDMEBykV23OPMV60GATdtVYUZ75ZaS5NJBkDOr06sG+3iRFJA= X-Google-Smtp-Source: AGHT+IGAkZ33RhQT3ptanHl7KQvF7KkwrtMZ12R2MHtdXUQfC4ypCCwefP5BwgNzFzSw2mbOlGcnOw== X-Received: by 2002:a17:903:2301:b0:1fb:2bed:6418 with SMTP id d9443c01a7336-2039e515956mr105951845ad.57.1724643951510; Sun, 25 Aug 2024 20:45:51 -0700 (PDT) Received: from C02DW0BEMD6R.bytedance.net ([139.177.225.242]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-20385566479sm59745285ad.58.2024.08.25.20.45.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 25 Aug 2024 20:45:51 -0700 (PDT) From: Qi Zheng To: david@redhat.com, hughd@google.com, willy@infradead.org, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, rppt@kernel.org, vishal.moola@gmail.com, peterx@redhat.com, ryan.roberts@arm.com, christophe.leroy2@cs-soprasteria.com Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, Qi Zheng Subject: [PATCH v2 01/14 update] mm: pgtable: introduce pte_offset_map_{ro|rw}_nolock() Date: Mon, 26 Aug 2024 11:45:12 +0800 Message-Id: <20240826034512.76917-1-zhengqi.arch@bytedance.com> X-Mailer: git-send-email 2.24.3 (Apple Git-128) In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2FA204000C X-Stat-Signature: p4wx74mhapmn5y3yfh5nwfuhrqb7jfxg X-Rspam-User: X-HE-Tag: 1724643952-252373 X-HE-Meta: U2FsdGVkX1+bIkWRPorJgg3k87exXutH0ehOkAsQNs8eb65QmkcG/BPFvvTxTl4GNfttBeh4lw3hetvKIhAfzJPNZBE/N0PPdPYxcDW0iNIHeI4DrCMwU0FjT2IBoKehMuw43lY3CxpAa6OZOUCPYEUCQL2ytTW2WSgYKChlccBYofiEMThbgtrOH4JiS5tBm/lAxHUI3igCSX2e4iADwyfzCUglZT4SFnHBCIFGL4KPeMplAJFZmlkpOqL2vGiQFlOd43rva/21H5XWYvGXGn5H6UpaMTP7GEs2AajJnaukhN5rCBsLA76jzDDBiVMp8qNEVkTFeCBmq+qeHpdTtWD6dseuy7JeI6lpdxNLf21WfOWLwxUg3w7Wmi9VkOQs8aVM3xIEBJBVOkXtVVWsYApTHZUPgyPOfkcofDG0cJJFf7PU+dK0ydsfY+6VT6vAhCb0eZulkudUSHleYnzLFQlW8UrJ0Q6CiQW/9CPlPOC/4Aaep4UKATbIPCfvD0d7d6+o1m2smofks0uMo+CilGiR2bGgTNy5S1MeB8w1CjubftUY6PE2MWaediq0hvHZk64JBXWFv0v7PCPFgY36SA3953ztk9CgQFNtRI8yPBCt/1OJvxttCxnKKkmWZIHRIg78o4PfYc2r+2cfujH83PnIKWwKo/Xjm+O742efOnmC4g8sdSuYpXGQWNBYJkcEXHyguwlSDLvU6esEz3EktY7eKZXZSFdbycdjQ2cwR3SqrD5JacED/H+miAbnXve8Mq+Dn+ECZM+H+K+2pROIf1IxnFyk2JeHocvLNWV+ZNanWnjbq99P1UdSRGT0HhKEkFxEukxo2vy/4nTqmGxmXyFrc84kKuGl6condbr1r9Hnaxt96kKJmL2PfY1eQJuOoYxQMNyVrrXWozPSEtBYuallgK5mDcHG0ZNJxw1C9hU+h5j0r9gwwaYsa9nxGiydyYQ7RRcjrrmeATS32ge qRy0VL2K R+qCcyLCxtpzW0NCwQwNcGybkz1/qKplnaHPRBjvitSs8G++x/wF7zv1v2jX1hwFm/K6p3tPQ1sZ5DDSWJbhmJm3KkvD6ypYuoxITpJBeyF5RnvfA3PRZfCenuWDXMZ0m8po40IMaRwu6JYW4ps7zhlq4FL/iIG+tCJr4fAnBdtqcBlgVhz3xDPEHvuPa4jk5PT4z4nxbklpEN9ULnrQvHkVRSki90gF1Aedl66Qas8vS9MVqn7mUywYoEVQF9QbLCcVJqYJlVohUL2n2AYCZiCrfeEyZH4cTdqH61ZgY1TZ3E4KXzaDRIzhdRgwKcVfuZkJ0MRv90IsNas+MaE/J1eLWh9jAd0TbXXcVNq7FXEmEcXLV4xeqMpZEC/BQvGB+BiY6Fvk0+IiRoXJdcI4oE5Rh4dX9ZGEG9oXiC4jPA0HPhxHRAB980ytpjxZOcvw7lX66 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Currently, the usage of pte_offset_map_nolock() can be divided into the following two cases: 1) After acquiring PTL, only read-only operations are performed on the PTE page. In this case, the RCU lock in pte_offset_map_nolock() will ensure that the PTE page will not be freed, and there is no need to worry about whether the pmd entry is modified. 2) After acquiring PTL, the pte or pmd entries may be modified. At this time, we need to ensure that the pmd entry has not been modified concurrently. To more clearing distinguish between these two cases, this commit introduces two new helper functions to replace pte_offset_map_nolock(). For 1), just rename it to pte_offset_map_ro_nolock(). For 2), in addition to changing the name to pte_offset_map_rw_nolock(), it also outputs the pmdval when successful. This can help the caller recheck *pmd once the PTL is taken. In some cases, that is, either the mmap_lock for write, or pte_same() check on contents, is also enough to ensure that the pmd entry is stable. But in order to prevent the interface from being abused, we choose to pass in a dummy local variable instead of NULL. Subsequent commits will convert pte_offset_map_nolock() into the above two functions one by one, and finally completely delete it. Signed-off-by: Qi Zheng --- Change to use VM_WARN_ON_ONCE() instead of BUG_ON(). (David Hildenbrand) Documentation/mm/split_page_table_lock.rst | 7 ++++ include/linux/mm.h | 5 +++ mm/pgtable-generic.c | 43 ++++++++++++++++++++++ 3 files changed, 55 insertions(+) diff --git a/Documentation/mm/split_page_table_lock.rst b/Documentation/mm/split_page_table_lock.rst index e4f6972eb6c04..08d0e706a32db 100644 --- a/Documentation/mm/split_page_table_lock.rst +++ b/Documentation/mm/split_page_table_lock.rst @@ -19,6 +19,13 @@ There are helpers to lock/unlock a table and other accessor functions: - pte_offset_map_nolock() maps PTE, returns pointer to PTE with pointer to its PTE table lock (not taken), or returns NULL if no PTE table; + - pte_offset_map_ro_nolock() + maps PTE, returns pointer to PTE with pointer to its PTE table + lock (not taken), or returns NULL if no PTE table; + - pte_offset_map_rw_nolock() + maps PTE, returns pointer to PTE with pointer to its PTE table + lock (not taken) and the value of its pmd entry, or returns NULL + if no PTE table; - pte_offset_map() maps PTE, returns pointer to PTE, or returns NULL if no PTE table; - pte_unmap() diff --git a/include/linux/mm.h b/include/linux/mm.h index da29b066495d6..a00cb35ce065f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2954,6 +2954,11 @@ static inline pte_t *pte_offset_map_lock(struct mm_struct *mm, pmd_t *pmd, pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, spinlock_t **ptlp); +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, pmd_t *pmdvalp, + spinlock_t **ptlp); #define pte_unmap_unlock(pte, ptl) do { \ spin_unlock(ptl); \ diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c index a78a4adf711ac..7abcea4451a71 100644 --- a/mm/pgtable-generic.c +++ b/mm/pgtable-generic.c @@ -317,6 +317,33 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, return pte; } +pte_t *pte_offset_map_ro_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, spinlock_t **ptlp) +{ + pmd_t pmdval; + pte_t *pte; + + pte = __pte_offset_map(pmd, addr, &pmdval); + if (likely(pte)) + *ptlp = pte_lockptr(mm, &pmdval); + return pte; +} + +pte_t *pte_offset_map_rw_nolock(struct mm_struct *mm, pmd_t *pmd, + unsigned long addr, pmd_t *pmdvalp, + spinlock_t **ptlp) +{ + pmd_t pmdval; + pte_t *pte; + + VM_WARN_ON_ONCE(!pmdvalp); + pte = __pte_offset_map(pmd, addr, &pmdval); + if (likely(pte)) + *ptlp = pte_lockptr(mm, &pmdval); + *pmdvalp = pmdval; + return pte; +} + /* * pte_offset_map_lock(mm, pmd, addr, ptlp), and its internal implementation * __pte_offset_map_lock() below, is usually called with the pmd pointer for @@ -356,6 +383,22 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, pmd_t *pmd, * recheck *pmd once the lock is taken; in practice, no callsite needs that - * either the mmap_lock for write, or pte_same() check on contents, is enough. * + * pte_offset_map_ro_nolock(mm, pmd, addr, ptlp), above, is like + * pte_offset_map(); but when successful, it also outputs a pointer to the + * spinlock in ptlp - as pte_offset_map_lock() does, but in this case without + * locking it. This helps the caller to avoid a later pte_lockptr(mm, *pmd), + * which might by that time act on a changed *pmd: pte_offset_map_ro_nolock() + * provides the correct spinlock pointer for the page table that it returns. + * For readonly case, the caller does not need to recheck *pmd after the lock is + * taken, because the RCU lock will ensure that the PTE page will not be freed. + * + * pte_offset_map_rw_nolock(mm, pmd, addr, pmdvalp, ptlp), above, is like + * pte_offset_map_ro_nolock(); but when successful, it also outputs the + * pdmval. For cases where pte or pmd entries may be modified, that is, maywrite + * case, this can help the caller recheck *pmd once the lock is taken. In some + * cases, that is, either the mmap_lock for write, or pte_same() check on + * contents, is also enough to ensure that the pmd entry is stable. + * * Note that free_pgtables(), used after unmapping detached vmas, or when * exiting the whole mm, does not take page table lock before freeing a page * table, and may not use RCU at all: "outsiders" like khugepaged should avoid