From patchwork Thu Aug 1 08:16:57 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Dev Jain X-Patchwork-Id: 13749961 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2DD88C3DA4A for ; Thu, 1 Aug 2024 08:18:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A1C656B0082; Thu, 1 Aug 2024 04:18:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9CCEA6B0083; Thu, 1 Aug 2024 04:18:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8955E6B0096; Thu, 1 Aug 2024 04:18:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 66F626B0082 for ; Thu, 1 Aug 2024 04:18:40 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 20A051A08B7 for ; Thu, 1 Aug 2024 08:18:40 +0000 (UTC) X-FDA: 82402975200.24.C5FDEC4 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id 3C31E160019 for ; Thu, 1 Aug 2024 08:18:37 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1722500243; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:references; bh=alG5mqDIhrlFZ8ZHILXT9MTipic/5YvDk8vbHhTbomE=; b=4p7ihlgj2fIL/NzToBkYBYc+CsH3GZUJxjT4bzOzEXE66o9diAFv2MqUmuthbYxb9w9eWc 123cr+SaJ/KWTHEOPPNOjnUFLnOaY+MHGfLT04LFzyzX3qlJjD2mExPnAYlnvvKZlsXOHn 1+47jN5ixWKTMaXX9FkDI/9p6bfpQuQ= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1722500243; a=rsa-sha256; cv=none; b=BDacmb33nRynZDqEdh+35E2P8W1fHqdfQ0Luh2gQfEWxjgY6Jya8sSvxSYedSAJdBtrIN3 8LXQjxUo3odUVXZdatNMFcg2/1nRKdNuALk2QvNFFeEvaxE/6ZxUAYGe8IzJR8bIk31dSc gbL+HYrso7D0CQ0P50nnB8cjhAevP6A= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3A1051007; Thu, 1 Aug 2024 01:19:00 -0700 (PDT) Received: from e116581.blr.arm.com (e116581.arm.com [10.162.42.27]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id D855F3F766; Thu, 1 Aug 2024 01:18:25 -0700 (PDT) From: Dev Jain To: akpm@linux-foundation.org, david@redhat.com, willy@infradead.org Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, osalvador@suse.de, baolin.wang@linux.alibaba.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, ioworker0@gmail.com, gshan@redhat.com, mark.rutland@arm.com, kirill.shutemov@linux.intel.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, broonie@kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dev Jain Subject: Race condition observed between page migration and page fault handling on arm64 machines Date: Thu, 1 Aug 2024 13:46:57 +0530 Message-Id: <20240801081657.1386743-1-dev.jain@arm.com> X-Mailer: git-send-email 2.34.1 MIME-Version: 1.0 X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 3C31E160019 X-Stat-Signature: jt8xy9gs5jooa4y441yrndr8kdyw3ux9 X-Rspam-User: X-HE-Tag: 1722500317-443355 X-HE-Meta: U2FsdGVkX1/J0Iy7EyBBfmj2FIIo24gBniLGEIRBbvhMVYz88/yY60A6zZxUtg689LBNtACzaP/XaL/OabZxzUlaTjGSQVxTEuyl8b2W1tEIynJFHRE17plv7NjQ0xZlUdHne3Zu2mDhLfRd8kIGBZTvfvniENPQAO6sawmq6pddtEfV7OyIWAPdE/8iKwXro4oC/qJAZsQHA9AtaOKcSUQ0nHXE6wKsqzGMoIPbNynuS8mszKqHerHtsrgpu5Cw/PxMiaATMdn7zaBd4K+SJRL1/gKDb9zOv44DKPyQTLw0KNsLD/rDHQX6765MrIFm2hj0OR5zoKVrWf4/5CURCoaTpyZ20fasgh0PL1x0wZRmVKDDoMyTsUwae8GGbk61Ej5yeNK1TakmoT4cIcDIc1V+w8Fr8EjwJgZRy9C+w9ncLyrnMGqIbhS4ba27DqlPIZgMeiCDqxiidyHYRqzr74stfljl4+oyCZ6CjTPm+Iaf3CmO+UVZoNOcjZrJ26APXXkVuzr6lKwkmKW9I2qhrFxnaYnJdhnLAe5zSryQEIgJuqh7JYZ7z5GwD0d4wD5pEp4BiJWyt6wGbymfOZprtbGC1en/n0tQ0sxM9ekGHlm8oPDaq5kfNvo/9wbpdtjKbUQa+4u4ofkOWC262i0Fm4ujbFmpMoBBZOLX3V8AFAfrFrfVfYeLp/BzgAfPXbFV25+UKB9kWr/D/8IZ1zyUAw+dhDa3bDPdrvrkSrc4UtEElsP4WwopI9pXIbEQt3TPCV78tnWpbHPad8QZdnBsPdjmjLYMRSL9RuBmlZ0Z4vx1jhpaAKZJU5GFjhmoT/2Q4iI8+DrbMAn+0mVvb4fKjOfLNCNxQOo79wWn5ehE2LTkGC0QGRQWMSMXKiQxCPnbTZ+6WMQ166aAtGFCJe9985zbnNPmXqNdRZ4eXaOLeN7flfiNeKyvEwsOWZvZ/4tcBKNH03wAo1nChIOFerE ZSRpiJNE k7aj//g2g/XMTYX7u1YYkwCfpseI1Ei0W6KpfyhV0haaEPhU041Or8bFkV0hfVldPEFzdkS+c37+uzJREsVuaQ8FsLCBg+98IDV9B X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: I and Ryan had a discussion and we thought it would be best to get feedback from the community. The migration mm selftest currently fails on arm64 for shared anon mappings, due to the following race: Migration: Page fault: try_to_migrate_one(): handle_pte_fault(): 1. Nuke the PTE PTE has been deleted => do_pte_missing() 2. Mark the PTE for migration PTE has not been deleted but is just not "present" => do_swap_page() In the test, the parent thread migrates a single page between nodes, and the children threads read on that page. The race happens when the PTE has been nuked, and before it gets marked for migration, the reader faults and checks if the PTE is missing, and calls do_pte_missing() instead of the correct do_swap_page(); the latter implies that the reader ends up calling migration_entry_wait() to wait for PTEs to get corrected. The former path ends up following this: do_fault() -> do_read_fault() -> __do_fault() -> vma->vm_ops->fault, which is shmem_fault() -> shmem_get_folio_gfp() -> filemap_get_entry(), which then calls folio_try_get() and takes a reference on the folio which is under migration. Returning back, the reader blocks on folio_lock() since the migration path has the lock, and migration ends up failing in folio_migrate_mapping(), which expects a reference count of 2 on the folio. The following hack makes the race vanish: mm/rmap.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) It's a hack because the invalidation is non-atomic, and pte_mkinvalid() is defined only on arm64. I could think of the following solutions: 1. Introduce (atomic)_pte_mkinvalid() for all arches and call the arch-API if defined, else call ptep_get_and_clear(). The theoretical race would still exist for other arches. 2. Prepare the migration swap entry and do an atomic exchange to fill the PTE (this currently seems the best option to me, but I have not tried it out). 3. In try_to_migrate_one(), before the nuking, freeze the refcount of the folio. This would solve the race, but then we will have to defer all the reference releases till later, and I don't know whether that's correct or feasible. 4. Since the extra refcount being taken in filemap_get_entry() is due to loading the folio from the pagecache, delete/invalidate the folio in the pagecache until migration is complete. I tried with some helpers present in mm/filemap.c to do that but I guess I am not aware of the subtleties, and I end up triggering a BUG() somewhere. 5. In do_fault(), under the if block, we are already checking whether this is just a temporary clearing of the PTE. We can take that out of the if block, but then that would be a performance bottleneck since we are taking the PTL? Thanks, Dev diff --git a/mm/rmap.c b/mm/rmap.c index e8fc5ecb59b2..bb10b3376595 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2126,7 +2126,9 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma, * transition on a cached TLB entry is written through * and traps if the PTE is unmapped. */ - pteval = ptep_get_and_clear(mm, address, pvmw.pte); + pteval = pte_mkinvalid(*(pvmw.pte)); + *(pvmw.pte) = pteval; set_tlb_ubc_flush_pending(mm, pteval, address); } else {