From patchwork Thu Apr 10 01:49:38 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Luis Chamberlain X-Patchwork-Id: 14045744 Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B0E391ADC90; Thu, 10 Apr 2025 01:49:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=198.137.202.133 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744249794; cv=none; b=MqpYhuC1RS6BNMv4pBhT3RjsKUsaVS0KedraqkM0vWc6/NR4/b5sAXC4NYzx+duhEKGF3YhRGU7/0TfCH4UI6tHjvT/zekLU1R4/cHbrqAoRooDBUBfw2znY+xhHHDjsK9haHxtlWHvE2Hz6xJTzAUx64zfYC9e7bjtK8ZY6VcU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1744249794; c=relaxed/simple; bh=ThRGS12HG1C9teSRCGWbQ8jkrJD1nLkUQAc86uUErSk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=m/iV/RpXr5VlO/bIeLkzRYvV7QxuHkjPctfqzpGtm7B52Dvc/1PBfRefYbLK+EbFkaJLpMSByNnquCC8hUSnWjvzLV6Ix38CmexdjhpbctGCqJnVDinq6myeaczhZO+HG8Oxeu6Q6E1mWgUBlIsS0n1hOA8yYNRFsmuIF3CvYVw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org; spf=none smtp.mailfrom=infradead.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b=a/4PN9sb; arc=none smtp.client-ip=198.137.202.133 Authentication-Results: smtp.subspace.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=kernel.org Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=infradead.org Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=infradead.org header.i=@infradead.org header.b="a/4PN9sb" DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-ID:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=CL+6HEhZASTZTm0n+Z7z4M5wQOhYNXjmk5hIzXt/heo=; b=a/4PN9sb92D7cFqc3K7oeSik7t Q+3oQ2p8GkMhEL/NDp8lQAFxC6bkdTJSMpF0WP0WvzM/+GP/1hKhavztS3KblNtKFvUEjpenwScBe Hb1qNVm1TJLT2GecaI6PPeM3SapBsIbUuujrA5qaDs3R2iDv3LFpejYBEfCufwqRvCef1Swk0S2Es 9c6zatte5dgYw84XvL/jwERxw2Ed9nxrnGkYKolyC2QiGz2KtP2MaTvuitMv2Zhcenq9L8mRvVW3r Dbndqjo6JIIUg+j8oo6JD/6d323P3UgBAGgDtmY9129Mgvr9vMfEMQHJt7s+73+P1QI3VfZVZDpCt sK63AXDA==; Received: from mcgrof by bombadil.infradead.org with local (Exim 4.98.2 #2 (Red Hat Linux)) id 1u2h35-00000008yv5-3Gk9; Thu, 10 Apr 2025 01:49:47 +0000 From: Luis Chamberlain To: brauner@kernel.org, jack@suse.cz, tytso@mit.edu, adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org, riel@surriel.com Cc: dave@stgolabs.net, willy@infradead.org, hannes@cmpxchg.org, oliver.sang@intel.com, david@redhat.com, axboe@kernel.dk, hare@suse.de, david@fromorbit.com, djwong@kernel.org, ritesh.list@gmail.com, linux-fsdevel@vger.kernel.org, linux-block@vger.kernel.org, linux-mm@kvack.org, gost.dev@samsung.com, p.raghav@samsung.com, da.gomez@samsung.com, mcgrof@kernel.org, syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com Subject: [PATCH v2 1/8] migrate: fix skipping metadata buffer heads on migration Date: Wed, 9 Apr 2025 18:49:38 -0700 Message-ID: <20250410014945.2140781-2-mcgrof@kernel.org> X-Mailer: git-send-email 2.49.0 In-Reply-To: <20250410014945.2140781-1-mcgrof@kernel.org> References: <20250410014945.2140781-1-mcgrof@kernel.org> Precedence: bulk X-Mailing-List: linux-fsdevel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Sender: Luis Chamberlain Filesystems which use buffer-heads where it cannot guarantees that there are no other references to the folio, for example with a folio lock, must use buffer_migrate_folio_norefs() for the address space mapping migrate_folio() callback. There are only 3 filesystems which use this callback: 1) the block device cache 2) ext4 for its ext4_journalled_aops, ie, jbd2 3) nilfs2 jbd2's use of this however callback however is very race prone, consider folio migration while reviewing jbd2_journal_write_metadata_buffer() and the fact that jbd2: - does not hold the folio lock - does not have have page writeback bit set - does not lock the buffer And so, it can race with folio_set_bh() on folio migration. The commit ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration") added a spin lock to prevent races with page migration which ext4 users were reporting through the SUSE bugzilla (bnc#1137609 [0]). Although we don't have exact traces of the original filesystem corruption we can can reproduce fs corruption on ext4 by just removing the spinlock and stress testing the filesystem with generic/750, we eventually end up after 3 hours of testing with kdevops using libvirt on the ext4 profiles ext4-4k and ext4-2k. It turns out that the spin lock doesn't in the end protect against corruption, it *helps* reduce the possibility, but ext4 filesystem corruption can still happen even with the spin lock held. A test was done using vanilla Linux and adding a udelay(2000) right before we spin_lock(&bd_mapping->i_private_lock) on __find_get_block_slow() and we can reproduce the same exact filesystem corruption issues as observed without the spinlock with generic/750 [1]. We now have a slew of traces collected for the ext4 corruptions possible without the current spin lock held [2] [3] [4] but the general pattern is as follows, as summarized by ChatGPT from all traces: do_writepages() # write back --> ext4_map_block() # performs logical to physical block mapping --> ext4_ext_insert_extent() # updates extent tree --> jbd2_journal_dirty_metadata() # marks metadata as dirty for # journaling. This can lead # to any of the following hints # as to what happened from # ext4 / jbd2 - Directory and extent metadata corruption splats or - Filure to handle out-of-space conditions gracefully, with cascading metadata errors and eventual filesystem shutdown to prevent further damage. - Failure to journal new extent metadata during extent tree growth, triggered under memory pressure or heavy writeback. Commonly results in ENOSPC, journal abort, and read-only fallback. ** - Journal metadata failure during extent tree growth causes read-only fallback. Seen repeatedly on small-block (2k) filesystems under stress (e.g. fsstress). Triggers errors in bitmap and inode updates, and persists in journal replay logs. "Error count since last fsck" shows long-term corruption footprint. ** Reproduced on vanilla Linux with udelay(2000) ** Call trace (ENOSPC journal failure): do_writepages() → ext4_do_writepages() → ext4_map_blocks() → ext4_ext_map_blocks() → ext4_ext_insert_extent() → __ext4_handle_dirty_metadata() → jbd2_journal_dirty_metadata() → ERROR -28 (ENOSPC) And so jbd2 still needs more work to avoid races with folio migration. So replace the current spin lock solution by just skipping jbd buffers on folio migration. We identify jbd buffers as its the only user of set_buffer_meta() on __ext4_handle_dirty_metadata(). By checking for buffer_meta() and bailing on migration we fix the existing racy ext4 corruption while also removing the spin lock to be held while sleeping complaints originally reported by 0-day [5], and paves the way for buffer-heads for more users of large folios other than the block device cache. Link: https://bugzilla.suse.com/show_bug.cgi?id=1137609 # [0] Link: https://web.git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux.git/commit/?h=20250408-ext4-jbd-force-corruption # [1] Link: https://lkml.kernel.org/r/Z-ZwToVfJbdTVRtG@bombadil.infradead.org # [2] Link: https://lore.kernel.org/all/Z-rzyrS0Jr7t984Y@bombadil.infradead.org/ # [3] Link: https://lore.kernel.org/all/Z_AA9SHZdRGq6tb4@bombadil.infradead.org/ # [4] Reported-by: kernel test robot Reported-by: syzbot+f3c6fda1297c748a7076@syzkaller.appspotmail.com Closes: https://lore.kernel.org/oe-lkp/202503101536.27099c77-lkp@intel.com # [5] Fixes: ebdf4de5642fb6 ("mm: migrate: fix reference check race between __find_get_block() and migration") Signed-off-by: Luis Chamberlain --- mm/migrate.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/migrate.c b/mm/migrate.c index f3ee6d8d5e2e..32fa72ba10b4 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -841,6 +841,9 @@ static int __buffer_migrate_folio(struct address_space *mapping, if (folio_ref_count(src) != expected_count) return -EAGAIN; + if (buffer_meta(head)) + return -EAGAIN; + if (!buffer_migrate_lock_buffers(head, mode)) return -EAGAIN; @@ -859,12 +862,12 @@ static int __buffer_migrate_folio(struct address_space *mapping, } bh = bh->b_this_page; } while (bh != head); + spin_unlock(&mapping->i_private_lock); if (busy) { if (invalidated) { rc = -EAGAIN; goto unlock_buffers; } - spin_unlock(&mapping->i_private_lock); invalidate_bh_lrus(); invalidated = true; goto recheck_buffers; @@ -880,10 +883,7 @@ static int __buffer_migrate_folio(struct address_space *mapping, folio_set_bh(bh, dst, bh_offset(bh)); bh = bh->b_this_page; } while (bh != head); - unlock_buffers: - if (check_refs) - spin_unlock(&mapping->i_private_lock); bh = head; do { unlock_buffer(bh);