From patchwork Wed Feb 23 13:54:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vlastimil Babka X-Patchwork-Id: 12756970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC448C4332F for ; Wed, 23 Feb 2022 13:54:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1DEDB8D0006; Wed, 23 Feb 2022 08:54:47 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 167308D0001; Wed, 23 Feb 2022 08:54:47 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F230D8D0006; Wed, 23 Feb 2022 08:54:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0103.hostedemail.com [216.40.44.103]) by kanga.kvack.org (Postfix) with ESMTP id CA5C78D0001 for ; Wed, 23 Feb 2022 08:54:46 -0500 (EST) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 53F2A181740B1 for ; Wed, 23 Feb 2022 13:54:46 +0000 (UTC) X-FDA: 79174190172.29.977BD7D Received: from smtp-out1.suse.de (smtp-out1.suse.de [195.135.220.28]) by imf12.hostedemail.com (Postfix) with ESMTP id 4534240003 for ; Wed, 23 Feb 2022 13:54:45 +0000 (UTC) Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by smtp-out1.suse.de (Postfix) with ESMTPS id BE5F621155; Wed, 23 Feb 2022 13:54:43 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_rsa; t=1645624483; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=BDDtMFSlNaRroadtmjh5Rpea2LPkFe//rdeeAscwciI=; b=1hG1h0oScQB5MQ5JpmLkarELJg9Sd+dGOSWF9Sx/uh3llAxMIDnb4K2uM8fkw4yOUU/zwm Na0iR8+Mx5Q4UAJCX2wi4hCGpmu0EvBT8IwdUTgY8kmVNM2JjqBVpE3cUixDIFBoEuOhfH joUBQVQJi+MmhJCRQEmG2S6P531LddM= DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=suse.cz; s=susede2_ed25519; t=1645624483; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type; bh=BDDtMFSlNaRroadtmjh5Rpea2LPkFe//rdeeAscwciI=; b=Ls2d1Dv6wKxOqma4CFw+uqyp1dbt4ahwDfB7wifMeJu6wkPAFpyUIaXLvGZVElJ/1/91+u dMitCnBiPMsvGNDg== Received: from imap2.suse-dmz.suse.de (imap2.suse-dmz.suse.de [192.168.254.74]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (P-521) server-digest SHA512) (No client certificate requested) by imap2.suse-dmz.suse.de (Postfix) with ESMTPS id 8FB4E13D72; Wed, 23 Feb 2022 13:54:43 +0000 (UTC) Received: from dovecot-director2.suse.de ([192.168.254.65]) by imap2.suse-dmz.suse.de with ESMTPSA id gELmIaM8FmIwOAAAMHmgww (envelope-from ); Wed, 23 Feb 2022 13:54:43 +0000 Message-ID: Date: Wed, 23 Feb 2022 14:54:43 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.1 Content-Language: en-US From: Vlastimil Babka To: Matthew Wilcox , "stable@vger.kernel.org" Cc: Miaohe Lin , Christoph Hellwig , Jan Kara , Takashi Iwai , "linux-mm@kvack.org" , patches@lists.linux.dev, LKML , "Kirill A. Shutemov" Subject: read() data corruption with CONFIG_READ_ONLY_THP_FOR_FS=y Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=suse.cz header.s=susede2_rsa header.b=1hG1h0oS; dkim=pass header.d=suse.cz header.s=susede2_ed25519 header.b=Ls2d1Dv6; dmarc=none; spf=pass (imf12.hostedemail.com: domain of vbabka@suse.cz designates 195.135.220.28 as permitted sender) smtp.mailfrom=vbabka@suse.cz X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4534240003 X-Stat-Signature: u1gp8unnk5t36hq4z6wnjd64racz7rxx X-HE-Tag: 1645624485-248944 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read") and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index entries in the page cache") The latter commit is part of folio rework so likely not stable material, so it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from someone who understands xarray :) The bug was found while building nodejs16, which involves running some tests, first with a non-stripped node16 binary, and then stripping it. This has a good chance of the stripped result becoming corrupted and not executable anymore due to dynamic loader failures. It turns out that while executed during tests, CONFIG_READ_ONLY_THP_FOR_FS=y allows khugepaged to collapse the large executable mappings. Then /usr/bin/strip uses read() to process the binary and triggers a bug introduced in cbd59c48ae2b where if a read spans two or more collapsed THPs in the page cache, the first one will be read multiple times instead. Takashi Iwai has bisected using the nodejs build scenario to commit cbd59c48ae2b. I have distilled the scenario to the attached reproducer. There are some assumptions for it to work: - the passed path for file it creates/works with should be on a filesystem such as xfs or ext4 that uses generic_file_buffered_read() - the kernel should have e6be37b2e7bd ("mm/huge_memory.c: add missing read-only THP checking in transparent_hugepage_enabled()") otherwise khugepaged will not recognize the reproducer's mm as thp eligible (it had to be some other mapping in nodejs that made it still possible to trigger this during bisect) - there's a pause to allow khugepaged to do its job, you can increase the speed as instructed and verify with /proc/pid/smaps and meminfo that the collapse in page cache has happened - if the bug is reproduced, the reproducer will complain like this: mismatch at offset 2097152: read value expected for offset 0 instead of 2097152 I've hacked some printk on top 5.16 (attached debug.patch) which gives this output: i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 It seems filemap_get_read_batch() should be returning pages ffffea0004340000 and ffffea0004470000 consecutively in the pvec, but returns the first one 8 times, so it's read twice and then the rest is just skipped over as it's beyond the requested read size. I suspect these lines: xas.xa_index = head->index + thp_nr_pages(head) - 1; xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch earlier), so some self-contained fix should be possible for prior kernels? But I don't understand xarray well enough. My colleagues should be credited for initial analysis of the nodejs build scenario: Analyzed-by: Adam Majer Analyzed-by: Dirk Mueller Bisected-by: Takashi Iwai Thanks, Vlastimil diff --git a/mm/filemap.c b/mm/filemap.c index 39c4c46c6133..ce39c15e8379 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2682,6 +2682,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, break; if (i > 0) mark_page_accessed(page); + + if (page_size > PAGE_SIZE) + pr_info("i=%d page=%px page_offset=%lld off=%lu bytes=%lu\n", + i, page, page_offset(page), offset, bytes); + /* * If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing