diff mbox series

read() data corruption with CONFIG_READ_ONLY_THP_FOR_FS=y

Message ID df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz (mailing list archive)
State New
Headers show
Series read() data corruption with CONFIG_READ_ONLY_THP_FOR_FS=y | expand

Commit Message

Vlastimil Babka Feb. 23, 2022, 1:54 p.m. UTC
Hi,

we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in
5.12 by cbd59c48ae2b ("mm/filemap: use head pages in
generic_file_buffered_read")
and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index
entries in the page cache")
The latter commit is part of folio rework so likely not stable material, so
it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from
someone who understands xarray :)

The bug was found while building nodejs16, which involves running some
tests, first with a non-stripped node16 binary, and then stripping it. This
has a good chance of the stripped result becoming corrupted and not
executable anymore due to dynamic loader failures. It turns out that while
executed during tests, CONFIG_READ_ONLY_THP_FOR_FS=y allows khugepaged to
collapse the large executable mappings. Then /usr/bin/strip uses read() to
process the binary and triggers a bug introduced in cbd59c48ae2b where if a
read spans two or more collapsed THPs in the page cache, the first one will
be read multiple times instead.

Takashi Iwai has bisected using the nodejs build scenario to commit
cbd59c48ae2b.

I have distilled the scenario to the attached reproducer. There are some
assumptions for it to work:

- the passed path for file it creates/works with should be on a filesystem
such as xfs or ext4 that uses generic_file_buffered_read()
- the kernel should have e6be37b2e7bd ("mm/huge_memory.c: add missing
read-only THP checking in transparent_hugepage_enabled()") otherwise
khugepaged will not recognize the reproducer's mm as thp eligible (it had to
be some other mapping in nodejs that made it still possible to trigger this
during bisect)
- there's a pause to allow khugepaged to do its job, you can increase the
speed as instructed and verify with /proc/pid/smaps and meminfo that the
collapse in page cache has happened
- if the bug is reproduced, the reproducer will complain like this:
mismatch at offset 2097152: read value expected for offset 0 instead of 2097152

I've hacked some printk on top 5.16 (attached debug.patch)
which gives this output:

i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0

It seems filemap_get_read_batch() should be returning pages ffffea0004340000
and ffffea0004470000 consecutively in the pvec, but returns the first one 8
times, so it's read twice and then the rest is just skipped over as it's
beyond the requested read size.

I suspect these lines:
  xas.xa_index = head->index + thp_nr_pages(head) - 1;
  xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;

commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch
earlier), so some self-contained fix should be possible for prior kernels?
But I don't understand xarray well enough.

My colleagues should be credited for initial analysis of the nodejs build
scenario:

Analyzed-by: Adam Majer <amajer@suse.com>
Analyzed-by: Dirk Mueller <dmueller@suse.com>
Bisected-by: Takashi Iwai <tiwai@suse.de>

Thanks,
Vlastimil

Comments

Matthew Wilcox Feb. 23, 2022, 2:33 p.m. UTC | #1
On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote:
> we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in
> 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in
> generic_file_buffered_read")
> and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index
> entries in the page cache")
> The latter commit is part of folio rework so likely not stable material, so
> it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from
> someone who understands xarray :)

[...]

> I've hacked some printk on top 5.16 (attached debug.patch)
> which gives this output:
> 
> i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
> i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
> i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
> i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
> 
> It seems filemap_get_read_batch() should be returning pages ffffea0004340000
> and ffffea0004470000 consecutively in the pvec, but returns the first one 8
> times, so it's read twice and then the rest is just skipped over as it's
> beyond the requested read size.
> 
> I suspect these lines:
>   xas.xa_index = head->index + thp_nr_pages(head) - 1;
>   xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
> 
> commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch
> earlier), so some self-contained fix should be possible for prior kernels?
> But I don't understand xarray well enough.

I figured it out!

In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9
page is stored in 512 consecutive slots.  The XArray stores 64 entries
per level.  So what happens is we start looking at index 0 and we walk
down to the bottom of the tree and find the THP at index 0.

                xas.xa_index = head->index + thp_nr_pages(head) - 1;
                xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;

So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63.
Then we call xas_next() which calls __xas_next(), which moves us along to
array index 64 while we think we're looking at index 512.

We could make __xas_next() more resistant to this kind of abuse (by
extracting the correct offset in the parent node from xa_index), but
as you say, we're looking for a small fix for LTS.  I suggest this
will probably do the right thing:

+++ b/mm/filemap.c
@@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping,
                        break;
                if (PageReadahead(head))
                        break;
-               xas.xa_index = head->index + thp_nr_pages(head) - 1;
-               xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
+               xas_set(&xas, head->index + thp_nr_pages(head) - 1);
                continue;
 put_page:
                put_page(head);

but I'll start trying the reproducer now.
Vlastimil Babka Feb. 23, 2022, 5:07 p.m. UTC | #2
On 2/23/22 15:33, Matthew Wilcox wrote:
> On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote:
>> we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in
>> 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in
>> generic_file_buffered_read")
>> and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index
>> entries in the page cache")
>> The latter commit is part of folio rework so likely not stable material, so
>> it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from
>> someone who understands xarray :)
> 
> [...]
> 
>> I've hacked some printk on top 5.16 (attached debug.patch)
>> which gives this output:
>> 
>> i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
>> i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152
>> i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0
>> i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0
>> 
>> It seems filemap_get_read_batch() should be returning pages ffffea0004340000
>> and ffffea0004470000 consecutively in the pvec, but returns the first one 8
>> times, so it's read twice and then the rest is just skipped over as it's
>> beyond the requested read size.
>> 
>> I suspect these lines:
>>   xas.xa_index = head->index + thp_nr_pages(head) - 1;
>>   xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
>> 
>> commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch
>> earlier), so some self-contained fix should be possible for prior kernels?
>> But I don't understand xarray well enough.
> 
> I figured it out!
> 
> In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9
> page is stored in 512 consecutive slots.  The XArray stores 64 entries
> per level.  So what happens is we start looking at index 0 and we walk
> down to the bottom of the tree and find the THP at index 0.
> 
>                 xas.xa_index = head->index + thp_nr_pages(head) - 1;
>                 xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
> 
> So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63.
> Then we call xas_next() which calls __xas_next(), which moves us along to
> array index 64 while we think we're looking at index 512.
> 
> We could make __xas_next() more resistant to this kind of abuse (by
> extracting the correct offset in the parent node from xa_index), but
> as you say, we're looking for a small fix for LTS.  I suggest this
> will probably do the right thing:

Great!

Just so others are aware: the final fix is here:
https://lore.kernel.org/all/20220223155918.927140-1-willy@infradead.org/

> +++ b/mm/filemap.c
> @@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping,
>                         break;
>                 if (PageReadahead(head))
>                         break;
> -               xas.xa_index = head->index + thp_nr_pages(head) - 1;
> -               xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK;
> +               xas_set(&xas, head->index + thp_nr_pages(head) - 1);
>                 continue;
>  put_page:
>                 put_page(head);
> 
> but I'll start trying the reproducer now.
> 
>
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 39c4c46c6133..ce39c15e8379 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2682,6 +2682,11 @@  ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter,
 				break;
 			if (i > 0)
 				mark_page_accessed(page);
+
+			if (page_size > PAGE_SIZE)
+				pr_info("i=%d page=%px page_offset=%lld off=%lu bytes=%lu\n",
+						i, page, page_offset(page), offset, bytes);
+
 			/*
 			 * If users can be writing to this page using arbitrary
 			 * virtual addresses, take care about potential aliasing