Message ID | df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | read() data corruption with CONFIG_READ_ONLY_THP_FOR_FS=y | expand |
On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote: > we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in > 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in > generic_file_buffered_read") > and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index > entries in the page cache") > The latter commit is part of folio rework so likely not stable material, so > it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from > someone who understands xarray :) [...] > I've hacked some printk on top 5.16 (attached debug.patch) > which gives this output: > > i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 > i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 > i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 > i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 > > It seems filemap_get_read_batch() should be returning pages ffffea0004340000 > and ffffea0004470000 consecutively in the pvec, but returns the first one 8 > times, so it's read twice and then the rest is just skipped over as it's > beyond the requested read size. > > I suspect these lines: > xas.xa_index = head->index + thp_nr_pages(head) - 1; > xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; > > commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch > earlier), so some self-contained fix should be possible for prior kernels? > But I don't understand xarray well enough. I figured it out! In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9 page is stored in 512 consecutive slots. The XArray stores 64 entries per level. So what happens is we start looking at index 0 and we walk down to the bottom of the tree and find the THP at index 0. xas.xa_index = head->index + thp_nr_pages(head) - 1; xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63. Then we call xas_next() which calls __xas_next(), which moves us along to array index 64 while we think we're looking at index 512. We could make __xas_next() more resistant to this kind of abuse (by extracting the correct offset in the parent node from xa_index), but as you say, we're looking for a small fix for LTS. I suggest this will probably do the right thing: +++ b/mm/filemap.c @@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping, break; if (PageReadahead(head)) break; - xas.xa_index = head->index + thp_nr_pages(head) - 1; - xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; + xas_set(&xas, head->index + thp_nr_pages(head) - 1); continue; put_page: put_page(head); but I'll start trying the reproducer now.
On 2/23/22 15:33, Matthew Wilcox wrote: > On Wed, Feb 23, 2022 at 02:54:43PM +0100, Vlastimil Babka wrote: >> we have found a bug involving CONFIG_READ_ONLY_THP_FOR_FS=y, introduced in >> 5.12 by cbd59c48ae2b ("mm/filemap: use head pages in >> generic_file_buffered_read") >> and apparently fixed in 5.17-rc1 by 6b24ca4a1a8d ("mm: Use multi-index >> entries in the page cache") >> The latter commit is part of folio rework so likely not stable material, so >> it would be nice to have a small fix for e.g. 5.15 LTS. Preferably from >> someone who understands xarray :) > > [...] > >> I've hacked some printk on top 5.16 (attached debug.patch) >> which gives this output: >> >> i=0 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 >> i=1 page=ffffea0004340000 page_offset=0 uoff=0 bytes=2097152 >> i=2 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=3 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=4 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=5 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=6 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=7 page=ffffea0004340000 page_offset=0 uoff=0 bytes=0 >> i=8 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=9 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=10 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=11 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=12 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=13 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> i=14 page=ffffea0004470000 page_offset=2097152 uoff=0 bytes=0 >> >> It seems filemap_get_read_batch() should be returning pages ffffea0004340000 >> and ffffea0004470000 consecutively in the pvec, but returns the first one 8 >> times, so it's read twice and then the rest is just skipped over as it's >> beyond the requested read size. >> >> I suspect these lines: >> xas.xa_index = head->index + thp_nr_pages(head) - 1; >> xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; >> >> commit 6b24ca4a1a8d changes those to xas_advance() (introduced one patch >> earlier), so some self-contained fix should be possible for prior kernels? >> But I don't understand xarray well enough. > > I figured it out! > > In v5.15 (indeed, everything before commit 6b24ca4a1a8d), an order-9 > page is stored in 512 consecutive slots. The XArray stores 64 entries > per level. So what happens is we start looking at index 0 and we walk > down to the bottom of the tree and find the THP at index 0. > > xas.xa_index = head->index + thp_nr_pages(head) - 1; > xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; > > So we've advanced xas.xa_index to 511, but advanced xas.xa_offset to 63. > Then we call xas_next() which calls __xas_next(), which moves us along to > array index 64 while we think we're looking at index 512. > > We could make __xas_next() more resistant to this kind of abuse (by > extracting the correct offset in the parent node from xa_index), but > as you say, we're looking for a small fix for LTS. I suggest this > will probably do the right thing: Great! Just so others are aware: the final fix is here: https://lore.kernel.org/all/20220223155918.927140-1-willy@infradead.org/ > +++ b/mm/filemap.c > @@ -2354,8 +2354,7 @@ static void filemap_get_read_batch(struct address_space *mapping, > break; > if (PageReadahead(head)) > break; > - xas.xa_index = head->index + thp_nr_pages(head) - 1; > - xas.xa_offset = (xas.xa_index >> xas.xa_shift) & XA_CHUNK_MASK; > + xas_set(&xas, head->index + thp_nr_pages(head) - 1); > continue; > put_page: > put_page(head); > > but I'll start trying the reproducer now. > >
diff --git a/mm/filemap.c b/mm/filemap.c index 39c4c46c6133..ce39c15e8379 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -2682,6 +2682,11 @@ ssize_t filemap_read(struct kiocb *iocb, struct iov_iter *iter, break; if (i > 0) mark_page_accessed(page); + + if (page_size > PAGE_SIZE) + pr_info("i=%d page=%px page_offset=%lld off=%lu bytes=%lu\n", + i, page, page_offset(page), offset, bytes); + /* * If users can be writing to this page using arbitrary * virtual addresses, take care about potential aliasing