Message ID | 20241104143015.34684-1-laoar.shao@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm: Add large folio support for async readahead | expand |
On Mon, Nov 04, 2024 at 10:30:15PM +0800, Yafang Shao wrote: > When testing large folio support with XFS on our servers, we observed that > only a few large folios are mapped when reading large files via mmap. This > behavior occurs because large folio support is currently implemented only > for sync readahead, not for async readahead. Consequently, while the > first filemap fault may map to a large folio, subsequent filemap faults are > mapped to regular folios. This can be verified with a simple test case, as > shown below: I awear I tested this and it worked at the time. Here's what's supposed to happen: You touch the first byte of the mapping, and there's no page in the page cache. So we go into the sync readahead path, and force it to read two PMDs. From your email, I assume this is succeeding. The readahead flag gets set on the second PMD so that when we get to 2MB, we go into the 'time to do more readahead' path, ie the 'do_async_mmap_readahead' function you patch below. Now, page_cache_async_ra() does this: unsigned int order = folio_order(folio); ... if (index == expected) { ra->start += ra->size; ra->size = get_next_ra_size(ra, max_pages); ra->async_size = ra->size; goto readit; } readit: ractl->_index = ra->start; page_cache_ra_order(ractl, ra, order); So it should already be doing readahead of PMD size. Can you dig into what's going wrong, now that you understand that this is supposed to work already?
On Tue, Nov 5, 2024 at 1:00 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Nov 04, 2024 at 10:30:15PM +0800, Yafang Shao wrote: > > When testing large folio support with XFS on our servers, we observed that > > only a few large folios are mapped when reading large files via mmap. This > > behavior occurs because large folio support is currently implemented only > > for sync readahead, not for async readahead. Consequently, while the > > first filemap fault may map to a large folio, subsequent filemap faults are > > mapped to regular folios. This can be verified with a simple test case, as > > shown below: > > I awear I tested this and it worked at the time. Here's what's supposed > to happen: > > You touch the first byte of the mapping, and there's no page in the page > cache. So we go into the sync readahead path, and force it to read two > PMDs. From your email, I assume this is succeeding. The readahead > flag gets set on the second PMD so that when we get to 2MB, we go into > the 'time to do more readahead' path, ie the 'do_async_mmap_readahead' > function you patch below. > > Now, page_cache_async_ra() does this: > > unsigned int order = folio_order(folio); > ... > if (index == expected) { > ra->start += ra->size; > ra->size = get_next_ra_size(ra, max_pages); > ra->async_size = ra->size; > goto readit; > } > readit: > ractl->_index = ra->start; > page_cache_ra_order(ractl, ra, order); > > So it should already be doing readahead of PMD size. Can you dig into > what's going wrong, now that you understand that this is supposed to > work already? Upon further analysis, it appears that the behavior depends on the /sys/block/*/queue/read_ahead_kb setting. On my test server, this parameter is set to 128KB. For the current approach to work without modification, it would need to be 2MB-aligned. However, I believe MADV_HUGEPAGE behavior should not be dependent on the value of read_ahead_kb. It would be more robust if the kernel automatically aligned it to 2MB when MADV_HUGEPAGE is enabled. The following changes ensure compatibility with non-2MB-aligned read_ahead_kb values: diff --git a/mm/readahead.c b/mm/readahead.c index 3dc6c7a128dd..c024245497cc 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -641,9 +641,9 @@ void page_cache_async_ra(struct readahead_control *ractl, expected = round_down(ra->start + ra->size - ra->async_size, 1UL << order); if (index == expected) { - ra->start += ra->size; - ra->size = get_next_ra_size(ra, max_pages); - ra->async_size = ra->size; + ra->start += ALIGN(ra->size, HPAGE_PMD_NR); + ra->size = ALIGN(get_next_ra_size(ra, max_pages), HPAGE_PMD_NR); + ra->async_size = ALIGN(ra->size, HPAGE_PMD_NR); goto readit; } In addition, adjustments may be needed in the sync readahead path to allow MADV_HUGEPAGE readahead sizes to be tunable based on read_ahead_kb. I will continue to analyze this and submit a revised version accordingly. -- Regards Yafang
On Tue, Nov 05, 2024 at 11:01:01AM +0800, Yafang Shao wrote: > Upon further analysis, it appears that the behavior depends on the > /sys/block/*/queue/read_ahead_kb setting. On my test server, this > parameter is set to 128KB. For the current approach to work without > modification, it would need to be 2MB-aligned. > > However, I believe MADV_HUGEPAGE behavior should not be dependent on > the value of read_ahead_kb. It would be more robust if the kernel > automatically aligned it to 2MB when MADV_HUGEPAGE is enabled. The > following changes ensure compatibility with non-2MB-aligned > read_ahead_kb values: or maybe: static unsigned long get_next_ra_size(struct file_ra_state *ra, unsigned long max) { unsigned long cur = ra->size; if (cur < max / 16) return 4 * cur; if (cur <= max / 2) return 2 * cur; + if (cur > max) + return cur; return max; }
On Tue, Nov 5, 2024 at 12:21 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Nov 05, 2024 at 11:01:01AM +0800, Yafang Shao wrote: > > Upon further analysis, it appears that the behavior depends on the > > /sys/block/*/queue/read_ahead_kb setting. On my test server, this > > parameter is set to 128KB. For the current approach to work without > > modification, it would need to be 2MB-aligned. > > > > However, I believe MADV_HUGEPAGE behavior should not be dependent on > > the value of read_ahead_kb. It would be more robust if the kernel > > automatically aligned it to 2MB when MADV_HUGEPAGE is enabled. The > > following changes ensure compatibility with non-2MB-aligned > > read_ahead_kb values: > > or maybe: > > static unsigned long get_next_ra_size(struct file_ra_state *ra, > unsigned long max) > { > unsigned long cur = ra->size; > > if (cur < max / 16) > return 4 * cur; > if (cur <= max / 2) > return 2 * cur; > + if (cur > max) > + return cur; > return max; > } > It is better. Thanks for your suggestion.
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 68a5f1ff3301..b77eb5d3a3dd 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -1350,8 +1350,10 @@ struct readahead_control { pgoff_t _index; unsigned int _nr_pages; unsigned int _batch_count; - bool _workingset; unsigned long _pflags; + bool _workingset; + bool _large_folio; + bool _rand_read; }; #define DEFINE_READAHEAD(ractl, f, r, m, i) \ diff --git a/mm/filemap.c b/mm/filemap.c index 36d22968be9a..c9695effdd88 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3226,6 +3226,10 @@ static struct file *do_async_mmap_readahead(struct vm_fault *vmf, WRITE_ONCE(ra->mmap_miss, --mmap_miss); if (folio_test_readahead(folio)) { + if (vmf->vma->vm_flags & VM_HUGEPAGE) + ractl._large_folio = true; + if (vmf->vma->vm_flags & VM_RAND_READ) + ractl._rand_read = true; fpin = maybe_unlock_mmap_for_io(vmf, fpin); page_cache_async_ra(&ractl, folio, ra->ra_pages); } diff --git a/mm/readahead.c b/mm/readahead.c index 3dc6c7a128dd..8c4f95f87b71 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -667,6 +667,15 @@ void page_cache_async_ra(struct readahead_control *ractl, ra->async_size = ra->size; readit: ractl->_index = ra->start; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (ractl->_large_folio) { + ractl->_index &= ~((unsigned long)HPAGE_PMD_NR - 1); + if (!ractl->_rand_read) + ra->size = 2 * HPAGE_PMD_NR; + ra->async_size = HPAGE_PMD_NR; + order = HPAGE_PMD_ORDER; + } +#endif page_cache_ra_order(ractl, ra, order); } EXPORT_SYMBOL_GPL(page_cache_async_ra);
When testing large folio support with XFS on our servers, we observed that only a few large folios are mapped when reading large files via mmap. This behavior occurs because large folio support is currently implemented only for sync readahead, not for async readahead. Consequently, while the first filemap fault may map to a large folio, subsequent filemap faults are mapped to regular folios. This can be verified with a simple test case, as shown below: #define LEN (1024 * 1024 * 1024) // 1GB file int main(int argc, char *argv[]) { char *addr; int fd, i; fd = open("data", O_RDWR); if (fd < 0) { perror("open"); exit(-1); } addr = mmap(NULL, LEN, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if (addr == MAP_FAILED) { perror("mmap"); exit(-1); } if (madvise(addr, LEN, MADV_HUGEPAGE)) { perror("madvise"); exit(-1); } for (i = 0; i < LEN / 4096; i++) memset(addr + i * 4096, 1, 1); while (1) {} // Verifiable with /proc/meminfo munmap(addr, LEN); close(fd); exit(0); } This patch adds large folio support to async readahead, aligning its behavior with sync readahead. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> --- include/linux/pagemap.h | 4 +++- mm/filemap.c | 4 ++++ mm/readahead.c | 9 +++++++++ 3 files changed, 16 insertions(+), 1 deletion(-)