Message ID | 20170126115819.58875-12-kirill.shutemov@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Thu, Jan 26, 2017 at 02:57:53PM +0300, Kirill A. Shutemov wrote: > Most page cache allocation happens via readahead (sync or async), so if > we want to have significant number of huge pages in page cache we need > to find a ways to allocate them from readahead. > > Unfortunately, huge pages doesn't fit into current readahead design: > 128 max readahead window, assumption on page size, PageReadahead() to > track hit/miss. > > I haven't found a ways to get it right yet. > > This patch just allocates huge page if allowed, but doesn't really > provide any readahead if huge page is allocated. We read out 2M a time > and I would expect spikes in latancy without readahead. > > Therefore HACK. > > Having that said, I don't think it should prevent huge page support to > be applied. Future will show if lacking readahead is a big deal with > huge pages in page cache. > > Any suggestions are welcome. Well ... what if we made readahead 2 hugepages in size for inodes which are using huge pages? That's only 8x our current readahead window, and if you're asking for hugepages, you're accepting that IOs are going to be larger, and you probably have the kind of storage system which can handle doing larger IOs.
On Feb 9, 2017, at 4:34 PM, Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Jan 26, 2017 at 02:57:53PM +0300, Kirill A. Shutemov wrote: >> Most page cache allocation happens via readahead (sync or async), so if >> we want to have significant number of huge pages in page cache we need >> to find a ways to allocate them from readahead. >> >> Unfortunately, huge pages doesn't fit into current readahead design: >> 128 max readahead window, assumption on page size, PageReadahead() to >> track hit/miss. >> >> I haven't found a ways to get it right yet. >> >> This patch just allocates huge page if allowed, but doesn't really >> provide any readahead if huge page is allocated. We read out 2M a time >> and I would expect spikes in latancy without readahead. >> >> Therefore HACK. >> >> Having that said, I don't think it should prevent huge page support to >> be applied. Future will show if lacking readahead is a big deal with >> huge pages in page cache. >> >> Any suggestions are welcome. > > Well ... what if we made readahead 2 hugepages in size for inodes which > are using huge pages? That's only 8x our current readahead window, and > if you're asking for hugepages, you're accepting that IOs are going to > be larger, and you probably have the kind of storage system which can > handle doing larger IOs. It would be nice if the bdi had a parameter for the maximum readahead size. Currently, readahead is capped at 2MB chunks by force_page_cache_readahead() even if bdi->ra_pages and bdi->io_pages are much larger. It should be up to the filesystem to decide how large the readahead chunks are rather than imposing some policy in the MM code. For high-speed (network) storage access it is better to have at least 4MB read chunks, for RAID storage it is desirable to have stripe-aligned readahead to avoid read inflation when verifying the parity. Any fixed size will eventually be inadequate as disks and filesystems change, so it may as well be a per-bdi tunable that can be set by the filesystem as needed, or possibly with a mount option if needed. Cheers, Andreas
On Thu, Feb 09, 2017 at 05:23:31PM -0700, Andreas Dilger wrote: > On Feb 9, 2017, at 4:34 PM, Matthew Wilcox <willy@infradead.org> wrote: > > Well ... what if we made readahead 2 hugepages in size for inodes which > > are using huge pages? That's only 8x our current readahead window, and > > if you're asking for hugepages, you're accepting that IOs are going to > > be larger, and you probably have the kind of storage system which can > > handle doing larger IOs. > > It would be nice if the bdi had a parameter for the maximum readahead size. > Currently, readahead is capped at 2MB chunks by force_page_cache_readahead() > even if bdi->ra_pages and bdi->io_pages are much larger. > > It should be up to the filesystem to decide how large the readahead chunks > are rather than imposing some policy in the MM code. For high-speed (network) > storage access it is better to have at least 4MB read chunks, for RAID storage > it is desirable to have stripe-aligned readahead to avoid read inflation when > verifying the parity. Any fixed size will eventually be inadequate as disks > and filesystems change, so it may as well be a per-bdi tunable that can be set > by the filesystem as needed, or possibly with a mount option if needed. I think the filesystem should provide a hint, but ultimately it needs to be up to the MM to decide how far to readahead. The filesystem doesn't (and shouldn't) have the global view into how much memory is available for readahead, nor should it be tracking how well this app is being served by readahead. That 2MB chunk restriction is allegedly there "so that we don't pin too much memory at once". Maybe that should be scaled with the amount of memory in the system (pinning 2MB of a 256MB system is a bit different from pinning 2MB of a 1TB memory system).
diff --git a/mm/readahead.c b/mm/readahead.c index c4ca70239233..289527a06254 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -174,6 +174,21 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp, if (page_offset > end_index) break; + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGE_PAGECACHE) && + (!page_idx || !(page_offset % HPAGE_PMD_NR)) && + page_cache_allow_huge(mapping, page_offset)) { + page = __page_cache_alloc_order(gfp_mask | __GFP_COMP, + HPAGE_PMD_ORDER); + if (page) { + prep_transhuge_page(page); + page->index = round_down(page_offset, + HPAGE_PMD_NR); + list_add(&page->lru, &page_pool); + ret++; + goto start_io; + } + } + rcu_read_lock(); page = radix_tree_lookup(&mapping->page_tree, page_offset); rcu_read_unlock(); @@ -189,7 +204,7 @@ int __do_page_cache_readahead(struct address_space *mapping, struct file *filp, SetPageReadahead(page); ret++; } - +start_io: /* * Now start the IO. We ignore I/O errors - if the page is not * uptodate then the caller will launch readpage again, and
Most page cache allocation happens via readahead (sync or async), so if we want to have significant number of huge pages in page cache we need to find a ways to allocate them from readahead. Unfortunately, huge pages doesn't fit into current readahead design: 128 max readahead window, assumption on page size, PageReadahead() to track hit/miss. I haven't found a ways to get it right yet. This patch just allocates huge page if allowed, but doesn't really provide any readahead if huge page is allocated. We read out 2M a time and I would expect spikes in latancy without readahead. Therefore HACK. Having that said, I don't think it should prevent huge page support to be applied. Future will show if lacking readahead is a big deal with huge pages in page cache. Any suggestions are welcome. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> --- mm/readahead.c | 17 ++++++++++++++++- 1 file changed, 16 insertions(+), 1 deletion(-)