diff mbox

[PATCHv3,15/41] filemap: handle huge pages in do_generic_file_read()

Message ID 20160915115523.29737-16-kirill.shutemov@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kirill A . Shutemov Sept. 15, 2016, 11:54 a.m. UTC
Most of work happans on head page. Only when we need to do copy data to
userspace we find relevant subpage.

We are still limited by PAGE_SIZE per iteration. Lifting this limitation
would require some more work.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Jan Kara Oct. 13, 2016, 9:33 a.m. UTC | #1
On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> Most of work happans on head page. Only when we need to do copy data to
> userspace we find relevant subpage.
> 
> We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> would require some more work.

Hum, I'm kind of lost. Can you point me to some design document / email
that would explain some high level ideas how are huge pages in page cache
supposed to work? When are we supposed to operate on the head page and when
on subpage? What is protected by the page lock of the head page? Do page
locks of subpages play any role? If understand right, e.g.
pagecache_get_page() will return subpages but is it generally safe to
operate on subpages individually or do we have to be aware that they are
part of a huge page?

If I understand the motivation right, it is mostly about being able to mmap
PMD-sized chunks to userspace. So my naive idea would be that we could just
implement it by allocating PMD sized chunks of pages when adding pages to
page cache, we don't even have to read them all unless we come from PMD
fault path. Reclaim may need to be aware not to split pages unnecessarily
but that's about it. So I'd like to understand what's wrong with this
naive idea and why do filesystems need to be aware that someone wants to
map in PMD sized chunks...

								Honza
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 50afe17230e7..b77bcf6843ee 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1860,6 +1860,7 @@ find_page:
>  			if (unlikely(page == NULL))
>  				goto no_cached_page;
>  		}
> +		page = compound_head(page);
>  		if (PageReadahead(page)) {
>  			page_cache_async_readahead(mapping,
>  					ra, filp, page,
> @@ -1936,7 +1937,8 @@ page_ok:
>  		 * now we can copy it to user space...
>  		 */
>  
> -		ret = copy_page_to_iter(page, offset, nr, iter);
> +		ret = copy_page_to_iter(page + index - page->index, offset,
> +				nr, iter);
>  		offset += ret;
>  		index += offset >> PAGE_SHIFT;
>  		offset &= ~PAGE_MASK;
> @@ -2356,6 +2358,7 @@ page_not_uptodate:
>  	 * because there really aren't any performance issues here
>  	 * and we need to check for errors.
>  	 */
> +	page = compound_head(page);
>  	ClearPageError(page);
>  	error = mapping->a_ops->readpage(file, page);
>  	if (!error) {
> -- 
> 2.9.3
> 
>
Kirill A. Shutemov Oct. 31, 2016, 6:10 p.m. UTC | #2
[ My mail system got broken and original reply didn't get to through. Resent. ]

On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > Most of work happans on head page. Only when we need to do copy data to
> > userspace we find relevant subpage.
> > 
> > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > would require some more work.
>
> Hum, I'm kind of lost.

The limitation here comes from how copy_page_to_iter() and
copy_page_from_iter() work wrt. highmem: it can only handle one small
page a time.

On write side, we also have problem with assuming small page: write length
and offset within page calculated before we know if small or huge page is
allocated. It's not easy to fix. Looks like it would require change in
->write_begin() interface to accept len > PAGE_SIZE.

> Can you point me to some design document / email that would explain some
> high level ideas how are huge pages in page cache supposed to work?

I'll elaborate more in cover letter to next revision.

> When are we supposed to operate on the head page and when on subpage?

It's case-by-case. See above explanation why we're limited to PAGE_SIZE
here.

> What is protected by the page lock of the head page?

Whole huge page. As with anon pages.

> Do page locks of subpages play any role?

lock_page() on any subpage would lock whole huge page.

> If understand right, e.g.  pagecache_get_page() will return subpages but
> is it generally safe to operate on subpages individually or do we have
> to be aware that they are part of a huge page?

I tried to make it as transparent as possible: page flag operations will
be redirected to head page, if necessary. Things like page_mapping() and
page_to_pgoff() know about huge pages.

Direct access to struct page fields must be avoided for tail pages as most
of them doesn't have meaning you would expect for small pages.

> If I understand the motivation right, it is mostly about being able to mmap
> PMD-sized chunks to userspace. So my naive idea would be that we could just
> implement it by allocating PMD sized chunks of pages when adding pages to
> page cache, we don't even have to read them all unless we come from PMD
> fault path.

Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
per-hugepage, one common list of buffer heads...

PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
it otherwise doesn't make sense) and handling it differently for file-THP
is nightmare from maintenance POV.

> Reclaim may need to be aware not to split pages unnecessarily
> but that's about it. So I'd like to understand what's wrong with this
> naive idea and why do filesystems need to be aware that someone wants to
> map in PMD sized chunks...

In addition to flags, THP uses some space in struct page of tail pages to
encode additional information. See compound_{mapcount,head,dtor,order},
page_deferred_list().

--
 Kirill A. Shutemov

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Nov. 1, 2016, 4:39 p.m. UTC | #3
On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> [ My mail system got broken and original reply didn't get to through. Resent. ]

OK, this answers some of my questions from previous email so disregard that
one.

> On Thu, Oct 13, 2016 at 11:33:13AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:57, Kirill A. Shutemov wrote:
> > > Most of work happans on head page. Only when we need to do copy data to
> > > userspace we find relevant subpage.
> > > 
> > > We are still limited by PAGE_SIZE per iteration. Lifting this limitation
> > > would require some more work.
> >
> > Hum, I'm kind of lost.
> 
> The limitation here comes from how copy_page_to_iter() and
> copy_page_from_iter() work wrt. highmem: it can only handle one small
> page a time.
> 
> On write side, we also have problem with assuming small page: write length
> and offset within page calculated before we know if small or huge page is
> allocated. It's not easy to fix. Looks like it would require change in
> ->write_begin() interface to accept len > PAGE_SIZE.
>
> > Can you point me to some design document / email that would explain some
> > high level ideas how are huge pages in page cache supposed to work?
> 
> I'll elaborate more in cover letter to next revision.
> 
> > When are we supposed to operate on the head page and when on subpage?
> 
> It's case-by-case. See above explanation why we're limited to PAGE_SIZE
> here.
> 
> > What is protected by the page lock of the head page?
> 
> Whole huge page. As with anon pages.
> 
> > Do page locks of subpages play any role?
> 
> lock_page() on any subpage would lock whole huge page.
> 
> > If understand right, e.g.  pagecache_get_page() will return subpages but
> > is it generally safe to operate on subpages individually or do we have
> > to be aware that they are part of a huge page?
> 
> I tried to make it as transparent as possible: page flag operations will
> be redirected to head page, if necessary. Things like page_mapping() and
> page_to_pgoff() know about huge pages.
> 
> Direct access to struct page fields must be avoided for tail pages as most
> of them doesn't have meaning you would expect for small pages.

OK, good to know.

> > If I understand the motivation right, it is mostly about being able to mmap
> > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > implement it by allocating PMD sized chunks of pages when adding pages to
> > page cache, we don't even have to read them all unless we come from PMD
> > fault path.
> 
> Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> per-hugepage, one common list of buffer heads...
> 
> PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> it otherwise doesn't make sense) and handling it differently for file-THP
> is nightmare from maintenance POV.

But the complexity of two different page sizes for page cache and *each*
filesystem that wants to support it does not make the maintenance easy
either. So I'm not convinced that using the same rules for anon-THP and
file-THP is a clear win. And if we have these two options neither of which
has negligible maintenance cost, I'd also like to see more justification
for why it is a good idea to have file-THP for normal filesystems. Do you
have any performance numbers that show it is a win under some realistic
workload?

I'd also note that having PMD-sized pages has some obvious disadvantages as
well:

1) I'm not sure buffer head handling code will quite scale to 512 or even
2048 buffer_heads on a linked list referenced from a page. It may work but
I suspect the performance will suck. 

2) PMD-sized pages result in increased space & memory usage.

3) In ext4 we have to estimate how much metadata we may need to modify when
allocating blocks underlying a page in the worst case (you don't seem to
update this estimate in your patch set). With 2048 blocks underlying a page,
each possibly in a different block group, it is a lot of metadata forcing
us to reserve a large transaction (not sure if you'll be able to even
reserve such large transaction with the default journal size), which again
makes things slower.

4) As you have noted some places like write_begin() still depend on 4k
pages which creates a strange mix of places that use subpages and that use
head pages.

All this would be a non-issue (well, except 2 I guess) if we just didn't
expose filesystems to the fact that something like file-THP exists.

> > Reclaim may need to be aware not to split pages unnecessarily
> > but that's about it. So I'd like to understand what's wrong with this
> > naive idea and why do filesystems need to be aware that someone wants to
> > map in PMD sized chunks...
> 
> In addition to flags, THP uses some space in struct page of tail pages to
> encode additional information. See compound_{mapcount,head,dtor,order},
> page_deferred_list().

Thanks, I'll check that.

								Honza
Kirill A. Shutemov Nov. 2, 2016, 8:32 a.m. UTC | #4
On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > If I understand the motivation right, it is mostly about being able to mmap
> > > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > page cache, we don't even have to read them all unless we come from PMD
> > > fault path.
> > 
> > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > per-hugepage, one common list of buffer heads...
> > 
> > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > it otherwise doesn't make sense) and handling it differently for file-THP
> > is nightmare from maintenance POV.
> 
> But the complexity of two different page sizes for page cache and *each*
> filesystem that wants to support it does not make the maintenance easy
> either.

I think with time we can make small pages just a subcase of huge pages.
And some generalization can be made once more than one filesystem with
backing storage will adopt huge pages.

> So I'm not convinced that using the same rules for anon-THP and
> file-THP is a clear win.

We already have file-THP with the same rules: tmpfs. Backing storage is
what changes the picture.

> And if we have these two options neither of which has negligible
> maintenance cost, I'd also like to see more justification for why it is
> a good idea to have file-THP for normal filesystems. Do you have any
> performance numbers that show it is a win under some realistic workload?

See below. As usual with huge pages, they make sense when you plenty of
memory.

> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
>
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck.

Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
why syscall-based IO sucks. We spend a lot of time looking for desired
block.

We need to switch to some other data structure for storing buffer_heads.
Is there a reason why we have list there in first place?
Why not just array?

I will look into it, but this sounds like a separate infrastructure change
project.

> 2) PMD-sized pages result in increased space & memory usage.

Space? Do you mean disk space? Not really: we still don't write beyond
i_size or into holes.

Behaviour wrt to holes may change with mmap()-IO as we have less
granularity, but the same can be seen just between different
architectures: 4k vs. 64k base page size.

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

I didn't saw this on profiles. And xfstests looks fine. I probably need to
run them with 1k blocks once again.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Yes, this need to be addressed to restore syscall-IO performance and take
advantage of huge pages.

But again, it's an infrastructure change that would likely affect
interface between VFS and filesystems. It deserves a separate patchset.

> All this would be a non-issue (well, except 2 I guess) if we just didn't
> expose filesystems to the fact that something like file-THP exists.

The numbers below generated with fio. The working set is relatively small,
so it fits into page cache and writing set doesn't hit dirty_ratio.

I think the mmap performance should be enough to justify initial inclusion
of an experimental feature: it useful for workloads that targets mmap()-IO.
It will take time to get feature mature anyway.

Configuration:
 - 2x E5-2697v2, 64G RAM;
 - INTEL SSDSC2CW24;
 - IO request size is 4k;
 - 8 processes, 512MB data set each;

Workload
 read/write	baseline	stddev	huge=always	stddev		change
--------------------------------------------------------------------------------
sync-read
 read		  21439.00	348.14	  20297.33	259.62		 -5.33%
sync-write
 write		   6833.20	147.08	   3630.13	 52.86		-46.88%
sync-readwrite
 read		   4377.17	 17.53	   2366.33	 19.52		-45.94%
 write		   4378.50	 17.83	   2365.80	 19.94		-45.97%
sync-randread
 read		   5491.20	 66.66	  14664.00	288.29		167.05%
sync-randwrite
 write		   6396.13	 98.79	   2035.80	  8.17		-68.17%
sync-randrw
 read		   2927.30	115.81	   1036.08	 34.67		-64.61%
 write		   2926.47	116.45	   1036.11	 34.90		-64.60%
libaio-read
 read		    254.36	 12.49	    258.63	 11.29		  1.68%
libaio-write
 write		   4979.20	122.75	   2904.77	 17.93		-41.66%
libaio-readwrite
 read		   2738.57	142.72	   2045.80	  4.12		-25.30%
 write		   2729.93	141.80	   2039.77	  3.79		-25.28%
libaio-randread
 read		    113.63	  2.98	    210.63	  5.07		 85.37%
libaio-randwrite
 write		   4456.10	 76.21	   1649.63	  7.00		-62.98%
libaio-randrw
 read		     97.85	  8.03	    877.49	 28.27		796.80%
 write		     97.55	  7.99	    874.83	 28.19		796.77%
mmap-read
 read		  20654.67	304.48	  24696.33	1064.07		 19.57%
mmap-write
 write		   8652.33	272.44	  13187.33	499.10		 52.41%
mmap-readwrite
 read		   6620.57	 16.05	   9221.60	399.56		 39.29%
 write		   6623.63	 16.34	   9222.13	399.31		 39.23%
mmap-randread
 read		   6717.23	1360.55	  21939.33	326.38		226.61%
mmap-randwrite
 write		   3204.63	253.66	  12371.00	 61.49		286.03%
mmap-randrw
 read		   2150.50	 78.00	   7682.67	188.59		257.25%
 write		   2149.50	 78.00	   7685.40	188.35		257.54%
Christoph Hellwig Nov. 2, 2016, 2:36 p.m. UTC | #5
On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> I'd also note that having PMD-sized pages has some obvious disadvantages as
> well:
> 
> 1) I'm not sure buffer head handling code will quite scale to 512 or even
> 2048 buffer_heads on a linked list referenced from a page. It may work but
> I suspect the performance will suck. 

buffer_head handling always sucks.  For the iomap based bufferd write
path I plan to support a buffer_head-less mode for the block size ==
PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
plate in time even for 4.10.  I think that's the right way to go for
THP, especially if we require the fs to allocate the whole huge page
as a single extent, similar to the DAX PMD mapping case.

> 2) PMD-sized pages result in increased space & memory usage.

How so?

> 3) In ext4 we have to estimate how much metadata we may need to modify when
> allocating blocks underlying a page in the worst case (you don't seem to
> update this estimate in your patch set). With 2048 blocks underlying a page,
> each possibly in a different block group, it is a lot of metadata forcing
> us to reserve a large transaction (not sure if you'll be able to even
> reserve such large transaction with the default journal size), which again
> makes things slower.

As said above I think we should only use huge page mappings if there is
a single underlying extent, same as in DAX to keep the complexity down.

> 4) As you have noted some places like write_begin() still depend on 4k
> pages which creates a strange mix of places that use subpages and that use
> head pages.

Just use the iomap bufferd I/O code and all these issues will go away.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 2, 2016, 2:37 p.m. UTC | #6
On Wed, Nov 02, 2016 at 11:32:04AM +0300, Kirill A. Shutemov wrote:
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

We're working on it with the iomap code.  And yes, it's really something
that needs to be done before we can consider the THP patches.  Same for
the biovec thing were we really need the > PAGE_SIZE bio_vecs first.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Nov. 3, 2016, 5:56 p.m. UTC | #7
On Wed 02-11-16 07:36:12, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> > 
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck. 
> 
> buffer_head handling always sucks.  For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10.  I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.

Yeah, if we require whole THP to be backed by a single extent, things get
simpler. But still there's the issue that ext4 cannot easily use iomap code
for buffered writes because of the data exposure issue we already talked
about - well, ext4 could actually work (it supports unwritten extents) but
old compatibility modes won't work and I'd strongly prefer not to have two
independent write paths in ext4... But I'll put more thought into this, I
have some idea how we could hack around the problem even for on-disk formats
that don't support unwritten extents. The trick we could use is that we'd
just mark the range of file as unwritten in memory in extent cache we have,
that should protect us against exposing uninitialized pages in racing
faults.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> How so?

Well, memory usage is clear I guess - if the files are smaller than THP
size, or if you don't use all the 4k pages that are forming THP you are
wasting memory. Sure it can be somewhat controlled by the heuristics
deciding when to use THP in pagecache and when to fall back to 4k pages.

Regarding space usage - it is mostly the case for sparse mmaped IO where
you always have to allocate (and write out) all the blocks underlying a THP
that gets written to, even though you may only need 4K from that area...

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.
> 
> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
> 
> Just use the iomap bufferd I/O code and all these issues will go away.

Yep, the above two things would make things somewhat less ugly I agree.

								Honza
Jan Kara Nov. 3, 2016, 8:40 p.m. UTC | #8
On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > On Mon 31-10-16 21:10:35, Kirill A. Shutemov wrote:
> > > > If I understand the motivation right, it is mostly about being able to mmap
> > > > PMD-sized chunks to userspace. So my naive idea would be that we could just
> > > > implement it by allocating PMD sized chunks of pages when adding pages to
> > > > page cache, we don't even have to read them all unless we come from PMD
> > > > fault path.
> > > 
> > > Well, no. We have one PG_{uptodate,dirty,writeback,mappedtodisk,etc}
> > > per-hugepage, one common list of buffer heads...
> > > 
> > > PG_dirty and PG_uptodate behaviour inhered from anon-THP (where handling
> > > it otherwise doesn't make sense) and handling it differently for file-THP
> > > is nightmare from maintenance POV.
> > 
> > But the complexity of two different page sizes for page cache and *each*
> > filesystem that wants to support it does not make the maintenance easy
> > either.
> 
> I think with time we can make small pages just a subcase of huge pages.
> And some generalization can be made once more than one filesystem with
> backing storage will adopt huge pages.

My objection is that IMHO currently the code is too ugly to go in. Too many
places need to know about THP and I'm not even sure you have patched all
the places or whether some corner cases remained unfixed and how should I
find that out.

> > So I'm not convinced that using the same rules for anon-THP and
> > file-THP is a clear win.
> 
> We already have file-THP with the same rules: tmpfs. Backing storage is
> what changes the picture.

Right, the ugliness comes from access to backing storage having to deal
with huge pages.

> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> >
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck.
> 
> Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> why syscall-based IO sucks. We spend a lot of time looking for desired
> block.
> 
> We need to switch to some other data structure for storing buffer_heads.
> Is there a reason why we have list there in first place?
> Why not just array?
> 
> I will look into it, but this sounds like a separate infrastructure change
> project.

As Christoph said iomap code should help you with that and make things
simpler. If things go as we imagine, we should be able to pretty much avoid
buffer heads. But it will take some time to get there.

> > 2) PMD-sized pages result in increased space & memory usage.
> 
> Space? Do you mean disk space? Not really: we still don't write beyond
> i_size or into holes.
> 
> Behaviour wrt to holes may change with mmap()-IO as we have less
> granularity, but the same can be seen just between different
> architectures: 4k vs. 64k base page size.

Yes, I meant different granularity of mmap based IO. And I agree it isn't a
new problem but the scale of the problem is much larger with 2MB pages than
with say 64K pages. And actually the overhead of higher IO granularity of
64K pages has been one of the reasons we have switched SLES PPC kernels
from 64K pages to 4K pages (we've got complaints from customers). 

> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> I didn't saw this on profiles. And xfstests looks fine. I probably need to
> run them with 1k blocks once again.

You wouldn't see this in profiles - it is a correctness thing. And it won't
be triggered unless the file is heavily fragmented which likely does not
happen with any test in xfstests. If it happens you'll notice though - the
filesystem will just report error and shut itself down.

> The numbers below generated with fio. The working set is relatively small,
> so it fits into page cache and writing set doesn't hit dirty_ratio.
> 
> I think the mmap performance should be enough to justify initial inclusion
> of an experimental feature: it useful for workloads that targets mmap()-IO.
> It will take time to get feature mature anyway.

I agree it will take time for feature to mature so I'me fine with
suboptimal performance in some cases. But I'm not fine with some of the
hacks you do currently because code maintenability is an issue even if
people don't actually use the feature...

> Configuration:
>  - 2x E5-2697v2, 64G RAM;
>  - INTEL SSDSC2CW24;
>  - IO request size is 4k;
>  - 8 processes, 512MB data set each;

The numbers indeed look interesting for mmaped case. Can you post the fio
cmdline? I'd like to compare profiles...

								Honza
> 
> Workload
>  read/write	baseline	stddev	huge=always	stddev		change
> --------------------------------------------------------------------------------
> sync-read
>  read		  21439.00	348.14	  20297.33	259.62		 -5.33%
> sync-write
>  write		   6833.20	147.08	   3630.13	 52.86		-46.88%
> sync-readwrite
>  read		   4377.17	 17.53	   2366.33	 19.52		-45.94%
>  write		   4378.50	 17.83	   2365.80	 19.94		-45.97%
> sync-randread
>  read		   5491.20	 66.66	  14664.00	288.29		167.05%
> sync-randwrite
>  write		   6396.13	 98.79	   2035.80	  8.17		-68.17%
> sync-randrw
>  read		   2927.30	115.81	   1036.08	 34.67		-64.61%
>  write		   2926.47	116.45	   1036.11	 34.90		-64.60%
> libaio-read
>  read		    254.36	 12.49	    258.63	 11.29		  1.68%
> libaio-write
>  write		   4979.20	122.75	   2904.77	 17.93		-41.66%
> libaio-readwrite
>  read		   2738.57	142.72	   2045.80	  4.12		-25.30%
>  write		   2729.93	141.80	   2039.77	  3.79		-25.28%
> libaio-randread
>  read		    113.63	  2.98	    210.63	  5.07		 85.37%
> libaio-randwrite
>  write		   4456.10	 76.21	   1649.63	  7.00		-62.98%
> libaio-randrw
>  read		     97.85	  8.03	    877.49	 28.27		796.80%
>  write		     97.55	  7.99	    874.83	 28.19		796.77%
> mmap-read
>  read		  20654.67	304.48	  24696.33	1064.07		 19.57%
> mmap-write
>  write		   8652.33	272.44	  13187.33	499.10		 52.41%
> mmap-readwrite
>  read		   6620.57	 16.05	   9221.60	399.56		 39.29%
>  write		   6623.63	 16.34	   9222.13	399.31		 39.23%
> mmap-randread
>  read		   6717.23	1360.55	  21939.33	326.38		226.61%
> mmap-randwrite
>  write		   3204.63	253.66	  12371.00	 61.49		286.03%
> mmap-randrw
>  read		   2150.50	 78.00	   7682.67	188.59		257.25%
>  write		   2149.50	 78.00	   7685.40	188.35		257.54%
> 
> -- 
>  Kirill A. Shutemov
Kirill A. Shutemov Nov. 7, 2016, 11:07 a.m. UTC | #9
On Thu, Nov 03, 2016 at 09:40:12PM +0100, Jan Kara wrote:
> On Wed 02-11-16 11:32:04, Kirill A. Shutemov wrote:
> > Yes, buffer_head list doesn't scale. That's the main reason (along with 4)
> > why syscall-based IO sucks. We spend a lot of time looking for desired
> > block.
> > 
> > We need to switch to some other data structure for storing buffer_heads.
> > Is there a reason why we have list there in first place?
> > Why not just array?
> > 
> > I will look into it, but this sounds like a separate infrastructure change
> > project.
> 
> As Christoph said iomap code should help you with that and make things
> simpler. If things go as we imagine, we should be able to pretty much avoid
> buffer heads. But it will take some time to get there.

Just to clarify: is it show-stopper or we can live with buffer_head list
for now?

> > > 2) PMD-sized pages result in increased space & memory usage.
> > 
> > Space? Do you mean disk space? Not really: we still don't write beyond
> > i_size or into holes.
> > 
> > Behaviour wrt to holes may change with mmap()-IO as we have less
> > granularity, but the same can be seen just between different
> > architectures: 4k vs. 64k base page size.
> 
> Yes, I meant different granularity of mmap based IO. And I agree it isn't a
> new problem but the scale of the problem is much larger with 2MB pages than
> with say 64K pages. And actually the overhead of higher IO granularity of
> 64K pages has been one of the reasons we have switched SLES PPC kernels
> from 64K pages to 4K pages (we've got complaints from customers). 

I guess fadvise()/madvise() hints for opt-in/opt-out should be good enough
to deal with this. I probably need to wire them up.

> > > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > > allocating blocks underlying a page in the worst case (you don't seem to
> > > update this estimate in your patch set). With 2048 blocks underlying a page,
> > > each possibly in a different block group, it is a lot of metadata forcing
> > > us to reserve a large transaction (not sure if you'll be able to even
> > > reserve such large transaction with the default journal size), which again
> > > makes things slower.
> > 
> > I didn't saw this on profiles. And xfstests looks fine. I probably need to
> > run them with 1k blocks once again.
> 
> You wouldn't see this in profiles - it is a correctness thing. And it won't
> be triggered unless the file is heavily fragmented which likely does not
> happen with any test in xfstests. If it happens you'll notice though - the
> filesystem will just report error and shut itself down.

Any suggestion how I can simulate this situation?

> > The numbers below generated with fio. The working set is relatively small,
> > so it fits into page cache and writing set doesn't hit dirty_ratio.
> > 
> > I think the mmap performance should be enough to justify initial inclusion
> > of an experimental feature: it useful for workloads that targets mmap()-IO.
> > It will take time to get feature mature anyway.
> 
> I agree it will take time for feature to mature so I'me fine with
> suboptimal performance in some cases. But I'm not fine with some of the
> hacks you do currently because code maintenability is an issue even if
> people don't actually use the feature...

Hm. Okay, I'll try to check what I can do to make it more maintainable.
My worry is that it will make the patchset even bigger...

> > Configuration:
> >  - 2x E5-2697v2, 64G RAM;
> >  - INTEL SSDSC2CW24;
> >  - IO request size is 4k;
> >  - 8 processes, 512MB data set each;
> 
> The numbers indeed look interesting for mmaped case. Can you post the fio
> cmdline? I'd like to compare profiles...

	fio \
		--directory=/mnt/ \
		--name="$engine-$rw" \
		--ioengine="$engine" \
		--rw="$rw" \
		--size=512M \
		--invalidate=1 \
		--numjobs=8 \
		--runtime=60 \
		--time_based \
		--group_reporting
Kirill A. Shutemov Nov. 7, 2016, 11:13 a.m. UTC | #10
On Wed, Nov 02, 2016 at 07:36:12AM -0700, Christoph Hellwig wrote:
> On Tue, Nov 01, 2016 at 05:39:40PM +0100, Jan Kara wrote:
> > I'd also note that having PMD-sized pages has some obvious disadvantages as
> > well:
> > 
> > 1) I'm not sure buffer head handling code will quite scale to 512 or even
> > 2048 buffer_heads on a linked list referenced from a page. It may work but
> > I suspect the performance will suck. 
> 
> buffer_head handling always sucks.  For the iomap based bufferd write
> path I plan to support a buffer_head-less mode for the block size ==
> PAGE_SIZE case in 4.11 latest, but if I get enough other things of my
> plate in time even for 4.10.  I think that's the right way to go for
> THP, especially if we require the fs to allocate the whole huge page
> as a single extent, similar to the DAX PMD mapping case.
> 
> > 2) PMD-sized pages result in increased space & memory usage.
> 
> How so?
> 
> > 3) In ext4 we have to estimate how much metadata we may need to modify when
> > allocating blocks underlying a page in the worst case (you don't seem to
> > update this estimate in your patch set). With 2048 blocks underlying a page,
> > each possibly in a different block group, it is a lot of metadata forcing
> > us to reserve a large transaction (not sure if you'll be able to even
> > reserve such large transaction with the default journal size), which again
> > makes things slower.
> 
> As said above I think we should only use huge page mappings if there is
> a single underlying extent, same as in DAX to keep the complexity down.

It looks like a huge limitation to me.

> > 4) As you have noted some places like write_begin() still depend on 4k
> > pages which creates a strange mix of places that use subpages and that use
> > head pages.
> 
> Just use the iomap bufferd I/O code and all these issues will go away.

Not really.

I'm looking onto iomap_write_actor(): we still calculate 'offset' and
'bytes' based on PAGE_SIZE before we even get the page.
This way we limit outself to PAGE_SIZE per-iteration.
Christoph Hellwig Nov. 7, 2016, 2:59 p.m. UTC | #11
On Mon, Nov 07, 2016 at 02:07:36PM +0300, Kirill A. Shutemov wrote:
> Just to clarify: is it show-stopper or we can live with buffer_head list
> for now?

I'm not Jan, but I will NAK anything that looks like the current THP
series.  It's a great prototype, but it also shows up all the area
that we need to fix first, and the buffer_head chain is one of them.

> Hm. Okay, I'll try to check what I can do to make it more maintainable.
> My worry is that it will make the patchset even bigger...

So start splitting out parts that are useful on their own, or spent
time on fixing fundamental underlying issues that will make it smaller
as a side effect.  That's how everyone else does kernel development.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 7, 2016, 3:01 p.m. UTC | #12
On Mon, Nov 07, 2016 at 02:13:05PM +0300, Kirill A. Shutemov wrote:
> It looks like a huge limitation to me.

The DAX PMD fault code can live just fine with it.  And without it
performance would suck anyway.

> I'm looking onto iomap_write_actor(): we still calculate 'offset' and
> 'bytes' based on PAGE_SIZE before we even get the page.
> This way we limit outself to PAGE_SIZE per-iteration.

Of course it does, given that it does not support huge pages _yet_.
But the important part is that this is now isolate to the highlevel
code, and the fs can get iomap_begin calls for a huge page (or in fact
much larger sizes than that).
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kirill A . Shutemov Nov. 7, 2016, 4:03 p.m. UTC | #13
On Mon, Nov 07, 2016 at 07:01:03AM -0800, Christoph Hellwig wrote:
> On Mon, Nov 07, 2016 at 02:13:05PM +0300, Kirill A. Shutemov wrote:
> > It looks like a huge limitation to me.
> 
> The DAX PMD fault code can live just fine with it.

There's no way out for DAX as we map backing storage directly into
userspace. There's no such limitation for page-cache. And I don't see a
point to introduce such limitation artificially.

Backing storage fragmentation can be a weight on decision whether we want
to allocate huge page, but it shouldn't be show-stopper.

> And without it performance would suck anyway.

It depends on workload, obviously.
diff mbox

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 50afe17230e7..b77bcf6843ee 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1860,6 +1860,7 @@  find_page:
 			if (unlikely(page == NULL))
 				goto no_cached_page;
 		}
+		page = compound_head(page);
 		if (PageReadahead(page)) {
 			page_cache_async_readahead(mapping,
 					ra, filp, page,
@@ -1936,7 +1937,8 @@  page_ok:
 		 * now we can copy it to user space...
 		 */
 
-		ret = copy_page_to_iter(page, offset, nr, iter);
+		ret = copy_page_to_iter(page + index - page->index, offset,
+				nr, iter);
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
@@ -2356,6 +2358,7 @@  page_not_uptodate:
 	 * because there really aren't any performance issues here
 	 * and we need to check for errors.
 	 */
+	page = compound_head(page);
 	ClearPageError(page);
 	error = mapping->a_ops->readpage(file, page);
 	if (!error) {