diff mbox

[PATCHv3,17/41] filemap: handle huge pages in filemap_fdatawait_range()

Message ID 20160915115523.29737-18-kirill.shutemov@linux.intel.com (mailing list archive)
State New, archived
Headers show

Commit Message

Kirill A . Shutemov Sept. 15, 2016, 11:54 a.m. UTC
We writeback whole huge page a time.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/filemap.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Jan Kara Oct. 13, 2016, 9:44 a.m. UTC | #1
On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> We writeback whole huge page a time.

This is one of the things I don't understand. Firstly I didn't see where
changes of writeback like this would happen (maybe they come later).
Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
pages. Is this because radix-tree multiorder entry tracks dirtiness for us
at that granularity? BTW, can you also explain why do we need multiorder
entries? What do they solve for us?

I'm sorry for these basic questions but I'd just like to understand how is
this supposed to work...

								Honza


> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/filemap.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 05b42d3e5ed8..53da93156e60 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -372,9 +372,14 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
>  			if (page->index > end)
>  				continue;
>  
> +			page = compound_head(page);
>  			wait_on_page_writeback(page);
>  			if (TestClearPageError(page))
>  				ret = -EIO;
> +			if (PageTransHuge(page)) {
> +				index = page->index + HPAGE_PMD_NR;
> +				i += index - pvec.pages[i]->index - 1;
> +			}
>  		}
>  		pagevec_release(&pvec);
>  		cond_resched();
> -- 
> 2.9.3
> 
>
Kirill A. Shutemov Oct. 13, 2016, 12:08 p.m. UTC | #2
On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > We writeback whole huge page a time.
> 
> This is one of the things I don't understand. Firstly I didn't see where
> changes of writeback like this would happen (maybe they come later).
> Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> at that granularity?

We track dirty/writeback on per-compound pages: meaning we have one
dirty/writeback flag for whole compound page, not on every individual
4k subpage. The same story for radix-tree tags.

> BTW, can you also explain why do we need multiorder entries? What do
> they solve for us?

It helps us having coherent view on tags in radix-tree: no matter which
index we refer from the range huge page covers we will get the same
answer on which tags set.

> I'm sorry for these basic questions but I'd just like to understand how is
> this supposed to work...
> 
> 								Honza
> 
> 
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >  mm/filemap.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 05b42d3e5ed8..53da93156e60 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -372,9 +372,14 @@ static int __filemap_fdatawait_range(struct address_space *mapping,
> >  			if (page->index > end)
> >  				continue;
> >  
> > +			page = compound_head(page);
> >  			wait_on_page_writeback(page);
> >  			if (TestClearPageError(page))
> >  				ret = -EIO;
> > +			if (PageTransHuge(page)) {
> > +				index = page->index + HPAGE_PMD_NR;
> > +				i += index - pvec.pages[i]->index - 1;
> > +			}
> >  		}
> >  		pagevec_release(&pvec);
> >  		cond_resched();
> > -- 
> > 2.9.3
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
Jan Kara Oct. 13, 2016, 1:18 p.m. UTC | #3
On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > We writeback whole huge page a time.
> > 
> > This is one of the things I don't understand. Firstly I didn't see where
> > changes of writeback like this would happen (maybe they come later).
> > Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> > pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> > at that granularity?
> 
> We track dirty/writeback on per-compound pages: meaning we have one
> dirty/writeback flag for whole compound page, not on every individual
> 4k subpage. The same story for radix-tree tags.
> 
> > BTW, can you also explain why do we need multiorder entries? What do
> > they solve for us?
> 
> It helps us having coherent view on tags in radix-tree: no matter which
> index we refer from the range huge page covers we will get the same
> answer on which tags set.

OK, understand that. But why do we need a coherent view? For which purposes
exactly do we care that it is not just a bunch of 4k pages that happen to
be physically contiguous and thus can be mapped in one PMD?

								Honza
Kirill A. Shutemov Oct. 24, 2016, 11:36 a.m. UTC | #4
On Thu, Oct 13, 2016 at 03:18:02PM +0200, Jan Kara wrote:
> On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> > On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > > We writeback whole huge page a time.
> > > 
> > > This is one of the things I don't understand. Firstly I didn't see where
> > > changes of writeback like this would happen (maybe they come later).
> > > Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> > > pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> > > at that granularity?
> > 
> > We track dirty/writeback on per-compound pages: meaning we have one
> > dirty/writeback flag for whole compound page, not on every individual
> > 4k subpage. The same story for radix-tree tags.
> > 
> > > BTW, can you also explain why do we need multiorder entries? What do
> > > they solve for us?
> > 
> > It helps us having coherent view on tags in radix-tree: no matter which
> > index we refer from the range huge page covers we will get the same
> > answer on which tags set.
> 
> OK, understand that. But why do we need a coherent view? For which purposes
> exactly do we care that it is not just a bunch of 4k pages that happen to
> be physically contiguous and thus can be mapped in one PMD?

My understanding is that things like PageDirty() should be handled on the
same granularity as PAGECACHE_TAG_DIRTY, otherwise things can go horribly
wrong...
Jan Kara Oct. 30, 2016, 5:31 p.m. UTC | #5
On Mon 24-10-16 14:36:25, Kirill A. Shutemov wrote:
> On Thu, Oct 13, 2016 at 03:18:02PM +0200, Jan Kara wrote:
> > On Thu 13-10-16 15:08:44, Kirill A. Shutemov wrote:
> > > On Thu, Oct 13, 2016 at 11:44:41AM +0200, Jan Kara wrote:
> > > > On Thu 15-09-16 14:54:59, Kirill A. Shutemov wrote:
> > > > > We writeback whole huge page a time.
> > > > 
> > > > This is one of the things I don't understand. Firstly I didn't see where
> > > > changes of writeback like this would happen (maybe they come later).
> > > > Secondly I'm not sure why e.g. writeback should behave atomically wrt huge
> > > > pages. Is this because radix-tree multiorder entry tracks dirtiness for us
> > > > at that granularity?
> > > 
> > > We track dirty/writeback on per-compound pages: meaning we have one
> > > dirty/writeback flag for whole compound page, not on every individual
> > > 4k subpage. The same story for radix-tree tags.
> > > 
> > > > BTW, can you also explain why do we need multiorder entries? What do
> > > > they solve for us?
> > > 
> > > It helps us having coherent view on tags in radix-tree: no matter which
> > > index we refer from the range huge page covers we will get the same
> > > answer on which tags set.
> > 
> > OK, understand that. But why do we need a coherent view? For which purposes
> > exactly do we care that it is not just a bunch of 4k pages that happen to
> > be physically contiguous and thus can be mapped in one PMD?
> 
> My understanding is that things like PageDirty() should be handled on the
> same granularity as PAGECACHE_TAG_DIRTY, otherwise things can go horribly
> wrong...

Yeah, I agree with that. My question was rather aiming in the direction:
Why don't we keep PageDirty and PAGECACHE_TAG_DIRTY on a page granularity?
Why do we push all this to happen only in the head page?

In your coverletter of the latest version (BTW thanks for expanding
explanations there) you write:
  - head page (the first subpage) on LRU represents whole huge page;
  - head page's flags represent state of whole huge page (with few
    exceptions);
  - mm can't migrate subpages of the compound page individually;

So the fact that flags of a head page represent flags of each individual
page is the decision that I'm questioning, at least for PageDirty and
PageWriteback flags. I'm asking because frankly, I don't like the series
much. IMHO too many places need to know about huge pages and things will
get broken frequently. And from filesystem POV I don't really see why a
filesystem should care about huge pages *at all*. Sure functions allocating
pages into page cache need to care, sure functions mapping pages into page
tables need to care. But nobody else should need to be aware we are playing
some huge page games... At least that is my idea how things ought to work
;)

Your solution seems to go more towards the direction where we have two
different sizes of pages in the system and everyone has to cope with it.
But I'd also note that you go only half way there - e.g. page lookup
functions still work with subpages, some places still use PAGE_SIZE &
page->index, ... - so the result is a strange mix.

So what are the reasons for having pages forming a huge page bound so
tightly?


								Honza
diff mbox

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 05b42d3e5ed8..53da93156e60 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -372,9 +372,14 @@  static int __filemap_fdatawait_range(struct address_space *mapping,
 			if (page->index > end)
 				continue;
 
+			page = compound_head(page);
 			wait_on_page_writeback(page);
 			if (TestClearPageError(page))
 				ret = -EIO;
+			if (PageTransHuge(page)) {
+				index = page->index + HPAGE_PMD_NR;
+				i += index - pvec.pages[i]->index - 1;
+			}
 		}
 		pagevec_release(&pvec);
 		cond_resched();