diff mbox series

[RFC,7/9] mm/gup: Decrement head page once for group of subpages

Message ID 20201208172901.17384-9-joao.m.martins@oracle.com (mailing list archive)
State New, archived
Headers show
Series mm, sparse-vmemmap: Introduce compound pagemaps | expand

Commit Message

Joao Martins Dec. 8, 2020, 5:28 p.m. UTC
Rather than decrementing the ref count one by one, we
walk the page array and checking which belong to the same
compound_head. Later on we decrement the calculated amount
of references in a single write to the head page.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
---
 mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
 1 file changed, 32 insertions(+), 9 deletions(-)

Comments

Jason Gunthorpe Dec. 8, 2020, 7:34 p.m. UTC | #1
On Tue, Dec 08, 2020 at 05:28:59PM +0000, Joao Martins wrote:
> Rather than decrementing the ref count one by one, we
> walk the page array and checking which belong to the same
> compound_head. Later on we decrement the calculated amount
> of references in a single write to the head page.
> 
> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>  mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
>  1 file changed, 32 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 194e6981eb03..3a9a7229f418 100644
> +++ b/mm/gup.c
> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>  }
>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
>  
> +static int record_refs(struct page **pages, int npages)
> +{
> +	struct page *head = compound_head(pages[0]);
> +	int refs = 1, index;
> +
> +	for (index = 1; index < npages; index++, refs++)
> +		if (compound_head(pages[index]) != head)
> +			break;
> +
> +	return refs;
> +}
> +
>  /**
>   * unpin_user_page() - release a dma-pinned page
>   * @page:            pointer to page to be released
> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>   * that such pages can be separately tracked and uniquely handled. In
>   * particular, interactions with RDMA and filesystems need special handling.
>   */
> -void unpin_user_page(struct page *page)
> +static void __unpin_user_page(struct page *page, int refs)

Refs should be unsigned everywhere.

I suggest using clear language 'page' here should always be a compound
head called 'head' (or do we have another common variable name for
this?)

'refs' is number of tail pages within the compound, so 'ntails' or
something

>  {
> -	int refs = 1;
> +	int orig_refs = refs;
>  
>  	page = compound_head(page);

Caller should always do this

> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>  		return;
>  
>  	if (hpage_pincount_available(page))
> -		hpage_pincount_sub(page, 1);
> +		hpage_pincount_sub(page, refs);
>  	else
> -		refs = GUP_PIN_COUNTING_BIAS;
> +		refs *= GUP_PIN_COUNTING_BIAS;
>  
>  	if (page_ref_sub_and_test(page, refs))
>  		__put_page(page);
>  
> -	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
> +}

And really this should be placed directly after
try_grab_compound_head() and be given a similar name
'unpin_compound_head()'. Even better would be to split the FOLL_PIN
part into a function so there was a clear logical pairing.

And reviewing it like that I want to ask if this unpin sequence is in
the right order.. I would expect it to be the reverse order of the get

John?

Is it safe to call mod_node_page_state() after releasing the refcount?
This could race with hot-unplugging the struct pages so I think it is
wrong.

> +void unpin_user_page(struct page *page)
> +{
> +	__unpin_user_page(page, 1);

Thus this is

	__unpin_user_page(compound_head(page), 1);

> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  				 bool make_dirty)
>  {
>  	unsigned long index;
> +	int refs = 1;
>  
>  	/*
>  	 * TODO: this can be optimized for huge pages: if a series of pages is
> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>  		return;
>  	}
>  
> -	for (index = 0; index < npages; index++) {
> +	for (index = 0; index < npages; index += refs) {
>  		struct page *page = compound_head(pages[index]);
> +

I think this is really hard to read, it should end up as some:

for_each_compond_head(page_list, page_list_len, &head, &ntails) {
       		if (!PageDirty(head))
			set_page_dirty_lock(head, ntails);
		unpin_user_page(head, ntails);
}

And maybe you open code that iteration, but that basic idea to find a
compound_head and ntails should be computational work performed.

No reason not to fix set_page_dirty_lock() too while you are here.

Also, this patch and the next can be completely independent of the
rest of the series, it is valuable regardless of the other tricks. You
can split them and progress them independently.

.. and I was just talking about this with Daniel Jordan and some other
people at your company :)

Thanks,
Jason
John Hubbard Dec. 9, 2020, 5:06 a.m. UTC | #2
On 12/8/20 11:34 AM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:59PM +0000, Joao Martins wrote:
>> Rather than decrementing the ref count one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>   mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
>>   1 file changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 194e6981eb03..3a9a7229f418 100644
>> +++ b/mm/gup.c
>> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>   }
>>   #endif /* CONFIG_DEV_PAGEMAP_OPS */
>>   
>> +static int record_refs(struct page **pages, int npages)
>> +{
>> +	struct page *head = compound_head(pages[0]);
>> +	int refs = 1, index;
>> +
>> +	for (index = 1; index < npages; index++, refs++)
>> +		if (compound_head(pages[index]) != head)
>> +			break;
>> +
>> +	return refs;
>> +}
>> +
>>   /**
>>    * unpin_user_page() - release a dma-pinned page
>>    * @page:            pointer to page to be released
>> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>    * that such pages can be separately tracked and uniquely handled. In
>>    * particular, interactions with RDMA and filesystems need special handling.
>>    */
>> -void unpin_user_page(struct page *page)
>> +static void __unpin_user_page(struct page *page, int refs)
> 
> Refs should be unsigned everywhere.

That's fine (although, see my comments in the previous patch for
pitfalls). But it should be a preparatory patch, in order to avoid
clouding up this one and your others as well.


> 
> I suggest using clear language 'page' here should always be a compound
> head called 'head' (or do we have another common variable name for
> this?)
> 

Agreed. Matthew's struct folio upgrade will allow us to really make
things clear in a typesafe way, but meanwhile, it's probably good to use
one of the following patterns:

page = compound_head(page); // at the very beginning of a routine

or

do_things_to_this_single_page(page);

head = compound_head(page);
do_things_to_this_compound_page(head);


> 'refs' is number of tail pages within the compound, so 'ntails' or
> something
> 

I think it's OK to leave it as "refs", because within gup.c, refs has
a very particular meaning. But if you change to ntails or something, I'd
want to see a complete change: no leftovers of refs that are really ntails.

So far I'd rather leave it as refs, but it's not a big deal either way.

>>   {
>> -	int refs = 1;
>> +	int orig_refs = refs;
>>   
>>   	page = compound_head(page);
> 
> Caller should always do this
> 
>> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>>   		return;
>>   
>>   	if (hpage_pincount_available(page))
>> -		hpage_pincount_sub(page, 1);
>> +		hpage_pincount_sub(page, refs);

Maybe a nice touch would be to pass in orig_refs, because there
is no intention to use a possibly modified refs. So:

		hpage_pincount_sub(page, orig_refs);

...obviously a fine point, I realize. :)

>>   	else
>> -		refs = GUP_PIN_COUNTING_BIAS;
>> +		refs *= GUP_PIN_COUNTING_BIAS;
>>   
>>   	if (page_ref_sub_and_test(page, refs))
>>   		__put_page(page);
>>   
>> -	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> +}
> 
> And really this should be placed directly after
> try_grab_compound_head() and be given a similar name
> 'unpin_compound_head()'. Even better would be to split the FOLL_PIN
> part into a function so there was a clear logical pairing.
> 
> And reviewing it like that I want to ask if this unpin sequence is in
> the right order.. I would expect it to be the reverse order of the get
> 
> John?
> 
> Is it safe to call mod_node_page_state() after releasing the refcount?
> This could race with hot-unplugging the struct pages so I think it is
> wrong.

Yes, I think you are right! I wasn't in a hot unplug state of mind when I
thought about the ordering there, but I should have been. :)

> 
>> +void unpin_user_page(struct page *page)
>> +{
>> +	__unpin_user_page(page, 1);
> 
> Thus this is
> 
> 	__unpin_user_page(compound_head(page), 1);
> 
>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   				 bool make_dirty)
>>   {
>>   	unsigned long index;
>> +	int refs = 1;
>>   
>>   	/*
>>   	 * TODO: this can be optimized for huge pages: if a series of pages is

I think you can delete this TODO block now, and the one in unpin_user_pages_dirty_lock(),
as a result of these changes.

>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>   		return;
>>   	}
>>   
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>   		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>         		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
> No reason not to fix set_page_dirty_lock() too while you are here.

Eh? What's wrong with set_page_dirty_lock() ?

> 
> Also, this patch and the next can be completely independent of the
> rest of the series, it is valuable regardless of the other tricks. You
> can split them and progress them independently.
> 
> .. and I was just talking about this with Daniel Jordan and some other
> people at your company :)
> 
> Thanks,
> Jason
> 

thanks,
Joao Martins Dec. 9, 2020, 12:17 p.m. UTC | #3
On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
> On Tue, Dec 08, 2020 at 05:28:59PM +0000, Joao Martins wrote:
>> Rather than decrementing the ref count one by one, we
>> walk the page array and checking which belong to the same
>> compound_head. Later on we decrement the calculated amount
>> of references in a single write to the head page.
>>
>> Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
>>  mm/gup.c | 41 ++++++++++++++++++++++++++++++++---------
>>  1 file changed, 32 insertions(+), 9 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index 194e6981eb03..3a9a7229f418 100644
>> +++ b/mm/gup.c
>> @@ -212,6 +212,18 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>  }
>>  #endif /* CONFIG_DEV_PAGEMAP_OPS */
>>  
>> +static int record_refs(struct page **pages, int npages)
>> +{
>> +	struct page *head = compound_head(pages[0]);
>> +	int refs = 1, index;
>> +
>> +	for (index = 1; index < npages; index++, refs++)
>> +		if (compound_head(pages[index]) != head)
>> +			break;
>> +
>> +	return refs;
>> +}
>> +
>>  /**
>>   * unpin_user_page() - release a dma-pinned page
>>   * @page:            pointer to page to be released
>> @@ -221,9 +233,9 @@ static bool __unpin_devmap_managed_user_page(struct page *page)
>>   * that such pages can be separately tracked and uniquely handled. In
>>   * particular, interactions with RDMA and filesystems need special handling.
>>   */
>> -void unpin_user_page(struct page *page)
>> +static void __unpin_user_page(struct page *page, int refs)
> 
> Refs should be unsigned everywhere.
> 
/me nods

> I suggest using clear language 'page' here should always be a compound
> head called 'head' (or do we have another common variable name for
> this?)
> 
> 'refs' is number of tail pages within the compound, so 'ntails' or
> something
> 
The usage of 'refs' seems to align with the rest of the GUP code. It's always referring to
tail pages and unpin case isn't any different IIUC.

I suppose we can always change that, but maybe better do that renaming in one shot as a
post cleanup?

>>  {
>> -	int refs = 1;
>> +	int orig_refs = refs;
>>  
>>  	page = compound_head(page);
> 
> Caller should always do this
> 
/me nods

>> @@ -237,14 +249,19 @@ void unpin_user_page(struct page *page)
>>  		return;
>>  
>>  	if (hpage_pincount_available(page))
>> -		hpage_pincount_sub(page, 1);
>> +		hpage_pincount_sub(page, refs);
>>  	else
>> -		refs = GUP_PIN_COUNTING_BIAS;
>> +		refs *= GUP_PIN_COUNTING_BIAS;
>>  
>>  	if (page_ref_sub_and_test(page, refs))
>>  		__put_page(page);
>>  
>> -	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
>> +	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
>> +}
> 
> And really this should be placed directly after
> try_grab_compound_head() and be given a similar name
> 'unpin_compound_head()'. Even better would be to split the FOLL_PIN
> part into a function so there was a clear logical pairing.
> 
> And reviewing it like that I want to ask if this unpin sequence is in
> the right order.. I would expect it to be the reverse order of the get
> 
> John?
> 
> Is it safe to call mod_node_page_state() after releasing the refcount?
> This could race with hot-unplugging the struct pages so I think it is
> wrong.
> 
It appears to be case based on John's follow up comment.

>> +void unpin_user_page(struct page *page)
>> +{
>> +	__unpin_user_page(page, 1);
> 
> Thus this is
> 
> 	__unpin_user_page(compound_head(page), 1);
> 
Got it.

>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  				 bool make_dirty)
>>  {
>>  	unsigned long index;
>> +	int refs = 1;
>>  
>>  	/*
>>  	 * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  		return;
>>  	}
>>  
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>  		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>        		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
/me nods Let me attempt at that.

> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
I like the idea of a page range API alternative to unpin_user_pages(), but
improving current unpin_user_pages() would improve other unpin users too.

Perhaps the logic can be common, and the current unpin_user_pages() would have
the second iteration part, while the new (faster) API be based on computation.

> No reason not to fix set_page_dirty_lock() too while you are here.
> 
OK.

> Also, this patch and the next can be completely independent of the
> rest of the series, it is valuable regardless of the other tricks. You
> can split them and progress them independently.
> 
Yeap, let me do that.

> .. and I was just talking about this with Daniel Jordan and some other
> people at your company :)
> 

:)
Jason Gunthorpe Dec. 9, 2020, 1:43 p.m. UTC | #4
On Tue, Dec 08, 2020 at 09:06:50PM -0800, John Hubbard wrote:
> > I suggest using clear language 'page' here should always be a compound
> > head called 'head' (or do we have another common variable name for
> > this?)
> 
> Agreed. Matthew's struct folio upgrade will allow us to really make
> things clear in a typesafe way, but meanwhile, it's probably good to use
> one of the following patterns:

Yes, this fits very well with the folio patches, and is much clearer

> page = compound_head(page); // at the very beginning of a routine

No, these routines really want to operate on head/folio's, that is the whole
point.
 
> do_things_to_this_single_page(page);
> 
> head = compound_head(page);
> do_things_to_this_compound_page(head);

Yes, but wordy though

> > Is it safe to call mod_node_page_state() after releasing the refcount?
> > This could race with hot-unplugging the struct pages so I think it is
> > wrong.
> 
> Yes, I think you are right! I wasn't in a hot unplug state of mind when I
> thought about the ordering there, but I should have been. :)

Ok

Hmm..

unpin_user_page() and put_compound_head() do exactly the same thing, 
and the latter gets it all right. I'll make a patch to fix this

> > And maybe you open code that iteration, but that basic idea to find a
> > compound_head and ntails should be computational work performed.
> > 
> > No reason not to fix set_page_dirty_lock() too while you are here.
> 
> Eh? What's wrong with set_page_dirty_lock() ?

Look at the code:

	for (index = 0; index < npages; index++) {
		struct page *page = compound_head(pages[index]);

		if (!PageDirty(page))
			set_page_dirty_lock(page);

So we really want

   set_folio_dirty_lock(folio, ntails)

Just like unpin_user_folio(folio, ntails)

(wow this is much clearer to explain using Matt's language)

set_page_dirty_lock does another wack of atomics so this should be
a healthy speedup on the large page benchmark.

Jason
Joao Martins Dec. 17, 2020, 7:05 p.m. UTC | #5
On 12/8/20 7:34 PM, Jason Gunthorpe wrote:
>> @@ -274,6 +291,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  				 bool make_dirty)
>>  {
>>  	unsigned long index;
>> +	int refs = 1;
>>  
>>  	/*
>>  	 * TODO: this can be optimized for huge pages: if a series of pages is
>> @@ -286,8 +304,9 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
>>  		return;
>>  	}
>>  
>> -	for (index = 0; index < npages; index++) {
>> +	for (index = 0; index < npages; index += refs) {
>>  		struct page *page = compound_head(pages[index]);
>> +
> 
> I think this is really hard to read, it should end up as some:
> 
> for_each_compond_head(page_list, page_list_len, &head, &ntails) {
>        		if (!PageDirty(head))
> 			set_page_dirty_lock(head, ntails);
> 		unpin_user_page(head, ntails);
> }
> 
> And maybe you open code that iteration, but that basic idea to find a
> compound_head and ntails should be computational work performed.
> 
> No reason not to fix set_page_dirty_lock() too while you are here.
> 

The wack of atomics you mentioned earlier you referred to, I suppose it
ends being account_page_dirtied(). See partial diff at the end.

I was looking at the latter part and renaming all the fs that supply
set_page_dirty()... But now my concern is whether it's really safe to
assume that filesystems that supply it ... have indeed the ability to dirty
@ntails pages. Functionally, fixing set_page_dirty_lock() means we don't call
set_page_dirty(head) @ntails times as it happens today, we would only call once
with ntails as argument.

Perhaps the safest thing to do is still to iterate over
@ntails and call .set_page_dirty(page) and instead introduce
a set_page_range_dirty() which individual filesystems can separately
supply and give precedence of ->set_page_range_dirty() as opposed
to ->set_page_dirty() ?

	Joao

--------------------->8------------------------------

diff --git a/mm/gup.c b/mm/gup.c
index 41ab3d48e1bb..5f8a0f16ab62 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -295,7 +295,7 @@ void unpin_user_pages_dirty_lock(struct page **pages, unsigned long
npages,
                 * next writeback cycle. This is harmless.
                 */
                if (!PageDirty(head))
-                       set_page_dirty_lock(head);
+                       set_page_range_dirty_lock(head, ntails);
                put_compound_head(head, ntails, FOLL_PIN);
        }
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 088729ea80b2..4642d037f657 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2417,7 +2417,8 @@ int __set_page_dirty_no_writeback(struct page *page, unsigned int
ntails)
  *
  * NOTE: This relies on being atomic wrt interrupts.
  */
-void account_page_dirtied(struct page *page, struct address_space *mapping)
+void account_page_dirtied(struct page *page, struct address_space *mapping,
+                         unsigned int ntails)
 {
        struct inode *inode = mapping->host;

@@ -2425,17 +2426,18 @@ void account_page_dirtied(struct page *page, struct address_space
*mapping)

        if (mapping_can_writeback(mapping)) {
                struct bdi_writeback *wb;
+               int nr = ntails + 1;

                inode_attach_wb(inode, page);
                wb = inode_to_wb(inode);

-               __inc_lruvec_page_state(page, NR_FILE_DIRTY);
-               __inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
-               __inc_node_page_state(page, NR_DIRTIED);
-               inc_wb_stat(wb, WB_RECLAIMABLE);
-               inc_wb_stat(wb, WB_DIRTIED);
-               task_io_account_write(PAGE_SIZE);
-               current->nr_dirtied++;
+               mod_lruvec_page_state(page, NR_FILE_DIRTY, nr);
+               mod_zone_page_state(page_zone(page), NR_ZONE_WRITE_PENDING, nr);
+               mod_node_page_state(page_pgdat(page), NR_DIRTIED, nr);
+               __add_wb_stat(wb, WB_RECLAIMABLE, nr);
+               __add_wb_stat(wb, WB_DIRTIED, nr);
+               task_io_account_write(nr * PAGE_SIZE);
+               current->nr_dirtied += nr;
                this_cpu_inc(bdp_ratelimits);

                mem_cgroup_track_foreign_dirty(page, wb);
@@ -2485,7 +2487,7 @@ int __set_page_dirty_nobuffers(struct page *page, unsigned int ntails)
                xa_lock_irqsave(&mapping->i_pages, flags);
                BUG_ON(page_mapping(page) != mapping);
                WARN_ON_ONCE(!PagePrivate(page) && !PageUptodate(page));
-               account_page_dirtied(page, mapping);
+               account_page_dirtied(page, mapping, ntails);
                __xa_set_mark(&mapping->i_pages, page_index(page),
                                   PAGECACHE_TAG_DIRTY);
                xa_unlock_irqrestore(&mapping->i_pages, flags);
@@ -2624,6 +2626,27 @@ int set_page_dirty_lock(struct page *page)
 }
 EXPORT_SYMBOL(set_page_dirty_lock);

+/*
+ * set_page_range_dirty() is racy if the caller has no reference against
+ * page->mapping->host, and if the page is unlocked.  This is because another
+ * CPU could truncate the page off the mapping and then free the mapping.
+ *
+ * Usually, the page _is_ locked, or the caller is a user-space process which
+ * holds a reference on the inode by having an open file.
+ *
+ * In other cases, the page should be locked before running set_page_range_dirty().
+ */
+int set_page_range_dirty_lock(struct page *page, unsigned int ntails)
+{
+       int ret;
+
+       lock_page(page);
+       ret = set_page_range_dirty(page, ntails);
+       unlock_page(page);
+       return ret;
+}
+EXPORT_SYMBOL(set_page_range_dirty_lock);
+
Jason Gunthorpe Dec. 17, 2020, 8:05 p.m. UTC | #6
On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
> > No reason not to fix set_page_dirty_lock() too while you are here.
> 
> The wack of atomics you mentioned earlier you referred to, I suppose it
> ends being account_page_dirtied(). See partial diff at the end.

Well, even just eliminating the lock_page, page_mapping, PageDirty,
etc is already a big win.

If mapping->a_ops->set_page_dirty() needs to be called multiple times
on the head page I'd probably just suggest:

  while (ntails--)
        rc |= (*spd)(head);

At least as a start.

If you have workloads that have page_mapping != NULL then look at
another series to optimze that. Looks a bit large though given the
number of places implementing set_page_dirty

I think the current reality is calling set_page_dirty on an actual
file system is busted anyhow, so I think mapping is generally going to
be NULL here?

Jason
Joao Martins Dec. 17, 2020, 10:34 p.m. UTC | #7
On 12/17/20 8:05 PM, Jason Gunthorpe wrote:
> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>
>> The wack of atomics you mentioned earlier you referred to, I suppose it
>> ends being account_page_dirtied(). See partial diff at the end.
> 
> Well, even just eliminating the lock_page, page_mapping, PageDirty,
> etc is already a big win.
> 
> If mapping->a_ops->set_page_dirty() needs to be called multiple times
> on the head page I'd probably just suggest:
> 
>   while (ntails--)
>         rc |= (*spd)(head);
> 
> At least as a start.
> 
/me nods

> If you have workloads that have page_mapping != NULL then look at
> another series to optimze that. Looks a bit large though given the
> number of places implementing set_page_dirty
> 
Yes. I don't have a particular workload, was just wondering what you had in
mind, as at a glance, changing all the places without messing filesystems looks like
the subject of a separate series.

> I think the current reality is calling set_page_dirty on an actual
> file system is busted anyhow, so I think mapping is generally going to
> be NULL here?

Perhaps -- I'll have to check.

	Joao
Jason Gunthorpe Dec. 18, 2020, 2:25 p.m. UTC | #8
On Thu, Dec 17, 2020 at 10:34:54PM +0000, Joao Martins wrote:
> On 12/17/20 8:05 PM, Jason Gunthorpe wrote:
> > On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
> >>> No reason not to fix set_page_dirty_lock() too while you are here.
> >>
> >> The wack of atomics you mentioned earlier you referred to, I suppose it
> >> ends being account_page_dirtied(). See partial diff at the end.
> > 
> > Well, even just eliminating the lock_page, page_mapping, PageDirty,
> > etc is already a big win.
> > 
> > If mapping->a_ops->set_page_dirty() needs to be called multiple times
> > on the head page I'd probably just suggest:
> > 
> >   while (ntails--)
> >         rc |= (*spd)(head);
> > 
> > At least as a start.
> > 
> /me nods

You might just be able to call it once as well, doesn't the page fault
handler dirties an entire compound in one call? Someone from the FS
world probably knows :)

Good example of Matt's Folio concept lending clarity, if this accepted
a folio we'd be certain it only need to be called once.

Jason
John Hubbard Dec. 19, 2020, 2:06 a.m. UTC | #9
On 12/17/20 12:05 PM, Jason Gunthorpe wrote:
> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>
>> The wack of atomics you mentioned earlier you referred to, I suppose it
>> ends being account_page_dirtied(). See partial diff at the end.
> 
> Well, even just eliminating the lock_page, page_mapping, PageDirty,
> etc is already a big win.
> 
> If mapping->a_ops->set_page_dirty() needs to be called multiple times
> on the head page I'd probably just suggest:
> 
>    while (ntails--)
>          rc |= (*spd)(head);

I think once should be enough. There is no counter for page dirtiness,
and this kind of accounting is always tracked in the head page, so there
is no reason to repeatedly call set_page_dirty() from the same
spot.


thanks,
Joao Martins Dec. 19, 2020, 1:10 p.m. UTC | #10
On 12/19/20 2:06 AM, John Hubbard wrote:
> On 12/17/20 12:05 PM, Jason Gunthorpe wrote:
>> On Thu, Dec 17, 2020 at 07:05:37PM +0000, Joao Martins wrote:
>>>> No reason not to fix set_page_dirty_lock() too while you are here.
>>>
>>> The wack of atomics you mentioned earlier you referred to, I suppose it
>>> ends being account_page_dirtied(). See partial diff at the end.
>>
>> Well, even just eliminating the lock_page, page_mapping, PageDirty,
>> etc is already a big win.
>>
>> If mapping->a_ops->set_page_dirty() needs to be called multiple times
>> on the head page I'd probably just suggest:
>>
>>    while (ntails--)
>>          rc |= (*spd)(head);
> 
> I think once should be enough. There is no counter for page dirtiness,
> and this kind of accounting is always tracked in the head page, so there
> is no reason to repeatedly call set_page_dirty() from the same
> spot.
> 
I think that's what we do even today, considering the Dirty bit is only set on the
compound head (regardless of accounting). Even without this patch,
IIUC we don't call a second set_page_dirty(head) after the first time
we dirty it. So probably there's no optimization to do here, as you say.

	Joao
diff mbox series

Patch

diff --git a/mm/gup.c b/mm/gup.c
index 194e6981eb03..3a9a7229f418 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -212,6 +212,18 @@  static bool __unpin_devmap_managed_user_page(struct page *page)
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
+static int record_refs(struct page **pages, int npages)
+{
+	struct page *head = compound_head(pages[0]);
+	int refs = 1, index;
+
+	for (index = 1; index < npages; index++, refs++)
+		if (compound_head(pages[index]) != head)
+			break;
+
+	return refs;
+}
+
 /**
  * unpin_user_page() - release a dma-pinned page
  * @page:            pointer to page to be released
@@ -221,9 +233,9 @@  static bool __unpin_devmap_managed_user_page(struct page *page)
  * that such pages can be separately tracked and uniquely handled. In
  * particular, interactions with RDMA and filesystems need special handling.
  */
-void unpin_user_page(struct page *page)
+static void __unpin_user_page(struct page *page, int refs)
 {
-	int refs = 1;
+	int orig_refs = refs;
 
 	page = compound_head(page);
 
@@ -237,14 +249,19 @@  void unpin_user_page(struct page *page)
 		return;
 
 	if (hpage_pincount_available(page))
-		hpage_pincount_sub(page, 1);
+		hpage_pincount_sub(page, refs);
 	else
-		refs = GUP_PIN_COUNTING_BIAS;
+		refs *= GUP_PIN_COUNTING_BIAS;
 
 	if (page_ref_sub_and_test(page, refs))
 		__put_page(page);
 
-	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1);
+	mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, orig_refs);
+}
+
+void unpin_user_page(struct page *page)
+{
+	__unpin_user_page(page, 1);
 }
 EXPORT_SYMBOL(unpin_user_page);
 
@@ -274,6 +291,7 @@  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 				 bool make_dirty)
 {
 	unsigned long index;
+	int refs = 1;
 
 	/*
 	 * TODO: this can be optimized for huge pages: if a series of pages is
@@ -286,8 +304,9 @@  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		return;
 	}
 
-	for (index = 0; index < npages; index++) {
+	for (index = 0; index < npages; index += refs) {
 		struct page *page = compound_head(pages[index]);
+
 		/*
 		 * Checking PageDirty at this point may race with
 		 * clear_page_dirty_for_io(), but that's OK. Two key
@@ -310,7 +329,8 @@  void unpin_user_pages_dirty_lock(struct page **pages, unsigned long npages,
 		 */
 		if (!PageDirty(page))
 			set_page_dirty_lock(page);
-		unpin_user_page(page);
+		refs = record_refs(pages + index, npages - index);
+		__unpin_user_page(page, refs);
 	}
 }
 EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
@@ -327,6 +347,7 @@  EXPORT_SYMBOL(unpin_user_pages_dirty_lock);
 void unpin_user_pages(struct page **pages, unsigned long npages)
 {
 	unsigned long index;
+	int refs = 1;
 
 	/*
 	 * If this WARN_ON() fires, then the system *might* be leaking pages (by
@@ -340,8 +361,10 @@  void unpin_user_pages(struct page **pages, unsigned long npages)
 	 * physically contiguous and part of the same compound page, then a
 	 * single operation to the head page should suffice.
 	 */
-	for (index = 0; index < npages; index++)
-		unpin_user_page(pages[index]);
+	for (index = 0; index < npages; index += refs) {
+		refs = record_refs(pages + index, npages - index);
+		__unpin_user_page(pages[index], refs);
+	}
 }
 EXPORT_SYMBOL(unpin_user_pages);