mm: shmem: convert to use folio_zero_range()

Message ID	20241017142504.1170208-2-wangkefeng.wang@huawei.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Kefeng Wang <wangkefeng.wang@huawei.com> To: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com> CC: David Hildenbrand <david@redhat.com>, Baolin Wang <baolin.wang@linux.alibaba.com>, Matthew Wilcox <willy@infradead.org>, <linux-mm@kvack.org>, Kefeng Wang <wangkefeng.wang@huawei.com> Subject: [PATCH] mm: shmem: convert to use folio_zero_range() Date: Thu, 17 Oct 2024 22:25:04 +0800 Message-ID: <20241017142504.1170208-2-wangkefeng.wang@huawei.com> In-Reply-To: <20241017142504.1170208-1-wangkefeng.wang@huawei.com> References: <20241017142504.1170208-1-wangkefeng.wang@huawei.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm: shmem: convert to use folio_zero_range() \| expand mm: shmem: convert to use folio_zero_range()

Message ID

20241017142504.1170208-2-wangkefeng.wang@huawei.com (mailing list archive)

State

New

Headers

From: Kefeng Wang <wangkefeng.wang@huawei.com>
To: Andrew Morton <akpm@linux-foundation.org>, Hugh Dickins <hughd@google.com>
CC: David Hildenbrand <david@redhat.com>, Baolin Wang
	<baolin.wang@linux.alibaba.com>, Matthew Wilcox <willy@infradead.org>,
	<linux-mm@kvack.org>, Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: [PATCH] mm: shmem: convert to use folio_zero_range()
Date: Thu, 17 Oct 2024 22:25:04 +0800
Message-ID: <20241017142504.1170208-2-wangkefeng.wang@huawei.com>
In-Reply-To: <20241017142504.1170208-1-wangkefeng.wang@huawei.com>
References: <20241017142504.1170208-1-wangkefeng.wang@huawei.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain
Sender: owner-linux-mm@kvack.org
Precedence: bulk

Series

mm: shmem: convert to use folio_zero_range() | expand

Comments

Matthew Wilcox (Oracle) Oct. 17, 2024, 3:09 p.m. UTC | #1

On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> Directly use folio_zero_range() to cleanup code.

Are you sure there's no performance regression introduced by this?
clear_highpage() is often optimised in ways that we can't optimise for
a plain memset().  On the other hand, if the folio is large, maybe a
modern CPU will be able to do better than clear-one-page-at-a-time.

IOW, what performance testing have you done with this patch?

>  	if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> -		long i, n = folio_nr_pages(folio);
> -
> -		for (i = 0; i < n; i++)
> -			clear_highpage(folio_page(folio, i));
> -		flush_dcache_folio(folio);
> +		folio_zero_range(folio, 0, folio_size(folio));
>  		folio_mark_uptodate(folio);
>  	}

Kefeng Wang Oct. 18, 2024, 5:20 a.m. UTC | #2

On 2024/10/17 23:09, Matthew Wilcox wrote:
> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>> Directly use folio_zero_range() to cleanup code.
> 
> Are you sure there's no performance regression introduced by this?
> clear_highpage() is often optimised in ways that we can't optimise for
> a plain memset().  On the other hand, if the folio is large, maybe a
> modern CPU will be able to do better than clear-one-page-at-a-time.
> 

Right, I missing this, clear_page might be better than memset, I change 
this one when look at the shmem_writepage(), which already convert to
use folio_zero_range() from clear_highpage(), also I grep 
folio_zero_range(), there are some other to use folio_zero_range().

fs/bcachefs/fs-io-buffered.c:		folio_zero_range(folio, 0, 
folio_size(folio));
fs/bcachefs/fs-io-buffered.c:			folio_zero_range(f, 0, folio_size(f));
fs/bcachefs/fs-io-buffered.c:			folio_zero_range(f, 0, folio_size(f));
fs/libfs.c:	folio_zero_range(folio, 0, folio_size(folio));
fs/ntfs3/frecord.c:		folio_zero_range(folio, 0, folio_size(folio));
mm/page_io.c:	folio_zero_range(folio, 0, folio_size(folio));
mm/shmem.c:		folio_zero_range(folio, 0, folio_size(folio));


> IOW, what performance testing have you done with this patch?

No performance test before, but I write a testcase,

1) allocate N large folios (folio_alloc(PMD_ORDER))
2) then calculate the diff(us) when clear all N folios
    clear_highpage/folio_zero_range/folio_zero_user
3) release N folios

the result(run 5 times) shown below on my machine,

N=1,
	clear_highpage  folio_zero_range    folio_zero_user
   1	  69                   74                 177
   2	  57                   62                 168
   3	  54                   58                 234
   4	  54                   58                 157
   5	  56                   62                 148
avg	  58                   62.8               176.8


N=100
	clear_highpage  folio_zero_range    folio_zero_user
   1	11015                 11309               32833
   2	10385                 11110               49751
   3	10369                 11056               33095
   4	10332                 11017               33106
   5	10483                 11000               49032
avg	10516.8               11098.4             39563.4

N=512
	clear_highpage  folio_zero_range   folio_zero_user
   1	55560                 60055              156876
   2	55485                 60024              157132
   3	55474                 60129              156658
   4	55555                 59867              157259
   5	55528                 59932              157108
avg	55520.4               60001.4            157006.6



folio_zero_user with many cond_resched(), so time fluctuates a lot,
clear_highpage is better folio_zero_range as you said.

Maybe add a new helper to convert all folio_zero_range(folio, 0, 
folio_size(folio))
to use clear_highpage + flush_dcache_folio?


> 
>>   	if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>> -		long i, n = folio_nr_pages(folio);
>> -
>> -		for (i = 0; i < n; i++)
>> -			clear_highpage(folio_page(folio, i));
>> -		flush_dcache_folio(folio);
>> +		folio_zero_range(folio, 0, folio_size(folio));
>>   		folio_mark_uptodate(folio);
>>   	}
>

Barry Song Oct. 18, 2024, 5:23 a.m. UTC | #3

On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>
>
>
> On 2024/10/17 23:09, Matthew Wilcox wrote:
> > On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
> >> Directly use folio_zero_range() to cleanup code.
> >
> > Are you sure there's no performance regression introduced by this?
> > clear_highpage() is often optimised in ways that we can't optimise for
> > a plain memset().  On the other hand, if the folio is large, maybe a
> > modern CPU will be able to do better than clear-one-page-at-a-time.
> >
>
> Right, I missing this, clear_page might be better than memset, I change
> this one when look at the shmem_writepage(), which already convert to
> use folio_zero_range() from clear_highpage(), also I grep
> folio_zero_range(), there are some other to use folio_zero_range().
>
> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
> folio_size(folio));
> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, folio_size(folio));
> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>
>
> > IOW, what performance testing have you done with this patch?
>
> No performance test before, but I write a testcase,
>
> 1) allocate N large folios (folio_alloc(PMD_ORDER))
> 2) then calculate the diff(us) when clear all N folios
>     clear_highpage/folio_zero_range/folio_zero_user
> 3) release N folios
>
> the result(run 5 times) shown below on my machine,
>
> N=1,
>         clear_highpage  folio_zero_range    folio_zero_user
>    1      69                   74                 177
>    2      57                   62                 168
>    3      54                   58                 234
>    4      54                   58                 157
>    5      56                   62                 148
> avg       58                   62.8               176.8
>
>
> N=100
>         clear_highpage  folio_zero_range    folio_zero_user
>    1    11015                 11309               32833
>    2    10385                 11110               49751
>    3    10369                 11056               33095
>    4    10332                 11017               33106
>    5    10483                 11000               49032
> avg     10516.8               11098.4             39563.4
>
> N=512
>         clear_highpage  folio_zero_range   folio_zero_user
>    1    55560                 60055              156876
>    2    55485                 60024              157132
>    3    55474                 60129              156658
>    4    55555                 59867              157259
>    5    55528                 59932              157108
> avg     55520.4               60001.4            157006.6
>
>
>
> folio_zero_user with many cond_resched(), so time fluctuates a lot,
> clear_highpage is better folio_zero_range as you said.
>
> Maybe add a new helper to convert all folio_zero_range(folio, 0,
> folio_size(folio))
> to use clear_highpage + flush_dcache_folio?

If this also improves performance for other existing callers of
folio_zero_range(), then that's a positive outcome.

>
>
> >
> >>      if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
> >> -            long i, n = folio_nr_pages(folio);
> >> -
> >> -            for (i = 0; i < n; i++)
> >> -                    clear_highpage(folio_page(folio, i));
> >> -            flush_dcache_folio(folio);
> >> +            folio_zero_range(folio, 0, folio_size(folio));
> >>              folio_mark_uptodate(folio);
> >>      }
> >
>
>

Thanks
Barry

Kefeng Wang Oct. 18, 2024, 7:32 a.m. UTC | #4

On 2024/10/18 13:23, Barry Song wrote:
> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang <wangkefeng.wang@huawei.com> wrote:
>>
>>
>>
>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>> Directly use folio_zero_range() to cleanup code.
>>>
>>> Are you sure there's no performance regression introduced by this?
>>> clear_highpage() is often optimised in ways that we can't optimise for
>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>
>>
>> Right, I missing this, clear_page might be better than memset, I change
>> this one when look at the shmem_writepage(), which already convert to
>> use folio_zero_range() from clear_highpage(), also I grep
>> folio_zero_range(), there are some other to use folio_zero_range().
>>
>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>> folio_size(folio));
>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 0, folio_size(f));
>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, folio_size(folio));
>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>
>>
>>> IOW, what performance testing have you done with this patch?
>>
>> No performance test before, but I write a testcase,
>>
>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>> 2) then calculate the diff(us) when clear all N folios
>>      clear_highpage/folio_zero_range/folio_zero_user
>> 3) release N folios
>>
>> the result(run 5 times) shown below on my machine,
>>
>> N=1,
>>          clear_highpage  folio_zero_range    folio_zero_user
>>     1      69                   74                 177
>>     2      57                   62                 168
>>     3      54                   58                 234
>>     4      54                   58                 157
>>     5      56                   62                 148
>> avg       58                   62.8               176.8
>>
>>
>> N=100
>>          clear_highpage  folio_zero_range    folio_zero_user
>>     1    11015                 11309               32833
>>     2    10385                 11110               49751
>>     3    10369                 11056               33095
>>     4    10332                 11017               33106
>>     5    10483                 11000               49032
>> avg     10516.8               11098.4             39563.4
>>
>> N=512
>>          clear_highpage  folio_zero_range   folio_zero_user
>>     1    55560                 60055              156876
>>     2    55485                 60024              157132
>>     3    55474                 60129              156658
>>     4    55555                 59867              157259
>>     5    55528                 59932              157108
>> avg     55520.4               60001.4            157006.6
>>
>>
>>
>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>> clear_highpage is better folio_zero_range as you said.
>>
>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>> folio_size(folio))
>> to use clear_highpage + flush_dcache_folio?
> 
> If this also improves performance for other existing callers of
> folio_zero_range(), then that's a positive outcome.


rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G 
/tmp/test && time fallocate -l 20G /tmp/test

1）mount always(2M folio)
	with patch	without patch
real	0m1.214s	0m1.111s
user	0m0.000s	0m0.000s
sys	0m1.210s	0m1.109s

With this patch, the performance does have regression,
folio_zero_range() is bad than clear_highpage + flush_dcache_folio

with patch

   99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate 

    vfs_fallocate 

  - shmem_fallocate 

       98.54% __pi_clear_page 

     - 1.38% shmem_get_folio_gfp 

          filemap_get_entry

without patch
  99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
   shmem_fallocate 

- shmem_get_folio_gfp 

      90.12% __memset 

    - 9.42% zero_user_segments.constprop.0 

         8.16% flush_dcache_page 

         1.03% flush_dcache_folio




2）mount  never (4K folio)
real	0m3.159s	0m3.176s
user	0m0.000s	0m0.000s
sys	0m3.150s	0m3.169s

But with this patch, the performance is improved a little,
folio_zero_range() is better than clear_highpage + flush_dcache_folio

with patch
  97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
- 94.40% shmem_fallocate 

    - 93.70% shmem_get_folio_gfp 

         66.60% __memset 

       - 7.43% filemap_get_entry 

            3.49% xas_load 

         1.32% zero_user_segments.constprop.0

without patch
   97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] 
shmem_fallocate
  - 94.61% shmem_fallocate 

       68.18% __pi_clear_page 

     - 25.60% shmem_get_folio_gfp 

        - 7.64% filemap_get_entry 

             3.51% xas_load
> 
>>
>>
>>>
>>>>       if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>> -            long i, n = folio_nr_pages(folio);
>>>> -
>>>> -            for (i = 0; i < n; i++)
>>>> -                    clear_highpage(folio_page(folio, i));
>>>> -            flush_dcache_folio(folio);
>>>> +            folio_zero_range(folio, 0, folio_size(folio));
>>>>               folio_mark_uptodate(folio);
>>>>       }
>>>
>>
>>
> 
> Thanks
> Barry

Kefeng Wang Oct. 18, 2024, 7:47 a.m. UTC | #5

On 2024/10/18 15:32, Kefeng Wang wrote:
> 
> 
> On 2024/10/18 13:23, Barry Song wrote:
>> On Fri, Oct 18, 2024 at 6:20 PM Kefeng Wang 
>> <wangkefeng.wang@huawei.com> wrote:
>>>
>>>
>>>
>>> On 2024/10/17 23:09, Matthew Wilcox wrote:
>>>> On Thu, Oct 17, 2024 at 10:25:04PM +0800, Kefeng Wang wrote:
>>>>> Directly use folio_zero_range() to cleanup code.
>>>>
>>>> Are you sure there's no performance regression introduced by this?
>>>> clear_highpage() is often optimised in ways that we can't optimise for
>>>> a plain memset().  On the other hand, if the folio is large, maybe a
>>>> modern CPU will be able to do better than clear-one-page-at-a-time.
>>>>
>>>
>>> Right, I missing this, clear_page might be better than memset, I change
>>> this one when look at the shmem_writepage(), which already convert to
>>> use folio_zero_range() from clear_highpage(), also I grep
>>> folio_zero_range(), there are some other to use folio_zero_range().
>>>
>>> fs/bcachefs/fs-io-buffered.c:           folio_zero_range(folio, 0,
>>> folio_size(folio));
>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 
>>> 0, folio_size(f));
>>> fs/bcachefs/fs-io-buffered.c:                   folio_zero_range(f, 
>>> 0, folio_size(f));
>>> fs/libfs.c:     folio_zero_range(folio, 0, folio_size(folio));
>>> fs/ntfs3/frecord.c:             folio_zero_range(folio, 0, 
>>> folio_size(folio));
>>> mm/page_io.c:   folio_zero_range(folio, 0, folio_size(folio));
>>> mm/shmem.c:             folio_zero_range(folio, 0, folio_size(folio));
>>>
>>>
>>>> IOW, what performance testing have you done with this patch?
>>>
>>> No performance test before, but I write a testcase,
>>>
>>> 1) allocate N large folios (folio_alloc(PMD_ORDER))
>>> 2) then calculate the diff(us) when clear all N folios
>>>      clear_highpage/folio_zero_range/folio_zero_user
>>> 3) release N folios
>>>
>>> the result(run 5 times) shown below on my machine,
>>>
>>> N=1,
>>>          clear_highpage  folio_zero_range    folio_zero_user
>>>     1      69                   74                 177
>>>     2      57                   62                 168
>>>     3      54                   58                 234
>>>     4      54                   58                 157
>>>     5      56                   62                 148
>>> avg       58                   62.8               176.8
>>>
>>>
>>> N=100
>>>          clear_highpage  folio_zero_range    folio_zero_user
>>>     1    11015                 11309               32833
>>>     2    10385                 11110               49751
>>>     3    10369                 11056               33095
>>>     4    10332                 11017               33106
>>>     5    10483                 11000               49032
>>> avg     10516.8               11098.4             39563.4
>>>
>>> N=512
>>>          clear_highpage  folio_zero_range   folio_zero_user
>>>     1    55560                 60055              156876
>>>     2    55485                 60024              157132
>>>     3    55474                 60129              156658
>>>     4    55555                 59867              157259
>>>     5    55528                 59932              157108
>>> avg     55520.4               60001.4            157006.6
>>>
>>>
>>>
>>> folio_zero_user with many cond_resched(), so time fluctuates a lot,
>>> clear_highpage is better folio_zero_range as you said.
>>>
>>> Maybe add a new helper to convert all folio_zero_range(folio, 0,
>>> folio_size(folio))
>>> to use clear_highpage + flush_dcache_folio?
>>
>> If this also improves performance for other existing callers of
>> folio_zero_range(), then that's a positive outcome.
> 
> 
> rm -f /tmp/test && fallocate -l 20G /tmp/test && fallocate -d -l 20G / 
> tmp/test && time fallocate -l 20G /tmp/test
> 
> 1）mount always(2M folio)
>      with patch    without patch
> real    0m1.214s    0m1.111s
> user    0m0.000s    0m0.000s
> sys    0m1.210s    0m1.109s
> 
> With this patch, the performance does have regression,
> folio_zero_range() is bad than clear_highpage + flush_dcache_folio
> 
> with patch

Oh, this should without patch since it uses clear_highpage,

> 
>    99.95%     0.00%  fallocate  [kernel.vmlinux]       [k] vfs_fallocate
>     vfs_fallocate
>   - shmem_fallocate
>        98.54% __pi_clear_page
>      - 1.38% shmem_get_folio_gfp
>           filemap_get_entry
> 
and this one is with patch
> without patch
>   99.89%     0.00%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>    shmem_fallocate
> - shmem_get_folio_gfp
>       90.12% __memset
>     - 9.42% zero_user_segments.constprop.0
>          8.16% flush_dcache_page
>          1.03% flush_dcache_folio
> 
> 
> 
> 
> 2）mount  never (4K folio)
> real    0m3.159s    0m3.176s
> user    0m0.000s    0m0.000s
> sys    0m3.150s    0m3.169s
> 
> But with this patch, the performance is improved a little,
> folio_zero_range() is better than clear_highpage + flush_dcache_folio
> 

For 4K, the result is fluctuating, so maybe no different.

> with patch
>   97.77%     3.37%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
> - 94.40% shmem_fallocate
>     - 93.70% shmem_get_folio_gfp
>          66.60% __memset
>        - 7.43% filemap_get_entry
>             3.49% xas_load
>          1.32% zero_user_segments.constprop.0
> 
> without patch
>    97.82%     3.22%  fallocate  [kernel.vmlinux]       [k] shmem_fallocate
>   - 94.61% shmem_fallocate
>        68.18% __pi_clear_page
>      - 25.60% shmem_get_folio_gfp
>         - 7.64% filemap_get_entry
>              3.51% xas_load
>>
>>>
>>>
>>>>
>>>>>       if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
>>>>> -            long i, n = folio_nr_pages(folio);
>>>>> -
>>>>> -            for (i = 0; i < n; i++)
>>>>> -                    clear_highpage(folio_page(folio, i));
>>>>> -            flush_dcache_folio(folio);
>>>>> +            folio_zero_range(folio, 0, folio_size(folio));
>>>>>               folio_mark_uptodate(folio);
>>>>>       }
>>>>
>>>
>>>
>>
>> Thanks
>> Barry
> 
>

diff --git a/mm/shmem.c b/mm/shmem.c
index bd5ba016567d..247c0403af83 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2392,11 +2392,7 @@  static int shmem_get_folio_gfp(struct inode *inode, pgoff_t index,
 	 * it now, lest undo on failure cancel our earlier guarantee.
 	 */
 	if (sgp != SGP_WRITE && !folio_test_uptodate(folio)) {
-		long i, n = folio_nr_pages(folio);
-
-		for (i = 0; i < n; i++)
-			clear_highpage(folio_page(folio, i));
-		flush_dcache_folio(folio);
+		folio_zero_range(folio, 0, folio_size(folio));
 		folio_mark_uptodate(folio);
 	}

mm: shmem: convert to use folio_zero_range()

Commit Message

Comments

Patch