diff mbox series

[v2] tmpfs: fault in smaller chunks if large folio allocation not allowed

Message ID 20240920143654.1008756-1-wangkefeng.wang@huawei.com (mailing list archive)
State New
Headers show
Series [v2] tmpfs: fault in smaller chunks if large folio allocation not allowed | expand

Commit Message

Kefeng Wang Sept. 20, 2024, 2:36 p.m. UTC
The tmpfs supports large folio, but there is some configurable options
to enable/disable large folio allocation, and for huge=within_size,
large folio only allowabled if it fully within i_size, so there is
performance issue when perform write without large folio, the issue is
similar to commit 4e527d5841e2 ("iomap: fault in smaller chunks for
non-large folio mappings").

Fix it by checking whether it allows large folio allocation or not
before perform write.

Fixes: 9aac777aaf94 ("filemap: Convert generic_perform_write() to support large folios")
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
---
v2:
- Don't use IOCB flags

 mm/filemap.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

Comments

Matthew Wilcox Sept. 22, 2024, 12:35 a.m. UTC | #1
On Fri, Sep 20, 2024 at 10:36:54PM +0800, Kefeng Wang wrote:
> The tmpfs supports large folio, but there is some configurable options
> to enable/disable large folio allocation, and for huge=within_size,
> large folio only allowabled if it fully within i_size, so there is
> performance issue when perform write without large folio, the issue is
> similar to commit 4e527d5841e2 ("iomap: fault in smaller chunks for
> non-large folio mappings").

No.  What's wrong with my earlier suggestion?
Kefeng Wang Sept. 23, 2024, 1:39 a.m. UTC | #2
On 2024/9/22 8:35, Matthew Wilcox wrote:
> On Fri, Sep 20, 2024 at 10:36:54PM +0800, Kefeng Wang wrote:
>> The tmpfs supports large folio, but there is some configurable options
>> to enable/disable large folio allocation, and for huge=within_size,
>> large folio only allowabled if it fully within i_size, so there is
>> performance issue when perform write without large folio, the issue is
>> similar to commit 4e527d5841e2 ("iomap: fault in smaller chunks for
>> non-large folio mappings").
> 
> No.  What's wrong with my earlier suggestion?
> 

The tempfs has mount options(never/always/within_size/madvise) for large
folio, also has sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled
to deny/force large folio at runtime, as replied in v1, I think it
breaks the rules of mapping_set_folio_order_range(),

   "Do not tune it based on, eg, i_size."
   --- for tmpfs, it does choose large folio or not based on the i_size

   "Context: This should not be called while the inode is active as it 
is non-atomic."
   --- during perform write, the inode is active

So this is why I don't use mapping_set_folio_order_range() here, but
correct me if I am wrong.
Pankaj Raghav (Samsung) Sept. 26, 2024, 8:38 a.m. UTC | #3
On Mon, Sep 23, 2024 at 09:39:07AM +0800, Kefeng Wang wrote:
> 
> 
> On 2024/9/22 8:35, Matthew Wilcox wrote:
> > On Fri, Sep 20, 2024 at 10:36:54PM +0800, Kefeng Wang wrote:
> > > The tmpfs supports large folio, but there is some configurable options
> > > to enable/disable large folio allocation, and for huge=within_size,
> > > large folio only allowabled if it fully within i_size, so there is
> > > performance issue when perform write without large folio, the issue is
> > > similar to commit 4e527d5841e2 ("iomap: fault in smaller chunks for
> > > non-large folio mappings").
> > 
> > No.  What's wrong with my earlier suggestion?
> > 
> 
> The tempfs has mount options(never/always/within_size/madvise) for large
> folio, also has sysfs file /sys/kernel/mm/transparent_hugepage/shmem_enabled
> to deny/force large folio at runtime, as replied in v1, I think it
> breaks the rules of mapping_set_folio_order_range(),
> 
>   "Do not tune it based on, eg, i_size."
>   --- for tmpfs, it does choose large folio or not based on the i_size
> 
>   "Context: This should not be called while the inode is active as it is
> non-atomic."
>   --- during perform write, the inode is active
> 
> So this is why I don't use mapping_set_folio_order_range() here, but
> correct me if I am wrong.

Yeah, the inode is active here as the max folio size is decided based on
the write size, so probably mapping_set_folio_order_range() will not be
a safe option.
Matthew Wilcox Sept. 26, 2024, 1:52 p.m. UTC | #4
On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
> > So this is why I don't use mapping_set_folio_order_range() here, but
> > correct me if I am wrong.
> 
> Yeah, the inode is active here as the max folio size is decided based on
> the write size, so probably mapping_set_folio_order_range() will not be
> a safe option.

You really are all making too much of this.  Here's the patch I think we
need:

+++ b/mm/shmem.c
@@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
        cache_no_acl(inode);
        if (sbinfo->noswap)
                mapping_set_unevictable(inode->i_mapping);
-       mapping_set_large_folios(inode->i_mapping);
+       if (sbinfo->huge)
+               mapping_set_large_folios(inode->i_mapping);

        switch (mode & S_IFMT) {
        default:
Kefeng Wang Sept. 26, 2024, 2:20 p.m. UTC | #5
On 2024/9/26 21:52, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>> correct me if I am wrong.
>>
>> Yeah, the inode is active here as the max folio size is decided based on
>> the write size, so probably mapping_set_folio_order_range() will not be
>> a safe option.
> 
> You really are all making too much of this.  Here's the patch I think we
> need:
> 
> +++ b/mm/shmem.c
> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
>          cache_no_acl(inode);
>          if (sbinfo->noswap)
>                  mapping_set_unevictable(inode->i_mapping);
> -       mapping_set_large_folios(inode->i_mapping);
> +       if (sbinfo->huge)
> +               mapping_set_large_folios(inode->i_mapping);
> 

But it can't solve all issue, eg,
   mount with huge = SHMEM_HUGE_WITHIN_SIZE, or
   mount with SHMEM_HUGE_ALWAYS  +  runtime SHMEM_HUGE_DENY

and the above change will break
   mount with SHMEM_HUGE_NEVER + runtime SHMEM_HUGE_FORCE
Matthew Wilcox Sept. 26, 2024, 2:58 p.m. UTC | #6
On Thu, Sep 26, 2024 at 10:20:54PM +0800, Kefeng Wang wrote:
> On 2024/9/26 21:52, Matthew Wilcox wrote:
> > On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
> > > > So this is why I don't use mapping_set_folio_order_range() here, but
> > > > correct me if I am wrong.
> > > 
> > > Yeah, the inode is active here as the max folio size is decided based on
> > > the write size, so probably mapping_set_folio_order_range() will not be
> > > a safe option.
> > 
> > You really are all making too much of this.  Here's the patch I think we
> > need:
> > 
> > -       mapping_set_large_folios(inode->i_mapping);
> > +       if (sbinfo->huge)
> > +               mapping_set_large_folios(inode->i_mapping);
> 
> But it can't solve all issue, eg,
>   mount with huge = SHMEM_HUGE_WITHIN_SIZE, or

The page cache will not create folios which overhang the end of the file
by more than the minimum folio size for that mapping.  So this is wrong.

>   mount with SHMEM_HUGE_ALWAYS  +  runtime SHMEM_HUGE_DENY

That's a tweak to this patch, not a fundamental problem with it.

> and the above change will break
>   mount with SHMEM_HUGE_NEVER + runtime SHMEM_HUGE_FORCE

Likewise.
Kefeng Wang Sept. 30, 2024, 1:27 a.m. UTC | #7
On 2024/9/26 22:58, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 10:20:54PM +0800, Kefeng Wang wrote:
>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
>>>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>>>> correct me if I am wrong.
>>>>
>>>> Yeah, the inode is active here as the max folio size is decided based on
>>>> the write size, so probably mapping_set_folio_order_range() will not be
>>>> a safe option.
>>>
>>> You really are all making too much of this.  Here's the patch I think we
>>> need:
>>>
>>> -       mapping_set_large_folios(inode->i_mapping);
>>> +       if (sbinfo->huge)
>>> +               mapping_set_large_folios(inode->i_mapping);
>>
>> But it can't solve all issue, eg,
>>    mount with huge = SHMEM_HUGE_WITHIN_SIZE, or
> 
> The page cache will not create folios which overhang the end of the file
> by more than the minimum folio size for that mapping.  So this is wrong.


Sorry for the late, not very familiar with it, will test after back to 
the office in next few days.

> 
>>    mount with SHMEM_HUGE_ALWAYS  +  runtime SHMEM_HUGE_DENY
> 
> That's a tweak to this patch, not a fundamental problem with it.
> 
>> and the above change will break
>>    mount with SHMEM_HUGE_NEVER + runtime SHMEM_HUGE_FORCE
> 
> Likewise.
> 

But the SHMEM_HUGE_DENY/FORCE could be changed at runtime, I don't find
a better way to fix, any more suggestion will be appreciate, thanks.
Baolin Wang Sept. 30, 2024, 2:02 a.m. UTC | #8
On 2024/9/26 21:52, Matthew Wilcox wrote:
> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>> correct me if I am wrong.
>>
>> Yeah, the inode is active here as the max folio size is decided based on
>> the write size, so probably mapping_set_folio_order_range() will not be
>> a safe option.
> 
> You really are all making too much of this.  Here's the patch I think we
> need:
> 
> +++ b/mm/shmem.c
> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct mnt_idmap *idmap,
>          cache_no_acl(inode);
>          if (sbinfo->noswap)
>                  mapping_set_unevictable(inode->i_mapping);
> -       mapping_set_large_folios(inode->i_mapping);
> +       if (sbinfo->huge)
> +               mapping_set_large_folios(inode->i_mapping);
> 
>          switch (mode & S_IFMT) {
>          default:

IMHO, we no longer need the the 'sbinfo->huge' validation after adding 
support for large folios in the tmpfs write and fallocate paths [1].

Kefeng, can you try if the following RFC patch [1] can solve your 
problem? Thanks.
(PS: I will revise the patch according to Matthew's suggestion)

[1] 
https://lore.kernel.org/all/c03ec1cb1392332726ab265a3d826fe1c408c7e7.1727338549.git.baolin.wang@linux.alibaba.com/
Kefeng Wang Sept. 30, 2024, 2:30 a.m. UTC | #9
On 2024/9/30 10:02, Baolin Wang wrote:
> 
> 
> On 2024/9/26 21:52, Matthew Wilcox wrote:
>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
>>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>>> correct me if I am wrong.
>>>
>>> Yeah, the inode is active here as the max folio size is decided based on
>>> the write size, so probably mapping_set_folio_order_range() will not be
>>> a safe option.
>>
>> You really are all making too much of this.  Here's the patch I think we
>> need:
>>
>> +++ b/mm/shmem.c
>> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct 
>> mnt_idmap *idmap,
>>          cache_no_acl(inode);
>>          if (sbinfo->noswap)
>>                  mapping_set_unevictable(inode->i_mapping);
>> -       mapping_set_large_folios(inode->i_mapping);
>> +       if (sbinfo->huge)
>> +               mapping_set_large_folios(inode->i_mapping);
>>
>>          switch (mode & S_IFMT) {
>>          default:
> 
> IMHO, we no longer need the the 'sbinfo->huge' validation after adding 
> support for large folios in the tmpfs write and fallocate paths [1].
> 
> Kefeng, can you try if the following RFC patch [1] can solve your 
> problem? Thanks.
> (PS: I will revise the patch according to Matthew's suggestion)

Sure, will try once I come back, but [1] won't solve the issue when set
force/deny at runtime, eg, mount with always/within_size, but set deny 
when runtime, we still fault in large chunks, but we can't allocate 
large folio, the performance of write will be degradation.


> 
> [1] https://lore.kernel.org/all/ 
> c03ec1cb1392332726ab265a3d826fe1c408c7e7.1727338549.git.baolin.wang@linux.alibaba.com/
Baolin Wang Sept. 30, 2024, 2:52 a.m. UTC | #10
On 2024/9/30 10:30, Kefeng Wang wrote:
> 
> 
> On 2024/9/30 10:02, Baolin Wang wrote:
>>
>>
>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) wrote:
>>>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>>>> correct me if I am wrong.
>>>>
>>>> Yeah, the inode is active here as the max folio size is decided 
>>>> based on
>>>> the write size, so probably mapping_set_folio_order_range() will not be
>>>> a safe option.
>>>
>>> You really are all making too much of this.  Here's the patch I think we
>>> need:
>>>
>>> +++ b/mm/shmem.c
>>> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct 
>>> mnt_idmap *idmap,
>>>          cache_no_acl(inode);
>>>          if (sbinfo->noswap)
>>>                  mapping_set_unevictable(inode->i_mapping);
>>> -       mapping_set_large_folios(inode->i_mapping);
>>> +       if (sbinfo->huge)
>>> +               mapping_set_large_folios(inode->i_mapping);
>>>
>>>          switch (mode & S_IFMT) {
>>>          default:
>>
>> IMHO, we no longer need the the 'sbinfo->huge' validation after adding 
>> support for large folios in the tmpfs write and fallocate paths [1].
>>
>> Kefeng, can you try if the following RFC patch [1] can solve your 
>> problem? Thanks.
>> (PS: I will revise the patch according to Matthew's suggestion)
> 
> Sure, will try once I come back, but [1] won't solve the issue when set

Cool. Thanks.

> force/deny at runtime, eg, mount with always/within_size, but set deny 
> when runtime, we still fault in large chunks, but we can't allocate 
> large folio, the performance of write will be degradation.

Yes, but as Hugh mentioned before, the options 'force' and 'deny' are 
rather testing artifacts from the old ages. So I suspect if the 'deny' 
option will be set in the real products, that means do we really need 
consider this case too much?
Kefeng Wang Sept. 30, 2024, 3:15 a.m. UTC | #11
On 2024/9/30 10:52, Baolin Wang wrote:
> 
> 
> On 2024/9/30 10:30, Kefeng Wang wrote:
>>
>>
>> On 2024/9/30 10:02, Baolin Wang wrote:
>>>
>>>
>>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) 
>>>> wrote:
>>>>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>>>>> correct me if I am wrong.
>>>>>
>>>>> Yeah, the inode is active here as the max folio size is decided 
>>>>> based on
>>>>> the write size, so probably mapping_set_folio_order_range() will 
>>>>> not be
>>>>> a safe option.
>>>>
>>>> You really are all making too much of this.  Here's the patch I 
>>>> think we
>>>> need:
>>>>
>>>> +++ b/mm/shmem.c
>>>> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct 
>>>> mnt_idmap *idmap,
>>>>          cache_no_acl(inode);
>>>>          if (sbinfo->noswap)
>>>>                  mapping_set_unevictable(inode->i_mapping);
>>>> -       mapping_set_large_folios(inode->i_mapping);
>>>> +       if (sbinfo->huge)
>>>> +               mapping_set_large_folios(inode->i_mapping);
>>>>
>>>>          switch (mode & S_IFMT) {
>>>>          default:
>>>
>>> IMHO, we no longer need the the 'sbinfo->huge' validation after 
>>> adding support for large folios in the tmpfs write and fallocate 
>>> paths [1].

Forget to mention, we still need to check sbinfo->huge, if mount with
huge=never, but we fault in large chunk, write is slower than without
9aac777aaf94, the above changes or my patch could fix it.

>>>
>>> Kefeng, can you try if the following RFC patch [1] can solve your 
>>> problem? Thanks.
>>> (PS: I will revise the patch according to Matthew's suggestion)
>>
>> Sure, will try once I come back, but [1] won't solve the issue when set
> 
> Cool. Thanks.
> 
>> force/deny at runtime, eg, mount with always/within_size, but set deny 
>> when runtime, we still fault in large chunks, but we can't allocate 
>> large folio, the performance of write will be degradation.
> 
> Yes, but as Hugh mentioned before, the options 'force' and 'deny' are 
> rather testing artifacts from the old ages. So I suspect if the 'deny' 
> option will be set in the real products, that means do we really need 
> consider this case too much?

OK, so maybe just update the document.
Baolin Wang Sept. 30, 2024, 6:48 a.m. UTC | #12
On 2024/9/30 11:15, Kefeng Wang wrote:
> 
> 
> On 2024/9/30 10:52, Baolin Wang wrote:
>>
>>
>> On 2024/9/30 10:30, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/9/30 10:02, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) 
>>>>> wrote:
>>>>>>> So this is why I don't use mapping_set_folio_order_range() here, but
>>>>>>> correct me if I am wrong.
>>>>>>
>>>>>> Yeah, the inode is active here as the max folio size is decided 
>>>>>> based on
>>>>>> the write size, so probably mapping_set_folio_order_range() will 
>>>>>> not be
>>>>>> a safe option.
>>>>>
>>>>> You really are all making too much of this.  Here's the patch I 
>>>>> think we
>>>>> need:
>>>>>
>>>>> +++ b/mm/shmem.c
>>>>> @@ -2831,7 +2831,8 @@ static struct inode *__shmem_get_inode(struct 
>>>>> mnt_idmap *idmap,
>>>>>          cache_no_acl(inode);
>>>>>          if (sbinfo->noswap)
>>>>>                  mapping_set_unevictable(inode->i_mapping);
>>>>> -       mapping_set_large_folios(inode->i_mapping);
>>>>> +       if (sbinfo->huge)
>>>>> +               mapping_set_large_folios(inode->i_mapping);
>>>>>
>>>>>          switch (mode & S_IFMT) {
>>>>>          default:
>>>>
>>>> IMHO, we no longer need the the 'sbinfo->huge' validation after 
>>>> adding support for large folios in the tmpfs write and fallocate 
>>>> paths [1].
> 
> Forget to mention, we still need to check sbinfo->huge, if mount with
> huge=never, but we fault in large chunk, write is slower than without
> 9aac777aaf94, the above changes or my patch could fix it.

My patch will allow allocating large folios in the tmpfs write and 
fallocate paths though the 'huge' option is 'never'.

My initial thought for supporting large folio is that, if the 'huge' 
option is enabled, to maintain backward compatibility, we only allow 2M 
PMD-sized order allocations. If the 'huge' option is 
disabled(huge=never), we still allow large folio allocations based on 
the write length.

Another choice is to allow the different sized large folio allocation 
based on the write length when the 'huge' option is enabled, rather than 
just the 2M PMD sized. But will force the huge orders off if 'huge' 
option is disabled.

Still need some discussions to determine which method is preferable.
Kefeng Wang Oct. 9, 2024, 7:09 a.m. UTC | #13
On 2024/9/30 14:48, Baolin Wang wrote:
> 
> 
> On 2024/9/30 11:15, Kefeng Wang wrote:
>>
>>
>> On 2024/9/30 10:52, Baolin Wang wrote:
>>>
>>>
>>> On 2024/9/30 10:30, Kefeng Wang wrote:
>>>>
>>>>
>>>> On 2024/9/30 10:02, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>>>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) 
>>>>>> wrote:
>>>>>>>> So this is why I don't use mapping_set_folio_order_range() here, 
>>>>>>>> but
>>>>>>>> correct me if I am wrong.
>>>>>>>
>>>>>>> Yeah, the inode is active here as the max folio size is decided 
>>>>>>> based on
>>>>>>> the write size, so probably mapping_set_folio_order_range() will 
>>>>>>> not be
>>>>>>> a safe option.
>>>>>>
>>>>>> You really are all making too much of this.  Here's the patch I 
>>>>>> think we
>>>>>> need:
>>>>>>
>>>>>> +++ b/mm/shmem.c
>>>>>> @@ -2831,7 +2831,8 @@ static struct inode 
>>>>>> *__shmem_get_inode(struct mnt_idmap *idmap,
>>>>>>          cache_no_acl(inode);
>>>>>>          if (sbinfo->noswap)
>>>>>>                  mapping_set_unevictable(inode->i_mapping);
>>>>>> -       mapping_set_large_folios(inode->i_mapping);
>>>>>> +       if (sbinfo->huge)
>>>>>> +               mapping_set_large_folios(inode->i_mapping);
>>>>>>
>>>>>>          switch (mode & S_IFMT) {
>>>>>>          default:
>>>>>
>>>>> IMHO, we no longer need the the 'sbinfo->huge' validation after 
>>>>> adding support for large folios in the tmpfs write and fallocate 
>>>>> paths [1].
>>
>> Forget to mention, we still need to check sbinfo->huge, if mount with
>> huge=never, but we fault in large chunk, write is slower than without
>> 9aac777aaf94, the above changes or my patch could fix it.
> 
> My patch will allow allocating large folios in the tmpfs write and 
> fallocate paths though the 'huge' option is 'never'.

Yes, indeed after checking your patch,

The Writing intelligently from 'Bonnie -d /mnt/tmpfs/ -s 1024' based on 
next-20241008,

1) huge=never
    the base:                                    2016438 K/Sec
    my v1/v2 or Matthew's patch :                2874504 K/Sec
    your patch with filemap_get_order() fix:     6330604 K/Sec

2) huge=always
    the write performance:                       7168917 K/Sec

Since large folios supported in the tmpfs write, we do have better 
performance shown above, that's great.

> 
> My initial thought for supporting large folio is that, if the 'huge' 
> option is enabled, to maintain backward compatibility, we only allow 2M 
> PMD-sized order allocations. If the 'huge' option is 
> disabled(huge=never), we still allow large folio allocations based on 
> the write length.
> 
> Another choice is to allow the different sized large folio allocation 
> based on the write length when the 'huge' option is enabled, rather than 
> just the 2M PMD sized. But will force the huge orders off if 'huge' 
> option is disabled.
> 

"huge=never  Do not allocate huge pages. This is the default."
 From the document, it's better not to allocate large folio, but we need
some special handle for huge=never or runtime deny/force.

> Still need some discussions to determine which method is preferable.

Personally. I like your current implementation, but it does not match 
document.
Baolin Wang Oct. 9, 2024, 8:52 a.m. UTC | #14
On 2024/10/9 15:09, Kefeng Wang wrote:
> 
> 
> On 2024/9/30 14:48, Baolin Wang wrote:
>>
>>
>> On 2024/9/30 11:15, Kefeng Wang wrote:
>>>
>>>
>>> On 2024/9/30 10:52, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2024/9/30 10:30, Kefeng Wang wrote:
>>>>>
>>>>>
>>>>> On 2024/9/30 10:02, Baolin Wang wrote:
>>>>>>
>>>>>>
>>>>>> On 2024/9/26 21:52, Matthew Wilcox wrote:
>>>>>>> On Thu, Sep 26, 2024 at 10:38:34AM +0200, Pankaj Raghav (Samsung) 
>>>>>>> wrote:
>>>>>>>>> So this is why I don't use mapping_set_folio_order_range() 
>>>>>>>>> here, but
>>>>>>>>> correct me if I am wrong.
>>>>>>>>
>>>>>>>> Yeah, the inode is active here as the max folio size is decided 
>>>>>>>> based on
>>>>>>>> the write size, so probably mapping_set_folio_order_range() will 
>>>>>>>> not be
>>>>>>>> a safe option.
>>>>>>>
>>>>>>> You really are all making too much of this.  Here's the patch I 
>>>>>>> think we
>>>>>>> need:
>>>>>>>
>>>>>>> +++ b/mm/shmem.c
>>>>>>> @@ -2831,7 +2831,8 @@ static struct inode 
>>>>>>> *__shmem_get_inode(struct mnt_idmap *idmap,
>>>>>>>          cache_no_acl(inode);
>>>>>>>          if (sbinfo->noswap)
>>>>>>>                  mapping_set_unevictable(inode->i_mapping);
>>>>>>> -       mapping_set_large_folios(inode->i_mapping);
>>>>>>> +       if (sbinfo->huge)
>>>>>>> +               mapping_set_large_folios(inode->i_mapping);
>>>>>>>
>>>>>>>          switch (mode & S_IFMT) {
>>>>>>>          default:
>>>>>>
>>>>>> IMHO, we no longer need the the 'sbinfo->huge' validation after 
>>>>>> adding support for large folios in the tmpfs write and fallocate 
>>>>>> paths [1].
>>>
>>> Forget to mention, we still need to check sbinfo->huge, if mount with
>>> huge=never, but we fault in large chunk, write is slower than without
>>> 9aac777aaf94, the above changes or my patch could fix it.
>>
>> My patch will allow allocating large folios in the tmpfs write and 
>> fallocate paths though the 'huge' option is 'never'.
> 
> Yes, indeed after checking your patch,
> 
> The Writing intelligently from 'Bonnie -d /mnt/tmpfs/ -s 1024' based on 
> next-20241008,
> 
> 1) huge=never
>     the base:                                    2016438 K/Sec
>     my v1/v2 or Matthew's patch :                2874504 K/Sec
>     your patch with filemap_get_order() fix:     6330604 K/Sec
> 
> 2) huge=always
>     the write performance:                       7168917 K/Sec
> 
> Since large folios supported in the tmpfs write, we do have better 
> performance shown above, that's great.

Great. Thanks for testing.

>> My initial thought for supporting large folio is that, if the 'huge' 
>> option is enabled, to maintain backward compatibility, we only allow 
>> 2M PMD-sized order allocations. If the 'huge' option is 
>> disabled(huge=never), we still allow large folio allocations based on 
>> the write length.
>>
>> Another choice is to allow the different sized large folio allocation 
>> based on the write length when the 'huge' option is enabled, rather 
>> than just the 2M PMD sized. But will force the huge orders off if 
>> 'huge' option is disabled.
>>
> 
> "huge=never  Do not allocate huge pages. This is the default."
>  From the document, it's better not to allocate large folio, but we need
> some special handle for huge=never or runtime deny/force.

Yes. I'm thinking of adding a new option (something like 'huge=mTHP') to 
allocate large folios based on the write size.

I will resend the patchset, and we can discuss it there.
diff mbox series

Patch

diff --git a/mm/filemap.c b/mm/filemap.c
index 3e46ca45e13d..b33f260fa32f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -4126,13 +4126,28 @@  generic_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
 }
 EXPORT_SYMBOL(generic_file_direct_write);
 
+static size_t generic_mapping_max_folio_size(struct address_space *mapping,
+					     loff_t pos)
+{
+	struct inode *inode = mapping->host;
+	pgoff_t index = pos >> PAGE_SHIFT;
+
+	if (!shmem_mapping(mapping))
+		goto out;
+
+	if (!shmem_allowable_huge_orders(inode, NULL, index, 0, false))
+		return PAGE_SIZE;
+out:
+	return mapping_max_folio_size(mapping);
+}
+
 ssize_t generic_perform_write(struct kiocb *iocb, struct iov_iter *i)
 {
 	struct file *file = iocb->ki_filp;
 	loff_t pos = iocb->ki_pos;
 	struct address_space *mapping = file->f_mapping;
 	const struct address_space_operations *a_ops = mapping->a_ops;
-	size_t chunk = mapping_max_folio_size(mapping);
+	size_t chunk = generic_mapping_max_folio_size(mapping, pos);
 	long status = 0;
 	ssize_t written = 0;