diff mbox series

[v2] vma remove the unneeded avc bound with non-CoWed folio

Message ID 20240823140139.263-1-justinjiang@vivo.com (mailing list archive)
State New
Headers show
Series [v2] vma remove the unneeded avc bound with non-CoWed folio | expand

Commit Message

zhiguojiang Aug. 23, 2024, 2:01 p.m. UTC
After CoWed by do_wp_page, the vma established a new mapping relationship
with the CoWed folio instead of the non-CoWed folio. However, regarding
the situation where vma->anon_vma and the non-CoWed folio's anon_vma are
not same, the avc binding relationship between them will no longer be
needed, so it is issue for the avc binding relationship still existing
between them.

This patch will remove the avc binding relationship between vma and the
non-CoWed folio's anon_vma, which each has their own independent
anon_vma. It can also alleviates rmap overhead simultaneously.

Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
---
-v2:
 * Solve the kernel test robot noticed "WARNING"
   Reported-by: kernel test robot <oliver.sang@intel.com>
   Closes: https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com
 * Update comments to more accurately describe this patch.

-v1:
 https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/

 include/linux/rmap.h |  1 +
 mm/memory.c          |  8 +++++++
 mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 62 insertions(+)

Comments

David Hildenbrand Aug. 26, 2024, 5:24 p.m. UTC | #1
On 23.08.24 16:01, Zhiguo Jiang wrote:
> After CoWed by do_wp_page, the vma established a new mapping relationship
> with the CoWed folio instead of the non-CoWed folio. However, regarding
> the situation where vma->anon_vma and the non-CoWed folio's anon_vma are
> not same, the avc binding relationship between them will no longer be
> needed, so it is issue for the avc binding relationship still existing
> between them.
> 
> This patch will remove the avc binding relationship between vma and the
> non-CoWed folio's anon_vma, which each has their own independent
> anon_vma. It can also alleviates rmap overhead simultaneously.
> 
> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
> ---
> -v2:
>   * Solve the kernel test robot noticed "WARNING"
>     Reported-by: kernel test robot <oliver.sang@intel.com>
>     Closes: https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com
>   * Update comments to more accurately describe this patch.
> 
> -v1:
>   https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/
> 
>   include/linux/rmap.h |  1 +
>   mm/memory.c          |  8 +++++++
>   mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 62 insertions(+)
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 91b5935e8485..8607d28a3146
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
>   	folio_remove_rmap_ptes(folio, page, 1, vma)
>   void folio_remove_rmap_pmd(struct folio *, struct page *,
>   		struct vm_area_struct *);
> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>   
>   void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>   		unsigned long address, rmap_t flags);
> diff --git a/mm/memory.c b/mm/memory.c
> index 93c0c25433d0..4c89cb1cb73e
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   			 * old page will be flushed before it can be reused.
>   			 */
>   			folio_remove_rmap_pte(old_folio, vmf->page, vma);
> +
> +			/*
> +			 * If the new_folio's anon_vma is different from the
> +			 * old_folio's anon_vma, the avc binding relationship
> +			 * between vma and the old_folio's anon_vma is removed,
> +			 * avoiding rmap redundant overhead.
> +			 */
> +			folio_remove_anon_avc(old_folio, vma);

... by increasing write fault latency, introducing an RMAP walk (!)? Hmm?

On the reuse path, we do a folio_move_anon_rmap(), to optimize that.
zhiguojiang Aug. 27, 2024, 1:50 a.m. UTC | #2
在 2024/8/27 1:24, David Hildenbrand 写道:
> On 23.08.24 16:01, Zhiguo Jiang wrote:
>> After CoWed by do_wp_page, the vma established a new mapping 
>> relationship
>> with the CoWed folio instead of the non-CoWed folio. However, regarding
>> the situation where vma->anon_vma and the non-CoWed folio's anon_vma are
>> not same, the avc binding relationship between them will no longer be
>> needed, so it is issue for the avc binding relationship still existing
>> between them.
>>
>> This patch will remove the avc binding relationship between vma and the
>> non-CoWed folio's anon_vma, which each has their own independent
>> anon_vma. It can also alleviates rmap overhead simultaneously.
>>
>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>> ---
>> -v2:
>>   * Solve the kernel test robot noticed "WARNING"
>>     Reported-by: kernel test robot <oliver.sang@intel.com>
>>     Closes: 
>> https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com
>>   * Update comments to more accurately describe this patch.
>>
>> -v1:
>> https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/
>>
>>   include/linux/rmap.h |  1 +
>>   mm/memory.c          |  8 +++++++
>>   mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++
>>   3 files changed, 62 insertions(+)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 91b5935e8485..8607d28a3146
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, 
>> struct page *, int nr_pages,
>>       folio_remove_rmap_ptes(folio, page, 1, vma)
>>   void folio_remove_rmap_pmd(struct folio *, struct page *,
>>           struct vm_area_struct *);
>> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>>     void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>>           unsigned long address, rmap_t flags);
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 93c0c25433d0..4c89cb1cb73e
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault 
>> *vmf)
>>                * old page will be flushed before it can be reused.
>>                */
>>               folio_remove_rmap_pte(old_folio, vmf->page, vma);
>> +
>> +            /*
>> +             * If the new_folio's anon_vma is different from the
>> +             * old_folio's anon_vma, the avc binding relationship
>> +             * between vma and the old_folio's anon_vma is removed,
>> +             * avoiding rmap redundant overhead.
>> +             */
>> +            folio_remove_anon_avc(old_folio, vma);
>
> ... by increasing write fault latency, introducing an RMAP walk (!)? Hmm?
>
> On the reuse path, we do a folio_move_anon_rmap(), to optimize that.
>
Thanks for your comments. This may not be a good fixup patch. The
resue patch folio_move_anon_rmap() seems to be exclusive or
_refcount = 1 folios. The fork() path seems to clear exclusive flag
in copy_page_range() --> ... --> __folio_try_dup_anon_rmap(). However,
I observed lots of orphan avcs by the above debug trace logs in
wp_page_copy(). But they may be not removed by discussing with Mika.

Thanks
Zhiguo
David Hildenbrand Aug. 27, 2024, 5:35 p.m. UTC | #3
On 27.08.24 03:50, zhiguojiang wrote:
> 
> 
> 在 2024/8/27 1:24, David Hildenbrand 写道:
>> On 23.08.24 16:01, Zhiguo Jiang wrote:
>>> After CoWed by do_wp_page, the vma established a new mapping
>>> relationship
>>> with the CoWed folio instead of the non-CoWed folio. However, regarding
>>> the situation where vma->anon_vma and the non-CoWed folio's anon_vma are
>>> not same, the avc binding relationship between them will no longer be
>>> needed, so it is issue for the avc binding relationship still existing
>>> between them.
>>>
>>> This patch will remove the avc binding relationship between vma and the
>>> non-CoWed folio's anon_vma, which each has their own independent
>>> anon_vma. It can also alleviates rmap overhead simultaneously.
>>>
>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>> ---
>>> -v2:
>>>    * Solve the kernel test robot noticed "WARNING"
>>>      Reported-by: kernel test robot <oliver.sang@intel.com>
>>>      Closes:
>>> https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com
>>>    * Update comments to more accurately describe this patch.
>>>
>>> -v1:
>>> https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/
>>>
>>>    include/linux/rmap.h |  1 +
>>>    mm/memory.c          |  8 +++++++
>>>    mm/rmap.c            | 53 ++++++++++++++++++++++++++++++++++++++++++++
>>>    3 files changed, 62 insertions(+)
>>>
>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>> index 91b5935e8485..8607d28a3146
>>> --- a/include/linux/rmap.h
>>> +++ b/include/linux/rmap.h
>>> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *,
>>> struct page *, int nr_pages,
>>>        folio_remove_rmap_ptes(folio, page, 1, vma)
>>>    void folio_remove_rmap_pmd(struct folio *, struct page *,
>>>            struct vm_area_struct *);
>>> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>>>      void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
>>>            unsigned long address, rmap_t flags);
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 93c0c25433d0..4c89cb1cb73e
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault
>>> *vmf)
>>>                 * old page will be flushed before it can be reused.
>>>                 */
>>>                folio_remove_rmap_pte(old_folio, vmf->page, vma);
>>> +
>>> +            /*
>>> +             * If the new_folio's anon_vma is different from the
>>> +             * old_folio's anon_vma, the avc binding relationship
>>> +             * between vma and the old_folio's anon_vma is removed,
>>> +             * avoiding rmap redundant overhead.
>>> +             */
>>> +            folio_remove_anon_avc(old_folio, vma);
>>
>> ... by increasing write fault latency, introducing an RMAP walk (!)? Hmm?
>>
>> On the reuse path, we do a folio_move_anon_rmap(), to optimize that.
>>
> Thanks for your comments. This may not be a good fixup patch. The
> resue patch folio_move_anon_rmap() seems to be exclusive or
> _refcount = 1 folios. The fork() path seems to clear exclusive flag
> in copy_page_range() --> ... --> __folio_try_dup_anon_rmap(). However,
> I observed lots of orphan avcs by the above debug trace logs in
> wp_page_copy(). But they may be not removed by discussing with Mika.

Was this patch ever tested? I cannot even boot a simple VM without an endless stream of

[    5.804598] ------------[ cut here ]------------
[    5.805494] WARNING: CPU: 11 PID: 595 at mm/rmap.c:443 unlink_anon_vmas+0x19b/0x1d0
[    5.806962] Modules linked in: qemu_fw_cfg
[    5.807762] CPU: 11 UID: 0 PID: 595 Comm: dracut-rootfs-g Tainted: G        W          6.11.0-rc4+ #72
[    5.809546] Tainted: [W]=WARN
[    5.810127] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-2.fc40 04/01/2014
[    5.811753] RIP: 0010:unlink_anon_vmas+0x19b/0x1d0
[    5.812680] Code: b0 00 00 00 00 75 1f f0 ff 8f a0 00 00 00 75 a2 e8 8a fd ff ff eb 9b 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d4 82 d0 00 0f 0b eb dd <0f> 0b eb cf 0f 0b 48 83 c7 08 e8 16 40 d7 ff e9 ea fe ff ff 48 8b
[    5.816247] RSP: 0018:ffffa19f43bb78d0 EFLAGS: 00010286
[    5.817258] RAX: ffff8a71c1bdd2d0 RBX: ffff8a71c1bdd2c0 RCX: ffff8a71c27a86c8
[    5.818624] RDX: 0000000000000001 RSI: ffff8a71c2771b28 RDI: ffff8a71c27a9e60
[    5.820011] RBP: dead000000000122 R08: 0000000000000000 R09: 0000000000000001
[    5.821380] R10: 0000000000000200 R11: 0000000000000001 R12: ffff8a71c2771b28
[    5.822748] R13: dead000000000100 R14: ffff8a71c2771b18 R15: ffff8a71c27a9e60
[    5.824122] FS:  0000000000000000(0000) GS:ffff8a7337980000(0000) knlGS:0000000000000000
[    5.825665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    5.826775] CR2: 00007fca7f70ac58 CR3: 00000001027b2004 CR4: 0000000000770ef0
[    5.828146] PKRU: 55555554
[    5.828686] Call Trace:
[    5.829169]  <TASK>
[    5.829594]  ? __warn.cold+0xb1/0x13e
[    5.830312]  ? unlink_anon_vmas+0x19b/0x1d0
[    5.831118]  ? report_bug+0xff/0x140
[    5.831840]  ? handle_bug+0x3c/0x80
[    5.832524]  ? exc_invalid_op+0x17/0x70
[    5.833262]  ? asm_exc_invalid_op+0x1a/0x20
[    5.834086]  ? unlink_anon_vmas+0x19b/0x1d0
[    5.834908]  free_pgtables+0x130/0x290
[    5.835661]  exit_mmap+0x19a/0x460
[    5.836351]  __mmput+0x4b/0x120
[    5.836965]  do_exit+0x2e1/0xac0
[    5.837601]  ? lock_release+0xd5/0x2c0
[    5.838343]  do_group_exit+0x36/0xa0
[    5.839035]  __x64_sys_exit_group+0x18/0x20
[    5.839866]  x64_sys_call+0x14b4/0x14c0


Andrew, please remove this from mm-unstable.
zhiguojiang Aug. 28, 2024, 1:14 a.m. UTC | #4
在 2024/8/28 1:35, David Hildenbrand 写道:
> On 27.08.24 03:50, zhiguojiang wrote:
>>
>>
>> 在 2024/8/27 1:24, David Hildenbrand 写道:
>>> On 23.08.24 16:01, Zhiguo Jiang wrote:
>>>> After CoWed by do_wp_page, the vma established a new mapping
>>>> relationship
>>>> with the CoWed folio instead of the non-CoWed folio. However, 
>>>> regarding
>>>> the situation where vma->anon_vma and the non-CoWed folio's 
>>>> anon_vma are
>>>> not same, the avc binding relationship between them will no longer be
>>>> needed, so it is issue for the avc binding relationship still existing
>>>> between them.
>>>>
>>>> This patch will remove the avc binding relationship between vma and 
>>>> the
>>>> non-CoWed folio's anon_vma, which each has their own independent
>>>> anon_vma. It can also alleviates rmap overhead simultaneously.
>>>>
>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>> ---
>>>> -v2:
>>>>    * Solve the kernel test robot noticed "WARNING"
>>>>      Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>      Closes:
>>>> https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com 
>>>>
>>>>    * Update comments to more accurately describe this patch.
>>>>
>>>> -v1:
>>>> https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/ 
>>>>
>>>>
>>>>    include/linux/rmap.h |  1 +
>>>>    mm/memory.c          |  8 +++++++
>>>>    mm/rmap.c            | 53 
>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>    3 files changed, 62 insertions(+)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 91b5935e8485..8607d28a3146
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *,
>>>> struct page *, int nr_pages,
>>>>        folio_remove_rmap_ptes(folio, page, 1, vma)
>>>>    void folio_remove_rmap_pmd(struct folio *, struct page *,
>>>>            struct vm_area_struct *);
>>>> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>>>>      void hugetlb_add_anon_rmap(struct folio *, struct 
>>>> vm_area_struct *,
>>>>            unsigned long address, rmap_t flags);
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 93c0c25433d0..4c89cb1cb73e
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault
>>>> *vmf)
>>>>                 * old page will be flushed before it can be reused.
>>>>                 */
>>>>                folio_remove_rmap_pte(old_folio, vmf->page, vma);
>>>> +
>>>> +            /*
>>>> +             * If the new_folio's anon_vma is different from the
>>>> +             * old_folio's anon_vma, the avc binding relationship
>>>> +             * between vma and the old_folio's anon_vma is removed,
>>>> +             * avoiding rmap redundant overhead.
>>>> +             */
>>>> +            folio_remove_anon_avc(old_folio, vma);
>>>
>>> ... by increasing write fault latency, introducing an RMAP walk (!)? 
>>> Hmm?
>>>
>>> On the reuse path, we do a folio_move_anon_rmap(), to optimize that.
>>>
>> Thanks for your comments. This may not be a good fixup patch. The
>> resue patch folio_move_anon_rmap() seems to be exclusive or
>> _refcount = 1 folios. The fork() path seems to clear exclusive flag
>> in copy_page_range() --> ... --> __folio_try_dup_anon_rmap(). However,
>> I observed lots of orphan avcs by the above debug trace logs in
>> wp_page_copy(). But they may be not removed by discussing with Mika.
>
> Was this patch ever tested? I cannot even boot a simple VM without an 
> endless stream of
>
> [    5.804598] ------------[ cut here ]------------
> [    5.805494] WARNING: CPU: 11 PID: 595 at mm/rmap.c:443 
> unlink_anon_vmas+0x19b/0x1d0
> [    5.806962] Modules linked in: qemu_fw_cfg
> [    5.807762] CPU: 11 UID: 0 PID: 595 Comm: dracut-rootfs-g Tainted: 
> G        W          6.11.0-rc4+ #72
> [    5.809546] Tainted: [W]=WARN
> [    5.810127] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), 
> BIOS 1.16.3-2.fc40 04/01/2014
> [    5.811753] RIP: 0010:unlink_anon_vmas+0x19b/0x1d0
> [    5.812680] Code: b0 00 00 00 00 75 1f f0 ff 8f a0 00 00 00 75 a2 
> e8 8a fd ff ff eb 9b 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d4 82 d0 00 0f 
> 0b eb dd <0f> 0b eb cf 0f 0b 48 83 c7 08 e8 16 40 d7 ff e9 ea fe ff ff 
> 48 8b
> [    5.816247] RSP: 0018:ffffa19f43bb78d0 EFLAGS: 00010286
> [    5.817258] RAX: ffff8a71c1bdd2d0 RBX: ffff8a71c1bdd2c0 RCX: 
> ffff8a71c27a86c8
> [    5.818624] RDX: 0000000000000001 RSI: ffff8a71c2771b28 RDI: 
> ffff8a71c27a9e60
> [    5.820011] RBP: dead000000000122 R08: 0000000000000000 R09: 
> 0000000000000001
> [    5.821380] R10: 0000000000000200 R11: 0000000000000001 R12: 
> ffff8a71c2771b28
> [    5.822748] R13: dead000000000100 R14: ffff8a71c2771b18 R15: 
> ffff8a71c27a9e60
> [    5.824122] FS:  0000000000000000(0000) GS:ffff8a7337980000(0000) 
> knlGS:0000000000000000
> [    5.825665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    5.826775] CR2: 00007fca7f70ac58 CR3: 00000001027b2004 CR4: 
> 0000000000770ef0
> [    5.828146] PKRU: 55555554
> [    5.828686] Call Trace:
> [    5.829169]  <TASK>
> [    5.829594]  ? __warn.cold+0xb1/0x13e
> [    5.830312]  ? unlink_anon_vmas+0x19b/0x1d0
> [    5.831118]  ? report_bug+0xff/0x140
> [    5.831840]  ? handle_bug+0x3c/0x80
> [    5.832524]  ? exc_invalid_op+0x17/0x70
> [    5.833262]  ? asm_exc_invalid_op+0x1a/0x20
> [    5.834086]  ? unlink_anon_vmas+0x19b/0x1d0
> [    5.834908]  free_pgtables+0x130/0x290
> [    5.835661]  exit_mmap+0x19a/0x460
> [    5.836351]  __mmput+0x4b/0x120
> [    5.836965]  do_exit+0x2e1/0xac0
> [    5.837601]  ? lock_release+0xd5/0x2c0
> [    5.838343]  do_group_exit+0x36/0xa0
> [    5.839035]  __x64_sys_exit_group+0x18/0x20
> [    5.839866]  x64_sys_call+0x14b4/0x14c0
Arm64 machine tested it and no crashes detected. You may try the
attachment modifition provided by Lorenzo Stoakes. Can you please
check if there are any opportunities for further improvement?
>
>
> Andrew, please remove this from mm-unstable.

Thanks
Zhiguo
From 973ce5f0aea78196088cd527905cc0fad40edb29 Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Sat, 24 Aug 2024 18:55:31 +0100
Subject: [RFC PATCH] mm: fixup orphan avc cleanup logic

Existing logic failed to reparent the anon_vma whose avc was removed which
resulted in assertion failures.

This patch corrects this, fixes up some comments, and does some other
cleanups.

We also do not do anything relating to anon_vma->parent manipulation if no
orphaned AVC is found.

I still feel this logic is highly dubious, but this does fix the issue with
anon_vma->num_children accounting.

This doesn't correctly handle locking of the reparented anon_vma.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/rmap.h |   2 +-
 mm/memory.c          |   2 +-
 mm/rmap.c            | 101 ++++++++++++++++++++++++++++---------------
 3 files changed, 68 insertions(+), 37 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8607d28a3146..f1a835f54064 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -257,7 +257,7 @@ void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_remove_rmap_ptes(folio, page, 1, vma)
 void folio_remove_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
-void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
+void cleanup_orphan_avc(struct folio *, struct vm_area_struct *);
 
 void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/memory.c b/mm/memory.c
index 4c89cb1cb73e..989b078dd860 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3435,7 +3435,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * between vma and the old_folio's anon_vma is removed,
 			 * avoiding rmap redundant overhead.
 			 */
-			folio_remove_anon_avc(old_folio, vma);
+			cleanup_orphan_avc(old_folio, vma);
 		}
 
 		/* Free the old page.. */
diff --git a/mm/rmap.c b/mm/rmap.c
index 56fc16fcf2a9..3ac264962917 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1523,56 +1523,87 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
 }
 
 /**
- * folio_remove_anon_avc - remove the avc binding relationship between
- * folio and vma with different anon_vmas.
- * @folio:	The folio with anon_vma to remove the binded avc from
- * @vma:	The vm area to remove the binded avc with folio's anon_vma
+ * cleanup_orphan_avc - remove the avc binding relationship between a parent
+ * folio and child vma with different anon_vmas which, due to an operation such
+ * as CoW'ing a folio, is no longer meaningful.
  *
- * The caller is currently used for CoWed scene.
+ * (insert ASCII diagrams and explanation here...)
+ *
+ * @old_folio:  The folio which contains the parent anon_vma which has an unneeded
+ *              avc binding.
+ * @new_vma:	The VMA which is unnecessarily bound to folio.
  */
-void folio_remove_anon_avc(struct folio *folio,
-		struct vm_area_struct *vma)
+void cleanup_orphan_avc(struct folio *old_folio, struct vm_area_struct *new_vma)
 {
-	struct anon_vma *anon_vma = folio_anon_vma(folio);
+	struct anon_vma *parent_anon_vma = folio_anon_vma(old_folio);
+	struct anon_vma *child_anon_vma = new_vma->anon_vma;
 	pgoff_t pgoff_start, pgoff_end;
 	struct anon_vma_chain *avc;
+	bool removed = false;
 
 	/*
-	 * Ensure that the vma's anon_vma and the folio's
-	 * anon_vma exist and are not same.
+	 * If this folio were not anonymous, folio_anon_vma() would have
+	 * returned NULL. Equally, if the parent and child anon_vma objects are
+	 * the same, then we have nothing to do here.
 	 */
-	if (!folio_test_anon(folio) || unlikely(!anon_vma) ||
-	    anon_vma == vma->anon_vma)
+	if (!parent_anon_vma || parent_anon_vma == child_anon_vma)
 		return;
 
-	pgoff_start = folio_pgoff(folio);
-	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
+	pgoff_start = folio_pgoff(old_folio);
+	pgoff_end = pgoff_start + folio_nr_pages(old_folio) - 1;
 
-	if (!anon_vma_trylock_write(anon_vma))
+	/* This is an optimistic attempt. */
+	if (!anon_vma_trylock_write(parent_anon_vma))
 		return;
 
-	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
-			pgoff_start, pgoff_end) {
-		/*
-		 * Find the avc associated with vma from the folio's
-		 * anon_vma and remove it.
-		 */
-		if (avc->vma == vma) {
-			anon_vma_interval_tree_remove(avc, &anon_vma->rb_root);
-			/*
-			 * When removing the avc with anon_vma that is
-			 * different from the parent anon_vma from parent
-			 * anon_vma->rb_root, the parent num_children
-			 * count value is needed to reduce one.
-			 */
-			anon_vma->num_children--;
+	/*
+	 * Iterate through all AVC's tied to the old folio, looking for the
+	 * redundant one pointing at the new VMA.
+	 */
+	anon_vma_interval_tree_foreach(avc, &parent_anon_vma->rb_root,
+				       pgoff_start, pgoff_end) {
+		if (avc->vma != new_vma)
+			continue;
 
-			list_del(&avc->same_vma);
-			anon_vma_chain_free(avc);
-			break;
-		}
+		/* Remove the unneeded avc. */
+		anon_vma_interval_tree_remove(avc, &parent_anon_vma->rb_root);
+		list_del(&avc->same_vma);
+		anon_vma_chain_free(avc);
+
+		removed = true;
+		break;
 	}
-	anon_vma_unlock_write(anon_vma);
+
+	if (!removed)
+		goto unlock;
+
+	/*
+	 * Removing an avc implies that the associated avc MAY no longer need
+	 * to point to its parent, and we need to reparent it.
+	 */
+
+	/*
+	 * If somehow we aren't already the child of the parent anon_vma, we
+	 * have nothing to do here.
+	 */
+	if (child_anon_vma->parent != parent_anon_vma)
+		goto unlock;
+
+	/* OK, we abandon our parent, and reparent to ourselves. */
+
+	parent_anon_vma->num_children--;
+
+	child_anon_vma->parent = child_anon_vma;
+	child_anon_vma->num_children++;
+
+	/*
+	 * Here we should probably reset the anon_vma->root, as per
+	 * anon_vma_ctor() but this feels icky and horrible. Bit weird to share
+	 * a lock with the old parent's root.
+	 */
+
+unlock:
+	anon_vma_unlock_write(parent_anon_vma);
 }
 
 static __always_inline void __folio_remove_rmap(struct folio *folio,
Mika Penttilä Aug. 28, 2024, 3:51 a.m. UTC | #5
Hi,

On 8/28/24 04:14, zhiguojiang wrote:
>
>
> 在 2024/8/28 1:35, David Hildenbrand 写道:
>> On 27.08.24 03:50, zhiguojiang wrote:
>>>
>>>
>>> 在 2024/8/27 1:24, David Hildenbrand 写道:
>>>> On 23.08.24 16:01, Zhiguo Jiang wrote:
>>>>> After CoWed by do_wp_page, the vma established a new mapping
>>>>> relationship
>>>>> with the CoWed folio instead of the non-CoWed folio. However,
>>>>> regarding
>>>>> the situation where vma->anon_vma and the non-CoWed folio's
>>>>> anon_vma are
>>>>> not same, the avc binding relationship between them will no longer be
>>>>> needed, so it is issue for the avc binding relationship still
>>>>> existing
>>>>> between them.
>>>>>
>>>>> This patch will remove the avc binding relationship between vma
>>>>> and the
>>>>> non-CoWed folio's anon_vma, which each has their own independent
>>>>> anon_vma. It can also alleviates rmap overhead simultaneously.
>>>>>
>>>>> Signed-off-by: Zhiguo Jiang <justinjiang@vivo.com>
>>>>> ---
>>>>> -v2:
>>>>>    * Solve the kernel test robot noticed "WARNING"
>>>>>      Reported-by: kernel test robot <oliver.sang@intel.com>
>>>>>      Closes:
>>>>> https://lore.kernel.org/oe-lkp/202408230938.43f55b4-lkp@intel.com
>>>>>    * Update comments to more accurately describe this patch.
>>>>>
>>>>> -v1:
>>>>> https://lore.kernel.org/linux-mm/20240820143359.199-1-justinjiang@vivo.com/
>>>>>
>>>>>
>>>>>    include/linux/rmap.h |  1 +
>>>>>    mm/memory.c          |  8 +++++++
>>>>>    mm/rmap.c            | 53
>>>>> ++++++++++++++++++++++++++++++++++++++++++++
>>>>>    3 files changed, 62 insertions(+)
>>>>>
>>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>>> index 91b5935e8485..8607d28a3146
>>>>> --- a/include/linux/rmap.h
>>>>> +++ b/include/linux/rmap.h
>>>>> @@ -257,6 +257,7 @@ void folio_remove_rmap_ptes(struct folio *,
>>>>> struct page *, int nr_pages,
>>>>>        folio_remove_rmap_ptes(folio, page, 1, vma)
>>>>>    void folio_remove_rmap_pmd(struct folio *, struct page *,
>>>>>            struct vm_area_struct *);
>>>>> +void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
>>>>>      void hugetlb_add_anon_rmap(struct folio *, struct
>>>>> vm_area_struct *,
>>>>>            unsigned long address, rmap_t flags);
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index 93c0c25433d0..4c89cb1cb73e
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -3428,6 +3428,14 @@ static vm_fault_t wp_page_copy(struct vm_fault
>>>>> *vmf)
>>>>>                 * old page will be flushed before it can be reused.
>>>>>                 */
>>>>>                folio_remove_rmap_pte(old_folio, vmf->page, vma);
>>>>> +
>>>>> +            /*
>>>>> +             * If the new_folio's anon_vma is different from the
>>>>> +             * old_folio's anon_vma, the avc binding relationship
>>>>> +             * between vma and the old_folio's anon_vma is removed,
>>>>> +             * avoiding rmap redundant overhead.
>>>>> +             */
>>>>> +            folio_remove_anon_avc(old_folio, vma);
>>>>
>>>> ... by increasing write fault latency, introducing an RMAP walk
>>>> (!)? Hmm?
>>>>
>>>> On the reuse path, we do a folio_move_anon_rmap(), to optimize that.
>>>>
>>> Thanks for your comments. This may not be a good fixup patch. The
>>> resue patch folio_move_anon_rmap() seems to be exclusive or
>>> _refcount = 1 folios. The fork() path seems to clear exclusive flag
>>> in copy_page_range() --> ... --> __folio_try_dup_anon_rmap(). However,
>>> I observed lots of orphan avcs by the above debug trace logs in
>>> wp_page_copy(). But they may be not removed by discussing with Mika.
>>
>> Was this patch ever tested? I cannot even boot a simple VM without an
>> endless stream of
>>
>> [    5.804598] ------------[ cut here ]------------
>> [    5.805494] WARNING: CPU: 11 PID: 595 at mm/rmap.c:443
>> unlink_anon_vmas+0x19b/0x1d0
>> [    5.806962] Modules linked in: qemu_fw_cfg
>> [    5.807762] CPU: 11 UID: 0 PID: 595 Comm: dracut-rootfs-g Tainted:
>> G        W          6.11.0-rc4+ #72
>> [    5.809546] Tainted: [W]=WARN
>> [    5.810127] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
>> BIOS 1.16.3-2.fc40 04/01/2014
>> [    5.811753] RIP: 0010:unlink_anon_vmas+0x19b/0x1d0
>> [    5.812680] Code: b0 00 00 00 00 75 1f f0 ff 8f a0 00 00 00 75 a2
>> e8 8a fd ff ff eb 9b 5b 5d 41 5c 41 5d 41 5e 41 5f e9 d4 82 d0 00 0f
>> 0b eb dd <0f> 0b eb cf 0f 0b 48 83 c7 08 e8 16 40 d7 ff e9 ea fe ff
>> ff 48 8b
>> [    5.816247] RSP: 0018:ffffa19f43bb78d0 EFLAGS: 00010286
>> [    5.817258] RAX: ffff8a71c1bdd2d0 RBX: ffff8a71c1bdd2c0 RCX:
>> ffff8a71c27a86c8
>> [    5.818624] RDX: 0000000000000001 RSI: ffff8a71c2771b28 RDI:
>> ffff8a71c27a9e60
>> [    5.820011] RBP: dead000000000122 R08: 0000000000000000 R09:
>> 0000000000000001
>> [    5.821380] R10: 0000000000000200 R11: 0000000000000001 R12:
>> ffff8a71c2771b28
>> [    5.822748] R13: dead000000000100 R14: ffff8a71c2771b18 R15:
>> ffff8a71c27a9e60
>> [    5.824122] FS:  0000000000000000(0000) GS:ffff8a7337980000(0000)
>> knlGS:0000000000000000
>> [    5.825665] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [    5.826775] CR2: 00007fca7f70ac58 CR3: 00000001027b2004 CR4:
>> 0000000000770ef0
>> [    5.828146] PKRU: 55555554
>> [    5.828686] Call Trace:
>> [    5.829169]  <TASK>
>> [    5.829594]  ? __warn.cold+0xb1/0x13e
>> [    5.830312]  ? unlink_anon_vmas+0x19b/0x1d0
>> [    5.831118]  ? report_bug+0xff/0x140
>> [    5.831840]  ? handle_bug+0x3c/0x80
>> [    5.832524]  ? exc_invalid_op+0x17/0x70
>> [    5.833262]  ? asm_exc_invalid_op+0x1a/0x20
>> [    5.834086]  ? unlink_anon_vmas+0x19b/0x1d0
>> [    5.834908]  free_pgtables+0x130/0x290
>> [    5.835661]  exit_mmap+0x19a/0x460
>> [    5.836351]  __mmput+0x4b/0x120
>> [    5.836965]  do_exit+0x2e1/0xac0
>> [    5.837601]  ? lock_release+0xd5/0x2c0
>> [    5.838343]  do_group_exit+0x36/0xa0
>> [    5.839035]  __x64_sys_exit_group+0x18/0x20
>> [    5.839866]  x64_sys_call+0x14b4/0x14c0
> Arm64 machine tested it and no crashes detected. You may try the
> attachment modifition provided by Lorenzo Stoakes. Can you please
> check if there are any opportunities for further improvement?


This patch is still wrong afaics in the main logic, you can not remove
the avc because the non cowed folios of child are not reached then.


>>
>>
>> Andrew, please remove this from mm-unstable.
>
> Thanks
> Zhiguo


Thanks,

Mika
diff mbox series

Patch

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 91b5935e8485..8607d28a3146
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -257,6 +257,7 @@  void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_remove_rmap_ptes(folio, page, 1, vma)
 void folio_remove_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *);
+void folio_remove_anon_avc(struct folio *, struct vm_area_struct *);
 
 void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
diff --git a/mm/memory.c b/mm/memory.c
index 93c0c25433d0..4c89cb1cb73e
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3428,6 +3428,14 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			 * old page will be flushed before it can be reused.
 			 */
 			folio_remove_rmap_pte(old_folio, vmf->page, vma);
+
+			/*
+			 * If the new_folio's anon_vma is different from the
+			 * old_folio's anon_vma, the avc binding relationship
+			 * between vma and the old_folio's anon_vma is removed,
+			 * avoiding rmap redundant overhead.
+			 */
+			folio_remove_anon_avc(old_folio, vma);
 		}
 
 		/* Free the old page.. */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1103a536e474..4c339c571ea6
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1522,6 +1522,59 @@  void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_remove_anon_avc - remove the avc binding relationship between
+ * folio and vma with different anon_vmas.
+ * @folio:	The folio with anon_vma to remove the binded avc from
+ * @vma:	The vm area to remove the binded avc with folio's anon_vma
+ *
+ * The caller is currently used for CoWed scene.
+ */
+void folio_remove_anon_avc(struct folio *folio,
+		struct vm_area_struct *vma)
+{
+	struct anon_vma *anon_vma = folio_anon_vma(folio);
+	pgoff_t pgoff_start, pgoff_end;
+	struct anon_vma_chain *avc;
+
+	/*
+	 * Ensure that the vma's anon_vma and the folio's
+	 * anon_vma exist and are not same.
+	 */
+	if (!folio_test_anon(folio) || unlikely(!anon_vma) ||
+	    anon_vma == vma->anon_vma)
+		return;
+
+	pgoff_start = folio_pgoff(folio);
+	pgoff_end = pgoff_start + folio_nr_pages(folio) - 1;
+
+	if (!anon_vma_trylock_write(anon_vma))
+		return;
+
+	anon_vma_interval_tree_foreach(avc, &anon_vma->rb_root,
+			pgoff_start, pgoff_end) {
+		/*
+		 * Find the avc associated with vma from the folio's
+		 * anon_vma and remove it.
+		 */
+		if (avc->vma == vma) {
+			anon_vma_interval_tree_remove(avc, &anon_vma->rb_root);
+			/*
+			 * When removing the avc with anon_vma that is
+			 * different from the parent anon_vma from parent
+			 * anon_vma->rb_root, the parent->num_children
+			 * count value is needed to reduce one.
+			 */
+			anon_vma->parent->num_children--;
+
+			list_del(&avc->same_vma);
+			anon_vma_chain_free(avc);
+			break;
+		}
+	}
+	anon_vma_unlock_write(anon_vma);
+}
+
 static __always_inline void __folio_remove_rmap(struct folio *folio,
 		struct page *page, int nr_pages, struct vm_area_struct *vma,
 		enum rmap_level level)