Message ID | 20221025014215.3466904-1-mawupeng1@huawei.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [-next,1/1] mm: hugetlb_vmemmap: Fix WARN_ON in vmemmap_remap_pte | expand |
> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > > From: Ma Wupeng <mawupeng1@huawei.com> > > Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > read-only to catch illegal write operation to the tail page. > > However this will lead to WARN_ON in arm64 in __check_racy_pte_update() Thanks for your finding this issue. > since this may lead to dirty state cleaned. This check is introduced by > commit 2f4b829c625e ("arm64: Add support for hardware updates of the > access and dirty pte bits") and the initial check is as follow: > > BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > > Since we do need to mark this pte as read-only to catch illegal write > operation to the tail pages, use set_pte to replace set_pte_at to bypass > this check. In theory, the waring does not affect anything since the tail vmemmap pages are supposed to be read-only. So, skipping this check for vmemmap pages seem feasible. But I am not sure whether it is general to use set_pte here instead of set_pte_at, I didn’t see any users of set_pte from the common code routines except the code from arch/xxx. And this issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() itself. Something like (Just some thoughts from mine): diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index b5df82aa99e6..df7716965a93 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) */ -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, +static inline void __check_racy_pte_update(struct mm_struct *mm, + unsigned long addr, pte_t *ptep, pte_t pte) { pte_t old_pte; @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) return; + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && + addr >= VMEMMAP_START && addr <= VMEMMAP_END) + return; + /* * Check for potential race with hardware updates of the pte * (ptep_set_access_flags safely changes valid ptes without going > > The following shell command can be used to reproduce this WARN_ON in > 6.1.0-rc1: > > echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap > cat /proc/sys/vm/hugetlb_optimize_vmemmap > > echo 1024 > /proc/sys/vm/nr_overcommit_hugepages > mkdir -p /root/hugetlb > mount none /root/hugetlb -t hugetlbfs > fallocate -l 2g /root/hugetlb/xx & > > Here is the detail WARN_ON log: > > ------------[ cut here ]------------ > __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 > WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 > Modules linked in: > CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 > Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 > Call trace: > vmemmap_remap_pte+0x118/0x120 > vmemmap_remap_range+0x30c/0x560 > hugetlb_vmemmap_optimize+0x158/0x408 > __prep_new_huge_page+0x24/0x150 > prep_new_huge_page+0x30/0x60 > alloc_fresh_huge_page+0x1c4/0x1e0 > alloc_surplus_huge_page+0xa0/0x168 > alloc_huge_page+0x264/0x5b8 > hugetlbfs_fallocate+0x294/0x680 > vfs_fallocate+0x12c/0x568 > ksys_fallocate+0x50/0xa0 > __arm64_sys_fallocate+0x28/0x38 > invoke_syscall+0x4c/0x110 > el0_svc_common.constprop.0+0x68/0x128 > do_el0_svc+0x34/0xd0 > el0_svc+0x48/0xb8 > el0t_64_sync_handler+0xb8/0xc0 > el0t_64_sync+0x18c/0x190 > ---[ end trace 0000000000000000 ]--- > > Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") Actually, this commit does not pose the issue for arm64. I think the correct commit which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. Thanks. > Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> > --- > mm/hugetlb_vmemmap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index ba2a2596fb4e..cb056265c31e 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > struct page *page = pte_page(*pte); > > list_add_tail(&page->lru, walk->vmemmap_pages); > - set_pte_at(&init_mm, addr, pte, entry); > + set_pte(pte, entry); > } > > /* > -- > 2.25.1 > >
On 2022/10/25 14:36, Muchun Song wrote: > > >> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >> >> From: Ma Wupeng <mawupeng1@huawei.com> >> >> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >> read-only to catch illegal write operation to the tail page. >> >> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > > Thanks for your finding this issue. > >> since this may lead to dirty state cleaned. This check is introduced by >> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >> access and dirty pte bits") and the initial check is as follow: >> >> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >> >> Since we do need to mark this pte as read-only to catch illegal write >> operation to the tail pages, use set_pte to replace set_pte_at to bypass >> this check. > > In theory, the waring does not affect anything since the tail vmemmap > pages are supposed to be read-only. So, skipping this check for vmemmap > pages seem feasible. But I am not sure whether it is general to use > set_pte here instead of set_pte_at, I didn’t see any users of set_pte > from the common code routines except the code from arch/xxx. And this > issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() > itself. > > Something like (Just some thoughts from mine): > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b5df82aa99e6..df7716965a93 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_racy_pte_update(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) > return; > > + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && > + addr >= VMEMMAP_START && addr <= VMEMMAP_END) > + return; > + > /* > * Check for potential race with hardware updates of the pte > * (ptep_set_access_flags safely changes valid ptes without going > >> >> The following shell command can be used to reproduce this WARN_ON in >> 6.1.0-rc1: >> >> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >> cat /proc/sys/vm/hugetlb_optimize_vmemmap >> >> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >> mkdir -p /root/hugetlb >> mount none /root/hugetlb -t hugetlbfs >> fallocate -l 2g /root/hugetlb/xx & >> >> Here is the detail WARN_ON log: >> >> ------------[ cut here ]------------ >> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >> Modules linked in: >> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >> Call trace: >> vmemmap_remap_pte+0x118/0x120 >> vmemmap_remap_range+0x30c/0x560 >> hugetlb_vmemmap_optimize+0x158/0x408 >> __prep_new_huge_page+0x24/0x150 >> prep_new_huge_page+0x30/0x60 >> alloc_fresh_huge_page+0x1c4/0x1e0 >> alloc_surplus_huge_page+0xa0/0x168 >> alloc_huge_page+0x264/0x5b8 >> hugetlbfs_fallocate+0x294/0x680 >> vfs_fallocate+0x12c/0x568 >> ksys_fallocate+0x50/0xa0 >> __arm64_sys_fallocate+0x28/0x38 >> invoke_syscall+0x4c/0x110 >> el0_svc_common.constprop.0+0x68/0x128 >> do_el0_svc+0x34/0xd0 >> el0_svc+0x48/0xb8 >> el0t_64_sync_handler+0xb8/0xc0 >> el0t_64_sync+0x18c/0x190 >> ---[ end trace 0000000000000000 ]--- >> >> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") > > Actually, this commit does not pose the issue for arm64. I think the correct commit > which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > > Thanks. Thanks for pointing it out. I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled. I will change my change log and fix this in next patch. > >> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >> --- >> mm/hugetlb_vmemmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index ba2a2596fb4e..cb056265c31e 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >> struct page *page = pte_page(*pte); >> >> list_add_tail(&page->lru, walk->vmemmap_pages); >> - set_pte_at(&init_mm, addr, pte, entry); >> + set_pte(pte, entry); >> } >> >> /* >> -- >> 2.25.1 >> >> >
On 10/25/22 12:06, Muchun Song wrote: > > >> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >> >> From: Ma Wupeng <mawupeng1@huawei.com> >> >> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >> read-only to catch illegal write operation to the tail page. >> >> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > > Thanks for your finding this issue. > >> since this may lead to dirty state cleaned. This check is introduced by >> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >> access and dirty pte bits") and the initial check is as follow: >> >> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >> >> Since we do need to mark this pte as read-only to catch illegal write >> operation to the tail pages, use set_pte to replace set_pte_at to bypass >> this check. > > In theory, the waring does not affect anything since the tail vmemmap > pages are supposed to be read-only. So, skipping this check for vmemmap Tails vmemmap pages are supposed to be read-only, in practice but their backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() warning would not have triggered. VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", __func__, pte_val(old_pte), pte_val(pte)); Also, is not it true that the pte being remapped into a different page as read only, than what it had originally (which will be freed up) i.e the PFN in 'old_pte' and 'pte' will be different. Hence is there still a possibility for a race condition even when the PFN changes ? > pages seem feasible. But I am not sure whether it is general to use > set_pte here instead of set_pte_at, I didn’t see any users of set_pte > from the common code routines except the code from arch/xxx. And this > issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() > itself. Right, should not change it to yet lower level platform helper set_pte() just to work around this warning. Instead, __check_racy_pte_update() is the right place where it should be fixed. > > Something like (Just some thoughts from mine): > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h > index b5df82aa99e6..df7716965a93 100644 > --- a/arch/arm64/include/asm/pgtable.h > +++ b/arch/arm64/include/asm/pgtable.h > @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); > * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) > */ > > -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > +static inline void __check_racy_pte_update(struct mm_struct *mm, > + unsigned long addr, pte_t *ptep, > pte_t pte) > { > pte_t old_pte; > @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, > if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) > return; > > + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && > + addr >= VMEMMAP_START && addr <= VMEMMAP_END) > + return; > + > /* > * Check for potential race with hardware updates of the pte > * (ptep_set_access_flags safely changes valid ptes without going > >> >> The following shell command can be used to reproduce this WARN_ON in >> 6.1.0-rc1: >> >> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >> cat /proc/sys/vm/hugetlb_optimize_vmemmap >> >> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >> mkdir -p /root/hugetlb >> mount none /root/hugetlb -t hugetlbfs >> fallocate -l 2g /root/hugetlb/xx & >> >> Here is the detail WARN_ON log: >> >> ------------[ cut here ]------------ >> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >> Modules linked in: >> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >> Call trace: >> vmemmap_remap_pte+0x118/0x120 >> vmemmap_remap_range+0x30c/0x560 >> hugetlb_vmemmap_optimize+0x158/0x408 >> __prep_new_huge_page+0x24/0x150 >> prep_new_huge_page+0x30/0x60 >> alloc_fresh_huge_page+0x1c4/0x1e0 >> alloc_surplus_huge_page+0xa0/0x168 >> alloc_huge_page+0x264/0x5b8 >> hugetlbfs_fallocate+0x294/0x680 >> vfs_fallocate+0x12c/0x568 >> ksys_fallocate+0x50/0xa0 >> __arm64_sys_fallocate+0x28/0x38 >> invoke_syscall+0x4c/0x110 >> el0_svc_common.constprop.0+0x68/0x128 >> do_el0_svc+0x34/0xd0 >> el0_svc+0x48/0xb8 >> el0t_64_sync_handler+0xb8/0xc0 >> el0t_64_sync+0x18c/0x190 >> ---[ end trace 0000000000000000 ]--- >> >> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") > > Actually, this commit does not pose the issue for arm64. I think the correct commit > which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > > Thanks. > >> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >> --- >> mm/hugetlb_vmemmap.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index ba2a2596fb4e..cb056265c31e 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >> struct page *page = pte_page(*pte); >> >> list_add_tail(&page->lru, walk->vmemmap_pages); >> - set_pte_at(&init_mm, addr, pte, entry); >> + set_pte(pte, entry); >> } >> >> /* >> -- >> 2.25.1 >> >> > >
> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 10/25/22 12:06, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap > > Tails vmemmap pages are supposed to be read-only, in practice but their > backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > warning would not have triggered. Right. > > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); > > Also, is not it true that the pte being remapped into a different page > as read only, than what it had originally (which will be freed up) i.e > the PFN in 'old_pte' and 'pte' will be different. Hence is there still Right. > a possibility for a race condition even when the PFN changes ? Sorry, I didn't get this question. Did you mean the PTE is changed from new (pte) to the old one (old_pte) by the hardware because of the update of dirty bit when a concurrent write operation to the tail vmemmap page? Muchun, Thanks. > >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. > > Right, should not change it to yet lower level platform helper set_pte() > just to work around this warning. Instead, __check_racy_pte_update() is > the right place where it should be fixed. > >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1
On 10/26/22 12:31, Muchun Song wrote: > > >> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: >> >> >> >> On 10/25/22 12:06, Muchun Song wrote: >>> >>> >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>> >>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>> >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>> read-only to catch illegal write operation to the tail page. >>>> >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>> >>> Thanks for your finding this issue. >>> >>>> since this may lead to dirty state cleaned. This check is introduced by >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>> access and dirty pte bits") and the initial check is as follow: >>>> >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>> >>>> Since we do need to mark this pte as read-only to catch illegal write >>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>> this check. >>> >>> In theory, the waring does not affect anything since the tail vmemmap >>> pages are supposed to be read-only. So, skipping this check for vmemmap >> >> Tails vmemmap pages are supposed to be read-only, in practice but their >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >> warning would not have triggered. > > Right. > >> >> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >> __func__, pte_val(old_pte), pte_val(pte)); >> >> Also, is not it true that the pte being remapped into a different page >> as read only, than what it had originally (which will be freed up) i.e >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > > Right. > >> a possibility for a race condition even when the PFN changes ? > > Sorry, I didn't get this question. Did you mean the PTE is changed from > new (pte) to the old one (old_pte) by the hardware because of the update > of dirty bit when a concurrent write operation to the tail vmemmap page? No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining tails pages ? Is not there a PFN change, along with access permission change involved in this remapping process ?
> On Oct 26, 2022, at 16:36, Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 10/26/22 12:31, Muchun Song wrote: >> >> >>> On Oct 26, 2022, at 13:06, Anshuman Khandual <anshuman.khandual@arm.com> wrote: >>> >>> >>> >>> On 10/25/22 12:06, Muchun Song wrote: >>>> >>>> >>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>> >>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>> >>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>> read-only to catch illegal write operation to the tail page. >>>>> >>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>> >>>> Thanks for your finding this issue. >>>> >>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>> access and dirty pte bits") and the initial check is as follow: >>>>> >>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>> >>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>> this check. >>>> >>>> In theory, the waring does not affect anything since the tail vmemmap >>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>> >>> Tails vmemmap pages are supposed to be read-only, in practice but their >>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>> warning would not have triggered. >> >> Right. >> >>> >>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>> __func__, pte_val(old_pte), pte_val(pte)); >>> >>> Also, is not it true that the pte being remapped into a different page >>> as read only, than what it had originally (which will be freed up) i.e >>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >> >> Right. >> >>> a possibility for a race condition even when the PFN changes ? >> >> Sorry, I didn't get this question. Did you mean the PTE is changed from >> new (pte) to the old one (old_pte) by the hardware because of the update >> of dirty bit when a concurrent write operation to the tail vmemmap page? > > No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > tails pages ? Is not there a PFN change, along with access permission change > involved in this remapping process ? Alright, yes, both the PFN and the access permission are changed.
On 2022/10/26 13:06, Anshuman Khandual wrote: > > > On 10/25/22 12:06, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap > > Tails vmemmap pages are supposed to be read-only, in practice but their > backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > warning would not have triggered. > > VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > __func__, pte_val(old_pte), pte_val(pte)); So, arm64 will trigger this warn_on in this condition and this condition is not unusual such as this scenario? It is true that we should change the login in arm64. but how to change it to make the code more common? Any thoughts? Thanks for reviewing. > > Also, is not it true that the pte being remapped into a different page > as read only, than what it had originally (which will be freed up) i.e > the PFN in 'old_pte' and 'pte' will be different. Hence is there still > a possibility for a race condition even when the PFN changes ? > >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. > > Right, should not change it to yet lower level platform helper set_pte() > just to work around this warning. Instead, __check_racy_pte_update() is > the right place where it should be fixed. > >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1 >>> >>> >> >> >
On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: > On 10/26/22 12:31, Muchun Song wrote: > >> On 10/25/22 12:06, Muchun Song wrote: > >>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>>> From: Ma Wupeng <mawupeng1@huawei.com> > >>>> > >>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>>> read-only to catch illegal write operation to the tail page. > >>>> > >>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >>> > >>> Thanks for your finding this issue. > >>> > >>>> since this may lead to dirty state cleaned. This check is introduced by > >>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>>> access and dirty pte bits") and the initial check is as follow: > >>>> > >>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>>> > >>>> Since we do need to mark this pte as read-only to catch illegal write > >>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>>> this check. > >>> > >>> In theory, the waring does not affect anything since the tail vmemmap > >>> pages are supposed to be read-only. So, skipping this check for vmemmap > >> > >> Tails vmemmap pages are supposed to be read-only, in practice but their > >> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > >> warning would not have triggered. > > > > Right. > > > >> > >> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > >> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > >> __func__, pte_val(old_pte), pte_val(pte)); > >> > >> Also, is not it true that the pte being remapped into a different page > >> as read only, than what it had originally (which will be freed up) i.e > >> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > > > > Right. > > > >> a possibility for a race condition even when the PFN changes ? > > > > Sorry, I didn't get this question. Did you mean the PTE is changed from > > new (pte) to the old one (old_pte) by the hardware because of the update > > of dirty bit when a concurrent write operation to the tail vmemmap page? > > No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > tails pages ? Is not there a PFN change, along with access permission change > involved in this remapping process ? For the record, as we discussed offline, changing the output address (pfn) of a pte is not safe without break-before-make if at least one of the mappings was writeable. The caller (vmemmap_remap_pte()) would need to be fixed to first invalidate the pte and then write the new pte. I assume no other CPU accesses this part of the vmemmap while the pte is being remapped.
> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: >> On 10/26/22 12:31, Muchun Song wrote: >>>> On 10/25/22 12:06, Muchun Song wrote: >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>>> >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>>> read-only to catch illegal write operation to the tail page. >>>>>> >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>>> >>>>> Thanks for your finding this issue. >>>>> >>>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>>> access and dirty pte bits") and the initial check is as follow: >>>>>> >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>>> >>>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>>> this check. >>>>> >>>>> In theory, the waring does not affect anything since the tail vmemmap >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>>> >>>> Tails vmemmap pages are supposed to be read-only, in practice but their >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>>> warning would not have triggered. >>> >>> Right. >>> >>>> >>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>>> __func__, pte_val(old_pte), pte_val(pte)); >>>> >>>> Also, is not it true that the pte being remapped into a different page >>>> as read only, than what it had originally (which will be freed up) i.e >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >>> >>> Right. >>> >>>> a possibility for a race condition even when the PFN changes ? >>> >>> Sorry, I didn't get this question. Did you mean the PTE is changed from >>> new (pte) to the old one (old_pte) by the hardware because of the update >>> of dirty bit when a concurrent write operation to the tail vmemmap page? >> >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining >> tails pages ? Is not there a PFN change, along with access permission change >> involved in this remapping process ? > > For the record, as we discussed offline, changing the output address > (pfn) of a pte is not safe without break-before-make if at least one of > the mappings was writeable. The caller (vmemmap_remap_pte()) would need > to be fixed to first invalidate the pte and then write the new pte. I Hi Catalin, Could you expose more details about what issue it will be caused? I am not familiar with arm64. > assume no other CPU accesses this part of the vmemmap while the pte is > being remapped. However, there is no guarantee that no other CPU accesses this pte. E.g. memory failure or memory compaction, both can obtain head page from any tail struct pages (only read) anytime. Thanks. > > -- > Catalin
On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote: > On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: > >> On 10/26/22 12:31, Muchun Song wrote: > >>>> On 10/25/22 12:06, Muchun Song wrote: > >>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>>>>> From: Ma Wupeng <mawupeng1@huawei.com> > >>>>>> > >>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>>>>> read-only to catch illegal write operation to the tail page. > >>>>>> > >>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >>>>> > >>>>> Thanks for your finding this issue. > >>>>> > >>>>>> since this may lead to dirty state cleaned. This check is introduced by > >>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>>>>> access and dirty pte bits") and the initial check is as follow: > >>>>>> > >>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>>>>> > >>>>>> Since we do need to mark this pte as read-only to catch illegal write > >>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>>>>> this check. > >>>>> > >>>>> In theory, the waring does not affect anything since the tail vmemmap > >>>>> pages are supposed to be read-only. So, skipping this check for vmemmap > >>>> > >>>> Tails vmemmap pages are supposed to be read-only, in practice but their > >>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() > >>>> warning would not have triggered. > >>> > >>> Right. > >>> > >>>> > >>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), > >>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", > >>>> __func__, pte_val(old_pte), pte_val(pte)); > >>>> > >>>> Also, is not it true that the pte being remapped into a different page > >>>> as read only, than what it had originally (which will be freed up) i.e > >>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still > >>> > >>> Right. > >>> > >>>> a possibility for a race condition even when the PFN changes ? > >>> > >>> Sorry, I didn't get this question. Did you mean the PTE is changed from > >>> new (pte) to the old one (old_pte) by the hardware because of the update > >>> of dirty bit when a concurrent write operation to the tail vmemmap page? > >> > >> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining > >> tails pages ? Is not there a PFN change, along with access permission change > >> involved in this remapping process ? > > > > For the record, as we discussed offline, changing the output address > > (pfn) of a pte is not safe without break-before-make if at least one of > > the mappings was writeable. The caller (vmemmap_remap_pte()) would need > > to be fixed to first invalidate the pte and then write the new pte. I > > Could you expose more details about what issue it will be caused? I am > not familiar with arm64. Well, it's not allowed by the architecture, so some CPU implementations may do weird things like accessing incorrect memory or triggering TLB conflict aborts if, for some reason, they end up with two entries in the TLB for the same VA but pointing to different pfns. The hardware expects an invalid PTE and TLB invalidation between such changes. In practice most likely nothing happens and this works fine but we need to stick to the architecture requirements in case some CPUs take advantage of this requirement. > > assume no other CPU accesses this part of the vmemmap while the pte is > > being remapped. > > However, there is no guarantee that no other CPU accesses this pte. > E.g. memory failure or memory compaction, both can obtain head page > from any tail struct pages (only read) anytime. Oh, so we cannot safely go through a break-before-make sequence here (zero the PTE, flush the TLB, write the new PTE) as some CPU may access this pte.
On 2022/10/26 11:01, mawupeng wrote: > > > On 2022/10/25 14:36, Muchun Song wrote: >> >> >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>> >>> From: Ma Wupeng <mawupeng1@huawei.com> >>> >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>> read-only to catch illegal write operation to the tail page. >>> >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >> >> Thanks for your finding this issue. >> >>> since this may lead to dirty state cleaned. This check is introduced by >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>> access and dirty pte bits") and the initial check is as follow: >>> >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>> >>> Since we do need to mark this pte as read-only to catch illegal write >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>> this check. >> >> In theory, the waring does not affect anything since the tail vmemmap >> pages are supposed to be read-only. So, skipping this check for vmemmap >> pages seem feasible. But I am not sure whether it is general to use >> set_pte here instead of set_pte_at, I didn’t see any users of set_pte >> from the common code routines except the code from arch/xxx. And this >> issue is specific for arm64, so I suggest fixing it in __check_racy_pte_update() >> itself. >> >> Something like (Just some thoughts from mine): >> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h >> index b5df82aa99e6..df7716965a93 100644 >> --- a/arch/arm64/include/asm/pgtable.h >> +++ b/arch/arm64/include/asm/pgtable.h >> @@ -292,7 +292,8 @@ extern void __sync_icache_dcache(pte_t pteval); >> * PTE_DIRTY || (PTE_WRITE && !PTE_RDONLY) >> */ >> >> -static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> +static inline void __check_racy_pte_update(struct mm_struct *mm, >> + unsigned long addr, pte_t *ptep, >> pte_t pte) >> { >> pte_t old_pte; >> @@ -307,6 +308,10 @@ static inline void __check_racy_pte_update(struct mm_struct *mm, pte_t *ptep, >> if (mm != current->active_mm && atomic_read(&mm->mm_users) <= 1) >> return; >> >> + if (IS_ENABLED(CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP) && >> + addr >= VMEMMAP_START && addr <= VMEMMAP_END) >> + return; >> + >> /* >> * Check for potential race with hardware updates of the pte >> * (ptep_set_access_flags safely changes valid ptes without going >> IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the access and dirty pte bits"). Maybe we can use pte_wrprotect() to mark this pte read-only? It will add PTE_DIRTY bit for the new pte entry compare to the old one. Here is the diff: diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index ba2a2596fb4e..24a230895316 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, * Remap the tail pages as read-only to catch illegal write operation * to the tail pages. */ - pgprot_t pgprot = PAGE_KERNEL_RO; - pte_t entry = mk_pte(walk->reuse_page, pgprot); + pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL)); struct page *page = pte_page(*pte); list_add_tail(&page->lru, walk->vmemmap_pages); >>> >>> The following shell command can be used to reproduce this WARN_ON in >>> 6.1.0-rc1: >>> >>> echo 1 > /proc/sys/vm/hugetlb_optimize_vmemmap >>> cat /proc/sys/vm/hugetlb_optimize_vmemmap >>> >>> echo 1024 > /proc/sys/vm/nr_overcommit_hugepages >>> mkdir -p /root/hugetlb >>> mount none /root/hugetlb -t hugetlbfs >>> fallocate -l 2g /root/hugetlb/xx & >>> >>> Here is the detail WARN_ON log: >>> >>> ------------[ cut here ]------------ >>> __check_racy_pte_update: racy dirty state clearing: 0x0068000416899f03 -> 0x0060000416898f83 >>> WARNING: CPU: 3 PID: 394 at arch/arm64/include/asm/pgtable.h:318 vmemmap_remap_pte+0x118/0x120 >>> Modules linked in: >>> CPU: 3 PID: 394 Comm: fallocate Not tainted 6.1.0-rc1 #224 >>> Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015 >>> Call trace: >>> vmemmap_remap_pte+0x118/0x120 >>> vmemmap_remap_range+0x30c/0x560 >>> hugetlb_vmemmap_optimize+0x158/0x408 >>> __prep_new_huge_page+0x24/0x150 >>> prep_new_huge_page+0x30/0x60 >>> alloc_fresh_huge_page+0x1c4/0x1e0 >>> alloc_surplus_huge_page+0xa0/0x168 >>> alloc_huge_page+0x264/0x5b8 >>> hugetlbfs_fallocate+0x294/0x680 >>> vfs_fallocate+0x12c/0x568 >>> ksys_fallocate+0x50/0xa0 >>> __arm64_sys_fallocate+0x28/0x38 >>> invoke_syscall+0x4c/0x110 >>> el0_svc_common.constprop.0+0x68/0x128 >>> do_el0_svc+0x34/0xd0 >>> el0_svc+0x48/0xb8 >>> el0t_64_sync_handler+0xb8/0xc0 >>> el0t_64_sync+0x18c/0x190 >>> ---[ end trace 0000000000000000 ]--- >>> >>> Fixes: f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with each HugeTLB page") >> >> Actually, this commit does not pose the issue for arm64. I think the correct commit >> which should be fixed is 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. >> >> Thanks. > > Thanks for pointing it out. > > I have tested the commit, it sure is patch 1e63ac088f20f7a4425c430c31ecd3cf167fb3f2. > This VM_WARN_ONCE will be produced if HUGETLB_PAGE_FREE_VMEMMAP for arm64 is enabled. > > I will change my change log and fix this in next patch. > >> >>> Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> >>> --- >>> mm/hugetlb_vmemmap.c | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index ba2a2596fb4e..cb056265c31e 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, >>> struct page *page = pte_page(*pte); >>> >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> - set_pte_at(&init_mm, addr, pte, entry); >>> + set_pte(pte, entry); >>> } >>> >>> /* >>> -- >>> 2.25.1 >>> >>> >>
> On Oct 28, 2022, at 23:53, Catalin Marinas <catalin.marinas@arm.com> wrote: > > On Fri, Oct 28, 2022 at 10:45:09AM +0800, Muchun Song wrote: >> On Oct 27, 2022, at 18:50, Catalin Marinas <catalin.marinas@arm.com> wrote: >>> On Wed, Oct 26, 2022 at 02:06:00PM +0530, Anshuman Khandual wrote: >>>> On 10/26/22 12:31, Muchun Song wrote: >>>>>> On 10/25/22 12:06, Muchun Song wrote: >>>>>>>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: >>>>>>>> From: Ma Wupeng <mawupeng1@huawei.com> >>>>>>>> >>>>>>>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with >>>>>>>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as >>>>>>>> read-only to catch illegal write operation to the tail page. >>>>>>>> >>>>>>>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() >>>>>>> >>>>>>> Thanks for your finding this issue. >>>>>>> >>>>>>>> since this may lead to dirty state cleaned. This check is introduced by >>>>>>>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the >>>>>>>> access and dirty pte bits") and the initial check is as follow: >>>>>>>> >>>>>>>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); >>>>>>>> >>>>>>>> Since we do need to mark this pte as read-only to catch illegal write >>>>>>>> operation to the tail pages, use set_pte to replace set_pte_at to bypass >>>>>>>> this check. >>>>>>> >>>>>>> In theory, the waring does not affect anything since the tail vmemmap >>>>>>> pages are supposed to be read-only. So, skipping this check for vmemmap >>>>>> >>>>>> Tails vmemmap pages are supposed to be read-only, in practice but their >>>>>> backing pages do have pte_write() enabled. Otherwise the VM_WARN_ONCE() >>>>>> warning would not have triggered. >>>>> >>>>> Right. >>>>> >>>>>> >>>>>> VM_WARN_ONCE(pte_write(old_pte) && !pte_dirty(pte), >>>>>> "%s: racy dirty state clearing: 0x%016llx -> 0x%016llx", >>>>>> __func__, pte_val(old_pte), pte_val(pte)); >>>>>> >>>>>> Also, is not it true that the pte being remapped into a different page >>>>>> as read only, than what it had originally (which will be freed up) i.e >>>>>> the PFN in 'old_pte' and 'pte' will be different. Hence is there still >>>>> >>>>> Right. >>>>> >>>>>> a possibility for a race condition even when the PFN changes ? >>>>> >>>>> Sorry, I didn't get this question. Did you mean the PTE is changed from >>>>> new (pte) to the old one (old_pte) by the hardware because of the update >>>>> of dirty bit when a concurrent write operation to the tail vmemmap page? >>>> >>>> No, but is not vmemmap_remap_pte() reuses walk->reuse_page for all remaining >>>> tails pages ? Is not there a PFN change, along with access permission change >>>> involved in this remapping process ? >>> >>> For the record, as we discussed offline, changing the output address >>> (pfn) of a pte is not safe without break-before-make if at least one of >>> the mappings was writeable. The caller (vmemmap_remap_pte()) would need >>> to be fixed to first invalidate the pte and then write the new pte. I >> >> Could you expose more details about what issue it will be caused? I am >> not familiar with arm64. > > Well, it's not allowed by the architecture, so some CPU implementations > may do weird things like accessing incorrect memory or triggering TLB > conflict aborts if, for some reason, they end up with two entries in > the TLB for the same VA but pointing to different pfns. The hardware > expects an invalid PTE and TLB invalidation between such changes. In > practice most likely nothing happens and this works fine but we need to > stick to the architecture requirements in case some CPUs take advantage > of this requirement. Got it. Thanks for your nice explanation. > >>> assume no other CPU accesses this part of the vmemmap while the pte is >>> being remapped. >> >> However, there is no guarantee that no other CPU accesses this pte. >> E.g. memory failure or memory compaction, both can obtain head page >> from any tail struct pages (only read) anytime. > > Oh, so we cannot safely go through a break-before-make sequence here > (zero the PTE, flush the TLB, write the new PTE) as some CPU may access > this pte. Right. Muchun > > -- > Catalin
On Sat, Oct 29, 2022 at 09:55:15AM +0800, mawupeng wrote: > On 2022/10/26 11:01, mawupeng wrote: > > On 2022/10/25 14:36, Muchun Song wrote: > >>> On Oct 25, 2022, at 09:42, Wupeng Ma <mawupeng1@huawei.com> wrote: > >>> > >>> From: Ma Wupeng <mawupeng1@huawei.com> > >>> > >>> Commit f41f2ed43ca5 ("mm: hugetlb: free the vmemmap pages associated with > >>> each HugeTLB page") add vmemmap_remap_pte to remap the tail pages as > >>> read-only to catch illegal write operation to the tail page. > >>> > >>> However this will lead to WARN_ON in arm64 in __check_racy_pte_update() > >> > >> Thanks for your finding this issue. > >> > >>> since this may lead to dirty state cleaned. This check is introduced by > >>> commit 2f4b829c625e ("arm64: Add support for hardware updates of the > >>> access and dirty pte bits") and the initial check is as follow: > >>> > >>> BUG_ON(pte_write(*ptep) && !pte_dirty(pte)); > >>> > >>> Since we do need to mark this pte as read-only to catch illegal write > >>> operation to the tail pages, use set_pte to replace set_pte_at to bypass > >>> this check. [...] > IMHO, arm64 or other archs do some work on the dirty bit and rdonly bit in > pte in commit 2f4b829c625e ("arm64: Add support for hardware updates of the > access and dirty pte bits"). > > Maybe we can use pte_wrprotect() to mark this pte read-only? It will add > PTE_DIRTY bit for the new pte entry compare to the old one. > > Here is the diff: > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index ba2a2596fb4e..24a230895316 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -244,8 +244,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > * Remap the tail pages as read-only to catch illegal write operation > * to the tail pages. > */ > - pgprot_t pgprot = PAGE_KERNEL_RO; > - pte_t entry = mk_pte(walk->reuse_page, pgprot); > + pte_t entry = pte_wrprotect(mk_pte(walk->reuse_page, PAGE_KERNEL)); This may silence the warning but we plan to add another to detect a change in the pfn without going through a break-before-make sequence.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index ba2a2596fb4e..cb056265c31e 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -249,7 +249,7 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, struct page *page = pte_page(*pte); list_add_tail(&page->lru, walk->vmemmap_pages); - set_pte_at(&init_mm, addr, pte, entry); + set_pte(pte, entry); } /*