Message ID | 20210714152426.216217-2-tiberiu.georgescu@nutanix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | pagemap: report swap location for shared pages | expand |
On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote: > When a page allocated using the MAP_SHARED flag is swapped out, its pagemap > entry is cleared. In many cases, there is no difference between swapped-out > shared pages and newly allocated, non-dirty pages in the pagemap interface. > > This patch addresses the behaviour and modifies pte_to_pagemap_entry() to > make use of the XArray associated with the virtual memory area struct > passed as an argument. The XArray contains the location of virtual pages > in the page cache, swap cache or on disk. If they are on either of the > caches, then the original implementation still works. If not, then the > missing information will be retrieved from the XArray. > > Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com> > Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com> > Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com> > Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com> > Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> > Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> > Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com> > --- > fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++-------- > 1 file changed, 29 insertions(+), 8 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index eb97468dfe4c..b17c8aedd32e 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, > return err; > } > > +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma, > + unsigned long addr) > +{ > + struct inode *inode = file_inode(vma->vm_file); > + struct address_space *mapping = inode->i_mapping; > + pgoff_t offset = linear_page_index(vma, addr); > + > + return xa_load(&mapping->i_pages, offset); > +} > + > static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > struct vm_area_struct *vma, unsigned long addr, pte_t pte) > { > u64 frame = 0, flags = 0; > struct page *page = NULL; > > + if (vma->vm_flags & VM_SOFTDIRTY) > + flags |= PM_SOFT_DIRTY; > + > if (pte_present(pte)) { > if (pm->show_pfn) > frame = pte_pfn(pte); > @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > flags |= PM_SOFT_DIRTY; > if (pte_uffd_wp(pte)) > flags |= PM_UFFD_WP; > - } else if (is_swap_pte(pte)) { > + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > swp_entry_t entry; > - if (pte_swp_soft_dirty(pte)) > - flags |= PM_SOFT_DIRTY; > - if (pte_swp_uffd_wp(pte)) > - flags |= PM_UFFD_WP; > - entry = pte_to_swp_entry(pte); > + if (is_swap_pte(pte)) { > + entry = pte_to_swp_entry(pte); > + if (pte_swp_soft_dirty(pte)) > + flags |= PM_SOFT_DIRTY; > + if (pte_swp_uffd_wp(pte)) > + flags |= PM_UFFD_WP; > + } else { > + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); > + > + if (xa_is_value(xa_entry)) > + entry = radix_to_swp_entry(xa_entry); > + else > + goto out; > + } > if (pm->show_pfn) > frame = swp_type(entry) | > (swp_offset(entry) << MAX_SWAPFILES_SHIFT); > @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > flags |= PM_FILE; > if (page && page_mapcount(page) == 1) > flags |= PM_MMAP_EXCLUSIVE; > - if (vma->vm_flags & VM_SOFTDIRTY) > - flags |= PM_SOFT_DIRTY; IMHO moving this to the entry will only work for the initial iteration, however it won't really help anything, as soft-dirty should always be used in pair with clear_refs written with value "4" first otherwise all pages will be marked soft-dirty then the pagemap data is meaningless. After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case to see all zeros again even with the patch. I think one way to fix this is to do something similar to uffd-wp: we leave a marker in pte showing that this is soft-dirtied pte even if swapped out. However we don't have a mechanism for that yet in current linux, and the uffd-wp series is the first one trying to introduce something like that. Thanks,
On 14.07.21 18:08, Peter Xu wrote: > On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote: >> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap >> entry is cleared. In many cases, there is no difference between swapped-out >> shared pages and newly allocated, non-dirty pages in the pagemap interface. >> >> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to >> make use of the XArray associated with the virtual memory area struct >> passed as an argument. The XArray contains the location of virtual pages >> in the page cache, swap cache or on disk. If they are on either of the >> caches, then the original implementation still works. If not, then the >> missing information will be retrieved from the XArray. >> >> Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com> >> Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com> >> Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com> >> Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com> >> Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> >> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> >> Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com> >> --- >> fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++-------- >> 1 file changed, 29 insertions(+), 8 deletions(-) >> >> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >> index eb97468dfe4c..b17c8aedd32e 100644 >> --- a/fs/proc/task_mmu.c >> +++ b/fs/proc/task_mmu.c >> @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, >> return err; >> } >> >> +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma, >> + unsigned long addr) >> +{ >> + struct inode *inode = file_inode(vma->vm_file); >> + struct address_space *mapping = inode->i_mapping; >> + pgoff_t offset = linear_page_index(vma, addr); >> + >> + return xa_load(&mapping->i_pages, offset); >> +} >> + >> static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> struct vm_area_struct *vma, unsigned long addr, pte_t pte) >> { >> u64 frame = 0, flags = 0; >> struct page *page = NULL; >> >> + if (vma->vm_flags & VM_SOFTDIRTY) >> + flags |= PM_SOFT_DIRTY; >> + >> if (pte_present(pte)) { >> if (pm->show_pfn) >> frame = pte_pfn(pte); >> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> flags |= PM_SOFT_DIRTY; >> if (pte_uffd_wp(pte)) >> flags |= PM_UFFD_WP; >> - } else if (is_swap_pte(pte)) { >> + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { >> swp_entry_t entry; >> - if (pte_swp_soft_dirty(pte)) >> - flags |= PM_SOFT_DIRTY; >> - if (pte_swp_uffd_wp(pte)) >> - flags |= PM_UFFD_WP; >> - entry = pte_to_swp_entry(pte); >> + if (is_swap_pte(pte)) { >> + entry = pte_to_swp_entry(pte); >> + if (pte_swp_soft_dirty(pte)) >> + flags |= PM_SOFT_DIRTY; >> + if (pte_swp_uffd_wp(pte)) >> + flags |= PM_UFFD_WP; >> + } else { >> + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); >> + >> + if (xa_is_value(xa_entry)) >> + entry = radix_to_swp_entry(xa_entry); >> + else >> + goto out; >> + } >> if (pm->show_pfn) >> frame = swp_type(entry) | >> (swp_offset(entry) << MAX_SWAPFILES_SHIFT); >> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> flags |= PM_FILE; >> if (page && page_mapcount(page) == 1) >> flags |= PM_MMAP_EXCLUSIVE; >> - if (vma->vm_flags & VM_SOFTDIRTY) >> - flags |= PM_SOFT_DIRTY; > > IMHO moving this to the entry will only work for the initial iteration, however > it won't really help anything, as soft-dirty should always be used in pair with > clear_refs written with value "4" first otherwise all pages will be marked > soft-dirty then the pagemap data is meaningless. > > After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case > to see all zeros again even with the patch. > > I think one way to fix this is to do something similar to uffd-wp: we leave a > marker in pte showing that this is soft-dirtied pte even if swapped out. How exactly does such a pte look like? Simply pte_none() with another bit set? > However we don't have a mechanism for that yet in current linux, and the > uffd-wp series is the first one trying to introduce something like that. Can you give me a pointer? I'm very interested in learning how to identify this case.
On 14.07.21 18:24, David Hildenbrand wrote: > On 14.07.21 18:08, Peter Xu wrote: >> On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote: >>> When a page allocated using the MAP_SHARED flag is swapped out, its pagemap >>> entry is cleared. In many cases, there is no difference between swapped-out >>> shared pages and newly allocated, non-dirty pages in the pagemap interface. >>> >>> This patch addresses the behaviour and modifies pte_to_pagemap_entry() to >>> make use of the XArray associated with the virtual memory area struct >>> passed as an argument. The XArray contains the location of virtual pages >>> in the page cache, swap cache or on disk. If they are on either of the >>> caches, then the original implementation still works. If not, then the >>> missing information will be retrieved from the XArray. >>> >>> Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com> >>> Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com> >>> Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com> >>> Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com> >>> Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> >>> Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> >>> Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com> >>> --- >>> fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++-------- >>> 1 file changed, 29 insertions(+), 8 deletions(-) >>> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c >>> index eb97468dfe4c..b17c8aedd32e 100644 >>> --- a/fs/proc/task_mmu.c >>> +++ b/fs/proc/task_mmu.c >>> @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, >>> return err; >>> } >>> >>> +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma, >>> + unsigned long addr) >>> +{ >>> + struct inode *inode = file_inode(vma->vm_file); >>> + struct address_space *mapping = inode->i_mapping; >>> + pgoff_t offset = linear_page_index(vma, addr); >>> + >>> + return xa_load(&mapping->i_pages, offset); >>> +} >>> + >>> static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >>> struct vm_area_struct *vma, unsigned long addr, pte_t pte) >>> { >>> u64 frame = 0, flags = 0; >>> struct page *page = NULL; >>> >>> + if (vma->vm_flags & VM_SOFTDIRTY) >>> + flags |= PM_SOFT_DIRTY; >>> + >>> if (pte_present(pte)) { >>> if (pm->show_pfn) >>> frame = pte_pfn(pte); >>> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >>> flags |= PM_SOFT_DIRTY; >>> if (pte_uffd_wp(pte)) >>> flags |= PM_UFFD_WP; >>> - } else if (is_swap_pte(pte)) { >>> + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { >>> swp_entry_t entry; >>> - if (pte_swp_soft_dirty(pte)) >>> - flags |= PM_SOFT_DIRTY; >>> - if (pte_swp_uffd_wp(pte)) >>> - flags |= PM_UFFD_WP; >>> - entry = pte_to_swp_entry(pte); >>> + if (is_swap_pte(pte)) { >>> + entry = pte_to_swp_entry(pte); >>> + if (pte_swp_soft_dirty(pte)) >>> + flags |= PM_SOFT_DIRTY; >>> + if (pte_swp_uffd_wp(pte)) >>> + flags |= PM_UFFD_WP; >>> + } else { >>> + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); >>> + >>> + if (xa_is_value(xa_entry)) >>> + entry = radix_to_swp_entry(xa_entry); >>> + else >>> + goto out; >>> + } >>> if (pm->show_pfn) >>> frame = swp_type(entry) | >>> (swp_offset(entry) << MAX_SWAPFILES_SHIFT); >>> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >>> flags |= PM_FILE; >>> if (page && page_mapcount(page) == 1) >>> flags |= PM_MMAP_EXCLUSIVE; >>> - if (vma->vm_flags & VM_SOFTDIRTY) >>> - flags |= PM_SOFT_DIRTY; >> >> IMHO moving this to the entry will only work for the initial iteration, however >> it won't really help anything, as soft-dirty should always be used in pair with >> clear_refs written with value "4" first otherwise all pages will be marked >> soft-dirty then the pagemap data is meaningless. >> >> After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case >> to see all zeros again even with the patch. >> >> I think one way to fix this is to do something similar to uffd-wp: we leave a >> marker in pte showing that this is soft-dirtied pte even if swapped out. > > How exactly does such a pte look like? Simply pte_none() with another > bit set? > >> However we don't have a mechanism for that yet in current linux, and the >> uffd-wp series is the first one trying to introduce something like that. > > Can you give me a pointer? I'm very interested in learning how to > identify this case. > I assume it's https://lore.kernel.org/lkml/20210527202117.30689-1-peterx@redhat.com/
On Wed, Jul 14, 2021 at 06:30:05PM +0200, David Hildenbrand wrote: > On 14.07.21 18:24, David Hildenbrand wrote: > > On 14.07.21 18:08, Peter Xu wrote: > > > On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote: > > > > When a page allocated using the MAP_SHARED flag is swapped out, its pagemap > > > > entry is cleared. In many cases, there is no difference between swapped-out > > > > shared pages and newly allocated, non-dirty pages in the pagemap interface. > > > > > > > > This patch addresses the behaviour and modifies pte_to_pagemap_entry() to > > > > make use of the XArray associated with the virtual memory area struct > > > > passed as an argument. The XArray contains the location of virtual pages > > > > in the page cache, swap cache or on disk. If they are on either of the > > > > caches, then the original implementation still works. If not, then the > > > > missing information will be retrieved from the XArray. > > > > > > > > Co-developed-by: Florian Schmidt <florian.schmidt@nutanix.com> > > > > Signed-off-by: Florian Schmidt <florian.schmidt@nutanix.com> > > > > Co-developed-by: Carl Waldspurger <carl.waldspurger@nutanix.com> > > > > Signed-off-by: Carl Waldspurger <carl.waldspurger@nutanix.com> > > > > Co-developed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> > > > > Signed-off-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com> > > > > Signed-off-by: Tiberiu Georgescu <tiberiu.georgescu@nutanix.com> > > > > --- > > > > fs/proc/task_mmu.c | 37 +++++++++++++++++++++++++++++-------- > > > > 1 file changed, 29 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > > > > index eb97468dfe4c..b17c8aedd32e 100644 > > > > --- a/fs/proc/task_mmu.c > > > > +++ b/fs/proc/task_mmu.c > > > > @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, > > > > return err; > > > > } > > > > +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma, > > > > + unsigned long addr) > > > > +{ > > > > + struct inode *inode = file_inode(vma->vm_file); > > > > + struct address_space *mapping = inode->i_mapping; > > > > + pgoff_t offset = linear_page_index(vma, addr); > > > > + > > > > + return xa_load(&mapping->i_pages, offset); > > > > +} > > > > + > > > > static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > > > > struct vm_area_struct *vma, unsigned long addr, pte_t pte) > > > > { > > > > u64 frame = 0, flags = 0; > > > > struct page *page = NULL; > > > > + if (vma->vm_flags & VM_SOFTDIRTY) > > > > + flags |= PM_SOFT_DIRTY; > > > > + > > > > if (pte_present(pte)) { > > > > if (pm->show_pfn) > > > > frame = pte_pfn(pte); > > > > @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > > > > flags |= PM_SOFT_DIRTY; > > > > if (pte_uffd_wp(pte)) > > > > flags |= PM_UFFD_WP; > > > > - } else if (is_swap_pte(pte)) { > > > > + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { > > > > swp_entry_t entry; > > > > - if (pte_swp_soft_dirty(pte)) > > > > - flags |= PM_SOFT_DIRTY; > > > > - if (pte_swp_uffd_wp(pte)) > > > > - flags |= PM_UFFD_WP; > > > > - entry = pte_to_swp_entry(pte); > > > > + if (is_swap_pte(pte)) { > > > > + entry = pte_to_swp_entry(pte); > > > > + if (pte_swp_soft_dirty(pte)) > > > > + flags |= PM_SOFT_DIRTY; > > > > + if (pte_swp_uffd_wp(pte)) > > > > + flags |= PM_UFFD_WP; > > > > + } else { > > > > + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); > > > > + > > > > + if (xa_is_value(xa_entry)) > > > > + entry = radix_to_swp_entry(xa_entry); > > > > + else > > > > + goto out; > > > > + } > > > > if (pm->show_pfn) > > > > frame = swp_type(entry) | > > > > (swp_offset(entry) << MAX_SWAPFILES_SHIFT); > > > > @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, > > > > flags |= PM_FILE; > > > > if (page && page_mapcount(page) == 1) > > > > flags |= PM_MMAP_EXCLUSIVE; > > > > - if (vma->vm_flags & VM_SOFTDIRTY) > > > > - flags |= PM_SOFT_DIRTY; > > > > > > IMHO moving this to the entry will only work for the initial iteration, however > > > it won't really help anything, as soft-dirty should always be used in pair with > > > clear_refs written with value "4" first otherwise all pages will be marked > > > soft-dirty then the pagemap data is meaningless. > > > > > > After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case > > > to see all zeros again even with the patch. > > > > > > I think one way to fix this is to do something similar to uffd-wp: we leave a > > > marker in pte showing that this is soft-dirtied pte even if swapped out. > > > > How exactly does such a pte look like? Simply pte_none() with another > > bit set? Yes something like that. The pte can be defined at will, as long as never used elsewhere. > > > > > However we don't have a mechanism for that yet in current linux, and the > > > uffd-wp series is the first one trying to introduce something like that. > > > > Can you give me a pointer? I'm very interested in learning how to > > identify this case. > > > > I assume it's > https://lore.kernel.org/lkml/20210527202117.30689-1-peterx@redhat.com/ Yes.
> On 14 Jul 2021, at 17:08, Peter Xu <peterx@redhat.com> wrote: > > On Wed, Jul 14, 2021 at 03:24:26PM +0000, Tiberiu Georgescu wrote: >> >> static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> struct vm_area_struct *vma, unsigned long addr, pte_t pte) >> { >> u64 frame = 0, flags = 0; >> struct page *page = NULL; >> >> + if (vma->vm_flags & VM_SOFTDIRTY) >> + flags |= PM_SOFT_DIRTY; >> + >> if (pte_present(pte)) { >> if (pm->show_pfn) >> frame = pte_pfn(pte); >> @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> flags |= PM_SOFT_DIRTY; >> if (pte_uffd_wp(pte)) >> flags |= PM_UFFD_WP; >> - } else if (is_swap_pte(pte)) { >> + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { >> swp_entry_t entry; >> - if (pte_swp_soft_dirty(pte)) >> - flags |= PM_SOFT_DIRTY; >> - if (pte_swp_uffd_wp(pte)) >> - flags |= PM_UFFD_WP; >> - entry = pte_to_swp_entry(pte); >> + if (is_swap_pte(pte)) { >> + entry = pte_to_swp_entry(pte); >> + if (pte_swp_soft_dirty(pte)) >> + flags |= PM_SOFT_DIRTY; >> + if (pte_swp_uffd_wp(pte)) >> + flags |= PM_UFFD_WP; >> + } else { >> + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); >> + >> + if (xa_is_value(xa_entry)) >> + entry = radix_to_swp_entry(xa_entry); >> + else >> + goto out; >> + } >> if (pm->show_pfn) >> frame = swp_type(entry) | >> (swp_offset(entry) << MAX_SWAPFILES_SHIFT); >> @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, >> flags |= PM_FILE; >> if (page && page_mapcount(page) == 1) >> flags |= PM_MMAP_EXCLUSIVE; >> - if (vma->vm_flags & VM_SOFTDIRTY) >> - flags |= PM_SOFT_DIRTY; > > IMHO moving this to the entry will only work for the initial iteration, however > it won't really help anything, as soft-dirty should always be used in pair with > clear_refs written with value "4" first otherwise all pages will be marked > soft-dirty then the pagemap data is meaningless. > > After the "write 4" op VM_SOFTDIRTY will be cleared and I expect the test case > to see all zeros again even with the patch. Indeed, the SOFT_DIRTY bit gets cleared and does not get set when we dirty the page and swap it out again. However, the pagemap entries are not completely zeroed out. The patch mostly deals with adding the swap frame offset on the pagemap entries of swappable, non-syncable pages, even if they are MAP_SHARED. Example output post-patch, after writing 4 to clear_refs and dirtying the pages: $ dd if=/proc/$PID/pagemap ibs=8 skip=$(($VADDR / $PAGESIZE)) count=256 | hexdump -C 00000000 80 13 01 00 00 00 00 40 a0 13 01 00 00 00 00 40 |.......@.......@| ...........more swapped-out entries............ 000005e0 e0 2a 01 00 00 00 00 40 00 2b 01 00 00 00 00 40 |.*.....@.+.....@| 000005f0 20 2b 01 00 00 00 00 40 40 2b 01 00 00 00 00 40 | +.....@@+.....@| 00000600 72 6c 1d 00 00 00 80 a1 c1 34 12 00 00 00 80 a1 |rl.......4......| ...........more in-memory entries............ 000007f0 3c 21 18 00 00 00 80 a1 69 ec 17 00 00 00 80 a1 |<!......i.......| You may find the pre-patch example output on the RFC cover letter, for reference: https://lkml.org/lkml/2021/7/14/594 > I think one way to fix this is to do something similar to uffd-wp: we leave a > marker in pte showing that this is soft-dirtied pte even if swapped out. > However we don't have a mechanism for that yet in current linux, and the > uffd-wp series is the first one trying to introduce something like that. I am taking a look at the uffd-wp patch today. Hope it gets upstreamed soon, so I can adapt one of the mechanisms in there to keep track of the SOFT_DIRTY bit on the PTE after swap. Kind regards, Tibi
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index eb97468dfe4c..b17c8aedd32e 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1359,12 +1359,25 @@ static int pagemap_pte_hole(unsigned long start, unsigned long end, return err; } +static void *get_xa_entry_at_vma_addr(struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode = file_inode(vma->vm_file); + struct address_space *mapping = inode->i_mapping; + pgoff_t offset = linear_page_index(vma, addr); + + return xa_load(&mapping->i_pages, offset); +} + static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, struct vm_area_struct *vma, unsigned long addr, pte_t pte) { u64 frame = 0, flags = 0; struct page *page = NULL; + if (vma->vm_flags & VM_SOFTDIRTY) + flags |= PM_SOFT_DIRTY; + if (pte_present(pte)) { if (pm->show_pfn) frame = pte_pfn(pte); @@ -1374,13 +1387,22 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, flags |= PM_SOFT_DIRTY; if (pte_uffd_wp(pte)) flags |= PM_UFFD_WP; - } else if (is_swap_pte(pte)) { + } else if (is_swap_pte(pte) || shmem_file(vma->vm_file)) { swp_entry_t entry; - if (pte_swp_soft_dirty(pte)) - flags |= PM_SOFT_DIRTY; - if (pte_swp_uffd_wp(pte)) - flags |= PM_UFFD_WP; - entry = pte_to_swp_entry(pte); + if (is_swap_pte(pte)) { + entry = pte_to_swp_entry(pte); + if (pte_swp_soft_dirty(pte)) + flags |= PM_SOFT_DIRTY; + if (pte_swp_uffd_wp(pte)) + flags |= PM_UFFD_WP; + } else { + void *xa_entry = get_xa_entry_at_vma_addr(vma, addr); + + if (xa_is_value(xa_entry)) + entry = radix_to_swp_entry(xa_entry); + else + goto out; + } if (pm->show_pfn) frame = swp_type(entry) | (swp_offset(entry) << MAX_SWAPFILES_SHIFT); @@ -1393,9 +1415,8 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm, flags |= PM_FILE; if (page && page_mapcount(page) == 1) flags |= PM_MMAP_EXCLUSIVE; - if (vma->vm_flags & VM_SOFTDIRTY) - flags |= PM_SOFT_DIRTY; +out: return make_pme(frame, flags); }