diff mbox series

arm64/mm: adds soft dirty page tracking

Message ID MW4PR12MB687563EFB56373E8D55DDEABB92B2@MW4PR12MB6875.namprd12.prod.outlook.com (mailing list archive)
State New, archived
Headers show
Series arm64/mm: adds soft dirty page tracking | expand

Commit Message

Shivansh Vij March 12, 2024, 1:16 a.m. UTC
Checkpoint-Restore in Userspace (CRIU) needs to be able
to track a memory page's changes if we want to enable
pre-dumping, which is important for live migrations.

The PTE_DIRTY bit (defined in pgtable-prot.h) is already
used to track software dirty pages, and the PTE_WRITE and
PTE_READ bits are used to track hardware dirty pages.

This patch enables full soft dirty page tracking
(including swap PTE support) for arm64 systems, and is
based very closely on the x86 implementation.

It is based on an unfinished patch by
Bin Lu (bin.lu@arm.com) from 2017
(https://patchwork.kernel.org/project/linux-arm-kernel/patch/1512029649-61312-1-git-send-email-bin.lu@arm.com/),
but has been updated for newer 6.x kernels as well as
tested on various 5.x kernels.

The main difference is this attempts to fix the bug
identified in the original patch where calling pte_mkclean()
on a page would result in pte_soft_dirty() == false. This
is invalid behaviour because pte_soft_dirty() should only
return false if the PTE_DIRTY bit is not set and
pte_mksoft_dirty() function has not been called. The x86
implementation expects this behaviour as well.

To achieve this, an additional software dirty bit called
PTE_SOFT_DIRTY is defined (in pgtable-prot.h), which is used
exclusively to track soft dirty pages.

This patch also reuses the _PAGE_SWP_SOFT_DIRTY
bit (defined in pgtable.h) from the original patch to add
support for swapped pages and for THP page MADV_FREE because
pmd_* functions have also been implemented.

This patch has been tested with CRIU's ZDTM test suite on
5.x and 6.x kernels using the following command:
test/zdtm.py run --page-server --remote-lazy-pages --keep-going --pre 3 -a

Signed-off-by: Shivansh Vij <shivanshvij@outlook.com>
---
 arch/arm64/Kconfig                    |  1 +
 arch/arm64/include/asm/pgtable-prot.h |  6 +++
 arch/arm64/include/asm/pgtable.h      | 54 ++++++++++++++++++++++++++-
 3 files changed, 60 insertions(+), 1 deletion(-)

Comments

David Hildenbrand March 12, 2024, 8:22 a.m. UTC | #1
On 12.03.24 02:16, Shivansh Vij wrote:

Hi,

> Checkpoint-Restore in Userspace (CRIU) needs to be able
> to track a memory page's changes if we want to enable
> pre-dumping, which is important for live migrations.
> 
> The PTE_DIRTY bit (defined in pgtable-prot.h) is already
> used to track software dirty pages, and the PTE_WRITE and
> PTE_READ bits are used to track hardware dirty pages.
> 
> This patch enables full soft dirty page tracking
> (including swap PTE support) for arm64 systems, and is
> based very closely on the x86 implementation.
> 
> It is based on an unfinished patch by
> Bin Lu (bin.lu@arm.com) from 2017
> (https://patchwork.kernel.org/project/linux-arm-kernel/patch/1512029649-61312-1-git-send-email-bin.lu@arm.com/),
> but has been updated for newer 6.x kernels as well as
> tested on various 5.x kernels.

There has also been more recently:

https://lore.kernel.org/lkml/20230703135526.930004-1-npache@redhat.com/#r

I recall that we are short on SW PTE bits:

"
So if you need software dirty, it can only be done with another software
PTE bit. The problem is that we are short of such bits (only one left if
we move PTE_PROT_NONE to a different location). The userfaultfd people
also want such bit.

Personally I'd reuse the four PBHA bits but I keep hearing that they may
be used with some out of tree patches.
"

https://lore.kernel.org/lkml/ZLQIaSMI74KpqsQQ@arm.com/
Joey Gouly March 12, 2024, 10:26 a.m. UTC | #2
On Tue, Mar 12, 2024 at 09:22:25AM +0100, David Hildenbrand wrote:
> On 12.03.24 02:16, Shivansh Vij wrote:
> 
> Hi,
> 
> > Checkpoint-Restore in Userspace (CRIU) needs to be able
> > to track a memory page's changes if we want to enable
> > pre-dumping, which is important for live migrations.
> > 
> > The PTE_DIRTY bit (defined in pgtable-prot.h) is already
> > used to track software dirty pages, and the PTE_WRITE and
> > PTE_READ bits are used to track hardware dirty pages.
> > 
> > This patch enables full soft dirty page tracking
> > (including swap PTE support) for arm64 systems, and is
> > based very closely on the x86 implementation.
> > 
> > It is based on an unfinished patch by
> > Bin Lu (bin.lu@arm.com) from 2017
> > (https://patchwork.kernel.org/project/linux-arm-kernel/patch/1512029649-61312-1-git-send-email-bin.lu@arm.com/),
> > but has been updated for newer 6.x kernels as well as
> > tested on various 5.x kernels.
> 
> There has also been more recently:
> 
> https://lore.kernel.org/lkml/20230703135526.930004-1-npache@redhat.com/#r
> 
> I recall that we are short on SW PTE bits:
> 
> "
> So if you need software dirty, it can only be done with another software
> PTE bit. The problem is that we are short of such bits (only one left if
> we move PTE_PROT_NONE to a different location). The userfaultfd people
> also want such bit.
> 
> Personally I'd reuse the four PBHA bits but I keep hearing that they may
> be used with some out of tree patches.
> "
> 
> https://lore.kernel.org/lkml/ZLQIaSMI74KpqsQQ@arm.com/
> 

I have some patches on the list (Permission Overlay) that also uses bit 60

	series: https://lore.kernel.org/linux-arm-kernel/20231124163510.1835740-1-joey.gouly@arm.com/
	commit: https://lore.kernel.org/linux-arm-kernel/20231124163510.1835740-9-joey.gouly@arm.com/

I will be sending out a v4 of that in several weeks.

Thanks,
Joey
Shivansh Vij March 12, 2024, 10:32 p.m. UTC | #3
Hi David,

On Tue, Mar 12, 2024 at 09:22:25AM +0100, David Hildenbrand wrote:
> On 12.03.24 02:16, Shivansh Vij wrote:
> 
> Hi,
> 
> > Checkpoint-Restore in Userspace (CRIU) needs to be able
> > to track a memory page's changes if we want to enable
> > pre-dumping, which is important for live migrations.
> > 
> > The PTE_DIRTY bit (defined in pgtable-prot.h) is already
> > used to track software dirty pages, and the PTE_WRITE and
> > PTE_READ bits are used to track hardware dirty pages.
> > 
> > This patch enables full soft dirty page tracking
> > (including swap PTE support) for arm64 systems, and is
> > based very closely on the x86 implementation.
> > 
> > It is based on an unfinished patch by
> > Bin Lu (bin.lu@arm.com) from 2017
> > (https://patchwork.kernel.org/project/linux-arm-kernel/patch/1512029649-61312-1-git-send-email-bin.lu@arm.com/),
> > but has been updated for newer 6.x kernels as well as
> > tested on various 5.x kernels.
> 
> There has also been more recently:
> 
> https://lore.kernel.org/lkml/20230703135526.930004-1-npache@redhat.com/#r
> 
> I recall that we are short on SW PTE bits:
> 
> "
> So if you need software dirty, it can only be done with another software
> PTE bit. The problem is that we are short of such bits (only one left if
> we move PTE_PROT_NONE to a different location). The userfaultfd people
> also want such bit.
> 
> Personally I'd reuse the four PBHA bits but I keep hearing that they may
> be used with some out of tree patches.
> "
> 
> https://lore.kernel.org/lkml/ZLQIaSMI74KpqsQQ@arm.com/

If I'm understanding the previous discussion (https://patchwork.kernel.org/project/linux-arm-kernel/patch/20230703135526.930004-1-npache@redhat.com/) correctly, the core issue is that we actually do need to use a special SW PTE bit (like the PTE_SOFT_DIRTY that's in this patch) - but at the same time, the PTE bits are highly contentious so it would be ideal if we could reuse an existing bit (maybe one of the PBHA bits like you suggested) instead of creating a new one.
 
Is my understanding correct?

Thanks,
Shivansh
David Hildenbrand March 15, 2024, 9:30 a.m. UTC | #4
On 12.03.24 23:32, Shivansh Vij wrote:
> Hi David,
> 
> On Tue, Mar 12, 2024 at 09:22:25AM +0100, David Hildenbrand wrote:
>> On 12.03.24 02:16, Shivansh Vij wrote:
>>
>> Hi,
>>
>>> Checkpoint-Restore in Userspace (CRIU) needs to be able
>>> to track a memory page's changes if we want to enable
>>> pre-dumping, which is important for live migrations.
>>>
>>> The PTE_DIRTY bit (defined in pgtable-prot.h) is already
>>> used to track software dirty pages, and the PTE_WRITE and
>>> PTE_READ bits are used to track hardware dirty pages.
>>>
>>> This patch enables full soft dirty page tracking
>>> (including swap PTE support) for arm64 systems, and is
>>> based very closely on the x86 implementation.
>>>
>>> It is based on an unfinished patch by
>>> Bin Lu (bin.lu@arm.com) from 2017
>>> (https://patchwork.kernel.org/project/linux-arm-kernel/patch/1512029649-61312-1-git-send-email-bin.lu@arm.com/),
>>> but has been updated for newer 6.x kernels as well as
>>> tested on various 5.x kernels.
>>
>> There has also been more recently:
>>
>> https://lore.kernel.org/lkml/20230703135526.930004-1-npache@redhat.com/#r
>>
>> I recall that we are short on SW PTE bits:
>>
>> "
>> So if you need software dirty, it can only be done with another software
>> PTE bit. The problem is that we are short of such bits (only one left if
>> we move PTE_PROT_NONE to a different location). The userfaultfd people
>> also want such bit.
>>
>> Personally I'd reuse the four PBHA bits but I keep hearing that they may
>> be used with some out of tree patches.
>> "
>>
>> https://lore.kernel.org/lkml/ZLQIaSMI74KpqsQQ@arm.com/
> 
> If I'm understanding the previous discussion (https://patchwork.kernel.org/project/linux-arm-kernel/patch/20230703135526.930004-1-npache@redhat.com/) correctly, the core issue is that we actually do need to use a special SW PTE bit (like the PTE_SOFT_DIRTY that's in this patch) - but at the same time, the PTE bits are highly contentious so it would be ideal if we could reuse an existing bit (maybe one of the PBHA bits like you suggested) instead of creating a new one.
>   
> Is my understanding correct?

Yes, that matches my understanding. As Joey noted, the bit you chose is 
defined by HW and might soon get used.

As Catalin wrote, some OOT patches might use the PBHA bits; although I 
am not sure what the latest state on that is and if we really should 
care about OOT patches. Maybe it would be good enough to allow driver 
use only in PFNMAP mappings, and simply not use the bit for 
softdirty/uffd-wp in there.

I don't know much about PBHA, this [1] never got merged but is an 
interesting read. We are certainly short on sw bits in any case.

There was recently some discussions around why soft-dirty tracking is 
not suitable (unfixable) for some cases, buried in previous iterations 
of [2]. The outcome of that was a new UFFD_FEATURE_WP_ASYNC mode as a 
replacement for soft-dirty tracking.

So long-term, avoiding introducing soft-dirty tracking and instead 
supporting uffd-wp might be the better choice on arm64.

[1] https://lkml.kernel.org/r/20211015161416.2196-8-james.morse@arm.com
[2] 
https://lore.kernel.org/all/20230821141518.870589-1-usama.anjum@collabora.com/
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index aa7c1d435139..fe73d4809c7e 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -178,6 +178,7 @@  config ARM64
 	select HAVE_ARCH_RANDOMIZE_KSTACK_OFFSET
 	select HAVE_ARCH_SECCOMP_FILTER
 	select HAVE_ARCH_STACKLEAK
+	select HAVE_ARCH_SOFT_DIRTY
 	select HAVE_ARCH_THREAD_STRUCT_WHITELIST
 	select HAVE_ARCH_TRACEHOOK
 	select HAVE_ARCH_TRANSPARENT_HUGEPAGE
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 483dbfa39c4c..1b4119bbdf01 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -27,6 +27,12 @@ 
  */
 #define PMD_PRESENT_INVALID	(_AT(pteval_t, 1) << 59) /* only when !PMD_SECT_VALID */
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define PTE_SOFT_DIRTY          (_AT(pteval_t, 1) << 60) /* for soft dirty tracking */
+#else
+#define PTE_SOFT_DIRTY          0UL
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+
 #define _PROT_DEFAULT		(PTE_TYPE_PAGE | PTE_AF | PTE_SHARED)
 #define _PROT_SECT_DEFAULT	(PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_S)
 
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751..0e699e7d96da 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -198,7 +198,7 @@  static inline pte_t pte_mkclean(pte_t pte)
 
 static inline pte_t pte_mkdirty(pte_t pte)
 {
-	pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
+	pte = set_pte_bit(pte, __pgprot(PTE_DIRTY | PTE_SOFT_DIRTY));
 
 	if (pte_write(pte))
 		pte = clear_pte_bit(pte, __pgprot(PTE_RDONLY));
@@ -443,6 +443,29 @@  static inline pgprot_t pte_pgprot(pte_t pte)
 	return __pgprot(pte_val(pfn_pte(pfn, __pgprot(0))) ^ pte_val(pte));
 }
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+static inline bool pte_soft_dirty(pte_t pte)
+{
+	return pte_sw_dirty(pte) || (!!(pte_val(pte) & PTE_SOFT_DIRTY));
+}
+
+static inline pte_t pte_mksoft_dirty(pte_t pte)
+{
+	pte = set_pte_bit(pte, __pgprot(PTE_SOFT_DIRTY));
+	return pte;
+}
+
+static inline pte_t pte_clear_soft_dirty(pte_t pte)
+{
+	pte = clear_pte_bit(pte, __pgprot(PTE_SOFT_DIRTY));
+	return pte;
+}
+
+#define pmd_soft_dirty(pmd)    pte_soft_dirty(pmd_pte(pmd))
+#define pmd_mksoft_dirty(pmd)  pte_pmd(pte_mksoft_dirty(pmd_pte(pmd)))
+#define pmd_clear_soft_dirty(pmd) pte_pmd(pte_clear_soft_dirty(pmd_pte(pmd)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 #ifdef CONFIG_NUMA_BALANCING
 /*
  * See the comment in include/linux/pgtable.h
@@ -1013,10 +1036,12 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
  *	bits 3-7:	swap type
  *	bits 8-57:	swap offset
  *	bit  58:	PTE_PROT_NONE (must be zero)
+ *	bit  59:        swap software dirty tracking
  */
 #define __SWP_TYPE_SHIFT	3
 #define __SWP_TYPE_BITS		5
 #define __SWP_OFFSET_BITS	50
+#define __SWP_PROT_NONE_BITS    1
 #define __SWP_TYPE_MASK		((1 << __SWP_TYPE_BITS) - 1)
 #define __SWP_OFFSET_SHIFT	(__SWP_TYPE_BITS + __SWP_TYPE_SHIFT)
 #define __SWP_OFFSET_MASK	((1UL << __SWP_OFFSET_BITS) - 1)
@@ -1033,6 +1058,33 @@  static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
 #define __swp_entry_to_pmd(swp)		__pmd((swp).val)
 #endif /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
+#ifdef CONFIG_MEM_SOFT_DIRTY
+#define _PAGE_SWP_SOFT_DIRTY   (1UL << (__SWP_OFFSET_SHIFT + __SWP_OFFSET_BITS + __SWP_PROT_NONE_BITS))
+#else
+#define _PAGE_SWP_SOFT_DIRTY    0UL
+#endif /* CONFIG_MEM_SOFT_DIRTY */
+
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+static inline bool pte_swp_soft_dirty(pte_t pte)
+{
+	return !!(pte_val(pte) & _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
+{
+	return __pte(pte_val(pte) | _PAGE_SWP_SOFT_DIRTY);
+}
+
+static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
+{
+	return __pte(pte_val(pte) & ~_PAGE_SWP_SOFT_DIRTY);
+}
+
+#define pmd_swp_soft_dirty(pmd)        pte_swp_soft_dirty(pmd_pte(pmd))
+#define pmd_swp_mksoft_dirty(pmd)      pte_pmd(pte_swp_mksoft_dirty(pmd_pte(pmd)))
+#define pmd_swp_clear_soft_dirty(pmd)  pte_pmd(pte_swp_clear_soft_dirty(pmd_pte(pmd)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 /*
  * Ensure that there are not more swap files than can be encoded in the kernel
  * PTEs.