diff mbox series

[RFC,02/12] mm: add config option and per-NUMA node VMS support

Message ID 20231228131056.602411-3-artem.kuzin@huawei.com (mailing list archive)
State New
Headers show
Series x86 NUMA-aware kernel replication | expand

Commit Message

Artem Kuzin Dec. 28, 2023, 1:10 p.m. UTC
From: Artem Kuzin <artem.kuzin@huawei.com>

Co-developed-by: Nikita Panov <nikita.panov@huawei-partners.com>
Signed-off-by: Nikita Panov <nikita.panov@huawei-partners.com>
Co-developed-by: Alexander Grubnikov <alexander.grubnikov@huawei.com>
Signed-off-by: Alexander Grubnikov <alexander.grubnikov@huawei.com>
Signed-off-by: Artem Kuzin <artem.kuzin@huawei.com>
---
 include/linux/mm_types.h | 11 ++++++++++-
 mm/Kconfig               | 10 ++++++++++
 2 files changed, 20 insertions(+), 1 deletion(-)

Comments

Christoph Lameter (Ampere) Jan. 3, 2024, 7:43 p.m. UTC | #1
On Thu, 28 Dec 2023, artem.kuzin@huawei.com wrote:

> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -626,7 +628,14 @@ struct mm_struct {
> 		unsigned long mmap_compat_legacy_base;
> #endif
> 		unsigned long task_size;	/* size of task vm space */
> -		pgd_t * pgd;
> +#ifndef CONFIG_KERNEL_REPLICATION
> +		pgd_t *pgd;
> +#else
> +		union {
> +			pgd_t *pgd;
> +			pgd_t *pgd_numa[MAX_NUMNODES];
> +		};
> +#endif


Hmmm... This is adding the pgd pointers for all mm_structs. But we 
only need the numa pgs pointers for the init_mm. Can this be a separate 
variable? There are some architecures with larger number of nodes.
Artem Kuzin Jan. 9, 2024, 4:57 p.m. UTC | #2
On 1/3/2024 10:43 PM, Christoph Lameter (Ampere) wrote:
> On Thu, 28 Dec 2023, artem.kuzin@huawei.com wrote:
>
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -626,7 +628,14 @@ struct mm_struct {
>>         unsigned long mmap_compat_legacy_base;
>> #endif
>>         unsigned long task_size;    /* size of task vm space */
>> -        pgd_t * pgd;
>> +#ifndef CONFIG_KERNEL_REPLICATION
>> +        pgd_t *pgd;
>> +#else
>> +        union {
>> +            pgd_t *pgd;
>> +            pgd_t *pgd_numa[MAX_NUMNODES];
>> +        };
>> +#endif
>
>
> Hmmm... This is adding the pgd pointers for all mm_structs. But we only need the numa pgs pointers for the init_mm. Can this be a separate variable? There are some architecures with larger number of nodes.
>
>
>

Hi, Christoph.

Sorry for such delay with the reply.

We already have per-NUMA node init_mm, but this is not enough.
We need this array of pointers in the task struct due to the proper pgd (per-NUMA node) should be used for threads of process that occupy more than one NUMA node.
On x86 we have one translation table per-process that contains both kernel and user space part. In case of kernel text and rodata replication enabled, we need to take
into account per-NUMA node kernel text and rodata replicas during the context switch and etc. For example, if particular thread runs a system call, we need to use the
kernel replica that corresponds to the NUMA node the thread running on. At the same time, the process can occupy several NUMA nodes, and the threads running on different
NUMA nodes should observe one user space version, but different kernel versions (per-NUMA node replicas).

But you are right that this place should be optimized. We no need this array for the processes that not expected to work in cross-NUMA node way. Possibly, we
need to implement some "lazy" approach for per-NUMA node translation tables allocation. Current version of kernel replication support is implemented in a way
when we try to do all the things as simple as possible.

Thank you!

Best regards,
Artem
Dave Hansen Jan. 25, 2024, 3:07 p.m. UTC | #3
On 1/9/24 08:57, Artem Kuzin wrote:
> We already have per-NUMA node init_mm, but this is not enough.
> We need this array of pointers in the task struct due to the proper pgd 
> (per-NUMA node) should be used for threads of process that occupy more 
> than one NUMA node.

Let me repeat what Christoph said in a bit more forceful way.

MAX_NUMNODES can be 1024.  You're adding 1023*8 bytes of overhead for
each process ... everywhere, including on my single node laptop.  That's
completely unacceptable.  You need to find another way to do this.

I'd suggest just ignoring the problem for now.  Do multi-node processes
with a later optimization.
Artem Kuzin Jan. 29, 2024, 6:22 a.m. UTC | #4
On 1/25/2024 6:07 PM, Dave Hansen wrote:
> On 1/9/24 08:57, Artem Kuzin wrote:
>> We already have per-NUMA node init_mm, but this is not enough.
>> We need this array of pointers in the task struct due to the proper pgd 
>> (per-NUMA node) should be used for threads of process that occupy more 
>> than one NUMA node.
> Let me repeat what Christoph said in a bit more forceful way.
>
> MAX_NUMNODES can be 1024.  You're adding 1023*8 bytes of overhead for
> each process ... everywhere, including on my single node laptop.  That's
> completely unacceptable.  You need to find another way to do this.
>
> I'd suggest just ignoring the problem for now.  Do multi-node processes
> with a later optimization.

Hi Dave, thanks to you and Christoph for the comments. I've just gave some details why this is necessary, and didn't want to push the solution with MAX_NUMNODES forward, this is temporarily
thing and this place should be definitely updated in future.

As for possible options, for now I am thinking about two:
1. additional config option to limit the number of page tables and corresponding replicas
2. setup per-NUMA node page tables and replicas in a lazy way allocating them on demand

But here I need to try and test everything.

Thanks!
Dave Hansen Jan. 30, 2024, 11:36 p.m. UTC | #5
On 1/28/24 22:22, Artem Kuzin wrote:
> As for possible options, for now I am thinking about two:
> 1. additional config option to limit the number of page tables and corresponding replicas

This doesn't help.  Everybody runs distro kernels with the same configs
with MAX_NUMNODES set to 1024.

> 2. setup per-NUMA node page tables and replicas in a lazy way allocating them on demand

... and probably shrinking them on demand too (or at least limiting
their growth).  You don't want to have a random long-lived process
that's migrated to 50 different nodes having 50 different PGD pages when
it can only use one at a time.
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7d30dc4ff0ff..1fafb8425994 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -22,6 +22,8 @@ 
 
 #include <asm/mmu.h>
 
+#include <linux/numa.h>
+
 #ifndef AT_VECTOR_SIZE_ARCH
 #define AT_VECTOR_SIZE_ARCH 0
 #endif
@@ -626,7 +628,14 @@  struct mm_struct {
 		unsigned long mmap_compat_legacy_base;
 #endif
 		unsigned long task_size;	/* size of task vm space */
-		pgd_t * pgd;
+#ifndef CONFIG_KERNEL_REPLICATION
+		pgd_t *pgd;
+#else
+		union {
+			pgd_t *pgd;
+			pgd_t *pgd_numa[MAX_NUMNODES];
+		};
+#endif
 
 #ifdef CONFIG_MEMBARRIER
 		/**
diff --git a/mm/Kconfig b/mm/Kconfig
index 09130434e30d..5fe5b3ba7f99 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1236,6 +1236,16 @@  config LOCK_MM_AND_FIND_VMA
 	bool
 	depends on !STACK_GROWSUP
 
+config KERNEL_REPLICATION
+	bool "Enable kernel text and ro-data replication between NUMA nodes"
+	default n
+	depends on (X86_64 && !(KASAN && X86_5LEVEL)) && MMU && NUMA && !MAXSMP
+
+	help
+	  Creates per-NUMA node replicas of kernel text and rodata sections.
+	  Page tables are replicated partially, according to replicated kernel memory range.
+	  If unsure, say "n".
+
 source "mm/damon/Kconfig"
 
 endmenu