diff mbox series

[V2,3/3] sched/numa: Reset the accessing PID information periodically

Message ID b0f273113fedffb02f9b1358c88813ff355a81d6.1675159422.git.raghavendra.kt@amd.com (mailing list archive)
State New
Headers show
Series sched/numa: Enhance vma scanning | expand

Commit Message

Raghavendra K T Feb. 1, 2023, 8:02 a.m. UTC
This helps to ensure, only recently accessed PIDs scan the
VMAs.

Current implementation:
 Reset accessing PIDs every (4 * sysctl_numa_balancing_scan_delay)
interval after initial scan delay period expires. The reset logic
is implemented in scan path

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
Some of the potential ideas for clearing the accessing PIDs

1) Flag to indicate phase in life cycle of vma and tie with timestamp (reuse next_scan or so)

VMA life cycle

t1         t2         t3                    t4         t5                   t6
|<-  DS  ->|<-  US  ->|<-        CS       ->|<-  US  ->|<-        CS       ->|
flags used to indicate whether we are in DS/CS/US phase

DS (delay scan): Initial phase where scan is avoided for new VMA
US (unconditional scan): Brief period where scanning is allowed irrespective of task faulting the VMA
CS (conditional scan) :  Longer conditiona scanning phase where task scanning is allowed only for VMA of interest  


2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration

 Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2  

accessing_pids_list1 <-  current list
accessing_pids_list2 <-  old list

3) Maintain per vma numa_seq also
Currently numa_seq (how many times we are scanning entire set of VMAs) is maintained at mm level.
Having numa_seq (almost like how many times the current VMA considered for scanning) per VMA may be helpful
in some context (for e.g., whether we need to allow VMA scanning unconditionally for a newly created VMA).

4) Reset accessing PIDs at regular intervals (current implementation)

t1       t2         t3         t4         t5         t6
|<- DS ->|<-  CS  ->|<-  CS  ->|<-  CS  ->|<-  CS  ->|

The current implementation resets accessing PIDs every 4*scan_delay intervals after initial scan delay
time expires. The reset logic is implemented in scan path

 include/linux/mm_types.h |  1 +
 kernel/sched/fair.c      | 17 +++++++++++++++++
 2 files changed, 18 insertions(+)

Comments

Peter Zijlstra Feb. 3, 2023, 11:35 a.m. UTC | #1
On Wed, Feb 01, 2023 at 01:32:22PM +0530, Raghavendra K T wrote:

> 2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration
> 
>  Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
> At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2  
> 
> accessing_pids_list1 <-  current list
> accessing_pids_list2 <-  old list

( I'm not sure why you think this part of the email doesn't need to be
  nicely wrapped at 76 chars.. )

This seems simple enough to me and can be trivially extended to N if
needed.

The typical implementation would looks something like:

	unsigned long pids[N];
	unsigned int pid_idx;

set:
	unsigned long *pids = numab->pids + pid_idx;
	if (!__test_bit(bit, pids))
		__set_bit(bit, pids);

test:
	unsigned long pids = 0;
	for (int i = 0; i < N; i++)
		pids |= numab->pids[i];
	return __test_bit(bit, &pids);

rotate:
	idx = READ_ONCE(numab->pid_idx);
	WRITE_ONCE(numab->pid_idx, (idx + 1) % N);
	numab->pids[idx] = 0;

Note the actual rotate can be simplified to ^1 for N:=2.
Raghavendra K T Feb. 4, 2023, 6:32 p.m. UTC | #2
On 2/3/2023 5:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 01, 2023 at 01:32:22PM +0530, Raghavendra K T wrote:
> 
>> 2) Maintain duplicate list of accessing PIDs to keep track of history of access. and switch/reset. use OR operation during iteration
>>
>>   Two lists of PIDs maintained. At regular interval old list is reset and we make current list as old list
>> At any point of time tracking of PIDs accessing VMA is determined by ORing list1 and list2
>>
>> accessing_pids_list1 <-  current list
>> accessing_pids_list2 <-  old list
> 
> ( I'm not sure why you think this part of the email doesn't need to be
>    nicely wrapped at 76 chars.. )
> 

Sorry.. copy pasted from my "idea" notes then word wrap fooled me..

> This seems simple enough to me and can be trivially extended to N if
> needed.
>  > The typical implementation would looks something like:
> 
> 	unsigned long pids[N];
> 	unsigned int pid_idx;
> 
> set:
> 	unsigned long *pids = numab->pids + pid_idx;
> 	if (!__test_bit(bit, pids))
> 		__set_bit(bit, pids);
> 
> test:
> 	unsigned long pids = 0;
> 	for (int i = 0; i < N; i++)
> 		pids |= numab->pids[i];
> 	return __test_bit(bit, &pids);
> 
> rotate:
> 	idx = READ_ONCE(numab->pid_idx);
> 	WRITE_ONCE(numab->pid_idx, (idx + 1) % N);
> 	numab->pids[idx] = 0;
> 
> Note the actual rotate can be simplified to ^1 for N:=2.

Thanks good idea. This will be very helpful when we want to
differentiate accessing PIDs in more granular way. Perhaps we can go
with N=2 and stick to below simplification of your code above?

something like:

unsigned long pids[2]

// Assume pids[1] has latest detail always
set:
if (!__test_bit(bit, pids[1])
	__set_bit(bit, pids[1])

test:
unsigned long pids = pids[0] | pids[1];
return __test_bit(bit, &pids);

rotate:
pids[0] = pids[1];
pids[1] = 0;
diff mbox series

Patch

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 980a6a4308b6..08a007744ea1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -437,6 +437,7 @@  struct anon_vma_name {
 
 struct vma_numab {
 	unsigned long next_scan;
+	unsigned long next_pid_reset;
 	unsigned long accessing_pids;
 };
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3505ae57c07c..14db6d8a5090 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2928,6 +2928,8 @@  static bool vma_is_accessed(struct vm_area_struct *vma)
 	return vma->numab->accessing_pids & (1UL << active_pid_bit);
 }
 
+#define VMA_PID_RESET_PERIOD (4 * sysctl_numa_balancing_scan_delay)
+
 /*
  * The expensive part of numa migration is done from task_work context.
  * Triggered from task_tick_numa().
@@ -3035,6 +3037,10 @@  static void task_numa_work(struct callback_head *work)
 
 			vma->numab->next_scan = now +
 				msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+
+			/* Reset happens after 4 times scan delay of scan start */
+			vma->numab->next_pid_reset =  vma->numab->next_scan +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
 		}
 
 		/*
@@ -3047,6 +3053,17 @@  static void task_numa_work(struct callback_head *work)
 		if (!vma_is_accessed(vma))
 			continue;
 
+		/*
+		 * RESET accessing PIDs regularly for old VMAs. Resetting after checking
+		 * vma for recent access to avoid clearing PID info before access..
+		 */
+		if (mm->numa_scan_seq &&
+				time_after(jiffies, vma->numab->next_pid_reset)) {
+			vma->numab->next_pid_reset =  vma->numab->next_pid_reset +
+				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
+			vma->numab->accessing_pids = 0;
+		}
+
 		do {
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);