diff mbox series

mm: introduce clear all vm events counters

Message ID 1614595766-7640-1-git-send-email-pintu@codeaurora.org (mailing list archive)
State New, archived
Headers show
Series mm: introduce clear all vm events counters | expand

Commit Message

Pintu Kumar March 1, 2021, 10:49 a.m. UTC
At times there is a need to regularly monitor vm counters while we
reproduce some issue, or it could be as simple as gathering some system
statistics when we run some scenario and every time we like to start from
beginning.
The current steps are:
Dump /proc/vmstat
Run some scenario
Dump /proc/vmstat again
Generate some data or graph
reboot and repeat again

So, each time we wanted to capture fresh data a reboot is required.

Thus, in this patch I introduce a way to clear/reset all vm counters to 0
without rebooting the system.
With this patch the steps can be:
Dump /proc/vmstat
Run some scenario
Dump /proc/vmstat again
Generate some data or graph
Do: echo 1 > /proc/sys/vm/clear_all_vm_events
Then repeat again

Thus, when 1 is written to this file, it resets all the vm counters present
under /proc/vmstat to its default value or zero.
The next time when you read /proc/vmstat, each node is filled with default
or current value.

There could be few other use cases where this can be useful.

Signed-off-by: Pintu Kumar <pintu@codeaurora.org>
Signed-off-by: Pintu Agarwal <ping.ping@gmail.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  8 ++++++++
 include/linux/vmstat.h                  |  7 +++++++
 kernel/sysctl.c                         |  9 +++++++++
 mm/vmstat.c                             | 26 ++++++++++++++++++++++++++
 4 files changed, 50 insertions(+)

Comments

Matthew Wilcox March 1, 2021, 12:13 p.m. UTC | #1
On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> +EXPORT_SYMBOL_GPL(clear_all_vm_events);

What module uses this function?
Pintu Kumar March 1, 2021, 1:55 p.m. UTC | #2
On 2021-03-01 17:43, Matthew Wilcox wrote:
> On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
>> +EXPORT_SYMBOL_GPL(clear_all_vm_events);
> 
> What module uses this function?

oh sorry, I will remove the EXPORT
Thanks for the review.


Regards,
Pintu
Johannes Weiner March 1, 2021, 3:11 p.m. UTC | #3
On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> At times there is a need to regularly monitor vm counters while we
> reproduce some issue, or it could be as simple as gathering some system
> statistics when we run some scenario and every time we like to start from
> beginning.
> The current steps are:
> Dump /proc/vmstat
> Run some scenario
> Dump /proc/vmstat again
> Generate some data or graph
> reboot and repeat again

You can subtract the first vmstat dump from the second to get the
event delta for the scenario run. That's what I do, and I'd assume
most people are doing. Am I missing something?
Pintu Kumar March 2, 2021, 10:30 a.m. UTC | #4
On 2021-03-01 20:41, Johannes Weiner wrote:
> On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
>> At times there is a need to regularly monitor vm counters while we
>> reproduce some issue, or it could be as simple as gathering some 
>> system
>> statistics when we run some scenario and every time we like to start 
>> from
>> beginning.
>> The current steps are:
>> Dump /proc/vmstat
>> Run some scenario
>> Dump /proc/vmstat again
>> Generate some data or graph
>> reboot and repeat again
> 
> You can subtract the first vmstat dump from the second to get the
> event delta for the scenario run. That's what I do, and I'd assume
> most people are doing. Am I missing something?

Thanks so much for your comments.
Yes in most cases it works.

But I guess there are sometimes where we need to compare with fresh data 
(just like reboot) at least for some of the counters.
Suppose we wanted to monitor pgalloc_normal and pgfree.
Or, suppose we want to monitor until the field becomes non-zero..
Or, how certain values are changing compared to fresh reboot.
Or, suppose we want to reset all counters after boot and start capturing 
fresh stats.

Some of the counters could be growing too large and too fast. Will there 
be chances of overflow ?
Then resetting using this could help without rebooting.

Suppose system came back from hibernation, and we want to reset all 
counters again (to capture fresh data) ?

Here I am sharing one output (from qemu-arm32 with 256MB RAM) just to 
indicate what could be changed:
Scenario: Generate OOM kill case and check oom_kill counter

BEFORE	AFTER	proc/vmstat
------  ------  -----------------------
49991	49916	nr_free_pages
4467	4481	nr_zone_inactive_anon
68	68	nr_zone_active_anon
3189	3067	nr_zone_inactive_file
223	444	nr_zone_active_file
0	0	nr_zone_unevictable
122	136	nr_zone_write_pending
0	0	nr_mlock
139	139	nr_page_table_pages
0	0	nr_bounce
0	0	nr_zspages
4032	4032	nr_free_cma
4467	4481	nr_inactive_anon
68	68	nr_active_anon
3189	3067	nr_inactive_file
223	444	nr_active_file
0	0	nr_unevictable
1177	1178	nr_slab_reclaimable
1889	1889	nr_slab_unreclaimable
0	0	nr_isolated_anon
0	0	nr_isolated_file
176	163	workingset_nodes
0	0	workingset_refault_anon
3295	3369	workingset_refault_file
0	0	workingset_activate_anon
4	4	workingset_activate_file
0	0	workingset_restore_anon
4	4	workingset_restore_file
0	0	workingset_nodereclaim
4436	4436	nr_anon_pages
2636	2678	nr_mapped
3559	3645	nr_file_pages
122	136	nr_dirty
0	0	nr_writeback
0	0	nr_writeback_temp
126	126	nr_shmem
0	0	nr_shmem_hugepages
0	0	nr_shmem_pmdmapped
0	0	nr_file_hugepages
0	0	nr_file_pmdmapped
0	0	nr_anon_transparent_hugepages
1	1	nr_vmscan_write
1	1	nr_vmscan_immediate_reclaim
1024	1038	nr_dirtied
902	902	nr_written
0	0	nr_kernel_misc_reclaimable
0	0	nr_foll_pin_acquired
0	0	nr_foll_pin_released
616	616	nr_kernel_stack
10529	10533	nr_dirty_threshold
5258	5260	nr_dirty_background_threshold
50714	256	pgpgin
3932	16	pgpgout
0	0	pswpin
0	0	pswpout
86828	122	pgalloc_normal
0	0	pgalloc_movable
0	0	allocstall_normal
22	0	allocstall_movable
0	0	pgskip_normal
0	0	pgskip_movable
139594	34	pgfree
4998	155	pgactivate
5738	0	pgdeactivate
0	0	pglazyfree
82113	122	pgfault
374	2	pgmajfault
0	0	pglazyfreed
7695	0	pgrefill
2718	20	pgreuse
9261	0	pgsteal_kswapd
173	0	pgsteal_direct
12627	0	pgscan_kswapd
283	0	pgscan_direct
2	0	pgscan_direct_throttle
0	0	pgscan_anon
12910	0	pgscan_file
0	0	pgsteal_anon
9434	0	pgsteal_file
0	0	pginodesteal
7008	0	slabs_scanned
109	0	kswapd_inodesteal
16	0	kswapd_low_wmark_hit_quickly
24	0	kswapd_high_wmark_hit_quickly
43	0	pageoutrun
1	0	pgrotated
0	0	drop_pagecache
0	0	drop_slab
1	0	oom_kill
1210	0	pgmigrate_success
0	0	pgmigrate_fail
0	0	thp_migration_success
0	0	thp_migration_fail
0	0	thp_migration_split
1509	0	compact_migrate_scanned
9015	0	compact_free_scanned
3911	0	compact_isolated
0	0	compact_stall
0	0	compact_fail
0	0	compact_success
3	0	compact_daemon_wake
1509	0	compact_daemon_migrate_scanned
9015	0	compact_daemon_free_scanned
0	0	unevictable_pgs_culled
0	0	unevictable_pgs_scanned
0	0	unevictable_pgs_rescued
0	0	unevictable_pgs_mlocked
0	0	unevictable_pgs_munlocked
0	0	unevictable_pgs_cleared
0	0	unevictable_pgs_stranded
0	0	balloon_inflate
0	0	balloon_deflate
0	0	balloon_migrate
0	0	swap_ra
0	0	swap_ra_hit
0	0	nr_unstable



Thanks,
Pintu
Johannes Weiner March 2, 2021, 3:56 p.m. UTC | #5
On Tue, Mar 02, 2021 at 04:00:34PM +0530, pintu@codeaurora.org wrote:
> On 2021-03-01 20:41, Johannes Weiner wrote:
> > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> > > At times there is a need to regularly monitor vm counters while we
> > > reproduce some issue, or it could be as simple as gathering some
> > > system
> > > statistics when we run some scenario and every time we like to start
> > > from
> > > beginning.
> > > The current steps are:
> > > Dump /proc/vmstat
> > > Run some scenario
> > > Dump /proc/vmstat again
> > > Generate some data or graph
> > > reboot and repeat again
> > 
> > You can subtract the first vmstat dump from the second to get the
> > event delta for the scenario run. That's what I do, and I'd assume
> > most people are doing. Am I missing something?
> 
> Thanks so much for your comments.
> Yes in most cases it works.
> 
> But I guess there are sometimes where we need to compare with fresh data
> (just like reboot) at least for some of the counters.
> Suppose we wanted to monitor pgalloc_normal and pgfree.

Hopefully these would already be balanced out pretty well before you
run a test, or there is a risk that whatever outstanding allocations
there are can cause a large number of frees during your test that
don't match up to your recorded allocation events. Resetting to zero
doesn't eliminate the risk of such background noise.

> Or, suppose we want to monitor until the field becomes non-zero..
> Or, how certain values are changing compared to fresh reboot.
> Or, suppose we want to reset all counters after boot and start capturing
> fresh stats.

Again, there simply is no mathematical difference between

	reset events to 0
	run test
	look at events - 0

and

	read events baseline
	run test
	look at events - baseline

> Some of the counters could be growing too large and too fast. Will there be
> chances of overflow ?
> Then resetting using this could help without rebooting.

Overflows are just a fact of life on 32 bit systems. However, they can
also be trivially handled - you can always subtract a ulong start
state from a ulong end state and get a reliable delta of up to 2^32
events, whether the end state has overflowed or not.

The bottom line is that the benefit of this patch adds a minor
convenience for something that can already be done in userspace. But
the downside is that there would be one more possible source of noise
for kernel developers to consider when looking at a bug report. Plus
the extra code and user interface that need to be maintained.

I don't think we should merge this patch.
Pintu Kumar March 3, 2021, 12:25 p.m. UTC | #6
On 2021-03-02 21:26, Johannes Weiner wrote:
> On Tue, Mar 02, 2021 at 04:00:34PM +0530, pintu@codeaurora.org wrote:
>> On 2021-03-01 20:41, Johannes Weiner wrote:
>> > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
>> > > At times there is a need to regularly monitor vm counters while we
>> > > reproduce some issue, or it could be as simple as gathering some
>> > > system
>> > > statistics when we run some scenario and every time we like to start
>> > > from
>> > > beginning.
>> > > The current steps are:
>> > > Dump /proc/vmstat
>> > > Run some scenario
>> > > Dump /proc/vmstat again
>> > > Generate some data or graph
>> > > reboot and repeat again
>> >
>> > You can subtract the first vmstat dump from the second to get the
>> > event delta for the scenario run. That's what I do, and I'd assume
>> > most people are doing. Am I missing something?
>> 
>> Thanks so much for your comments.
>> Yes in most cases it works.
>> 
>> But I guess there are sometimes where we need to compare with fresh 
>> data
>> (just like reboot) at least for some of the counters.
>> Suppose we wanted to monitor pgalloc_normal and pgfree.
> 
> Hopefully these would already be balanced out pretty well before you
> run a test, or there is a risk that whatever outstanding allocations
> there are can cause a large number of frees during your test that
> don't match up to your recorded allocation events. Resetting to zero
> doesn't eliminate the risk of such background noise.
> 
>> Or, suppose we want to monitor until the field becomes non-zero..
>> Or, how certain values are changing compared to fresh reboot.
>> Or, suppose we want to reset all counters after boot and start 
>> capturing
>> fresh stats.
> 
> Again, there simply is no mathematical difference between
> 
> 	reset events to 0
> 	run test
> 	look at events - 0
> 
> and
> 
> 	read events baseline
> 	run test
> 	look at events - baseline
> 
>> Some of the counters could be growing too large and too fast. Will 
>> there be
>> chances of overflow ?
>> Then resetting using this could help without rebooting.
> 
> Overflows are just a fact of life on 32 bit systems. However, they can
> also be trivially handled - you can always subtract a ulong start
> state from a ulong end state and get a reliable delta of up to 2^32
> events, whether the end state has overflowed or not.
> 
> The bottom line is that the benefit of this patch adds a minor
> convenience for something that can already be done in userspace. But
> the downside is that there would be one more possible source of noise
> for kernel developers to consider when looking at a bug report. Plus
> the extra code and user interface that need to be maintained.
> 
> I don't think we should merge this patch.

Okay no problem.Thank you so much for your review and feedback.
Yes I agree the benefits are minor but I thought might be useful for 
someone somewhere.
I worked on it and found it easy and convinient and thus proposed it.
If others feel not important I am okay to drop it.

Thanks once again to all who helped to review it.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index e35a3f2..54382ab 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -26,6 +26,7 @@  Currently, these files are in /proc/sys/vm:
 
 - admin_reserve_kbytes
 - block_dump
+- clear_all_vm_events
 - compact_memory
 - compaction_proactiveness
 - compact_unevictable_allowed
@@ -112,6 +113,13 @@  block_dump
 block_dump enables block I/O debugging when set to a nonzero value. More
 information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst.
 
+clear_all_vm_events
+===================
+
+When 1 is written to this file, it resets all the vm counters present under
+/proc/vmstat to its default value or zeros.
+The next time when you read /proc/vmstat, each node is filled with default or
+current value.
 
 compact_memory
 ==============
diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 773135f..8ab9be5 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -11,6 +11,8 @@ 
 #include <linux/mmdebug.h>
 
 extern int sysctl_stat_interval;
+extern int sysctl_clearvmevents_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos);
 
 #ifdef CONFIG_NUMA
 #define ENABLE_NUMA_STAT   1
@@ -83,6 +85,8 @@  static inline void count_vm_events(enum vm_event_item item, long delta)
 
 extern void all_vm_events(unsigned long *);
 
+extern void clear_all_vm_events(void);
+
 extern void vm_events_fold_cpu(int cpu);
 
 #else
@@ -103,6 +107,9 @@  static inline void __count_vm_events(enum vm_event_item item, long delta)
 static inline void all_vm_events(unsigned long *ret)
 {
 }
+static inline void clear_all_vm_events(void)
+{
+}
 static inline void vm_events_fold_cpu(int cpu)
 {
 }
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9fbdd8..25299a4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -200,6 +200,8 @@  static int min_extfrag_threshold;
 static int max_extfrag_threshold = 1000;
 #endif
 
+static int sysctl_clear_vm_events;
+
 #endif /* CONFIG_SYSCTL */
 
 #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL)
@@ -2891,6 +2893,13 @@  static struct ctl_table vm_table[] = {
 
 #endif /* CONFIG_COMPACTION */
 	{
+		.procname	= "clear_all_vm_events",
+		.data		= &sysctl_clear_vm_events,
+		.maxlen		= sizeof(int),
+		.mode		= 0200,
+		.proc_handler	= sysctl_clearvmevents_handler,
+	},
+	{
 		.procname	= "min_free_kbytes",
 		.data		= &min_free_kbytes,
 		.maxlen		= sizeof(min_free_kbytes),
diff --git a/mm/vmstat.c b/mm/vmstat.c
index f894216..26b8f81 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -136,6 +136,32 @@  void all_vm_events(unsigned long *ret)
 }
 EXPORT_SYMBOL_GPL(all_vm_events);
 
+void clear_all_vm_events(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		struct vm_event_state *this = &per_cpu(vm_event_states, cpu);
+		int sz = NR_VM_EVENT_ITEMS * sizeof(unsigned long);
+
+		memset(this->event, 0, sz);
+	}
+}
+EXPORT_SYMBOL_GPL(clear_all_vm_events);
+
+/*
+ * This is the node to reset all vm events in /proc/vmstat
+ * via /proc/sys/vm/clear_all_vm_events
+ */
+int sysctl_clearvmevents_handler(struct ctl_table *table, int write,
+	void __user *buffer, size_t *length, loff_t *ppos)
+{
+	if (write)
+		clear_all_vm_events();
+
+	return 0;
+}
+
 /*
  * Fold the foreign cpu events into our own.
  *