Message ID | 1614595766-7640-1-git-send-email-pintu@codeaurora.org (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm: introduce clear all vm events counters | expand |
On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote:
> +EXPORT_SYMBOL_GPL(clear_all_vm_events);
What module uses this function?
On 2021-03-01 17:43, Matthew Wilcox wrote: > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote: >> +EXPORT_SYMBOL_GPL(clear_all_vm_events); > > What module uses this function? oh sorry, I will remove the EXPORT Thanks for the review. Regards, Pintu
On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote: > At times there is a need to regularly monitor vm counters while we > reproduce some issue, or it could be as simple as gathering some system > statistics when we run some scenario and every time we like to start from > beginning. > The current steps are: > Dump /proc/vmstat > Run some scenario > Dump /proc/vmstat again > Generate some data or graph > reboot and repeat again You can subtract the first vmstat dump from the second to get the event delta for the scenario run. That's what I do, and I'd assume most people are doing. Am I missing something?
On 2021-03-01 20:41, Johannes Weiner wrote: > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote: >> At times there is a need to regularly monitor vm counters while we >> reproduce some issue, or it could be as simple as gathering some >> system >> statistics when we run some scenario and every time we like to start >> from >> beginning. >> The current steps are: >> Dump /proc/vmstat >> Run some scenario >> Dump /proc/vmstat again >> Generate some data or graph >> reboot and repeat again > > You can subtract the first vmstat dump from the second to get the > event delta for the scenario run. That's what I do, and I'd assume > most people are doing. Am I missing something? Thanks so much for your comments. Yes in most cases it works. But I guess there are sometimes where we need to compare with fresh data (just like reboot) at least for some of the counters. Suppose we wanted to monitor pgalloc_normal and pgfree. Or, suppose we want to monitor until the field becomes non-zero.. Or, how certain values are changing compared to fresh reboot. Or, suppose we want to reset all counters after boot and start capturing fresh stats. Some of the counters could be growing too large and too fast. Will there be chances of overflow ? Then resetting using this could help without rebooting. Suppose system came back from hibernation, and we want to reset all counters again (to capture fresh data) ? Here I am sharing one output (from qemu-arm32 with 256MB RAM) just to indicate what could be changed: Scenario: Generate OOM kill case and check oom_kill counter BEFORE AFTER proc/vmstat ------ ------ ----------------------- 49991 49916 nr_free_pages 4467 4481 nr_zone_inactive_anon 68 68 nr_zone_active_anon 3189 3067 nr_zone_inactive_file 223 444 nr_zone_active_file 0 0 nr_zone_unevictable 122 136 nr_zone_write_pending 0 0 nr_mlock 139 139 nr_page_table_pages 0 0 nr_bounce 0 0 nr_zspages 4032 4032 nr_free_cma 4467 4481 nr_inactive_anon 68 68 nr_active_anon 3189 3067 nr_inactive_file 223 444 nr_active_file 0 0 nr_unevictable 1177 1178 nr_slab_reclaimable 1889 1889 nr_slab_unreclaimable 0 0 nr_isolated_anon 0 0 nr_isolated_file 176 163 workingset_nodes 0 0 workingset_refault_anon 3295 3369 workingset_refault_file 0 0 workingset_activate_anon 4 4 workingset_activate_file 0 0 workingset_restore_anon 4 4 workingset_restore_file 0 0 workingset_nodereclaim 4436 4436 nr_anon_pages 2636 2678 nr_mapped 3559 3645 nr_file_pages 122 136 nr_dirty 0 0 nr_writeback 0 0 nr_writeback_temp 126 126 nr_shmem 0 0 nr_shmem_hugepages 0 0 nr_shmem_pmdmapped 0 0 nr_file_hugepages 0 0 nr_file_pmdmapped 0 0 nr_anon_transparent_hugepages 1 1 nr_vmscan_write 1 1 nr_vmscan_immediate_reclaim 1024 1038 nr_dirtied 902 902 nr_written 0 0 nr_kernel_misc_reclaimable 0 0 nr_foll_pin_acquired 0 0 nr_foll_pin_released 616 616 nr_kernel_stack 10529 10533 nr_dirty_threshold 5258 5260 nr_dirty_background_threshold 50714 256 pgpgin 3932 16 pgpgout 0 0 pswpin 0 0 pswpout 86828 122 pgalloc_normal 0 0 pgalloc_movable 0 0 allocstall_normal 22 0 allocstall_movable 0 0 pgskip_normal 0 0 pgskip_movable 139594 34 pgfree 4998 155 pgactivate 5738 0 pgdeactivate 0 0 pglazyfree 82113 122 pgfault 374 2 pgmajfault 0 0 pglazyfreed 7695 0 pgrefill 2718 20 pgreuse 9261 0 pgsteal_kswapd 173 0 pgsteal_direct 12627 0 pgscan_kswapd 283 0 pgscan_direct 2 0 pgscan_direct_throttle 0 0 pgscan_anon 12910 0 pgscan_file 0 0 pgsteal_anon 9434 0 pgsteal_file 0 0 pginodesteal 7008 0 slabs_scanned 109 0 kswapd_inodesteal 16 0 kswapd_low_wmark_hit_quickly 24 0 kswapd_high_wmark_hit_quickly 43 0 pageoutrun 1 0 pgrotated 0 0 drop_pagecache 0 0 drop_slab 1 0 oom_kill 1210 0 pgmigrate_success 0 0 pgmigrate_fail 0 0 thp_migration_success 0 0 thp_migration_fail 0 0 thp_migration_split 1509 0 compact_migrate_scanned 9015 0 compact_free_scanned 3911 0 compact_isolated 0 0 compact_stall 0 0 compact_fail 0 0 compact_success 3 0 compact_daemon_wake 1509 0 compact_daemon_migrate_scanned 9015 0 compact_daemon_free_scanned 0 0 unevictable_pgs_culled 0 0 unevictable_pgs_scanned 0 0 unevictable_pgs_rescued 0 0 unevictable_pgs_mlocked 0 0 unevictable_pgs_munlocked 0 0 unevictable_pgs_cleared 0 0 unevictable_pgs_stranded 0 0 balloon_inflate 0 0 balloon_deflate 0 0 balloon_migrate 0 0 swap_ra 0 0 swap_ra_hit 0 0 nr_unstable Thanks, Pintu
On Tue, Mar 02, 2021 at 04:00:34PM +0530, pintu@codeaurora.org wrote: > On 2021-03-01 20:41, Johannes Weiner wrote: > > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote: > > > At times there is a need to regularly monitor vm counters while we > > > reproduce some issue, or it could be as simple as gathering some > > > system > > > statistics when we run some scenario and every time we like to start > > > from > > > beginning. > > > The current steps are: > > > Dump /proc/vmstat > > > Run some scenario > > > Dump /proc/vmstat again > > > Generate some data or graph > > > reboot and repeat again > > > > You can subtract the first vmstat dump from the second to get the > > event delta for the scenario run. That's what I do, and I'd assume > > most people are doing. Am I missing something? > > Thanks so much for your comments. > Yes in most cases it works. > > But I guess there are sometimes where we need to compare with fresh data > (just like reboot) at least for some of the counters. > Suppose we wanted to monitor pgalloc_normal and pgfree. Hopefully these would already be balanced out pretty well before you run a test, or there is a risk that whatever outstanding allocations there are can cause a large number of frees during your test that don't match up to your recorded allocation events. Resetting to zero doesn't eliminate the risk of such background noise. > Or, suppose we want to monitor until the field becomes non-zero.. > Or, how certain values are changing compared to fresh reboot. > Or, suppose we want to reset all counters after boot and start capturing > fresh stats. Again, there simply is no mathematical difference between reset events to 0 run test look at events - 0 and read events baseline run test look at events - baseline > Some of the counters could be growing too large and too fast. Will there be > chances of overflow ? > Then resetting using this could help without rebooting. Overflows are just a fact of life on 32 bit systems. However, they can also be trivially handled - you can always subtract a ulong start state from a ulong end state and get a reliable delta of up to 2^32 events, whether the end state has overflowed or not. The bottom line is that the benefit of this patch adds a minor convenience for something that can already be done in userspace. But the downside is that there would be one more possible source of noise for kernel developers to consider when looking at a bug report. Plus the extra code and user interface that need to be maintained. I don't think we should merge this patch.
On 2021-03-02 21:26, Johannes Weiner wrote: > On Tue, Mar 02, 2021 at 04:00:34PM +0530, pintu@codeaurora.org wrote: >> On 2021-03-01 20:41, Johannes Weiner wrote: >> > On Mon, Mar 01, 2021 at 04:19:26PM +0530, Pintu Kumar wrote: >> > > At times there is a need to regularly monitor vm counters while we >> > > reproduce some issue, or it could be as simple as gathering some >> > > system >> > > statistics when we run some scenario and every time we like to start >> > > from >> > > beginning. >> > > The current steps are: >> > > Dump /proc/vmstat >> > > Run some scenario >> > > Dump /proc/vmstat again >> > > Generate some data or graph >> > > reboot and repeat again >> > >> > You can subtract the first vmstat dump from the second to get the >> > event delta for the scenario run. That's what I do, and I'd assume >> > most people are doing. Am I missing something? >> >> Thanks so much for your comments. >> Yes in most cases it works. >> >> But I guess there are sometimes where we need to compare with fresh >> data >> (just like reboot) at least for some of the counters. >> Suppose we wanted to monitor pgalloc_normal and pgfree. > > Hopefully these would already be balanced out pretty well before you > run a test, or there is a risk that whatever outstanding allocations > there are can cause a large number of frees during your test that > don't match up to your recorded allocation events. Resetting to zero > doesn't eliminate the risk of such background noise. > >> Or, suppose we want to monitor until the field becomes non-zero.. >> Or, how certain values are changing compared to fresh reboot. >> Or, suppose we want to reset all counters after boot and start >> capturing >> fresh stats. > > Again, there simply is no mathematical difference between > > reset events to 0 > run test > look at events - 0 > > and > > read events baseline > run test > look at events - baseline > >> Some of the counters could be growing too large and too fast. Will >> there be >> chances of overflow ? >> Then resetting using this could help without rebooting. > > Overflows are just a fact of life on 32 bit systems. However, they can > also be trivially handled - you can always subtract a ulong start > state from a ulong end state and get a reliable delta of up to 2^32 > events, whether the end state has overflowed or not. > > The bottom line is that the benefit of this patch adds a minor > convenience for something that can already be done in userspace. But > the downside is that there would be one more possible source of noise > for kernel developers to consider when looking at a bug report. Plus > the extra code and user interface that need to be maintained. > > I don't think we should merge this patch. Okay no problem.Thank you so much for your review and feedback. Yes I agree the benefits are minor but I thought might be useful for someone somewhere. I worked on it and found it easy and convinient and thus proposed it. If others feel not important I am okay to drop it. Thanks once again to all who helped to review it.
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index e35a3f2..54382ab 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -26,6 +26,7 @@ Currently, these files are in /proc/sys/vm: - admin_reserve_kbytes - block_dump +- clear_all_vm_events - compact_memory - compaction_proactiveness - compact_unevictable_allowed @@ -112,6 +113,13 @@ block_dump block_dump enables block I/O debugging when set to a nonzero value. More information on block I/O debugging is in Documentation/admin-guide/laptops/laptop-mode.rst. +clear_all_vm_events +=================== + +When 1 is written to this file, it resets all the vm counters present under +/proc/vmstat to its default value or zeros. +The next time when you read /proc/vmstat, each node is filled with default or +current value. compact_memory ============== diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 773135f..8ab9be5 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -11,6 +11,8 @@ #include <linux/mmdebug.h> extern int sysctl_stat_interval; +extern int sysctl_clearvmevents_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos); #ifdef CONFIG_NUMA #define ENABLE_NUMA_STAT 1 @@ -83,6 +85,8 @@ static inline void count_vm_events(enum vm_event_item item, long delta) extern void all_vm_events(unsigned long *); +extern void clear_all_vm_events(void); + extern void vm_events_fold_cpu(int cpu); #else @@ -103,6 +107,9 @@ static inline void __count_vm_events(enum vm_event_item item, long delta) static inline void all_vm_events(unsigned long *ret) { } +static inline void clear_all_vm_events(void) +{ +} static inline void vm_events_fold_cpu(int cpu) { } diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..25299a4 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -200,6 +200,8 @@ static int min_extfrag_threshold; static int max_extfrag_threshold = 1000; #endif +static int sysctl_clear_vm_events; + #endif /* CONFIG_SYSCTL */ #if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_SYSCTL) @@ -2891,6 +2893,13 @@ static struct ctl_table vm_table[] = { #endif /* CONFIG_COMPACTION */ { + .procname = "clear_all_vm_events", + .data = &sysctl_clear_vm_events, + .maxlen = sizeof(int), + .mode = 0200, + .proc_handler = sysctl_clearvmevents_handler, + }, + { .procname = "min_free_kbytes", .data = &min_free_kbytes, .maxlen = sizeof(min_free_kbytes), diff --git a/mm/vmstat.c b/mm/vmstat.c index f894216..26b8f81 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -136,6 +136,32 @@ void all_vm_events(unsigned long *ret) } EXPORT_SYMBOL_GPL(all_vm_events); +void clear_all_vm_events(void) +{ + int cpu; + + for_each_online_cpu(cpu) { + struct vm_event_state *this = &per_cpu(vm_event_states, cpu); + int sz = NR_VM_EVENT_ITEMS * sizeof(unsigned long); + + memset(this->event, 0, sz); + } +} +EXPORT_SYMBOL_GPL(clear_all_vm_events); + +/* + * This is the node to reset all vm events in /proc/vmstat + * via /proc/sys/vm/clear_all_vm_events + */ +int sysctl_clearvmevents_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *length, loff_t *ppos) +{ + if (write) + clear_all_vm_events(); + + return 0; +} + /* * Fold the foreign cpu events into our own. *