[2/2] MM: Discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead.
diff mbox series

Message ID 878sj8w55y.fsf@notabene.neil.brown.name
State New
Headers show
Series
  • [1/2] MM: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE
Related show

Commit Message

NeilBrown April 6, 2020, 11:44 p.m. UTC
After an NFS page has been written it is considered "unstable" until a
COMMIT request succeeds.  If the COMMIT fails, the page will be
re-written.

These "unstable" pages are currently accounted as "reclaimable", either
in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
'reclaimable' count.  This might have made sense when sending the COMMIT
required a separate action by the VFS/MM (e.g.  releasepage() used to
send a COMMIT).  However now that all writes generated by ->writepages()
will automatically be followed by a COMMIT (since commit 919e3bd9a875
("NFS: Ensure we commit after writeback is complete")) it makes more
sense to treat them as writeback pages.

So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
NR_WRITEBACK and WB_WRITEBACK.

A particular effect of this change is that when
wb_check_background_flush() calls wb_over_bg_threshold(), the latter
will report 'true' a lot less often as the 'unstable' pages are no
longer considered 'dirty' (as there is nothing that writeback can do
about them anyway).

Currently wb_check_background_flush() will trigger writeback to NFS even
when there are relatively few dirty pages (if there are lots of unstable
pages), this can result in small writes going to the server (10s of
Kilobytes rather than a Megabyte) which hurts throughput.
With this patch, there are fewer writes which are each larger on average.

Where the NR_UNSTABLE_NFS count was included in statistics
virtual-files, the entry is retained, but the value is hard-coded as
zero.  static trace points which record this counter no longer report
it.

Signed-off-by: NeilBrown <neilb@suse.de>
---
 Documentation/filesystems/proc.rst |  4 ++--
 drivers/base/node.c                |  2 +-
 fs/fs-writeback.c                  |  1 -
 fs/nfs/internal.h                  | 10 +++++++---
 fs/nfs/write.c                     |  4 ++--
 fs/proc/meminfo.c                  |  3 +--
 include/linux/mmzone.h             |  1 -
 include/trace/events/writeback.h   |  5 +----
 mm/memcontrol.c                    |  1 -
 mm/page-writeback.c                | 17 ++++-------------
 mm/page_alloc.c                    |  6 ++----
 mm/vmstat.c                        | 13 ++++++++++---
 12 files changed, 30 insertions(+), 37 deletions(-)

Comments

Trond Myklebust April 7, 2020, 2:34 a.m. UTC | #1
On Tue, 2020-04-07 at 09:44 +1000, NeilBrown wrote:
> After an NFS page has been written it is considered "unstable" until
> a
> COMMIT request succeeds.  If the COMMIT fails, the page will be
> re-written.
> 
> These "unstable" pages are currently accounted as "reclaimable",
> either
> in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
> 'reclaimable' count.  This might have made sense when sending the
> COMMIT
> required a separate action by the VFS/MM (e.g.  releasepage() used to
> send a COMMIT).  However now that all writes generated by
> ->writepages()
> will automatically be followed by a COMMIT (since commit 919e3bd9a875
> ("NFS: Ensure we commit after writeback is complete")) it makes more
> sense to treat them as writeback pages.
> 
> So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
> NR_WRITEBACK and WB_WRITEBACK.
> 
> A particular effect of this change is that when
> wb_check_background_flush() calls wb_over_bg_threshold(), the latter
> will report 'true' a lot less often as the 'unstable' pages are no
> longer considered 'dirty' (as there is nothing that writeback can do
> about them anyway).
> 
> Currently wb_check_background_flush() will trigger writeback to NFS
> even
> when there are relatively few dirty pages (if there are lots of
> unstable
> pages), this can result in small writes going to the server (10s of
> Kilobytes rather than a Megabyte) which hurts throughput.
> With this patch, there are fewer writes which are each larger on
> average.
> 
> Where the NR_UNSTABLE_NFS count was included in statistics
> virtual-files, the entry is retained, but the value is hard-coded as
> zero.  static trace points which record this counter no longer report
> it.

I'm OK with this. It puts the control at the right level IMO. I'm
assuming it will go through the MM maintainers, so:

Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com>

> 
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
>  Documentation/filesystems/proc.rst |  4 ++--
>  drivers/base/node.c                |  2 +-
>  fs/fs-writeback.c                  |  1 -
>  fs/nfs/internal.h                  | 10 +++++++---
>  fs/nfs/write.c                     |  4 ++--
>  fs/proc/meminfo.c                  |  3 +--
>  include/linux/mmzone.h             |  1 -
>  include/trace/events/writeback.h   |  5 +----
>  mm/memcontrol.c                    |  1 -
>  mm/page-writeback.c                | 17 ++++-------------
>  mm/page_alloc.c                    |  6 ++----
>  mm/vmstat.c                        | 13 ++++++++++---
>  12 files changed, 30 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst
> b/Documentation/filesystems/proc.rst
> index 38b606991065..092b7b44d158 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -1042,8 +1042,8 @@ PageTables
>                amount of memory dedicated to the lowest level of page
>                tables.
>  NFS_Unstable
> -              NFS pages sent to the server, but not yet committed to
> stable
> -	      storage
> +              Always zero. Previous counted pages which had been
> written to
> +              the server, but has not been committed to stable
> storage.
>  Bounce
>                Memory used for block device "bounce buffers"
>  WritebackTmp
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 10d7e818e118..15f5ed6a8830 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device
> *dev,
>  		       nid, K(i.sharedram),
>  		       nid, sum_zone_node_page_state(nid,
> NR_KERNEL_STACK_KB),
>  		       nid, K(sum_zone_node_page_state(nid,
> NR_PAGETABLE)),
> -		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> +		       nid, 0,
>  		       nid, K(sum_zone_node_page_state(nid,
> NR_BOUNCE)),
>  		       nid, K(node_page_state(pgdat,
> NR_WRITEBACK_TEMP)),
>  		       nid, K(sreclaimable +
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 76ac9c7d32ec..c5bdf46e3b4b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct
> backing_dev_info *bdi,
>  static unsigned long get_nr_dirty_pages(void)
>  {
>  	return global_node_page_state(NR_FILE_DIRTY) +
> -		global_node_page_state(NR_UNSTABLE_NFS) +
>  		get_nr_dirty_inodes();
>  }
>  
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index f80c47d5ff27..749da02b547a 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -652,7 +652,8 @@ void nfs_super_set_maxbytes(struct super_block
> *sb, __u64 maxfilesize)
>  }
>  
>  /*
> - * Record the page as unstable and mark its inode as dirty.
> + * Record the page as unstable (an extra writeback period) and mark
> its
> + * inode as dirty.
>   */
>  static inline
>  void nfs_mark_page_unstable(struct page *page, struct
> nfs_commit_info *cinfo)
> @@ -660,8 +661,11 @@ void nfs_mark_page_unstable(struct page *page,
> struct nfs_commit_info *cinfo)
>  	if (!cinfo->dreq) {
>  		struct inode *inode = page_file_mapping(page)->host;
>  
> -		inc_node_page_state(page, NR_UNSTABLE_NFS);
> -		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
> +		/* This page is really still in write-back - just that
> the
> +		 * writeback is happening on the server now.
> +		 */
> +		inc_node_page_state(page, NR_WRITEBACK);
> +		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
>  		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>  	}
>  }
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c478b772cc49..2e15a56620b3 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -958,9 +958,9 @@ nfs_mark_request_commit(struct nfs_page *req,
> struct pnfs_layout_segment *lseg,
>  static void
>  nfs_clear_page_commit(struct page *page)
>  {
> -	dec_node_page_state(page, NR_UNSTABLE_NFS);
> +	dec_node_page_state(page, NR_WRITEBACK);
>  	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
> -		    WB_RECLAIMABLE);
> +		    WB_WRITEBACK);
>  }
>  
>  /* Called holding the request lock on @req */
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 8c1f1bb1a5ce..9bd94b5a9658 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m,
> void *v)
>  	show_val_kb(m, "PageTables:     ",
>  		    global_zone_page_state(NR_PAGETABLE));
>  
> -	show_val_kb(m, "NFS_Unstable:   ",
> -		    global_node_page_state(NR_UNSTABLE_NFS));
> +	show_val_kb(m, "NFS_Unstable:   ", 0);
>  	show_val_kb(m, "Bounce:         ",
>  		    global_zone_page_state(NR_BOUNCE));
>  	show_val_kb(m, "WritebackTmp:   ",
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e84d448988b6..3937f2be27d8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -237,7 +237,6 @@ enum node_stat_item {
>  	NR_FILE_THPS,
>  	NR_FILE_PMDMAPPED,
>  	NR_ANON_THPS,
> -	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_VMSCAN_WRITE,
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when
> writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
> diff --git a/include/trace/events/writeback.h
> b/include/trace/events/writeback.h
> index d94def25e4dc..45b5fbdb1f62 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -542,7 +542,6 @@ TRACE_EVENT(global_dirty_state,
>  	TP_STRUCT__entry(
>  		__field(unsigned long,	nr_dirty)
>  		__field(unsigned long,	nr_writeback)
> -		__field(unsigned long,	nr_unstable)
>  		__field(unsigned long,	background_thresh)
>  		__field(unsigned long,	dirty_thresh)
>  		__field(unsigned long,	dirty_limit)
> @@ -553,7 +552,6 @@ TRACE_EVENT(global_dirty_state,
>  	TP_fast_assign(
>  		__entry->nr_dirty	=
> global_node_page_state(NR_FILE_DIRTY);
>  		__entry->nr_writeback	=
> global_node_page_state(NR_WRITEBACK);
> -		__entry->nr_unstable	=
> global_node_page_state(NR_UNSTABLE_NFS);
>  		__entry->nr_dirtied	=
> global_node_page_state(NR_DIRTIED);
>  		__entry->nr_written	=
> global_node_page_state(NR_WRITTEN);
>  		__entry->background_thresh = background_thresh;
> @@ -561,12 +559,11 @@ TRACE_EVENT(global_dirty_state,
>  		__entry->dirty_limit	=
> global_wb_domain.dirty_limit;
>  	),
>  
> -	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
> +	TP_printk("dirty=%lu writeback=%lu "
>  		  "bg_thresh=%lu thresh=%lu limit=%lu "
>  		  "dirtied=%lu written=%lu",
>  		  __entry->nr_dirty,
>  		  __entry->nr_writeback,
> -		  __entry->nr_unstable,
>  		  __entry->background_thresh,
>  		  __entry->dirty_thresh,
>  		  __entry->dirty_limit,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ca194864d802..41b450b0ca29 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4326,7 +4326,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback
> *wb, unsigned long *pfilepages,
>  
>  	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
>  
> -	/* this should eventually include NR_UNSTABLE_NFS */
>  	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
>  	*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
>  			memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4c9875971de5..d16f6a59bce4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat)
>  	unsigned long nr_pages = 0;
>  
>  	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
> -	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
>  	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
>  
>  	return nr_pages <= limit;
> @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct
> dirty_throttle_control *mdtc,
>   * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters,
> if set.
>   *
>   * Return: @wb's dirty limit in pages. The term "dirty" in the
> context of
> - * dirty balancing includes all PG_dirty, PG_writeback and NFS
> unstable pages.
> + * dirty balancing includes all PG_dirty and PG_writeback pages.
>   */
>  static unsigned long __wb_calc_thresh(struct dirty_throttle_control
> *dtc)
>  {
> @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct
> bdi_writeback *wb,
>  	struct dirty_throttle_control * const mdtc =
> mdtc_valid(&mdtc_stor) ?
>  						     &mdtc_stor : NULL;
>  	struct dirty_throttle_control *sdtc;
> -	unsigned long nr_reclaimable;	/* = file_dirty +
> unstable_nfs */
> +	unsigned long nr_reclaimable;	/* = file_dirty */
>  	long period;
>  	long pause;
>  	long max_pause;
> @@ -1589,14 +1588,7 @@ static void balance_dirty_pages(struct
> bdi_writeback *wb,
>  		unsigned long m_thresh = 0;
>  		unsigned long m_bg_thresh = 0;
>  
> -		/*
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 */
> -		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY)
> +
> -					global_node_page_state(NR_UNSTA
> BLE_NFS);
> +		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
>  		gdtc->avail = global_dirtyable_memory();
>  		gdtc->dirty = nr_reclaimable +
> global_node_page_state(NR_WRITEBACK);
>  
> @@ -1940,8 +1932,7 @@ bool wb_over_bg_thresh(struct bdi_writeback
> *wb)
>  	 * as we're trying to decide whether to put more under
> writeback.
>  	 */
>  	gdtc->avail = global_dirtyable_memory();
> -	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
> -		      global_node_page_state(NR_UNSTABLE_NFS);
> +	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
>  	domain_dirty_limits(gdtc);
>  
>  	if (gdtc->dirty > gdtc->bg_thresh)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5f76da8cd4e..24678d6e308d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5237,7 +5237,7 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>  
>  	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
>  		" active_file:%lu inactive_file:%lu
> isolated_file:%lu\n"
> -		" unevictable:%lu dirty:%lu writeback:%lu
> unstable:%lu\n"
> +		" unevictable:%lu dirty:%lu writeback:%lu unstable:0\n"
>  		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
>  		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
>  		" free:%lu free_pcp:%lu free_cma:%lu\n",
> @@ -5250,7 +5250,6 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>  		global_node_page_state(NR_UNEVICTABLE),
>  		global_node_page_state(NR_FILE_DIRTY),
>  		global_node_page_state(NR_WRITEBACK),
> -		global_node_page_state(NR_UNSTABLE_NFS),
>  		global_node_page_state(NR_SLAB_RECLAIMABLE),
>  		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
>  		global_node_page_state(NR_FILE_MAPPED),
> @@ -5283,7 +5282,7 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>  			" anon_thp: %lukB"
>  #endif
>  			" writeback_tmp:%lukB"
> -			" unstable:%lukB"
> +			" unstable:0kB"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -5305,7 +5304,6 @@ void show_free_areas(unsigned int filter,
> nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_ANON_THPS) *
> HPAGE_PMD_NR),
>  #endif
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> -			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
>  				"yes" : "no");
>  	}
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c9c0d71f917f..228d9f6e1c5c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone,
> unsigned int order)
>  					TEXT_FOR_HIGHMEM(xx) xx
> "_movable",
>  
>  const char * const vmstat_text[] = {
> -	/* enum zone_stat_item countes */
> +	/* enum zone_stat_item counters */
>  	"nr_free_pages",
>  	"nr_zone_inactive_anon",
>  	"nr_zone_active_anon",
> @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
>  	"nr_file_hugepages",
>  	"nr_file_pmdmapped",
>  	"nr_anon_transparent_hugepages",
> -	"nr_unstable",
>  	"nr_vmscan_write",
>  	"nr_vmscan_immediate_reclaim",
>  	"nr_dirtied",
> @@ -1707,8 +1706,16 @@ static void *vmstat_start(struct seq_file *m,
> loff_t *pos)
>  static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
>  {
>  	(*pos)++;
> -	if (*pos >= NR_VMSTAT_ITEMS)
> +	if (*pos >= NR_VMSTAT_ITEMS) {
> +		/*
> +		 * Deprecated counters which are no longer represented
> +		 * in vmstat arrays. We just lie about them to be
> always
> +		 * 0 to not break userspace which might expect them in
> +		 * the output.
> +		 */
> +		seq_puts(m, "nr_unstable 0");
>  		return NULL;
> +	}
>  	return (unsigned long *)m->private + *pos;
>  }
>
Michal Hocko April 7, 2020, 9:19 a.m. UTC | #2
On Tue 07-04-20 09:44:25, Neil Brown wrote:
> 
> After an NFS page has been written it is considered "unstable" until a
> COMMIT request succeeds.  If the COMMIT fails, the page will be
> re-written.
> 
> These "unstable" pages are currently accounted as "reclaimable", either
> in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
> 'reclaimable' count.  This might have made sense when sending the COMMIT
> required a separate action by the VFS/MM (e.g.  releasepage() used to
> send a COMMIT).  However now that all writes generated by ->writepages()
> will automatically be followed by a COMMIT (since commit 919e3bd9a875
> ("NFS: Ensure we commit after writeback is complete")) it makes more
> sense to treat them as writeback pages.
> 
> So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
> NR_WRITEBACK and WB_WRITEBACK.
> 
> A particular effect of this change is that when
> wb_check_background_flush() calls wb_over_bg_threshold(), the latter
> will report 'true' a lot less often as the 'unstable' pages are no
> longer considered 'dirty' (as there is nothing that writeback can do
> about them anyway).
> 
> Currently wb_check_background_flush() will trigger writeback to NFS even
> when there are relatively few dirty pages (if there are lots of unstable
> pages), this can result in small writes going to the server (10s of
> Kilobytes rather than a Megabyte) which hurts throughput.
> With this patch, there are fewer writes which are each larger on average.
> 
> Where the NR_UNSTABLE_NFS count was included in statistics
> virtual-files, the entry is retained, but the value is hard-coded as
> zero.  static trace points which record this counter no longer report
> it.

I do not have sufficient insight to nfs so I cannot judge that part but
the core MM changes make sense and I do not see any problems there. It
is PITA to keep the counter in user visible interfaces like meminfo and
vmstat but I believe this really makes sense here as this is a counter
that is usually considered. Maybe its non-existence will not be fatal
for existing scripts but risking that is not worth it now. Maybe we can
drop the fake value in future.
 
> Signed-off-by: NeilBrown <neilb@suse.de>

Acked-by: Michal Hocko <mhocko@suse.com> # for MM parts

Thanks!
> ---
>  Documentation/filesystems/proc.rst |  4 ++--
>  drivers/base/node.c                |  2 +-
>  fs/fs-writeback.c                  |  1 -
>  fs/nfs/internal.h                  | 10 +++++++---
>  fs/nfs/write.c                     |  4 ++--
>  fs/proc/meminfo.c                  |  3 +--
>  include/linux/mmzone.h             |  1 -
>  include/trace/events/writeback.h   |  5 +----
>  mm/memcontrol.c                    |  1 -
>  mm/page-writeback.c                | 17 ++++-------------
>  mm/page_alloc.c                    |  6 ++----
>  mm/vmstat.c                        | 13 ++++++++++---
>  12 files changed, 30 insertions(+), 37 deletions(-)
> 
> diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> index 38b606991065..092b7b44d158 100644
> --- a/Documentation/filesystems/proc.rst
> +++ b/Documentation/filesystems/proc.rst
> @@ -1042,8 +1042,8 @@ PageTables
>                amount of memory dedicated to the lowest level of page
>                tables.
>  NFS_Unstable
> -              NFS pages sent to the server, but not yet committed to stable
> -	      storage
> +              Always zero. Previous counted pages which had been written to
> +              the server, but has not been committed to stable storage.
>  Bounce
>                Memory used for block device "bounce buffers"
>  WritebackTmp
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 10d7e818e118..15f5ed6a8830 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -439,7 +439,7 @@ static ssize_t node_read_meminfo(struct device *dev,
>  		       nid, K(i.sharedram),
>  		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
>  		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
> -		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> +		       nid, 0,
>  		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
>  		       nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
>  		       nid, K(sreclaimable +
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 76ac9c7d32ec..c5bdf46e3b4b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -1070,7 +1070,6 @@ static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
>  static unsigned long get_nr_dirty_pages(void)
>  {
>  	return global_node_page_state(NR_FILE_DIRTY) +
> -		global_node_page_state(NR_UNSTABLE_NFS) +
>  		get_nr_dirty_inodes();
>  }
>  
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index f80c47d5ff27..749da02b547a 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -652,7 +652,8 @@ void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize)
>  }
>  
>  /*
> - * Record the page as unstable and mark its inode as dirty.
> + * Record the page as unstable (an extra writeback period) and mark its
> + * inode as dirty.
>   */
>  static inline
>  void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
> @@ -660,8 +661,11 @@ void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
>  	if (!cinfo->dreq) {
>  		struct inode *inode = page_file_mapping(page)->host;
>  
> -		inc_node_page_state(page, NR_UNSTABLE_NFS);
> -		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
> +		/* This page is really still in write-back - just that the
> +		 * writeback is happening on the server now.
> +		 */
> +		inc_node_page_state(page, NR_WRITEBACK);
> +		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
>  		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
>  	}
>  }
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index c478b772cc49..2e15a56620b3 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -958,9 +958,9 @@ nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
>  static void
>  nfs_clear_page_commit(struct page *page)
>  {
> -	dec_node_page_state(page, NR_UNSTABLE_NFS);
> +	dec_node_page_state(page, NR_WRITEBACK);
>  	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
> -		    WB_RECLAIMABLE);
> +		    WB_WRITEBACK);
>  }
>  
>  /* Called holding the request lock on @req */
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 8c1f1bb1a5ce..9bd94b5a9658 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -106,8 +106,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  	show_val_kb(m, "PageTables:     ",
>  		    global_zone_page_state(NR_PAGETABLE));
>  
> -	show_val_kb(m, "NFS_Unstable:   ",
> -		    global_node_page_state(NR_UNSTABLE_NFS));
> +	show_val_kb(m, "NFS_Unstable:   ", 0);
>  	show_val_kb(m, "Bounce:         ",
>  		    global_zone_page_state(NR_BOUNCE));
>  	show_val_kb(m, "WritebackTmp:   ",
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index e84d448988b6..3937f2be27d8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -237,7 +237,6 @@ enum node_stat_item {
>  	NR_FILE_THPS,
>  	NR_FILE_PMDMAPPED,
>  	NR_ANON_THPS,
> -	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_VMSCAN_WRITE,
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
> diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
> index d94def25e4dc..45b5fbdb1f62 100644
> --- a/include/trace/events/writeback.h
> +++ b/include/trace/events/writeback.h
> @@ -542,7 +542,6 @@ TRACE_EVENT(global_dirty_state,
>  	TP_STRUCT__entry(
>  		__field(unsigned long,	nr_dirty)
>  		__field(unsigned long,	nr_writeback)
> -		__field(unsigned long,	nr_unstable)
>  		__field(unsigned long,	background_thresh)
>  		__field(unsigned long,	dirty_thresh)
>  		__field(unsigned long,	dirty_limit)
> @@ -553,7 +552,6 @@ TRACE_EVENT(global_dirty_state,
>  	TP_fast_assign(
>  		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
>  		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
> -		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
>  		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
>  		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
>  		__entry->background_thresh = background_thresh;
> @@ -561,12 +559,11 @@ TRACE_EVENT(global_dirty_state,
>  		__entry->dirty_limit	= global_wb_domain.dirty_limit;
>  	),
>  
> -	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
> +	TP_printk("dirty=%lu writeback=%lu "
>  		  "bg_thresh=%lu thresh=%lu limit=%lu "
>  		  "dirtied=%lu written=%lu",
>  		  __entry->nr_dirty,
>  		  __entry->nr_writeback,
> -		  __entry->nr_unstable,
>  		  __entry->background_thresh,
>  		  __entry->dirty_thresh,
>  		  __entry->dirty_limit,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ca194864d802..41b450b0ca29 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4326,7 +4326,6 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>  
>  	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
>  
> -	/* this should eventually include NR_UNSTABLE_NFS */
>  	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
>  	*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
>  			memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4c9875971de5..d16f6a59bce4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -504,7 +504,6 @@ bool node_dirty_ok(struct pglist_data *pgdat)
>  	unsigned long nr_pages = 0;
>  
>  	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
> -	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
>  	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
>  
>  	return nr_pages <= limit;
> @@ -758,7 +757,7 @@ static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
>   * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
>   *
>   * Return: @wb's dirty limit in pages. The term "dirty" in the context of
> - * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
> + * dirty balancing includes all PG_dirty and PG_writeback pages.
>   */
>  static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
>  {
> @@ -1566,7 +1565,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
>  	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
>  						     &mdtc_stor : NULL;
>  	struct dirty_throttle_control *sdtc;
> -	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
> +	unsigned long nr_reclaimable;	/* = file_dirty */
>  	long period;
>  	long pause;
>  	long max_pause;
> @@ -1589,14 +1588,7 @@ static void balance_dirty_pages(struct bdi_writeback *wb,
>  		unsigned long m_thresh = 0;
>  		unsigned long m_bg_thresh = 0;
>  
> -		/*
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 */
> -		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
> -					global_node_page_state(NR_UNSTABLE_NFS);
> +		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
>  		gdtc->avail = global_dirtyable_memory();
>  		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
>  
> @@ -1940,8 +1932,7 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
>  	 * as we're trying to decide whether to put more under writeback.
>  	 */
>  	gdtc->avail = global_dirtyable_memory();
> -	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
> -		      global_node_page_state(NR_UNSTABLE_NFS);
> +	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
>  	domain_dirty_limits(gdtc);
>  
>  	if (gdtc->dirty > gdtc->bg_thresh)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5f76da8cd4e..24678d6e308d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5237,7 +5237,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  
>  	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
>  		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
> -		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
> +		" unevictable:%lu dirty:%lu writeback:%lu unstable:0\n"
>  		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
>  		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
>  		" free:%lu free_pcp:%lu free_cma:%lu\n",
> @@ -5250,7 +5250,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		global_node_page_state(NR_UNEVICTABLE),
>  		global_node_page_state(NR_FILE_DIRTY),
>  		global_node_page_state(NR_WRITEBACK),
> -		global_node_page_state(NR_UNSTABLE_NFS),
>  		global_node_page_state(NR_SLAB_RECLAIMABLE),
>  		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
>  		global_node_page_state(NR_FILE_MAPPED),
> @@ -5283,7 +5282,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			" anon_thp: %lukB"
>  #endif
>  			" writeback_tmp:%lukB"
> -			" unstable:%lukB"
> +			" unstable:0kB"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -5305,7 +5304,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
>  #endif
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> -			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
>  				"yes" : "no");
>  	}
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index c9c0d71f917f..228d9f6e1c5c 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1108,7 +1108,7 @@ int fragmentation_index(struct zone *zone, unsigned int order)
>  					TEXT_FOR_HIGHMEM(xx) xx "_movable",
>  
>  const char * const vmstat_text[] = {
> -	/* enum zone_stat_item countes */
> +	/* enum zone_stat_item counters */
>  	"nr_free_pages",
>  	"nr_zone_inactive_anon",
>  	"nr_zone_active_anon",
> @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
>  	"nr_file_hugepages",
>  	"nr_file_pmdmapped",
>  	"nr_anon_transparent_hugepages",
> -	"nr_unstable",
>  	"nr_vmscan_write",
>  	"nr_vmscan_immediate_reclaim",
>  	"nr_dirtied",
> @@ -1707,8 +1706,16 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
>  {
>  	(*pos)++;
> -	if (*pos >= NR_VMSTAT_ITEMS)
> +	if (*pos >= NR_VMSTAT_ITEMS) {
> +		/*
> +		 * Deprecated counters which are no longer represented
> +		 * in vmstat arrays. We just lie about them to be always
> +		 * 0 to not break userspace which might expect them in
> +		 * the output.
> +		 */
> +		seq_puts(m, "nr_unstable 0");
>  		return NULL;
> +	}
>  	return (unsigned long *)m->private + *pos;
>  }
>  
> -- 
> 2.26.0
>
Jan Kara April 7, 2020, 10:25 a.m. UTC | #3
On Tue 07-04-20 09:44:25, NeilBrown wrote:
> 
> After an NFS page has been written it is considered "unstable" until a
> COMMIT request succeeds.  If the COMMIT fails, the page will be
> re-written.
> 
> These "unstable" pages are currently accounted as "reclaimable", either
> in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
> 'reclaimable' count.  This might have made sense when sending the COMMIT
> required a separate action by the VFS/MM (e.g.  releasepage() used to
> send a COMMIT).  However now that all writes generated by ->writepages()
> will automatically be followed by a COMMIT (since commit 919e3bd9a875
> ("NFS: Ensure we commit after writeback is complete")) it makes more
> sense to treat them as writeback pages.
> 
> So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
> NR_WRITEBACK and WB_WRITEBACK.
> 
> A particular effect of this change is that when
> wb_check_background_flush() calls wb_over_bg_threshold(), the latter
> will report 'true' a lot less often as the 'unstable' pages are no
> longer considered 'dirty' (as there is nothing that writeback can do
> about them anyway).
> 
> Currently wb_check_background_flush() will trigger writeback to NFS even
> when there are relatively few dirty pages (if there are lots of unstable
> pages), this can result in small writes going to the server (10s of
> Kilobytes rather than a Megabyte) which hurts throughput.
> With this patch, there are fewer writes which are each larger on average.
> 
> Where the NR_UNSTABLE_NFS count was included in statistics
> virtual-files, the entry is retained, but the value is hard-coded as
> zero.  static trace points which record this counter no longer report
> it.
> 
> Signed-off-by: NeilBrown <neilb@suse.de>

...

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index e5f76da8cd4e..24678d6e308d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5237,7 +5237,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  
>  	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
>  		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
> -		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
> +		" unevictable:%lu dirty:%lu writeback:%lu unstable:0\n"
>  		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
>  		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
>  		" free:%lu free_pcp:%lu free_cma:%lu\n",
> @@ -5250,7 +5250,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  		global_node_page_state(NR_UNEVICTABLE),
>  		global_node_page_state(NR_FILE_DIRTY),
>  		global_node_page_state(NR_WRITEBACK),
> -		global_node_page_state(NR_UNSTABLE_NFS),
>  		global_node_page_state(NR_SLAB_RECLAIMABLE),
>  		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
>  		global_node_page_state(NR_FILE_MAPPED),
> @@ -5283,7 +5282,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			" anon_thp: %lukB"
>  #endif
>  			" writeback_tmp:%lukB"
> -			" unstable:%lukB"
> +			" unstable:0kB"
>  			" all_unreclaimable? %s"
>  			"\n",
>  			pgdat->node_id,
> @@ -5305,7 +5304,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
>  			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
>  #endif
>  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> -			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
>  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
>  				"yes" : "no");
>  	}

These are just page allocator splats on OOM. I don't think preserving
'unstable' in these reports is needed.

> @@ -1707,8 +1706,16 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
>  {
>  	(*pos)++;
> -	if (*pos >= NR_VMSTAT_ITEMS)
> +	if (*pos >= NR_VMSTAT_ITEMS) {
> +		/*
> +		 * Deprecated counters which are no longer represented
> +		 * in vmstat arrays. We just lie about them to be always
> +		 * 0 to not break userspace which might expect them in
> +		 * the output.
> +		 */
> +		seq_puts(m, "nr_unstable 0");
>  		return NULL;
> +	}
>  	return (unsigned long *)m->private + *pos;
>  }

Umm, how is this supposed to work? vmstat_next() should return next element
of the sequence, not fill anything into seq_file - that's the job of
vmstat_show(). Looking at seq_read() implementation it may actually end up
working fine but I wouldn't really bet much on it especially in corner
cases like when we are just about to fill the user buffer and then need to
restart reading close to an end of vmstat file or so.

Michal, won't it be cleaner to have NR_VM_DEPRECATED_ITEMS included in
NR_VMSTAT_ITEMS, have names of these items in vmstat_text, and just set
appropriate number of 0 entries at the end of the array generated in
vmstat_start() and be done with it? That seems conceptually simpler and the
overhead is minimal.

								Honza
Michal Hocko April 7, 2020, 11:24 a.m. UTC | #4
On Tue 07-04-20 12:25:15, Jan Kara wrote:
> On Tue 07-04-20 09:44:25, NeilBrown wrote:
> > @@ -5283,7 +5282,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  			" anon_thp: %lukB"
> >  #endif
> >  			" writeback_tmp:%lukB"
> > -			" unstable:%lukB"
> > +			" unstable:0kB"
> >  			" all_unreclaimable? %s"
> >  			"\n",
> >  			pgdat->node_id,
> > @@ -5305,7 +5304,6 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
> >  			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
> >  #endif
> >  			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
> > -			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
> >  			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
> >  				"yes" : "no");
> >  	}
> 
> These are just page allocator splats on OOM. I don't think preserving
> 'unstable' in these reports is needed.

YOu are right and the less we dump from this path the better. I could
have noticed.

> > @@ -1707,8 +1706,16 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> >  static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
> >  {
> >  	(*pos)++;
> > -	if (*pos >= NR_VMSTAT_ITEMS)
> > +	if (*pos >= NR_VMSTAT_ITEMS) {
> > +		/*
> > +		 * Deprecated counters which are no longer represented
> > +		 * in vmstat arrays. We just lie about them to be always
> > +		 * 0 to not break userspace which might expect them in
> > +		 * the output.
> > +		 */
> > +		seq_puts(m, "nr_unstable 0");
> >  		return NULL;
> > +	}
> >  	return (unsigned long *)m->private + *pos;
> >  }
> 
> Umm, how is this supposed to work? vmstat_next() should return next element
> of the sequence, not fill anything into seq_file - that's the job of
> vmstat_show(). Looking at seq_read() implementation it may actually end up
> working fine but I wouldn't really bet much on it especially in corner
> cases like when we are just about to fill the user buffer and then need to
> restart reading close to an end of vmstat file or so.

Well, I have to confess I haven't really tested this myself but the
logic was to have this output close to NR_VMSTAT_ITEMS break out of
the counters loop.

> Michal, won't it be cleaner to have NR_VM_DEPRECATED_ITEMS included in
> NR_VMSTAT_ITEMS, have names of these items in vmstat_text, and just set
> appropriate number of 0 entries at the end of the array generated in
> vmstat_start() and be done with it? That seems conceptually simpler and the
> overhead is minimal.

Yes, that would be much nicer, albeit more code.  So I believe you meant
something like this?

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 462f6873905a..a18611197bea 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -237,7 +237,6 @@ enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
-	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 78d53378db99..992e162f1886 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
-	"nr_unstable",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
@@ -1293,9 +1292,13 @@ const char * const vmstat_text[] = {
 	"swap_ra_hit",
 #endif
 #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
+	/* Deprecated counters. Count them in NR_VM_DEPRECATED_ITEMS */
+	"nr_unstable",
 };
 #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
 
+#define NR_VM_DEPRECATED_ITEMS 1
+
 #if (defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)) || \
      defined(CONFIG_PROC_FS)
 static void *frag_start(struct seq_file *m, loff_t *pos)
@@ -1661,7 +1664,8 @@ static const struct seq_operations zoneinfo_op = {
 			 NR_VM_NODE_STAT_ITEMS + \
 			 NR_VM_WRITEBACK_STAT_ITEMS + \
 			 (IS_ENABLED(CONFIG_VM_EVENT_COUNTERS) ? \
-			  NR_VM_EVENT_ITEMS : 0))
+			  NR_VM_EVENT_ITEMS : 0) + \
+			  NR_VM_DEPRECATED_ITEMS)
 
 static void *vmstat_start(struct seq_file *m, loff_t *pos)
 {
@@ -1698,7 +1702,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
 	all_vm_events(v);
 	v[PGPGIN] /= 2;		/* sectors -> kbytes */
 	v[PGPGOUT] /= 2;
+	v += NR_VM_EVENT_ITEMS;
 #endif
+	for (i = 0; i < NR_VM_DEPRECATED_ITEMS)
+		v[i] = 0;
+
 	return (unsigned long *)m->private + *pos;
 }
Jan Kara April 7, 2020, 11:33 a.m. UTC | #5
On Tue 07-04-20 13:24:12, Michal Hocko wrote:
> On Tue 07-04-20 12:25:15, Jan Kara wrote:
> > > @@ -1707,8 +1706,16 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
> > >  static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
> > >  {
> > >  	(*pos)++;
> > > -	if (*pos >= NR_VMSTAT_ITEMS)
> > > +	if (*pos >= NR_VMSTAT_ITEMS) {
> > > +		/*
> > > +		 * Deprecated counters which are no longer represented
> > > +		 * in vmstat arrays. We just lie about them to be always
> > > +		 * 0 to not break userspace which might expect them in
> > > +		 * the output.
> > > +		 */
> > > +		seq_puts(m, "nr_unstable 0");
> > >  		return NULL;
> > > +	}
> > >  	return (unsigned long *)m->private + *pos;
> > >  }
> > 
> > Umm, how is this supposed to work? vmstat_next() should return next element
> > of the sequence, not fill anything into seq_file - that's the job of
> > vmstat_show(). Looking at seq_read() implementation it may actually end up
> > working fine but I wouldn't really bet much on it especially in corner
> > cases like when we are just about to fill the user buffer and then need to
> > restart reading close to an end of vmstat file or so.
> 
> Well, I have to confess I haven't really tested this myself but the
> logic was to have this output close to NR_VMSTAT_ITEMS break out of
> the counters loop.
> 
> > Michal, won't it be cleaner to have NR_VM_DEPRECATED_ITEMS included in
> > NR_VMSTAT_ITEMS, have names of these items in vmstat_text, and just set
> > appropriate number of 0 entries at the end of the array generated in
> > vmstat_start() and be done with it? That seems conceptually simpler and the
> > overhead is minimal.
> 
> Yes, that would be much nicer, albeit more code.  So I believe you meant
> something like this?
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 462f6873905a..a18611197bea 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -237,7 +237,6 @@ enum node_stat_item {
>  	NR_FILE_THPS,
>  	NR_FILE_PMDMAPPED,
>  	NR_ANON_THPS,
> -	NR_UNSTABLE_NFS,	/* NFS unstable pages */
>  	NR_VMSCAN_WRITE,
>  	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
>  	NR_DIRTIED,		/* page dirtyings since bootup */
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 78d53378db99..992e162f1886 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1162,7 +1162,6 @@ const char * const vmstat_text[] = {
>  	"nr_file_hugepages",
>  	"nr_file_pmdmapped",
>  	"nr_anon_transparent_hugepages",
> -	"nr_unstable",
>  	"nr_vmscan_write",
>  	"nr_vmscan_immediate_reclaim",
>  	"nr_dirtied",
> @@ -1293,9 +1292,13 @@ const char * const vmstat_text[] = {
>  	"swap_ra_hit",
>  #endif
>  #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */
> +	/* Deprecated counters. Count them in NR_VM_DEPRECATED_ITEMS */
> +	"nr_unstable",
>  };
>  #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */
>  
> +#define NR_VM_DEPRECATED_ITEMS 1
> +
>  #if (defined(CONFIG_DEBUG_FS) && defined(CONFIG_COMPACTION)) || \
>       defined(CONFIG_PROC_FS)
>  static void *frag_start(struct seq_file *m, loff_t *pos)
> @@ -1661,7 +1664,8 @@ static const struct seq_operations zoneinfo_op = {
>  			 NR_VM_NODE_STAT_ITEMS + \
>  			 NR_VM_WRITEBACK_STAT_ITEMS + \
>  			 (IS_ENABLED(CONFIG_VM_EVENT_COUNTERS) ? \
> -			  NR_VM_EVENT_ITEMS : 0))
> +			  NR_VM_EVENT_ITEMS : 0) + \
> +			  NR_VM_DEPRECATED_ITEMS)
>  
>  static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  {
> @@ -1698,7 +1702,11 @@ static void *vmstat_start(struct seq_file *m, loff_t *pos)
>  	all_vm_events(v);
>  	v[PGPGIN] /= 2;		/* sectors -> kbytes */
>  	v[PGPGOUT] /= 2;
> +	v += NR_VM_EVENT_ITEMS;
>  #endif
> +	for (i = 0; i < NR_VM_DEPRECATED_ITEMS)
					     ^^ ; i++ here

but otherwise yes, this is what I meant. Thanks!

> +		v[i] = 0;
> +
>  	return (unsigned long *)m->private + *pos;
>  }

								Honza

Patch
diff mbox series

diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 38b606991065..092b7b44d158 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -1042,8 +1042,8 @@  PageTables
               amount of memory dedicated to the lowest level of page
               tables.
 NFS_Unstable
-              NFS pages sent to the server, but not yet committed to stable
-	      storage
+              Always zero. Previous counted pages which had been written to
+              the server, but has not been committed to stable storage.
 Bounce
               Memory used for block device "bounce buffers"
 WritebackTmp
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 10d7e818e118..15f5ed6a8830 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -439,7 +439,7 @@  static ssize_t node_read_meminfo(struct device *dev,
 		       nid, K(i.sharedram),
 		       nid, sum_zone_node_page_state(nid, NR_KERNEL_STACK_KB),
 		       nid, K(sum_zone_node_page_state(nid, NR_PAGETABLE)),
-		       nid, K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
+		       nid, 0,
 		       nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
 		       nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
 		       nid, K(sreclaimable +
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 76ac9c7d32ec..c5bdf46e3b4b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1070,7 +1070,6 @@  static void bdi_split_work_to_wbs(struct backing_dev_info *bdi,
 static unsigned long get_nr_dirty_pages(void)
 {
 	return global_node_page_state(NR_FILE_DIRTY) +
-		global_node_page_state(NR_UNSTABLE_NFS) +
 		get_nr_dirty_inodes();
 }
 
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index f80c47d5ff27..749da02b547a 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -652,7 +652,8 @@  void nfs_super_set_maxbytes(struct super_block *sb, __u64 maxfilesize)
 }
 
 /*
- * Record the page as unstable and mark its inode as dirty.
+ * Record the page as unstable (an extra writeback period) and mark its
+ * inode as dirty.
  */
 static inline
 void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
@@ -660,8 +661,11 @@  void nfs_mark_page_unstable(struct page *page, struct nfs_commit_info *cinfo)
 	if (!cinfo->dreq) {
 		struct inode *inode = page_file_mapping(page)->host;
 
-		inc_node_page_state(page, NR_UNSTABLE_NFS);
-		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_RECLAIMABLE);
+		/* This page is really still in write-back - just that the
+		 * writeback is happening on the server now.
+		 */
+		inc_node_page_state(page, NR_WRITEBACK);
+		inc_wb_stat(&inode_to_bdi(inode)->wb, WB_WRITEBACK);
 		__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 	}
 }
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index c478b772cc49..2e15a56620b3 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -958,9 +958,9 @@  nfs_mark_request_commit(struct nfs_page *req, struct pnfs_layout_segment *lseg,
 static void
 nfs_clear_page_commit(struct page *page)
 {
-	dec_node_page_state(page, NR_UNSTABLE_NFS);
+	dec_node_page_state(page, NR_WRITEBACK);
 	dec_wb_stat(&inode_to_bdi(page_file_mapping(page)->host)->wb,
-		    WB_RECLAIMABLE);
+		    WB_WRITEBACK);
 }
 
 /* Called holding the request lock on @req */
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 8c1f1bb1a5ce..9bd94b5a9658 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -106,8 +106,7 @@  static int meminfo_proc_show(struct seq_file *m, void *v)
 	show_val_kb(m, "PageTables:     ",
 		    global_zone_page_state(NR_PAGETABLE));
 
-	show_val_kb(m, "NFS_Unstable:   ",
-		    global_node_page_state(NR_UNSTABLE_NFS));
+	show_val_kb(m, "NFS_Unstable:   ", 0);
 	show_val_kb(m, "Bounce:         ",
 		    global_zone_page_state(NR_BOUNCE));
 	show_val_kb(m, "WritebackTmp:   ",
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index e84d448988b6..3937f2be27d8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -237,7 +237,6 @@  enum node_stat_item {
 	NR_FILE_THPS,
 	NR_FILE_PMDMAPPED,
 	NR_ANON_THPS,
-	NR_UNSTABLE_NFS,	/* NFS unstable pages */
 	NR_VMSCAN_WRITE,
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index d94def25e4dc..45b5fbdb1f62 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -542,7 +542,6 @@  TRACE_EVENT(global_dirty_state,
 	TP_STRUCT__entry(
 		__field(unsigned long,	nr_dirty)
 		__field(unsigned long,	nr_writeback)
-		__field(unsigned long,	nr_unstable)
 		__field(unsigned long,	background_thresh)
 		__field(unsigned long,	dirty_thresh)
 		__field(unsigned long,	dirty_limit)
@@ -553,7 +552,6 @@  TRACE_EVENT(global_dirty_state,
 	TP_fast_assign(
 		__entry->nr_dirty	= global_node_page_state(NR_FILE_DIRTY);
 		__entry->nr_writeback	= global_node_page_state(NR_WRITEBACK);
-		__entry->nr_unstable	= global_node_page_state(NR_UNSTABLE_NFS);
 		__entry->nr_dirtied	= global_node_page_state(NR_DIRTIED);
 		__entry->nr_written	= global_node_page_state(NR_WRITTEN);
 		__entry->background_thresh = background_thresh;
@@ -561,12 +559,11 @@  TRACE_EVENT(global_dirty_state,
 		__entry->dirty_limit	= global_wb_domain.dirty_limit;
 	),
 
-	TP_printk("dirty=%lu writeback=%lu unstable=%lu "
+	TP_printk("dirty=%lu writeback=%lu "
 		  "bg_thresh=%lu thresh=%lu limit=%lu "
 		  "dirtied=%lu written=%lu",
 		  __entry->nr_dirty,
 		  __entry->nr_writeback,
-		  __entry->nr_unstable,
 		  __entry->background_thresh,
 		  __entry->dirty_thresh,
 		  __entry->dirty_limit,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ca194864d802..41b450b0ca29 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4326,7 +4326,6 @@  void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 
 	*pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY);
 
-	/* this should eventually include NR_UNSTABLE_NFS */
 	*pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK);
 	*pfilepages = memcg_exact_page_state(memcg, NR_INACTIVE_FILE) +
 			memcg_exact_page_state(memcg, NR_ACTIVE_FILE);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4c9875971de5..d16f6a59bce4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -504,7 +504,6 @@  bool node_dirty_ok(struct pglist_data *pgdat)
 	unsigned long nr_pages = 0;
 
 	nr_pages += node_page_state(pgdat, NR_FILE_DIRTY);
-	nr_pages += node_page_state(pgdat, NR_UNSTABLE_NFS);
 	nr_pages += node_page_state(pgdat, NR_WRITEBACK);
 
 	return nr_pages <= limit;
@@ -758,7 +757,7 @@  static void mdtc_calc_avail(struct dirty_throttle_control *mdtc,
  * bounded by the bdi->min_ratio and/or bdi->max_ratio parameters, if set.
  *
  * Return: @wb's dirty limit in pages. The term "dirty" in the context of
- * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
+ * dirty balancing includes all PG_dirty and PG_writeback pages.
  */
 static unsigned long __wb_calc_thresh(struct dirty_throttle_control *dtc)
 {
@@ -1566,7 +1565,7 @@  static void balance_dirty_pages(struct bdi_writeback *wb,
 	struct dirty_throttle_control * const mdtc = mdtc_valid(&mdtc_stor) ?
 						     &mdtc_stor : NULL;
 	struct dirty_throttle_control *sdtc;
-	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
+	unsigned long nr_reclaimable;	/* = file_dirty */
 	long period;
 	long pause;
 	long max_pause;
@@ -1589,14 +1588,7 @@  static void balance_dirty_pages(struct bdi_writeback *wb,
 		unsigned long m_thresh = 0;
 		unsigned long m_bg_thresh = 0;
 
-		/*
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 */
-		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY) +
-					global_node_page_state(NR_UNSTABLE_NFS);
+		nr_reclaimable = global_node_page_state(NR_FILE_DIRTY);
 		gdtc->avail = global_dirtyable_memory();
 		gdtc->dirty = nr_reclaimable + global_node_page_state(NR_WRITEBACK);
 
@@ -1940,8 +1932,7 @@  bool wb_over_bg_thresh(struct bdi_writeback *wb)
 	 * as we're trying to decide whether to put more under writeback.
 	 */
 	gdtc->avail = global_dirtyable_memory();
-	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY) +
-		      global_node_page_state(NR_UNSTABLE_NFS);
+	gdtc->dirty = global_node_page_state(NR_FILE_DIRTY);
 	domain_dirty_limits(gdtc);
 
 	if (gdtc->dirty > gdtc->bg_thresh)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e5f76da8cd4e..24678d6e308d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5237,7 +5237,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 
 	printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
-		" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" unevictable:%lu dirty:%lu writeback:%lu unstable:0\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
 		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
@@ -5250,7 +5250,6 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_UNEVICTABLE),
 		global_node_page_state(NR_FILE_DIRTY),
 		global_node_page_state(NR_WRITEBACK),
-		global_node_page_state(NR_UNSTABLE_NFS),
 		global_node_page_state(NR_SLAB_RECLAIMABLE),
 		global_node_page_state(NR_SLAB_UNRECLAIMABLE),
 		global_node_page_state(NR_FILE_MAPPED),
@@ -5283,7 +5282,7 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" anon_thp: %lukB"
 #endif
 			" writeback_tmp:%lukB"
-			" unstable:%lukB"
+			" unstable:0kB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@@ -5305,7 +5304,6 @@  void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			K(node_page_state(pgdat, NR_ANON_THPS) * HPAGE_PMD_NR),
 #endif
 			K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
-			K(node_page_state(pgdat, NR_UNSTABLE_NFS)),
 			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
 				"yes" : "no");
 	}
diff --git a/mm/vmstat.c b/mm/vmstat.c
index c9c0d71f917f..228d9f6e1c5c 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1108,7 +1108,7 @@  int fragmentation_index(struct zone *zone, unsigned int order)
 					TEXT_FOR_HIGHMEM(xx) xx "_movable",
 
 const char * const vmstat_text[] = {
-	/* enum zone_stat_item countes */
+	/* enum zone_stat_item counters */
 	"nr_free_pages",
 	"nr_zone_inactive_anon",
 	"nr_zone_active_anon",
@@ -1162,7 +1162,6 @@  const char * const vmstat_text[] = {
 	"nr_file_hugepages",
 	"nr_file_pmdmapped",
 	"nr_anon_transparent_hugepages",
-	"nr_unstable",
 	"nr_vmscan_write",
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
@@ -1707,8 +1706,16 @@  static void *vmstat_start(struct seq_file *m, loff_t *pos)
 static void *vmstat_next(struct seq_file *m, void *arg, loff_t *pos)
 {
 	(*pos)++;
-	if (*pos >= NR_VMSTAT_ITEMS)
+	if (*pos >= NR_VMSTAT_ITEMS) {
+		/*
+		 * Deprecated counters which are no longer represented
+		 * in vmstat arrays. We just lie about them to be always
+		 * 0 to not break userspace which might expect them in
+		 * the output.
+		 */
+		seq_puts(m, "nr_unstable 0");
 		return NULL;
+	}
 	return (unsigned long *)m->private + *pos;
 }