diff mbox series

[v2] mm: implement write-behind policy for sequential file writes

Message ID 156896493723.4334.13340481207144634918.stgit@buzz (mailing list archive)
State New, archived
Headers show
Series [v2] mm: implement write-behind policy for sequential file writes | expand

Commit Message

Konstantin Khlebnikov Sept. 20, 2019, 7:35 a.m. UTC
Traditional writeback tries to accumulate as much dirty data as possible.
This is worth strategy for extremely short-living files and for batching
writes for saving battery power. But for workloads where disk latency is
important this policy generates periodic disk load spikes which increases
latency for concurrent operations.

Also dirty pages in file cache cannot be reclaimed and reused immediately.
This way massive I/O like file copying affects memory allocation latency.

Present writeback engine allows to tune only dirty data size or expiration
time. Such tuning cannot eliminate spikes - this just lowers and multiplies
them. Other option is switching into sync mode which flushes written data
right after each write, obviously this have significant performance impact.
Such tuning is system-wide and affects memory-mapped and randomly written
files, flusher threads handle them much better.

This patch implements write-behind policy which tracks sequential writes
and starts background writeback when file have enough dirty pages.

Global switch in sysctl vm.dirty_write_behind:
=0: disabled, default
=1: enabled for strictly sequential writes (append, copying)
=2: enabled for all sequential writes

The only parameter is window size: maximum amount of dirty pages behind
current position and maximum amount of pages in background writeback.

Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
Default: 16MiB, '0' disables write-behind for this disk.

When amount of unwritten pages exceeds window size write-behind starts
background writeback for max(excess, max_sectors_kb) and then waits for
the same amount of background writeback initiated at previously.

 |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
 |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
              current head-^    new head-^              file position-^

Remaining tail pages are flushed at closing file if async write-behind was
started or this is new file and it is at least max_sectors_kb long.

Overall behavior depending on total data size:
< max_sectors_kb - no writes
> max_sectors_kb - write new files in background after close
> write_behind_kb - streaming write, write tail at close

Special cases:

* files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored

* writing cursor for O_APPEND is aligned to covers previous small appends
  Append might happen via multiple files or via new file each time.

* mode vm.dirty_write_behind=1 ignores non-append writes
  This reacts only to completely sequential writes like copying files,
  writing logs with O_APPEND or rewriting files after O_TRUNC.

Note: ext4 feature "auto_da_alloc" also writes cache at closing file
after truncating it to 0 and after renaming one file over other.

Changes since v1 (2017-10-02):
* rework window management:
* change default window 1MiB -> 16MiB
* change default request 256KiB -> max_sectors_kb
* drop always-async behavior for O_NONBLOCK
* drop handling POSIX_FADV_NOREUSE (should be in separate patch)
* ignore writes with O_DIRECT, O_SYNC, O_DSYNC
* align head position for O_APPEND
* add strictly sequential mode
* write tail pages for new files
* make void, keep errors at mapping

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
---
 Documentation/ABI/testing/sysfs-class-bdi |    5 +
 Documentation/admin-guide/sysctl/vm.rst   |   15 +++
 fs/file_table.c                           |    2 
 include/linux/backing-dev-defs.h          |    1 
 include/linux/fs.h                        |    8 +-
 include/linux/mm.h                        |    1 
 kernel/sysctl.c                           |    9 ++
 mm/backing-dev.c                          |   43 +++++----
 mm/filemap.c                              |  136 +++++++++++++++++++++++++++++
 9 files changed, 199 insertions(+), 21 deletions(-)

Comments

Konstantin Khlebnikov Sept. 20, 2019, 7:39 a.m. UTC | #1
Script for trivial demo in attachment

$ bash test_writebehind.sh
SIZE
3,2G	dummy
vm.dirty_write_behind = 0
COPY

real	0m3.629s
user	0m0.016s
sys	0m3.613s
Dirty:           3254552 kB
SYNC

real	0m31.953s
user	0m0.002s
sys	0m0.000s
vm.dirty_write_behind = 1
COPY

real	0m32.738s
user	0m0.008s
sys	0m4.047s
Dirty:              2900 kB
SYNC

real	0m0.427s
user	0m0.000s
sys	0m0.004s
vm.dirty_write_behind = 2
COPY

real	0m32.168s
user	0m0.000s
sys	0m4.066s
Dirty:              3088 kB
SYNC

real	0m0.421s
user	0m0.004s
sys	0m0.001s


With vm.dirty_write_behind 1 or 2 files are written even faster and
during copying amount of dirty memory always stays around at 16MiB.


On 20/09/2019 10.35, Konstantin Khlebnikov wrote:
> Traditional writeback tries to accumulate as much dirty data as possible.
> This is worth strategy for extremely short-living files and for batching
> writes for saving battery power. But for workloads where disk latency is
> important this policy generates periodic disk load spikes which increases
> latency for concurrent operations.
> 
> Also dirty pages in file cache cannot be reclaimed and reused immediately.
> This way massive I/O like file copying affects memory allocation latency.
> 
> Present writeback engine allows to tune only dirty data size or expiration
> time. Such tuning cannot eliminate spikes - this just lowers and multiplies
> them. Other option is switching into sync mode which flushes written data
> right after each write, obviously this have significant performance impact.
> Such tuning is system-wide and affects memory-mapped and randomly written
> files, flusher threads handle them much better.
> 
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.
> 
> Global switch in sysctl vm.dirty_write_behind:
> =0: disabled, default
> =1: enabled for strictly sequential writes (append, copying)
> =2: enabled for all sequential writes
> 
> The only parameter is window size: maximum amount of dirty pages behind
> current position and maximum amount of pages in background writeback.
> 
> Setup is per-disk in sysfs in file /sys/block/$DISK/bdi/write_behind_kb.
> Default: 16MiB, '0' disables write-behind for this disk.
> 
> When amount of unwritten pages exceeds window size write-behind starts
> background writeback for max(excess, max_sectors_kb) and then waits for
> the same amount of background writeback initiated at previously.
> 
>   |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
>   |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
>                current head-^    new head-^              file position-^
> 
> Remaining tail pages are flushed at closing file if async write-behind was
> started or this is new file and it is at least max_sectors_kb long.
> 
> Overall behavior depending on total data size:
> < max_sectors_kb - no writes
>> max_sectors_kb - write new files in background after close
>> write_behind_kb - streaming write, write tail at close
> 
> Special cases:
> 
> * files with POSIX_FADV_RANDOM, O_DIRECT, O_[D]SYNC are ignored
> 
> * writing cursor for O_APPEND is aligned to covers previous small appends
>    Append might happen via multiple files or via new file each time.
> 
> * mode vm.dirty_write_behind=1 ignores non-append writes
>    This reacts only to completely sequential writes like copying files,
>    writing logs with O_APPEND or rewriting files after O_TRUNC.
> 
> Note: ext4 feature "auto_da_alloc" also writes cache at closing file
> after truncating it to 0 and after renaming one file over other.
> 
> Changes since v1 (2017-10-02):
> * rework window management:
> * change default window 1MiB -> 16MiB
> * change default request 256KiB -> max_sectors_kb
> * drop always-async behavior for O_NONBLOCK
> * drop handling POSIX_FADV_NOREUSE (should be in separate patch)
> * ignore writes with O_DIRECT, O_SYNC, O_DSYNC
> * align head position for O_APPEND
> * add strictly sequential mode
> * write tail pages for new files
> * make void, keep errors at mapping
> 
> Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
> Link: https://lore.kernel.org/patchwork/patch/836149/ (v1)
> ---
Linus Torvalds Sept. 20, 2019, 11:05 p.m. UTC | #2
On Fri, Sep 20, 2019 at 12:35 AM Konstantin Khlebnikov
<khlebnikov@yandex-team.ru> wrote:
>
> This patch implements write-behind policy which tracks sequential writes
> and starts background writeback when file have enough dirty pages.

Apart from a spelling error ("contigious"), my only reaction is that
I've wanted this for the multi-file writes, not just for single big
files.

Yes, single big files may be a simpler and perhaps the "10% effort for
90% of the gain", and thus the right thing to do, but I do wonder if
you've looked at simply extending it to cover multiple files when
people copy a whole directory (or unpack a tar-file, or similar).

Now, I hear you say "those are so small these days that it doesn't
matter". And maybe you're right. But partiocularly for slow media,
triggering good streaming write behavior has been a problem in the
past.

So I'm wondering whether the "writebehind" state should perhaps be
considered be a process state, rather than "struct file" state, and
also start triggering for writing smaller files.

Maybe this was already discussed and people decided that the big-file
case was so much easier that it wasn't worth worrying about
writebehind for multiple files.

            Linus
Linus Torvalds Sept. 20, 2019, 11:10 p.m. UTC | #3
On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> Now, I hear you say "those are so small these days that it doesn't
> matter". And maybe you're right. But particularly for slow media,
> triggering good streaming write behavior has been a problem in the
> past.

Which reminds me: the writebehind trigger should likely be tied to the
estimate of the bdi write speed.

We _do_ have that avg_write_bandwidth thing in the bdi_writeback
structure, it sounds like a potentially good idea to try to use that
to estimate when to do writebehind.

No?

            Linus
kernel test robot Sept. 22, 2019, 7:47 a.m. UTC | #4
Hi Konstantin,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[cannot apply to v5.3 next-20190919]
[if your patch is applied to the wrong git tree, please drop us a note to help
improve the system. BTW, we also suggest to use '--base' option to specify the
base tree in git format-patch, please see https://stackoverflow.com/a/37406982]

url:    https://github.com/0day-ci/linux/commits/Konstantin-Khlebnikov/mm-implement-write-behind-policy-for-sequential-file-writes/20190920-155606
reproduce: make htmldocs
:::::: branch date: 8 hours ago
:::::: commit date: 8 hours ago

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c:1: warning: no structured comments found
   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c:1: warning: no structured comments found
   drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c:1: warning: 'pp_dpm_sclk pp_dpm_mclk pp_dpm_pcie' not found
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:132: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source @atomic_obj
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:238: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source gpu_info FW provided soc bounding box struct or 0 if not
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'atomic_obj' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'backlight_link' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'backlight_caps' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'freesync_module' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'fw_dmcu' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'dmcu_fw_version' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.h:243: warning: Function parameter or member 'soc_bounding_box' not described in 'amdgpu_display_manager'
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'register_hpd_handlers' not found
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'dm_crtc_high_irq' not found
   drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm.c:1: warning: 'dm_pflip_high_irq' not found
   include/linux/spi/spi.h:190: warning: Function parameter or member 'driver_override' not described in 'spi_device'
   drivers/gpio/gpiolib-of.c:92: warning: Excess function parameter 'dev' description in 'of_gpio_need_valid_mask'
   include/linux/i2c.h:337: warning: Function parameter or member 'init_irq' not described in 'i2c_client'
   include/linux/regulator/machine.h:196: warning: Function parameter or member 'max_uV_step' not described in 'regulation_constraints'
   include/linux/regulator/driver.h:223: warning: Function parameter or member 'resume' not described in 'regulator_ops'
   fs/fs-writeback.c:913: warning: Excess function parameter 'nr_pages' description in 'cgroup_writeback_by_id'
   fs/direct-io.c:258: warning: Excess function parameter 'offset' description in 'dio_complete'
   fs/libfs.c:496: warning: Excess function parameter 'available' description in 'simple_write_end'
   fs/posix_acl.c:647: warning: Function parameter or member 'inode' not described in 'posix_acl_update_mode'
   fs/posix_acl.c:647: warning: Function parameter or member 'mode_p' not described in 'posix_acl_update_mode'
   fs/posix_acl.c:647: warning: Function parameter or member 'acl' not described in 'posix_acl_update_mode'
   drivers/usb/typec/bus.c:1: warning: 'typec_altmode_unregister_driver' not found
   drivers/usb/typec/bus.c:1: warning: 'typec_altmode_register_driver' not found
   drivers/usb/typec/class.c:1: warning: 'typec_altmode_register_notifier' not found
   drivers/usb/typec/class.c:1: warning: 'typec_altmode_unregister_notifier' not found
   kernel/dma/coherent.c:1: warning: no structured comments found
   include/linux/input/sparse-keymap.h:43: warning: Function parameter or member 'sw' not described in 'key_entry'
   include/linux/skbuff.h:888: warning: Function parameter or member 'dev_scratch' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'list' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'ip_defrag_offset' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'skb_mstamp_ns' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member '__cloned_offset' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'head_frag' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member '__pkt_type_offset' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'encapsulation' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'encap_hdr_csum' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'csum_valid' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member '__pkt_vlan_present_offset' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'vlan_present' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'csum_complete_sw' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'csum_level' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'inner_protocol_type' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'remcsum_offload' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'sender_cpu' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'reserved_tailroom' not described in 'sk_buff'
   include/linux/skbuff.h:888: warning: Function parameter or member 'inner_ipproto' not described in 'sk_buff'
   include/net/sock.h:233: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
   include/net/sock.h:233: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'
   include/net/sock.h:515: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
   include/net/sock.h:515: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock'
   include/net/sock.h:2439: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE'
   include/net/sock.h:2439: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE'
   include/net/sock.h:2439: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'gso_partial_features' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'l3mdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'xfrmdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'tlsdev_ops' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'name_assign_type' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'ieee802154_ptr' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'mpls_ptr' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'xdp_prog' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'gro_flush_timeout' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'nf_hooks_ingress' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member '____cacheline_aligned_in_smp' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'qdisc_hash' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'xps_cpus_map' not described in 'net_device'
   include/linux/netdevice.h:2053: warning: Function parameter or member 'xps_rxqs_map' not described in 'net_device'
   include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(advertising' not described in 'phylink_link_state'
   include/linux/phylink.h:56: warning: Function parameter or member '__ETHTOOL_DECLARE_LINK_MODE_MASK(lp_advertising' not described in 'phylink_link_state'
   drivers/net/phy/phylink.c:595: warning: Function parameter or member 'config' not described in 'phylink_create'
   drivers/net/phy/phylink.c:595: warning: Excess function parameter 'ndev' description in 'phylink_create'
   lib/genalloc.c:1: warning: 'gen_pool_add_virt' not found
   lib/genalloc.c:1: warning: 'gen_pool_alloc' not found
   lib/genalloc.c:1: warning: 'gen_pool_free' not found
   lib/genalloc.c:1: warning: 'gen_pool_alloc_algo' not found
   include/linux/bitmap.h:341: warning: Function parameter or member 'nbits' not described in 'bitmap_or_equal'
   include/linux/rculist.h:374: warning: Excess function parameter 'cond' description in 'list_for_each_entry_rcu'
   include/linux/rculist.h:651: warning: Excess function parameter 'cond' description in 'hlist_for_each_entry_rcu'
   mm/util.c:1: warning: 'get_user_pages_fast' not found
   mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize'
>> mm/filemap.c:3551: warning: Function parameter or member 'iocb' not described in 'generic_write_behind'
>> mm/filemap.c:3551: warning: Function parameter or member 'count' not described in 'generic_write_behind'
   include/drm/drm_modeset_helper_vtables.h:1053: warning: Function parameter or member 'prepare_writeback_job' not described in 'drm_connector_helper_funcs'
   include/drm/drm_modeset_helper_vtables.h:1053: warning: Function parameter or member 'cleanup_writeback_job' not described in 'drm_connector_helper_funcs'
   include/drm/drm_atomic_state_helper.h:1: warning: no structured comments found
   include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv' not described in 'drm_gem_shmem_object'
   include/drm/drm_gem_shmem_helper.h:87: warning: Function parameter or member 'madv_list' not described in 'drm_gem_shmem_object'
   drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Enum value 'DPLL_ID_TGL_MGPLL5' not described in enum 'intel_dpll_id'
   drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Enum value 'DPLL_ID_TGL_MGPLL6' not described in enum 'intel_dpll_id'
   drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Excess enum value 'DPLL_ID_TGL_TCPLL6' description in 'intel_dpll_id'
   drivers/gpu/drm/i915/display/intel_dpll_mgr.h:158: warning: Excess enum value 'DPLL_ID_TGL_TCPLL5' description in 'intel_dpll_id'
   drivers/gpu/drm/i915/display/intel_dpll_mgr.h:342: warning: Function parameter or member 'wakeref' not described in 'intel_shared_dpll'
   Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c
   Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c
   Error: Cannot open file drivers/gpu/drm/i915/i915_gem_batch_pool.c
   drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information.
   drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer.
   drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state
   drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to
   drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to.
   drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer
   drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register,
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'pinned_ctx' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'specific_ctx_id' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'specific_ctx_id_mask' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'poll_check_timer' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'poll_wq' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'pollin' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'periodic' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'period_exponent' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1207: warning: Function parameter or member 'oa_buffer' not described in 'i915_perf_stream'
   drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information.
   drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer.
   drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state
   drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to
   drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to.
   drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer
   drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register,
   drivers/gpu/drm/i915/i915_drv.h:1129: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source The OA context specific information.
   drivers/gpu/drm/i915/i915_drv.h:1143: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source State of the OA buffer.
   drivers/gpu/drm/i915/i915_drv.h:1154: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Locks reads and writes to all head/tail state
   drivers/gpu/drm/i915/i915_drv.h:1176: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source One 'aging' tail pointer and one 'aged' tail pointer ready to
   drivers/gpu/drm/i915/i915_drv.h:1188: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Index for the aged tail ready to read() data up to.
   drivers/gpu/drm/i915/i915_drv.h:1193: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source A monotonic timestamp for when the current aging tail pointer
   drivers/gpu/drm/i915/i915_drv.h:1199: warning: Incorrect use of kernel-doc format: Documentation Makefile include scripts source Although we can always read back the head pointer register,
   drivers/gpu/drm/mcde/mcde_drv.c:1: warning: 'ST-Ericsson MCDE DRM Driver' not found
   include/net/cfg80211.h:1185: warning: Function parameter or member 'txpwr' not described in 'station_parameters'
   include/net/mac80211.h:4056: warning: Function parameter or member 'sta_set_txpwr' not described in 'ieee80211_ops'
   include/net/mac80211.h:2018: warning: Function parameter or member 'txpwr' not described in 'ieee80211_sta'
   Documentation/admin-guide/perf/imx-ddr.rst:21: WARNING: Unexpected indentation.
   Documentation/admin-guide/perf/imx-ddr.rst:34: WARNING: Unexpected indentation.
   Documentation/admin-guide/perf/imx-ddr.rst:40: WARNING: Unexpected indentation.
   Documentation/admin-guide/perf/imx-ddr.rst:45: WARNING: Unexpected indentation.
   Documentation/admin-guide/perf/imx-ddr.rst:52: WARNING: Unexpected indentation.
   Documentation/hwmon/inspur-ipsps1.rst:2: WARNING: Title underline too short.

# https://github.com/0day-ci/linux/commit/e0e7df8d5b71bf59ad93fe75e662c929b580d805
git remote add linux-review https://github.com/0day-ci/linux
git remote update linux-review
git checkout e0e7df8d5b71bf59ad93fe75e662c929b580d805
vim +3551 mm/filemap.c

e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3534  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3535  /**
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3536   * generic_write_behind() - writeback dirty pages behind current position.
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3537   *
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3538   * This function tracks writing position. If file has enough sequentially
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3539   * written data it starts background writeback and then waits for previous
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3540   * writeback initiated some iterations ago.
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3541   *
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3542   * Write-behind maintains per-file head cursor in file->f_write_behind and
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3543   * two windows around: background writeback before and pending data after.
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3544   *
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3545   * |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3546   * |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3547   *              current head-^    new head-^              file position-^
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3548   */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3549  void generic_write_behind(struct kiocb *iocb, ssize_t count)
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3550  {
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20 @3551  	struct file *file = iocb->ki_filp;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3552  	struct address_space *mapping = file->f_mapping;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3553  	struct inode *inode = mapping->host;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3554  	struct backing_dev_info *bdi = inode_to_bdi(inode);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3555  	unsigned long window = READ_ONCE(bdi->write_behind_pages);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3556  	pgoff_t head = file->f_write_behind;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3557  	pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3558  	pgoff_t end = iocb->ki_pos >> PAGE_SHIFT;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3559  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3560  	/* Skip if write is random, direct, sync or disabled for disk */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3561  	if ((file->f_mode & FMODE_RANDOM) || !window ||
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3562  	    (iocb->ki_flags & (IOCB_DIRECT | IOCB_DSYNC)))
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3563  		return;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3564  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3565  	/* Skip non-sequential writes in strictly sequential mode. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3566  	if (vm_dirty_write_behind < 2 &&
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3567  	    iocb->ki_pos != i_size_read(inode) &&
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3568  	    !(iocb->ki_flags & IOCB_APPEND))
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3569  		return;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3570  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3571  	/* Contigious write and still within window. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3572  	if (end - head < window)
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3573  		return;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3574  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3575  	spin_lock(&file->f_lock);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3576  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3577  	/* Re-read under lock. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3578  	head = file->f_write_behind;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3579  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3580  	/* Non-contiguous, move head position. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3581  	if (head > end || begin - head > window) {
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3582  		/*
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3583  		 * Append might happen though multiple files or via new file
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3584  		 * every time. Align head cursor to cover previous appends.
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3585  		 */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3586  		if (iocb->ki_flags & IOCB_APPEND)
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3587  			begin = roundup(begin - min(begin, window - 1),
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3588  					bdi->io_pages);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3589  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3590  		file->f_write_behind = head = begin;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3591  	}
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3592  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3593  	/* Still not big enough. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3594  	if (end - head < window) {
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3595  		spin_unlock(&file->f_lock);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3596  		return;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3597  	}
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3598  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3599  	/* Write excess and try at least max_sectors_kb if possible */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3600  	end = head + max(end - head - window, min(end - head, bdi->io_pages));
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3601  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3602  	/* Set head for next iteration, everything behind will be written. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3603  	file->f_write_behind = end;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3604  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3605  	spin_unlock(&file->f_lock);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3606  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3607  	/* Start background writeback. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3608  	__filemap_fdatawrite_range(mapping,
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3609  				   (loff_t)head << PAGE_SHIFT,
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3610  				   ((loff_t)end << PAGE_SHIFT) - 1,
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3611  				   WB_SYNC_NONE);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3612  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3613  	if (head < window)
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3614  		return;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3615  
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3616  	/* Wait for pages falling behind writeback window. */
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3617  	head -= window;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3618  	end -= window;
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3619  	__filemap_fdatawait_range(mapping,
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3620  				  (loff_t)head << PAGE_SHIFT,
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3621  				  ((loff_t)end << PAGE_SHIFT) - 1);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3622  }
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3623  EXPORT_SYMBOL(generic_write_behind);
e0e7df8d5b71bf Konstantin Khlebnikov 2019-09-20  3624  

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Tejun Heo Sept. 23, 2019, 2:52 p.m. UTC | #5
Hello, Konstantin.

On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> With vm.dirty_write_behind 1 or 2 files are written even faster and

Is the faster speed reproducible?  I don't quite understand why this
would be.

> during copying amount of dirty memory always stays around at 16MiB.

The following is the test part of a slightly modified version of your
test script which should run fine on any modern systems.

  for mode in 0 1; do
	  if [ $mode == 0 ]; then
		  prefix=''
	  else
		  prefix='systemd-run --user --scope -p MemoryMax=64M'
	  fi

	  echo COPY
	  time $prefix cp -r dummy copy

	  grep Dirty /proc/meminfo

	  echo SYNC
	  time sync

	  rm -fr copy
  done

and the result looks like the following.

  $ ./test-writebehind.sh
  SIZE
  3.3G    dummy
  COPY

  real    0m2.859s
  user    0m0.015s
  sys     0m2.843s
  Dirty:           3416780 kB
  SYNC

  real    0m34.008s
  user    0m0.000s
  sys     0m0.008s
  COPY
  Running scope as unit: run-r69dca5326a9a435d80e036435ff9e1da.scope

  real    0m32.267s
  user    0m0.032s
  sys     0m4.186s
  Dirty:             14304 kB
  SYNC

  real    0m1.783s
  user    0m0.000s
  sys     0m0.006s

This is how we are solving the massive dirtier problem.  It's easy,
works pretty well and can easily be tailored to the specific
requirements.

Generic write-behind would definitely have other benefits and also a
bunch of regression possibilities.  I'm not trying to say that
write-behind isn't a good idea but it'd be useful to consider that a
good portion of the benefits can already be obtained fairly easily.

Thanks.
Konstantin Khlebnikov Sept. 23, 2019, 3:06 p.m. UTC | #6
On 23/09/2019 17.52, Tejun Heo wrote:
> Hello, Konstantin.
> 
> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
>> With vm.dirty_write_behind 1 or 2 files are written even faster and
> 
> Is the faster speed reproducible?  I don't quite understand why this
> would be.

Writing to disk simply starts earlier.

> 
>> during copying amount of dirty memory always stays around at 16MiB.
> 
> The following is the test part of a slightly modified version of your
> test script which should run fine on any modern systems.
> 
>    for mode in 0 1; do
> 	  if [ $mode == 0 ]; then
> 		  prefix=''
> 	  else
> 		  prefix='systemd-run --user --scope -p MemoryMax=64M'
> 	  fi
> 
> 	  echo COPY
> 	  time $prefix cp -r dummy copy
> 
> 	  grep Dirty /proc/meminfo
> 
> 	  echo SYNC
> 	  time sync
> 
> 	  rm -fr copy
>    done
> 
> and the result looks like the following.
> 
>    $ ./test-writebehind.sh
>    SIZE
>    3.3G    dummy
>    COPY
> 
>    real    0m2.859s
>    user    0m0.015s
>    sys     0m2.843s
>    Dirty:           3416780 kB
>    SYNC
> 
>    real    0m34.008s
>    user    0m0.000s
>    sys     0m0.008s
>    COPY
>    Running scope as unit: run-r69dca5326a9a435d80e036435ff9e1da.scope
> 
>    real    0m32.267s
>    user    0m0.032s
>    sys     0m4.186s
>    Dirty:             14304 kB
>    SYNC
> 
>    real    0m1.783s
>    user    0m0.000s
>    sys     0m0.006s
> 
> This is how we are solving the massive dirtier problem.  It's easy,
> works pretty well and can easily be tailored to the specific
> requirements.
> 
> Generic write-behind would definitely have other benefits and also a
> bunch of regression possibilities.  I'm not trying to say that
> write-behind isn't a good idea but it'd be useful to consider that a
> good portion of the benefits can already be obtained fairly easily.
> 

I'm afraid this could end badly if each simple task like file copying
will require own systemd job and container with manual tuning.
Tejun Heo Sept. 23, 2019, 3:19 p.m. UTC | #7
Hello,

On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> On 23/09/2019 17.52, Tejun Heo wrote:
> >Hello, Konstantin.
> >
> >On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> >>With vm.dirty_write_behind 1 or 2 files are written even faster and
> >
> >Is the faster speed reproducible?  I don't quite understand why this
> >would be.
> 
> Writing to disk simply starts earlier.

I see.

> >Generic write-behind would definitely have other benefits and also a
> >bunch of regression possibilities.  I'm not trying to say that
> >write-behind isn't a good idea but it'd be useful to consider that a
> >good portion of the benefits can already be obtained fairly easily.
> >
> 
> I'm afraid this could end badly if each simple task like file copying
> will require own systemd job and container with manual tuning.

At least the write window size part of it is pretty easy - the range
of acceptable values is fiarly wide - and setting up a cgroup and
running a command in it isn't that expensive.  It's not like these
need full-on containers.  That said, yes, there sure are benefits to
the kernel being able to detect and handle these conditions
automagically.

Thanks.
Jens Axboe Sept. 23, 2019, 3:36 p.m. UTC | #8
On 9/20/19 5:10 PM, Linus Torvalds wrote:
> On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>>
>> Now, I hear you say "those are so small these days that it doesn't
>> matter". And maybe you're right. But particularly for slow media,
>> triggering good streaming write behavior has been a problem in the
>> past.
> 
> Which reminds me: the writebehind trigger should likely be tied to the
> estimate of the bdi write speed.
> 
> We _do_ have that avg_write_bandwidth thing in the bdi_writeback
> structure, it sounds like a potentially good idea to try to use that
> to estimate when to do writebehind.
> 
> No?

I really like the feature, and agree it should be tied to the bdi write
speed. How about just making the tunable acceptable time of write behind
dirty? Eg if write_behind_msec is 1000, allow 1s of pending dirty before
starting writbeack.
Konstantin Khlebnikov Sept. 23, 2019, 4:05 p.m. UTC | #9
On 23/09/2019 18.36, Jens Axboe wrote:
> On 9/20/19 5:10 PM, Linus Torvalds wrote:
>> On Fri, Sep 20, 2019 at 4:05 PM Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>>
>>>
>>> Now, I hear you say "those are so small these days that it doesn't
>>> matter". And maybe you're right. But particularly for slow media,
>>> triggering good streaming write behavior has been a problem in the
>>> past.
>>
>> Which reminds me: the writebehind trigger should likely be tied to the
>> estimate of the bdi write speed.
>>
>> We _do_ have that avg_write_bandwidth thing in the bdi_writeback
>> structure, it sounds like a potentially good idea to try to use that
>> to estimate when to do writebehind.
>>
>> No?
> 
> I really like the feature, and agree it should be tied to the bdi write
> speed. How about just making the tunable acceptable time of write behind
> dirty? Eg if write_behind_msec is 1000, allow 1s of pending dirty before
> starting writbeack.
> 

I haven't digged into it yet.

But IIRR writeback speed estimation has some problems:

There is no "slow start" - initial speed is 100MiB/s.
This is especially bad for slow usb disks - right after plugging
we'll accumulate too much dirty cache before starting writeback.

And I've seen problems with cgroup-writeback:
each cgroup has own estimation, doesn't work well for short-living cgroups.
Konstantin Khlebnikov Sept. 23, 2019, 7:11 p.m. UTC | #10
On Mon, Sep 23, 2019 at 3:37 AM kernel test robot <rong.a.chen@intel.com> wrote:
>
> Greeting,
>
> FYI, we noticed a -7.3% regression of will-it-scale.per_process_ops due to commit:

Most likely this caused by changing struct file layout after adding new field.

>
>
> commit: e0e7df8d5b71bf59ad93fe75e662c929b580d805 ("[PATCH v2] mm: implement write-behind policy for sequential file writes")
> url: https://github.com/0day-ci/linux/commits/Konstantin-Khlebnikov/mm-implement-write-behind-policy-for-sequential-file-writes/20190920-155606
>
>
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
> with following parameters:
>
>         nr_task: 100%
>         mode: process
>         test: open1
>         cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <rong.a.chen@intel.com>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
>         git clone https://github.com/intel/lkp-tests.git
>         cd lkp-tests
>         bin/lkp install job.yaml  # job file is attached in this email
>         bin/lkp run     job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
>   gcc-7/performance/x86_64-rhel-7.6/process/100%/debian-x86_64-2019-05-14.cgz/lkp-csl-2ap4/open1/will-it-scale
>
> commit:
>   574cc45397 (" drm main pull for 5.4-rc1")
>   e0e7df8d5b ("mm: implement write-behind policy for sequential file writes")
>
> 574cc4539762561d e0e7df8d5b71bf59ad93fe75e66
> ---------------- ---------------------------
>          %stddev     %change         %stddev
>              \          |                \
>     370456            -7.3%     343238        will-it-scale.per_process_ops
>   71127653            -7.3%   65901758        will-it-scale.workload
>     828565 ± 23%     +66.8%    1381984 ± 23%  cpuidle.C1.time
>       1499            +1.1%       1515        turbostat.Avg_MHz
>     163498 ±  5%     +26.4%     206691 ±  4%  slabinfo.filp.active_slabs
>     163498 ±  5%     +26.4%     206691 ±  4%  slabinfo.filp.num_slabs
>      39055 ±  2%     +17.1%      45720 ±  5%  meminfo.Inactive
>      38615 ±  2%     +17.3%      45291 ±  5%  meminfo.Inactive(anon)
>      51382 ±  3%     +19.6%      61469 ±  7%  meminfo.Mapped
>    5163010 ±  2%     +12.7%    5819765 ±  3%  meminfo.Memused
>    2840181 ±  3%     +22.5%    3478003 ±  5%  meminfo.SUnreclaim
>    2941874 ±  3%     +21.7%    3579791 ±  5%  meminfo.Slab
>      67755 ±  5%     +23.8%      83884 ±  3%  meminfo.max_used_kB
>   79719901           +17.3%   93512842        numa-numastat.node0.local_node
>   79738690           +17.3%   93533079        numa-numastat.node0.numa_hit
>   81987497           +16.6%   95625946        numa-numastat.node1.local_node
>   82018695           +16.6%   95652480        numa-numastat.node1.numa_hit
>   82693483           +15.8%   95762465        numa-numastat.node2.local_node
>   82705924           +15.8%   95789007        numa-numastat.node2.numa_hit
>   80329941           +17.1%   94048289        numa-numastat.node3.local_node
>   80361116           +17.1%   94068512        numa-numastat.node3.numa_hit
>       9678 ±  2%     +17.1%      11334 ±  5%  proc-vmstat.nr_inactive_anon
>      13001 ±  3%     +19.2%      15503 ±  7%  proc-vmstat.nr_mapped
>     738232 ±  4%     +18.5%     875062 ±  2%  proc-vmstat.nr_slab_unreclaimable
>       9678 ±  2%     +17.1%      11334 ±  5%  proc-vmstat.nr_zone_inactive_anon
>       2391 ± 92%     -84.5%     369.50 ± 46%  proc-vmstat.numa_hint_faults
>  3.243e+08           +16.8%  3.789e+08        proc-vmstat.numa_hit
>  3.242e+08           +16.8%  3.788e+08        proc-vmstat.numa_local
>  1.296e+09           +16.8%  1.514e+09        proc-vmstat.pgalloc_normal
>  1.296e+09           +16.8%  1.514e+09        proc-vmstat.pgfree
>     862.61 ±  5%     +37.7%       1188 ±  5%  sched_debug.cfs_rq:/.exec_clock.stddev
>     229663 ± 62%    +113.3%     489907 ± 29%  sched_debug.cfs_rq:/.load.max
>     491.04 ±  4%      -9.5%     444.29 ±  7%  sched_debug.cfs_rq:/.nr_spread_over.min
>     229429 ± 62%    +113.4%     489618 ± 29%  sched_debug.cfs_rq:/.runnable_weight.max
>   -1959962           +36.2%   -2669681        sched_debug.cfs_rq:/.spread0.min
>    1416008 ±  2%     -13.3%    1227494 ±  5%  sched_debug.cpu.avg_idle.avg
>    1240763 ±  8%     -28.2%     891028 ± 18%  sched_debug.cpu.avg_idle.stddev
>     352361 ±  6%     -29.6%     248105 ± 25%  sched_debug.cpu.max_idle_balance_cost.stddev
>     -20.00           +51.0%     -30.21        sched_debug.cpu.nr_uninterruptible.min
>       6618 ± 10%     -20.8%       5240 ±  8%  sched_debug.cpu.ttwu_count.max
>    1452719 ±  4%      +7.2%    1557262 ±  3%  numa-meminfo.node0.MemUsed
>     797565 ±  2%     +20.8%     963538 ±  2%  numa-meminfo.node0.SUnreclaim
>     835343 ±  3%     +19.6%     998867 ±  2%  numa-meminfo.node0.Slab
>     831114 ±  2%     +20.1%     998248 ±  2%  numa-meminfo.node1.SUnreclaim
>     848052           +19.8%    1016069 ±  2%  numa-meminfo.node1.Slab
>    1441558 ±  6%     +15.7%    1668466 ±  3%  numa-meminfo.node2.MemUsed
>     879835 ±  2%     +20.4%    1059441        numa-meminfo.node2.SUnreclaim
>     901359 ±  3%     +20.3%    1084727 ±  2%  numa-meminfo.node2.Slab
>    1446041 ±  5%     +15.5%    1669477 ±  3%  numa-meminfo.node3.MemUsed
>     899442 ±  5%     +23.0%    1106354        numa-meminfo.node3.SUnreclaim
>     924903 ±  5%     +22.1%    1129709        numa-meminfo.node3.Slab
>     198945           +19.8%     238298 ±  2%  numa-vmstat.node0.nr_slab_unreclaimable
>   40181885           +17.3%   47129598        numa-vmstat.node0.numa_hit
>   40163521           +17.3%   47110122        numa-vmstat.node0.numa_local
>     208512           +20.9%     252000 ±  2%  numa-vmstat.node1.nr_slab_unreclaimable
>   41144466           +16.7%   48021716        numa-vmstat.node1.numa_hit
>   41027051           +16.8%   47908675        numa-vmstat.node1.numa_local
>     220763 ±  2%     +21.9%     269115 ±  2%  numa-vmstat.node2.nr_slab_unreclaimable
>   41437805           +16.2%   48167791        numa-vmstat.node2.numa_hit
>   41338581           +16.2%   48054485        numa-vmstat.node2.numa_local
>     225216 ±  2%     +24.7%     280851 ±  2%  numa-vmstat.node3.nr_slab_unreclaimable
>   40385721           +16.9%   47195289        numa-vmstat.node3.numa_hit
>   40268228           +16.9%   47088405        numa-vmstat.node3.numa_local
>      77.00 ± 29%    +494.8%     458.00 ±110%  interrupts.CPU10.RES:Rescheduling_interrupts
>     167.25 ± 65%    +347.8%     749.00 ± 85%  interrupts.CPU103.RES:Rescheduling_interrupts
>     136.50 ± 42%    +309.2%     558.50 ± 85%  interrupts.CPU107.RES:Rescheduling_interrupts
>     132.50 ± 26%    +637.5%     977.25 ± 50%  interrupts.CPU109.RES:Rescheduling_interrupts
>     212.50 ± 51%     -65.2%      74.00 ±  9%  interrupts.CPU115.RES:Rescheduling_interrupts
>     270.25 ± 20%     -77.2%      61.50 ± 10%  interrupts.CPU121.RES:Rescheduling_interrupts
>     184.00 ± 50%     -57.5%      78.25 ± 51%  interrupts.CPU128.RES:Rescheduling_interrupts
>      85.25 ± 38%    +911.4%     862.25 ±135%  interrupts.CPU137.RES:Rescheduling_interrupts
>      72.25 ±  6%    +114.2%     154.75 ± 25%  interrupts.CPU147.RES:Rescheduling_interrupts
>     415.00 ± 75%     -69.8%     125.25 ± 59%  interrupts.CPU15.RES:Rescheduling_interrupts
>     928.25 ± 93%     -89.8%      94.50 ± 50%  interrupts.CPU182.RES:Rescheduling_interrupts
>     359.75 ± 76%     -58.8%     148.25 ± 85%  interrupts.CPU19.RES:Rescheduling_interrupts
>      95.75 ± 30%    +103.9%     195.25 ± 48%  interrupts.CPU45.RES:Rescheduling_interrupts
>      60.25 ±  9%    +270.5%     223.25 ± 93%  interrupts.CPU83.RES:Rescheduling_interrupts
>     906.75 ±136%     -90.5%      85.75 ± 36%  interrupts.CPU85.RES:Rescheduling_interrupts
>     199.25 ± 25%     -52.1%      95.50 ± 43%  interrupts.CPU90.RES:Rescheduling_interrupts
>       5192 ± 34%     +41.5%       7347 ± 24%  interrupts.CPU95.NMI:Non-maskable_interrupts
>       5192 ± 34%     +41.5%       7347 ± 24%  interrupts.CPU95.PMI:Performance_monitoring_interrupts
>       1.75           +26.1%       2.20        perf-stat.i.MPKI
>  7.975e+10            -6.8%  7.435e+10        perf-stat.i.branch-instructions
>  3.782e+08            -5.9%  3.558e+08        perf-stat.i.branch-misses
>      75.36            +0.9       76.29        perf-stat.i.cache-miss-rate%
>  5.484e+08           +18.8%  6.515e+08        perf-stat.i.cache-misses
>  7.276e+08           +17.3%  8.539e+08        perf-stat.i.cache-references
>       1.37            +8.2%       1.48        perf-stat.i.cpi
>  5.701e+11            +0.7%  5.744e+11        perf-stat.i.cpu-cycles
>       1040           -15.2%     882.10        perf-stat.i.cycles-between-cache-misses
>  1.253e+11            -7.2%  1.163e+11        perf-stat.i.dTLB-loads
>  7.443e+10            -7.2%  6.904e+10        perf-stat.i.dTLB-stores
>  3.336e+08           +12.6%  3.755e+08        perf-stat.i.iTLB-load-misses
>    5004598 ±  7%     -60.9%    1954451 ±  6%  perf-stat.i.iTLB-loads
>  4.175e+11            -6.9%  3.887e+11        perf-stat.i.instructions
>       1251           -17.3%       1035        perf-stat.i.instructions-per-iTLB-miss
>       0.73            -7.6%       0.68        perf-stat.i.ipc
>      19.77            -1.5       18.31        perf-stat.i.node-load-miss-rate%
>    5003202 ±  2%     +16.5%    5829006        perf-stat.i.node-load-misses
>   20521507           +28.1%   26283838        perf-stat.i.node-loads
>       1.84            +0.4        2.28        perf-stat.i.node-store-miss-rate%
>    1469703           +29.0%    1895783        perf-stat.i.node-store-misses
>   78304054            +4.0%   81463725        perf-stat.i.node-stores
>       1.74           +26.1%       2.20        perf-stat.overall.MPKI
>      75.37            +0.9       76.30        perf-stat.overall.cache-miss-rate%
>       1.37            +8.2%       1.48        perf-stat.overall.cpi
>       1039           -15.2%     881.41        perf-stat.overall.cycles-between-cache-misses
>       1251           -17.3%       1035        perf-stat.overall.instructions-per-iTLB-miss
>       0.73            -7.6%       0.68        perf-stat.overall.ipc
>      19.59            -1.5       18.14        perf-stat.overall.node-load-miss-rate%
>       1.84            +0.4        2.27        perf-stat.overall.node-store-miss-rate%
>  7.943e+10            -6.8%  7.404e+10        perf-stat.ps.branch-instructions
>  3.767e+08            -5.9%  3.543e+08        perf-stat.ps.branch-misses
>  5.465e+08           +18.8%  6.492e+08        perf-stat.ps.cache-misses
>   7.25e+08           +17.4%  8.508e+08        perf-stat.ps.cache-references
>   5.68e+11            +0.7%  5.722e+11        perf-stat.ps.cpu-cycles
>  1.248e+11            -7.2%  1.158e+11        perf-stat.ps.dTLB-loads
>  7.413e+10            -7.3%  6.874e+10        perf-stat.ps.dTLB-stores
>  3.322e+08           +12.5%  3.739e+08        perf-stat.ps.iTLB-load-misses
>    4986239 ±  7%     -61.0%    1946378 ±  6%  perf-stat.ps.iTLB-loads
>  4.158e+11            -6.9%   3.87e+11        perf-stat.ps.instructions
>    4982520 ±  2%     +16.5%    5803884        perf-stat.ps.node-load-misses
>   20448588           +28.1%   26201547        perf-stat.ps.node-loads
>    1463675           +29.0%    1887791        perf-stat.ps.node-store-misses
>   77979119            +4.0%   81107191        perf-stat.ps.node-stores
>   1.25e+14            -6.8%  1.165e+14        perf-stat.total.instructions
>      10.11            -1.9        8.21        perf-profile.calltrace.cycles-pp.file_free_rcu.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd
>      17.28            -0.8       16.48        perf-profile.calltrace.cycles-pp.close
>       9.41            -0.7        8.69        perf-profile.calltrace.cycles-pp.link_path_walk.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       6.32            -0.7        5.64        perf-profile.calltrace.cycles-pp.do_dentry_open.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       5.27            -0.5        4.72        perf-profile.calltrace.cycles-pp.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>      13.96            -0.5       13.49        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.close
>      13.58            -0.4       13.14        perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.close
>       0.92            -0.3        0.64        perf-profile.calltrace.cycles-pp.__close_fd.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.close
>       3.10            -0.2        2.86        perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.close
>       2.44            -0.2        2.21        perf-profile.calltrace.cycles-pp.walk_component.link_path_walk.path_openat.do_filp_open.do_sys_open
>       4.02            -0.2        3.80        perf-profile.calltrace.cycles-pp.selinux_inode_permission.security_inode_permission.link_path_walk.path_openat.do_filp_open
>       1.82            -0.2        1.60        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.open64
>       9.26            -0.2        9.04        perf-profile.calltrace.cycles-pp.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.close
>       2.12 ±  2%      -0.2        1.90        perf-profile.calltrace.cycles-pp.lookup_fast.walk_component.link_path_walk.path_openat.do_filp_open
>       1.03 ± 10%      -0.2        0.82        perf-profile.calltrace.cycles-pp.inode_permission.link_path_walk.path_openat.do_filp_open.do_sys_open
>       2.55 ±  2%      -0.2        2.36 ±  3%  perf-profile.calltrace.cycles-pp.security_inode_permission.may_open.path_openat.do_filp_open.do_sys_open
>       1.37            -0.2        1.18        perf-profile.calltrace.cycles-pp.kmem_cache_alloc.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       1.15            -0.2        0.95        perf-profile.calltrace.cycles-pp.ima_file_check.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       1.79            -0.2        1.60        perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.close
>       2.41 ±  3%      -0.2        2.22 ±  3%  perf-profile.calltrace.cycles-pp.selinux_inode_permission.security_inode_permission.may_open.path_openat.do_filp_open
>       2.88            -0.2        2.71        perf-profile.calltrace.cycles-pp.security_file_open.do_dentry_open.path_openat.do_filp_open.do_sys_open
>       2.38            -0.2        2.22        perf-profile.calltrace.cycles-pp.security_file_alloc.__alloc_file.alloc_empty_file.path_openat.do_filp_open
>       4.31            -0.2        4.16        perf-profile.calltrace.cycles-pp.security_inode_permission.link_path_walk.path_openat.do_filp_open.do_sys_open
>       9.93            -0.1        9.80        perf-profile.calltrace.cycles-pp.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe.close
>       1.63            -0.1        1.50        perf-profile.calltrace.cycles-pp.kmem_cache_alloc.security_file_alloc.__alloc_file.alloc_empty_file.path_openat
>       1.38            -0.1        1.26        perf-profile.calltrace.cycles-pp.__alloc_fd.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
>       5.16            -0.1        5.04        perf-profile.calltrace.cycles-pp.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
>       1.13            -0.1        1.02        perf-profile.calltrace.cycles-pp.dput.terminate_walk.path_openat.do_filp_open.do_sys_open
>       2.26            -0.1        2.15        perf-profile.calltrace.cycles-pp.selinux_file_open.security_file_open.do_dentry_open.path_openat.do_filp_open
>       0.63            -0.1        0.52 ±  2%  perf-profile.calltrace.cycles-pp.__check_heap_object.__check_object_size.strncpy_from_user.getname_flags.do_sys_open
>       1.29            -0.1        1.18        perf-profile.calltrace.cycles-pp.lookup_fast.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       1.75            -0.1        1.65        perf-profile.calltrace.cycles-pp.terminate_walk.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       0.67            -0.1        0.58        perf-profile.calltrace.cycles-pp.kmem_cache_free.__fput.task_work_run.exit_to_usermode_loop.do_syscall_64
>       1.22 ±  2%      -0.1        1.12        perf-profile.calltrace.cycles-pp.avc_has_perm_noaudit.selinux_inode_permission.security_inode_permission.link_path_walk.path_openat
>       1.21            -0.1        1.12        perf-profile.calltrace.cycles-pp.fput_many.filp_close.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.74            -0.1        0.66        perf-profile.calltrace.cycles-pp.__inode_security_revalidate.selinux_file_open.security_file_open.do_dentry_open.path_openat
>       0.89            -0.1        0.81        perf-profile.calltrace.cycles-pp.inode_security_rcu.selinux_inode_permission.security_inode_permission.may_open.path_openat
>       0.79 ±  4%      -0.1        0.72        perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.76            -0.1        0.70        perf-profile.calltrace.cycles-pp.__inode_security_revalidate.inode_security_rcu.selinux_inode_permission.security_inode_permission.may_open
>       0.67 ±  3%      -0.1        0.61        perf-profile.calltrace.cycles-pp.__d_lookup_rcu.lookup_fast.path_openat.do_filp_open.do_sys_open
>       0.66 ±  3%      -0.1        0.60        perf-profile.calltrace.cycles-pp.inode_permission.may_open.path_openat.do_filp_open.do_sys_open
>       1.02            -0.1        0.96        perf-profile.calltrace.cycles-pp.path_init.path_openat.do_filp_open.do_sys_open.do_syscall_64
>       0.81            -0.1        0.75        perf-profile.calltrace.cycles-pp.task_work_add.fput_many.filp_close.__x64_sys_close.do_syscall_64
>       0.67            -0.0        0.63        perf-profile.calltrace.cycles-pp.rcu_segcblist_enqueue.__call_rcu.task_work_run.exit_to_usermode_loop.do_syscall_64
>       0.78            -0.0        0.74        perf-profile.calltrace.cycles-pp.__slab_free.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start
>       0.55            -0.0        0.53        perf-profile.calltrace.cycles-pp.selinux_file_alloc_security.security_file_alloc.__alloc_file.alloc_empty_file.path_openat
>       0.71            +0.1        0.82        perf-profile.calltrace.cycles-pp.memset_erms.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat
>       3.38            +0.1        3.50        perf-profile.calltrace.cycles-pp.strncpy_from_user.getname_flags.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       1.66            +0.1        1.78        perf-profile.calltrace.cycles-pp.__call_rcu.task_work_run.exit_to_usermode_loop.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.70            +0.1        0.84        perf-profile.calltrace.cycles-pp.__virt_addr_valid.__check_object_size.strncpy_from_user.getname_flags.do_sys_open
>       1.81            +0.4        2.23        perf-profile.calltrace.cycles-pp.__check_object_size.strncpy_from_user.getname_flags.do_sys_open.do_syscall_64
>      39.47            +0.7       40.17        perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.open64
>       0.00            +0.8        0.75        perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc
>      38.69            +0.8       39.45        perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
>       0.00            +0.8        0.84        perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc
>      29.90            +0.9       30.79        perf-profile.calltrace.cycles-pp.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
>      29.90            +0.9       30.79        perf-profile.calltrace.cycles-pp.run_ksoftirqd.smpboot_thread_fn.kthread.ret_from_fork
>      29.87            +0.9       30.76        perf-profile.calltrace.cycles-pp.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn
>      29.88            +0.9       30.78        perf-profile.calltrace.cycles-pp.rcu_core.__softirqentry_text_start.run_ksoftirqd.smpboot_thread_fn.kthread
>      29.93            +0.9       30.84        perf-profile.calltrace.cycles-pp.smpboot_thread_fn.kthread.ret_from_fork
>      29.94            +0.9       30.85        perf-profile.calltrace.cycles-pp.ret_from_fork
>      29.94            +0.9       30.85        perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
>       0.89 ± 29%      +0.9        1.81        perf-profile.calltrace.cycles-pp.setup_object_debug.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc
>       7.25 ±  3%      +1.1        8.36        perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.free_one_page.__free_pages_ok.unfreeze_partials
>       7.75 ±  3%      +1.1        8.87        perf-profile.calltrace.cycles-pp.__free_pages_ok.unfreeze_partials.put_cpu_partial.kmem_cache_free.rcu_do_batch
>       7.72 ±  3%      +1.1        8.85        perf-profile.calltrace.cycles-pp.free_one_page.__free_pages_ok.unfreeze_partials.put_cpu_partial.kmem_cache_free
>       7.29 ±  3%      +1.1        8.41        perf-profile.calltrace.cycles-pp._raw_spin_lock.free_one_page.__free_pages_ok.unfreeze_partials.put_cpu_partial
>       9.12 ±  3%      +1.1       10.25        perf-profile.calltrace.cycles-pp.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd
>       7.96 ±  3%      +1.1        9.10        perf-profile.calltrace.cycles-pp.put_cpu_partial.kmem_cache_free.rcu_do_batch.rcu_core.__softirqentry_text_start
>       7.92 ±  3%      +1.1        9.07        perf-profile.calltrace.cycles-pp.unfreeze_partials.put_cpu_partial.kmem_cache_free.rcu_do_batch.rcu_core
>       2.38            +1.5        3.83        perf-profile.calltrace.cycles-pp.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file
>      10.53            +1.7       12.19        perf-profile.calltrace.cycles-pp.rcu_cblist_dequeue.rcu_do_batch.rcu_core.__softirqentry_text_start.run_ksoftirqd
>       5.47            +2.2        7.64        perf-profile.calltrace.cycles-pp.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat.do_filp_open
>       3.34            +2.2        5.56        perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file
>       3.39            +2.3        5.65        perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat
>      11.39            +2.7       14.08        perf-profile.calltrace.cycles-pp.alloc_empty_file.path_openat.do_filp_open.do_sys_open.do_syscall_64
>      10.91            +2.7       13.63        perf-profile.calltrace.cycles-pp.__alloc_file.alloc_empty_file.path_openat.do_filp_open.do_sys_open
>      10.62            -2.1        8.54        perf-profile.children.cycles-pp.file_free_rcu
>      17.31            -0.8       16.51        perf-profile.children.cycles-pp.close
>       9.47            -0.7        8.74        perf-profile.children.cycles-pp.link_path_walk
>       6.37            -0.7        5.68        perf-profile.children.cycles-pp.do_dentry_open
>       5.48            -0.6        4.90        perf-profile.children.cycles-pp.__fput
>       6.49            -0.4        6.08        perf-profile.children.cycles-pp.selinux_inode_permission
>       6.95            -0.3        6.60        perf-profile.children.cycles-pp.security_inode_permission
>       3.48            -0.3        3.15        perf-profile.children.cycles-pp.lookup_fast
>       2.38            -0.3        2.09        perf-profile.children.cycles-pp.entry_SYSCALL_64
>       1.74 ±  5%      -0.3        1.46        perf-profile.children.cycles-pp.inode_permission
>       0.94            -0.3        0.66        perf-profile.children.cycles-pp.__close_fd
>       3.10            -0.2        2.86        perf-profile.children.cycles-pp.__x64_sys_close
>       2.27 ±  2%      -0.2        2.04 ±  2%  perf-profile.children.cycles-pp.dput
>       2.47            -0.2        2.24        perf-profile.children.cycles-pp.walk_component
>       2.21 ±  2%      -0.2        1.98        perf-profile.children.cycles-pp.___might_sleep
>       2.24            -0.2        2.02        perf-profile.children.cycles-pp.syscall_return_via_sysret
>       9.32            -0.2        9.12        perf-profile.children.cycles-pp.task_work_run
>       1.17            -0.2        0.97        perf-profile.children.cycles-pp.ima_file_check
>       1.99            -0.2        1.80        perf-profile.children.cycles-pp.__inode_security_revalidate
>       2.92            -0.2        2.73        perf-profile.children.cycles-pp.security_file_open
>       0.56            -0.2        0.38        perf-profile.children.cycles-pp.selinux_task_getsecid
>       0.69            -0.2        0.51        perf-profile.children.cycles-pp.security_task_getsecid
>       2.40            -0.2        2.24        perf-profile.children.cycles-pp.security_file_alloc
>       0.20 ±  4%      -0.1        0.06 ± 11%  perf-profile.children.cycles-pp.try_module_get
>       1.44            -0.1        1.31        perf-profile.children.cycles-pp.__might_sleep
>      10.01            -0.1        9.88        perf-profile.children.cycles-pp.exit_to_usermode_loop
>       1.46            -0.1        1.33        perf-profile.children.cycles-pp.inode_security_rcu
>       1.00            -0.1        0.87        perf-profile.children.cycles-pp._cond_resched
>       5.20            -0.1        5.08        perf-profile.children.cycles-pp.getname_flags
>       1.05            -0.1        0.93        perf-profile.children.cycles-pp.__fsnotify_parent
>       1.42            -0.1        1.30        perf-profile.children.cycles-pp.fsnotify
>       1.41            -0.1        1.29        perf-profile.children.cycles-pp.__alloc_fd
>       2.29            -0.1        2.18        perf-profile.children.cycles-pp.selinux_file_open
>       0.64            -0.1        0.53        perf-profile.children.cycles-pp.__check_heap_object
>       1.42 ±  2%      -0.1        1.31 ±  2%  perf-profile.children.cycles-pp.irq_exit
>       1.80            -0.1        1.69        perf-profile.children.cycles-pp.terminate_walk
>       0.33            -0.1        0.23        perf-profile.children.cycles-pp.file_ra_state_init
>       0.65 ±  3%      -0.1        0.56        perf-profile.children.cycles-pp.generic_permission
>       1.23            -0.1        1.15        perf-profile.children.cycles-pp.fput_many
>       0.83 ±  3%      -0.1        0.74        perf-profile.children.cycles-pp._raw_spin_lock_irq
>       0.53            -0.1        0.45 ±  2%  perf-profile.children.cycles-pp.rcu_all_qs
>       0.58 ±  5%      -0.1        0.51 ±  2%  perf-profile.children.cycles-pp.mntput_no_expire
>       0.75            -0.1        0.69        perf-profile.children.cycles-pp.lockref_put_or_lock
>       1.03            -0.1        0.97        perf-profile.children.cycles-pp.path_init
>       0.84            -0.1        0.78        perf-profile.children.cycles-pp.task_work_add
>       0.14 ±  3%      -0.1        0.08 ±  5%  perf-profile.children.cycles-pp.ima_file_free
>       0.26 ±  7%      -0.1        0.21 ±  2%  perf-profile.children.cycles-pp.path_get
>       0.83            -0.0        0.78        perf-profile.children.cycles-pp.__slab_free
>       0.62            -0.0        0.58        perf-profile.children.cycles-pp.percpu_counter_add_batch
>       0.67            -0.0        0.63        perf-profile.children.cycles-pp.rcu_segcblist_enqueue
>       0.20 ± 11%      -0.0        0.16 ±  2%  perf-profile.children.cycles-pp.mntget
>       0.22 ±  4%      -0.0        0.19 ±  3%  perf-profile.children.cycles-pp.get_unused_fd_flags
>       0.10 ± 14%      -0.0        0.07 ± 10%  perf-profile.children.cycles-pp.close@plt
>       0.34 ±  2%      -0.0        0.31        perf-profile.children.cycles-pp.lockref_get
>       0.24            -0.0        0.21 ±  2%  perf-profile.children.cycles-pp.__x64_sys_open
>       0.11 ±  8%      -0.0        0.08 ± 10%  perf-profile.children.cycles-pp.putname
>       0.18 ±  2%      -0.0        0.16        perf-profile.children.cycles-pp.should_failslab
>       0.55            -0.0        0.53        perf-profile.children.cycles-pp.selinux_file_alloc_security
>       0.21 ±  3%      -0.0        0.19 ±  2%  perf-profile.children.cycles-pp.expand_files
>       0.07 ±  6%      -0.0        0.05        perf-profile.children.cycles-pp.module_put
>       0.12            -0.0        0.10 ±  4%  perf-profile.children.cycles-pp.security_file_free
>       0.17            -0.0        0.15 ±  3%  perf-profile.children.cycles-pp.find_next_zero_bit
>       0.07 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.memset
>       0.07 ±  5%      -0.0        0.06        perf-profile.children.cycles-pp.__mutex_init
>       0.10            -0.0        0.09        perf-profile.children.cycles-pp.mntput
>       0.12            +0.0        0.13 ±  3%  perf-profile.children.cycles-pp.__list_del_entry_valid
>       0.12 ±  3%      +0.0        0.14 ±  3%  perf-profile.children.cycles-pp.discard_slab
>       0.08            +0.0        0.10 ±  4%  perf-profile.children.cycles-pp.kick_process
>       0.04 ± 57%      +0.0        0.07 ±  7%  perf-profile.children.cycles-pp.native_irq_return_iret
>       0.12 ±  4%      +0.0        0.15 ±  3%  perf-profile.children.cycles-pp.blkcg_maybe_throttle_current
>       1.31            +0.0        1.34        perf-profile.children.cycles-pp.memset_erms
>       0.40            +0.0        0.44        perf-profile.children.cycles-pp.lockref_get_not_dead
>       0.07 ±  6%      +0.0        0.11 ±  3%  perf-profile.children.cycles-pp.rcu_segcblist_pend_cbs
>       0.01 ±173%      +0.0        0.06 ± 11%  perf-profile.children.cycles-pp.native_write_msr
>       0.27 ±  6%      +0.1        0.33 ± 10%  perf-profile.children.cycles-pp.ktime_get
>       0.16 ±  5%      +0.1        0.22 ±  4%  perf-profile.children.cycles-pp.get_partial_node
>       0.01 ±173%      +0.1        0.07 ± 30%  perf-profile.children.cycles-pp.perf_mux_hrtimer_handler
>       0.00            +0.1        0.07 ±  5%  perf-profile.children.cycles-pp.____fput
>       0.05            +0.1        0.15 ±  3%  perf-profile.children.cycles-pp.__mod_zone_page_state
>       0.12 ± 16%      +0.1        0.23 ± 10%  perf-profile.children.cycles-pp.ktime_get_update_offsets_now
>       0.05 ±  8%      +0.1        0.16 ±  2%  perf-profile.children.cycles-pp.legitimize_links
>       1.71            +0.1        1.83        perf-profile.children.cycles-pp.__call_rcu
>       3.40            +0.1        3.52        perf-profile.children.cycles-pp.strncpy_from_user
>       0.30 ±  2%      +0.1        0.43 ±  2%  perf-profile.children.cycles-pp.locks_remove_posix
>       0.72            +0.1        0.86        perf-profile.children.cycles-pp.__virt_addr_valid
>       0.08            +0.1        0.23 ±  3%  perf-profile.children.cycles-pp._raw_spin_lock_irqsave
>       0.84 ±  9%      +0.2        1.02 ±  8%  perf-profile.children.cycles-pp.hrtimer_interrupt
>       0.15 ±  3%      +0.2        0.38        perf-profile.children.cycles-pp.check_stack_object
>       0.65            +0.4        1.05        perf-profile.children.cycles-pp.setup_object_debug
>       0.36            +0.4        0.76        perf-profile.children.cycles-pp.get_page_from_freelist
>       1.90            +0.4        2.31        perf-profile.children.cycles-pp.__check_object_size
>       0.39            +0.4        0.84        perf-profile.children.cycles-pp.__alloc_pages_nodemask
>      39.52            +0.7       40.22        perf-profile.children.cycles-pp.do_filp_open
>      38.84            +0.8       39.59        perf-profile.children.cycles-pp.path_openat
>      31.27            +0.8       32.05        perf-profile.children.cycles-pp.rcu_core
>      31.26            +0.8       32.03        perf-profile.children.cycles-pp.rcu_do_batch
>      31.31            +0.8       32.09        perf-profile.children.cycles-pp.__softirqentry_text_start
>      29.90            +0.9       30.79        perf-profile.children.cycles-pp.run_ksoftirqd
>      29.93            +0.9       30.84        perf-profile.children.cycles-pp.smpboot_thread_fn
>      29.94            +0.9       30.85        perf-profile.children.cycles-pp.kthread
>      29.94            +0.9       30.85        perf-profile.children.cycles-pp.ret_from_fork
>      10.63 ±  2%      +1.0       11.61        perf-profile.children.cycles-pp.kmem_cache_free
>       8.45 ±  3%      +1.1        9.57        perf-profile.children.cycles-pp._raw_spin_lock
>       7.96 ±  3%      +1.1        9.11        perf-profile.children.cycles-pp.__free_pages_ok
>       7.93 ±  3%      +1.2        9.08        perf-profile.children.cycles-pp.free_one_page
>       8.19 ±  3%      +1.2        9.36        perf-profile.children.cycles-pp.put_cpu_partial
>       8.15 ±  3%      +1.2        9.32        perf-profile.children.cycles-pp.unfreeze_partials
>       7.59 ±  3%      +1.3        8.89        perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
>      11.12            +1.7       12.83        perf-profile.children.cycles-pp.rcu_cblist_dequeue
>       2.88            +1.7        4.61        perf-profile.children.cycles-pp.new_slab
>       8.73            +1.8       10.54        perf-profile.children.cycles-pp.kmem_cache_alloc
>       3.34            +2.2        5.56        perf-profile.children.cycles-pp.___slab_alloc
>       3.39            +2.3        5.65        perf-profile.children.cycles-pp.__slab_alloc
>      11.45            +2.7       14.12        perf-profile.children.cycles-pp.alloc_empty_file
>      10.98            +2.7       13.70        perf-profile.children.cycles-pp.__alloc_file
>      10.53            -2.1        8.47        perf-profile.self.cycles-pp.file_free_rcu
>       2.37            -0.3        2.05        perf-profile.self.cycles-pp.kmem_cache_alloc
>       1.43            -0.3        1.17        perf-profile.self.cycles-pp.strncpy_from_user
>       0.50            -0.2        0.26 ±  2%  perf-profile.self.cycles-pp.__close_fd
>       2.22            -0.2        2.00        perf-profile.self.cycles-pp.syscall_return_via_sysret
>       2.07 ±  2%      -0.2        1.86        perf-profile.self.cycles-pp.___might_sleep
>       1.01 ±  7%      -0.2        0.82        perf-profile.self.cycles-pp.inode_permission
>       1.13            -0.2        0.96 ±  2%  perf-profile.self.cycles-pp.entry_SYSCALL_64
>       3.10            -0.2        2.93        perf-profile.self.cycles-pp.selinux_inode_permission
>       0.52            -0.2        0.35        perf-profile.self.cycles-pp.selinux_task_getsecid
>       1.33            -0.1        1.18 ±  2%  perf-profile.self.cycles-pp.do_dentry_open
>       1.55            -0.1        1.42        perf-profile.self.cycles-pp.kmem_cache_free
>       1.55            -0.1        1.43        perf-profile.self.cycles-pp.link_path_walk
>       0.17 ±  4%      -0.1        0.04 ± 57%  perf-profile.self.cycles-pp.try_module_get
>       1.35            -0.1        1.24        perf-profile.self.cycles-pp.fsnotify
>       0.96            -0.1        0.85        perf-profile.self.cycles-pp.__fsnotify_parent
>       1.25 ±  2%      -0.1        1.14        perf-profile.self.cycles-pp.__might_sleep
>       0.87 ±  2%      -0.1        0.78 ±  2%  perf-profile.self.cycles-pp.lookup_fast
>       0.79 ±  2%      -0.1        0.70 ±  2%  perf-profile.self.cycles-pp.do_syscall_64
>       1.02            -0.1        0.93        perf-profile.self.cycles-pp.do_sys_open
>       0.30            -0.1        0.22        perf-profile.self.cycles-pp.file_ra_state_init
>       0.58            -0.1        0.50        perf-profile.self.cycles-pp.__check_heap_object
>       0.80 ±  3%      -0.1        0.72        perf-profile.self.cycles-pp._raw_spin_lock_irq
>       0.59 ±  2%      -0.1        0.52 ±  2%  perf-profile.self.cycles-pp.generic_permission
>       1.17            -0.1        1.09        perf-profile.self.cycles-pp.__fput
>       0.68            -0.1        0.60        perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
>       0.84            -0.1        0.77        perf-profile.self.cycles-pp.__inode_security_revalidate
>       0.73            -0.1        0.66        perf-profile.self.cycles-pp.task_work_add
>       0.39            -0.1        0.32 ±  2%  perf-profile.self.cycles-pp.rcu_all_qs
>       0.54 ±  5%      -0.1        0.47 ±  2%  perf-profile.self.cycles-pp.mntput_no_expire
>       0.50 ±  6%      -0.1        0.44 ±  4%  perf-profile.self.cycles-pp.dput
>       0.46            -0.1        0.40        perf-profile.self.cycles-pp._cond_resched
>       0.93            -0.1        0.88 ±  2%  perf-profile.self.cycles-pp.close
>       0.11 ±  4%      -0.1        0.06 ± 11%  perf-profile.self.cycles-pp.ima_file_free
>       0.83            -0.1        0.77        perf-profile.self.cycles-pp.__slab_free
>       0.61            -0.1        0.56        perf-profile.self.cycles-pp.__alloc_fd
>       0.87            -0.1        0.82        perf-profile.self.cycles-pp._raw_spin_lock
>       0.69            -0.0        0.64        perf-profile.self.cycles-pp.lockref_put_or_lock
>       0.67            -0.0        0.62        perf-profile.self.cycles-pp.rcu_segcblist_enqueue
>       0.46            -0.0        0.41        perf-profile.self.cycles-pp.do_filp_open
>       0.56            -0.0        0.51        perf-profile.self.cycles-pp.percpu_counter_add_batch
>       1.05 ±  2%      -0.0        1.01        perf-profile.self.cycles-pp.path_openat
>       0.28 ±  2%      -0.0        0.24 ±  4%  perf-profile.self.cycles-pp.security_file_open
>       0.94 ±  2%      -0.0        0.90        perf-profile.self.cycles-pp.open64
>       0.21 ±  6%      -0.0        0.17 ±  4%  perf-profile.self.cycles-pp.get_unused_fd_flags
>       0.17 ± 13%      -0.0        0.14 ±  3%  perf-profile.self.cycles-pp.mntget
>       0.39            -0.0        0.35        perf-profile.self.cycles-pp.fput_many
>       0.37            -0.0        0.34        perf-profile.self.cycles-pp.getname_flags
>       0.43            -0.0        0.41        perf-profile.self.cycles-pp.path_init
>       0.33            -0.0        0.30 ±  2%  perf-profile.self.cycles-pp.lockref_get
>       0.20 ±  2%      -0.0        0.17        perf-profile.self.cycles-pp.filp_close
>       0.52            -0.0        0.50        perf-profile.self.cycles-pp.selinux_file_alloc_security
>       0.22            -0.0        0.20 ±  2%  perf-profile.self.cycles-pp.__x64_sys_open
>       0.28 ±  2%      -0.0        0.26        perf-profile.self.cycles-pp.inode_security_rcu
>       0.19 ±  4%      -0.0        0.17 ±  2%  perf-profile.self.cycles-pp.expand_files
>       0.09 ±  9%      -0.0        0.07 ±  6%  perf-profile.self.cycles-pp.putname
>       0.10            -0.0        0.08 ±  5%  perf-profile.self.cycles-pp.security_file_free
>       0.15 ±  3%      -0.0        0.14 ±  3%  perf-profile.self.cycles-pp.find_next_zero_bit
>       0.12 ±  3%      -0.0        0.11 ±  4%  perf-profile.self.cycles-pp.nd_jump_root
>       0.08            -0.0        0.07        perf-profile.self.cycles-pp.fd_install
>       0.06            -0.0        0.05        perf-profile.self.cycles-pp.path_get
>       0.12            +0.0        0.13 ±  3%  perf-profile.self.cycles-pp.__list_del_entry_valid
>       0.07 ±  5%      +0.0        0.09        perf-profile.self.cycles-pp.get_partial_node
>       0.12 ±  3%      +0.0        0.14 ±  3%  perf-profile.self.cycles-pp.discard_slab
>       0.11            +0.0        0.13 ±  3%  perf-profile.self.cycles-pp.blkcg_maybe_throttle_current
>       0.06            +0.0        0.08 ±  5%  perf-profile.self.cycles-pp.kick_process
>       0.28 ±  2%      +0.0        0.30        perf-profile.self.cycles-pp.__x64_sys_close
>       0.04 ± 57%      +0.0        0.07 ±  7%  perf-profile.self.cycles-pp.native_irq_return_iret
>       0.39            +0.0        0.42        perf-profile.self.cycles-pp.lockref_get_not_dead
>       0.53            +0.0        0.57        perf-profile.self.cycles-pp.exit_to_usermode_loop
>       0.06 ±  7%      +0.0        0.10 ±  4%  perf-profile.self.cycles-pp.rcu_segcblist_pend_cbs
>       0.01 ±173%      +0.0        0.06 ± 11%  perf-profile.self.cycles-pp.native_write_msr
>       0.28            +0.0        0.33        perf-profile.self.cycles-pp.terminate_walk
>       0.27 ±  5%      +0.1        0.32 ± 11%  perf-profile.self.cycles-pp.ktime_get
>       0.00            +0.1        0.05 ±  9%  perf-profile.self.cycles-pp._raw_spin_lock_irqsave
>       0.43 ±  5%      +0.1        0.49 ±  3%  perf-profile.self.cycles-pp.security_inode_permission
>       0.00            +0.1        0.06 ±  6%  perf-profile.self.cycles-pp.____fput
>       0.00            +0.1        0.08        perf-profile.self.cycles-pp.__alloc_pages_nodemask
>       0.05            +0.1        0.15 ±  3%  perf-profile.self.cycles-pp.__mod_zone_page_state
>       0.25 ±  3%      +0.1        0.35 ±  2%  perf-profile.self.cycles-pp.locks_remove_posix
>       0.12 ± 17%      +0.1        0.23 ± 11%  perf-profile.self.cycles-pp.ktime_get_update_offsets_now
>       0.14            +0.1        0.25        perf-profile.self.cycles-pp.setup_object_debug
>       0.93            +0.1        1.04        perf-profile.self.cycles-pp.__call_rcu
>       0.46            +0.1        0.58        perf-profile.self.cycles-pp.__check_object_size
>       0.13 ±  3%      +0.1        0.26 ±  3%  perf-profile.self.cycles-pp.get_page_from_freelist
>       0.00            +0.1        0.13 ±  3%  perf-profile.self.cycles-pp.legitimize_links
>       0.68            +0.1        0.81        perf-profile.self.cycles-pp.__virt_addr_valid
>       0.40 ± 11%      +0.2        0.58        perf-profile.self.cycles-pp.may_open
>       0.12            +0.2        0.31        perf-profile.self.cycles-pp.check_stack_object
>       0.90            +0.3        1.22        perf-profile.self.cycles-pp.task_work_run
>       0.30            +0.4        0.73        perf-profile.self.cycles-pp.___slab_alloc
>       2.88            +0.7        3.59        perf-profile.self.cycles-pp.__alloc_file
>       2.27            +1.1        3.38        perf-profile.self.cycles-pp.new_slab
>       7.59 ±  3%      +1.3        8.89        perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>      11.04            +1.7       12.73        perf-profile.self.cycles-pp.rcu_cblist_dequeue
>
>
>
>                             will-it-scale.per_process_ops
>
>   385000 +-+----------------------------------------------------------------+
>          |     .+                     .+.      .+                           |
>   380000 +-+.++  :              +.++.+   +.++.+  :                          |
>   375000 +-+     :             +                 :                          |
>          |        +.+   .+.++.+                   ++.      .+.+.+ .+.+.+ .+.|
>   370000 +-+         +.+                             +.+.++      +      +   |
>   365000 +-+                                                                |
>          |                                                                  |
>   360000 +-+                                                                |
>   355000 +-+                                                                |
>          |                                                                  |
>   350000 +-+                                                                |
>   345000 O-+ OO O O OO O O OO O O O                                         |
>          | O                       O O O O OO O O                           |
>   340000 +-+----------------------------------------------------------------+
>
>
>                                 will-it-scale.workload
>
>   7.4e+07 +-+---------------------------------------------------------------+
>           |     .+                     .+      .+                           |
>   7.3e+07 +-++.+  :              ++.+.+  +.+.++  :                          |
>   7.2e+07 +-+     :             +                :                          |
>           |        ++.   .++.+.+                  +.+       +.+.+.++.+.+. +.|
>   7.1e+07 +-+         +.+                            +.+.+.+             +  |
>     7e+07 +-+                                                               |
>           |                                                                 |
>   6.9e+07 +-+                                                               |
>   6.8e+07 +-+                                                               |
>           |                                                                 |
>   6.7e+07 +-+           O        O                                          |
>   6.6e+07 O-OO O O OO O   OO O O  O O   O  O  O                             |
>           |                           O  O   O  O                           |
>   6.5e+07 +-+---------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad  sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Rong Chen
>
Dave Chinner Sept. 24, 2019, 7:39 a.m. UTC | #11
On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> On 23/09/2019 17.52, Tejun Heo wrote:
> > Hello, Konstantin.
> > 
> > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> > > With vm.dirty_write_behind 1 or 2 files are written even faster and
> > 
> > Is the faster speed reproducible?  I don't quite understand why this
> > would be.
> 
> Writing to disk simply starts earlier.

Stupid question: how is this any different to simply winding down
our dirty writeback and throttling thresholds like so:

# echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes

to start background writeback when there's 100MB of dirty pages in
memory, and then:

# echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes

So that writers are directly throttled at 200MB of dirty pages in
memory?

This effectively gives us global writebehind behaviour with a
100-200MB cache write burst for initial writes.

ANd, really such strict writebehind behaviour is going to cause all
sorts of unintended problesm with filesystems because there will be
adverse interactions with delayed allocation. We need a substantial
amount of dirty data to be cached for writeback for fragmentation
minimisation algorithms to be able to do their job....

Cheers,

Dave.
Konstantin Khlebnikov Sept. 24, 2019, 9 a.m. UTC | #12
On 24/09/2019 10.39, Dave Chinner wrote:
> On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
>> On 23/09/2019 17.52, Tejun Heo wrote:
>>> Hello, Konstantin.
>>>
>>> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
>>>> With vm.dirty_write_behind 1 or 2 files are written even faster and
>>>
>>> Is the faster speed reproducible?  I don't quite understand why this
>>> would be.
>>
>> Writing to disk simply starts earlier.
> 
> Stupid question: how is this any different to simply winding down
> our dirty writeback and throttling thresholds like so:
> 
> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> 
> to start background writeback when there's 100MB of dirty pages in
> memory, and then:
> 
> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> 
> So that writers are directly throttled at 200MB of dirty pages in
> memory?
> 
> This effectively gives us global writebehind behaviour with a
> 100-200MB cache write burst for initial writes.

Global limits affect all dirty pages including memory-mapped and
randomly touched. Write-behind aims only into sequential streams.

> 
> ANd, really such strict writebehind behaviour is going to cause all
> sorts of unintended problesm with filesystems because there will be
> adverse interactions with delayed allocation. We need a substantial
> amount of dirty data to be cached for writeback for fragmentation
> minimisation algorithms to be able to do their job....

I think most sequentially written files never change after close.
Except of knowing final size of huge files (>16Mb in my patch)
there should be no difference for delayed allocation.

Probably write behind could provide hint about streaming pattern:
pass something like "MSG_MORE" into writeback call.

> 
> Cheers,
> 
> Dave.
>
Konstantin Khlebnikov Sept. 24, 2019, 9:29 a.m. UTC | #13
On 21/09/2019 02.05, Linus Torvalds wrote:
> On Fri, Sep 20, 2019 at 12:35 AM Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru> wrote:
>>
>> This patch implements write-behind policy which tracks sequential writes
>> and starts background writeback when file have enough dirty pages.
> 
> Apart from a spelling error ("contigious"), my only reaction is that
> I've wanted this for the multi-file writes, not just for single big
> files.
> 
> Yes, single big files may be a simpler and perhaps the "10% effort for
> 90% of the gain", and thus the right thing to do, but I do wonder if
> you've looked at simply extending it to cover multiple files when
> people copy a whole directory (or unpack a tar-file, or similar).
> 
> Now, I hear you say "those are so small these days that it doesn't
> matter". And maybe you're right. But partiocularly for slow media,
> triggering good streaming write behavior has been a problem in the
> past.
> 
> So I'm wondering whether the "writebehind" state should perhaps be
> considered be a process state, rather than "struct file" state, and
> also start triggering for writing smaller files.

It's simple to extend existing state with per-task counter of sequential
writes to detect patterns like unpacking tarball with small files.
After reaching some threshold write-behind could flush files in at close.

But in this case it's hard to wait previous writes to limit amount of
requests and pages in writeback for each stream.

Theoretically we could build chain of inodes for delaying and batching.

> 
> Maybe this was already discussed and people decided that the big-file
> case was so much easier that it wasn't worth worrying about
> writebehind for multiple files.
> 
>              Linus
>
Linus Torvalds Sept. 24, 2019, 7:08 p.m. UTC | #14
On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@fromorbit.com> wrote:
>
> Stupid question: how is this any different to simply winding down
> our dirty writeback and throttling thresholds like so:
>
> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes

Our dirty_background stuff is very questionable, but it exists (and
has those insane defaults) because of various legacy reasons.

But it probably _shouldn't_ exist any more (except perhaps as a
last-ditch hard limit), and I don't think it really ends up being the
primary throttling any more in many cases.

It used to make sense to make it a "percentage of memory" back when we
were talking old machines with 8MB of RAM, and having an appreciable
percentage of memory dirty was "normal".

And we've kept that model and not touched it, because some benchmarks
really want enormous amounts of dirty data (particularly various dirty
shared mappings).

But out default really is fairly crazy and questionable. 10% of memory
being dirty may be ok when you have a small amount of memory, but it's
rather less sane if you have gigs and gigs of RAM.

Of course, SSD's made it work slightly better again, but our
"dirty_background" stuff really is legacy and not very good.

The whole dirty limit when seen as percentage of memory (which is our
default) is particularly questionable, but even when seen as total
bytes is bad.

If you have slow filesystems (say, FAT on a USB stick), the limit
should be very different from a fast one (eg XFS on a RAID of proper
SSDs).

So the limit really needs be per-bdi, not some global ratio or bytes.

As a result we've grown various _other_ heuristics over time, and the
simplistic dirty_background stuff is only a very small part of the
picture these days.

To the point of almost being irrelevant in many situations, I suspect.

> to start background writeback when there's 100MB of dirty pages in
> memory, and then:
>
> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes

The thing is, that also accounts for dirty shared mmap pages. And it
really will kill some benchmarks that people take very very seriously.

And 200MB is peanuts when you're doing a benchmark on some studly
machine that has a million iops per second, and 200MB of dirty data is
nothing.

Yet it's probably much too big when you're on a workstation that still
has rotational media.

And the whole memcg code obviously makes this even more complicated.

Anyway, the end result of all this is that we have that
balance_dirty_pages() that is pretty darn complex and I suspect very
few people understand everything that goes on in that function.

So I think that the point of any write-behind logic would be to avoid
triggering the global limits as much as humanly possible - not just
getting the simple cases to write things out more quickly, but to
remove the complex global limit questions from (one) common and fairly
simple case.

Now, whether write-behind really _does_ help that, or whether it's
just yet another tweak and complication, I can't actually say. But I
don't think 'dirty_background_bytes' is really an argument against
write-behind, it's just one knob on the very complex dirty handling we
have.

            Linus
Dave Chinner Sept. 25, 2019, 7:18 a.m. UTC | #15
On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote:
> On 24/09/2019 10.39, Dave Chinner wrote:
> > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> > > On 23/09/2019 17.52, Tejun Heo wrote:
> > > > Hello, Konstantin.
> > > > 
> > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> > > > > With vm.dirty_write_behind 1 or 2 files are written even faster and
> > > > 
> > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > would be.
> > > 
> > > Writing to disk simply starts earlier.
> > 
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:
> > 
> > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> > 
> > to start background writeback when there's 100MB of dirty pages in
> > memory, and then:
> > 
> > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> > 
> > So that writers are directly throttled at 200MB of dirty pages in
> > memory?
> > 
> > This effectively gives us global writebehind behaviour with a
> > 100-200MB cache write burst for initial writes.
> 
> Global limits affect all dirty pages including memory-mapped and
> randomly touched. Write-behind aims only into sequential streams.

There are  apps that do sequential writes via mmap()d files.
They should do writebehind too, yes?

> > ANd, really such strict writebehind behaviour is going to cause all
> > sorts of unintended problesm with filesystems because there will be
> > adverse interactions with delayed allocation. We need a substantial
> > amount of dirty data to be cached for writeback for fragmentation
> > minimisation algorithms to be able to do their job....
> 
> I think most sequentially written files never change after close.

There are lots of apps that write zeros to initialise and allocate
space, then go write real data to them. Database WAL files are
commonly initialised like this...

> Except of knowing final size of huge files (>16Mb in my patch)
> there should be no difference for delayed allocation.

There is, because you throttle the writes down such that there is
only 16MB of dirty data in memory. Hence filesystems will only
typically allocate in 16MB chunks as that's all the delalloc range
spans.

I'm not so concerned for XFS here, because our speculative
preallocation will handle this just fine, but for ext4 and btrfs
it's going to interleave the allocate of concurrent streaming writes
and fragment the crap out of the files.

In general, the smaller you make the individual file writeback
window, the worse the fragmentation problems gets....

> Probably write behind could provide hint about streaming pattern:
> pass something like "MSG_MORE" into writeback call.

How does that help when we've only got dirty data and block
reservations up to EOF which is no more than 16MB away?

Cheers,

Dave.
Dave Chinner Sept. 25, 2019, 8 a.m. UTC | #16
On Tue, Sep 24, 2019 at 12:08:04PM -0700, Linus Torvalds wrote:
> On Tue, Sep 24, 2019 at 12:39 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:
> >
> > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> 
> Our dirty_background stuff is very questionable, but it exists (and
> has those insane defaults) because of various legacy reasons.

That's not what I was asking about.  The context is in the previous
lines you didn't quote:

> > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > would be.
> > >
> > > Writing to disk simply starts earlier.
> >
> > Stupid question: how is this any different to simply winding down
> > our dirty writeback and throttling thresholds like so:

i.e. I'm asking about the reasons for the performance differential
not asking for an explanation of what writebehind is. If the
performance differential really is caused by writeback starting
sooner, then winding down dirty_background_bytes should produce
exactly the same performance because it will start writeback -much
faster-.

If it doesn't, then the assertion that the difference is caused by
earlier writeout is questionable and the code may not actually be
doing what is claimed....

Basically, I'm asking for proof that the explanation is correct.

> > to start background writeback when there's 100MB of dirty pages in
> > memory, and then:
> >
> > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> 
> The thing is, that also accounts for dirty shared mmap pages. And it
> really will kill some benchmarks that people take very very seriously.

Yes, I know that. I'm not suggesting that we do this,

[snip]

> Anyway, the end result of all this is that we have that
> balance_dirty_pages() that is pretty darn complex and I suspect very
> few people understand everything that goes on in that function.

I'd agree with you there - most of the ground work for the
balance_dirty_pages IO throttling feedback loop was all based on
concepts I developed to solve dirty page writeback thrashing
problems on Irix back in 2003.  The code we have in Linux was
written by Fenguang Wu with help for a lot of people, but the
underlying concepts of delegating IO to dedicated writeback threads
that calculate and track page cleaning rates (BDI writeback rates)
and then throttling incoming page dirtying rate to the page cleaning
rate all came out of my head....

So, much as it may surprise you, I am one of the few people who do
actually understand how that whole complex mass of accounting and
feedback is supposed to work. :)

> Now, whether write-behind really _does_ help that, or whether it's
> just yet another tweak and complication, I can't actually say.

Neither can I at this point - I lack the data and that's why I was
asking if there was a perf difference with the existing limits wound
right down. Knowing whether the performance difference is simply a
result of starting writeback IO sooner tells me an awful lot about
what other behaviour is happening as a result of the changes in this
patch.

> But I
> don't think 'dirty_background_bytes' is really an argument against
> write-behind, it's just one knob on the very complex dirty handling we
> have.

Never said it was - just trying to determine if a one line
explanation is true or not.

Cheers,

Dave.
Konstantin Khlebnikov Sept. 25, 2019, 8:15 a.m. UTC | #17
On 25/09/2019 10.18, Dave Chinner wrote:
> On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote:
>> On 24/09/2019 10.39, Dave Chinner wrote:
>>> On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
>>>> On 23/09/2019 17.52, Tejun Heo wrote:
>>>>> Hello, Konstantin.
>>>>>
>>>>> On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
>>>>>> With vm.dirty_write_behind 1 or 2 files are written even faster and
>>>>>
>>>>> Is the faster speed reproducible?  I don't quite understand why this
>>>>> would be.
>>>>
>>>> Writing to disk simply starts earlier.
>>>
>>> Stupid question: how is this any different to simply winding down
>>> our dirty writeback and throttling thresholds like so:
>>>
>>> # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
>>>
>>> to start background writeback when there's 100MB of dirty pages in
>>> memory, and then:
>>>
>>> # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
>>>
>>> So that writers are directly throttled at 200MB of dirty pages in
>>> memory?
>>>
>>> This effectively gives us global writebehind behaviour with a
>>> 100-200MB cache write burst for initial writes.
>>
>> Global limits affect all dirty pages including memory-mapped and
>> randomly touched. Write-behind aims only into sequential streams.
> 
> There are  apps that do sequential writes via mmap()d files.
> They should do writebehind too, yes?

I see no reason for that. This is different scenario.

Mmap have no clear signal about "end of write", only page fault at
beginning. Theoretically we could implement similar sliding window and
start writeback on consequent page faults.

But applications who use memory mapped files probably knows better what
to do with this data. I prefer to leave them alone for now.

> 
>>> ANd, really such strict writebehind behaviour is going to cause all
>>> sorts of unintended problesm with filesystems because there will be
>>> adverse interactions with delayed allocation. We need a substantial
>>> amount of dirty data to be cached for writeback for fragmentation
>>> minimisation algorithms to be able to do their job....
>>
>> I think most sequentially written files never change after close.
> 
> There are lots of apps that write zeros to initialise and allocate
> space, then go write real data to them. Database WAL files are
> commonly initialised like this...

Those zeros are just bunch of dirty pages which have to be written.
Sync and memory pressure will do that, why write-behind don't have to?

> 
>> Except of knowing final size of huge files (>16Mb in my patch)
>> there should be no difference for delayed allocation.
> 
> There is, because you throttle the writes down such that there is
> only 16MB of dirty data in memory. Hence filesystems will only
> typically allocate in 16MB chunks as that's all the delalloc range
> spans.
> 
> I'm not so concerned for XFS here, because our speculative
> preallocation will handle this just fine, but for ext4 and btrfs
> it's going to interleave the allocate of concurrent streaming writes
> and fragment the crap out of the files.
> 
> In general, the smaller you make the individual file writeback
> window, the worse the fragmentation problems gets....

AFAIR ext4 already preallocates extent beyond EOF too.

But this must be carefully tested for all modern fs for sure.

> 
>> Probably write behind could provide hint about streaming pattern:
>> pass something like "MSG_MORE" into writeback call.
> 
> How does that help when we've only got dirty data and block
> reservations up to EOF which is no more than 16MB away?

Block allocator should interpret this flags as "more data are
expected" and preallocate extent bigger than data and beyond EOF.

> 
> Cheers,
> 
> Dave.
>
Theodore Ts'o Sept. 25, 2019, 12:54 p.m. UTC | #18
On Wed, Sep 25, 2019 at 05:18:54PM +1000, Dave Chinner wrote:
> > > ANd, really such strict writebehind behaviour is going to cause all
> > > sorts of unintended problesm with filesystems because there will be
> > > adverse interactions with delayed allocation. We need a substantial
> > > amount of dirty data to be cached for writeback for fragmentation
> > > minimisation algorithms to be able to do their job....
> > 
> > I think most sequentially written files never change after close.
> 
> There are lots of apps that write zeros to initialise and allocate
> space, then go write real data to them. Database WAL files are
> commonly initialised like this...

Fortunately, most of the time Enterprise Database files which are
initialized with a fd which is then kept open.  And it's only a single
file.  So that's a hueristic that's not too bad to handle so long as
it's only triggered when there are no open file descriptors on said
inode.  If something is still keeping the file open, then we do need
to be very careful about writebehind.

That behind said, with databases, they are goind to be calling
fdatasync(2) and fsync(2) all the time, so it's unlikely writebehind
is goint to be that much of an issue, so long as the max writebehind
knob isn't set too insanely low.  It's been over ten years since I
last looked at this, and so things may have very likely changed, but
one enterprise database I looked at would fallocate 32M, and then
write 32M of zeros to make sure blocks were marked as initialized, so
that further random writes wouldn't cause metadata updates.

Now, there *are* applications which log to files via append, and in
the worst case, they don't actually keep a fd open.  Examples of this
would include scripts that call logger(1) very often.  But in general,
taking into account whether or not there is still a fd holding the
inode open to influence how aggressively we do writeback does make
sense.

Finally, we should remember that this will impact battery life on
laptops.  Perhaps not so much now that most laptops have SSD's instead
of HDD's, but aggressive writebehind does certainly have tradeoffs,
and what makes sense for a NVMe attached SSD is going to be very
different for a $2 USB thumb drive picked up at the checkout aisle of
Staples....

						- Ted
Dave Chinner Sept. 25, 2019, 11:25 p.m. UTC | #19
On Wed, Sep 25, 2019 at 11:15:30AM +0300, Konstantin Khlebnikov wrote:
> On 25/09/2019 10.18, Dave Chinner wrote:
> > On Tue, Sep 24, 2019 at 12:00:17PM +0300, Konstantin Khlebnikov wrote:
> > > On 24/09/2019 10.39, Dave Chinner wrote:
> > > > On Mon, Sep 23, 2019 at 06:06:46PM +0300, Konstantin Khlebnikov wrote:
> > > > > On 23/09/2019 17.52, Tejun Heo wrote:
> > > > > > Hello, Konstantin.
> > > > > > 
> > > > > > On Fri, Sep 20, 2019 at 10:39:33AM +0300, Konstantin Khlebnikov wrote:
> > > > > > > With vm.dirty_write_behind 1 or 2 files are written even faster and
> > > > > > 
> > > > > > Is the faster speed reproducible?  I don't quite understand why this
> > > > > > would be.
> > > > > 
> > > > > Writing to disk simply starts earlier.
> > > > 
> > > > Stupid question: how is this any different to simply winding down
> > > > our dirty writeback and throttling thresholds like so:
> > > > 
> > > > # echo $((100 * 1000 * 1000)) > /proc/sys/vm/dirty_background_bytes
> > > > 
> > > > to start background writeback when there's 100MB of dirty pages in
> > > > memory, and then:
> > > > 
> > > > # echo $((200 * 1000 * 1000)) > /proc/sys/vm/dirty_bytes
> > > > 
> > > > So that writers are directly throttled at 200MB of dirty pages in
> > > > memory?
> > > > 
> > > > This effectively gives us global writebehind behaviour with a
> > > > 100-200MB cache write burst for initial writes.
> > > 
> > > Global limits affect all dirty pages including memory-mapped and
> > > randomly touched. Write-behind aims only into sequential streams.
> > 
> > There are  apps that do sequential writes via mmap()d files.
> > They should do writebehind too, yes?
> 
> I see no reason for that. This is different scenario.

It is?

> Mmap have no clear signal about "end of write", only page fault at
> beginning. Theoretically we could implement similar sliding window and
> start writeback on consequent page faults.

sequential IO doing pwrite() in a loop has no clear signal about
"end of write", either. It's exactly the same as doing a memset(0)
on a mmap()d region to zero the file. i.e. the write doesn't stop
until EOF is reached...

> But applications who use memory mapped files probably knows better what
> to do with this data. I prefer to leave them alone for now.

By that argument, we shouldn't have readahead for mmap() access or
even read-around for page faults. We can track read and write faults
exactly for mmap(), so if you are tracking sequential page dirtying
for writebehind we can do that jsut as easily for mmap (via
->page_mkwrite) as we can for write() IO.

> > > > ANd, really such strict writebehind behaviour is going to cause all
> > > > sorts of unintended problesm with filesystems because there will be
> > > > adverse interactions with delayed allocation. We need a substantial
> > > > amount of dirty data to be cached for writeback for fragmentation
> > > > minimisation algorithms to be able to do their job....
> > > 
> > > I think most sequentially written files never change after close.
> > 
> > There are lots of apps that write zeros to initialise and allocate
> > space, then go write real data to them. Database WAL files are
> > commonly initialised like this...
> 
> Those zeros are just bunch of dirty pages which have to be written.
> Sync and memory pressure will do that, why write-behind don't have to?

Huh? IIUC, the writebehind flag is a global behaviour flag for the
kernel - everything does writebehind or nothing does it, right?

Hence if you turn on writebehind, the writebehind will write the
zeros to disk before real data can be written. We no longer have
zeroing as something that sits in the cache until it's overwritten
with real data - that file now gets written twice and it delays the
application from actually writing real data until the zeros are all
on disk.

strict writebehind without the ability to burst temporary/short-term
data/state into the cache is going to cause a lot of performance
regressions in applications....

> > > Except of knowing final size of huge files (>16Mb in my patch)
> > > there should be no difference for delayed allocation.
> > 
> > There is, because you throttle the writes down such that there is
> > only 16MB of dirty data in memory. Hence filesystems will only
> > typically allocate in 16MB chunks as that's all the delalloc range
> > spans.
> > 
> > I'm not so concerned for XFS here, because our speculative
> > preallocation will handle this just fine, but for ext4 and btrfs
> > it's going to interleave the allocate of concurrent streaming writes
> > and fragment the crap out of the files.
> > 
> > In general, the smaller you make the individual file writeback
> > window, the worse the fragmentation problems gets....
> 
> AFAIR ext4 already preallocates extent beyond EOF too.

Only via fallocate(), not for delayed allocation.

> > > Probably write behind could provide hint about streaming pattern:
> > > pass something like "MSG_MORE" into writeback call.
> > 
> > How does that help when we've only got dirty data and block
> > reservations up to EOF which is no more than 16MB away?
> 
> Block allocator should interpret this flags as "more data are
> expected" and preallocate extent bigger than data and beyond EOF.

Can't do that: delayed allocation is a 2-phase operation that is not
seperable from the context that is dirtying the pages. The space
is _accounted as used_ during the write() context, but the _physical
allocation_ of that space is done in the writeback context. We
cannot reserve more space in the writeback context, because we may
already be at ENOSPC by the time writeback comes along. Hence
writeback must already have all the space it needs to write back the
dirty pages in memory already accounted as used space before it
starts running physical allocations.

IOWs, we cannot magically allocate more space than was reserved for
the data being written in because of some special flag from the
writeback code. That way lies angry users because we lost their
data due to ENOSPC issues in writeback.

Cheers,

Dave.
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index d773d5697cf5..f16be656cbd5 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -30,6 +30,11 @@  read_ahead_kb (read-write)
 
 	Size of the read-ahead window in kilobytes
 
+write_behind_kb (read-write)
+
+	Size of the write-behind window in kilobytes.
+	0 -> disable write-behind for this disk.
+
 min_ratio (read-write)
 
 	Under normal circumstances each device is given a part of the
diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 64aeee1009ca..a275fa42579f 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -35,6 +35,7 @@  Currently, these files are in /proc/sys/vm:
 - dirty_ratio
 - dirtytime_expire_seconds
 - dirty_writeback_centisecs
+- dirty_write_behind
 - drop_caches
 - extfrag_threshold
 - hugetlb_shm_group
@@ -210,6 +211,20 @@  out to disk.  This tunable expresses the interval between those wakeups, in
 Setting this to zero disables periodic writeback altogether.
 
 
+dirty_write_behind
+==================
+
+This controls write-behind writeback policy - automatic background writeback
+for sequentially written data behind current writing position.
+
+=0: disabled, default
+=1: enabled for strictly sequential writes (append, copying)
+=2: enabled for all sequential writes
+
+Write-behind window size configured in sysfs for each block device:
+/sys/block/$DEV/bdi/write_behind_kb
+
+
 drop_caches
 ===========
 
diff --git a/fs/file_table.c b/fs/file_table.c
index b07b53f24ff5..bb40b45f27d3 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -276,6 +276,8 @@  static void __fput(struct file *file)
 		if (file->f_op->fasync)
 			file->f_op->fasync(-1, file, 0);
 	}
+	if ((mode & FMODE_WRITE) && vm_dirty_write_behind)
+		generic_write_behind_close(file);
 	if (file->f_op->release)
 		file->f_op->release(inode, file);
 	if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL &&
diff --git a/include/linux/backing-dev-defs.h b/include/linux/backing-dev-defs.h
index 4fc87dee005a..4f1abd1d64a7 100644
--- a/include/linux/backing-dev-defs.h
+++ b/include/linux/backing-dev-defs.h
@@ -191,6 +191,7 @@  struct backing_dev_info {
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_SIZE units */
 	unsigned long io_pages;	/* max allowed IO size */
+	unsigned long write_behind_pages; /* write-behind window in pages */
 	congested_fn *congested_fn; /* Function pointer if device is md/dm */
 	void *congested_data;	/* Pointer to aux data for congested func */
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 997a530ff4e9..42cad18aaec7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -942,6 +942,7 @@  struct file {
 	struct fown_struct	f_owner;
 	const struct cred	*f_cred;
 	struct file_ra_state	f_ra;
+	pgoff_t			f_write_behind;
 
 	u64			f_version;
 #ifdef CONFIG_SECURITY
@@ -2788,6 +2789,10 @@  extern int vfs_fsync(struct file *file, int datasync);
 extern int sync_file_range(struct file *file, loff_t offset, loff_t nbytes,
 				unsigned int flags);
 
+extern int vm_dirty_write_behind;
+extern void generic_write_behind(struct kiocb *iocb, ssize_t count);
+extern void generic_write_behind_close(struct file *file);
+
 /*
  * Sync the bytes written if this was a synchronous write.  Expect ki_pos
  * to already be updated for the write, and will return either the amount
@@ -2801,7 +2806,8 @@  static inline ssize_t generic_write_sync(struct kiocb *iocb, ssize_t count)
 				(iocb->ki_flags & IOCB_SYNC) ? 0 : 1);
 		if (ret)
 			return ret;
-	}
+	} else if (vm_dirty_write_behind)
+		generic_write_behind(iocb, count);
 
 	return count;
 }
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0334ca97c584..1b47a6e06ef2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2443,6 +2443,7 @@  void task_dirty_inc(struct task_struct *tsk);
 
 /* readahead.c */
 #define VM_READAHEAD_PAGES	(SZ_128K / PAGE_SIZE)
+#define VM_WRITE_BEHIND_PAGES	(SZ_16M / PAGE_SIZE)
 
 int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
 			pgoff_t offset, unsigned long nr_to_read);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 078950d9605b..74b6b66ee8da 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1404,6 +1404,15 @@  static struct ctl_table vm_table[] = {
 		.proc_handler	= dirtytime_interval_handler,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "dirty_write_behind",
+		.data		= &vm_dirty_write_behind,
+		.maxlen		= sizeof(vm_dirty_write_behind),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= &two,
+	},
 	{
 		.procname	= "swappiness",
 		.data		= &vm_swappiness,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d9daa3e422d0..7fee95c02862 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -131,25 +131,6 @@  static inline void bdi_debug_unregister(struct backing_dev_info *bdi)
 }
 #endif
 
-static ssize_t read_ahead_kb_store(struct device *dev,
-				  struct device_attribute *attr,
-				  const char *buf, size_t count)
-{
-	struct backing_dev_info *bdi = dev_get_drvdata(dev);
-	unsigned long read_ahead_kb;
-	ssize_t ret;
-
-	ret = kstrtoul(buf, 10, &read_ahead_kb);
-	if (ret < 0)
-		return ret;
-
-	bdi->ra_pages = read_ahead_kb >> (PAGE_SHIFT - 10);
-
-	return count;
-}
-
-#define K(pages) ((pages) << (PAGE_SHIFT - 10))
-
 #define BDI_SHOW(name, expr)						\
 static ssize_t name##_show(struct device *dev,				\
 			   struct device_attribute *attr, char *page)	\
@@ -160,7 +141,26 @@  static ssize_t name##_show(struct device *dev,				\
 }									\
 static DEVICE_ATTR_RW(name);
 
-BDI_SHOW(read_ahead_kb, K(bdi->ra_pages))
+#define BDI_ATTR_KB(name, field)					\
+static ssize_t name##_store(struct device *dev,				\
+			    struct device_attribute *attr,		\
+			    const char *buf, size_t count)		\
+{									\
+	struct backing_dev_info *bdi = dev_get_drvdata(dev);		\
+	unsigned long kb;						\
+	ssize_t ret;							\
+									\
+	ret = kstrtoul(buf, 10, &kb);					\
+	if (ret < 0)							\
+		return ret;						\
+									\
+	bdi->field = kb >> (PAGE_SHIFT - 10);				\
+	return count;							\
+}									\
+BDI_SHOW(name, ((bdi->field) << (PAGE_SHIFT - 10)))
+
+BDI_ATTR_KB(read_ahead_kb, ra_pages)
+BDI_ATTR_KB(write_behind_kb, write_behind_pages)
 
 static ssize_t min_ratio_store(struct device *dev,
 		struct device_attribute *attr, const char *buf, size_t count)
@@ -213,6 +213,7 @@  static DEVICE_ATTR_RO(stable_pages_required);
 
 static struct attribute *bdi_dev_attrs[] = {
 	&dev_attr_read_ahead_kb.attr,
+	&dev_attr_write_behind_kb.attr,
 	&dev_attr_min_ratio.attr,
 	&dev_attr_max_ratio.attr,
 	&dev_attr_stable_pages_required.attr,
@@ -859,6 +860,8 @@  static int bdi_init(struct backing_dev_info *bdi)
 	INIT_LIST_HEAD(&bdi->wb_list);
 	init_waitqueue_head(&bdi->wb_waitq);
 
+	bdi->write_behind_pages = VM_WRITE_BEHIND_PAGES;
+
 	ret = cgwb_bdi_init(bdi);
 
 	return ret;
diff --git a/mm/filemap.c b/mm/filemap.c
index d0cf700bf201..5398b1bea1bf 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3525,3 +3525,139 @@  int try_to_release_page(struct page *page, gfp_t gfp_mask)
 }
 
 EXPORT_SYMBOL(try_to_release_page);
+
+int vm_dirty_write_behind __read_mostly;
+EXPORT_SYMBOL(vm_dirty_write_behind);
+
+/**
+ * generic_write_behind() - writeback dirty pages behind current position.
+ *
+ * This function tracks writing position. If file has enough sequentially
+ * written data it starts background writeback and then waits for previous
+ * writeback initiated some iterations ago.
+ *
+ * Write-behind maintains per-file head cursor in file->f_write_behind and
+ * two windows around: background writeback before and pending data after.
+ *
+ * |<-wait-this->|           |<-send-this->|<---pending-write-behind--->|
+ * |<--async-write-behind--->|<--------previous-data------>|<-new-data->|
+ *              current head-^    new head-^              file position-^
+ */
+void generic_write_behind(struct kiocb *iocb, ssize_t count)
+{
+	struct file *file = iocb->ki_filp;
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	unsigned long window = READ_ONCE(bdi->write_behind_pages);
+	pgoff_t head = file->f_write_behind;
+	pgoff_t begin = (iocb->ki_pos - count) >> PAGE_SHIFT;
+	pgoff_t end = iocb->ki_pos >> PAGE_SHIFT;
+
+	/* Skip if write is random, direct, sync or disabled for disk */
+	if ((file->f_mode & FMODE_RANDOM) || !window ||
+	    (iocb->ki_flags & (IOCB_DIRECT | IOCB_DSYNC)))
+		return;
+
+	/* Skip non-sequential writes in strictly sequential mode. */
+	if (vm_dirty_write_behind < 2 &&
+	    iocb->ki_pos != i_size_read(inode) &&
+	    !(iocb->ki_flags & IOCB_APPEND))
+		return;
+
+	/* Contigious write and still within window. */
+	if (end - head < window)
+		return;
+
+	spin_lock(&file->f_lock);
+
+	/* Re-read under lock. */
+	head = file->f_write_behind;
+
+	/* Non-contiguous, move head position. */
+	if (head > end || begin - head > window) {
+		/*
+		 * Append might happen though multiple files or via new file
+		 * every time. Align head cursor to cover previous appends.
+		 */
+		if (iocb->ki_flags & IOCB_APPEND)
+			begin = roundup(begin - min(begin, window - 1),
+					bdi->io_pages);
+
+		file->f_write_behind = head = begin;
+	}
+
+	/* Still not big enough. */
+	if (end - head < window) {
+		spin_unlock(&file->f_lock);
+		return;
+	}
+
+	/* Write excess and try at least max_sectors_kb if possible */
+	end = head + max(end - head - window, min(end - head, bdi->io_pages));
+
+	/* Set head for next iteration, everything behind will be written. */
+	file->f_write_behind = end;
+
+	spin_unlock(&file->f_lock);
+
+	/* Start background writeback. */
+	__filemap_fdatawrite_range(mapping,
+				   (loff_t)head << PAGE_SHIFT,
+				   ((loff_t)end << PAGE_SHIFT) - 1,
+				   WB_SYNC_NONE);
+
+	if (head < window)
+		return;
+
+	/* Wait for pages falling behind writeback window. */
+	head -= window;
+	end -= window;
+	__filemap_fdatawait_range(mapping,
+				  (loff_t)head << PAGE_SHIFT,
+				  ((loff_t)end << PAGE_SHIFT) - 1);
+}
+EXPORT_SYMBOL(generic_write_behind);
+
+/**
+ * generic_write_behind_close() - write tail pages
+ *
+ * This function finishes write-behind steam and writes remaining tail pages
+ * in background. It start write if write-behind stream was started before
+ * (i.e. total written size is bigger than write-behind window) or if this is
+ * new file and it is bigger than max_sectors_kb.
+ */
+void generic_write_behind_close(struct file *file)
+{
+	struct address_space *mapping = file->f_mapping;
+	struct inode *inode = mapping->host;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	unsigned long window = READ_ONCE(bdi->write_behind_pages);
+	pgoff_t head = file->f_write_behind;
+	pgoff_t end = (file->f_pos + PAGE_SIZE - 1) >> PAGE_SHIFT;
+
+	if ((file->f_mode & FMODE_RANDOM) ||
+	    (file->f_flags & (O_APPEND | O_DSYNC | O_DIRECT)) ||
+	    !bdi_cap_writeback_dirty(bdi) || !window)
+		return;
+
+	/* Skip non-sequential writes in strictly sequential mode. */
+	if (vm_dirty_write_behind < 2 &&
+	    file->f_pos != i_size_read(inode))
+		return;
+
+	/* Non-contiguous */
+	if (head > end || end - head > window)
+		return;
+
+	/* Start stream only for new files bigger than max_sectors_kb. */
+	if (end - head < (window - min(window, bdi->io_pages)) &&
+	    (!(file->f_mode & FMODE_CREATED) || end - head < bdi->io_pages))
+		return;
+
+	/* Write tail pages in background. */
+	__filemap_fdatawrite_range(mapping,
+				   (loff_t)head << PAGE_SHIFT,
+				   file->f_pos - 1,
+				   WB_SYNC_NONE);
+}