mbox series

[RFC,0/1] Large folios in block buffered IO path

Message ID 20241127054737.33351-1-bharata@amd.com (mailing list archive)
Headers show
Series Large folios in block buffered IO path | expand

Message

Bharata B Rao Nov. 27, 2024, 5:47 a.m. UTC
Recently we discussed the scalability issues while running large
instances of FIO with buffered IO option on NVME block devices here:

https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/

One of the suggestions Chris Mason gave (during private discussions) was
to enable large folios in block buffered IO path as that could
improve the scalability problems and improve the lock contention
scenarios.

This is an attempt to check the feasibility and potential benefit of the
same. To keep changes to minimum and also to non-disruptively test this
for the required block device only, I have added an ioctl to set large
folios support on block device mapping. I understand that this is not
the right way to do this but this is just an experiment to evaluate the
potential benefit.

Experimental setup
------------------
2 node EPYC server based Zen5 server with 512G memory in each node.

Disk layout for FIO:
nvme2n1     259:12   0   3.5T  0 disk 
├─nvme2n1p1 259:13   0 894.3G  0 part 
├─nvme2n1p2 259:14   0 894.3G  0 part 
├─nvme2n1p3 259:15   0 894.3G  0 part 
└─nvme2n1p4 259:16   0 894.1G  0 part 

Four parallel instances of FIO are run on the above 4 partitions with
the following options:

-filename=/dev/nvme2n1p[1,2,3,4] -direct=0 -thread -size=800G -rw=rw -rwmixwrite=[10,30,50] --norandommap --randrepeat=0 -ioengine=sync -bs=64k -numjobs=252 -runtime=3600 --time_based -group_reporting

Results
-------
default: Unmodified kernel and FIO.
patched: Kernel with BLKSETLFOLIO ioctl(introduced in this patchset) and FIO
modified to issue that ioctl.
In the below table, r is READ bw and w is WRITE bw reported by FIO.

		default				patched
ro (w/o -rw=rw option)
Instance 1	r=12.3GiB/s			r=39.4GiB/s
Instance 2	r=12.2GiB/s			r=39.1GiB/s
Instance 3	r=16.3GiB/s			r=37.1GiB/s
Instance 4	r=14.9GiB/s			r=42.9GiB/s

rwmixwrite=10%
Instance 1	r=27.5GiB/s,w=3125MiB/s		r=75.9GiB/s,w=8636MiB/s
Instance 2	r=25.5GiB/s,w=2898MiB/s		r=87.6GiB/s,w=9967MiB/s
Instance 3	r=25.7GiB/s,w=2922MiB/s		r=78.3GiB/s,w=8904MiB/s
Instance 4	r=27.5GiB/s,w=3134MiB/s		r=73.5GiB/s,w=8365MiB/s

rwmixwrite=30%
Instance 1	r=55.7GiB/s,w=23.9GiB/s		r=59.2GiB/s,w=25.4GiB/s
Instance 2	r=38.5GiB/s,w=16.5GiB/s		r=57.6GiB/s,w=24.7GiB/s
Instance 3	r=37.5GiB/s,w=16.1GiB/s		r=59.5GiB/s,w=25.5GiB/s
Instance 4	r=37.4GiB/s,w=16.0GiB/s		r=63.3GiB/s,w=27.1GiB/s

rwmixwrite=50%
Instance 1	r=37.1GiB/s,w=37.1GiB/s		r=40.7GiB/s,w=40.7GiB/s
Instance 2	r=37.6GiB/s,w=37.6GiB/s		r=45.9GiB/s,w=45.9GiB/s
Instance 3	r=35.1GiB/s,w=35.1GiB/s		r=49.2GiB/s,w=49.2GiB/s
Instance 4	r=43.6GiB/s,w=43.6GiB/s		r=41.2GiB/s,w=41.2GiB/s

Summary of FIO throughput
-------------------------
- Significant increase(3x) in bandwidth for ro case.
- Significant increase(3x) in bandwidth for rw 10%.
- Good gains(~1.15 to 1.5x) for 30% and 50%.

perf-lock contention output
---------------------------
The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
mix it looks like this:

perf-lock contention default
 contended   total wait     max wait     avg wait         type   caller

1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
                        0xffffffff8f39e88f  up_write+0x4f
                        0xffffffff8f9d598e  blkdev_llseek+0x4e
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
   2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
                        0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
                        0xffffffff903dce5e  do_syscall_64+0x7e
                        0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
 134057198     14.27 h      35.93 ms    383.14 us     spinlock   clear_shadow_entries+0x57
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5e7967  clear_shadow_entries+0x57
                        0xffffffff8f5e90e3  mapping_try_invalidate+0x163
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
  33351524      1.76 h      35.86 ms    190.43 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f9be6  remove_mapping+0x16
                        0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
   9448820     14.96 m       1.54 ms     95.01 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e7593  deactivate_file_folio+0x43
                        0xffffffff8f5e90b7  mapping_try_invalidate+0x137
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
   1488531     11.07 m       1.07 ms    446.39 us     spinlock   try_to_free_buffers+0x56
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f768c76  try_to_free_buffers+0x56
                        0xffffffff8f5cf647  filemap_release_folio+0x87
                        0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
   2556868      6.78 m     474.72 us    159.07 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f14c4  rwsem_down_write_slowpath+0x274
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
   2512627      3.75 m     450.96 us     89.55 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f12f0  rwsem_down_write_slowpath+0xa0
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
    908184      1.52 m     439.58 us    100.58 us     spinlock   blkdev_llseek+0x31
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5d01  _raw_spin_lock_irq+0x51
                        0xffffffff903f1367  rwsem_down_write_slowpath+0x117
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
       134      1.48 m       1.22 s     663.88 ms        mutex   bdev_release+0x69
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f9d5249  bdev_release+0x69
                        0xffffffff8f9d5921  blkdev_release+0x11
                        0xffffffff8f7089f3  __fput+0xe3
                        0xffffffff8f708c9b  __fput_sync+0x1b
                        0xffffffff8f6fe8ed  __x64_sys_close+0x3d


perf-lock contention patched
 contended   total wait     max wait     avg wait         type   caller

   1153627     40.15 h      48.67 s     125.30 ms      rwsem:W   blkdev_llseek+0x31
                        0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
                        0xffffffff903f18fb  down_write+0x5b
                        0xffffffff8f9d5971  blkdev_llseek+0x31
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
                        0xffffffff903dce5e  do_syscall_64+0x7e
                        0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
 276512439     39.19 h      46.90 ms    510.22 us     spinlock   clear_shadow_entries+0x57
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5e7967  clear_shadow_entries+0x57
                        0xffffffff8f5e90e3  mapping_try_invalidate+0x163
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
 763119320     26.37 h     887.44 us    124.38 us     spinlock   rwsem_wake.isra.0+0x42
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
                        0xffffffff8f39e88f  up_write+0x4f
                        0xffffffff8f9d598e  blkdev_llseek+0x4e
                        0xffffffff8f703322  ksys_lseek+0x72
                        0xffffffff8f7033a8  __x64_sys_lseek+0x18
                        0xffffffff8f20b983  x64_sys_call+0x1fb3
  33263910      2.87 h      29.43 ms    310.56 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f9be6  remove_mapping+0x16
                        0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
  58671816      2.50 h     519.68 us    153.45 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e7593  deactivate_file_folio+0x43
                        0xffffffff8f5e90b7  mapping_try_invalidate+0x137
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
       284     22.33 m       5.35 s       4.72 s         mutex   bdev_release+0x69
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f9d5249  bdev_release+0x69
                        0xffffffff8f9d5921  blkdev_release+0x11
                        0xffffffff8f7089f3  __fput+0xe3
                        0xffffffff8f708c9b  __fput_sync+0x1b
                        0xffffffff8f6fe8ed  __x64_sys_close+0x3d
   2181469     21.38 m       1.15 ms    587.98 us     spinlock   try_to_free_buffers+0x56
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f768c76  try_to_free_buffers+0x56
                        0xffffffff8f5cf647  filemap_release_folio+0x87
                        0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
                        0xffffffff8f5e9068  mapping_try_invalidate+0xe8
                        0xffffffff8f5e9160  invalidate_mapping_pages+0x10
                        0xffffffff8f9d3872  invalidate_bdev+0x42
    454398      4.22 m      37.54 ms    557.13 us     spinlock   __remove_mapping+0x5d
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f5c7f  _raw_spin_lock+0x3f
                        0xffffffff8f5ec71d  __remove_mapping+0x5d
                        0xffffffff8f5f4f04  shrink_folio_list+0xbc4
                        0xffffffff8f5f5a6b  evict_folios+0x34b
                        0xffffffff8f5f772f  try_to_shrink_lruvec+0x20f
                        0xffffffff8f5f79ef  shrink_one+0x10f
                        0xffffffff8f5fb975  shrink_node+0xb45
       773      3.53 m       2.60 s     273.76 ms        mutex   __lru_add_drain_all+0x3a
                        0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
                        0xffffffff903ef863  __mutex_lock_slowpath+0x13
                        0xffffffff903ef8bb  mutex_lock+0x3b
                        0xffffffff8f5e3d7a  __lru_add_drain_all+0x3a
                        0xffffffff8f5e77a0  lru_add_drain_all+0x10
                        0xffffffff8f9d3861  invalidate_bdev+0x31
                        0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
                        0xffffffff8f9faea1  blkdev_ioctl+0xc1
   1997851      3.09 m     651.65 us     92.83 us     spinlock   folio_lruvec_lock_irqsave+0x64
                        0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
                        0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
                        0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
                        0xffffffff8f5e587c  folio_batch_move_lru+0x5c
                        0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
                        0xffffffff8f5e5ae4  folio_add_lru+0x54
                        0xffffffff8f5d075d  filemap_add_folio+0xcd
                        0xffffffff8f5e30c0  page_cache_ra_order+0x220

Observations from perf-lock contention
--------------------------------------
- Significant reduction of contention for inode_lock (inode->i_rwsem)
  from blkdev_llseek() path.
- Significant increase in contention for inode->i_lock from invalidate
  and remove_mapping paths.
- Significant increase in contention for lruvec spinlock from
  deactive_file_folio path.

Request comments on the above and I am specifically looking for inputs
on these:

- Lock contention results and usefulness of large folios in bringing
  down the contention in this specific case.
- If enabling large folios in block buffered IO path is a feasible
  approach, inputs on doing this cleanly and correclty.

Bharata B Rao (1):
  block/ioctl: Add an ioctl to enable large folios for block buffered IO
    path

 block/ioctl.c           | 8 ++++++++
 include/uapi/linux/fs.h | 2 ++
 2 files changed, 10 insertions(+)

Comments

Mateusz Guzik Nov. 27, 2024, 6:13 a.m. UTC | #1
On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
>
> Recently we discussed the scalability issues while running large
> instances of FIO with buffered IO option on NVME block devices here:
>
> https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
>
> One of the suggestions Chris Mason gave (during private discussions) was
> to enable large folios in block buffered IO path as that could
> improve the scalability problems and improve the lock contention
> scenarios.
>

I have no basis to comment on the idea.

However, it is pretty apparent whatever the situation it is being
heavily disfigured by lock contention in blkdev_llseek:

> perf-lock contention output
> ---------------------------
> The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> mix it looks like this:
>
> perf-lock contention default
>  contended   total wait     max wait     avg wait         type   caller
>
> 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>                         0xffffffff8f39e88f  up_write+0x4f
>                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
>                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>                         0xffffffff903dce5e  do_syscall_64+0x7e
>                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76

Admittedly I'm not familiar with this code, but at a quick glance the
lock can be just straight up removed here?

  534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
  535 {
  536 │       struct inode *bd_inode = bdev_file_inode(file);
  537 │       loff_t retval;
  538 │
  539 │       inode_lock(bd_inode);
  540 │       retval = fixed_size_llseek(file, offset, whence,
i_size_read(bd_inode));
  541 │       inode_unlock(bd_inode);
  542 │       return retval;
  543 }

At best it stabilizes the size for the duration of the call. Sounds
like it helps nothing since if the size can change, the file offset
will still be altered as if there was no locking?

Suppose this cannot be avoided to grab the size for whatever reason.

While the above fio invocation did not work for me, I ran some crapper
which I had in my shell history and according to strace:
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271829] lseek(7, 0, SEEK_SET)      = 0
[pid 271830] lseek(7, 0, SEEK_SET)      = 0

... the lseeks just rewind to the beginning, *definitely* not needing
to know the size. One would have to check but this is most likely the
case in your test as well.

And for that there is 0 need to grab the size, and consequently the inode lock.

>  134057198     14.27 h      35.93 ms    383.14 us     spinlock   clear_shadow_entries+0x57
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5e7967  clear_shadow_entries+0x57
>                         0xffffffff8f5e90e3  mapping_try_invalidate+0x163
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>   33351524      1.76 h      35.86 ms    190.43 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f9be6  remove_mapping+0x16
>                         0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>    9448820     14.96 m       1.54 ms     95.01 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e7593  deactivate_file_folio+0x43
>                         0xffffffff8f5e90b7  mapping_try_invalidate+0x137
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>    1488531     11.07 m       1.07 ms    446.39 us     spinlock   try_to_free_buffers+0x56
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f768c76  try_to_free_buffers+0x56
>                         0xffffffff8f5cf647  filemap_release_folio+0x87
>                         0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>    2556868      6.78 m     474.72 us    159.07 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f14c4  rwsem_down_write_slowpath+0x274
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>    2512627      3.75 m     450.96 us     89.55 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f12f0  rwsem_down_write_slowpath+0xa0
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>     908184      1.52 m     439.58 us    100.58 us     spinlock   blkdev_llseek+0x31
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5d01  _raw_spin_lock_irq+0x51
>                         0xffffffff903f1367  rwsem_down_write_slowpath+0x117
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>        134      1.48 m       1.22 s     663.88 ms        mutex   bdev_release+0x69
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f9d5249  bdev_release+0x69
>                         0xffffffff8f9d5921  blkdev_release+0x11
>                         0xffffffff8f7089f3  __fput+0xe3
>                         0xffffffff8f708c9b  __fput_sync+0x1b
>                         0xffffffff8f6fe8ed  __x64_sys_close+0x3d
>
>
> perf-lock contention patched
>  contended   total wait     max wait     avg wait         type   caller
>
>    1153627     40.15 h      48.67 s     125.30 ms      rwsem:W   blkdev_llseek+0x31
>                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
>                         0xffffffff903f18fb  down_write+0x5b
>                         0xffffffff8f9d5971  blkdev_llseek+0x31
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>                         0xffffffff903dce5e  do_syscall_64+0x7e
>                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>  276512439     39.19 h      46.90 ms    510.22 us     spinlock   clear_shadow_entries+0x57
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5e7967  clear_shadow_entries+0x57
>                         0xffffffff8f5e90e3  mapping_try_invalidate+0x163
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>  763119320     26.37 h     887.44 us    124.38 us     spinlock   rwsem_wake.isra.0+0x42
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
>                         0xffffffff8f39e88f  up_write+0x4f
>                         0xffffffff8f9d598e  blkdev_llseek+0x4e
>                         0xffffffff8f703322  ksys_lseek+0x72
>                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
>                         0xffffffff8f20b983  x64_sys_call+0x1fb3
>   33263910      2.87 h      29.43 ms    310.56 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f9be6  remove_mapping+0x16
>                         0xffffffff8f5e8f5b  mapping_evict_folio+0x7b
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>   58671816      2.50 h     519.68 us    153.45 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e7593  deactivate_file_folio+0x43
>                         0xffffffff8f5e90b7  mapping_try_invalidate+0x137
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>        284     22.33 m       5.35 s       4.72 s         mutex   bdev_release+0x69
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f9d5249  bdev_release+0x69
>                         0xffffffff8f9d5921  blkdev_release+0x11
>                         0xffffffff8f7089f3  __fput+0xe3
>                         0xffffffff8f708c9b  __fput_sync+0x1b
>                         0xffffffff8f6fe8ed  __x64_sys_close+0x3d
>    2181469     21.38 m       1.15 ms    587.98 us     spinlock   try_to_free_buffers+0x56
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f768c76  try_to_free_buffers+0x56
>                         0xffffffff8f5cf647  filemap_release_folio+0x87
>                         0xffffffff8f5e8f4c  mapping_evict_folio+0x6c
>                         0xffffffff8f5e9068  mapping_try_invalidate+0xe8
>                         0xffffffff8f5e9160  invalidate_mapping_pages+0x10
>                         0xffffffff8f9d3872  invalidate_bdev+0x42
>     454398      4.22 m      37.54 ms    557.13 us     spinlock   __remove_mapping+0x5d
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f5c7f  _raw_spin_lock+0x3f
>                         0xffffffff8f5ec71d  __remove_mapping+0x5d
>                         0xffffffff8f5f4f04  shrink_folio_list+0xbc4
>                         0xffffffff8f5f5a6b  evict_folios+0x34b
>                         0xffffffff8f5f772f  try_to_shrink_lruvec+0x20f
>                         0xffffffff8f5f79ef  shrink_one+0x10f
>                         0xffffffff8f5fb975  shrink_node+0xb45
>        773      3.53 m       2.60 s     273.76 ms        mutex   __lru_add_drain_all+0x3a
>                         0xffffffff903ef1de  __mutex_lock.constprop.0+0x17e
>                         0xffffffff903ef863  __mutex_lock_slowpath+0x13
>                         0xffffffff903ef8bb  mutex_lock+0x3b
>                         0xffffffff8f5e3d7a  __lru_add_drain_all+0x3a
>                         0xffffffff8f5e77a0  lru_add_drain_all+0x10
>                         0xffffffff8f9d3861  invalidate_bdev+0x31
>                         0xffffffff8f9fac3e  blkdev_common_ioctl+0x9ae
>                         0xffffffff8f9faea1  blkdev_ioctl+0xc1
>    1997851      3.09 m     651.65 us     92.83 us     spinlock   folio_lruvec_lock_irqsave+0x64
>                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
>                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
>                         0xffffffff8f6e3ed4  folio_lruvec_lock_irqsave+0x64
>                         0xffffffff8f5e587c  folio_batch_move_lru+0x5c
>                         0xffffffff8f5e5a41  __folio_batch_add_and_move+0xd1
>                         0xffffffff8f5e5ae4  folio_add_lru+0x54
>                         0xffffffff8f5d075d  filemap_add_folio+0xcd
>                         0xffffffff8f5e30c0  page_cache_ra_order+0x220
>
> Observations from perf-lock contention
> --------------------------------------
> - Significant reduction of contention for inode_lock (inode->i_rwsem)
>   from blkdev_llseek() path.
> - Significant increase in contention for inode->i_lock from invalidate
>   and remove_mapping paths.
> - Significant increase in contention for lruvec spinlock from
>   deactive_file_folio path.
>
> Request comments on the above and I am specifically looking for inputs
> on these:
>
> - Lock contention results and usefulness of large folios in bringing
>   down the contention in this specific case.
> - If enabling large folios in block buffered IO path is a feasible
>   approach, inputs on doing this cleanly and correclty.
>
> Bharata B Rao (1):
>   block/ioctl: Add an ioctl to enable large folios for block buffered IO
>     path
>
>  block/ioctl.c           | 8 ++++++++
>  include/uapi/linux/fs.h | 2 ++
>  2 files changed, 10 insertions(+)
>
> --
> 2.34.1
>
Mateusz Guzik Nov. 27, 2024, 6:19 a.m. UTC | #2
On Wed, Nov 27, 2024 at 7:13 AM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> On Wed, Nov 27, 2024 at 6:48 AM Bharata B Rao <bharata@amd.com> wrote:
> >
> > Recently we discussed the scalability issues while running large
> > instances of FIO with buffered IO option on NVME block devices here:
> >
> > https://lore.kernel.org/linux-mm/d2841226-e27b-4d3d-a578-63587a3aa4f3@amd.com/
> >
> > One of the suggestions Chris Mason gave (during private discussions) was
> > to enable large folios in block buffered IO path as that could
> > improve the scalability problems and improve the lock contention
> > scenarios.
> >
>
> I have no basis to comment on the idea.
>
> However, it is pretty apparent whatever the situation it is being
> heavily disfigured by lock contention in blkdev_llseek:
>
> > perf-lock contention output
> > ---------------------------
> > The lock contention data doesn't look all that conclusive but for 30% rwmixwrite
> > mix it looks like this:
> >
> > perf-lock contention default
> >  contended   total wait     max wait     avg wait         type   caller
> >
> > 1337359017     64.69 h     769.04 us    174.14 us     spinlock   rwsem_wake.isra.0+0x42
> >                         0xffffffff903f60a3  native_queued_spin_lock_slowpath+0x1f3
> >                         0xffffffff903f537c  _raw_spin_lock_irqsave+0x5c
> >                         0xffffffff8f39e7d2  rwsem_wake.isra.0+0x42
> >                         0xffffffff8f39e88f  up_write+0x4f
> >                         0xffffffff8f9d598e  blkdev_llseek+0x4e
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >    2665573     64.38 h       1.98 s      86.95 ms      rwsem:W   blkdev_llseek+0x31
> >                         0xffffffff903f15bc  rwsem_down_write_slowpath+0x36c
> >                         0xffffffff903f18fb  down_write+0x5b
> >                         0xffffffff8f9d5971  blkdev_llseek+0x31
> >                         0xffffffff8f703322  ksys_lseek+0x72
> >                         0xffffffff8f7033a8  __x64_sys_lseek+0x18
> >                         0xffffffff8f20b983  x64_sys_call+0x1fb3
> >                         0xffffffff903dce5e  do_syscall_64+0x7e
> >                         0xffffffff9040012b  entry_SYSCALL_64_after_hwframe+0x76
>
> Admittedly I'm not familiar with this code, but at a quick glance the
> lock can be just straight up removed here?
>
>   534 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
>   535 {
>   536 │       struct inode *bd_inode = bdev_file_inode(file);
>   537 │       loff_t retval;
>   538 │
>   539 │       inode_lock(bd_inode);
>   540 │       retval = fixed_size_llseek(file, offset, whence,
> i_size_read(bd_inode));
>   541 │       inode_unlock(bd_inode);
>   542 │       return retval;
>   543 }
>
> At best it stabilizes the size for the duration of the call. Sounds
> like it helps nothing since if the size can change, the file offset
> will still be altered as if there was no locking?
>
> Suppose this cannot be avoided to grab the size for whatever reason.
>
> While the above fio invocation did not work for me, I ran some crapper
> which I had in my shell history and according to strace:
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271829] lseek(7, 0, SEEK_SET)      = 0
> [pid 271830] lseek(7, 0, SEEK_SET)      = 0
>
> ... the lseeks just rewind to the beginning, *definitely* not needing
> to know the size. One would have to check but this is most likely the
> case in your test as well.
>
> And for that there is 0 need to grab the size, and consequently the inode lock.

That is to say bare minimum this needs to be benchmarked before/after
with the lock removed from the picture, like so:

diff --git a/block/fops.c b/block/fops.c
index 2d01c9007681..7f9e9e2f9081 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -534,12 +534,8 @@ const struct address_space_operations def_blk_aops = {
 static loff_t blkdev_llseek(struct file *file, loff_t offset, int whence)
 {
        struct inode *bd_inode = bdev_file_inode(file);
-       loff_t retval;

-       inode_lock(bd_inode);
-       retval = fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
-       inode_unlock(bd_inode);
-       return retval;
+       return fixed_size_llseek(file, offset, whence, i_size_read(bd_inode));
 }

 static int blkdev_fsync(struct file *filp, loff_t start, loff_t end,

To be aborted if it blows up (but I don't see why it would).