Message ID | 20230628105624.150352-1-lipeng.zhu@intel.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/address_space: add alignment padding for i_map and i_mmap_rwsem to mitigate a false sharing. | expand |
On Wed, 28 Jun 2023 18:56:25 +0800 "Zhu, Lipeng" <lipeng.zhu@intel.com> wrote: > When running UnixBench/Shell Scripts, we observed high false sharing > for accessing i_mmap against i_mmap_rwsem. > > UnixBench/Shell Scripts are typical load/execute command test scenarios, > the i_mmap will be accessed frequently to insert/remove vma_interval_tree. > Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they are > in the same cacheline. That sounds odd. One would expect these two fields to be used in close conjunction, so any sharing might even be beneficial. Can you identify in more detail what's actually going on in there? > The patch places the i_mmap and i_mmap_rwsem in separate cache lines to avoid > this false sharing problem. > > With this patch, on Intel Sapphire Rapids 2 sockets 112c/224t platform, based > on kernel v6.4-rc4, the 224 parallel score is improved ~2.5% for > UnixBench/Shell Scripts case. And perf c2c tool shows the false sharing is > resolved as expected, the symbol vma_interval_tree_remove disappeared in > cache line 0 after this change. There can be many address_spaces in memory, so a size increase is a concern. Is there anything we can do to minimize the cost of this?
Hi Andrew, > -----Original Message----- > From: Andrew Morton <akpm@linux-foundation.org> > Sent: Monday, July 3, 2023 5:11 AM > To: Zhu, Lipeng <lipeng.zhu@intel.com> > Cc: viro@zeniv.linux.org.uk; brauner@kernel.org; linux- > fsdevel@vger.kernel.org; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > Deng, Pan <pan.deng@intel.com>; Ma, Yu <yu.ma@intel.com>; Li, Tianyou > <tianyou.li@intel.com>; tim.c.chen@linux.intel.com > Subject: Re: [PATCH] fs/address_space: add alignment padding for i_map and > i_mmap_rwsem to mitigate a false sharing. > > On Wed, 28 Jun 2023 18:56:25 +0800 "Zhu, Lipeng" <lipeng.zhu@intel.com> > wrote: > > > When running UnixBench/Shell Scripts, we observed high false sharing > > for accessing i_mmap against i_mmap_rwsem. > > > > UnixBench/Shell Scripts are typical load/execute command test > > scenarios, the i_mmap will be accessed frequently to insert/remove > vma_interval_tree. > > Meanwhile, the i_mmap_rwsem is frequently loaded. Unfortunately, they > > are in the same cacheline. > > That sounds odd. One would expect these two fields to be used in close > conjunction, so any sharing might even be beneficial. Can you identify in more > detail what's actually going on in there? > Yes, I'm running UnixBench/Shell Script which concurrently launch->execute->exit a lot of shell commands. During the workload running: 1: A lot of processes invoke vma_interval_tree_remove which touch "i_mmap", the call stack: ----vma_interval_tree_remove |----unlink_file_vma | free_pgtables | |----exit_mmap | | mmput | | |----begin_new_exec | | | load_elf_binary | | | bprm_execve 2: Also, there are a lot of processes touch 'i_mmap_rwsem' to acquire the semaphore in order to access 'i_mmap'. In existing 'address_space' layout, 'i_mmap' and 'i_mmap_rwsem' are in the same cacheline. struct address_space { struct inode * host; /* 0 8 */ struct xarray i_pages; /* 8 16 */ struct rw_semaphore invalidate_lock; /* 24 40 */ /* --- cacheline 1 boundary (64 bytes) --- */ gfp_t gfp_mask; /* 64 4 */ atomic_t i_mmap_writable; /* 68 4 */ struct rb_root_cached i_mmap; /* 72 16 */ struct rw_semaphore i_mmap_rwsem; /* 88 40 */ /* --- cacheline 2 boundary (128 bytes) --- */ long unsigned int nrpages; /* 128 8 */ long unsigned int writeback_index; /* 136 8 */ const struct address_space_operations * a_ops; /* 144 8 */ long unsigned int flags; /* 152 8 */ errseq_t wb_err; /* 160 4 */ spinlock_t private_lock; /* 164 4 */ struct list_head private_list; /* 168 16 */ void * private_data; /* 184 8 */ /* size: 192, cachelines: 3, members: 15 */ }; Following perf c2c result shows heavy c2c bounce due to false sharing. ================================================= Shared Cache Line Distribution Pareto ================================================= ------------------------------------------------------------- 0 3729 5791 0 0 0xff19b3818445c740 ------------------------------------------------------------- 3.27% 3.02% 0.00% 0.00% 0x18 0 1 0xffffffffa194403b 604 483 389 692 203 [k] vma_interval_tree_insert [kernel.kallsyms] vma_interval_tree_insert+75 0 1 4.13% 3.63% 0.00% 0.00% 0x20 0 1 0xffffffffa19440a2 553 413 415 962 215 [k] vma_interval_tree_remove [kernel.kallsyms] vma_interval_tree_remove+18 0 1 2.04% 1.35% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1d6 1210 855 460 1229 222 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.62% 1.85% 0.00% 0.00% 0x28 0 1 0xffffffffa219a1bf 762 329 577 527 198 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.48% 0.31% 0.00% 0.00% 0x28 0 1 0xffffffffa219a58c 1677 1476 733 1544 224 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.05% 0.07% 0.00% 0.00% 0x28 0 1 0xffffffffa219a21d 1040 819 689 33 27 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+749 0 1 0.00% 0.05% 0.00% 0.00% 0x28 0 1 0xffffffffa17707db 0 1005 786 1373 223 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.02% 0.00% 0.00% 0x28 0 1 0xffffffffa219a064 0 233 778 32 30 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 33.82% 34.10% 0.00% 0.00% 0x30 0 1 0xffffffffa1770945 779 495 534 6011 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 17.06% 15.28% 0.00% 0.00% 0x30 0 1 0xffffffffa1770915 593 438 468 2715 224 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1 3.54% 3.52% 0.00% 0.00% 0x30 0 1 0xffffffffa2199f84 881 601 583 1421 223 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+84 0 1 > > The patch places the i_mmap and i_mmap_rwsem in separate cache lines > > to avoid this false sharing problem. > > > > With this patch, on Intel Sapphire Rapids 2 sockets 112c/224t > > platform, based on kernel v6.4-rc4, the 224 parallel score is improved > > ~2.5% for UnixBench/Shell Scripts case. And perf c2c tool shows the > > false sharing is resolved as expected, the symbol > > vma_interval_tree_remove disappeared in cache line 0 after this change. > > There can be many address_spaces in memory, so a size increase is a concern. > Is there anything we can do to minimize the cost of this? Thanks for your reminder of the memory size increased. After the padding, the struct "address_space" need 4 cachelines to fill in, the memory size increased is around ~33%. Then I tried another approach by moving 'i_mmap_rwsem' under the field 'flags', no memory size increased for "address_space" and based on v6.4.0, on Intel Sapphire Rapids 112C/224T platform, the score improves by ~5.3%. From the perf c2c record data, the false sharing has been fixed for 'i_mmap' and 'i_mmap_rwsem'. ================================================= Shared Cache Line Distribution Pareto ================================================= ------------------------------------------------------------- 0 556 838 0 0 0xff2780d7965d2780 ------------------------------------------------------------- 0.18% 0.60% 0.00% 0.00% 0x8 0 1 0xffffffffafff27b8 503 453 569 14 13 [k] do_dentry_open [kernel.kallsyms] do_dentry_open+456 0 1 0.54% 0.12% 0.00% 0.00% 0x8 0 1 0xffffffffaffc51ac 510 199 428 15 12 [k] hugepage_vma_check [kernel.kallsyms] hugepage_vma_check+252 0 1 1.80% 2.15% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1d6 1778 799 343 215 136 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+678 0 1 0.54% 1.31% 0.00% 0.00% 0x18 0 1 0xffffffffb079a1bf 547 296 528 91 71 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+655 0 1 0.72% 0.72% 0.00% 0.00% 0x18 0 1 0xffffffffb079a58c 1479 1534 676 288 163 [k] down_write [kernel.kallsyms] down_write+28 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffafd707db 0 2381 744 282 158 [k] up_write [kernel.kallsyms] up_write+27 0 1 0.00% 0.12% 0.00% 0.00% 0x18 0 1 0xffffffffb079a064 0 239 518 6 6 [k] rwsem_down_write_slowpath [kernel.kallsyms] rwsem_down_write_slowpath+308 0 1 46.58% 47.02% 0.00% 0.00% 0x20 0 1 0xffffffffafd70945 704 403 499 1137 219 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+53 0 1 23.92% 25.78% 0.00% 0.00% 0x20 0 1 0xffffffffafd70915 558 413 500 542 185 [k] rwsem_spin_on_owner [kernel.kallsyms] rwsem_spin_on_owner+5 0 1 I will send another patch out and update the commit log. Lipeng Zhu, Best Regards
diff --git a/include/linux/fs.h b/include/linux/fs.h index c85916e9f7db..d3dd8dcc9b8b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -434,7 +434,7 @@ struct address_space { atomic_t nr_thps; #endif struct rb_root_cached i_mmap; - struct rw_semaphore i_mmap_rwsem; + struct rw_semaphore i_mmap_rwsem ____cacheline_aligned_in_smp; unsigned long nrpages; pgoff_t writeback_index; const struct address_space_operations *a_ops;