Message ID | 20230601092400.27162-1-zhiyin.chen@intel.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
Series | fs.h: Optimize file struct to prevent false sharing | expand |
On Thu, Jun 01, 2023 at 05:24:00AM -0400, chenzhiyin wrote: > In the syscall test of UnixBench, performance regression occurred due > to false sharing. > > The lock and atomic members, including file::f_lock, file::f_count and > file::f_pos_lock are highly contended and frequently updated in the > high-concurrency test scenarios. perf c2c indentified one affected > read access, file::f_op. > To prevent false sharing, the layout of file struct is changed as > following > (A) f_lock, f_count and f_pos_lock are put together to share the same > cache line. > (B) The read mostly members, including f_path, f_inode, f_op are put > into a separate cache line. > (C) f_mode is put together with f_count, since they are used frequently > at the same time. > Due to '__randomize_layout' attribute of file struct, the updated layout > only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'. > > The optimization has been validated in the syscall test of UnixBench. > performance gain is 30~50%. Furthermore, to confirm the optimization > effectiveness on the other codes path, the results of fsdisk, fsbuffer > and fstime are also shown. > > Here are the detailed test results of unixbench. > > Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk > > Without Patch > ------------------------------------------------------------------------ > File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples) > File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples) > File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples) > System Call Overhead 5772268.3 lps (10.0 s, 7 samples) > > System Benchmarks Partial Index BASELINE RESULT INDEX > File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7 > File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9 > File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7 > System Call Overhead 15000.0 5772268.3 3848.2 > ======== > System Benchmarks Index Score (Partial Only) 2768.3 > > With Patch > ------------------------------------------------------------------------ > File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples) > File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples) > File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples) > System Call Overhead 8237404.4 lps (10.0 s, 7 samples) > > System Benchmarks Partial Index BASELINE RESULT INDEX > File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4 > File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8 > File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5 > System Call Overhead 15000.0 8237404.4 5491.6 > ======== > System Benchmarks Index Score (Partial Only) 3295.3 > > Signed-off-by: chenzhiyin <zhiyin.chen@intel.com> > --- Dave had some more concerns and perf analysis requests for this. So this will be put on hold until these are addressed.
On 6/1/23 11:24, chenzhiyin wrote: > In the syscall test of UnixBench, performance regression occurred due > to false sharing. > > The lock and atomic members, including file::f_lock, file::f_count and > file::f_pos_lock are highly contended and frequently updated in the > high-concurrency test scenarios. perf c2c indentified one affected > read access, file::f_op. > To prevent false sharing, the layout of file struct is changed as > following > (A) f_lock, f_count and f_pos_lock are put together to share the same > cache line. > (B) The read mostly members, including f_path, f_inode, f_op are put > into a separate cache line. > (C) f_mode is put together with f_count, since they are used frequently > at the same time. > Due to '__randomize_layout' attribute of file struct, the updated layout > only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'. > > The optimization has been validated in the syscall test of UnixBench. > performance gain is 30~50%. Furthermore, to confirm the optimization > effectiveness on the other codes path, the results of fsdisk, fsbuffer > and fstime are also shown. > > Here are the detailed test results of unixbench. > > Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk > > Without Patch > ------------------------------------------------------------------------ > File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples) > File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples) > File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples) > System Call Overhead 5772268.3 lps (10.0 s, 7 samples) > > System Benchmarks Partial Index BASELINE RESULT INDEX > File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7 > File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9 > File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7 > System Call Overhead 15000.0 5772268.3 3848.2 > ======== > System Benchmarks Index Score (Partial Only) 2768.3 > > With Patch > ------------------------------------------------------------------------ > File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples) > File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples) > File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples) > System Call Overhead 8237404.4 lps (10.0 s, 7 samples) > > System Benchmarks Partial Index BASELINE RESULT INDEX > File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4 > File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8 > File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5 > System Call Overhead 15000.0 8237404.4 5491.6 > ======== > System Benchmarks Index Score (Partial Only) 3295.3 > > Signed-off-by: chenzhiyin <zhiyin.chen@intel.com> > --- > include/linux/fs.h | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 133f0640fb24..cf1388e4dad0 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -962,23 +962,23 @@ struct file { > struct rcu_head f_rcuhead; > unsigned int f_iocb_flags; > }; > - struct path f_path; > - struct inode *f_inode; /* cached value */ > - const struct file_operations *f_op; > > /* > * Protects f_ep, f_flags. > * Must not be taken from IRQ context. > */ > spinlock_t f_lock; > - atomic_long_t f_count; > - unsigned int f_flags; > fmode_t f_mode; > + atomic_long_t f_count; > struct mutex f_pos_lock; > loff_t f_pos; > + unsigned int f_flags; > struct fown_struct f_owner; > const struct cred *f_cred; > struct file_ra_state f_ra; > + struct path f_path; > + struct inode *f_inode; /* cached value */ > + const struct file_operations *f_op; > > u64 f_version; > #ifdef CONFIG_SECURITY Maybe add a comment for the struct that values are cache line optimized? I.e. any change in the structure that does not check for cache lines might/will invalidate your optimization - your patch adds maintenance overhead, without giving a hint about that. Thanks, Bernd
On Thu, 01 Jun 2023 05:24:00 -0400, chenzhiyin wrote: > In the syscall test of UnixBench, performance regression occurred due > to false sharing. > > The lock and atomic members, including file::f_lock, file::f_count and > file::f_pos_lock are highly contended and frequently updated in the > high-concurrency test scenarios. perf c2c indentified one affected > read access, file::f_op. > To prevent false sharing, the layout of file struct is changed as > following > (A) f_lock, f_count and f_pos_lock are put together to share the same > cache line. > (B) The read mostly members, including f_path, f_inode, f_op are put > into a separate cache line. > (C) f_mode is put together with f_count, since they are used frequently > at the same time. > Due to '__randomize_layout' attribute of file struct, the updated layout > only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'. > > [...] Applied to the vfs.misc branch of the vfs/vfs.git tree. Patches in the vfs.misc branch should appear in linux-next soon. Please report any outstanding bugs that were missed during review in a new review to the original patch series allowing us to drop it. It's encouraged to provide Acked-bys and Reviewed-bys even though the patch has now been applied. If possible patch trailers will be updated. tree: https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git branch: vfs.misc [1/1] fs.h: Optimize file struct to prevent false sharing https://git.kernel.org/vfs/vfs/c/b63bfcf3c65d
diff --git a/include/linux/fs.h b/include/linux/fs.h index 133f0640fb24..cf1388e4dad0 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -962,23 +962,23 @@ struct file { struct rcu_head f_rcuhead; unsigned int f_iocb_flags; }; - struct path f_path; - struct inode *f_inode; /* cached value */ - const struct file_operations *f_op; /* * Protects f_ep, f_flags. * Must not be taken from IRQ context. */ spinlock_t f_lock; - atomic_long_t f_count; - unsigned int f_flags; fmode_t f_mode; + atomic_long_t f_count; struct mutex f_pos_lock; loff_t f_pos; + unsigned int f_flags; struct fown_struct f_owner; const struct cred *f_cred; struct file_ra_state f_ra; + struct path f_path; + struct inode *f_inode; /* cached value */ + const struct file_operations *f_op; u64 f_version; #ifdef CONFIG_SECURITY
In the syscall test of UnixBench, performance regression occurred due to false sharing. The lock and atomic members, including file::f_lock, file::f_count and file::f_pos_lock are highly contended and frequently updated in the high-concurrency test scenarios. perf c2c indentified one affected read access, file::f_op. To prevent false sharing, the layout of file struct is changed as following (A) f_lock, f_count and f_pos_lock are put together to share the same cache line. (B) The read mostly members, including f_path, f_inode, f_op are put into a separate cache line. (C) f_mode is put together with f_count, since they are used frequently at the same time. Due to '__randomize_layout' attribute of file struct, the updated layout only can be effective when CONFIG_RANDSTRUCT_NONE is 'y'. The optimization has been validated in the syscall test of UnixBench. performance gain is 30~50%. Furthermore, to confirm the optimization effectiveness on the other codes path, the results of fsdisk, fsbuffer and fstime are also shown. Here are the detailed test results of unixbench. Command: numactl -C 3-18 ./Run -c 16 syscall fsbuffer fstime fsdisk Without Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 875052.1 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 235484.0 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 2815153.5 KBps (30.0 s, 2 samples) System Call Overhead 5772268.3 lps (10.0 s, 7 samples) System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 875052.1 2209.7 File Copy 256 bufsize 500 maxblocks 1655.0 235484.0 1422.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 2815153.5 4853.7 System Call Overhead 15000.0 5772268.3 3848.2 ======== System Benchmarks Index Score (Partial Only) 2768.3 With Patch ------------------------------------------------------------------------ File Copy 1024 bufsize 2000 maxblocks 1009977.2 KBps (30.0 s, 2 samples) File Copy 256 bufsize 500 maxblocks 264765.9 KBps (30.0 s, 2 samples) File Copy 4096 bufsize 8000 maxblocks 3052236.0 KBps (30.0 s, 2 samples) System Call Overhead 8237404.4 lps (10.0 s, 7 samples) System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 1009977.2 2550.4 File Copy 256 bufsize 500 maxblocks 1655.0 264765.9 1599.8 File Copy 4096 bufsize 8000 maxblocks 5800.0 3052236.0 5262.5 System Call Overhead 15000.0 8237404.4 5491.6 ======== System Benchmarks Index Score (Partial Only) 3295.3 Signed-off-by: chenzhiyin <zhiyin.chen@intel.com> --- include/linux/fs.h | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)