Message ID | 1572866980-13001-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | xfs: optimise xfs_mod_icount/ifree when delta < 0 | expand |
On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > From: Yang Guo <guoyang2@huawei.com> > > percpu_counter_compare will be called by xfs_mod_icount/ifree to check > whether the counter less than 0 and it is a expensive function. > let's check it only when delta < 0, it will be good for xfs's performance. How much overhead do you see? In the end the compare is just a debug check, so if it actually shows up we should remove it entirely.
On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > From: Yang Guo <guoyang2@huawei.com> > > percpu_counter_compare will be called by xfs_mod_icount/ifree to check > whether the counter less than 0 and it is a expensive function. > let's check it only when delta < 0, it will be good for xfs's performance. Hmmm. I don't recall this as being expensive. How did you find this? Can you please always document how you found the problem being addressed in the commit message so that we don't then have to ask how the problem being fixed is reproduced. > Cc: "Darrick J. Wong" <darrick.wong@oracle.com> > Signed-off-by: Yang Guo <guoyang2@huawei.com> > Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> > --- > fs/xfs/xfs_mount.c | 6 ++++++ > 1 file changed, 6 insertions(+) > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c > index ba5b6f3b2b88..5e8314e6565e 100644 > --- a/fs/xfs/xfs_mount.c > +++ b/fs/xfs/xfs_mount.c > @@ -1174,6 +1174,9 @@ xfs_mod_icount( > int64_t delta) > { > percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH); > + if (delta > 0) > + return 0; > + > if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) { > ASSERT(0); > percpu_counter_add(&mp->m_icount, -delta); I struggle to see how this is expensive when you have more than num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated. __percpu_counter_compare() will always take the fast path so ends up being very little code at all. > @@ -1188,6 +1191,9 @@ xfs_mod_ifree( > int64_t delta) > { > percpu_counter_add(&mp->m_ifree, delta); > + if (delta > 0) > + return 0; > + > if (percpu_counter_compare(&mp->m_ifree, 0) < 0) { > ASSERT(0); > percpu_counter_add(&mp->m_ifree, -delta); This one might have some overhead because the count is often at or around zero, but I haven't noticed it being expensive in kernel profiles when creating/freeing hundreds of thousands of inodes every second. IOWs, we typically measure the overhead of such functions by kernel profile. Creating ~200,000 inodes a second, so hammering the icount and ifree counters, I see: 0.16% [kernel] [k] percpu_counter_add_batch 0.03% [kernel] [k] __percpu_counter_compare Almost nothing - it's way down the long tail of noise in the profile. IOWs, the CPU consumed by percpu_counter_compare() is low that optimisation isn't going to produce any measurable performance improvement. Hence it's not really something we've concerned ourselves about. The profile is pretty much identical for removing hundreds of thousands of files a second, too, so there really isn't any performance gain to be had here. If you want to optimise code to make it faster and show a noticable performance improvement, start by running kernel profiles while your performance critical workload is running. Then look at what the functions and call chains that consume the most CPU and work out how to do them better. Those are the places that optimisation will result in measurable performance gains.... Cheers, Dave.
Hi Christoph, On 2019/11/4 23:25, Christoph Hellwig wrote: > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >> From: Yang Guo <guoyang2@huawei.com> >> >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >> whether the counter less than 0 and it is a expensive function. >> let's check it only when delta < 0, it will be good for xfs's performance. > > How much overhead do you see? In the end the compare is just a debug Thanks your reply, sorry for my not clear description. __percpu_counter_compare itself is not expensive, but __percpu_counter_sum called by __percpu_counter_compare is high load, I will list it in next thread. > check, so if it actually shows up we should remove it entirely. > I'm not sure about it, so I check the delta to do less modification. Thanks, Shaokun >
Hi Dave, On 2019/11/5 4:49, Dave Chinner wrote: > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >> From: Yang Guo <guoyang2@huawei.com> >> >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >> whether the counter less than 0 and it is a expensive function. >> let's check it only when delta < 0, it will be good for xfs's performance. > > Hmmm. I don't recall this as being expensive. > Sorry about the misunderstanding information in commit message. > How did you find this? Can you please always document how you found If user creates million of files and the delete them, We found that the __percpu_counter_compare costed 5.78% CPU usage, you are right that itself is not expensive, but it calls __percpu_counter_sum which will use spin_lock and read other cpu's count. perf record -g is used to profile it: - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree - 5.86% xfs_mod_ifree - 5.78% __percpu_counter_compare 5.61% __percpu_counter_sum > the problem being addressed in the commit message so that we don't > then have to ask how the problem being fixed is reproduced. > >> Cc: "Darrick J. Wong" <darrick.wong@oracle.com> >> Signed-off-by: Yang Guo <guoyang2@huawei.com> >> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com> >> --- >> fs/xfs/xfs_mount.c | 6 ++++++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c >> index ba5b6f3b2b88..5e8314e6565e 100644 >> --- a/fs/xfs/xfs_mount.c >> +++ b/fs/xfs/xfs_mount.c >> @@ -1174,6 +1174,9 @@ xfs_mod_icount( >> int64_t delta) >> { >> percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH); >> + if (delta > 0) >> + return 0; >> + >> if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) { >> ASSERT(0); >> percpu_counter_add(&mp->m_icount, -delta); > > I struggle to see how this is expensive when you have more than > num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated. > __percpu_counter_compare() will always take the fast path so ends up > being very little code at all. > >> @@ -1188,6 +1191,9 @@ xfs_mod_ifree( >> int64_t delta) >> { >> percpu_counter_add(&mp->m_ifree, delta); >> + if (delta > 0) >> + return 0; >> + >> if (percpu_counter_compare(&mp->m_ifree, 0) < 0) { >> ASSERT(0); >> percpu_counter_add(&mp->m_ifree, -delta); > > This one might have some overhead because the count is often at or > around zero, but I haven't noticed it being expensive in kernel > profiles when creating/freeing hundreds of thousands of inodes every > second. > > IOWs, we typically measure the overhead of such functions by kernel > profile. Creating ~200,000 inodes a second, so hammering the icount > and ifree counters, I see: > > 0.16% [kernel] [k] percpu_counter_add_batch > 0.03% [kernel] [k] __percpu_counter_compare > 0.03% is just __percpu_counter_compare's usage. > Almost nothing - it's way down the long tail of noise in the > profile. > > IOWs, the CPU consumed by percpu_counter_compare() is low that > optimisation isn't going to produce any measurable performance > improvement. Hence it's not really something we've concerned > ourselves about. The profile is pretty much identical for removing > hundreds of thousands of files a second, too, so there really isn't > any performance gain to be had here. > > If you want to optimise code to make it faster and show a noticable > performance improvement, start by running kernel profiles while your > performance critical workload is running. Then look at what the > functions and call chains that consume the most CPU and work out how > to do them better. Those are the places that optimisation will > result in measurable performance gains.... Hmm, I have done it and I didn't describe this problem clearly, with this patch, 5.78%(__percpu_counter_compare) will disappear. I will follow your method and reduce unnecessary noise. Thanks, Shaokun > > Cheers, > > Dave. >
On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/5 4:49, Dave Chinner wrote: > > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > >> From: Yang Guo <guoyang2@huawei.com> > >> > >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check > >> whether the counter less than 0 and it is a expensive function. > >> let's check it only when delta < 0, it will be good for xfs's performance. > > > > Hmmm. I don't recall this as being expensive. > > > > Sorry about the misunderstanding information in commit message. > > > How did you find this? Can you please always document how you found > > If user creates million of files and the delete them, We found that the > __percpu_counter_compare costed 5.78% CPU usage, you are right that itself > is not expensive, but it calls __percpu_counter_sum which will use > spin_lock and read other cpu's count. perf record -g is used to profile it: > > - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree > - 5.86% xfs_mod_ifree > - 5.78% __percpu_counter_compare > 5.61% __percpu_counter_sum Interesting. Your workload is hitting the slow path, which I most certainly do no see when creating lots of files. What's your workload? > > IOWs, we typically measure the overhead of such functions by kernel > > profile. Creating ~200,000 inodes a second, so hammering the icount > > and ifree counters, I see: > > > > 0.16% [kernel] [k] percpu_counter_add_batch > > 0.03% [kernel] [k] __percpu_counter_compare > > > > 0.03% is just __percpu_counter_compare's usage. No, that's the total of _all_ the percpu counter functions captured by the profile - it was the list of all samples filtered by "percpu". I just re-ran the profile again, and got: 0.23% [kernel] [k] percpu_counter_add_batch 0.04% [kernel] [k] __percpu_counter_compare 0.00% [kernel] [k] collect_percpu_times 0.00% [kernel] [k] __handle_irq_event_percpu 0.00% [kernel] [k] __percpu_counter_sum 0.00% [kernel] [k] handle_irq_event_percpu 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu 0.00% [kernel] [k] free_percpu 0.00% [kernel] [k] percpu_ref_exit So you can see that this essentially no samples in __percpu_counter_sum at all - my tests are not hitting the slow path at all, despite allocating inodes continuously. IOWs, your workload is hitting the slow path repeatedly, and so the question that needs to be answered is "why is the slow path actually being exercised?". IOWs, we need to know what your workload is, what the filesystem config is, what hardware (cpus, storage, etc) you are running on, etc. There must be some reason for the slow path being used, and that's what we need to understand first before deciding what the best fix might be... I suspect that you are only running one or two threads creating files and you have lots of idle CPU and hence the inode allocation is not clearing the fast path batch threshold on the ifree counter. And because you have lots of CPUs, the cost of a sum is very expensive compared to running single threaded creates. That's my current hypothesis based what I see on my workloads that xfs_mod_ifree overhead goes down as concurrency goes up.... FWIW, the profiles I took came from running this on 16 and 32p machines: -- dirs="" for i in `seq 1 $THREADS`; do dirs="$dirs -d /mnt/scratch/$i" done cycles=$((512 / $THREADS)) time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs -- With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem image: meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=134217727500, imaxpct=1 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 That's allocating enough inodes to keep the free inode counter entirely out of the slow path... Cheers, Dave.
Hi Dave, On 2019/11/5 12:03, Dave Chinner wrote: > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: >> Hi Dave, >> >> On 2019/11/5 4:49, Dave Chinner wrote: >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >>>> From: Yang Guo <guoyang2@huawei.com> >>>> >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >>>> whether the counter less than 0 and it is a expensive function. >>>> let's check it only when delta < 0, it will be good for xfs's performance. >>> >>> Hmmm. I don't recall this as being expensive. >>> >> >> Sorry about the misunderstanding information in commit message. >> >>> How did you find this? Can you please always document how you found >> >> If user creates million of files and the delete them, We found that the >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself >> is not expensive, but it calls __percpu_counter_sum which will use >> spin_lock and read other cpu's count. perf record -g is used to profile it: >> >> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree >> - 5.86% xfs_mod_ifree >> - 5.78% __percpu_counter_compare >> 5.61% __percpu_counter_sum > > Interesting. Your workload is hitting the slow path, which I most > certainly do no see when creating lots of files. What's your > workload? > The hardware has 128 cpu cores, and the xfs filesystem format config is default, while the test is a single thread, as follow: ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 xfs info: meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1 spinodes=1 rmapbt=0 = reflink=0 data = bsize=4096 blocks=976754644, imaxpct=5 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=476930, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 disk info: Disk /dev/bcache2: 4000.8 GB, 4000787021824 bytes, 7814037152 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes >>> IOWs, we typically measure the overhead of such functions by kernel >>> profile. Creating ~200,000 inodes a second, so hammering the icount >>> and ifree counters, I see: >>> >>> 0.16% [kernel] [k] percpu_counter_add_batch >>> 0.03% [kernel] [k] __percpu_counter_compare >>> >> >> 0.03% is just __percpu_counter_compare's usage. > > No, that's the total of _all_ the percpu counter functions captured > by the profile - it was the list of all samples filtered by > "percpu". I just re-ran the profile again, and got: > > > 0.23% [kernel] [k] percpu_counter_add_batch > 0.04% [kernel] [k] __percpu_counter_compare > 0.00% [kernel] [k] collect_percpu_times > 0.00% [kernel] [k] __handle_irq_event_percpu > 0.00% [kernel] [k] __percpu_counter_sum > 0.00% [kernel] [k] handle_irq_event_percpu > 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0 > 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu > 0.00% [kernel] [k] free_percpu > 0.00% [kernel] [k] percpu_ref_exit > > So you can see that this essentially no samples in > __percpu_counter_sum at all - my tests are not hitting the slow path > at all, despite allocating inodes continuously. Got it, > > IOWs, your workload is hitting the slow path repeatedly, and so the > question that needs to be answered is "why is the slow path actually > being exercised?". IOWs, we need to know what your workload is, what > the filesystem config is, what hardware (cpus, storage, etc) you are > running on, etc. There must be some reason for the slow path being > used, and that's what we need to understand first before deciding > what the best fix might be... > > I suspect that you are only running one or two threads creating Yeah, we just run one thread test. > files and you have lots of idle CPU and hence the inode allocation > is not clearing the fast path batch threshold on the ifree counter. > And because you have lots of CPUs, the cost of a sum is very > expensive compared to running single threaded creates. That's my > current hypothesis based what I see on my workloads that > xfs_mod_ifree overhead goes down as concurrency goes up.... > Agree, we add some debug info in xfs_mod_ifree and found most times m_ifree.count < batch * num_online_cpus(), because we have 128 online cpus and m_ifree.count around 999. > FWIW, the profiles I took came from running this on 16 and 32p > machines: > > -- > dirs="" > for i in `seq 1 $THREADS`; do > dirs="$dirs -d /mnt/scratch/$i" > done > > cycles=$((512 / $THREADS)) > > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs > -- > > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem > image: > > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1 finobt=1, sparse=1, rmapbt=0 > = reflink=1 > data = bsize=4096 blocks=134217727500, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > log =internal log bsize=4096 blocks=521728, version=2 > = sectsz=512 sunit=0 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > That's allocating enough inodes to keep the free inode counter > entirely out of the slow path... percpu_counter_read that reads the count will cause cache synchronization cost if other cpu changes the count, Maybe it's better not to call percpu_counter_compare if possible. Thanks, Shaokun > > Cheers, > > Dave. >
On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/5 12:03, Dave Chinner wrote: > > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: > >> Hi Dave, > >> > >> On 2019/11/5 4:49, Dave Chinner wrote: > >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > >>>> From: Yang Guo <guoyang2@huawei.com> > >>>> > >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check > >>>> whether the counter less than 0 and it is a expensive function. > >>>> let's check it only when delta < 0, it will be good for xfs's performance. > >>> > >>> Hmmm. I don't recall this as being expensive. > >>> > >> > >> Sorry about the misunderstanding information in commit message. > >> > >>> How did you find this? Can you please always document how you found > >> > >> If user creates million of files and the delete them, We found that the > >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself > >> is not expensive, but it calls __percpu_counter_sum which will use > >> spin_lock and read other cpu's count. perf record -g is used to profile it: > >> > >> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree > >> - 5.86% xfs_mod_ifree > >> - 5.78% __percpu_counter_compare > >> 5.61% __percpu_counter_sum > > > > Interesting. Your workload is hitting the slow path, which I most > > certainly do no see when creating lots of files. What's your > > workload? > > > > The hardware has 128 cpu cores, and the xfs filesystem format config is default, > while the test is a single thread, as follow: > ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 What version and where do I get it? Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent metadata workload testing? How representative is it of your actual production workload? Is that single threaded? > xfs info: > meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks only 4 AGs, which explains the lack of free inodes - there isn't enough concurrency in the filesystem layout to push the free inode count in all AGs beyond the batchsize * num_online_cpus(). i.e. single threaded workloads typically drain the free inode count all the way down to zero before new inodes are allocated. Workloads that are highly concurrent allocate from lots of AGs at once, leaving free inodes in every AG that is not current being actively allocated out of. As a test, can you remake that test filesystem with "-d agcount=32" and see if the overhead you are seeing disappears? > > files and you have lots of idle CPU and hence the inode allocation > > is not clearing the fast path batch threshold on the ifree counter. > > And because you have lots of CPUs, the cost of a sum is very > > expensive compared to running single threaded creates. That's my > > current hypothesis based what I see on my workloads that > > xfs_mod_ifree overhead goes down as concurrency goes up.... > > > > Agree, we add some debug info in xfs_mod_ifree and found most times > m_ifree.count < batch * num_online_cpus(), because we have 128 online > cpus and m_ifree.count around 999. Ok, the threshold is 32 * 128 = ~4000 to get out of the slow path. 32 AGs may well push the count over this threshold, so it's definitely worth trying.... > > FWIW, the profiles I took came from running this on 16 and 32p > > machines: > > > > -- > > dirs="" > > for i in `seq 1 $THREADS`; do > > dirs="$dirs -d /mnt/scratch/$i" > > done > > > > cycles=$((512 / $THREADS)) > > > > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs > > -- > > > > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem > > image: > > > > meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=1 finobt=1, sparse=1, rmapbt=0 > > = reflink=1 > > data = bsize=4096 blocks=134217727500, imaxpct=1 > > = sunit=0 swidth=0 blks > > naming =version 2 bsize=4096 ascii-ci=0, ftype=1 > > log =internal log bsize=4096 blocks=521728, version=2 > > = sectsz=512 sunit=0 blks, lazy-count=1 > > realtime =none extsz=4096 blocks=0, rtextents=0 > > > > That's allocating enough inodes to keep the free inode counter > > entirely out of the slow path... > > percpu_counter_read that reads the count will cause cache synchronization > cost if other cpu changes the count, Maybe it's better not to call > percpu_counter_compare if possible. Depends. Sometimes we trade off ultimate single threaded performance and efficiency for substantially better scalability. i.e. if we lose 5% on single threaded performance but gain 10x on concurrent workloads, then that is a good tradeoff to make. Cheers, Dave.
Hi Dave, On 2019/11/7 5:20, Dave Chinner wrote: > On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: >> Hi Dave, >> >> On 2019/11/5 12:03, Dave Chinner wrote: >>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: >>>> Hi Dave, >>>> >>>> On 2019/11/5 4:49, Dave Chinner wrote: >>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >>>>>> From: Yang Guo <guoyang2@huawei.com> >>>>>> >>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >>>>>> whether the counter less than 0 and it is a expensive function. >>>>>> let's check it only when delta < 0, it will be good for xfs's performance. >>>>> >>>>> Hmmm. I don't recall this as being expensive. >>>>> >>>> >>>> Sorry about the misunderstanding information in commit message. >>>> >>>>> How did you find this? Can you please always document how you found >>>> >>>> If user creates million of files and the delete them, We found that the >>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself >>>> is not expensive, but it calls __percpu_counter_sum which will use >>>> spin_lock and read other cpu's count. perf record -g is used to profile it: >>>> >>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree >>>> - 5.86% xfs_mod_ifree >>>> - 5.78% __percpu_counter_compare >>>> 5.61% __percpu_counter_sum >>> >>> Interesting. Your workload is hitting the slow path, which I most >>> certainly do no see when creating lots of files. What's your >>> workload? >>> >> >> The hardware has 128 cpu cores, and the xfs filesystem format config is default, >> while the test is a single thread, as follow: >> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 > > What version and where do I get it? You can get the mdtest from github: https://github.com/LLNL/mdtest. > > Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent > metadata workload testing? How representative is it of your actual > production workload? Is that single threaded? > We just use mdtest to test the performance of a file system, it can't representative the actual workload and it's single threaded. But we also find that it goes to slow path when we remove a dir with many files. The cmd is below: rm -rf xxx. >> xfs info: >> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks > > only 4 AGs, which explains the lack of free inodes - there isn't > enough concurrency in the filesystem layout to push the free inode > count in all AGs beyond the batchsize * num_online_cpus(). > > i.e. single threaded workloads typically drain the free inode count > all the way down to zero before new inodes are allocated. Workloads > that are highly concurrent allocate from lots of AGs at once, > leaving free inodes in every AG that is not current being actively > allocated out of. > > As a test, can you remake that test filesystem with "-d agcount=32" > and see if the overhead you are seeing disappears? > We try to remake the filesystem with "-d agcount=32" and it also enters slow path mostly. Print the batch * num_online_cpus() and find that it's 32768. Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. Then we change the agcount=1024, and it also goes to slow path frequently because mostly there are no 32768 free inodes. >>> files and you have lots of idle CPU and hence the inode allocation >>> is not clearing the fast path batch threshold on the ifree counter. >>> And because you have lots of CPUs, the cost of a sum is very >>> expensive compared to running single threaded creates. That's my >>> current hypothesis based what I see on my workloads that >>> xfs_mod_ifree overhead goes down as concurrency goes up.... >>> >> >> Agree, we add some debug info in xfs_mod_ifree and found most times >> m_ifree.count < batch * num_online_cpus(), because we have 128 online >> cpus and m_ifree.count around 999. > > Ok, the threshold is 32 * 128 = ~4000 to get out of the slow > path. 32 AGs may well push the count over this threshold, so it's > definitely worth trying.... > Yes, we tried it and found that threshold was 32768, because percpu_counter_batch was initialized to 2 * num_online_cpus(). >>> FWIW, the profiles I took came from running this on 16 and 32p >>> machines: >>> >>> -- >>> dirs="" >>> for i in `seq 1 $THREADS`; do >>> dirs="$dirs -d /mnt/scratch/$i" >>> done >>> >>> cycles=$((512 / $THREADS)) >>> >>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs >>> -- >>> >>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem >>> image: >>> >>> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks >>> = sectsz=512 attr=2, projid32bit=1 >>> = crc=1 finobt=1, sparse=1, rmapbt=0 >>> = reflink=1 >>> data = bsize=4096 blocks=134217727500, imaxpct=1 >>> = sunit=0 swidth=0 blks >>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1 >>> log =internal log bsize=4096 blocks=521728, version=2 >>> = sectsz=512 sunit=0 blks, lazy-count=1 >>> realtime =none extsz=4096 blocks=0, rtextents=0 >>> >>> That's allocating enough inodes to keep the free inode counter >>> entirely out of the slow path... >> >> percpu_counter_read that reads the count will cause cache synchronization >> cost if other cpu changes the count, Maybe it's better not to call >> percpu_counter_compare if possible. > > Depends. Sometimes we trade off ultimate single threaded > performance and efficiency for substantially better scalability. > i.e. if we lose 5% on single threaded performance but gain 10x on > concurrent workloads, then that is a good tradeoff to make. > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in xfs_mod_ifree/icount. Thanks, Shaokun > Cheers, > > Dave. >
Hi Dave, With configuration "-d agcount=32", it also enters slow path frequently when there are 128 cpu cores, any thoughts about this issue? Can we remove debug check entirely as Christoph's suggestion? Thanks, Shaokun On 2019/11/8 13:58, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/7 5:20, Dave Chinner wrote: >> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: >>> Hi Dave, >>> >>> On 2019/11/5 12:03, Dave Chinner wrote: >>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: >>>>> Hi Dave, >>>>> >>>>> On 2019/11/5 4:49, Dave Chinner wrote: >>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: >>>>>>> From: Yang Guo <guoyang2@huawei.com> >>>>>>> >>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check >>>>>>> whether the counter less than 0 and it is a expensive function. >>>>>>> let's check it only when delta < 0, it will be good for xfs's performance. >>>>>> >>>>>> Hmmm. I don't recall this as being expensive. >>>>>> >>>>> >>>>> Sorry about the misunderstanding information in commit message. >>>>> >>>>>> How did you find this? Can you please always document how you found >>>>> >>>>> If user creates million of files and the delete them, We found that the >>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself >>>>> is not expensive, but it calls __percpu_counter_sum which will use >>>>> spin_lock and read other cpu's count. perf record -g is used to profile it: >>>>> >>>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree >>>>> - 5.86% xfs_mod_ifree >>>>> - 5.78% __percpu_counter_compare >>>>> 5.61% __percpu_counter_sum >>>> >>>> Interesting. Your workload is hitting the slow path, which I most >>>> certainly do no see when creating lots of files. What's your >>>> workload? >>>> >>> >>> The hardware has 128 cpu cores, and the xfs filesystem format config is default, >>> while the test is a single thread, as follow: >>> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 >> >> What version and where do I get it? > > You can get the mdtest from github: https://github.com/LLNL/mdtest. > >> >> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent >> metadata workload testing? How representative is it of your actual >> production workload? Is that single threaded? >> > > We just use mdtest to test the performance of a file system, it can't representative > the actual workload and it's single threaded. But we also find that it goes to slow > path when we remove a dir with many files. The cmd is below: > rm -rf xxx. > >>> xfs info: >>> meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks >> >> only 4 AGs, which explains the lack of free inodes - there isn't >> enough concurrency in the filesystem layout to push the free inode >> count in all AGs beyond the batchsize * num_online_cpus(). >> >> i.e. single threaded workloads typically drain the free inode count >> all the way down to zero before new inodes are allocated. Workloads >> that are highly concurrent allocate from lots of AGs at once, >> leaving free inodes in every AG that is not current being actively >> allocated out of. >> >> As a test, can you remake that test filesystem with "-d agcount=32" >> and see if the overhead you are seeing disappears? >> > > We try to remake the filesystem with "-d agcount=32" and it also enters slow path > mostly. Print the batch * num_online_cpus() and find that it's 32768. > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. > Then we change the agcount=1024, and it also goes to slow path frequently because > mostly there are no 32768 free inodes. > >>>> files and you have lots of idle CPU and hence the inode allocation >>>> is not clearing the fast path batch threshold on the ifree counter. >>>> And because you have lots of CPUs, the cost of a sum is very >>>> expensive compared to running single threaded creates. That's my >>>> current hypothesis based what I see on my workloads that >>>> xfs_mod_ifree overhead goes down as concurrency goes up.... >>>> >>> >>> Agree, we add some debug info in xfs_mod_ifree and found most times >>> m_ifree.count < batch * num_online_cpus(), because we have 128 online >>> cpus and m_ifree.count around 999. >> >> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow >> path. 32 AGs may well push the count over this threshold, so it's >> definitely worth trying.... >> > > Yes, we tried it and found that threshold was 32768, because percpu_counter_batch > was initialized to 2 * num_online_cpus(). > >>>> FWIW, the profiles I took came from running this on 16 and 32p >>>> machines: >>>> >>>> -- >>>> dirs="" >>>> for i in `seq 1 $THREADS`; do >>>> dirs="$dirs -d /mnt/scratch/$i" >>>> done >>>> >>>> cycles=$((512 / $THREADS)) >>>> >>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs >>>> -- >>>> >>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem >>>> image: >>>> >>>> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks >>>> = sectsz=512 attr=2, projid32bit=1 >>>> = crc=1 finobt=1, sparse=1, rmapbt=0 >>>> = reflink=1 >>>> data = bsize=4096 blocks=134217727500, imaxpct=1 >>>> = sunit=0 swidth=0 blks >>>> naming =version 2 bsize=4096 ascii-ci=0, ftype=1 >>>> log =internal log bsize=4096 blocks=521728, version=2 >>>> = sectsz=512 sunit=0 blks, lazy-count=1 >>>> realtime =none extsz=4096 blocks=0, rtextents=0 >>>> >>>> That's allocating enough inodes to keep the free inode counter >>>> entirely out of the slow path... >>> >>> percpu_counter_read that reads the count will cause cache synchronization >>> cost if other cpu changes the count, Maybe it's better not to call >>> percpu_counter_compare if possible. >> >> Depends. Sometimes we trade off ultimate single threaded >> performance and efficiency for substantially better scalability. >> i.e. if we lose 5% on single threaded performance but gain 10x on >> concurrent workloads, then that is a good tradeoff to make. >> > > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in > xfs_mod_ifree/icount. > > Thanks, > Shaokun > >> Cheers, >> >> Dave. >>
On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote: > Hi Dave, > > On 2019/11/7 5:20, Dave Chinner wrote: > > On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote: > >> Hi Dave, > >> > >> On 2019/11/5 12:03, Dave Chinner wrote: > >>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote: > >>>> Hi Dave, > >>>> > >>>> On 2019/11/5 4:49, Dave Chinner wrote: > >>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote: > >>>>>> From: Yang Guo <guoyang2@huawei.com> > >>>>>> > >>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check > >>>>>> whether the counter less than 0 and it is a expensive function. > >>>>>> let's check it only when delta < 0, it will be good for xfs's performance. > >>>>> > >>>>> Hmmm. I don't recall this as being expensive. > >>>>> > >>>> > >>>> Sorry about the misunderstanding information in commit message. > >>>> > >>>>> How did you find this? Can you please always document how you found > >>>> > >>>> If user creates million of files and the delete them, We found that the > >>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself > >>>> is not expensive, but it calls __percpu_counter_sum which will use > >>>> spin_lock and read other cpu's count. perf record -g is used to profile it: > >>>> > >>>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree > >>>> - 5.86% xfs_mod_ifree > >>>> - 5.78% __percpu_counter_compare > >>>> 5.61% __percpu_counter_sum > >>> > >>> Interesting. Your workload is hitting the slow path, which I most > >>> certainly do no see when creating lots of files. What's your > >>> workload? > >>> > >> > >> The hardware has 128 cpu cores, and the xfs filesystem format config is default, > >> while the test is a single thread, as follow: > >> ./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2 > > > > What version and where do I get it? > > You can get the mdtest from github: https://github.com/LLNL/mdtest. Oh what a pain. $ MPI_CC=mpicc make mpicc -DLinux -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D__USE_LARGEFILE64=1 -g -o mdtest mdtest.c -lm mdtest.c:32:10: fatal error: mpi.h: No such file or directory 32 | #include "mpi.h" | ^~~~~~~ compilation terminated. make: *** [Makefile:59: mdtest] Error 1 It doesn't build with a modern linux distro openmpi installation. /me looks at the readme. Oh, it's an unmaintained historical archive. Hmmm, it got forked to https://github.com/MDTEST-LANL/mdtest. That's an unmaintained archive, too. Oh, there's a version at https://github.com/hpc/ior Ngggh. $./configure .... checking for mpicc... mpicc checking for gcc... (cached) mpicc checking whether the C compiler works... no configure: error: in `/home/dave/src/ior': configure: error: C compiler cannot create executables See `config.log' for more details $ $ less config.log .... configure:3805: checking whether the C compiler works configure:3827: mpicc conftest.c >&5 /usr/bin/ld: cannot find -lopen-rte /usr/bin/ld: cannot find -lopen-pal /usr/bin/ld: cannot find -lhwloc /usr/bin/ld: cannot find -levent /usr/bin/ld: cannot find -levent_pthreads collect2: error: ld returned 1 exit status configure:3831: $? = 1 ..... So, the mpicc compiler wrapper uses a bunch of libraries that don't get installed with the compiler wrapper. Yay? > > Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent > > metadata workload testing? How representative is it of your actual > > production workload? Is that single threaded? > > > > We just use mdtest to test the performance of a file system, it can't representative > the actual workload and it's single threaded. But we also find that it goes to slow > path when we remove a dir with many files. The cmd is below: > rm -rf xxx. The benchmark doesn't reproduce that. It will only occur when you do sequential inode remove so you have no free inodes in the filesytem and every set of 64 inodes that is remove land in the same chunk and so are immediately freed, leaving zero inodes in the filesyste, i.e. this is a result of sequential inode create, which typically implies "empty filesystem". Aged filesystems typically don't behave like this - they have free indoes distributed all through the inode chunks on disk. And if you do random remove, you won't see it for the same reason, either - the remove phase of mdtest doesn't show any CPU usage in percpu_counter_sum() at all. Please understand taht I'm not saying that it isn't a problem, that it doesn't happen, or that we don't need to change the bahviour. What I'm trying to do is understand the conditions in which this problem occurs and whether they are real world or just micro-benchmark related problems... > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. > Then we change the agcount=1024, and it also goes to slow path frequently because > mostly there are no 32768 free inodes. Hmmm. I had forgotten the batch size increased with CPU count like that - I had the thought it was log(ncpus), not linear(ncpus). That kinda points out that scaling the batch size by CPU count is somewhat silly, as concurrency on the counter is not defined by the number of CPUs in the system. INode counter concurrency is defined by the number of inode allocation/free operations that can be performed at any given time. As such, having a counter threshold of 256 * num_cpus doesn't make a whole lot of sense for a 4 AG filesystem because you can't have 128 CPUs all banging on it at once. And even if you have hundreds of AGs, the CPUs are going to be somewhat serialised through the transaction commit path by the CIL locks that are taken to insert objects into the CIL. Hence the unbound concurrency of transaction commit is going to be soaked up by the CIL list insert spin lock, and it's mostly going to be just a dribble of CPUs at a time through the transaction commit accounting code. Ok, so maybe we just need a small batch size here, like a value of 4 or 8 just to avoid every inode alloc/free transaction having to pick up a global spin lock every time... Let me do some testing here... > >> percpu_counter_read that reads the count will cause cache synchronization > >> cost if other cpu changes the count, Maybe it's better not to call > >> percpu_counter_compare if possible. > > > > Depends. Sometimes we trade off ultimate single threaded > > performance and efficiency for substantially better scalability. > > i.e. if we lose 5% on single threaded performance but gain 10x on > > concurrent workloads, then that is a good tradeoff to make. > > > > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in > xfs_mod_ifree/icount. It also doesn't totally avoid the issue, either, because of the way we dynamically alloc/free inode clusters and so the counters go both up and down during conmtinuous allocation or continuous freeing. i.e. The pattern we see on inode allocation is: icount += 64 ifree += 64 ifree-- .... ifree-- icount += 64 ifree += 64 ifree-- .... ifree-- And on inode freeing, we see the opposite: ifree++ .... ifree++ icount -= 64 ifree -= 64 ifree++ .... ifree++ icount -= 64 ifree -= 64 So the the values of ifree will still decrement during both free and allocation operations, hence it doesn't avoid the issue totally. Cheers, Dave.
On Mon, Nov 18, 2019 at 07:12:12PM +1100, Dave Chinner wrote: > On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote: > > On 2019/11/7 5:20, Dave Chinner wrote: > > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores. > > Then we change the agcount=1024, and it also goes to slow path frequently because > > mostly there are no 32768 free inodes. > > Hmmm. I had forgotten the batch size increased with CPU count like > that - I had the thought it was log(ncpus), not linear(ncpus). [.....] > Ok, so maybe we just need a small batch size here, like a value of 4 > or 8 just to avoid every inode alloc/free transaction having to pick > up a global spin lock every time... > > Let me do some testing here... [....] > > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in > > xfs_mod_ifree/icount. Ok, so the percpu_counter_sum() only shows up during a create workload here, at ~1.5% of the CPU used. only doing the check when delta < 0 makes no difference to that value. CPU usage of percpu_counter_sum is noise for all other parts of the workload workloads, including the "remove files" part of the benchmark. Hence it is this pattern: > i.e. The pattern we see on inode allocation is: > > icount += 64 > ifree += 64 > ifree-- > .... > ifree-- > icount += 64 > ifree += 64 > ifree-- > .... > ifree-- That is causing the compare CPU usage - all the single decrements are triggering it because we are operating on a new filesystem that has no free inodes other than the cluster we just allocated. Hence avoiding doing the value compare when delta > 0 makes no difference to CPU usage because most of the modifications are subraction. > And on inode freeing, we see the opposite: > > ifree++ > .... > ifree++ > icount -= 64 > ifree -= 64 > ifree++ > .... > ifree++ > icount -= 64 > ifree -= 64 For freeing, we aren't always freeing from the same inode cluster, so the ifree count actually goes up quite a bit before it starts going down as we free clusters. Hence if we keep the batch size small here, we should mostly stay out of the compare path, and the "no compare when delta > 0" will also make a substantial difference here if the ifree count is low. So, I reduced the batch size to 8, and CPU usage during creates dropped from 1.5% to 0.6% on a 16p machine. That's a substantial reduction - if this translates to larger machines, that should bring CPU usage down under 2%.... I was going to send this patch for you to test, but I left this email sitting unsent overnight and I thought about it some more. The reality is, as Christoph said, this ends up being just debug code because we silently ignore the underflow error on production kernels. Hence I think the right thing to do is gut the transaction superblock accounting code so that we need simple asserts and don't bother trying to behave like it's an error we can actually handle sanely. This removes the compare from production kernels completely, so you should see this all go away with the patch below. The difference in performance on my 16p machine is not significant - it is within the normal run-to-run variability of the mdtest benchmark, but the counter compare overhead is gone from the profiles. Cheers, Dave.
> +static void > xfs_sb_mod8( > uint8_t *field, > int8_t delta) > { > int8_t counter = *field; > > + if (!delta) > + return; > counter += delta; > + ASSERT(counter >= 0); > *field = counter; > } I'd almost find it easier to keep the != 0 check in the caller, and in fact also move the assert there and just kill these helpers, e.g. if (tp->t_imaxpct_delta != 0) { mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta; ASSERT(mp->m_sb.sb_imax_pct >= 0); } if (tp->t_rextsize_delta != 0) { mp->m_sb.sb_rextsize += tp->t_rextsize_delta; ASSERT(mp->m_sb.sb_rextsize >= 0); } While this is 2 more lines per counter it is a lot more explicit and easier to follow. > if (idelta) { > - error = xfs_mod_icount(mp, idelta); > - if (error) > - goto out_undo_fdblocks; > + percpu_counter_add_batch(&mp->m_icount, idelta, > + XFS_ICOUNT_BATCH); > + if (idelta < 0) > + ASSERT(__percpu_counter_compare(&mp->m_icount, 0, > + XFS_ICOUNT_BATCH) >= 0) And here I wonder if keeping single use helpers wouldn't have been a little nicer. Not that it really matters.
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index ba5b6f3b2b88..5e8314e6565e 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1174,6 +1174,9 @@ xfs_mod_icount( int64_t delta) { percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH); + if (delta > 0) + return 0; + if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) { ASSERT(0); percpu_counter_add(&mp->m_icount, -delta); @@ -1188,6 +1191,9 @@ xfs_mod_ifree( int64_t delta) { percpu_counter_add(&mp->m_ifree, delta); + if (delta > 0) + return 0; + if (percpu_counter_compare(&mp->m_ifree, 0) < 0) { ASSERT(0); percpu_counter_add(&mp->m_ifree, -delta);