xfs: optimise xfs_mod_icount/ifree when delta < 0

Message ID	1572866980-13001-1-git-send-email-zhangshaokun@hisilicon.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=9rhF=Y4=vger.kernel.org=linux-xfs-owner@kernel.org> From: Shaokun Zhang <zhangshaokun@hisilicon.com> To: <linux-xfs@vger.kernel.org>, <linux-kernel@vger.kernel.org> CC: Yang Guo <guoyang2@huawei.com>, "Darrick J. Wong" <darrick.wong@oracle.com>, Shaokun Zhang <zhangshaokun@hisilicon.com> Subject: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0 Date: Mon, 4 Nov 2019 19:29:40 +0800 Message-ID: <1572866980-13001-1-git-send-email-zhangshaokun@hisilicon.com> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-xfs-owner@vger.kernel.org Precedence: bulk
Series	xfs: optimise xfs_mod_icount/ifree when delta < 0 \| expand xfs: optimise xfs_mod_icount/ifree when delta < 0

Shaokun Zhang Nov. 4, 2019, 11:29 a.m. UTC

From: Yang Guo <guoyang2@huawei.com>

percpu_counter_compare will be called by xfs_mod_icount/ifree to check
whether the counter less than 0 and it is a expensive function.
let's check it only when delta < 0, it will be good for xfs's performance.

Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
Signed-off-by: Yang Guo <guoyang2@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
---
 fs/xfs/xfs_mount.c | 6 ++++++
 1 file changed, 6 insertions(+)

Christoph Hellwig Nov. 4, 2019, 3:25 p.m. UTC | #1

On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> From: Yang Guo <guoyang2@huawei.com>
> 
> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> whether the counter less than 0 and it is a expensive function.
> let's check it only when delta < 0, it will be good for xfs's performance.

How much overhead do you see?  In the end the compare is just a debug
check, so if it actually shows up we should remove it entirely.

Dave Chinner Nov. 4, 2019, 8:49 p.m. UTC | #2

On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> From: Yang Guo <guoyang2@huawei.com>
> 
> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> whether the counter less than 0 and it is a expensive function.
> let's check it only when delta < 0, it will be good for xfs's performance.

Hmmm. I don't recall this as being expensive.

How did you find this? Can you please always document how you found
the problem being addressed in the commit message so that we don't
then have to ask how the problem being fixed is reproduced.

> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
> Signed-off-by: Yang Guo <guoyang2@huawei.com>
> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
> ---
>  fs/xfs/xfs_mount.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index ba5b6f3b2b88..5e8314e6565e 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1174,6 +1174,9 @@ xfs_mod_icount(
>  	int64_t			delta)
>  {
>  	percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
> +	if (delta > 0)
> +		return 0;
> +
>  	if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
>  		ASSERT(0);
>  		percpu_counter_add(&mp->m_icount, -delta);

I struggle to see how this is expensive when you have more than
num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated.
__percpu_counter_compare() will always take the fast path so ends up
being very little code at all.

> @@ -1188,6 +1191,9 @@ xfs_mod_ifree(
>  	int64_t			delta)
>  {
>  	percpu_counter_add(&mp->m_ifree, delta);
> +	if (delta > 0)
> +		return 0;
> +
>  	if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
>  		ASSERT(0);
>  		percpu_counter_add(&mp->m_ifree, -delta);

This one might have some overhead because the count is often at or
around zero, but I haven't noticed it being expensive in kernel
profiles when creating/freeing hundreds of thousands of inodes every
second.

IOWs, we typically measure the overhead of such functions by kernel
profile.  Creating ~200,000 inodes a second, so hammering the icount
and ifree counters, I see:

      0.16%  [kernel]  [k] percpu_counter_add_batch
      0.03%  [kernel]  [k] __percpu_counter_compare

Almost nothing - it's way down the long tail of noise in the
profile.

IOWs, the CPU consumed by percpu_counter_compare() is low that
optimisation isn't going to produce any measurable performance
improvement. Hence it's not really something we've concerned
ourselves about.  The profile is pretty much identical for removing
hundreds of thousands of files a second, too, so there really isn't
any performance gain to be had here.

If you want to optimise code to make it faster and show a noticable
performance improvement, start by running kernel profiles while your
performance critical workload is running. Then look at what the
functions and call chains that consume the most CPU and work out how
to do them better. Those are the places that optimisation will
result in measurable performance gains....

Cheers,

Dave.

Shaokun Zhang Nov. 5, 2019, 3:08 a.m. UTC | #3

Hi Christoph,

On 2019/11/4 23:25, Christoph Hellwig wrote:
> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>> From: Yang Guo <guoyang2@huawei.com>
>>
>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>> whether the counter less than 0 and it is a expensive function.
>> let's check it only when delta < 0, it will be good for xfs's performance.
> 
> How much overhead do you see?  In the end the compare is just a debug

Thanks your reply, sorry for my not clear description.
__percpu_counter_compare itself is not expensive, but __percpu_counter_sum
called by __percpu_counter_compare is high load, I will list it in next thread.

> check, so if it actually shows up we should remove it entirely.
> 

I'm not sure about it, so I check the delta to do less modification.

Thanks,
Shaokun

>

Shaokun Zhang Nov. 5, 2019, 3:26 a.m. UTC | #4

Hi Dave,

On 2019/11/5 4:49, Dave Chinner wrote:
> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>> From: Yang Guo <guoyang2@huawei.com>
>>
>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>> whether the counter less than 0 and it is a expensive function.
>> let's check it only when delta < 0, it will be good for xfs's performance.
> 
> Hmmm. I don't recall this as being expensive.
> 

Sorry about the misunderstanding information in commit message.

> How did you find this? Can you please always document how you found

If user creates million of files and the delete them, We found that the
__percpu_counter_compare costed 5.78% CPU usage, you are right that itself
is not expensive, but it calls __percpu_counter_sum which will use
spin_lock and read other cpu's count. perf record -g is used to profile it:

- 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
   - 5.86% xfs_mod_ifree
      - 5.78% __percpu_counter_compare
           5.61% __percpu_counter_sum

> the problem being addressed in the commit message so that we don't
> then have to ask how the problem being fixed is reproduced.
> 
>> Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
>> Signed-off-by: Yang Guo <guoyang2@huawei.com>
>> Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
>> ---
>>  fs/xfs/xfs_mount.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
>> index ba5b6f3b2b88..5e8314e6565e 100644
>> --- a/fs/xfs/xfs_mount.c
>> +++ b/fs/xfs/xfs_mount.c
>> @@ -1174,6 +1174,9 @@ xfs_mod_icount(
>>  	int64_t			delta)
>>  {
>>  	percpu_counter_add_batch(&mp->m_icount, delta, XFS_ICOUNT_BATCH);
>> +	if (delta > 0)
>> +		return 0;
>> +
>>  	if (__percpu_counter_compare(&mp->m_icount, 0, XFS_ICOUNT_BATCH) < 0) {
>>  		ASSERT(0);
>>  		percpu_counter_add(&mp->m_icount, -delta);
> 
> I struggle to see how this is expensive when you have more than
> num_online_cpus() * XFS_ICOUNT_BATCH inodes allocated.
> __percpu_counter_compare() will always take the fast path so ends up
> being very little code at all.
> 
>> @@ -1188,6 +1191,9 @@ xfs_mod_ifree(
>>  	int64_t			delta)
>>  {
>>  	percpu_counter_add(&mp->m_ifree, delta);
>> +	if (delta > 0)
>> +		return 0;
>> +
>>  	if (percpu_counter_compare(&mp->m_ifree, 0) < 0) {
>>  		ASSERT(0);
>>  		percpu_counter_add(&mp->m_ifree, -delta);
> 
> This one might have some overhead because the count is often at or
> around zero, but I haven't noticed it being expensive in kernel
> profiles when creating/freeing hundreds of thousands of inodes every
> second.
> 
> IOWs, we typically measure the overhead of such functions by kernel
> profile.  Creating ~200,000 inodes a second, so hammering the icount
> and ifree counters, I see:
> 
>       0.16%  [kernel]  [k] percpu_counter_add_batch
>       0.03%  [kernel]  [k] __percpu_counter_compare
> 

0.03% is just __percpu_counter_compare's usage.

> Almost nothing - it's way down the long tail of noise in the
> profile.
> 
> IOWs, the CPU consumed by percpu_counter_compare() is low that
> optimisation isn't going to produce any measurable performance
> improvement. Hence it's not really something we've concerned
> ourselves about.  The profile is pretty much identical for removing
> hundreds of thousands of files a second, too, so there really isn't
> any performance gain to be had here.
> 
> If you want to optimise code to make it faster and show a noticable
> performance improvement, start by running kernel profiles while your
> performance critical workload is running. Then look at what the
> functions and call chains that consume the most CPU and work out how
> to do them better. Those are the places that optimisation will
> result in measurable performance gains....

Hmm, I have done it and I didn't describe this problem clearly, with this
patch, 5.78%(__percpu_counter_compare) will disappear. I will follow
your method and reduce unnecessary noise.

Thanks,
Shaokun

> 
> Cheers,
> 
> Dave.
>

Dave Chinner Nov. 5, 2019, 4:03 a.m. UTC | #5

On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/5 4:49, Dave Chinner wrote:
> > On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >> From: Yang Guo <guoyang2@huawei.com>
> >>
> >> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >> whether the counter less than 0 and it is a expensive function.
> >> let's check it only when delta < 0, it will be good for xfs's performance.
> > 
> > Hmmm. I don't recall this as being expensive.
> > 
> 
> Sorry about the misunderstanding information in commit message.
> 
> > How did you find this? Can you please always document how you found
> 
> If user creates million of files and the delete them, We found that the
> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> is not expensive, but it calls __percpu_counter_sum which will use
> spin_lock and read other cpu's count. perf record -g is used to profile it:
> 
> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>    - 5.86% xfs_mod_ifree
>       - 5.78% __percpu_counter_compare
>            5.61% __percpu_counter_sum

Interesting. Your workload is hitting the slow path, which I most
certainly do no see when creating lots of files. What's your
workload?

> > IOWs, we typically measure the overhead of such functions by kernel
> > profile.  Creating ~200,000 inodes a second, so hammering the icount
> > and ifree counters, I see:
> > 
> >       0.16%  [kernel]  [k] percpu_counter_add_batch
> >       0.03%  [kernel]  [k] __percpu_counter_compare
> > 
> 
> 0.03% is just __percpu_counter_compare's usage.

No, that's the total of _all_ the percpu counter functions captured
by the profile - it was the list of all samples filtered by
"percpu". I just re-ran the profile again, and got:

   0.23%  [kernel]  [k] percpu_counter_add_batch
   0.04%  [kernel]  [k] __percpu_counter_compare
   0.00%  [kernel]  [k] collect_percpu_times
   0.00%  [kernel]  [k] __handle_irq_event_percpu
   0.00%  [kernel]  [k] __percpu_counter_sum
   0.00%  [kernel]  [k] handle_irq_event_percpu
   0.00%  [kernel]  [k] fprop_reflect_period_percpu.isra.0
   0.00%  [kernel]  [k] percpu_ref_switch_to_atomic_rcu
   0.00%  [kernel]  [k] free_percpu
   0.00%  [kernel]  [k] percpu_ref_exit

So you can see that this essentially no samples in
__percpu_counter_sum at all - my tests are not hitting the slow path
at all, despite allocating inodes continuously.

IOWs, your workload is hitting the slow path repeatedly, and so the
question that needs to be answered is "why is the slow path actually
being exercised?". IOWs, we need to know what your workload is, what
the filesystem config is, what hardware (cpus, storage, etc) you are
running on, etc. There must be some reason for the slow path being
used, and that's what we need to understand first before deciding
what the best fix might be...

I suspect that you are only running one or two threads creating
files and you have lots of idle CPU and hence the inode allocation
is not clearing the fast path batch threshold on the ifree counter.
And because you have lots of CPUs, the cost of a sum is very
expensive compared to running single threaded creates. That's my
current hypothesis based what I see on my workloads that
xfs_mod_ifree overhead goes down as concurrency goes up....

FWIW, the profiles I took came from running this on 16 and 32p
machines:

--
dirs=""
for i in `seq 1 $THREADS`; do
        dirs="$dirs -d /mnt/scratch/$i"
done

cycles=$((512 / $THREADS))

time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
--

With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
image:

meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=134217727500, imaxpct=1
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

That's allocating enough inodes to keep the free inode counter
entirely out of the slow path...

Cheers,

Dave.

Shaokun Zhang Nov. 6, 2019, 6 a.m. UTC | #6

Hi Dave,

On 2019/11/5 12:03, Dave Chinner wrote:
> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 4:49, Dave Chinner wrote:
>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>> From: Yang Guo <guoyang2@huawei.com>
>>>>
>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>> whether the counter less than 0 and it is a expensive function.
>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>
>>> Hmmm. I don't recall this as being expensive.
>>>
>>
>> Sorry about the misunderstanding information in commit message.
>>
>>> How did you find this? Can you please always document how you found
>>
>> If user creates million of files and the delete them, We found that the
>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>> is not expensive, but it calls __percpu_counter_sum which will use
>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>
>> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>>    - 5.86% xfs_mod_ifree
>>       - 5.78% __percpu_counter_compare
>>            5.61% __percpu_counter_sum
> 
> Interesting. Your workload is hitting the slow path, which I most
> certainly do no see when creating lots of files. What's your
> workload?
> 

The hardware has 128 cpu cores, and the xfs filesystem format config is default,
while the test is a single thread, as follow:
./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2

xfs info:
meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1 spinodes=1 rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=976754644, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1	
log      =internal               bsize=4096   blocks=476930, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

disk info:
Disk /dev/bcache2: 4000.8 GB, 4000787021824 bytes, 7814037152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> IOWs, we typically measure the overhead of such functions by kernel
>>> profile.  Creating ~200,000 inodes a second, so hammering the icount
>>> and ifree counters, I see:
>>>
>>>       0.16%  [kernel]  [k] percpu_counter_add_batch
>>>       0.03%  [kernel]  [k] __percpu_counter_compare
>>>
>>
>> 0.03% is just __percpu_counter_compare's usage.
> 
> No, that's the total of _all_ the percpu counter functions captured
> by the profile - it was the list of all samples filtered by
> "percpu". I just re-ran the profile again, and got:
> 
> 
>    0.23%  [kernel]  [k] percpu_counter_add_batch
>    0.04%  [kernel]  [k] __percpu_counter_compare
>    0.00%  [kernel]  [k] collect_percpu_times
>    0.00%  [kernel]  [k] __handle_irq_event_percpu
>    0.00%  [kernel]  [k] __percpu_counter_sum
>    0.00%  [kernel]  [k] handle_irq_event_percpu
>    0.00%  [kernel]  [k] fprop_reflect_period_percpu.isra.0
>    0.00%  [kernel]  [k] percpu_ref_switch_to_atomic_rcu
>    0.00%  [kernel]  [k] free_percpu
>    0.00%  [kernel]  [k] percpu_ref_exit
> 
> So you can see that this essentially no samples in
> __percpu_counter_sum at all - my tests are not hitting the slow path
> at all, despite allocating inodes continuously.

Got it,

> 
> IOWs, your workload is hitting the slow path repeatedly, and so the
> question that needs to be answered is "why is the slow path actually
> being exercised?". IOWs, we need to know what your workload is, what
> the filesystem config is, what hardware (cpus, storage, etc) you are
> running on, etc. There must be some reason for the slow path being
> used, and that's what we need to understand first before deciding
> what the best fix might be...
> 
> I suspect that you are only running one or two threads creating

Yeah, we just run one thread test.

> files and you have lots of idle CPU and hence the inode allocation
> is not clearing the fast path batch threshold on the ifree counter.
> And because you have lots of CPUs, the cost of a sum is very
> expensive compared to running single threaded creates. That's my
> current hypothesis based what I see on my workloads that
> xfs_mod_ifree overhead goes down as concurrency goes up....
> 

Agree, we add some debug info in xfs_mod_ifree and found most times
m_ifree.count < batch * num_online_cpus(),  because we have 128 online
cpus and m_ifree.count around 999.


> FWIW, the profiles I took came from running this on 16 and 32p
> machines:
> 
> --
> dirs=""
> for i in `seq 1 $THREADS`; do
>         dirs="$dirs -d /mnt/scratch/$i"
> done
> 
> cycles=$((512 / $THREADS))
> 
> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> --
> 
> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> image:
> 
> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>          =                       reflink=1
> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=512   sunit=0 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> That's allocating enough inodes to keep the free inode counter
> entirely out of the slow path...

percpu_counter_read that reads the count will cause cache synchronization
cost if other cpu changes the count, Maybe it's better not to call
percpu_counter_compare if possible.

Thanks,
Shaokun

> 
> Cheers,
> 
> Dave.
>

Dave Chinner Nov. 6, 2019, 9:20 p.m. UTC | #7

On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/5 12:03, Dave Chinner wrote:
> > On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> >> Hi Dave,
> >>
> >> On 2019/11/5 4:49, Dave Chinner wrote:
> >>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >>>> From: Yang Guo <guoyang2@huawei.com>
> >>>>
> >>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >>>> whether the counter less than 0 and it is a expensive function.
> >>>> let's check it only when delta < 0, it will be good for xfs's performance.
> >>>
> >>> Hmmm. I don't recall this as being expensive.
> >>>
> >>
> >> Sorry about the misunderstanding information in commit message.
> >>
> >>> How did you find this? Can you please always document how you found
> >>
> >> If user creates million of files and the delete them, We found that the
> >> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> >> is not expensive, but it calls __percpu_counter_sum which will use
> >> spin_lock and read other cpu's count. perf record -g is used to profile it:
> >>
> >> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
> >>    - 5.86% xfs_mod_ifree
> >>       - 5.78% __percpu_counter_compare
> >>            5.61% __percpu_counter_sum
> > 
> > Interesting. Your workload is hitting the slow path, which I most
> > certainly do no see when creating lots of files. What's your
> > workload?
> > 
> 
> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
> while the test is a single thread, as follow:
> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2

What version and where do I get it?

Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
metadata workload testing? How representative is it of your actual
production workload? Is that single threaded?

> xfs info:
> meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks

only 4 AGs, which explains the lack of free inodes - there isn't
enough concurrency in the filesystem layout to push the free inode
count in all AGs beyond the batchsize * num_online_cpus().

i.e. single threaded workloads typically drain the free inode count
all the way down to zero before new inodes are allocated. Workloads
that are highly concurrent allocate from lots of AGs at once,
leaving free inodes in every AG that is not current being actively
allocated out of.

As a test, can you remake that test filesystem with "-d agcount=32"
and see if the overhead you are seeing disappears?

> > files and you have lots of idle CPU and hence the inode allocation
> > is not clearing the fast path batch threshold on the ifree counter.
> > And because you have lots of CPUs, the cost of a sum is very
> > expensive compared to running single threaded creates. That's my
> > current hypothesis based what I see on my workloads that
> > xfs_mod_ifree overhead goes down as concurrency goes up....
> > 
> 
> Agree, we add some debug info in xfs_mod_ifree and found most times
> m_ifree.count < batch * num_online_cpus(),  because we have 128 online
> cpus and m_ifree.count around 999.

Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
path. 32 AGs may well push the count over this threshold, so it's
definitely worth trying....

> > FWIW, the profiles I took came from running this on 16 and 32p
> > machines:
> > 
> > --
> > dirs=""
> > for i in `seq 1 $THREADS`; do
> >         dirs="$dirs -d /mnt/scratch/$i"
> > done
> > 
> > cycles=$((512 / $THREADS))
> > 
> > time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> > --
> > 
> > With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> > image:
> > 
> > meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=1        finobt=1, sparse=1, rmapbt=0
> >          =                       reflink=1
> > data     =                       bsize=4096   blocks=134217727500, imaxpct=1
> >          =                       sunit=0      swidth=0 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >          =                       sectsz=512   sunit=0 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > That's allocating enough inodes to keep the free inode counter
> > entirely out of the slow path...
> 
> percpu_counter_read that reads the count will cause cache synchronization
> cost if other cpu changes the count, Maybe it's better not to call
> percpu_counter_compare if possible.

Depends.  Sometimes we trade off ultimate single threaded
performance and efficiency for substantially better scalability.
i.e. if we lose 5% on single threaded performance but gain 10x on
concurrent workloads, then that is a good tradeoff to make.

Cheers,

Dave.

Shaokun Zhang Nov. 8, 2019, 5:58 a.m. UTC | #8

Hi Dave,

On 2019/11/7 5:20, Dave Chinner wrote:
> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 12:03, Dave Chinner wrote:
>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>>>> Hi Dave,
>>>>
>>>> On 2019/11/5 4:49, Dave Chinner wrote:
>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>>>> From: Yang Guo <guoyang2@huawei.com>
>>>>>>
>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>>>> whether the counter less than 0 and it is a expensive function.
>>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>>>
>>>>> Hmmm. I don't recall this as being expensive.
>>>>>
>>>>
>>>> Sorry about the misunderstanding information in commit message.
>>>>
>>>>> How did you find this? Can you please always document how you found
>>>>
>>>> If user creates million of files and the delete them, We found that the
>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>>>> is not expensive, but it calls __percpu_counter_sum which will use
>>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>>>
>>>> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>>>>    - 5.86% xfs_mod_ifree
>>>>       - 5.78% __percpu_counter_compare
>>>>            5.61% __percpu_counter_sum
>>>
>>> Interesting. Your workload is hitting the slow path, which I most
>>> certainly do no see when creating lots of files. What's your
>>> workload?
>>>
>>
>> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
>> while the test is a single thread, as follow:
>> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2
> 
> What version and where do I get it?

You can get the mdtest from github: https://github.com/LLNL/mdtest.

> 
> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
> metadata workload testing? How representative is it of your actual
> production workload? Is that single threaded?
> 

We just use mdtest to test the performance of a file system, it can't representative
the actual workload and it's single threaded. But we also find that it goes to slow
path when we remove a dir with many files. The cmd is below:
rm -rf xxx.

>> xfs info:
>> meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks
> 
> only 4 AGs, which explains the lack of free inodes - there isn't
> enough concurrency in the filesystem layout to push the free inode
> count in all AGs beyond the batchsize * num_online_cpus().
> 
> i.e. single threaded workloads typically drain the free inode count
> all the way down to zero before new inodes are allocated. Workloads
> that are highly concurrent allocate from lots of AGs at once,
> leaving free inodes in every AG that is not current being actively
> allocated out of.
> 
> As a test, can you remake that test filesystem with "-d agcount=32"
> and see if the overhead you are seeing disappears?
> 

We try to remake the filesystem with "-d agcount=32" and it also enters slow path
mostly. Print the batch * num_online_cpus() and find that it's 32768.
Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
Then we change the agcount=1024, and it also goes to slow path frequently because
mostly there are no 32768 free inodes.

>>> files and you have lots of idle CPU and hence the inode allocation
>>> is not clearing the fast path batch threshold on the ifree counter.
>>> And because you have lots of CPUs, the cost of a sum is very
>>> expensive compared to running single threaded creates. That's my
>>> current hypothesis based what I see on my workloads that
>>> xfs_mod_ifree overhead goes down as concurrency goes up....
>>>
>>
>> Agree, we add some debug info in xfs_mod_ifree and found most times
>> m_ifree.count < batch * num_online_cpus(),  because we have 128 online
>> cpus and m_ifree.count around 999.
> 
> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
> path. 32 AGs may well push the count over this threshold, so it's
> definitely worth trying....
> 

Yes, we tried it and found that threshold was 32768, because percpu_counter_batch
was initialized to 2 * num_online_cpus().

>>> FWIW, the profiles I took came from running this on 16 and 32p
>>> machines:
>>>
>>> --
>>> dirs=""
>>> for i in `seq 1 $THREADS`; do
>>>         dirs="$dirs -d /mnt/scratch/$i"
>>> done
>>>
>>> cycles=$((512 / $THREADS))
>>>
>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
>>> --
>>>
>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
>>> image:
>>>
>>> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>>>          =                       sectsz=512   attr=2, projid32bit=1
>>>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>>>          =                       reflink=1
>>> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>>>          =                       sunit=0      swidth=0 blks
>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>
>>> That's allocating enough inodes to keep the free inode counter
>>> entirely out of the slow path...
>>
>> percpu_counter_read that reads the count will cause cache synchronization
>> cost if other cpu changes the count, Maybe it's better not to call
>> percpu_counter_compare if possible.
> 
> Depends.  Sometimes we trade off ultimate single threaded
> performance and efficiency for substantially better scalability.
> i.e. if we lose 5% on single threaded performance but gain 10x on
> concurrent workloads, then that is a good tradeoff to make.
> 

Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
xfs_mod_ifree/icount.

Thanks,
Shaokun

> Cheers,
> 
> Dave.
>

Shaokun Zhang Nov. 15, 2019, 9:16 a.m. UTC | #9

Hi Dave,

With configuration "-d agcount=32", it also enters slow path frequently
when there are 128 cpu cores, any thoughts about this issue?
Can we remove debug check entirely as Christoph's suggestion?

Thanks,
Shaokun

On 2019/11/8 13:58, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/7 5:20, Dave Chinner wrote:
>> On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
>>> Hi Dave,
>>>
>>> On 2019/11/5 12:03, Dave Chinner wrote:
>>>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>>>>> Hi Dave,
>>>>>
>>>>> On 2019/11/5 4:49, Dave Chinner wrote:
>>>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>>>>> From: Yang Guo <guoyang2@huawei.com>
>>>>>>>
>>>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>>>>> whether the counter less than 0 and it is a expensive function.
>>>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>>>>
>>>>>> Hmmm. I don't recall this as being expensive.
>>>>>>
>>>>>
>>>>> Sorry about the misunderstanding information in commit message.
>>>>>
>>>>>> How did you find this? Can you please always document how you found
>>>>>
>>>>> If user creates million of files and the delete them, We found that the
>>>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>>>>> is not expensive, but it calls __percpu_counter_sum which will use
>>>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>>>>
>>>>> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
>>>>>    - 5.86% xfs_mod_ifree
>>>>>       - 5.78% __percpu_counter_compare
>>>>>            5.61% __percpu_counter_sum
>>>>
>>>> Interesting. Your workload is hitting the slow path, which I most
>>>> certainly do no see when creating lots of files. What's your
>>>> workload?
>>>>
>>>
>>> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
>>> while the test is a single thread, as follow:
>>> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2
>>
>> What version and where do I get it?
> 
> You can get the mdtest from github: https://github.com/LLNL/mdtest.
> 
>>
>> Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
>> metadata workload testing? How representative is it of your actual
>> production workload? Is that single threaded?
>>
> 
> We just use mdtest to test the performance of a file system, it can't representative
> the actual workload and it's single threaded. But we also find that it goes to slow
> path when we remove a dir with many files. The cmd is below:
> rm -rf xxx.
> 
>>> xfs info:
>>> meta-data=/dev/bcache2           isize=512    agcount=4, agsize=244188661 blks
>>
>> only 4 AGs, which explains the lack of free inodes - there isn't
>> enough concurrency in the filesystem layout to push the free inode
>> count in all AGs beyond the batchsize * num_online_cpus().
>>
>> i.e. single threaded workloads typically drain the free inode count
>> all the way down to zero before new inodes are allocated. Workloads
>> that are highly concurrent allocate from lots of AGs at once,
>> leaving free inodes in every AG that is not current being actively
>> allocated out of.
>>
>> As a test, can you remake that test filesystem with "-d agcount=32"
>> and see if the overhead you are seeing disappears?
>>
> 
> We try to remake the filesystem with "-d agcount=32" and it also enters slow path
> mostly. Print the batch * num_online_cpus() and find that it's 32768.
> Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> Then we change the agcount=1024, and it also goes to slow path frequently because
> mostly there are no 32768 free inodes.
> 
>>>> files and you have lots of idle CPU and hence the inode allocation
>>>> is not clearing the fast path batch threshold on the ifree counter.
>>>> And because you have lots of CPUs, the cost of a sum is very
>>>> expensive compared to running single threaded creates. That's my
>>>> current hypothesis based what I see on my workloads that
>>>> xfs_mod_ifree overhead goes down as concurrency goes up....
>>>>
>>>
>>> Agree, we add some debug info in xfs_mod_ifree and found most times
>>> m_ifree.count < batch * num_online_cpus(),  because we have 128 online
>>> cpus and m_ifree.count around 999.
>>
>> Ok, the threshold is 32 * 128 = ~4000 to get out of the slow
>> path. 32 AGs may well push the count over this threshold, so it's
>> definitely worth trying....
>>
> 
> Yes, we tried it and found that threshold was 32768, because percpu_counter_batch
> was initialized to 2 * num_online_cpus().
> 
>>>> FWIW, the profiles I took came from running this on 16 and 32p
>>>> machines:
>>>>
>>>> --
>>>> dirs=""
>>>> for i in `seq 1 $THREADS`; do
>>>>         dirs="$dirs -d /mnt/scratch/$i"
>>>> done
>>>>
>>>> cycles=$((512 / $THREADS))
>>>>
>>>> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
>>>> --
>>>>
>>>> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
>>>> image:
>>>>
>>>> meta-data=/dev/vdc               isize=512    agcount=500, agsize=268435455 blks
>>>>          =                       sectsz=512   attr=2, projid32bit=1
>>>>          =                       crc=1        finobt=1, sparse=1, rmapbt=0
>>>>          =                       reflink=1
>>>> data     =                       bsize=4096   blocks=134217727500, imaxpct=1
>>>>          =                       sunit=0      swidth=0 blks
>>>> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
>>>> log      =internal log           bsize=4096   blocks=521728, version=2
>>>>          =                       sectsz=512   sunit=0 blks, lazy-count=1
>>>> realtime =none                   extsz=4096   blocks=0, rtextents=0
>>>>
>>>> That's allocating enough inodes to keep the free inode counter
>>>> entirely out of the slow path...
>>>
>>> percpu_counter_read that reads the count will cause cache synchronization
>>> cost if other cpu changes the count, Maybe it's better not to call
>>> percpu_counter_compare if possible.
>>
>> Depends.  Sometimes we trade off ultimate single threaded
>> performance and efficiency for substantially better scalability.
>> i.e. if we lose 5% on single threaded performance but gain 10x on
>> concurrent workloads, then that is a good tradeoff to make.
>>
> 
> Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> xfs_mod_ifree/icount.
> 
> Thanks,
> Shaokun
> 
>> Cheers,
>>
>> Dave.
>>

Dave Chinner Nov. 18, 2019, 8:12 a.m. UTC | #10

On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote:
> Hi Dave,
> 
> On 2019/11/7 5:20, Dave Chinner wrote:
> > On Wed, Nov 06, 2019 at 02:00:58PM +0800, Shaokun Zhang wrote:
> >> Hi Dave,
> >>
> >> On 2019/11/5 12:03, Dave Chinner wrote:
> >>> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
> >>>> Hi Dave,
> >>>>
> >>>> On 2019/11/5 4:49, Dave Chinner wrote:
> >>>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
> >>>>>> From: Yang Guo <guoyang2@huawei.com>
> >>>>>>
> >>>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
> >>>>>> whether the counter less than 0 and it is a expensive function.
> >>>>>> let's check it only when delta < 0, it will be good for xfs's performance.
> >>>>>
> >>>>> Hmmm. I don't recall this as being expensive.
> >>>>>
> >>>>
> >>>> Sorry about the misunderstanding information in commit message.
> >>>>
> >>>>> How did you find this? Can you please always document how you found
> >>>>
> >>>> If user creates million of files and the delete them, We found that the
> >>>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
> >>>> is not expensive, but it calls __percpu_counter_sum which will use
> >>>> spin_lock and read other cpu's count. perf record -g is used to profile it:
> >>>>
> >>>> - 5.88%     0.02%  rm  [kernel.vmlinux]  [k] xfs_mod_ifree
> >>>>    - 5.86% xfs_mod_ifree
> >>>>       - 5.78% __percpu_counter_compare
> >>>>            5.61% __percpu_counter_sum
> >>>
> >>> Interesting. Your workload is hitting the slow path, which I most
> >>> certainly do no see when creating lots of files. What's your
> >>> workload?
> >>>
> >>
> >> The hardware has 128 cpu cores, and the xfs filesystem format config is default,
> >> while the test is a single thread, as follow:
> >> ./mdtest -I 10  -z 6 -b 8 -d /mnt/ -t -c 2
> > 
> > What version and where do I get it?
> 
> You can get the mdtest from github: https://github.com/LLNL/mdtest.

Oh what a pain.

$ MPI_CC=mpicc make mpicc -DLinux -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE=1 -D__USE_LARGEFILE64=1  -g -o mdtest mdtest.c -lm
mdtest.c:32:10: fatal error: mpi.h: No such file or directory
 32 | #include "mpi.h"
    |          ^~~~~~~
compilation terminated.
make: *** [Makefile:59: mdtest] Error 1

It doesn't build with a modern linux distro openmpi installation.

/me looks at the readme.

Oh, it's an unmaintained historical archive.

Hmmm, it got forked to https://github.com/MDTEST-LANL/mdtest.

That's an unmaintained archive, too.

Oh, there's a version at https://github.com/hpc/ior

Ngggh.

$./configure
....
checking for mpicc... mpicc
checking for gcc... (cached) mpicc
checking whether the C compiler works... no
configure: error: in `/home/dave/src/ior':
configure: error: C compiler cannot create executables
See `config.log' for more details
$
$ less config.log
....
configure:3805: checking whether the C compiler works
configure:3827: mpicc    conftest.c  >&5
/usr/bin/ld: cannot find -lopen-rte
/usr/bin/ld: cannot find -lopen-pal
/usr/bin/ld: cannot find -lhwloc
/usr/bin/ld: cannot find -levent
/usr/bin/ld: cannot find -levent_pthreads
collect2: error: ld returned 1 exit status
configure:3831: $? = 1
.....

So, the mpicc compiler wrapper uses a bunch of libraries that don't
get installed with the compiler wrapper.

Yay?

> > Hmmm - isn't mdtest a MPI benchmark intended for highly concurrent
> > metadata workload testing? How representative is it of your actual
> > production workload? Is that single threaded?
> > 
> 
> We just use mdtest to test the performance of a file system, it can't representative
> the actual workload and it's single threaded. But we also find that it goes to slow
> path when we remove a dir with many files. The cmd is below:
> rm -rf xxx.

The benchmark doesn't reproduce that. It will only occur when you do
sequential inode remove so you have no free inodes in the filesytem
and every set of 64 inodes that is remove land in the same chunk and
so are immediately freed, leaving zero inodes in the filesyste,

i.e. this is a result of sequential inode create, which typically
implies "empty filesystem". Aged filesystems typically don't behave
like this - they have free indoes distributed all through the inode
chunks on disk. And if you do random remove, you won't see it for
the same reason, either - the remove phase of mdtest doesn't show
any CPU usage in percpu_counter_sum() at all.

Please understand taht I'm not saying that it isn't a problem,
that it doesn't happen, or that we don't need to change the
bahviour. What I'm trying to do is understand the
conditions in which this problem occurs and whether they are real
world or just micro-benchmark related problems...

> Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> Then we change the agcount=1024, and it also goes to slow path frequently because
> mostly there are no 32768 free inodes.

Hmmm. I had forgotten the batch size increased with CPU count like
that - I had the thought it was log(ncpus), not linear(ncpus).

That kinda points out that scaling the batch size by CPU count is
somewhat silly, as concurrency on the counter is not defined by the
number of CPUs in the system. INode counter concurrency is defined
by the number of inode allocation/free operations that can be
performed at any given time.

As such, having a counter threshold of 256 * num_cpus doesn't make a
whole lot of sense for a 4 AG filesystem because you can't have 128
CPUs all banging on it at once. And even if you have hundreds of
AGs, the CPUs are going to be somewhat serialised through the
transaction commit path by the CIL locks that are taken to insert
objects into the CIL.

Hence the unbound concurrency of transaction commit is going to be
soaked up by the CIL list insert spin lock, and it's mostly going to be
just a dribble of CPUs at a time through the transaction commit
accounting code.

Ok, so maybe we just need a small batch size here, like a value of 4
or 8 just to avoid every inode alloc/free transaction having to pick
up a global spin lock every time...

Let me do some testing here...

> >> percpu_counter_read that reads the count will cause cache synchronization
> >> cost if other cpu changes the count, Maybe it's better not to call
> >> percpu_counter_compare if possible.
> > 
> > Depends.  Sometimes we trade off ultimate single threaded
> > performance and efficiency for substantially better scalability.
> > i.e. if we lose 5% on single threaded performance but gain 10x on
> > concurrent workloads, then that is a good tradeoff to make.
> > 
> 
> Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> xfs_mod_ifree/icount.

It also doesn't totally avoid the issue, either, because of the
way we dynamically alloc/free inode clusters and so the counters go
both up and down during conmtinuous allocation or continuous
freeing.

i.e. The pattern we see on inode allocation is:

	icount += 64
	ifree += 64
	ifree--
	....
	ifree--
	icount += 64
	ifree += 64
	ifree--
	....
	ifree--

And on inode freeing, we see the opposite:

	ifree++
	....
	ifree++
	icount -= 64
	ifree -= 64
	ifree++
	....
	ifree++
	icount -= 64
	ifree -= 64

So the the values of ifree will still decrement during both free and
allocation operations, hence it doesn't avoid the issue totally.

Cheers,

Dave.

Dave Chinner Nov. 20, 2019, 9:08 p.m. UTC | #11

On Mon, Nov 18, 2019 at 07:12:12PM +1100, Dave Chinner wrote:
> On Fri, Nov 08, 2019 at 01:58:56PM +0800, Shaokun Zhang wrote:
> > On 2019/11/7 5:20, Dave Chinner wrote:
> > Because percpu_counter_batch was initialized to 256 when there are 128 cpu cores.
> > Then we change the agcount=1024, and it also goes to slow path frequently because
> > mostly there are no 32768 free inodes.
> 
> Hmmm. I had forgotten the batch size increased with CPU count like
> that - I had the thought it was log(ncpus), not linear(ncpus).

[.....]

> Ok, so maybe we just need a small batch size here, like a value of 4
> or 8 just to avoid every inode alloc/free transaction having to pick
> up a global spin lock every time...
> 
> Let me do some testing here...

[....]

> > Agree, I mean that when delta > 0, there is no need to call percpu_counter_compare in
> > xfs_mod_ifree/icount.

Ok, so the percpu_counter_sum() only shows up during a create
workload here, at ~1.5% of the CPU used. only doing the check when
delta < 0 makes no difference to that value. CPU usage of
percpu_counter_sum is noise for all other parts of the workload
workloads, including the "remove files" part of the benchmark.

Hence it is this pattern:

> i.e. The pattern we see on inode allocation is:
> 
> 	icount += 64
> 	ifree += 64
> 	ifree--
> 	....
> 	ifree--
> 	icount += 64
> 	ifree += 64
> 	ifree--
> 	....
> 	ifree--

That is causing the compare CPU usage - all the single decrements
are triggering it because we are operating on a new filesystem that
has no free inodes other than the cluster we just allocated. Hence
avoiding doing the value compare when delta > 0 makes no difference
to CPU usage because most of the modifications are subraction.

> And on inode freeing, we see the opposite:
> 
> 	ifree++
> 	....
> 	ifree++
> 	icount -= 64
> 	ifree -= 64
> 	ifree++
> 	....
> 	ifree++
> 	icount -= 64
> 	ifree -= 64

For freeing, we aren't always freeing from the same inode cluster,
so the ifree count actually goes up quite a bit before it starts
going down as we free clusters. Hence if we keep the batch size
small here, we should mostly stay out of the compare path, and the
"no compare when delta > 0" will also make a substantial difference
here if the ifree count is low.

So, I reduced the batch size to 8, and CPU usage during creates
dropped from 1.5% to 0.6% on a 16p machine. That's a substantial
reduction - if this translates to larger machines, that should bring
CPU usage down under 2%....

I was going to send this patch for you to test, but I left this
email sitting unsent overnight and I thought about it some more.

The reality is, as Christoph said, this ends up being just debug
code because we silently ignore the underflow error on production
kernels. Hence I think the right thing to do is gut the transaction
superblock accounting code so that we need simple asserts and don't
bother trying to behave like it's an error we can actually handle
sanely. This removes the compare from production kernels completely,
so you should see this all go away with the patch below.

The difference in performance on my 16p machine is not significant -
it is within the normal run-to-run variability of the mdtest
benchmark, but the counter compare overhead is gone from the
profiles.

Cheers,

Dave.

Christoph Hellwig Nov. 21, 2019, 6:56 a.m. UTC | #12

> +static void
>  xfs_sb_mod8(
>  	uint8_t			*field,
>  	int8_t			delta)
>  {
>  	int8_t			counter = *field;
>  
> +	if (!delta)
> +		return;
>  	counter += delta;
> +	ASSERT(counter >= 0);
>  	*field = counter;
>  }

I'd almost find it easier to keep the != 0 check in the caller, and
in fact also move the assert there and just kill these helpers, e.g.

	if (tp->t_imaxpct_delta != 0) {
		mp->m_sb.sb_imax_pct += tp->t_imaxpct_delta;
		ASSERT(mp->m_sb.sb_imax_pct >= 0);
	}
	if (tp->t_rextsize_delta != 0) {
		mp->m_sb.sb_rextsize += tp->t_rextsize_delta;
		ASSERT(mp->m_sb.sb_rextsize >= 0);
	}

While this is 2 more lines per counter it is a lot more explicit and
easier to follow.

>  	if (idelta) {
> -		error = xfs_mod_icount(mp, idelta);
> -		if (error)
> -			goto out_undo_fdblocks;
> +		percpu_counter_add_batch(&mp->m_icount, idelta,
> +					 XFS_ICOUNT_BATCH);
> +		if (idelta < 0)
> +			ASSERT(__percpu_counter_compare(&mp->m_icount, 0,
> +							XFS_ICOUNT_BATCH) >= 0)

And here I wonder if keeping single use helpers wouldn't have been a
little nicer.  Not that it really matters.

xfs: optimise xfs_mod_icount/ifree when delta < 0

Commit Message

Comments

Patch