Message ID | 0849fc77-1a6e-46f8-a18d-15699f99158e@linux.alibaba.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [report] Unixbench shell1 performance regression | expand |
On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote: > Hi folks, > > Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency) > performance regression during a benchmark comparison between XFS and > EXT4: The XFS result was lower than EXT4 by 15% on Linux 6.6.y with > 144-core aarch64 (64K page size). Since Unixbench is somewhat important > to indicate overall system performance for many end users, it's not > a good result. Unixbench isn't really that indicative of typical worklaods on large core-count machines these days. It's an ancient benchmark, and it's exceedingly rare that a modern machine is fully loaded with shell scripts such as the shell1 test is running because it's highly inefficient to do large scale concurrent processing of data in this way.... Indeed, look at the file copy "benchmarks" it runs - the use buffer sizes of 256, 1024 and 4096 bytes to tell you how well the filesystem performs. Using sub-page size buffers might have been common for 1983-era CPUs to get the highest possible file copy throughput, but these days these are slow paths that we largely don't optimise for highest throughput. Measuring modern system scalability via how such operations perform is largely meaningless because applications don't behave this way anymore.... > shell1 test[2] basically runs in a loop that it executes commands > to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove > them. The testcase lasts for one minute and then show the total number > of iterations. > > While no difference was observed in single-threaded results, it showed > a noticeable difference above if `./Run shell1 -c 144 -i 1` is used. I'm betting that the XFS filesystem is small and only has 4 AGs, and so has very limited concurrency in allocation. i.e. you're trying to run a massively concurrent workload on a filesystem that only has - at best - the ability to do 4 allocations or frees at a time. Of course it is going to contend on the allocation group locks.... > The original report was on aarch64, but I could still reproduce some > difference on Linux 6.13 with a X86 physical machine: > > Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz * 96 cores > 512 GiB memory > > XFS (35649.6) is still lower than EXT4 (37146.0) by 4% and > the kconfig is attached. > > However, I don't observe much difference on 5.10.y kernels. After > collecting some off-CPU trace, I found there are many new agi buf > lock waits compared with the correspoinding 5.10.y trace, as below: Yes, because background inactivation can increase the contention on AGF/AGI buffer locks when there is insufficient concurrency in the filesystem layout. It is rare, however, that any workload other that benchmarks generate enough load and/or concurrency to reach the thresholds where such lock breakdown occurs. > I tried to do some hack to disable defer inode inactivation as below, > the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9): > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > index 7b6c026d01a1..d9fb2ef3686a 100644 > --- a/fs/xfs/xfs_icache.c > +++ b/fs/xfs/xfs_icache.c > @@ -2059,6 +2059,7 @@ void > xfs_inodegc_start( > struct xfs_mount *mp) > { > + return; > if (xfs_set_inodegc_enabled(mp)) > return; > > @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( > ip->i_flags |= XFS_NEED_INACTIVE; > spin_unlock(&ip->i_flags_lock); > > + if (1) { > + xfs_iflags_set(ip, XFS_INACTIVATING); > + xfs_inodegc_inactivate(ip); > + return; > + } That reintroduces potential deadlock vectors by running blocking transactions directly from iput() and/or memory reclaim. That's one of the main reasons we moved inactivation to a background thread - it gets rid of an entire class of potential deadlock problems.... -Dave.
Hi Dave, On 2025/3/17 05:25, Dave Chinner wrote: > On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote: >> Hi folks, >> >> Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency) >> performance regression during a benchmark comparison between XFS and >> EXT4: The XFS result was lower than EXT4 by 15% on Linux 6.6.y with >> 144-core aarch64 (64K page size). Since Unixbench is somewhat important >> to indicate overall system performance for many end users, it's not >> a good result. > > Unixbench isn't really that indicative of typical worklaods on large > core-count machines these days. It's an ancient benchmark, and it's > exceedingly rare that a modern machine is fully loaded with shell > scripts such as the shell1 test is running because it's highly > inefficient to do large scale concurrent processing of data in this > way.... > > Indeed, look at the file copy "benchmarks" it runs - the use buffer > sizes of 256, 1024 and 4096 bytes to tell you how well the > filesystem performs. Using sub-page size buffers might have been > common for 1983-era CPUs to get the highest possible file copy > throughput, but these days these are slow paths that we largely > don't optimise for highest throughput. Measuring modern system > scalability via how such operations perform is largely meaningless > because applications don't behave this way anymore.... Thanks for your reply! Sigh. Many customers really care, and they select the whole software stack based on this benchmark. If they think the results are not good, they might ask us to move away of XFS filesystem. It's not what I could do anything, you know. > >> shell1 test[2] basically runs in a loop that it executes commands >> to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove >> them. The testcase lasts for one minute and then show the total number >> of iterations. >> >> While no difference was observed in single-threaded results, it showed >> a noticeable difference above if `./Run shell1 -c 144 -i 1` is used. > > I'm betting that the XFS filesystem is small and only has 4 AGs, > and so has very limited concurrency in allocation. > > i.e. you're trying to run a massively concurrent workload on a > filesystem that only has - at best - the ability to do 4 allocations > or frees at a time. Of course it is going to contend on the > allocation group locks.... I've adjusted this, from 4 AG to 20 AG. No real impact. > >> The original report was on aarch64, but I could still reproduce some >> difference on Linux 6.13 with a X86 physical machine: >> >> Intel(R) Xeon(R) Platinum 8331C CPU @ 2.50GHz * 96 cores >> 512 GiB memory >> >> XFS (35649.6) is still lower than EXT4 (37146.0) by 4% and >> the kconfig is attached. >> >> However, I don't observe much difference on 5.10.y kernels. After >> collecting some off-CPU trace, I found there are many new agi buf >> lock waits compared with the correspoinding 5.10.y trace, as below: > > Yes, because background inactivation can increase the contention on > AGF/AGI buffer locks when there is insufficient concurrency in the > filesystem layout. It is rare, however, that any workload other that > benchmarks generate enough load and/or concurrency to reach the > thresholds where such lock breakdown occurs. > >> I tried to do some hack to disable defer inode inactivation as below, >> the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9): >> >> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c >> index 7b6c026d01a1..d9fb2ef3686a 100644 >> --- a/fs/xfs/xfs_icache.c >> +++ b/fs/xfs/xfs_icache.c >> @@ -2059,6 +2059,7 @@ void >> xfs_inodegc_start( >> struct xfs_mount *mp) >> { >> + return; >> if (xfs_set_inodegc_enabled(mp)) >> return; >> >> @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( >> ip->i_flags |= XFS_NEED_INACTIVE; >> spin_unlock(&ip->i_flags_lock); >> >> + if (1) { >> + xfs_iflags_set(ip, XFS_INACTIVATING); >> + xfs_inodegc_inactivate(ip); >> + return; >> + } > > That reintroduces potential deadlock vectors by running blocking > transactions directly from iput() and/or memory reclaim. That's one > of the main reasons we moved inactivation to a background thread - > it gets rid of an entire class of potential deadlock problems.... Yeah, I noticed that too, mainly commit 68b957f64fca ("xfs: load uncached unlinked inodes into memory on demand"). Thanks, Gao Xiang > > -Dave.
On Mon, Mar 17, 2025 at 08:25:16AM +0800, Gao Xiang wrote: > Hi Dave, > > On 2025/3/17 05:25, Dave Chinner wrote: > > On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote: > > > Hi folks, > > > > > > Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency) > > > performance regression during a benchmark comparison between XFS and > > > EXT4: The XFS result was lower than EXT4 by 15% on Linux 6.6.y with > > > 144-core aarch64 (64K page size). Since Unixbench is somewhat important > > > to indicate overall system performance for many end users, it's not > > > a good result. > > > > Unixbench isn't really that indicative of typical worklaods on large > > core-count machines these days. It's an ancient benchmark, and it's > > exceedingly rare that a modern machine is fully loaded with shell > > scripts such as the shell1 test is running because it's highly > > inefficient to do large scale concurrent processing of data in this > > way.... > > > > Indeed, look at the file copy "benchmarks" it runs - the use buffer > > sizes of 256, 1024 and 4096 bytes to tell you how well the > > filesystem performs. Using sub-page size buffers might have been > > common for 1983-era CPUs to get the highest possible file copy > > throughput, but these days these are slow paths that we largely > > don't optimise for highest throughput. Measuring modern system > > scalability via how such operations perform is largely meaningless > > because applications don't behave this way anymore.... > > Thanks for your reply! > > Sigh. Many customers really care, and they select the whole software > stack based on this benchmark. People using benchmarks that have no relevance to their software/application stack behaviour as the basis of their purchase decisions has been happening for decades. > If they think the results are not good, they might ask us to move away > of XFS filesystem. It's not what I could do anything, you know. If they think there is a filesystem better suited to their requirements than XFS, then they are free to make that decision themselves. We can point out that their selection metrics are irrelevant to their actual workload, but in my experience this just makes the people running the selection trial more convinced they are right and they still make a poor decision.... > > > shell1 test[2] basically runs in a loop that it executes commands > > > to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove > > > them. The testcase lasts for one minute and then show the total number > > > of iterations. > > > > > > While no difference was observed in single-threaded results, it showed > > > a noticeable difference above if `./Run shell1 -c 144 -i 1` is used. > > > > I'm betting that the XFS filesystem is small and only has 4 AGs, > > and so has very limited concurrency in allocation. > > > > i.e. you're trying to run a massively concurrent workload on a > > filesystem that only has - at best - the ability to do 4 allocations > > or frees at a time. Of course it is going to contend on the > > allocation group locks.... > > I've adjusted this, from 4 AG to 20 AG. No real impact. Yup, still very limited concurrency considering that you are running 144 instances of that workload (which, AFAICT, are all doing independent work). This implies that a couple of hundred AGs would be needed to provide sufficient allocation concurrency for this sort of workload. > > > I tried to do some hack to disable defer inode inactivation as below, > > > the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9): > > > > > > diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c > > > index 7b6c026d01a1..d9fb2ef3686a 100644 > > > --- a/fs/xfs/xfs_icache.c > > > +++ b/fs/xfs/xfs_icache.c > > > @@ -2059,6 +2059,7 @@ void > > > xfs_inodegc_start( > > > struct xfs_mount *mp) > > > { > > > + return; > > > if (xfs_set_inodegc_enabled(mp)) > > > return; > > > > > > @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( > > > ip->i_flags |= XFS_NEED_INACTIVE; > > > spin_unlock(&ip->i_flags_lock); > > > > > > + if (1) { > > > + xfs_iflags_set(ip, XFS_INACTIVATING); > > > + xfs_inodegc_inactivate(ip); > > > + return; > > > + } > > > > That reintroduces potential deadlock vectors by running blocking > > transactions directly from iput() and/or memory reclaim. That's one > > of the main reasons we moved inactivation to a background thread - > > it gets rid of an entire class of potential deadlock problems.... > > Yeah, I noticed that too, mainly > commit 68b957f64fca ("xfs: load uncached unlinked inodes into memory > on demand"). That is not related to the class of deadlocks and issues I'm referring to. Running a transaction in memory reclaim context (i.e. shrinker evicts the inode from memory) means that memory reclaim now blocks on journal space, IO, buffer locks, etc. The sort of deadlock that this can cause is a non-transactional operation above memory reclaim holding a buffer lock (e.g. bulkstat reading the AGI btree), then requiring memory allocation (e.g. pulling a AGI btree block into memory) which triggers direct memory reclaim, which then tries to inactivate an inode, which then (for whatever reason) requires taking a AGI btree block lock.... That is the class of potential deadlock that background inode inactivation avoids completely. It also avoids excessive inode eviction latency (important as shrinkers run from direct reclaim are supposed to be non-blocking) and other sub-optimal inode eviction behaviours from occurring... -Dave.
On 2025/3/18 04:43, Dave Chinner wrote: > On Mon, Mar 17, 2025 at 08:25:16AM +0800, Gao Xiang wrote: >> Hi Dave, >> >> On 2025/3/17 05:25, Dave Chinner wrote: >>> On Sat, Mar 15, 2025 at 01:19:31AM +0800, Gao Xiang wrote: >>>> Hi folks, >>>> >>>> Days ago, I received a XFS Unixbench[1] shell1 (high-concurrency) >>>> performance regression during a benchmark comparison between XFS and >>>> EXT4: The XFS result was lower than EXT4 by 15% on Linux 6.6.y with >>>> 144-core aarch64 (64K page size). Since Unixbench is somewhat important >>>> to indicate overall system performance for many end users, it's not >>>> a good result. >>> >>> Unixbench isn't really that indicative of typical worklaods on large >>> core-count machines these days. It's an ancient benchmark, and it's >>> exceedingly rare that a modern machine is fully loaded with shell >>> scripts such as the shell1 test is running because it's highly >>> inefficient to do large scale concurrent processing of data in this >>> way.... >>> >>> Indeed, look at the file copy "benchmarks" it runs - the use buffer >>> sizes of 256, 1024 and 4096 bytes to tell you how well the >>> filesystem performs. Using sub-page size buffers might have been >>> common for 1983-era CPUs to get the highest possible file copy >>> throughput, but these days these are slow paths that we largely >>> don't optimise for highest throughput. Measuring modern system >>> scalability via how such operations perform is largely meaningless >>> because applications don't behave this way anymore.... >> >> Thanks for your reply! >> >> Sigh. Many customers really care, and they select the whole software >> stack based on this benchmark. > > People using benchmarks that have no relevance to their > software/application stack behaviour as the basis of their purchase > decisions has been happening for decades. They even know nothing. Unixbench is already a practice for them, I cannot argue about that. > >> If they think the results are not good, they might ask us to move away >> of XFS filesystem. It's not what I could do anything, you know. > > If they think there is a filesystem better suited to their > requirements than XFS, then they are free to make that decision > themselves. We can point out that their selection metrics are > irrelevant to their actual workload, but in my experience this just > makes the people running the selection trial more convinced they are > right and they still make a poor decision.... The problem is not simple like this, what we'd like is to provide a unique cloud image for users to use. It's impossible for us to provide two images for two filesystems. But Unixbench is still important for many users, so either we still to XFS or switch back to EXT4. > >>>> shell1 test[2] basically runs in a loop that it executes commands >>>> to generate files (sort.$$, od.$$, grep.$$, wc.$$) and then remove >>>> them. The testcase lasts for one minute and then show the total number >>>> of iterations. >>>> >>>> While no difference was observed in single-threaded results, it showed >>>> a noticeable difference above if `./Run shell1 -c 144 -i 1` is used. >>> >>> I'm betting that the XFS filesystem is small and only has 4 AGs, >>> and so has very limited concurrency in allocation. >>> >>> i.e. you're trying to run a massively concurrent workload on a >>> filesystem that only has - at best - the ability to do 4 allocations >>> or frees at a time. Of course it is going to contend on the >>> allocation group locks.... >> >> I've adjusted this, from 4 AG to 20 AG. No real impact. > > Yup, still very limited concurrency considering that you are running > 144 instances of that workload (which, AFAICT, are all doing > independent work). This implies that a couple of hundred AGs would > be needed to provide sufficient allocation concurrency for this sort > of workload. Hmm.. We shipped a unique traditional rootfs image to all users, we cannot adjust AG numbers according to specific workload like this because increasing AG numbers is not always good for many other workloads. But I could try to use 144 AG and retest again. > >>>> I tried to do some hack to disable defer inode inactivation as below, >>>> the shell1 benchmark then recovered: XFS (35649.6 -> 37810.9): >>>> >>>> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c >>>> index 7b6c026d01a1..d9fb2ef3686a 100644 >>>> --- a/fs/xfs/xfs_icache.c >>>> +++ b/fs/xfs/xfs_icache.c >>>> @@ -2059,6 +2059,7 @@ void >>>> xfs_inodegc_start( >>>> struct xfs_mount *mp) >>>> { >>>> + return; >>>> if (xfs_set_inodegc_enabled(mp)) >>>> return; >>>> >>>> @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( >>>> ip->i_flags |= XFS_NEED_INACTIVE; >>>> spin_unlock(&ip->i_flags_lock); >>>> >>>> + if (1) { >>>> + xfs_iflags_set(ip, XFS_INACTIVATING); >>>> + xfs_inodegc_inactivate(ip); >>>> + return; >>>> + } >>> >>> That reintroduces potential deadlock vectors by running blocking >>> transactions directly from iput() and/or memory reclaim. That's one >>> of the main reasons we moved inactivation to a background thread - >>> it gets rid of an entire class of potential deadlock problems.... >> >> Yeah, I noticed that too, mainly >> commit 68b957f64fca ("xfs: load uncached unlinked inodes into memory >> on demand"). > > That is not related to the class of deadlocks and issues I'm > referring to. Running a transaction in memory reclaim context (i.e. > shrinker evicts the inode from memory) means that memory reclaim now > blocks on journal space, IO, buffer locks, etc. > > The sort of deadlock that this can cause is a non-transactional > operation above memory reclaim holding a buffer lock (e.g. bulkstat > reading the AGI btree), then requiring memory allocation (e.g. > pulling a AGI btree block into memory) which triggers direct memory > reclaim, which then tries to inactivate an inode, which then > (for whatever reason) requires taking a AGI btree block lock.... Yeah, we don't have a context to know AGI lock is locked before. > > That is the class of potential deadlock that background inode > inactivation avoids completely. It also avoids excessive inode > eviction latency (important as shrinkers run from direct reclaim > are supposed to be non-blocking) and other sub-optimal inode > eviction behaviours from occurring... I know, what I meant was a direct problem if I hack to revert to the old way even I don't care about the memory reclaim deadlock. It seems the in-core unlink list is heavily relied on this feature too, so I don't have a good way anyway. Thanks, Gao Xiang > > -Dave.
On Tue, Mar 18, 2025 at 08:29:46AM +0800, Gao Xiang wrote: > On 2025/3/18 04:43, Dave Chinner wrote: > > On Mon, Mar 17, 2025 at 08:25:16AM +0800, Gao Xiang wrote: > > > If they think the results are not good, they might ask us to move away > > > of XFS filesystem. It's not what I could do anything, you know. > > > > If they think there is a filesystem better suited to their > > requirements than XFS, then they are free to make that decision > > themselves. We can point out that their selection metrics are > > irrelevant to their actual workload, but in my experience this just > > makes the people running the selection trial more convinced they are > > right and they still make a poor decision.... > > The problem is not simple like this, what we'd like is to provide > a unique cloud image for users to use. It's impossible for us to > provide two images for two filesystems. But Unixbench is still > important for many users, so either we still to XFS or switch back > to EXT4. Well, that means your company has the motivation to try to improve the XFS code, doesn't it? If they won't put up the resources to address issues that affect their customers, then why should anyone else do that work for them for free? -Dave.
On Tue, Mar 18, 2025 at 11:58:53AM +1100, Dave Chinner wrote: > On Tue, Mar 18, 2025 at 08:29:46AM +0800, Gao Xiang wrote: > > On 2025/3/18 04:43, Dave Chinner wrote: > > > On Mon, Mar 17, 2025 at 08:25:16AM +0800, Gao Xiang wrote: > > > > If they think the results are not good, they might ask us to move away > > > > of XFS filesystem. It's not what I could do anything, you know. > > > > > > If they think there is a filesystem better suited to their > > > requirements than XFS, then they are free to make that decision > > > themselves. We can point out that their selection metrics are > > > irrelevant to their actual workload, but in my experience this just > > > makes the people running the selection trial more convinced they are > > > right and they still make a poor decision.... > > > > The problem is not simple like this, what we'd like is to provide > > a unique cloud image for users to use. It's impossible for us to > > provide two images for two filesystems. But Unixbench is still > > important for many users, so either we still to XFS or switch back > > to EXT4. > > Well, that means your company has the motivation to try to improve > the XFS code, doesn't it? If they won't put up the resources to > address issues that affect their customers, then why should anyone > else do that work for them for free? Disclose: I don't speak for my company. So the following is just my own thoughts. It may be true in 2023, however, the only resource now is me (I suspect there will be more resource), but I still have other work to do on my hand which impacts my own performance review more. For this issue, I spent much time (up to the mid-night) last week to find out, which greatly impact my own physical health. Yes, I will speed more on this, but: There is no clear evidence that a cloud image could directly benefit to our AI infra. So from POV of a company, the worst case is that they may revisit the overall benefits among all possible choices and find out the best one. Anyway, I've got the community view of Unixbench. I will arrange my own TODO list. Thanks, Gao Xiang > > -Dave. > -- > Dave Chinner > david@fromorbit.com
On Tue, Mar 18, 2025 at 10:03:10AM +0800, Gao Xiang wrote: > Anyway, I've got the community view of Unixbench. I will arrange my > own TODO list. FYI, while I agree with all of Dave's observation on Unixbench I don't entirely agree with the conclusion. Yes, we should not bend over backwards to optimize for this workload. But if we can find an easy enough way to improve it I'm all in favour. While everyone should know not to do subsector writes there's actually still plenty of workloads that do. And while those are not the heavily optimized ones that really matter it would still be nice to get good numbers for them. So if you can find a way to optimize it without much negative impact I'd love to see that.
Hi Christoph, On 2025/3/18 16:10, Christoph Hellwig wrote: > On Tue, Mar 18, 2025 at 10:03:10AM +0800, Gao Xiang wrote: >> Anyway, I've got the community view of Unixbench. I will arrange my >> own TODO list. > > FYI, while I agree with all of Dave's observation on Unixbench I don't > entirely agree with the conclusion. Yes, we should not bend over > backwards to optimize for this workload. But if we can find an easy > enough way to improve it I'm all in favour. While everyone should > know not to do subsector writes there's actually still plenty of > workloads that do. And while those are not the heavily optimized ones > that really matter it would still be nice to get good numbers for them. > > So if you can find a way to optimize it without much negative impact > I'd love to see that. Got it. I will find time digging into more (but honestly, the priority is not high), if there could be some clean solution, I will feedback then. BTW, if any folks are really interested in this case and get much high priority compared to me (that is one of reasons I reported it immediately), I really appreciate too. Thanks, Gao Xiang
diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 7b6c026d01a1..d9fb2ef3686a 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -2059,6 +2059,7 @@ void xfs_inodegc_start( struct xfs_mount *mp) { + return; if (xfs_set_inodegc_enabled(mp)) return; @@ -2180,6 +2181,12 @@ xfs_inodegc_queue( ip->i_flags |= XFS_NEED_INACTIVE; spin_unlock(&ip->i_flags_lock); + if (1) { + xfs_iflags_set(ip, XFS_INACTIVATING); + xfs_inodegc_inactivate(ip); + return; + } + cpu_nr = get_cpu(); gc = this_cpu_ptr(mp->m_inodegc); llist_add(&ip->i_gclist, &gc->list);