Message ID | 1477420904-1399-6-git-send-email-jbacik@fb.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > With anything that populates the inode/dentry cache with a lot of one time use > inodes we can really put a lot of pressure on the system for things we don't > need to keep in cache. It takes two runs through the LRU to evict these one use > entries, and if you have a lot of memory you can end up with 10's of millions of > entries in the dcache or icache that have never actually been touched since they > were first instantiated, and it will take a lot of CPU and a lot of pressure to > evict all of them. > > So instead do what we do with pagecache, only set the *REFERENCED flags if we > are being used after we've been put onto the LRU. This makes a significant > difference in the system's ability to evict these useless cache entries. With a > fs_mark workload that creates 40 million files we get the following results (all > in files/sec) What's the workload, storage, etc? > Btrfs Patched Unpatched > Average Files/sec: 72209.3 63254.2 > p50 Files/sec: 70850 57560 > p90 Files/sec: 68757 53085 > p99 Files/sec: 68757 53085 So how much of this is from changing the dentry referenced behaviour, and how much from the inode? Can you separate out the two changes so we know which one is actually affecting reclaim performance? Indeed, I wonder if just changing the superblock shrinker default_seeks for btrfs would have exactly the same impact because that canbe used to exactly double the reclaim scan rate for the same memory pressure. If that doesn't change performance by a similar amount (changing defaults seeks is the normal way of changing shrinker balance), then more digging is required here to explain why the referenced bits make such an impact to steady state performance... > XFS Patched Unpatched > Average Files/sec: 61025.5 60719.5 > p50 Files/sec: 60107 59465 > p90 Files/sec: 59300 57966 > p99 Files/sec: 59227 57528 You made XFS never use I_REFERENCED at all (hint: not all filesystems use find_inode/find_inode_fast()), so it's not clear that the extra scanning (less than 1% difference in average throughput) is actuallly the cause of things being slower in btrfs. > The reason Btrfs has a much larger improvement is because it holds a lot more > things in memory so benefits more from faster slab reclaim, but across the board > is an improvement for each of the file systems. Less than 1% for XFS and ~1.5% for ext4 is well within the run-to-run variation of fsmark. It looks like it might be slightly faster, but it's not a cut-and-dried win for anything other than btrfs. Cheers, Dave.
On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > > With anything that populates the inode/dentry cache with a lot of one time use > > inodes we can really put a lot of pressure on the system for things we don't > > need to keep in cache. It takes two runs through the LRU to evict these one use > > entries, and if you have a lot of memory you can end up with 10's of millions of > > entries in the dcache or icache that have never actually been touched since they > > were first instantiated, and it will take a lot of CPU and a lot of pressure to > > evict all of them. > > > > So instead do what we do with pagecache, only set the *REFERENCED flags if we > > are being used after we've been put onto the LRU. This makes a significant > > difference in the system's ability to evict these useless cache entries. With a > > fs_mark workload that creates 40 million files we get the following results (all > > in files/sec) > > What's the workload, storage, etc? > > > Btrfs Patched Unpatched > > Average Files/sec: 72209.3 63254.2 > > p50 Files/sec: 70850 57560 > > p90 Files/sec: 68757 53085 > > p99 Files/sec: 68757 53085 > > So how much of this is from changing the dentry referenced > behaviour, and how much from the inode? Can you separate out the two > changes so we know which one is actually affecting reclaim > performance? FWIW, I just went to run my usual zero-length file creation fsmark test (16-way create, large sparse FS image on SSDs). XFS (with debug enabled) takes 4m10s to run at an average of ~230k files/s, with a std deviation of +/-1.7k files/s. Unfortunately, btrfs turns that into more heat than it ever has done before. It's only managing 35k files/s and the profile looks like this: 58.79% [kernel] [k] __pv_queued_spin_lock_slowpath 5.61% [kernel] [k] queued_write_lock_slowpath 1.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 1.38% [kernel] [k] reschedule_interrupt 1.08% [kernel] [k] _raw_spin_lock 0.92% [kernel] [k] __radix_tree_lookup 0.86% [kernel] [k] _raw_spin_lock_irqsave 0.83% [kernel] [k] btrfs_set_lock_blocking_rw I killed it because this would take too long to run. I reduced the concurrency down to 4-way, spinlock contention went down to about 40% of the CPU time. I reduced the concurrency down to 2 and saw about 16% of CPU time being spent in lock contention. Throughput results: btrfs throughput 2-way 4-way unpatched 46938.1+/-2.8e+03 40273.4+/-3.1e+03 patched 45697.2+/-2.4e+03 49287.1+/-3e+03 So, 2-way has not improved. If changing referenced behaviour was an obvious win for btrfs, we'd expect to see that here as well. however, because 4-way improved by 20%, I think all we're seeing is a slight change in lock contention levels inside btrfs. Indeed, looking at the profiles I see that lock contention time was reduced to around 32% of the total CPU used (down by about 20%): 20.79% [kernel] [k] __pv_queued_spin_lock_slowpath 3.85% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 3.68% [kernel] [k] _raw_spin_lock 3.40% [kernel] [k] queued_write_lock_slowpath ..... IOWs, the performance increase comes from the fact that changing inode cache eviction patterns causes slightly less lock contention in btrfs inode reclaim. IOWs, the problem that needs fixing is the btrfs lock contention, not the VFS cache LRU algorithms. Root cause analysis needs to be done properly before behavioural changes like this are proposed, people! -Dave. PS: I caught this profile on unmount when the 8 million inodes cached inodes were being reclaimed. evict_inodes() ignores the referenced bit, so this gives a lot of insight into the work being done by inode reclaim in a filesystem: 18.54% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock 9.43% [kernel] [k] rb_erase 8.03% [kernel] [k] __btrfs_release_delayed_node 7.23% [kernel] [k] _raw_spin_lock 6.93% [kernel] [k] __list_del_entry 4.35% [kernel] [k] __slab_free 3.93% [kernel] [k] __mutex_lock_slowpath 2.77% [kernel] [k] bit_waitqueue 2.58% [kernel] [k] kmem_cache_alloc 2.50% [kernel] [k] __radix_tree_lookup 2.44% [kernel] [k] _raw_spin_lock_irq 2.18% [kernel] [k] kmem_cache_free 2.17% [kernel] [k] evict <<<<<<<<<<< 2.13% [kernel] [k] fsnotify_destroy_marks 1.68% [kernel] [k] btrfs_remove_delayed_node 1.61% [kernel] [k] __call_rcu.constprop.70 1.50% [kernel] [k] __remove_inode_hash 1.49% [kernel] [k] kmem_cache_alloc_trace 1.39% [kernel] [k] ___might_sleep 1.15% [kernel] [k] __memset 1.12% [kernel] [k] __mutex_unlock_slowpath 1.03% [kernel] [k] evict_inodes <<<<<<<<<< 1.02% [kernel] [k] cmpxchg_double_slab.isra.66 0.93% [kernel] [k] free_extent_map 0.83% [kernel] [k] _raw_write_lock 0.69% [kernel] [k] __might_sleep 0.56% [kernel] [k] _raw_spin_unlock The VFS inode cache traversal to evict inodes is a very small part of the CPU usage being recorded. btrfs lock traffic alone accounts for more than 10x as much CPU usage as inode cache traversal....
On 10/25/2016 06:01 PM, Dave Chinner wrote: > On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: >> With anything that populates the inode/dentry cache with a lot of one time use >> inodes we can really put a lot of pressure on the system for things we don't >> need to keep in cache. It takes two runs through the LRU to evict these one use >> entries, and if you have a lot of memory you can end up with 10's of millions of >> entries in the dcache or icache that have never actually been touched since they >> were first instantiated, and it will take a lot of CPU and a lot of pressure to >> evict all of them. >> >> So instead do what we do with pagecache, only set the *REFERENCED flags if we >> are being used after we've been put onto the LRU. This makes a significant >> difference in the system's ability to evict these useless cache entries. With a >> fs_mark workload that creates 40 million files we get the following results (all >> in files/sec) > > What's the workload, storage, etc? Oops sorry I thought I said it. It's fs_mark creating 20 million empty files on a single NVME drive. > >> Btrfs Patched Unpatched >> Average Files/sec: 72209.3 63254.2 >> p50 Files/sec: 70850 57560 >> p90 Files/sec: 68757 53085 >> p99 Files/sec: 68757 53085 > > So how much of this is from changing the dentry referenced > behaviour, and how much from the inode? Can you separate out the two > changes so we know which one is actually affecting reclaim > performance? > > Indeed, I wonder if just changing the superblock shrinker > default_seeks for btrfs would have exactly the same impact because > that canbe used to exactly double the reclaim scan rate for the same > memory pressure. If that doesn't change performance by a similar > amount (changing defaults seeks is the normal way of changing > shrinker balance), then more digging is required here to explain why > the referenced bits make such an impact to steady state > performance... > I'll tease out the impact of changing dcache vs icache vs both. Yeah I'll reduce default_seeks and see what that turns out to be. >> XFS Patched Unpatched >> Average Files/sec: 61025.5 60719.5 >> p50 Files/sec: 60107 59465 >> p90 Files/sec: 59300 57966 >> p99 Files/sec: 59227 57528 > > You made XFS never use I_REFERENCED at all (hint: not all > filesystems use find_inode/find_inode_fast()), so it's not clear > that the extra scanning (less than 1% difference in average > throughput) is actuallly the cause of things being slower in btrfs. > >> The reason Btrfs has a much larger improvement is because it holds a lot more >> things in memory so benefits more from faster slab reclaim, but across the board >> is an improvement for each of the file systems. > > Less than 1% for XFS and ~1.5% for ext4 is well within the > run-to-run variation of fsmark. It looks like it might be slightly > faster, but it's not a cut-and-dried win for anything other than > btrfs. > Sure the win in this benchmark is clearly benefiting btrfs the most, but I think the overall approach is sound and likely to help everybody in theory. Inside FB we definitely have had problems where the memory pressure induced by some idi^H^H^Hprocess goes along and runs find / which causes us to evict real things that are being used rather than these one use inodes. This sort of behavior could possibly be mitigated by this patch, but I haven't sat down to figure out a reliable way to mirror this workload to test that theory. Thanks Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/25/2016 07:36 PM, Dave Chinner wrote: > On Wed, Oct 26, 2016 at 09:01:13AM +1100, Dave Chinner wrote: >> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: >>> With anything that populates the inode/dentry cache with a lot of one time use >>> inodes we can really put a lot of pressure on the system for things we don't >>> need to keep in cache. It takes two runs through the LRU to evict these one use >>> entries, and if you have a lot of memory you can end up with 10's of millions of >>> entries in the dcache or icache that have never actually been touched since they >>> were first instantiated, and it will take a lot of CPU and a lot of pressure to >>> evict all of them. >>> >>> So instead do what we do with pagecache, only set the *REFERENCED flags if we >>> are being used after we've been put onto the LRU. This makes a significant >>> difference in the system's ability to evict these useless cache entries. With a >>> fs_mark workload that creates 40 million files we get the following results (all >>> in files/sec) >> >> What's the workload, storage, etc? >> >>> Btrfs Patched Unpatched >>> Average Files/sec: 72209.3 63254.2 >>> p50 Files/sec: 70850 57560 >>> p90 Files/sec: 68757 53085 >>> p99 Files/sec: 68757 53085 >> >> So how much of this is from changing the dentry referenced >> behaviour, and how much from the inode? Can you separate out the two >> changes so we know which one is actually affecting reclaim >> performance? (Ok figured out the problem, apparently outlook thinks you are spam and has been filtering all your emails into its junk folder without giving them to me. I've fixed this so now we should be able to get somewhere) > > FWIW, I just went to run my usual zero-length file creation fsmark > test (16-way create, large sparse FS image on SSDs). XFS (with debug > enabled) takes 4m10s to run at an average of ~230k files/s, with a > std deviation of +/-1.7k files/s. > > Unfortunately, btrfs turns that into more heat than it ever has done > before. It's only managing 35k files/s and the profile looks like > this: > > 58.79% [kernel] [k] __pv_queued_spin_lock_slowpath > 5.61% [kernel] [k] queued_write_lock_slowpath > 1.65% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > 1.38% [kernel] [k] reschedule_interrupt > 1.08% [kernel] [k] _raw_spin_lock > 0.92% [kernel] [k] __radix_tree_lookup > 0.86% [kernel] [k] _raw_spin_lock_irqsave > 0.83% [kernel] [k] btrfs_set_lock_blocking_rw > > I killed it because this would take too long to run. > > I reduced the concurrency down to 4-way, spinlock contention went > down to about 40% of the CPU time. I reduced the concurrency down > to 2 and saw about 16% of CPU time being spent in lock contention. > > Throughput results: > btrfs throughput > 2-way 4-way > unpatched 46938.1+/-2.8e+03 40273.4+/-3.1e+03 > patched 45697.2+/-2.4e+03 49287.1+/-3e+03 > > > So, 2-way has not improved. If changing referenced behaviour was an > obvious win for btrfs, we'd expect to see that here as well. > however, because 4-way improved by 20%, I think all we're seeing is > a slight change in lock contention levels inside btrfs. Indeed, > looking at the profiles I see that lock contention time was reduced > to around 32% of the total CPU used (down by about 20%): > > 20.79% [kernel] [k] __pv_queued_spin_lock_slowpath > 3.85% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > 3.68% [kernel] [k] _raw_spin_lock > 3.40% [kernel] [k] queued_write_lock_slowpath > ..... > > IOWs, the performance increase comes from the fact that changing > inode cache eviction patterns causes slightly less lock contention > in btrfs inode reclaim. IOWs, the problem that needs fixing is the > btrfs lock contention, not the VFS cache LRU algorithms. > > Root cause analysis needs to be done properly before behavioural > changes like this are proposed, people! > We are doing different things. Yes when you do it all into the same subvol the lock contention is obviously much worse and the bottleneck, but that's not what I'm doing, I'm creating a subvol for each thread to reduce the lock contention on the root nodes. If you retest by doing that then you will see the differences I was seeing. Are you suggesting that I just made these numbers up? Instead of assuming I'm an idiot and don't know how to root cause issues please instead ask what is different from your run versus my run. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 26, 2016 at 04:03:54PM -0400, Josef Bacik wrote: > On 10/25/2016 07:36 PM, Dave Chinner wrote: > >So, 2-way has not improved. If changing referenced behaviour was an > >obvious win for btrfs, we'd expect to see that here as well. > >however, because 4-way improved by 20%, I think all we're seeing is > >a slight change in lock contention levels inside btrfs. Indeed, > >looking at the profiles I see that lock contention time was reduced > >to around 32% of the total CPU used (down by about 20%): > > > > 20.79% [kernel] [k] __pv_queued_spin_lock_slowpath > > 3.85% [kernel] [k] __raw_callee_save___pv_queued_spin_unlock > > 3.68% [kernel] [k] _raw_spin_lock > > 3.40% [kernel] [k] queued_write_lock_slowpath > > ..... > > > >IOWs, the performance increase comes from the fact that changing > >inode cache eviction patterns causes slightly less lock contention > >in btrfs inode reclaim. IOWs, the problem that needs fixing is the > >btrfs lock contention, not the VFS cache LRU algorithms. > > > >Root cause analysis needs to be done properly before behavioural > >changes like this are proposed, people! > > > > We are doing different things. Well, no, we're doing the same thing. Except... > Yes when you do it all into the same > subvol the lock contention is obviously much worse and the > bottleneck, but that's not what I'm doing, I'm creating a subvol for > each thread to reduce the lock contention on the root nodes. .. you have a non-default filesystem config that you didn't mention even in response to a /direct question/. Seriously, this is a major configuration detail that is necessary for anyone who wants to replicate your results! The difference is that I'm using is the /default/ btrfs filesystem configuration and you're using a carefully contrived structure that is designed to work around fundamental scalability issues the benchmark normally exposes. This raises another question: Does this subvol setup reflect production configurations, or is it simply a means to get the benchmark to put enough load on the inode cache to demonstrate the effect of changing the reference bits? > If you > retest by doing that then you will see the differences I was seeing. > Are you suggesting that I just made these numbers up? No, I'm suggesting that you haven't explained the problem or how to reproduce it in sufficient detail. You haven't explained why reducing scanning (which in and of iteself has minimal overhead and is not visible to btrfs) has such a marked effect on the behaviour of btrfs. You haven't given us details on the workload or storage config so we can reproduce the numbers (and still haven't!). You're arguing that numbers far below variablity and accuracy measurement limits are significant (i.e. the patched numbers are better for XFS and ext4). There's lots of information you've chosen to omit that we need to know. To fill in the gaps, I've done some analysis, provided perf numbers and CPU profiles, and put them into an explanation that explains why there is a difference in performance for a default btrfs config. I note that you've not actually addressed that analysis and the problems it uncovers, but instead just said "I'm using a different configuration". What about all the btrfs users that aren't using your configuration that will see these problems? How does the scanning change actually benefit them more than fixing the locking issues that exist? Memory reclaim and LRU algorithms affect everyone, not just a benchmark configuration. We need to know a lot more about the issue at hand to be able to evaluate whether this is the /right solution for the problem./ > Instead of > assuming I'm an idiot and don't know how to root cause issues please > instead ask what is different from your run versus my run. Thanks, I'm assuming you're a compentent engineer, Josef, who knows that details about benchmarks and performance measurement are extremely important, and that someone reviewing code needs to understand why the behaviour change had the impact it did to be able to evaluate it properly. I'm assuming that you also know that if you understand the root cause of a problem, you put it in the commit message so that everyone else knows it, too. So if you know what the root cause of the btrfs problem that reducing reclaim scanning avoids, then please explain it. Cheers, Dave.
On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: > On 10/25/2016 06:01 PM, Dave Chinner wrote: > >On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > >>With anything that populates the inode/dentry cache with a lot of one time use > >>inodes we can really put a lot of pressure on the system for things we don't > >>need to keep in cache. It takes two runs through the LRU to evict these one use > >>entries, and if you have a lot of memory you can end up with 10's of millions of > >>entries in the dcache or icache that have never actually been touched since they > >>were first instantiated, and it will take a lot of CPU and a lot of pressure to > >>evict all of them. > >> > >>So instead do what we do with pagecache, only set the *REFERENCED flags if we > >>are being used after we've been put onto the LRU. This makes a significant > >>difference in the system's ability to evict these useless cache entries. With a > >>fs_mark workload that creates 40 million files we get the following results (all > >>in files/sec) > > > >What's the workload, storage, etc? > > Oops sorry I thought I said it. It's fs_mark creating 20 million > empty files on a single NVME drive. How big is the drive/filesystem (e.g. has impact on XFS allocation concurrency)? And multiple btrfs subvolumes, too, by the sound of it. How did you set those up? What about concurrency, directory sizes, etc? Can you post the fsmark command line as these details do actually matter... Getting the benchmark configuration to reproduce posted results should not require playing 20 questions! > >>The reason Btrfs has a much larger improvement is because it holds a lot more > >>things in memory so benefits more from faster slab reclaim, but across the board > >>is an improvement for each of the file systems. > > > >Less than 1% for XFS and ~1.5% for ext4 is well within the > >run-to-run variation of fsmark. It looks like it might be slightly > >faster, but it's not a cut-and-dried win for anything other than > >btrfs. > > > > Sure the win in this benchmark is clearly benefiting btrfs the most, > but I think the overall approach is sound and likely to help > everybody in theory. Yup, but without an explanation of why it makes such a big change to btrfs, we can't really say what effect it's really going to have. Why does cycling the inode a second time through the LRU make any difference? Once we're in steady state on this workload, one or two cycles through the LRU should make no difference at all to performance because all the inodes are instantiated in identical states (including the referenced bit) and so scanning treats every inode identically. i.e. the reclaim rate (i.e. how often evict_inode() is called) should be exactly the same and the only difference is the length of time between the inode being put on the LRU and when it is evicted. Is there an order difference in reclaim as a result of earlier reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs rather than contending on spinning locks? Is it having to wait for transaction commits or some other metadata IO because reclaim is happening earlier? i.e. The question I'm asking is what, exactly, leads to such a marked performance improvement in steady state behaviour? I want to know because if there's behavioural changes in LRU reclaim order having a significant effect on affecting btrfs, then there is going to be some effects on other filesystems, too. Maybe not in this benchmark, but we can't anticipate potential problems if we don't understand exactly what is going on here. > Inside FB we definitely have had problems > where the memory pressure induced by some idi^H^H^Hprocess goes > along and runs find / which causes us to evict real things that are > being used rather than these one use inodes. That's one of the problems the IO throttling in the XFS shrinker tends to avoid. i.e. This is one of the specific cases we expect to see on all production systems - backup applications are the common cause of regular full filesystem traversals. FWIW, there's an element of deja vu in this thread: that XFS inode cache shrinker IO throttling is exactly what Chris proposed we gut last week to solve some other FB memory reclaim problem that had no explanation of the root cause. (http://www.spinics.net/lists/linux-xfs/msg01541.html) > This sort of behavior > could possibly be mitigated by this patch, but I haven't sat down to > figure out a reliable way to mirror this workload to test that > theory. Thanks I use fsmark to create filesystems with tens of millions of small files, then do things like run concurrent greps repeatedly over a small portion of the directory structure (e.g. 1M cached inodes and 4GB worth of cached data) and then run concurrent find traversals across the entire filesystem to simulate this sort use-once vs working set reclaim... Cheers, Dave.
On 10/26/2016 08:30 PM, Dave Chinner wrote: > On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: >> On 10/25/2016 06:01 PM, Dave Chinner wrote: >>> On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: >>>> With anything that populates the inode/dentry cache with a lot of one time use >>>> inodes we can really put a lot of pressure on the system for things we don't >>>> need to keep in cache. It takes two runs through the LRU to evict these one use >>>> entries, and if you have a lot of memory you can end up with 10's of millions of >>>> entries in the dcache or icache that have never actually been touched since they >>>> were first instantiated, and it will take a lot of CPU and a lot of pressure to >>>> evict all of them. >>>> >>>> So instead do what we do with pagecache, only set the *REFERENCED flags if we >>>> are being used after we've been put onto the LRU. This makes a significant >>>> difference in the system's ability to evict these useless cache entries. With a >>>> fs_mark workload that creates 40 million files we get the following results (all >>>> in files/sec) >>> >>> What's the workload, storage, etc? >> >> Oops sorry I thought I said it. It's fs_mark creating 20 million >> empty files on a single NVME drive. > > How big is the drive/filesystem (e.g. has impact on XFS allocation > concurrency)? And multiple btrfs subvolumes, too, by the sound of > it. How did you set those up? What about concurrency, directory > sizes, etc? Can you post the fsmark command line as these details > do actually matter... > > Getting the benchmark configuration to reproduce posted results > should not require playing 20 questions! This is the disk Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors This is the script https://paste.fedoraproject.org/461874/73910147/1 It's on a 1 socket 8 core cpu with 16gib of ram. > >>>> The reason Btrfs has a much larger improvement is because it holds a lot more >>>> things in memory so benefits more from faster slab reclaim, but across the board >>>> is an improvement for each of the file systems. >>> >>> Less than 1% for XFS and ~1.5% for ext4 is well within the >>> run-to-run variation of fsmark. It looks like it might be slightly >>> faster, but it's not a cut-and-dried win for anything other than >>> btrfs. >>> >> >> Sure the win in this benchmark is clearly benefiting btrfs the most, >> but I think the overall approach is sound and likely to help >> everybody in theory. > > Yup, but without an explanation of why it makes such a big change to > btrfs, we can't really say what effect it's really going to have. > Why does cycling the inode a second time through the LRU make any > difference? Once we're in steady state on this workload, one or two > cycles through the LRU should make no difference at all to > performance because all the inodes are instantiated in identical > states (including the referenced bit) and so scanning treats every > inode identically. i.e. the reclaim rate (i.e. how often > evict_inode() is called) should be exactly the same and the only > difference is the length of time between the inode being put on the > LRU and when it is evicted. > > Is there an order difference in reclaim as a result of earlier > reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs > rather than contending on spinning locks? Is it having to wait for > transaction commits or some other metadata IO because reclaim is > happening earlier? i.e. The question I'm asking is what, exactly, > leads to such a marked performance improvement in steady state > behaviour? I would have seen this in my traces. There's tooooons of places to improve btrfs's performance and behavior here no doubt. But simply moving from pagecache to a slab shrinker shouldn't have drastically changed how we perform in this test. I feel like the shrinkers need to be used differently, but I completely destroyed vmscan.c trying different approaches and none of them made a difference like this patch made. From what I was seeing in my trace we were simply reclaiming less per kswapd scan iteration with the old approach vs. the new approach. > > I want to know because if there's behavioural changes in LRU reclaim > order having a significant effect on affecting btrfs, then there is > going to be some effects on other filesystems, too. Maybe not in > this benchmark, but we can't anticipate potential problems if we > don't understand exactly what is going on here. So I'll just describe to you what I was seeing and maybe we can work out where we think the problem is. 1) We go at X speed until we fill up all of the memory with the various caches. 2) We lost about 15% when kswapd kicks in and it never recovered. Doing tracing I was seeing that we would try to scan very small batches of objects, usually less than the batch size, because with btrfs not using pagecache for anything and the shrinker scanning being based on how much pagecache was scanned our scan totals would be very much smaller than the total cache size. I originally thought this was the problem, but short of just forcing us to scan the whole cache every time nothing I did made any difference. Once we'd scanned through the entire LRU at once suddenly we started reclaiming almost as many objects as we scanned, as opposed to < 10 items. This is because of the whole referenced thing. So we'd see a small perf bump when we did manage to do this, and then it would drop again once the lru was full of fresh inodes again. So we'd get this saw-toothy sort of reclaim. This is when I realized the whole REFERENCED thing was probably screwing us, so I went to look at what pagecache did to eliminate this problem. They do two things, first start all pages on the inactive list, and second balance between scanning the active list vs inactive list based on the size of either list. I rigged up something to do an active/inactive list inside of list_lru which again did basically nothing. So I changed how the REFERENCED was marked to match how pagecache did things and viola, my performance was back. I did all sorts of tracing to try and figure out what exactly it was about the reclaim that made everything so much slower, but with that many things going on it was hard to decide what was noise and what was actually causing the problem. Without this patch I see kswapd stay at higher CPU usage for longer, because it has to keep scanning things to satisfy the high watermark and it takes scanning multiple million object lists multiple times before it starts to make any progress towards reclaim. This logically makes sense and is not ideal behavior. > >> Inside FB we definitely have had problems >> where the memory pressure induced by some idi^H^H^Hprocess goes >> along and runs find / which causes us to evict real things that are >> being used rather than these one use inodes. > > That's one of the problems the IO throttling in the XFS shrinker > tends to avoid. i.e. This is one of the specific cases we expect to see > on all production systems - backup applications are the common cause > of regular full filesystem traversals. > > FWIW, there's an element of deja vu in this thread: that XFS inode > cache shrinker IO throttling is exactly what Chris proposed we gut > last week to solve some other FB memory reclaim problem that had no > explanation of the root cause. > > (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_linux-2Dxfs_msg01541.html&d=DQIBAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=REEqbp9jT-ifN0oN83tK5X3BxEL76jmuULRORyO8Qeg&s=vlsX-YRw_2HCnXFThpOjyumOHu4JFDcSqssSQG1Rt40&e= ) > >> This sort of behavior >> could possibly be mitigated by this patch, but I haven't sat down to >> figure out a reliable way to mirror this workload to test that >> theory. Thanks > > I use fsmark to create filesystems with tens of millions of small > files, then do things like run concurrent greps repeatedly over a > small portion of the directory structure (e.g. 1M cached inodes > and 4GB worth of cached data) and then run concurrent find > traversals across the entire filesystem to simulate this sort > use-once vs working set reclaim... > Yeah that sounds reasonable, like I said I just haven't tried to test it as my numbers got bigger and I was happy with that. I'll rig this up and see how it performs with my patch to see if there's a significant difference with FS's other than btrfs. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 27, 2016 at 09:13:44AM -0400, Josef Bacik wrote: > On 10/26/2016 08:30 PM, Dave Chinner wrote: > >On Wed, Oct 26, 2016 at 11:11:35AM -0400, Josef Bacik wrote: > >>On 10/25/2016 06:01 PM, Dave Chinner wrote: > >>>On Tue, Oct 25, 2016 at 02:41:44PM -0400, Josef Bacik wrote: > >>>>With anything that populates the inode/dentry cache with a lot of one time use > >>>>inodes we can really put a lot of pressure on the system for things we don't > >>>>need to keep in cache. It takes two runs through the LRU to evict these one use > >>>>entries, and if you have a lot of memory you can end up with 10's of millions of > >>>>entries in the dcache or icache that have never actually been touched since they > >>>>were first instantiated, and it will take a lot of CPU and a lot of pressure to > >>>>evict all of them. > >>>> > >>>>So instead do what we do with pagecache, only set the *REFERENCED flags if we > >>>>are being used after we've been put onto the LRU. This makes a significant > >>>>difference in the system's ability to evict these useless cache entries. With a > >>>>fs_mark workload that creates 40 million files we get the following results (all > >>>>in files/sec) > >>> > >>>What's the workload, storage, etc? > >> > >>Oops sorry I thought I said it. It's fs_mark creating 20 million > >>empty files on a single NVME drive. > > > >How big is the drive/filesystem (e.g. has impact on XFS allocation > >concurrency)? And multiple btrfs subvolumes, too, by the sound of > >it. How did you set those up? What about concurrency, directory > >sizes, etc? Can you post the fsmark command line as these details > >do actually matter... > > > >Getting the benchmark configuration to reproduce posted results > >should not require playing 20 questions! > > This is the disk > > Disk /dev/nvme0n1: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors > > This is the script > > https://paste.fedoraproject.org/461874/73910147/1 Link no workee. This does, though: https://paste.fedoraproject.org/461874/73910147 > > It's on a 1 socket 8 core cpu with 16gib of ram. Thanks! > >Is there an order difference in reclaim as a result of earlier > >reclaim? Is btrfs_evict_inode() blocking on a sleeping lock in btrfs > >rather than contending on spinning locks? Is it having to wait for > >transaction commits or some other metadata IO because reclaim is > >happening earlier? i.e. The question I'm asking is what, exactly, > >leads to such a marked performance improvement in steady state > >behaviour? > > I would have seen this in my traces. There's tooooons of places to > improve btrfs's performance and behavior here no doubt. But simply > moving from pagecache to a slab shrinker shouldn't have drastically > changed how we perform in this test. Not sure what you mean by this. Do you mean that because of the preceeding patches that change the page cache accounting for btrfs metadata there is now not enough pressure being placed on the btrfs inode cache shrinker? FWIW, this zero length file fsmark workload produces zero page cache pressure on XFS. If the case is that btrfs is no longer using the pa > I feel like the shrinkers need > to be used differently, but I completely destroyed vmscan.c trying > different approaches and none of them made a difference like this > patch made. From what I was seeing in my trace we were simply > reclaiming less per kswapd scan iteration with the old approach vs. > the new approach. Yup, that's what default_seeks is supposed to be used to tune - the relative pressure that should be placed on the slab compared to everything else. > >I want to know because if there's behavioural changes in LRU reclaim > >order having a significant effect on affecting btrfs, then there is > >going to be some effects on other filesystems, too. Maybe not in > >this benchmark, but we can't anticipate potential problems if we > >don't understand exactly what is going on here. > > So I'll just describe to you what I was seeing and maybe we can work > out where we think the problem is. > > 1) We go at X speed until we fill up all of the memory with the various caches. > 2) We lost about 15% when kswapd kicks in and it never recovered. That's expected behaviour (i.e. that's exactly what I see when XFS fill memory and reclaim gates new allocations). Memory reclaim does not come for free. > Doing tracing I was seeing that we would try to scan very small > batches of objects, usually less than the batch size, because with > btrfs not using pagecache for anything and the shrinker scanning > being based on how much pagecache was scanned our scan totals would > be very much smaller than the total cache size. I originally > thought this was the problem, but short of just forcing us to scan > the whole cache every time nothing I did made any difference. So this is from a kernel just running your patch 5 (the referenced bit changes) and samples were taken for about a minute at steady state after memory filled (about 20m files into a 16-way 50m file create with subvols running at ~130k files/s in steady state) using: # ~/trace-cmd/trace-cmd record -e mm_shrink_slab\* There were this many shrinker invocations, but all of them came from kswapd (no direct reclaim at all). $ grep 0xffff88036a2774c0 t.t |grep shrink_slab_start |wc -l 1923 The amount of reclaim each shrinker invocation did was: # grep 0xffff88036a2774c0 t.t |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr 623 1025 481 0 172 3075 145 2050 131 6150 82 12300 54 7175 40 4100 40 11275 34 5125 32 14350 30 13325 19 23575 18 24600 7 15375 4 8200 3 22550 3 10250 2 47150 2 16400 1 9225 1 25625 1 14349 1 11274 I don't see any partial batches there, but I also don't see enough reclaim pressure being put on the inode cache. I'll come back to this, but first lets look at what how XFS behaves, because it doesn't create any page cache pressure on these workloads: $ grep 0xffff880094aea4c0 t.t |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr 1059 0 5 1000 4 7045 4 1008 3 7970 3 7030 2 996 2 995 2 962 2 9005 2 8961 2 8937 2 8062 2 8049 ..... Very different look to thework szies. Sorting on the reclaim batch sizes shows this: 1 901131 1 874354 1 872911 1 847679 1 760696 1 616491 1 590231 1 542313 1 517882 1 513400 1 507431 1 479111 1 439889 1 433734 1 410811 1 388127 ..... There are massive batches of inodes being reclaimed. Let's have a look at that one of those traces more closely: kswapd1-862 [001] 183735.856649: mm_shrink_slab_start: super_cache_scan+0x0 0xffff880094aea4c0: nid: 1 objects to shrink 882936 gfp_flags GFP_KERNEL pgs_scanned 32 lru_pgs 6236 cache items 2416122 delta 24792 total_scan 907728 So here we see only 32 pages were scanned in the page cache, of a very small 6236 pages - there is no page cache to speak of, and the 32 pages scanned is the defined minimum for the page cache. However, the calculated delta is still reasonable at 24792, with the cache size at 2416122 objects. It's reasonable because the delta is almost exactly 1% of the cache size and that's a good reclaim batch size for a single shrinker invocation. However, total_scan is at 907728, which means there have been lots of shrinker invocations that have not done work due to GFP_NOFS context, and so that work has been deferred. kswapd is invoked with GFP_KERNEL context, so we'd expect it to do most of that work right now: kswapd1-862 [009] 183737.582875: mm_shrink_slab_end: super_cache_scan+0x0 0xffff880094aea4c0: nid: 1 unused scan count 882936 new scan count 464 total_scan 464 last shrinker return val 874354 And it does - it freed 874354 objects from the cache, and the deferred scan count returned to the shrinker was only 464. I'll also point out that the reclaim rate here is about 500,000 inodes/s (per kswapd) so we've got plenty of headroom and scalability here. This is behaving exactly as we'd expect if there was no page cache pressure and substantial GFP_NOFS allocations. THe delta calculation is in the correct ballpark, the reclaim batch size is being determined by the amount of work we are deferring to kswapd. Just to show what GFP_NOFS windup from direct reclaim looks like: fs_mark-30202 [006] 183695.757995: mm_shrink_slab_start: super_cache_scan+0x0 0xffff880094aea4c0: nid: 1 objects to shrink 64064 gfp_flags GFP_NOFS|__GFP_NOWARN|__GFP_COMP|__GFP_NOTRACK pgs_scanned 32 lru_pgs 6697 cache items 2236078 delta 21365 total_scan 85429 fs_mark-30202 [006] 183695.757995: mm_shrink_slab_end: super_cache_scan+0x0 0xffff880094aea4c0: nid: 1 unused scan count 64064 new scan count 85429 total_scan 85429 last shrinker return val 0 incoming deferred work was 64064 objects, delta added was 21365 (again, 1% of total cache size), zero objects were freed, outgoing deferred work in now 85429 objects. So we can see that regardless of the invocation context of the shrinker, XFS is asking for 1% of the cache to be removed. Knowing this, let's go back to the btrfs traces and examine that: kswapd1-862 [013] 185883.279076: mm_shrink_slab_start: super_cache_scan+0x0 0xffff8803329404c0: nid: 1 objects to shrink 1006 gfp_flags GFP_KERNEL pgs_scanned 203 lru_pgs 251067 cache items 1787431 delta 2890 total_scan 3896 Ok, the first thing to note here is that the delta is much smaller than what we see with XFS. That is also no obvious deferred work, which we'd expect because there is no direct reclaim records in the trace for btrfs. It's pretty clear from this that each shrinker invocation is asking for about 0.15% of the cache to be scanned - there's much lower pressure being put on the shrinker when compared to XFS. Hence it will take 5-10x as many invocations of the shrinker to do the same amount of work. Occasionally we see the page cache get scanned enough to bring deltas up to 1% of the cache size: kswapd1-862 [013] 185883.943647: mm_shrink_slab_start: super_cache_scan+0x0 0xffff8803329404c0: nid: 1 objects to shrink 459 gfp_flags GFP_KERNEL pgs_scanned 1585 lru_pgs 249194 cache items 1784345 delta 22698 total_scan 23157 See the difference here? pgs_scanned is much higher than the trace from than half a second ago, and now the delta reaches that ~1% of cache size number. This is an outlier, though, but what it indicates is that the problem is with the amount of page cache scanning being done during reclaim. There are two ways to deal with this: fix whatever problem is causing the page cache to be underscanned, or make the pgs_scanned:shrinker delta ratio larger. The later can be don via changing the shrinker seek count or modifying /proc/sys/vm/vfs_cache_pressure to make the cache seem 10x larger than it really is.... > Once we'd scanned through the entire LRU at once suddenly we started > reclaiming almost as many objects as we scanned, as opposed to < 10 > items. This is because of the whole referenced thing. That looks like a side effect of not generating enough pressure on the shrinker. > So we'd see > a small perf bump when we did manage to do this, and then it would > drop again once the lru was full of fresh inodes again. So we'd get > this saw-toothy sort of reclaim. Yup - butin normal conditions this workload should result with a relatively even mix of referenced and unreferenced inodes on the LRU, so this behaviour should not happen. Indeed, I just pulled the referenced patch out of the kernel and reran the XFS tests, and I see regular "delta 24000, freed 13000" traces. There are much fewer huge reclaim peaks (i.e. less GFP_NOFS windup) and no two shrinker invocations freed the exact same number of objects. $ grep 0xffff88033a638cc0 t.t |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2 1 381458 1 333756 1 90265 1 88664 1 82371 1 82340 1 81849 1 80772 1 79268 i.e. changing the referenced bit behaviour has made the XFS inode shrinker too aggressive for this workload, so rebalancing would be necessary... At the same time, I changed btrfs to has "seeks = 1". This shows /substantially/ more sustained and larger amounts of pressure being put on the btrfs inode shrinker than the referenced patch: $ grep 0xffff88004ed79cc0: b.t |awk -e '/mm_shrink_slab_end/ { print $23 }' |sort -n |uniq -c |sort -nr -k2 1 59021 1 56271 1 53743 1 52885 1 47992 1 44442 1 41265 1 40718 1 36038 1 34872 .... .... but it still has a really, really long tail of small reclaim attempts so it's better but still not perfect.... Unfortunately, because default_seeks is only 2, we're going to have to change this value and the constants in do_shrinker_slab() to give us more ranging to increase reclaim pressure in this manner.... > I did all sorts of tracing to try and figure out what exactly it was > about the reclaim that made everything so much slower, but with that > many things going on it was hard to decide what was noise and what > was actually causing the problem. Without this patch I see kswapd > stay at higher CPU usage for longer, because it has to keep scanning > things to satisfy the high watermark and it takes scanning multiple > million object lists multiple times before it starts to make any > progress towards reclaim. This logically makes sense and is not > ideal behavior. I would expect the elevated kswapd CPU is due to the fact it is being called so many more times to do the inode cache reclaim work. It has to do more page cache scanning because the shrinker is not removing the memory pressure, and so it's a downward spiral of inefficiency. let's see what happens when we get the shrinker pressure to reliably scan 1% of the cache on every invocation.... Cheers, Dave.
diff --git a/fs/dcache.c b/fs/dcache.c index 5c7cc95..a558075 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -779,8 +779,6 @@ repeat: goto kill_it; } - if (!(dentry->d_flags & DCACHE_REFERENCED)) - dentry->d_flags |= DCACHE_REFERENCED; dentry_lru_add(dentry); dentry->d_lockref.count--; @@ -803,6 +801,13 @@ static inline void __dget_dlock(struct dentry *dentry) dentry->d_lockref.count++; } +static inline void __dget_dlock_reference(struct dentry *dentry) +{ + if (dentry->d_flags & DCACHE_LRU_LIST) + dentry->d_flags |= DCACHE_REFERENCED; + dentry->d_lockref.count++; +} + static inline void __dget(struct dentry *dentry) { lockref_get(&dentry->d_lockref); @@ -875,7 +880,7 @@ again: (alias->d_flags & DCACHE_DISCONNECTED)) { discon_alias = alias; } else { - __dget_dlock(alias); + __dget_dlock_reference(alias); spin_unlock(&alias->d_lock); return alias; } @@ -886,7 +891,7 @@ again: alias = discon_alias; spin_lock(&alias->d_lock); if (S_ISDIR(inode->i_mode) || !d_unhashed(alias)) { - __dget_dlock(alias); + __dget_dlock_reference(alias); spin_unlock(&alias->d_lock); return alias; } @@ -2250,7 +2255,7 @@ struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name) if (!d_same_name(dentry, parent, name)) goto next; - dentry->d_lockref.count++; + __dget_dlock_reference(dentry); found = dentry; spin_unlock(&dentry->d_lock); break; diff --git a/fs/inode.c b/fs/inode.c index 7e3ef3a..5937d3a 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -796,6 +796,8 @@ repeat: __wait_on_freeing_inode(inode); goto repeat; } + if (!list_empty(&inode->i_lru)) + inode->i_state |= I_REFERENCED; __iget(inode); spin_unlock(&inode->i_lock); return inode; @@ -823,6 +825,8 @@ repeat: __wait_on_freeing_inode(inode); goto repeat; } + if (!list_empty(&inode->i_lru)) + inode->i_state |= I_REFERENCED; __iget(inode); spin_unlock(&inode->i_lock); return inode; @@ -1463,7 +1467,6 @@ static void iput_final(struct inode *inode) drop = generic_drop_inode(inode); if (!drop && (sb->s_flags & MS_ACTIVE)) { - inode->i_state |= I_REFERENCED; inode_add_lru(inode); spin_unlock(&inode->i_lock); return;
With anything that populates the inode/dentry cache with a lot of one time use inodes we can really put a lot of pressure on the system for things we don't need to keep in cache. It takes two runs through the LRU to evict these one use entries, and if you have a lot of memory you can end up with 10's of millions of entries in the dcache or icache that have never actually been touched since they were first instantiated, and it will take a lot of CPU and a lot of pressure to evict all of them. So instead do what we do with pagecache, only set the *REFERENCED flags if we are being used after we've been put onto the LRU. This makes a significant difference in the system's ability to evict these useless cache entries. With a fs_mark workload that creates 40 million files we get the following results (all in files/sec) Btrfs Patched Unpatched Average Files/sec: 72209.3 63254.2 p50 Files/sec: 70850 57560 p90 Files/sec: 68757 53085 p99 Files/sec: 68757 53085 XFS Patched Unpatched Average Files/sec: 61025.5 60719.5 p50 Files/sec: 60107 59465 p90 Files/sec: 59300 57966 p99 Files/sec: 59227 57528 Ext4 Patched Unpatched Average Files/sec: 121785.4 119248.0 p50 Files/sec: 120852 119274 p90 Files/sec: 116703 112783 p99 Files/sec: 114393 104934 The reason Btrfs has a much larger improvement is because it holds a lot more things in memory so benefits more from faster slab reclaim, but across the board is an improvement for each of the file systems. Signed-off-by: Josef Bacik <jbacik@fb.com> --- fs/dcache.c | 15 ++++++++++----- fs/inode.c | 5 ++++- 2 files changed, 14 insertions(+), 6 deletions(-)