Message ID | 20240202093407.12536-1-JonasZhou-oc@zhaoxin.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap. | expand |
[That needs a review from Willy.] On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > In the struct address_space, there is a 32-byte gap between i_mmap > and i_mmap_rwsem. Due to the alignment of struct address_space > variables to 8 bytes, in certain situations, i_mmap and > i_mmap_rwsem may end up in the same CACHE line. > > While running Unixbench/execl, we observe high false sharing issues > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > after i_private_list, ensuring a 64-byte gap between i_mmap and > i_mmap_rwsem. > > For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the > score of Unixbench/execl improves by ~3.94%, and the score of > Unixbench/shell improves by ~3.26%. > > Baseline: > ------------------------------------------------------------- > 162 546 748 11374 21 0xffff92e266af90c0 > ------------------------------------------------------------- > 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 > 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 > 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 > 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 > 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 > 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 > 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 > 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 > 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 > 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 > 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 > 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 > 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 > 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 > > with this change: > ------------------------------------------------------------- > 360 12 300 57 35 0xffff982cdae76400 > ------------------------------------------------------------- > 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 > 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 > 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 > 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 > 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 > 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 > 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 > 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 > > Signed-off-by: JonasZhou-oc <JonasZhou-oc@zhaoxin.com> > --- > include/linux/fs.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/fs.h b/include/linux/fs.h > index ed5966a70495..2d6ccde5d1be 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -482,10 +482,10 @@ struct address_space { > pgoff_t writeback_index; > const struct address_space_operations *a_ops; > unsigned long flags; > - struct rw_semaphore i_mmap_rwsem; > errseq_t wb_err; > spinlock_t i_private_lock; > struct list_head i_private_list; > + struct rw_semaphore i_mmap_rwsem; > void * i_private_data; > } __attribute__((aligned(sizeof(long)))) __randomize_layout; > /* > -- > 2.25.1 >
On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > In the struct address_space, there is a 32-byte gap between i_mmap > and i_mmap_rwsem. Due to the alignment of struct address_space > variables to 8 bytes, in certain situations, i_mmap and > i_mmap_rwsem may end up in the same CACHE line. > > While running Unixbench/execl, we observe high false sharing issues > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > after i_private_list, ensuring a 64-byte gap between i_mmap and > i_mmap_rwsem. I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock and the thing it's protecting in the same cacheline. Why is that not the case here?
On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote: > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > > In the struct address_space, there is a 32-byte gap between i_mmap > > and i_mmap_rwsem. Due to the alignment of struct address_space > > variables to 8 bytes, in certain situations, i_mmap and > > i_mmap_rwsem may end up in the same CACHE line. > > > > While running Unixbench/execl, we observe high false sharing issues > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > > after i_private_list, ensuring a 64-byte gap between i_mmap and > > i_mmap_rwsem. > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock > and the thing it's protecting in the same cacheline. Why is that not > the case here? We actually had this seven months ago: https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@intel.com/ Unfortunately, no argumentation was forthcoming about *why* this was the right approach. All we got was a different patch and an assertion that it still improved performance. We need to understand what's going on! Please don't do the same thing as the other submitter and just assert that it does.
On Fri, Feb 02, 2024 at 07:32:36PM +0000, Matthew Wilcox wrote: > On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote: > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > > > In the struct address_space, there is a 32-byte gap between i_mmap > > > and i_mmap_rwsem. Due to the alignment of struct address_space > > > variables to 8 bytes, in certain situations, i_mmap and > > > i_mmap_rwsem may end up in the same CACHE line. > > > > > > While running Unixbench/execl, we observe high false sharing issues > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > > > after i_private_list, ensuring a 64-byte gap between i_mmap and > > > i_mmap_rwsem. > > > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock > > and the thing it's protecting in the same cacheline. You are correct in the case that there is never any significant contention on the lock. i.e. gaining the lock will also pull the cacheline for the object it protects and so avoid an extra memory fetch. However.... > > Why is that not > > the case here? > > We actually had this seven months ago: > > https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@intel.com/ > > Unfortunately, no argumentation was forthcoming about *why* this was > the right approach. All we got was a different patch and an assertion > that it still improved performance. > > We need to understand what's going on! Please don't do the same thing > as the other submitter and just assert that it does. Intuition tells me that what the OP is seeing is the opposite case to above: there is significant contention on the lock. In that case, optimal "contention performance" comes from separating the lock and the objects it protects into different cachelines. The reason for this is that if the lock and objects it protects are on the same cacheline, lock contention affects both the lock and the objects being manipulated inside the critical section. i.e. attempts to grab the lock pull the cacheline away from the CPU that holds the lock, and then accesses to the object that are protected by the lock then have to pull the cacheline back. i.e. the cost of the extra memory fetch from an uncontended cacheline is less than the cost of having to repeatedly fetch the memory inside a critical section on a contended cacheline. I consider optimisation attempts like this the canary in the mine: it won't be long before these or similar workloads report catastrophic lock contention on the lock in question. Moving items in the structure is equivalent to re-arranging the deck chairs whilst the ship sinks - we might keep our heads above water a little longer, but the ship is still sinking and we're still going to have to fix the leak sooner rather than later... -Dave.
> On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote: > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > > > In the struct address_space, there is a 32-byte gap between i_mmap > > > and i_mmap_rwsem. Due to the alignment of struct address_space > > > variables to 8 bytes, in certain situations, i_mmap and > > > i_mmap_rwsem may end up in the same CACHE line. > > > > > > While running Unixbench/execl, we observe high false sharing issues > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > > > after i_private_list, ensuring a 64-byte gap between i_mmap and > > > i_mmap_rwsem. > > > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock > > and the thing it's protecting in the same cacheline. Why is that not > > the case here? > > We actually had this seven months ago: > > https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@intel.com/ > > Unfortunately, no argumentation was forthcoming about *why* this was > the right approach. All we got was a different patch and an assertion > that it still improved performance. > > We need to understand what's going on! Please don't do the same thing > as the other submitter and just assert that it does. When running UnixBench/execl, each execl process repeatedly performs i_mmap_lock_write -> vma_interval_tree_remove/insert -> i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem are in the same CACHE Line, there will be more HITM. Func0: i_mmap_lock_write Func1: vma_interval_tree_remove/insert Func2: i_mmap_unlock_write In the same CACHE Line Process A | Process B | Process C | Process D | CACHE Line state ----------+-----------+-----------+-----------+----------------- Func0 | | | | I->M | Func0 | | | HITM M->S Func1 | | | | may change to M | | Func0 | | HITM M->S Func2 | | | | S->M | | | Func0 | HITM M->S In different CACHE Lines Process A | Process B | Process C | Process D | CACHE Line state ----------+-----------+-----------+-----------+----------------- Func0 | | | | I->M | Func0 | | | HITM M->S Func1 | | | | | | Func0 | | S->S Func2 | | | | S->M | | | Func0 | HITM M->S The same issue will occur in Unixbench/shell because the shell launches a lot of shell commands, loads executable files and dynamic libraries into memory, execute, and exit. Yes, his commit has been merged into the Linux kernel, but there is an issue. After moving i_mmap_rwsem below flags, there is a 32-byte gap between i_mmap_rwsem and i_mmap. However, the struct address_space is aligned to sizeof(long), which is 8 on the x86-64 architecture. As a result, i_mmap_rwsem and i_mmap may be placed on the same CACHE Line, causing a false sharing problem. This issue has been observed using the perf c2c tool.
On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote: > When running UnixBench/execl, each execl process repeatedly performs > i_mmap_lock_write -> vma_interval_tree_remove/insert -> > i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem > are in the same CACHE Line, there will be more HITM. (I wasn't familiar with the term HITM. For anyone else who's unfamiliar, this appears to mean a HIT in another core's cache, which has the cachline in the Modified state) > Func0: i_mmap_lock_write > Func1: vma_interval_tree_remove/insert > Func2: i_mmap_unlock_write > In the same CACHE Line > Process A | Process B | Process C | Process D | CACHE Line state > ----------+-----------+-----------+-----------+----------------- > Func0 | | | | I->M > | Func0 | | | HITM M->S > Func1 | | | | may change to M > | | Func0 | | HITM M->S > Func2 | | | | S->M > | | | Func0 | HITM M->S > > In different CACHE Lines > Process A | Process B | Process C | Process D | CACHE Line state > ----------+-----------+-----------+-----------+----------------- > Func0 | | | | I->M > | Func0 | | | HITM M->S > Func1 | | | | > | | Func0 | | S->S > Func2 | | | | S->M > | | | Func0 | HITM M->S > > The same issue will occur in Unixbench/shell because the shell > launches a lot of shell commands, loads executable files and dynamic > libraries into memory, execute, and exit. OK, I see. > Yes, his commit has been merged into the Linux kernel, but there > is an issue. After moving i_mmap_rwsem below flags, there is a > 32-byte gap between i_mmap_rwsem and i_mmap. However, the struct > address_space is aligned to sizeof(long), which is 8 on the x86-64 > architecture. As a result, i_mmap_rwsem and i_mmap may be placed on > the same CACHE Line, causing a false sharing problem. This issue has > been observed using the perf c2c tool. Got it. OK, let's put this patch in. It's a stopgap measure, clearly. I'll reply to Dave's email with a longer term solution.
On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote: > > On Fri, Feb 02, 2024 at 03:03:51PM +0000, Matthew Wilcox wrote: > > > On Fri, Feb 02, 2024 at 05:34:07PM +0800, JonasZhou-oc wrote: > > > > In the struct address_space, there is a 32-byte gap between i_mmap > > > > and i_mmap_rwsem. Due to the alignment of struct address_space > > > > variables to 8 bytes, in certain situations, i_mmap and > > > > i_mmap_rwsem may end up in the same CACHE line. > > > > > > > > While running Unixbench/execl, we observe high false sharing issues > > > > when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem > > > > after i_private_list, ensuring a 64-byte gap between i_mmap and > > > > i_mmap_rwsem. > > > > > > I'm confused. i_mmap_rwsem protects i_mmap. Usually you want the lock > > > and the thing it's protecting in the same cacheline. Why is that not > > > the case here? > > > > We actually had this seven months ago: > > > > https://lore.kernel.org/all/20230628105624.150352-1-lipeng.zhu@intel.com/ > > > > Unfortunately, no argumentation was forthcoming about *why* this was > > the right approach. All we got was a different patch and an assertion > > that it still improved performance. > > > > We need to understand what's going on! Please don't do the same thing > > as the other submitter and just assert that it does. > > When running UnixBench/execl, each execl process repeatedly performs > i_mmap_lock_write -> vma_interval_tree_remove/insert -> > i_mmap_unlock_write. As indicated below, when i_mmap and i_mmap_rwsem > are in the same CACHE Line, there will be more HITM. As I expected, your test is exercising the contention case rather than the single, uncontended case. As such, your patch is simply optimising the structure layout for the contended case at the expense of an extra cacheline miss in the uncontended case. I'm not an mm expert, so I don't know which case we should optimise for. However, the existing code is not obviously wrong, it's just that your micro-benchmark exercises the pathological worst case for the optimisation choices made for this structure. Whether the contention case is worth optimising is the first decision that needs to be made, then people can decide if hacking minor optimisations into the code is better than reworking the locking and/or algorithm to avoid the contention altogether is a better direction... -Dave.
On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote: > Intuition tells me that what the OP is seeing is the opposite case > to above: there is significant contention on the lock. In that case, > optimal "contention performance" comes from separating the lock and > the objects it protects into different cachelines. > > The reason for this is that if the lock and objects it protects are > on the same cacheline, lock contention affects both the lock and the > objects being manipulated inside the critical section. i.e. attempts > to grab the lock pull the cacheline away from the CPU that holds the > lock, and then accesses to the object that are protected by the lock > then have to pull the cacheline back. > > i.e. the cost of the extra memory fetch from an uncontended > cacheline is less than the cost of having to repeatedly fetch the > memory inside a critical section on a contended cacheline. > > I consider optimisation attempts like this the canary in the mine: > it won't be long before these or similar workloads report > catastrophic lock contention on the lock in question. Moving items > in the structure is equivalent to re-arranging the deck chairs > whilst the ship sinks - we might keep our heads above water a > little longer, but the ship is still sinking and we're still going > to have to fix the leak sooner rather than later... So the fundamental problem is our data structure. It's actually two problems wrapped up in one bad idea. i_mmap is a struct rb_root_cached: struct rb_root_cached { struct rb_root rb_root; struct rb_node *rb_leftmost; }; struct rb_root { struct rb_node *rb_node; }; so it's two pointers into the tree; one to the leftmost node, one to the topmost node. That means we're modifying one or both of these pointers frequently. I imagine it's the rb_root, which is being modified frequently because we're always ... appending to the right? And so we're rotating frequently. Worst case would be if we're appending to the left and modifying both pointers. There are things we could do to address this by making rotations less frequent for the common case, which I suspect is mapping the entire file. And perhaps we should do these things as a stopgap. The larger problem is that rbtrees suck. They suck the same way that linked lists suck; the CPU can't prefetch very far ahead (or in this case down). It can do a little more work in that it can prefetch both the left & right pointers, but it can't fetch the grandchildren until the children fetches have finished, so the dependent reads have us stumped. The solution to this problem is to change the interval tree data structure from an Red-Black tree to a B-tree, or something similar where we use an array of pointers instead of a single pointer. Not the Maple Tree; that is designed for non-overlapping ranges. One could take inspiration from the Maple Tree and design a very similar data structure that can store and iterate over overlapping ranges. I can't understand the macros this late at night, so I don't fully understand what the interval tree is doing, but I can probably make a more fully informed observation by the end of the week.
> Got it. OK, let's put this patch in. It's a stopgap measure, clearly. > I'll reply to Dave's email with a longer term solution. Fyi, I've picked this up yesterday. You should've gotten a notification hopefully.
On Mon, Feb 05, 2024 at 11:28:29PM +0000, Matthew Wilcox wrote: > On Mon, Feb 05, 2024 at 02:22:18PM +1100, Dave Chinner wrote: > > Intuition tells me that what the OP is seeing is the opposite case > > to above: there is significant contention on the lock. In that case, > > optimal "contention performance" comes from separating the lock and > > the objects it protects into different cachelines. > > > > The reason for this is that if the lock and objects it protects are > > on the same cacheline, lock contention affects both the lock and the > > objects being manipulated inside the critical section. i.e. attempts > > to grab the lock pull the cacheline away from the CPU that holds the > > lock, and then accesses to the object that are protected by the lock > > then have to pull the cacheline back. > > > > i.e. the cost of the extra memory fetch from an uncontended > > cacheline is less than the cost of having to repeatedly fetch the > > memory inside a critical section on a contended cacheline. > > > > I consider optimisation attempts like this the canary in the mine: > > it won't be long before these or similar workloads report > > catastrophic lock contention on the lock in question. Moving items > > in the structure is equivalent to re-arranging the deck chairs > > whilst the ship sinks - we might keep our heads above water a > > little longer, but the ship is still sinking and we're still going > > to have to fix the leak sooner rather than later... > > So the fundamental problem is our data structure. It's actually two > problems wrapped up in one bad idea. > > i_mmap is a struct rb_root_cached: > > struct rb_root_cached { > struct rb_root rb_root; > struct rb_node *rb_leftmost; > }; > > struct rb_root { > struct rb_node *rb_node; > }; > > so it's two pointers into the tree; one to the leftmost node, one > to the topmost node. That means we're modifying one or both of these > pointers frequently. I imagine it's the rb_root, which is being modified > frequently because we're always ... appending to the right? And so > we're rotating frequently. Worst case would be if we're appending to > the left and modifying both pointers. > > There are things we could do to address this by making rotations less > frequent for the common case, which I suspect is mapping the entire file. > And perhaps we should do these things as a stopgap. > > The larger problem is that rbtrees suck. They suck the same way that > linked lists suck; the CPU can't prefetch very far ahead (or in this > case down). It can do a little more work in that it can prefetch both > the left & right pointers, but it can't fetch the grandchildren until the > children fetches have finished, so the dependent reads have us stumped. Yes, pretty much all balanced binary search trees have this problem. I pointed out this problem with rbtrees many years ago when someone tried to implement range locks with an rbtree to track locked ranges. On modern CPUs, trees with array based nodes (btrees, the linux radix tree, etc) are far more efficient. But.... > The solution to this problem is to change the interval tree data structure > from an Red-Black tree to a B-tree, or something similar where we use > an array of pointers instead of a single pointer. .... B-trees are not immune to pointer chasing problems, either. Most use binary searches within the nodes, and so we have the problem of unpredictable cacheline misses within the nodes as well as being unable to do depth based prefetch similar to rbtrees. Perhaps we should be looking at something like this: https://lore.kernel.org/linux-fsdevel/20240126220655.395093-2-kent.overstreet@linux.dev/ -Dave.
On Wed, Feb 07, 2024 at 08:35:59AM +1100, Dave Chinner wrote: > > The solution to this problem is to change the interval tree data structure > > from an Red-Black tree to a B-tree, or something similar where we use > > an array of pointers instead of a single pointer. > > .... B-trees are not immune to pointer chasing problems, either. > Most use binary searches within the nodes, and so we have the > problem of unpredictable cacheline misses within the nodes as well > as being unable to do depth based prefetch similar to rbtrees. > > Perhaps we should be looking at something like this: > > https://lore.kernel.org/linux-fsdevel/20240126220655.395093-2-kent.overstreet@linux.dev/ I need more data (and maybe Kent has it!) Without any data, I believe that Eytzinger layout is an idea that was a really good one in the 1990s/2000s and has now passed its expiration date because of the changed nature of hardware. In the mid-90s, we could do O(10) instructions in the time it took to service a LLC miss. Today, it's more like O(2500). That means it is far more important to be predictable than it is to execute the minimum number of instructions. If your B-tree nodes are 256kB in size (I believe that's what bacachefs is using?), then Eytzinger layout may make some sense, but if you're using smaller nodes (which I further believe is appropriate for an in-memory B-tree), then a straight 'load each index and compare it' may outperform a search that jumps around inside a node. I'm pontificating and will happily yield to someone who has data. I've tried to mark my assumptions clearly above. Something else I'm not aware of (and filesystem B-trees will not have any experience of) is what research exists on efficiently adding new entries to a balanced tree so as to minimise rebalances. Filesystems are like the Maple Tree in that for every logical offset inside a file, there is precisely one answer to "what physical block does this map to". For the i_mmap tree, we want instead to answer the question "Which VMAs have an intersection with this range of the file", and for the benchmark in question, we will have a large number of entries that compare equal to each other (they have the same start, and same end, but different values). So they could be inserted at many different points in the tree. We'd like to choose the point which causes the least amount of rebalancing.
> On Mon, Feb 05, 2024 at 02:22:29PM +0800, JonasZhou wrote: > > As I expected, your test is exercising the contention case rather > than the single, uncontended case. As such, your patch is simply > optimising the structure layout for the contended case at the > expense of an extra cacheline miss in the uncontended case. > > I'm not an mm expert, so I don't know which case we should optimise > for. > > However, the existing code is not obviously wrong, it's just that > your micro-benchmark exercises the pathological worst case for the > optimisation choices made for this structure. Whether the contention > case is worth optimising is the first decision that needs to be > made, then people can decide if hacking minor optimisations into the > code is better than reworking the locking and/or algorithm to avoid > the contention altogether is a better direction... According to https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/log/?qt=grep&q=unixbench, many people use Unixbench and submit optimization patches to Linux based on its scores. So this is not my micro-benchmark exercises. When multiple processes open the same file, such as multiple processes opening libc.so, there will be contention. > -Dave. > -- > Dave Chinner > david@fromorbit.com
diff --git a/include/linux/fs.h b/include/linux/fs.h index ed5966a70495..2d6ccde5d1be 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -482,10 +482,10 @@ struct address_space { pgoff_t writeback_index; const struct address_space_operations *a_ops; unsigned long flags; - struct rw_semaphore i_mmap_rwsem; errseq_t wb_err; spinlock_t i_private_lock; struct list_head i_private_list; + struct rw_semaphore i_mmap_rwsem; void * i_private_data; } __attribute__((aligned(sizeof(long)))) __randomize_layout; /*
In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line. While running Unixbench/execl, we observe high false sharing issues when accessing i_mmap against i_mmap_rwsem. We move i_mmap_rwsem after i_private_list, ensuring a 64-byte gap between i_mmap and i_mmap_rwsem. For Intel Silver machines (2 sockets) using kernel v6.8 rc-2, the score of Unixbench/execl improves by ~3.94%, and the score of Unixbench/shell improves by ~3.26%. Baseline: ------------------------------------------------------------- 162 546 748 11374 21 0xffff92e266af90c0 ------------------------------------------------------------- 46.89% 44.65% 0.00% 0.00% 0x0 1 1 0xffffffff86d5fb96 460 258 271 1069 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 4.21% 4.41% 0.00% 0.00% 0x4 1 1 0xffffffff86d0ed54 473 311 288 95 28 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 0.04% 4.76% 0x8 1 1 0xffffffff86d4bcf1 0 0 0 5 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 6.41% 6.02% 0.00% 0.00% 0x8 1 1 0xffffffff86d4ba85 411 271 339 210 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 0.47% 95.24% 0x10 1 1 0xffffffff86d4bd34 0 0 0 74 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 0.37% 0.13% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4f 328 212 380 7 5 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 5.13% 5.08% 0.00% 0.00% 0x10 1 1 0xffffffff86d4bb4b 416 255 357 197 32 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 1.10% 0.53% 0.00% 0.00% 0x28 1 1 0xffffffff86e06eb8 395 228 351 24 14 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 1.10% 2.14% 57.07% 0.00% 0x38 1 1 0xffffffff878c9225 1364 792 462 7003 32 [k] down_write [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.00% 0.01% 0.00% 0x38 1 1 0xffffffff878c8e75 0 0 252 3 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 0.00% 0.13% 0.00% 0.00% 0x38 1 1 0xffffffff878c8e23 0 596 63 2 2 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 2.38% 2.94% 6.53% 0.00% 0x38 1 1 0xffffffff878c8ccb 1150 818 570 1197 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:109 0 1 30.59% 32.22% 0.00% 0.00% 0x38 1 1 0xffffffff878c8cb4 423 251 380 648 32 [k] rwsem_down_write_slowpath [kernel.vmlinux] atomic64_64.h:15 0 1 1.83% 1.74% 35.88% 0.00% 0x38 1 1 0xffffffff86b4f833 1217 1112 565 4586 32 [k] up_write [kernel.vmlinux] atomic64_64.h:91 0 1 with this change: ------------------------------------------------------------- 360 12 300 57 35 0xffff982cdae76400 ------------------------------------------------------------- 50.00% 59.67% 0.00% 0.00% 0x0 1 1 0xffffffff8215fb86 352 200 191 558 32 [k] __handle_mm_fault [kernel.vmlinux] memory.c:2940 0 1 8.33% 5.00% 0.00% 0.00% 0x4 1 1 0xffffffff8210ed44 370 284 263 42 24 [k] filemap_read [kernel.vmlinux] atomic.h:23 0 1 0.00% 0.00% 5.26% 2.86% 0x8 1 1 0xffffffff8214bce1 0 0 0 4 4 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:204 0 1 33.33% 14.33% 0.00% 0.00% 0x8 1 1 0xffffffff8214ba75 344 186 219 140 32 [k] vma_interval_tree_insert [kernel.vmlinux] interval_tree.c:23 0 1 0.00% 0.00% 94.74% 97.14% 0x10 1 1 0xffffffff8214bd24 0 0 0 88 29 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:339 0 1 8.33% 20.00% 0.00% 0.00% 0x10 1 1 0xffffffff8214bb3b 296 209 226 167 31 [k] vma_interval_tree_remove [kernel.vmlinux] rbtree_augmented.h:338 0 1 0.00% 0.67% 0.00% 0.00% 0x28 1 1 0xffffffff82206f45 0 140 334 4 3 [k] do_dentry_open [kernel.vmlinux] open.c:966 0 1 0.00% 0.33% 0.00% 0.00% 0x38 1 1 0xffffffff8250a6c4 0 286 126 5 5 [k] errseq_sample [kernel.vmlinux] errseq.c:125 0 Signed-off-by: JonasZhou-oc <JonasZhou-oc@zhaoxin.com> --- include/linux/fs.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)