Message ID | 20160224082840.GB10096@quack.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
* Jan Kara <jack@suse.cz> wrote: > On Tue 23-02-16 14:04:32, Waiman Long wrote: > > When many threads are trying to add or delete inode to or from > > a superblock's s_inodes list, spinlock contention on the list can > > become a performance bottleneck. > > > > This patch changes the s_inodes field to become a per-cpu list with > > per-cpu spinlocks. As a result, the following superblock inode list > > (sb->s_inodes) iteration functions in vfs are also being modified: > > > > 1. iterate_bdevs() > > 2. drop_pagecache_sb() > > 3. wait_sb_inodes() > > 4. evict_inodes() > > 5. invalidate_inodes() > > 6. fsnotify_unmount_inodes() > > 7. add_dquot_ref() > > 8. remove_dquot_ref() > > > > With an exit microbenchmark that creates a large number of threads, > > attachs many inodes to them and then exits. The runtimes of that > > microbenchmark with 1000 threads before and after the patch on a > > 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as > > follows: > > > > Kernel Elapsed Time System Time > > ------ ------------ ----------- > > Vanilla 4.5-rc4 65.29s 82m14s > > Patched 4.5-rc4 22.81s 23m03s > > > > Before the patch, spinlock contention at the inode_sb_list_add() > > function at the startup phase and the inode_sb_list_del() function at > > the exit phase were about 79% and 93% of total CPU time respectively > > (as measured by perf). After the patch, the percpu_list_add() > > function consumed only about 0.04% of CPU time at startup phase. The > > percpu_list_del() function consumed about 0.4% of CPU time at exit > > phase. There were still some spinlock contention, but they happened > > elsewhere. > > While looking through this patch, I have noticed that the > list_for_each_entry_safe() iterations in evict_inodes() and > invalidate_inodes() are actually unnecessary. So if you first apply the > attached patch, you don't have to implement safe iteration variants at all. > > As a second comment, I'd note that this patch grows struct inode by 1 pointer. > It is probably acceptable for large machines given the speedup but it should be > noted in the changelog. Furthermore for UP or even small SMP systems this is > IMHO undesired bloat since the speedup won't be noticeable. > > So for these small systems it would be good if per-cpu list magic would just > fall back to single linked list with a spinlock. Do you think that is reasonably > doable? Even many 'small' systems tend to be SMP these days. If you do this then please keep it a separate add-on patch, so that the 'UP cost' becomes apparent. Uniprocessor #ifdeffery is really painful in places and we might be better off with a single extra pointer. Forthermore UP kernels are tested a lot less stringently than SMP kernels. It's just 4 bytes for a truly small 32-bit system. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 24-02-16 09:36:30, Ingo Molnar wrote: > > * Jan Kara <jack@suse.cz> wrote: > > > On Tue 23-02-16 14:04:32, Waiman Long wrote: > > > When many threads are trying to add or delete inode to or from > > > a superblock's s_inodes list, spinlock contention on the list can > > > become a performance bottleneck. > > > > > > This patch changes the s_inodes field to become a per-cpu list with > > > per-cpu spinlocks. As a result, the following superblock inode list > > > (sb->s_inodes) iteration functions in vfs are also being modified: > > > > > > 1. iterate_bdevs() > > > 2. drop_pagecache_sb() > > > 3. wait_sb_inodes() > > > 4. evict_inodes() > > > 5. invalidate_inodes() > > > 6. fsnotify_unmount_inodes() > > > 7. add_dquot_ref() > > > 8. remove_dquot_ref() > > > > > > With an exit microbenchmark that creates a large number of threads, > > > attachs many inodes to them and then exits. The runtimes of that > > > microbenchmark with 1000 threads before and after the patch on a > > > 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as > > > follows: > > > > > > Kernel Elapsed Time System Time > > > ------ ------------ ----------- > > > Vanilla 4.5-rc4 65.29s 82m14s > > > Patched 4.5-rc4 22.81s 23m03s > > > > > > Before the patch, spinlock contention at the inode_sb_list_add() > > > function at the startup phase and the inode_sb_list_del() function at > > > the exit phase were about 79% and 93% of total CPU time respectively > > > (as measured by perf). After the patch, the percpu_list_add() > > > function consumed only about 0.04% of CPU time at startup phase. The > > > percpu_list_del() function consumed about 0.4% of CPU time at exit > > > phase. There were still some spinlock contention, but they happened > > > elsewhere. > > > > While looking through this patch, I have noticed that the > > list_for_each_entry_safe() iterations in evict_inodes() and > > invalidate_inodes() are actually unnecessary. So if you first apply the > > attached patch, you don't have to implement safe iteration variants at all. > > > > As a second comment, I'd note that this patch grows struct inode by 1 pointer. > > It is probably acceptable for large machines given the speedup but it should be > > noted in the changelog. Furthermore for UP or even small SMP systems this is > > IMHO undesired bloat since the speedup won't be noticeable. > > > > So for these small systems it would be good if per-cpu list magic would just > > fall back to single linked list with a spinlock. Do you think that is reasonably > > doable? > > Even many 'small' systems tend to be SMP these days. Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this change either. It will just have to pay the memory cost. And frankly I don't care that much myself but if there is some reasonably easy way to avoid the cost, it would be welcome. > If you do this then please keep it a separate add-on patch, so that the > 'UP cost' becomes apparent. Uniprocessor #ifdeffery is really painful in > places and we might be better off with a single extra pointer. > Forthermore UP kernels are tested a lot less stringently than SMP > kernels. It's just 4 bytes for a truly small 32-bit system. Honza
On 02/24/2016 03:28 AM, Jan Kara wrote: > On Tue 23-02-16 14:04:32, Waiman Long wrote: >> When many threads are trying to add or delete inode to or from >> a superblock's s_inodes list, spinlock contention on the list can >> become a performance bottleneck. >> >> This patch changes the s_inodes field to become a per-cpu list with >> per-cpu spinlocks. As a result, the following superblock inode list >> (sb->s_inodes) iteration functions in vfs are also being modified: >> >> 1. iterate_bdevs() >> 2. drop_pagecache_sb() >> 3. wait_sb_inodes() >> 4. evict_inodes() >> 5. invalidate_inodes() >> 6. fsnotify_unmount_inodes() >> 7. add_dquot_ref() >> 8. remove_dquot_ref() >> >> With an exit microbenchmark that creates a large number of threads, >> attachs many inodes to them and then exits. The runtimes of that >> microbenchmark with 1000 threads before and after the patch on a >> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as >> follows: >> >> Kernel Elapsed Time System Time >> ------ ------------ ----------- >> Vanilla 4.5-rc4 65.29s 82m14s >> Patched 4.5-rc4 22.81s 23m03s >> >> Before the patch, spinlock contention at the inode_sb_list_add() >> function at the startup phase and the inode_sb_list_del() function at >> the exit phase were about 79% and 93% of total CPU time respectively >> (as measured by perf). After the patch, the percpu_list_add() >> function consumed only about 0.04% of CPU time at startup phase. The >> percpu_list_del() function consumed about 0.4% of CPU time at exit >> phase. There were still some spinlock contention, but they happened >> elsewhere. > While looking through this patch, I have noticed that the > list_for_each_entry_safe() iterations in evict_inodes() and > invalidate_inodes() are actually unnecessary. So if you first apply the > attached patch, you don't have to implement safe iteration variants at all. Thank for the patch. I will apply that in my next update. As for the safe iteration variant, I think I will keep it since I had implemented that already just in case it may be needed in some other places. > As a second comment, I'd note that this patch grows struct inode by 1 > pointer. It is probably acceptable for large machines given the speedup but > it should be noted in the changelog. Furthermore for UP or even small SMP > systems this is IMHO undesired bloat since the speedup won't be noticeable. > > So for these small systems it would be good if per-cpu list magic would just > fall back to single linked list with a spinlock. Do you think that is > reasonably doable? > I already have a somewhat separate code path for UP. So I can remove the lock pointer for that. For small SMP system, however, the only way to avoid the extra pointer is to add a config parameter to turn this feature off. That can be added as a separate patch, if necessary. BTW, I think the current inode structure is already pretty big, adding one more pointer will have too much impact on its overall size. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Jan Kara <jack@suse.cz> wrote: > > > > With an exit microbenchmark that creates a large number of threads, > > > > attachs many inodes to them and then exits. The runtimes of that > > > > microbenchmark with 1000 threads before and after the patch on a 4-socket > > > > Intel E7-4820 v3 system (40 cores, 80 threads) were as follows: > > > > > > > > Kernel Elapsed Time System Time > > > > ------ ------------ ----------- > > > > Vanilla 4.5-rc4 65.29s 82m14s > > > > Patched 4.5-rc4 22.81s 23m03s > > > > > > > > Before the patch, spinlock contention at the inode_sb_list_add() function > > > > at the startup phase and the inode_sb_list_del() function at the exit > > > > phase were about 79% and 93% of total CPU time respectively (as measured > > > > by perf). After the patch, the percpu_list_add() function consumed only > > > > about 0.04% of CPU time at startup phase. The percpu_list_del() function > > > > consumed about 0.4% of CPU time at exit phase. There were still some > > > > spinlock contention, but they happened elsewhere. > > > > > > While looking through this patch, I have noticed that the > > > list_for_each_entry_safe() iterations in evict_inodes() and > > > invalidate_inodes() are actually unnecessary. So if you first apply the > > > attached patch, you don't have to implement safe iteration variants at all. > > > > > > As a second comment, I'd note that this patch grows struct inode by 1 > > > pointer. It is probably acceptable for large machines given the speedup but > > > it should be noted in the changelog. Furthermore for UP or even small SMP > > > systems this is IMHO undesired bloat since the speedup won't be noticeable. > > > > > > So for these small systems it would be good if per-cpu list magic would just > > > fall back to single linked list with a spinlock. Do you think that is > > > reasonably doable? > > > > Even many 'small' systems tend to be SMP these days. > > Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this > change either. [...] I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in system time, which ought to be noticeable on smaller SMP systems as well. Waiman, could you please post the microbenchmark? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/25/2016 03:06 AM, Ingo Molnar wrote: > * Jan Kara<jack@suse.cz> wrote: > >>>>> With an exit microbenchmark that creates a large number of threads, >>>>> attachs many inodes to them and then exits. The runtimes of that >>>>> microbenchmark with 1000 threads before and after the patch on a 4-socket >>>>> Intel E7-4820 v3 system (40 cores, 80 threads) were as follows: >>>>> >>>>> Kernel Elapsed Time System Time >>>>> ------ ------------ ----------- >>>>> Vanilla 4.5-rc4 65.29s 82m14s >>>>> Patched 4.5-rc4 22.81s 23m03s >>>>> >>>>> Before the patch, spinlock contention at the inode_sb_list_add() function >>>>> at the startup phase and the inode_sb_list_del() function at the exit >>>>> phase were about 79% and 93% of total CPU time respectively (as measured >>>>> by perf). After the patch, the percpu_list_add() function consumed only >>>>> about 0.04% of CPU time at startup phase. The percpu_list_del() function >>>>> consumed about 0.4% of CPU time at exit phase. There were still some >>>>> spinlock contention, but they happened elsewhere. >>>> While looking through this patch, I have noticed that the >>>> list_for_each_entry_safe() iterations in evict_inodes() and >>>> invalidate_inodes() are actually unnecessary. So if you first apply the >>>> attached patch, you don't have to implement safe iteration variants at all. >>>> >>>> As a second comment, I'd note that this patch grows struct inode by 1 >>>> pointer. It is probably acceptable for large machines given the speedup but >>>> it should be noted in the changelog. Furthermore for UP or even small SMP >>>> systems this is IMHO undesired bloat since the speedup won't be noticeable. >>>> >>>> So for these small systems it would be good if per-cpu list magic would just >>>> fall back to single linked list with a spinlock. Do you think that is >>>> reasonably doable? >>> Even many 'small' systems tend to be SMP these days. >> Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this >> change either. [...] > I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in > system time, which ought to be noticeable on smaller SMP systems as well. > > Waiman, could you please post the microbenchmark? > > Thanks, > > Ingo The microbenchmark that I used is attached. I do agree that performance benefit will decrease as the number of CPUs get smaller. The system that I used for testing have 4 sockets with 40 cores (80 threads). Dave Chinner had run his fstests on a 16-core system (probably 2-socket) which showed modest improvement in performance (~4m40s vs 4m30s in runtime). This patch enables parallel insertion and deletion to/from the inode list which used to be a serialized operation. So if that list operation is a bottleneck, you will see significant improvement. If it is not, we may not notice that much of a difference. For a single-socket 4-core system, I agree that the performance benefit, if any, will be limited. Cheers, Longman /* * This program is free software; you can redistribute it and/or modify * it under the terms of the GNU General Public License as published by * the Free Software Foundation; either version 2 of the License, or * (at your option) any later version. * * This program is distributed in the hope that it will be useful, * but WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the * GNU General Public License for more details. * * Authors: Waiman Long <waiman.long@hp.com> */ /* * This is an exit test */ #include <ctype.h> #include <errno.h> #include <pthread.h> #include <stdlib.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <signal.h> #include <sys/types.h> #include <sys/syscall.h> #define do_exit() syscall(SYS_exit) #define gettid() syscall(SYS_gettid) #define MAX_THREADS 2048 static inline void cpu_relax(void) { __asm__ __volatile__("rep;nop": : :"memory"); } static inline void atomic_inc(volatile int *v) { __asm__ __volatile__("lock incl %0": "+m" (*v)); } static volatile int exit_now = 0; static volatile int threadcnt = 0; /* * Walk the /proc/<pid> filesystem to make them fill the dentry cache */ static void walk_procfs(void) { char cmdbuf[256]; pid_t tid = gettid(); snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid); if (system(cmdbuf) < 0) perror("system() failed!"); } static void *exit_thread(void *dummy) { long tid = (long)dummy; walk_procfs(); atomic_inc(&threadcnt); /* * Busy wait until the do_exit flag is set and then call exit */ while (!exit_now) sleep(1); do_exit(); } static void exit_test(int threads) { pthread_t thread[threads]; long i = 0, finish; time_t start = time(NULL); while (i++ < threads) { if (pthread_create(thread + i - 1, NULL, exit_thread, (void *)i)) { perror("pthread_create"); exit(1); } #if 0 /* * Pipelining to reduce contention & improve speed */ if ((i & 0xf) == 0) while (i - threadcnt > 12) usleep(1); #endif } while (threadcnt != threads) usleep(1); walk_procfs(); printf("Setup time = %lus\n", time(NULL) - start); printf("Process ready to exit!\n"); kill(0, SIGKILL); exit(0); } int main(int argc, char *argv[]) { int tcnt; /* Thread counts */ char *cmd = argv[0]; if ((argc != 2) || !isdigit(argv[1][0])) { fprintf(stderr, "Usage: %s <thread count>\n", cmd); exit(1); } tcnt = strtoul(argv[1], NULL, 10); if (tcnt > MAX_THREADS) { fprintf(stderr, "Error: thread count should be <= %d\n", MAX_THREADS); exit(1); } exit_test(tcnt); return 0; /* Not reaachable */ }
On 02/24/2016 03:23 PM, Waiman Long wrote: > On 02/24/2016 03:28 AM, Jan Kara wrote: >> On Tue 23-02-16 14:04:32, Waiman Long wrote: >>> When many threads are trying to add or delete inode to or from >>> a superblock's s_inodes list, spinlock contention on the list can >>> become a performance bottleneck. >>> >>> This patch changes the s_inodes field to become a per-cpu list with >>> per-cpu spinlocks. As a result, the following superblock inode list >>> (sb->s_inodes) iteration functions in vfs are also being modified: >>> >>> 1. iterate_bdevs() >>> 2. drop_pagecache_sb() >>> 3. wait_sb_inodes() >>> 4. evict_inodes() >>> 5. invalidate_inodes() >>> 6. fsnotify_unmount_inodes() >>> 7. add_dquot_ref() >>> 8. remove_dquot_ref() >>> >>> With an exit microbenchmark that creates a large number of threads, >>> attachs many inodes to them and then exits. The runtimes of that >>> microbenchmark with 1000 threads before and after the patch on a >>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as >>> follows: >>> >>> Kernel Elapsed Time System Time >>> ------ ------------ ----------- >>> Vanilla 4.5-rc4 65.29s 82m14s >>> Patched 4.5-rc4 22.81s 23m03s >>> >>> Before the patch, spinlock contention at the inode_sb_list_add() >>> function at the startup phase and the inode_sb_list_del() function at >>> the exit phase were about 79% and 93% of total CPU time respectively >>> (as measured by perf). After the patch, the percpu_list_add() >>> function consumed only about 0.04% of CPU time at startup phase. The >>> percpu_list_del() function consumed about 0.4% of CPU time at exit >>> phase. There were still some spinlock contention, but they happened >>> elsewhere. >> While looking through this patch, I have noticed that the >> list_for_each_entry_safe() iterations in evict_inodes() and >> invalidate_inodes() are actually unnecessary. So if you first apply the >> attached patch, you don't have to implement safe iteration variants >> at all. > > Thank for the patch. I will apply that in my next update. As for the > safe iteration variant, I think I will keep it since I had implemented > that already just in case it may be needed in some other places. > >> As a second comment, I'd note that this patch grows struct inode by 1 >> pointer. It is probably acceptable for large machines given the >> speedup but >> it should be noted in the changelog. Furthermore for UP or even small >> SMP >> systems this is IMHO undesired bloat since the speedup won't be >> noticeable. >> >> So for these small systems it would be good if per-cpu list magic >> would just >> fall back to single linked list with a spinlock. Do you think that is >> reasonably doable? >> > > I already have a somewhat separate code path for UP. So I can remove > the lock pointer for that. For small SMP system, however, the only way > to avoid the extra pointer is to add a config parameter to turn this > feature off. That can be added as a separate patch, if necessary. I am sorry that I need to retreat from this promise for UP. Removing the lock pointer will require change in the list deletion API to pass in the lock information. So I am not going to change it for the time being. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From ede070a2159d4c49c6a29be601594f7119872417 Mon Sep 17 00:00:00 2001 From: Jan Kara <jack@suse.cz> Date: Wed, 24 Feb 2016 09:05:32 +0100 Subject: [PATCH] vfs: Remove unnecessary list_for_each_entry_safe() variants evict_inodes() and invalidate_inodes() use list_for_each_entry_safe() to iterate sb->s_inodes list. However, since we use i_lru list entry for our local temporary list of inodes to destroy, the inode is guaranteed to stay in sb->s_inodes list while we hold sb->s_inode_list_lock. So there is no real need for safe iteration variant and we can use list_for_each_entry() just fine. Signed-off-by: Jan Kara <jack@suse.cz> --- fs/inode.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index 9f62db3bcc3e..efb1bea53695 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -590,12 +590,12 @@ static void dispose_list(struct list_head *head) */ void evict_inodes(struct super_block *sb) { - struct inode *inode, *next; + struct inode *inode; LIST_HEAD(dispose); again: spin_lock(&sb->s_inode_list_lock); - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { if (atomic_read(&inode->i_count)) continue; @@ -640,11 +640,11 @@ again: int invalidate_inodes(struct super_block *sb, bool kill_dirty) { int busy = 0; - struct inode *inode, *next; + struct inode *inode; LIST_HEAD(dispose); spin_lock(&sb->s_inode_list_lock); - list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) { + list_for_each_entry(inode, &sb->s_inodes, i_sb_list) { spin_lock(&inode->i_lock); if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) { spin_unlock(&inode->i_lock); -- 2.6.2