diff mbox

[RRC,2/2] vfs: Use per-cpu list for superblock's inode list

Message ID 1455672680-7153-3-git-send-email-Waiman.Long@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Waiman Long Feb. 17, 2016, 1:31 a.m. UTC
When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks.

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:

  Kernel            Elapsed Time    System Time
  ------            ------------    -----------
  Vanilla 4.5-rc4      65.29s         82m14s
  Patched 4.5-rc4      22.81s         23m03s

Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/block_dev.c         |   16 +++++++---------
 fs/drop_caches.c       |   11 +++++------
 fs/fs-writeback.c      |   16 +++++++---------
 fs/inode.c             |   43 +++++++++++++++++--------------------------
 fs/notify/inode_mark.c |   22 +++++++++++-----------
 fs/quota/dquot.c       |   23 ++++++++++-------------
 fs/super.c             |    7 ++++---
 include/linux/fs.h     |    8 ++++----
 8 files changed, 65 insertions(+), 81 deletions(-)

Comments

Ingo Molnar Feb. 17, 2016, 7:16 a.m. UTC | #1
* Waiman Long <Waiman.Long@hpe.com> wrote:

> When many threads are trying to add or delete inode to or from
> a superblock's s_inodes list, spinlock contention on the list can
> become a performance bottleneck.
> 
> This patch changes the s_inodes field to become a per-cpu list with
> per-cpu spinlocks.
> 
> With an exit microbenchmark that creates a large number of threads,
> attachs many inodes to them and then exits. The runtimes of that
> microbenchmark with 1000 threads before and after the patch on a
> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> follows:
> 
>   Kernel            Elapsed Time    System Time
>   ------            ------------    -----------
>   Vanilla 4.5-rc4      65.29s         82m14s
>   Patched 4.5-rc4      22.81s         23m03s
> 
> Before the patch, spinlock contention at the inode_sb_list_add()
> function at the startup phase and the inode_sb_list_del() function at
> the exit phase were about 79% and 93% of total CPU time respectively
> (as measured by perf). After the patch, the percpu_list_add()
> function consumed only about 0.04% of CPU time at startup phase. The
> percpu_list_del() function consumed about 0.4% of CPU time at exit
> phase. There were still some spinlock contention, but they happened
> elsewhere.

Pretty impressive IMHO!

Just for the record, here's your former 'batched list' number inserted into the 
above table:

   Kernel                       Elapsed Time    System Time
   ------                       ------------    -----------
   Vanilla      [v4.5-rc4]      65.29s          82m14s
   batched list [v4.4]          45.69s          49m44s
   percpu list  [v4.5-rc4]      22.81s          23m03s

i.e. the proper per CPU data structure and the resulting improvement in cache 
locality gave another doubling in performance.

Just out of curiosity, could you post the profile of the latest patches - is there 
any (bigger) SMP overhead left, or is the profile pretty flat now?

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 17, 2016, 10:37 a.m. UTC | #2
On Tue, Feb 16, 2016 at 08:31:20PM -0500, Waiman Long wrote:
> When many threads are trying to add or delete inode to or from
> a superblock's s_inodes list, spinlock contention on the list can
> become a performance bottleneck.
> 
> This patch changes the s_inodes field to become a per-cpu list with
> per-cpu spinlocks.
> 
> With an exit microbenchmark that creates a large number of threads,
> attachs many inodes to them and then exits. The runtimes of that
> microbenchmark with 1000 threads before and after the patch on a
> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
> follows:
> 
>   Kernel            Elapsed Time    System Time
>   ------            ------------    -----------
>   Vanilla 4.5-rc4      65.29s         82m14s
>   Patched 4.5-rc4      22.81s         23m03s

Pretty good :)

My fsmark tests usually show up a fair bit of contention - moving
250k inodes through the cache every second over 16p does generate a
bit of load on the list. The patch makes the inode list add/del
operations disappear completely from the perf profiles, and there's
a marginal decrease in runtime (~4m40s vs 4m30s). I think the global
lock is right on the edge of breakdown under this load, though, so
if I was testing on a larger system I think the difference would be
much bigger.

I'll run some more testing on it, see if anything breaks.

A few comments on the code follow.

> @@ -1866,8 +1866,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>  {
>  	struct inode *inode, *old_inode = NULL;
>  
> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
> -	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
> +	for_all_percpu_list_entries_simple(inode, percpu_lock,
> +			blockdev_superblock->s_inodes_cpu, i_sb_list) {

This is kind what I meant about names getting way too long. How
about something like:

#define walk_sb_inodes(inode, sb, pcpu_lock)	\
	for_all_percpu_list_entries_simple(inode, pcpu_lock,	\
					   sb->s_inodes_list, i_sb_list)

#define walk_sb_inodes_end(pcpu_lock) end_all_percpu_list_entries(pcpu_lock)

for brevity?

> @@ -189,7 +190,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  		spin_unlock(&inode->i_lock);
>  
>  		/* In case the dropping of a reference would nuke next_i. */
> -		while (&next_i->i_sb_list != &sb->s_inodes) {
> +		while (&next_i->i_sb_list.list != percpu_head) {
>  			spin_lock(&next_i->i_lock);
>  			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
>  						atomic_read(&next_i->i_count)) {
> @@ -199,16 +200,16 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>  				break;
>  			}
>  			spin_unlock(&next_i->i_lock);
> -			next_i = list_next_entry(next_i, i_sb_list);
> +			next_i = list_next_entry(next_i, i_sb_list.list);

pcpu_list_next_entry(next_i, i_sb_list)?

> @@ -1397,9 +1398,8 @@ struct super_block {
>  	 */
>  	int s_stack_depth;
>  
> -	/* s_inode_list_lock protects s_inodes */
> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
> -	struct list_head	s_inodes;	/* all inodes */
> +	/* The percpu locks protect s_inodes_cpu */
> +	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */

There is no need to encode the type of list into the name.
i.e. drop the "_cpu" suffix - we can see it's a percpu list from the
declaration.

Cheers,

Dave.
Waiman Long Feb. 17, 2016, 3:40 p.m. UTC | #3
On 02/17/2016 02:16 AM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long@hpe.com>  wrote:
>
>> When many threads are trying to add or delete inode to or from
>> a superblock's s_inodes list, spinlock contention on the list can
>> become a performance bottleneck.
>>
>> This patch changes the s_inodes field to become a per-cpu list with
>> per-cpu spinlocks.
>>
>> With an exit microbenchmark that creates a large number of threads,
>> attachs many inodes to them and then exits. The runtimes of that
>> microbenchmark with 1000 threads before and after the patch on a
>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>> follows:
>>
>>    Kernel            Elapsed Time    System Time
>>    ------            ------------    -----------
>>    Vanilla 4.5-rc4      65.29s         82m14s
>>    Patched 4.5-rc4      22.81s         23m03s
>>
>> Before the patch, spinlock contention at the inode_sb_list_add()
>> function at the startup phase and the inode_sb_list_del() function at
>> the exit phase were about 79% and 93% of total CPU time respectively
>> (as measured by perf). After the patch, the percpu_list_add()
>> function consumed only about 0.04% of CPU time at startup phase. The
>> percpu_list_del() function consumed about 0.4% of CPU time at exit
>> phase. There were still some spinlock contention, but they happened
>> elsewhere.
> Pretty impressive IMHO!
>
> Just for the record, here's your former 'batched list' number inserted into the
> above table:
>
>     Kernel                       Elapsed Time    System Time
>     ------                       ------------    -----------
>     Vanilla      [v4.5-rc4]      65.29s          82m14s
>     batched list [v4.4]          45.69s          49m44s
>     percpu list  [v4.5-rc4]      22.81s          23m03s
>
> i.e. the proper per CPU data structure and the resulting improvement in cache
> locality gave another doubling in performance.
>
> Just out of curiosity, could you post the profile of the latest patches - is there
> any (bigger) SMP overhead left, or is the profile pretty flat now?
>
> Thanks,
>
> 	Ingo

Yes, there were still spinlock contention elsewhere in the exit path. 
Now the bulk of the CPU times was in:

-   79.23%    79.23%         a.out  [kernel.kallsyms]    [k] 
native_queued_spin
    - native_queued_spin_lock_slowpath
       - 99.99% queued_spin_lock_slowpath
          - 100.00% _raw_spin_lock
             - 99.98% list_lru_del
                - d_lru_del
                   - 100.00% select_collect
                        detach_and_collect
                        d_walk
                        d_invalidate
                        proc_flush_task
                        release_task
                        do_exit
                        do_group_exit
                        get_signal
                        do_signal
                        exit_to_usermode_loop
                        syscall_return_slowpath
                        int_ret_from_sys_call

The locks that were being contended were nlru->lock. For a 4-node system 
that I used, there will be four of those.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 4:08 p.m. UTC | #4
On 02/17/2016 05:37 AM, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 08:31:20PM -0500, Waiman Long wrote:
>> When many threads are trying to add or delete inode to or from
>> a superblock's s_inodes list, spinlock contention on the list can
>> become a performance bottleneck.
>>
>> This patch changes the s_inodes field to become a per-cpu list with
>> per-cpu spinlocks.
>>
>> With an exit microbenchmark that creates a large number of threads,
>> attachs many inodes to them and then exits. The runtimes of that
>> microbenchmark with 1000 threads before and after the patch on a
>> 4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
>> follows:
>>
>>    Kernel            Elapsed Time    System Time
>>    ------            ------------    -----------
>>    Vanilla 4.5-rc4      65.29s         82m14s
>>    Patched 4.5-rc4      22.81s         23m03s
> Pretty good :)
>
> My fsmark tests usually show up a fair bit of contention - moving
> 250k inodes through the cache every second over 16p does generate a
> bit of load on the list. The patch makes the inode list add/del
> operations disappear completely from the perf profiles, and there's
> a marginal decrease in runtime (~4m40s vs 4m30s). I think the global
> lock is right on the edge of breakdown under this load, though, so
> if I was testing on a larger system I think the difference would be
> much bigger.
>
> I'll run some more testing on it, see if anything breaks.
>
> A few comments on the code follow.
>
>> @@ -1866,8 +1866,8 @@ void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
>>   {
>>   	struct inode *inode, *old_inode = NULL;
>>
>> -	spin_lock(&blockdev_superblock->s_inode_list_lock);
>> -	list_for_each_entry(inode,&blockdev_superblock->s_inodes, i_sb_list) {
>> +	for_all_percpu_list_entries_simple(inode, percpu_lock,
>> +			blockdev_superblock->s_inodes_cpu, i_sb_list) {
> This is kind what I meant about names getting way too long. How
> about something like:
>
> #define walk_sb_inodes(inode, sb, pcpu_lock)	\
> 	for_all_percpu_list_entries_simple(inode, pcpu_lock,	\
> 					   sb->s_inodes_list, i_sb_list)
>
> #define walk_sb_inodes_end(pcpu_lock) end_all_percpu_list_entries(pcpu_lock)
>
> for brevity?

Yes, I think adding some inode specific macros in fs.h will help to make 
the patch easier to read.

>> @@ -189,7 +190,7 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>>   		spin_unlock(&inode->i_lock);
>>
>>   		/* In case the dropping of a reference would nuke next_i. */
>> -		while (&next_i->i_sb_list !=&sb->s_inodes) {
>> +		while (&next_i->i_sb_list.list != percpu_head) {
>>   			spin_lock(&next_i->i_lock);
>>   			if (!(next_i->i_state&  (I_FREEING | I_WILL_FREE))&&
>>   						atomic_read(&next_i->i_count)) {
>> @@ -199,16 +200,16 @@ void fsnotify_unmount_inodes(struct super_block *sb)
>>   				break;
>>   			}
>>   			spin_unlock(&next_i->i_lock);
>> -			next_i = list_next_entry(next_i, i_sb_list);
>> +			next_i = list_next_entry(next_i, i_sb_list.list);
> pcpu_list_next_entry(next_i, i_sb_list)?

Will add that.

>> @@ -1397,9 +1398,8 @@ struct super_block {
>>   	 */
>>   	int s_stack_depth;
>>
>> -	/* s_inode_list_lock protects s_inodes */
>> -	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
>> -	struct list_head	s_inodes;	/* all inodes */
>> +	/* The percpu locks protect s_inodes_cpu */
>> +	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */
> There is no need to encode the type of list into the name.
> i.e. drop the "_cpu" suffix - we can see it's a percpu list from the
> declaration.

Will remove that macro.

Thanks for the review.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/block_dev.c b/fs/block_dev.c
index 39b3a17..30b12cb 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1866,8 +1866,8 @@  void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 {
 	struct inode *inode, *old_inode = NULL;
 
-	spin_lock(&blockdev_superblock->s_inode_list_lock);
-	list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+			blockdev_superblock->s_inodes_cpu, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
 		spin_lock(&inode->i_lock);
@@ -1878,22 +1878,20 @@  void iterate_bdevs(void (*func)(struct block_device *, void *), void *arg)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock  We cannot iput the inode now as we can
+		 * percpu_lock. We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
 
 		func(I_BDEV(inode), arg);
 
-		spin_lock(&blockdev_superblock->s_inode_list_lock);
-	}
-	spin_unlock(&blockdev_superblock->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 }
diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index d72d52b..c091d91 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -17,8 +17,8 @@  static void drop_pagecache_sb(struct super_block *sb, void *unused)
 {
 	struct inode *inode, *toput_inode = NULL;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+				    sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    (inode->i_mapping->nrpages == 0)) {
@@ -27,15 +27,14 @@  static void drop_pagecache_sb(struct super_block *sb, void *unused)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		invalidate_mapping_pages(inode->i_mapping, 0, -1);
 		iput(toput_inode);
 		toput_inode = inode;
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(toput_inode);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6915c95..3fcacb5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2115,7 +2115,6 @@  static void wait_sb_inodes(struct super_block *sb)
 	WARN_ON(!rwsem_is_locked(&sb->s_umount));
 
 	mutex_lock(&sb->s_sync_lock);
-	spin_lock(&sb->s_inode_list_lock);
 
 	/*
 	 * Data integrity sync. Must wait for all pages under writeback,
@@ -2124,7 +2123,8 @@  static void wait_sb_inodes(struct super_block *sb)
 	 * In which case, the inode may not be on the dirty list, but
 	 * we still have to wait for that writeout.
 	 */
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		struct address_space *mapping = inode->i_mapping;
 
 		spin_lock(&inode->i_lock);
@@ -2135,15 +2135,14 @@  static void wait_sb_inodes(struct super_block *sb)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock.  We cannot iput the inode now as we can
+		 * percpu_lock.  We cannot iput the inode now as we can
 		 * be holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		iput(old_inode);
 		old_inode = inode;
@@ -2157,9 +2156,8 @@  static void wait_sb_inodes(struct super_block *sb)
 
 		cond_resched();
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 	mutex_unlock(&sb->s_sync_lock);
 }
diff --git a/fs/inode.c b/fs/inode.c
index 9f62db3..15c82e7 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -28,8 +28,8 @@ 
  *   inode->i_state, inode->i_hash, __iget()
  * Inode LRU list locks protect:
  *   inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
- *   inode->i_sb->s_inodes, inode->i_sb_list
+ * inode->i_sb->s_inodes_cpu->lock protects:
+ *   inode->i_sb->s_inodes_cpu, inode->i_sb_list
  * bdi->wb.list_lock protects:
  *   bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
  * inode_hash_lock protects:
@@ -37,7 +37,7 @@ 
  *
  * Lock ordering:
  *
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes_cpu->lock
  *   inode->i_lock
  *     Inode LRU list locks
  *
@@ -45,7 +45,7 @@ 
  *   inode->i_lock
  *
  * inode_hash_lock
- *   inode->i_sb->s_inode_list_lock
+ *   inode->i_sb->s_inodes_cpu->lock
  *   inode->i_lock
  *
  * iunique_lock
@@ -424,19 +424,14 @@  static void inode_lru_list_del(struct inode *inode)
  */
 void inode_sb_list_add(struct inode *inode)
 {
-	spin_lock(&inode->i_sb->s_inode_list_lock);
-	list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
-	spin_unlock(&inode->i_sb->s_inode_list_lock);
+	percpu_list_add(&inode->i_sb_list, inode->i_sb->s_inodes_cpu);
 }
 EXPORT_SYMBOL_GPL(inode_sb_list_add);
 
 static inline void inode_sb_list_del(struct inode *inode)
 {
-	if (!list_empty(&inode->i_sb_list)) {
-		spin_lock(&inode->i_sb->s_inode_list_lock);
-		list_del_init(&inode->i_sb_list);
-		spin_unlock(&inode->i_sb->s_inode_list_lock);
-	}
+	if (!list_empty(&inode->i_sb_list.list))
+		percpu_list_del(&inode->i_sb_list);
 }
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -590,12 +585,12 @@  static void dispose_list(struct list_head *head)
  */
 void evict_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
 again:
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		if (atomic_read(&inode->i_count))
 			continue;
 
@@ -616,13 +611,12 @@  again:
 		 * bit so we don't livelock.
 		 */
 		if (need_resched()) {
-			spin_unlock(&sb->s_inode_list_lock);
+			spin_unlock(percpu_lock);
 			cond_resched();
 			dispose_list(&dispose);
 			goto again;
 		}
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 
 	dispose_list(&dispose);
 }
@@ -640,11 +634,11 @@  again:
 int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 {
 	int busy = 0;
-	struct inode *inode, *next;
+	struct inode *inode;
 	LIST_HEAD(dispose);
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
 			spin_unlock(&inode->i_lock);
@@ -665,8 +659,7 @@  int invalidate_inodes(struct super_block *sb, bool kill_dirty)
 		inode_lru_list_del(inode);
 		spin_unlock(&inode->i_lock);
 		list_add(&inode->i_lru, &dispose);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 
 	dispose_list(&dispose);
 
@@ -881,7 +874,7 @@  struct inode *new_inode_pseudo(struct super_block *sb)
 		spin_lock(&inode->i_lock);
 		inode->i_state = 0;
 		spin_unlock(&inode->i_lock);
-		INIT_LIST_HEAD(&inode->i_sb_list);
+		INIT_PERCPU_LIST_ENTRY(&inode->i_sb_list);
 	}
 	return inode;
 }
@@ -902,8 +895,6 @@  struct inode *new_inode(struct super_block *sb)
 {
 	struct inode *inode;
 
-	spin_lock_prefetch(&sb->s_inode_list_lock);
-
 	inode = new_inode_pseudo(sb);
 	if (inode)
 		inode_sb_list_add(inode);
diff --git a/fs/notify/inode_mark.c b/fs/notify/inode_mark.c
index 741077d..96bcd4a 100644
--- a/fs/notify/inode_mark.c
+++ b/fs/notify/inode_mark.c
@@ -146,14 +146,15 @@  int fsnotify_add_inode_mark(struct fsnotify_mark *mark,
  * @sb: superblock being unmounted.
  *
  * Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inodes_cpu->lock and CAN
+ * block.
  */
 void fsnotify_unmount_inodes(struct super_block *sb)
 {
-	struct inode *inode, *next_i, *need_iput = NULL;
+	struct inode *inode, *need_iput = NULL;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry_safe(inode, next_i, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries(inode, next_i, percpu_head, percpu_lock,
+				    sb->s_inodes_cpu, i_sb_list) {
 		struct inode *need_iput_tmp;
 
 		/*
@@ -189,7 +190,7 @@  void fsnotify_unmount_inodes(struct super_block *sb)
 		spin_unlock(&inode->i_lock);
 
 		/* In case the dropping of a reference would nuke next_i. */
-		while (&next_i->i_sb_list != &sb->s_inodes) {
+		while (&next_i->i_sb_list.list != percpu_head) {
 			spin_lock(&next_i->i_lock);
 			if (!(next_i->i_state & (I_FREEING | I_WILL_FREE)) &&
 						atomic_read(&next_i->i_count)) {
@@ -199,16 +200,16 @@  void fsnotify_unmount_inodes(struct super_block *sb)
 				break;
 			}
 			spin_unlock(&next_i->i_lock);
-			next_i = list_next_entry(next_i, i_sb_list);
+			next_i = list_next_entry(next_i, i_sb_list.list);
 		}
 
 		/*
-		 * We can safely drop s_inode_list_lock here because either
+		 * We can safely drop percpu_lock  here because either
 		 * we actually hold references on both inode and next_i or
 		 * end of list.  Also no new inodes will be added since the
 		 * umount has begun.
 		 */
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 		if (need_iput_tmp)
 			iput(need_iput_tmp);
@@ -220,7 +221,6 @@  void fsnotify_unmount_inodes(struct super_block *sb)
 
 		iput(inode);
 
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 }
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 3c3b81b..0123ef4 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -928,8 +928,8 @@  static void add_dquot_ref(struct super_block *sb, int type)
 	int reserved = 0;
 #endif
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, percpu_lock,
+					   sb->s_inodes_cpu, i_sb_list) {
 		spin_lock(&inode->i_lock);
 		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
 		    !atomic_read(&inode->i_writecount) ||
@@ -939,7 +939,7 @@  static void add_dquot_ref(struct super_block *sb, int type)
 		}
 		__iget(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&sb->s_inode_list_lock);
+		spin_unlock(percpu_lock);
 
 #ifdef CONFIG_QUOTA_DEBUG
 		if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -951,15 +951,13 @@  static void add_dquot_ref(struct super_block *sb, int type)
 		/*
 		 * We hold a reference to 'inode' so it couldn't have been
 		 * removed from s_inodes list while we dropped the
-		 * s_inode_list_lock. We cannot iput the inode now as we can be
+		 * percpu_lock. We cannot iput the inode now as we can be
 		 * holding the last reference and we cannot iput it under
-		 * s_inode_list_lock. So we keep the reference and iput it
-		 * later.
+		 * percpu_lock. So we keep the reference and iput it later.
 		 */
 		old_inode = inode;
-		spin_lock(&sb->s_inode_list_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+		spin_lock(percpu_lock);
+	} end_all_percpu_list_entries(percpu_lock)
 	iput(old_inode);
 
 #ifdef CONFIG_QUOTA_DEBUG
@@ -1028,8 +1026,8 @@  static void remove_dquot_ref(struct super_block *sb, int type,
 	struct inode *inode;
 	int reserved = 0;
 
-	spin_lock(&sb->s_inode_list_lock);
-	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+	for_all_percpu_list_entries_simple(inode, lock, sb->s_inodes_cpu,
+					   i_sb_list) {
 		/*
 		 *  We have to scan also I_NEW inodes because they can already
 		 *  have quota pointer initialized. Luckily, we need to touch
@@ -1043,8 +1041,7 @@  static void remove_dquot_ref(struct super_block *sb, int type,
 			remove_inode_dquot_ref(inode, type, tofree_head);
 		}
 		spin_unlock(&dq_data_lock);
-	}
-	spin_unlock(&sb->s_inode_list_lock);
+	} end_all_percpu_list_entries(lock)
 #ifdef CONFIG_QUOTA_DEBUG
 	if (reserved) {
 		printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 1182af8..1db4f37 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -163,6 +163,7 @@  static void destroy_super(struct super_block *s)
 {
 	list_lru_destroy(&s->s_dentry_lru);
 	list_lru_destroy(&s->s_inode_lru);
+	free_percpu_list_head(&s->s_inodes_cpu);
 	security_sb_free(s);
 	WARN_ON(!list_empty(&s->s_mounts));
 	kfree(s->s_subtype);
@@ -204,9 +205,9 @@  static struct super_block *alloc_super(struct file_system_type *type, int flags)
 	INIT_HLIST_NODE(&s->s_instances);
 	INIT_HLIST_BL_HEAD(&s->s_anon);
 	mutex_init(&s->s_sync_lock);
-	INIT_LIST_HEAD(&s->s_inodes);
-	spin_lock_init(&s->s_inode_list_lock);
 
+	if (init_percpu_list_head(&s->s_inodes_cpu))
+		goto fail;
 	if (list_lru_init_memcg(&s->s_dentry_lru))
 		goto fail;
 	if (list_lru_init_memcg(&s->s_inode_lru))
@@ -426,7 +427,7 @@  void generic_shutdown_super(struct super_block *sb)
 		if (sop->put_super)
 			sop->put_super(sb);
 
-		if (!list_empty(&sb->s_inodes)) {
+		if (!percpu_list_empty(sb->s_inodes_cpu)) {
 			printk("VFS: Busy inodes after unmount of %s. "
 			   "Self-destruct in 5 seconds.  Have a nice day...\n",
 			   sb->s_id);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ae68100..efedc59 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -27,6 +27,7 @@ 
 #include <linux/migrate_mode.h>
 #include <linux/uidgid.h>
 #include <linux/lockdep.h>
+#include <linux/percpu-list.h>
 #include <linux/percpu-rwsem.h>
 #include <linux/blk_types.h>
 #include <linux/workqueue.h>
@@ -648,7 +649,7 @@  struct inode {
 	u16			i_wb_frn_history;
 #endif
 	struct list_head	i_lru;		/* inode LRU list */
-	struct list_head	i_sb_list;
+	struct percpu_list	i_sb_list;
 	union {
 		struct hlist_head	i_dentry;
 		struct rcu_head		i_rcu;
@@ -1397,9 +1398,8 @@  struct super_block {
 	 */
 	int s_stack_depth;
 
-	/* s_inode_list_lock protects s_inodes */
-	spinlock_t		s_inode_list_lock ____cacheline_aligned_in_smp;
-	struct list_head	s_inodes;	/* all inodes */
+	/* The percpu locks protect s_inodes_cpu */
+	PERCPU_LIST_HEAD(s_inodes_cpu);	/* all inodes */
 };
 
 extern struct timespec current_fs_time(struct super_block *sb);