[RFC,1/2] lib/percpu-list: Per-cpu list with associated per-cpu locks
diff mbox

Message ID 1455672680-7153-2-git-send-email-Waiman.Long@hpe.com
State New
Headers show

Commit Message

Waiman Long Feb. 17, 2016, 1:31 a.m. UTC
Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new per-cpu list subystem with associated
per-cpu locks for protecting each of the lists individually. This
allows list entries insertion and deletion operations to happen in
parallel instead of being serialized with a global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/percpu-list.h will be added with the
associated percpu_list structure. The following functions are used
to manage the per-cpu list:

 1. int init_percpu_list_head(struct percpu_list **pclist_handle)
 2. void percpu_list_add(struct percpu_list *new,
			 struct percpu_list *head)
 3. void percpu_list_del(struct percpu_list *entry)

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 include/linux/percpu-list.h |  117 +++++++++++++++++++++++++++++++++++++++++++
 lib/Makefile                |    2 +-
 lib/percpu-list.c           |   80 +++++++++++++++++++++++++++++
 3 files changed, 198 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/percpu-list.h
 create mode 100644 lib/percpu-list.c

Comments

Dave Chinner Feb. 17, 2016, 9:53 a.m. UTC | #1
On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote:
> Linked list is used everywhere in the Linux kernel. However, if many
> threads are trying to add or delete entries into the same linked list,
> it can create a performance bottleneck.
> 
> This patch introduces a new per-cpu list subystem with associated
> per-cpu locks for protecting each of the lists individually. This
> allows list entries insertion and deletion operations to happen in
> parallel instead of being serialized with a global list and lock.
> 
> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.
> 
> A new header file include/linux/percpu-list.h will be added with the
> associated percpu_list structure. The following functions are used
> to manage the per-cpu list:
> 
>  1. int init_percpu_list_head(struct percpu_list **pclist_handle)
>  2. void percpu_list_add(struct percpu_list *new,
> 			 struct percpu_list *head)
>  3. void percpu_list_del(struct percpu_list *entry)

A few comments on the code

> + * A per-cpu list protected by a per-cpu spinlock.
> + *
> + * The list head percpu_list structure contains the spinlock, the other
> + * entries in the list contain the spinlock pointer.
> + */
> +struct percpu_list {
> +	struct list_head list;
> +	union {
> +		spinlock_t lock;	/* For list head */
> +		spinlock_t *lockptr;	/* For other entries */
> +	};
> +};

This union is bad for kernels running spinlock debugging - the size
of the spinlock can blow out, and that increases the size of any
object that has a percpu_list in it. I've only got basic spinlock
debugging turned on, and the spinlock_t is 24 bytes with that
config. If I turn on lockdep, it gets much larger again....

So it might be best to separate the list head and list entry
structures, similar to a hash list?

> +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
> +{
> +	INIT_LIST_HEAD(&pcpu_list->list);
> +	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
> +}
> +
> +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
> +{
> +	INIT_LIST_HEAD(&pcpu_list->list);
> +	pcpu_list->lockptr = NULL;
> +}

These function names don't need to shout.

> +/**
> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> + * @pos:	the type * to use as a loop cursor for the current .
> + * @next:	an internal type * variable pointing to the next entry
> + * @pchead:	an internal struct list * of percpu list head
> + * @pclock:	an internal variable for the current per-cpu spinlock
> + * @head:	the head of the per-cpu list
> + * @member:	the name of the per-cpu list within the struct
> + */
> +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> +	{								 \
> +	int cpu;							 \
> +	for_each_possible_cpu (cpu) {					 \
> +		typeof(*pos) *next;					 \
> +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> +		spin_lock(pclock);					 \
> +		list_for_each_entry_safe(pos, next, pchead, member.list)
> +
> +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }

This is a bit of a landmine - the code inside he iteration is under
a spinlock hidden in the macros. People are going to forget about
that, and it's needs documenting about how it needs to be treated
w.r.t. dropping and regaining the lock so sleeping operations can be
performed on objects on the list being iterated.

Can we also think up some shorter names - names that are 30-40
characters long are getting out out of hand given we're supposed
tobe sticking to 80 character widths and we lost 8 of them in the
first indent...

Cheers,

Dave.
Peter Zijlstra Feb. 17, 2016, 11 a.m. UTC | #2
On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > +/**
> > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> > + * @pos:	the type * to use as a loop cursor for the current .
> > + * @next:	an internal type * variable pointing to the next entry
> > + * @pchead:	an internal struct list * of percpu list head
> > + * @pclock:	an internal variable for the current per-cpu spinlock
> > + * @head:	the head of the per-cpu list
> > + * @member:	the name of the per-cpu list within the struct
> > + */
> > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> > +	{								 \
> > +	int cpu;							 \
> > +	for_each_possible_cpu (cpu) {					 \
> > +		typeof(*pos) *next;					 \
> > +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> > +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> > +		spin_lock(pclock);					 \
> > +		list_for_each_entry_safe(pos, next, pchead, member.list)
> > +
> > +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> 
> This is a bit of a landmine 

Yeah, that is pretty terrible. Maybe a visitor interface is advisable?

visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
{
	int cpu;

	for_each_possible_cpu(cpu) {
		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
		struct list_head *head = per_cpu_ptr(&head->list, cpu);
		struct list_head *pos, *tmp;

		spin_lock(lock);
		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
			visitor(pos, data);
		spin_unlock(lock);
	}
}
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 11:05 a.m. UTC | #3
On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > > +/**
> > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking

But the locking is 'pointless'. You only lock one per-cpu sublist at a
time, therefore the list can be modified concurrently anyhow.

So why not use RCU for the list iteration and avoid potentially large
lock hold times?

> > > + * @pos:	the type * to use as a loop cursor for the current .
> > > + * @next:	an internal type * variable pointing to the next entry
> > > + * @pchead:	an internal struct list * of percpu list head
> > > + * @pclock:	an internal variable for the current per-cpu spinlock
> > > + * @head:	the head of the per-cpu list
> > > + * @member:	the name of the per-cpu list within the struct
> > > + */
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner Feb. 17, 2016, 11:10 a.m. UTC | #4
On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
> > > +/**
> > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> > > + * @pos:	the type * to use as a loop cursor for the current .
> > > + * @next:	an internal type * variable pointing to the next entry
> > > + * @pchead:	an internal struct list * of percpu list head
> > > + * @pclock:	an internal variable for the current per-cpu spinlock
> > > + * @head:	the head of the per-cpu list
> > > + * @member:	the name of the per-cpu list within the struct
> > > + */
> > > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> > > +	{								 \
> > > +	int cpu;							 \
> > > +	for_each_possible_cpu (cpu) {					 \
> > > +		typeof(*pos) *next;					 \
> > > +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> > > +		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
> > > +		spin_lock(pclock);					 \
> > > +		list_for_each_entry_safe(pos, next, pchead, member.list)
> > > +
> > > +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> > 
> > This is a bit of a landmine 
> 
> Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> 
> visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> {
> 	int cpu;
> 
> 	for_each_possible_cpu(cpu) {
> 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> 		struct list_head *pos, *tmp;
> 
> 		spin_lock(lock);
> 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> 			visitor(pos, data);

I thought about this - it's the same problem as the list_lru walking
functions. That is, the visitor has to be able to drop the list lock
to do blocking operations, so the lock has to be passed to the
visitor/internal loop context somehow, and the way the callers can
use it need to be documented.

Cheers,

Dave.
Peter Zijlstra Feb. 17, 2016, 11:26 a.m. UTC | #5
On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote:
> On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:

> > Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> > 
> > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> > {
> > 	int cpu;
> > 
> > 	for_each_possible_cpu(cpu) {
> > 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> > 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> > 		struct list_head *pos, *tmp;
> > 
> > 		spin_lock(lock);
> > 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> > 			visitor(pos, data);
> 
> I thought about this - it's the same problem as the list_lru walking
> functions. That is, the visitor has to be able to drop the list lock
> to do blocking operations, so the lock has to be passed to the
> visitor/internal loop context somehow, and the way the callers can
> use it need to be documented.

But you cannot drop the lock and guarantee fwd progress. The moment you
drop the lock, you have to restart the iteration from the head, since
any iterator you had might now be pointing into space.


--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 11:36 a.m. UTC | #6
On Wed, Feb 17, 2016 at 12:26:54PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote:
> > On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
> 
> > > Yeah, that is pretty terrible. Maybe a visitor interface is advisable?
> > > 
> > > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data)
> > > {
> > > 	int cpu;
> > > 
> > > 	for_each_possible_cpu(cpu) {
> > > 		spinlock_t *lock = per_cpu_ptr(&head->lock, cpu);
> > > 		struct list_head *head = per_cpu_ptr(&head->list, cpu);
> > > 		struct list_head *pos, *tmp;
> > > 
> > > 		spin_lock(lock);
> > > 		for (pos = head->next, tmp = pos->next; pos != head; pos = tmp)
> > > 			visitor(pos, data);
> > 
> > I thought about this - it's the same problem as the list_lru walking
> > functions. That is, the visitor has to be able to drop the list lock
> > to do blocking operations, so the lock has to be passed to the
> > visitor/internal loop context somehow, and the way the callers can
> > use it need to be documented.
> 
> But you cannot drop the lock and guarantee fwd progress. The moment you
> drop the lock, you have to restart the iteration from the head, since
> any iterator you had might now be pointing into space.

Ah, I see what iterate_bdevs() does. Yes, that's somewhat 'special'. Not
sure it makes sense to craft a generic 'interface' for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Lameter Feb. 17, 2016, 3:13 p.m. UTC | #7
On Tue, 16 Feb 2016, Waiman Long wrote:

> List entry insertion is strictly per cpu. List deletion, however, can
> happen in a cpu other than the one that did the insertion. So we still
> need lock to protect the list. Because of that, there may still be
> a small amount of contention when deletion is being done.

Is there a way to avoid locking completely? You could use cmpxchg_double
to swap both list head and tail in an atomic fashion with some work. There is
even a this_cpu_cmpxchg_double available.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 3:56 p.m. UTC | #8
On 02/17/2016 04:53 AM, Dave Chinner wrote:
> On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote:
>> Linked list is used everywhere in the Linux kernel. However, if many
>> threads are trying to add or delete entries into the same linked list,
>> it can create a performance bottleneck.
>>
>> This patch introduces a new per-cpu list subystem with associated
>> per-cpu locks for protecting each of the lists individually. This
>> allows list entries insertion and deletion operations to happen in
>> parallel instead of being serialized with a global list and lock.
>>
>> List entry insertion is strictly per cpu. List deletion, however, can
>> happen in a cpu other than the one that did the insertion. So we still
>> need lock to protect the list. Because of that, there may still be
>> a small amount of contention when deletion is being done.
>>
>> A new header file include/linux/percpu-list.h will be added with the
>> associated percpu_list structure. The following functions are used
>> to manage the per-cpu list:
>>
>>   1. int init_percpu_list_head(struct percpu_list **pclist_handle)
>>   2. void percpu_list_add(struct percpu_list *new,
>> 			 struct percpu_list *head)
>>   3. void percpu_list_del(struct percpu_list *entry)
> A few comments on the code
>
>> + * A per-cpu list protected by a per-cpu spinlock.
>> + *
>> + * The list head percpu_list structure contains the spinlock, the other
>> + * entries in the list contain the spinlock pointer.
>> + */
>> +struct percpu_list {
>> +	struct list_head list;
>> +	union {
>> +		spinlock_t lock;	/* For list head */
>> +		spinlock_t *lockptr;	/* For other entries */
>> +	};
>> +};
> This union is bad for kernels running spinlock debugging - the size
> of the spinlock can blow out, and that increases the size of any
> object that has a percpu_list in it. I've only got basic spinlock
> debugging turned on, and the spinlock_t is 24 bytes with that
> config. If I turn on lockdep, it gets much larger again....
>
> So it might be best to separate the list head and list entry
> structures, similar to a hash list?

Right. I will split it into 2 separate structure in the next iteration 
of the patch.

>> +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
>> +{
>> +	INIT_LIST_HEAD(&pcpu_list->list);
>> +	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
>> +}
>> +
>> +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
>> +{
>> +	INIT_LIST_HEAD(&pcpu_list->list);
>> +	pcpu_list->lockptr = NULL;
>> +}
> These function names don't need to shout.

I was just following the convention used in list init functions. I can 
certainly change them to lowercase.

>
>> +/**
>> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
>> + * @pos:	the type * to use as a loop cursor for the current .
>> + * @next:	an internal type * variable pointing to the next entry
>> + * @pchead:	an internal struct list * of percpu list head
>> + * @pclock:	an internal variable for the current per-cpu spinlock
>> + * @head:	the head of the per-cpu list
>> + * @member:	the name of the per-cpu list within the struct
>> + */
>> +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
>> +	{								 \
>> +	int cpu;							 \
>> +	for_each_possible_cpu (cpu) {					 \
>> +		typeof(*pos) *next;					 \
>> +		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
>> +		struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\
>> +		spin_lock(pclock);					 \
>> +		list_for_each_entry_safe(pos, next, pchead, member.list)
>> +
>> +#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
> This is a bit of a landmine - the code inside he iteration is under
> a spinlock hidden in the macros. People are going to forget about
> that, and it's needs documenting about how it needs to be treated
> w.r.t. dropping and regaining the lock so sleeping operations can be
> performed on objects on the list being iterated.
>
> Can we also think up some shorter names - names that are 30-40
> characters long are getting out out of hand given we're supposed
> tobe sticking to 80 character widths and we lost 8 of them in the
> first indent...
>
> Cheers,
>
> Dave.

I will try to shorten the name and better document the macro. This is 
probably the most tricky part of the whole part.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 4:02 p.m. UTC | #9
On Wed, Feb 17, 2016 at 10:56:10AM -0500, Waiman Long wrote:
> >>+/**
> >>+ * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> >>+ * @pos:	the type * to use as a loop cursor for the current .
> >>+ * @next:	an internal type * variable pointing to the next entry
> >>+ * @pchead:	an internal struct list * of percpu list head
> >>+ * @pclock:	an internal variable for the current per-cpu spinlock
> >>+ * @head:	the head of the per-cpu list
> >>+ * @member:	the name of the per-cpu list within the struct
> >>+ */
> >>+#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
> >>+	{								 \
> >>+	int cpu;							 \
> >>+	for_each_possible_cpu (cpu) {					 \
> >>+		typeof(*pos) *next;					 \
> >>+		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
> >>+		struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\
> >>+		spin_lock(pclock);					 \
> >>+		list_for_each_entry_safe(pos, next, pchead, member.list)
> >>+
> >>+#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }

> I will try to shorten the name and better document the macro. This is
> probably the most tricky part of the whole part.

Note that your use of _safe() here actually makes the usage in
iterate_bdevs() unsafe!

Because that function relies on __iget() pinning the current position,
which means that once you re-acquire the list lock, pos->next is valid.

Howveer, _safe() takes pos->next before dropping the lock, and that
object is not actually pinned and can go away.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 4:16 p.m. UTC | #10
On 02/17/2016 06:05 AM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote:
>> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote:
>>>> +/**
>>>> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
> But the locking is 'pointless'. You only lock one per-cpu sublist at a
> time, therefore the list can be modified concurrently anyhow.

For each per-cpu list, there can be only 1 task allowed to make changes. 
If you are talking about all the per-cpu lists within the group, 
operations can happen in parallel.

> So why not use RCU for the list iteration and avoid potentially large
> lock hold times?
>

I know we can use RCU for singly linked list, but I don't think we can 
use that for doubly linked list as there is no easy way to make atomic 
changes to both prev and next pointers simultaneously unless you are 
taking about 16b cmpxchg which is only supported in some architecture.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 4:22 p.m. UTC | #11
On Wed, Feb 17, 2016 at 11:16:10AM -0500, Waiman Long wrote:
> >So why not use RCU for the list iteration and avoid potentially large
> >lock hold times?
> >
> 
> I know we can use RCU for singly linked list, but I don't think we can use
> that for doubly linked list as there is no easy way to make atomic changes
> to both prev and next pointers simultaneously unless you are taking about
> 16b cmpxchg which is only supported in some architecture.

See include/linux/rculist.h, the only constraint is that reverse
iteration is not allowed, because of what you say.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christopher Lameter Feb. 17, 2016, 4:27 p.m. UTC | #12
On Wed, 17 Feb 2016, Waiman Long wrote:

> I know we can use RCU for singly linked list, but I don't think we can use
> that for doubly linked list as there is no easy way to make atomic changes to
> both prev and next pointers simultaneously unless you are taking about 16b
> cmpxchg which is only supported in some architecture.

But its supported in the most important architecutes. You can fall back to
spinlocks on the ones that do not support it.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 5:12 p.m. UTC | #13
On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> On Wed, 17 Feb 2016, Waiman Long wrote:
>
>> I know we can use RCU for singly linked list, but I don't think we can use
>> that for doubly linked list as there is no easy way to make atomic changes to
>> both prev and next pointers simultaneously unless you are taking about 16b
>> cmpxchg which is only supported in some architecture.
> But its supported in the most important architecutes. You can fall back to
> spinlocks on the ones that do not support it.
>

I guess with some limitations on how the lists can be traversed, we may 
be able to do that with RCU without lock. However, that will make the 
code more complex and harder to verify. Given that in both my and Dave's 
testing that contentions with list insertion and deletion are almost 
gone from the perf profile when they used to be a bottleneck, is it 
really worth the effort to do such a conversion?

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 5:18 p.m. UTC | #14
On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> >On Wed, 17 Feb 2016, Waiman Long wrote:
> >
> >>I know we can use RCU for singly linked list, but I don't think we can use
> >>that for doubly linked list as there is no easy way to make atomic changes to
> >>both prev and next pointers simultaneously unless you are taking about 16b
> >>cmpxchg which is only supported in some architecture.
> >But its supported in the most important architecutes. You can fall back to
> >spinlocks on the ones that do not support it.
> >
> 
> I guess with some limitations on how the lists can be traversed, we may be
> able to do that with RCU without lock. However, that will make the code more
> complex and harder to verify. Given that in both my and Dave's testing that
> contentions with list insertion and deletion are almost gone from the perf
> profile when they used to be a bottleneck, is it really worth the effort to
> do such a conversion?

My initial concern was the preempt disable delay introduced by holding
the spinlock over the entire iteration.

There is no saying how many elements are on that list and there is no
lock break.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 5:41 p.m. UTC | #15
On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
>> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
>>> On Wed, 17 Feb 2016, Waiman Long wrote:
>>>
>>>> I know we can use RCU for singly linked list, but I don't think we can use
>>>> that for doubly linked list as there is no easy way to make atomic changes to
>>>> both prev and next pointers simultaneously unless you are taking about 16b
>>>> cmpxchg which is only supported in some architecture.
>>> But its supported in the most important architecutes. You can fall back to
>>> spinlocks on the ones that do not support it.
>>>
>> I guess with some limitations on how the lists can be traversed, we may be
>> able to do that with RCU without lock. However, that will make the code more
>> complex and harder to verify. Given that in both my and Dave's testing that
>> contentions with list insertion and deletion are almost gone from the perf
>> profile when they used to be a bottleneck, is it really worth the effort to
>> do such a conversion?
> My initial concern was the preempt disable delay introduced by holding
> the spinlock over the entire iteration.
>
> There is no saying how many elements are on that list and there is no
> lock break.

But preempt_disable() is called at the beginning of the spin_lock() 
call. So the additional preempt_disable() in percpu_list_add() is just 
to cover the this_cpu_ptr() call to make sure that the cpu number 
doesn't change. So we are talking about a few ns at most here.

Actually, I think I can remove the preempt_disable() and 
preempt_enable() calls as we just need to put list entry in one of the 
per-cpu lists. It doesn't need to be the same CPU of the current task.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 6:22 p.m. UTC | #16
On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote:
> On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
> >On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
> >>On 02/17/2016 11:27 AM, Christoph Lameter wrote:
> >>>On Wed, 17 Feb 2016, Waiman Long wrote:
> >>>
> >>>>I know we can use RCU for singly linked list, but I don't think we can use
> >>>>that for doubly linked list as there is no easy way to make atomic changes to
> >>>>both prev and next pointers simultaneously unless you are taking about 16b
> >>>>cmpxchg which is only supported in some architecture.
> >>>But its supported in the most important architecutes. You can fall back to
> >>>spinlocks on the ones that do not support it.
> >>>
> >>I guess with some limitations on how the lists can be traversed, we may be
> >>able to do that with RCU without lock. However, that will make the code more
> >>complex and harder to verify. Given that in both my and Dave's testing that
> >>contentions with list insertion and deletion are almost gone from the perf
> >>profile when they used to be a bottleneck, is it really worth the effort to
> >>do such a conversion?
> >My initial concern was the preempt disable delay introduced by holding
> >the spinlock over the entire iteration.
> >
> >There is no saying how many elements are on that list and there is no
> >lock break.
> 
> But preempt_disable() is called at the beginning of the spin_lock() call. So
> the additional preempt_disable() in percpu_list_add() is just to cover the
> this_cpu_ptr() call to make sure that the cpu number doesn't change. So we
> are talking about a few ns at most here.
> 

I'm talking about the list iteration, there is no preempt_disable() in
there, just the spin_lock, which you hold over the entire list, which
can be many, many element.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long Feb. 17, 2016, 6:45 p.m. UTC | #17
On 02/17/2016 01:22 PM, Peter Zijlstra wrote:
> On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote:
>> On 02/17/2016 12:18 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote:
>>>> On 02/17/2016 11:27 AM, Christoph Lameter wrote:
>>>>> On Wed, 17 Feb 2016, Waiman Long wrote:
>>>>>
>>>>>> I know we can use RCU for singly linked list, but I don't think we can use
>>>>>> that for doubly linked list as there is no easy way to make atomic changes to
>>>>>> both prev and next pointers simultaneously unless you are taking about 16b
>>>>>> cmpxchg which is only supported in some architecture.
>>>>> But its supported in the most important architecutes. You can fall back to
>>>>> spinlocks on the ones that do not support it.
>>>>>
>>>> I guess with some limitations on how the lists can be traversed, we may be
>>>> able to do that with RCU without lock. However, that will make the code more
>>>> complex and harder to verify. Given that in both my and Dave's testing that
>>>> contentions with list insertion and deletion are almost gone from the perf
>>>> profile when they used to be a bottleneck, is it really worth the effort to
>>>> do such a conversion?
>>> My initial concern was the preempt disable delay introduced by holding
>>> the spinlock over the entire iteration.
>>>
>>> There is no saying how many elements are on that list and there is no
>>> lock break.
>> But preempt_disable() is called at the beginning of the spin_lock() call. So
>> the additional preempt_disable() in percpu_list_add() is just to cover the
>> this_cpu_ptr() call to make sure that the cpu number doesn't change. So we
>> are talking about a few ns at most here.
>>
> I'm talking about the list iteration, there is no preempt_disable() in
> there, just the spin_lock, which you hold over the entire list, which
> can be many, many element.

Sorry for the misunderstanding.

The original code has one global lock and one single list that covers 
all the inodes in the filesystem. This patch essentially breaks it up 
into multiple smaller lists with one lock for each. So the lock hold 
time should have been greatly reduced unless we are unfortunately enough 
that most of the inodes are in one single list.

If lock hold time is a concern, I think in some cases we can set the an 
upper limit on how many inodes we want to process, release the lock, 
reacquire it and continue. I am just worry that using RCU and 16b 
cmpxchg will introduce too much complexity with no performance gain to show.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Feb. 17, 2016, 7:39 p.m. UTC | #18
On Wed, Feb 17, 2016 at 01:45:35PM -0500, Waiman Long wrote:
> The original code has one global lock and one single list that covers all
> the inodes in the filesystem. This patch essentially breaks it up into
> multiple smaller lists with one lock for each. So the lock hold time should
> have been greatly reduced unless we are unfortunately enough that most of
> the inodes are in one single list.

Most of the inode code has lock breaks in, but in general you cannot do
that.

The more I look at that inode code, the more I think you want an inode
specific visitor interface and not bother provide something generic.

iterate_bdevs(), drop_pagecache_sb(), wait_sb_inodes(), add_dquot_ref()
all have the same pattern. And maybe fsnotify_unmount_inodes() can be
man-handled into the same form.

Afaict only invalidate_inodes() really doesn't do a lock-break, but its
very similar in form to evict_inodes() which does.


> If lock hold time is a concern, I think in some cases we can set the an
> upper limit on how many inodes we want to process, release the lock,
> reacquire it and continue. I am just worry that using RCU and 16b cmpxchg
> will introduce too much complexity with no performance gain to show.

You don't actually need cmpxchg16b in order to use RCU. But given the
users of this, you don't actually need RCU either.

Just don't try and provide a for_each_list_entry() like construct.
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch
diff mbox

diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h
new file mode 100644
index 0000000..94be520
--- /dev/null
+++ b/include/linux/percpu-list.h
@@ -0,0 +1,117 @@ 
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#ifndef __LINUX_PERCPU_LIST_H
+#define __LINUX_PERCPU_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+
+/*
+ * include/linux/percpu-list.h
+ *
+ * A per-cpu list protected by a per-cpu spinlock.
+ *
+ * The list head percpu_list structure contains the spinlock, the other
+ * entries in the list contain the spinlock pointer.
+ */
+struct percpu_list {
+	struct list_head list;
+	union {
+		spinlock_t lock;	/* For list head */
+		spinlock_t *lockptr;	/* For other entries */
+	};
+};
+
+/*
+ * A simplified for_all_percpu_list_entries macro without the next and pchead
+ * parameters.
+ */
+#define for_all_percpu_list_entries_simple(pos, pclock, head, member)	\
+	for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)
+
+#define PERCPU_LIST_HEAD_INIT(name)				\
+	{							\
+		.list.prev = &name.list,			\
+		.list.next = &name.list,			\
+		.list.lock = __SPIN_LOCK_UNLOCKED(name),	\
+	}
+
+#define PERCPU_LIST_ENTRY_INIT(name)		\
+	{					\
+		.list.prev = &name.list,	\
+		.list.next = &name.list,	\
+		.list.lockptr = NULL		\
+	}
+
+static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list)
+{
+	INIT_LIST_HEAD(&pcpu_list->list);
+	pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock);
+}
+
+static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list)
+{
+	INIT_LIST_HEAD(&pcpu_list->list);
+	pcpu_list->lockptr = NULL;
+}
+
+#define PERCPU_LIST_HEAD(name)	struct percpu_list __percpu *name
+
+static inline void free_percpu_list_head(struct percpu_list **pclist_handle)
+{
+	free_percpu(*pclist_handle);
+	*pclist_handle = NULL;
+}
+
+static inline bool percpu_list_empty(struct percpu_list *pcpu_list)
+{
+	int cpu;
+
+	for_each_possible_cpu (cpu)
+		if (!list_empty(&per_cpu_ptr(pcpu_list, cpu)->list))
+			return false;
+	return true;
+}
+
+/**
+ * for_all_percpu_list_entries - iterate over all the per-cpu list with locking
+ * @pos:	the type * to use as a loop cursor for the current .
+ * @next:	an internal type * variable pointing to the next entry
+ * @pchead:	an internal struct list * of percpu list head
+ * @pclock:	an internal variable for the current per-cpu spinlock
+ * @head:	the head of the per-cpu list
+ * @member:	the name of the per-cpu list within the struct
+ */
+#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\
+	{								 \
+	int cpu;							 \
+	for_each_possible_cpu (cpu) {					 \
+		typeof(*pos) *next;					 \
+		spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu);	 \
+		struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\
+		spin_lock(pclock);					 \
+		list_for_each_entry_safe(pos, next, pchead, member.list)
+
+#define end_all_percpu_list_entries(pclock)	spin_unlock(pclock); } }
+
+extern int init_percpu_list_head(struct percpu_list **pclist_handle);
+extern void percpu_list_add(struct percpu_list *new, struct percpu_list *head);
+extern void percpu_list_del(struct percpu_list *entry);
+
+#endif /* __LINUX_PERCPU_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index a7c26a4..71a25d4 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -27,7 +27,7 @@  obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
 	 gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
 	 bsearch.o find_bit.o llist.o memweight.o kfifo.o \
 	 percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
-	 once.o
+	 once.o percpu-list.o
 obj-y += string_helpers.o
 obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
 obj-y += hexdump.o
diff --git a/lib/percpu-list.c b/lib/percpu-list.c
new file mode 100644
index 0000000..e5c04bf
--- /dev/null
+++ b/lib/percpu-list.c
@@ -0,0 +1,80 @@ 
+/*
+ * Per-cpu list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ *
+ * Authors: Waiman Long <waiman.long@hpe.com>
+ */
+#include <linux/percpu-list.h>
+
+/*
+ * Initialize the per-cpu list
+ */
+int init_percpu_list_head(struct percpu_list **pclist_handle)
+{
+	struct percpu_list *pclist = alloc_percpu(struct percpu_list);
+	int cpu;
+
+	if (!pclist)
+		return -ENOMEM;
+
+	for_each_possible_cpu (cpu)
+		INIT_PERCPU_LIST_HEAD(per_cpu_ptr(pclist, cpu));
+
+	*pclist_handle = pclist;
+	return 0;
+}
+
+/*
+ * List selection is based on the CPU being used when the percpu_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ * So we still need to use a lock to protect the content of the list.
+ */
+void percpu_list_add(struct percpu_list *new, struct percpu_list *head)
+{
+	spinlock_t *lock;
+
+	/*
+	 * We need to disable preemption before accessing the per-cpu data
+	 * to make sure that the cpu won't be changed because of preemption.
+	 */
+	preempt_disable();
+	lock = this_cpu_ptr(&head->lock);
+	spin_lock(lock);
+	new->lockptr = lock;
+	list_add(&new->list, this_cpu_ptr(&head->list));
+	spin_unlock(lock);
+	preempt_enable();
+}
+
+/*
+ * Delete an entry from a percpu list
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent delete of the same entry. If the lock pointer changes
+ * or becomes NULL, we assume that the deletion was done elsewhere.
+ */
+void percpu_list_del(struct percpu_list *entry)
+{
+	spinlock_t *lock = READ_ONCE(entry->lockptr);
+
+	if (unlikely(!lock))
+		return;
+
+	spin_lock(lock);
+	if (likely(entry->lockptr && (lock == entry->lockptr))) {
+		list_del_init(&entry->list);
+		entry->lockptr = NULL;
+	}
+	spin_unlock(lock);
+}