Message ID | 1455672680-7153-2-git-send-email-Waiman.Long@hpe.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote: > Linked list is used everywhere in the Linux kernel. However, if many > threads are trying to add or delete entries into the same linked list, > it can create a performance bottleneck. > > This patch introduces a new per-cpu list subystem with associated > per-cpu locks for protecting each of the lists individually. This > allows list entries insertion and deletion operations to happen in > parallel instead of being serialized with a global list and lock. > > List entry insertion is strictly per cpu. List deletion, however, can > happen in a cpu other than the one that did the insertion. So we still > need lock to protect the list. Because of that, there may still be > a small amount of contention when deletion is being done. > > A new header file include/linux/percpu-list.h will be added with the > associated percpu_list structure. The following functions are used > to manage the per-cpu list: > > 1. int init_percpu_list_head(struct percpu_list **pclist_handle) > 2. void percpu_list_add(struct percpu_list *new, > struct percpu_list *head) > 3. void percpu_list_del(struct percpu_list *entry) A few comments on the code > + * A per-cpu list protected by a per-cpu spinlock. > + * > + * The list head percpu_list structure contains the spinlock, the other > + * entries in the list contain the spinlock pointer. > + */ > +struct percpu_list { > + struct list_head list; > + union { > + spinlock_t lock; /* For list head */ > + spinlock_t *lockptr; /* For other entries */ > + }; > +}; This union is bad for kernels running spinlock debugging - the size of the spinlock can blow out, and that increases the size of any object that has a percpu_list in it. I've only got basic spinlock debugging turned on, and the spinlock_t is 24 bytes with that config. If I turn on lockdep, it gets much larger again.... So it might be best to separate the list head and list entry structures, similar to a hash list? > +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list) > +{ > + INIT_LIST_HEAD(&pcpu_list->list); > + pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock); > +} > + > +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list) > +{ > + INIT_LIST_HEAD(&pcpu_list->list); > + pcpu_list->lockptr = NULL; > +} These function names don't need to shout. > +/** > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking > + * @pos: the type * to use as a loop cursor for the current . > + * @next: an internal type * variable pointing to the next entry > + * @pchead: an internal struct list * of percpu list head > + * @pclock: an internal variable for the current per-cpu spinlock > + * @head: the head of the per-cpu list > + * @member: the name of the per-cpu list within the struct > + */ > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ > + { \ > + int cpu; \ > + for_each_possible_cpu (cpu) { \ > + typeof(*pos) *next; \ > + spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ > + struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\ > + spin_lock(pclock); \ > + list_for_each_entry_safe(pos, next, pchead, member.list) > + > +#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } This is a bit of a landmine - the code inside he iteration is under a spinlock hidden in the macros. People are going to forget about that, and it's needs documenting about how it needs to be treated w.r.t. dropping and regaining the lock so sleeping operations can be performed on objects on the list being iterated. Can we also think up some shorter names - names that are 30-40 characters long are getting out out of hand given we're supposed tobe sticking to 80 character widths and we lost 8 of them in the first indent... Cheers, Dave.
On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote: > > +/** > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking > > + * @pos: the type * to use as a loop cursor for the current . > > + * @next: an internal type * variable pointing to the next entry > > + * @pchead: an internal struct list * of percpu list head > > + * @pclock: an internal variable for the current per-cpu spinlock > > + * @head: the head of the per-cpu list > > + * @member: the name of the per-cpu list within the struct > > + */ > > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ > > + { \ > > + int cpu; \ > > + for_each_possible_cpu (cpu) { \ > > + typeof(*pos) *next; \ > > + spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ > > + struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\ > > + spin_lock(pclock); \ > > + list_for_each_entry_safe(pos, next, pchead, member.list) > > + > > +#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } > > This is a bit of a landmine Yeah, that is pretty terrible. Maybe a visitor interface is advisable? visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data) { int cpu; for_each_possible_cpu(cpu) { spinlock_t *lock = per_cpu_ptr(&head->lock, cpu); struct list_head *head = per_cpu_ptr(&head->list, cpu); struct list_head *pos, *tmp; spin_lock(lock); for (pos = head->next, tmp = pos->next; pos != head; pos = tmp) visitor(pos, data); spin_unlock(lock); } } -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote: > > > +/** > > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking But the locking is 'pointless'. You only lock one per-cpu sublist at a time, therefore the list can be modified concurrently anyhow. So why not use RCU for the list iteration and avoid potentially large lock hold times? > > > + * @pos: the type * to use as a loop cursor for the current . > > > + * @next: an internal type * variable pointing to the next entry > > > + * @pchead: an internal struct list * of percpu list head > > > + * @pclock: an internal variable for the current per-cpu spinlock > > > + * @head: the head of the per-cpu list > > > + * @member: the name of the per-cpu list within the struct > > > + */ -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote: > > > +/** > > > + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking > > > + * @pos: the type * to use as a loop cursor for the current . > > > + * @next: an internal type * variable pointing to the next entry > > > + * @pchead: an internal struct list * of percpu list head > > > + * @pclock: an internal variable for the current per-cpu spinlock > > > + * @head: the head of the per-cpu list > > > + * @member: the name of the per-cpu list within the struct > > > + */ > > > +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ > > > + { \ > > > + int cpu; \ > > > + for_each_possible_cpu (cpu) { \ > > > + typeof(*pos) *next; \ > > > + spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ > > > + struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\ > > > + spin_lock(pclock); \ > > > + list_for_each_entry_safe(pos, next, pchead, member.list) > > > + > > > +#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } > > > > This is a bit of a landmine > > Yeah, that is pretty terrible. Maybe a visitor interface is advisable? > > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data) > { > int cpu; > > for_each_possible_cpu(cpu) { > spinlock_t *lock = per_cpu_ptr(&head->lock, cpu); > struct list_head *head = per_cpu_ptr(&head->list, cpu); > struct list_head *pos, *tmp; > > spin_lock(lock); > for (pos = head->next, tmp = pos->next; pos != head; pos = tmp) > visitor(pos, data); I thought about this - it's the same problem as the list_lru walking functions. That is, the visitor has to be able to drop the list lock to do blocking operations, so the lock has to be passed to the visitor/internal loop context somehow, and the way the callers can use it need to be documented. Cheers, Dave.
On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote: > On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote: > > Yeah, that is pretty terrible. Maybe a visitor interface is advisable? > > > > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data) > > { > > int cpu; > > > > for_each_possible_cpu(cpu) { > > spinlock_t *lock = per_cpu_ptr(&head->lock, cpu); > > struct list_head *head = per_cpu_ptr(&head->list, cpu); > > struct list_head *pos, *tmp; > > > > spin_lock(lock); > > for (pos = head->next, tmp = pos->next; pos != head; pos = tmp) > > visitor(pos, data); > > I thought about this - it's the same problem as the list_lru walking > functions. That is, the visitor has to be able to drop the list lock > to do blocking operations, so the lock has to be passed to the > visitor/internal loop context somehow, and the way the callers can > use it need to be documented. But you cannot drop the lock and guarantee fwd progress. The moment you drop the lock, you have to restart the iteration from the head, since any iterator you had might now be pointing into space. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 12:26:54PM +0100, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 10:10:02PM +1100, Dave Chinner wrote: > > On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote: > > > > Yeah, that is pretty terrible. Maybe a visitor interface is advisable? > > > > > > visit_percpu_list_entries(struct percpu_list *head, void (*visitor)(struct list_head *pos, void *data), void *data) > > > { > > > int cpu; > > > > > > for_each_possible_cpu(cpu) { > > > spinlock_t *lock = per_cpu_ptr(&head->lock, cpu); > > > struct list_head *head = per_cpu_ptr(&head->list, cpu); > > > struct list_head *pos, *tmp; > > > > > > spin_lock(lock); > > > for (pos = head->next, tmp = pos->next; pos != head; pos = tmp) > > > visitor(pos, data); > > > > I thought about this - it's the same problem as the list_lru walking > > functions. That is, the visitor has to be able to drop the list lock > > to do blocking operations, so the lock has to be passed to the > > visitor/internal loop context somehow, and the way the callers can > > use it need to be documented. > > But you cannot drop the lock and guarantee fwd progress. The moment you > drop the lock, you have to restart the iteration from the head, since > any iterator you had might now be pointing into space. Ah, I see what iterate_bdevs() does. Yes, that's somewhat 'special'. Not sure it makes sense to craft a generic 'interface' for that. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 16 Feb 2016, Waiman Long wrote: > List entry insertion is strictly per cpu. List deletion, however, can > happen in a cpu other than the one that did the insertion. So we still > need lock to protect the list. Because of that, there may still be > a small amount of contention when deletion is being done. Is there a way to avoid locking completely? You could use cmpxchg_double to swap both list head and tail in an atomic fashion with some work. There is even a this_cpu_cmpxchg_double available. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/17/2016 04:53 AM, Dave Chinner wrote: > On Tue, Feb 16, 2016 at 08:31:19PM -0500, Waiman Long wrote: >> Linked list is used everywhere in the Linux kernel. However, if many >> threads are trying to add or delete entries into the same linked list, >> it can create a performance bottleneck. >> >> This patch introduces a new per-cpu list subystem with associated >> per-cpu locks for protecting each of the lists individually. This >> allows list entries insertion and deletion operations to happen in >> parallel instead of being serialized with a global list and lock. >> >> List entry insertion is strictly per cpu. List deletion, however, can >> happen in a cpu other than the one that did the insertion. So we still >> need lock to protect the list. Because of that, there may still be >> a small amount of contention when deletion is being done. >> >> A new header file include/linux/percpu-list.h will be added with the >> associated percpu_list structure. The following functions are used >> to manage the per-cpu list: >> >> 1. int init_percpu_list_head(struct percpu_list **pclist_handle) >> 2. void percpu_list_add(struct percpu_list *new, >> struct percpu_list *head) >> 3. void percpu_list_del(struct percpu_list *entry) > A few comments on the code > >> + * A per-cpu list protected by a per-cpu spinlock. >> + * >> + * The list head percpu_list structure contains the spinlock, the other >> + * entries in the list contain the spinlock pointer. >> + */ >> +struct percpu_list { >> + struct list_head list; >> + union { >> + spinlock_t lock; /* For list head */ >> + spinlock_t *lockptr; /* For other entries */ >> + }; >> +}; > This union is bad for kernels running spinlock debugging - the size > of the spinlock can blow out, and that increases the size of any > object that has a percpu_list in it. I've only got basic spinlock > debugging turned on, and the spinlock_t is 24 bytes with that > config. If I turn on lockdep, it gets much larger again.... > > So it might be best to separate the list head and list entry > structures, similar to a hash list? Right. I will split it into 2 separate structure in the next iteration of the patch. >> +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list) >> +{ >> + INIT_LIST_HEAD(&pcpu_list->list); >> + pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock); >> +} >> + >> +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list) >> +{ >> + INIT_LIST_HEAD(&pcpu_list->list); >> + pcpu_list->lockptr = NULL; >> +} > These function names don't need to shout. I was just following the convention used in list init functions. I can certainly change them to lowercase. > >> +/** >> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking >> + * @pos: the type * to use as a loop cursor for the current . >> + * @next: an internal type * variable pointing to the next entry >> + * @pchead: an internal struct list * of percpu list head >> + * @pclock: an internal variable for the current per-cpu spinlock >> + * @head: the head of the per-cpu list >> + * @member: the name of the per-cpu list within the struct >> + */ >> +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ >> + { \ >> + int cpu; \ >> + for_each_possible_cpu (cpu) { \ >> + typeof(*pos) *next; \ >> + spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ >> + struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\ >> + spin_lock(pclock); \ >> + list_for_each_entry_safe(pos, next, pchead, member.list) >> + >> +#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } > This is a bit of a landmine - the code inside he iteration is under > a spinlock hidden in the macros. People are going to forget about > that, and it's needs documenting about how it needs to be treated > w.r.t. dropping and regaining the lock so sleeping operations can be > performed on objects on the list being iterated. > > Can we also think up some shorter names - names that are 30-40 > characters long are getting out out of hand given we're supposed > tobe sticking to 80 character widths and we lost 8 of them in the > first indent... > > Cheers, > > Dave. I will try to shorten the name and better document the macro. This is probably the most tricky part of the whole part. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 10:56:10AM -0500, Waiman Long wrote: > >>+/** > >>+ * for_all_percpu_list_entries - iterate over all the per-cpu list with locking > >>+ * @pos: the type * to use as a loop cursor for the current . > >>+ * @next: an internal type * variable pointing to the next entry > >>+ * @pchead: an internal struct list * of percpu list head > >>+ * @pclock: an internal variable for the current per-cpu spinlock > >>+ * @head: the head of the per-cpu list > >>+ * @member: the name of the per-cpu list within the struct > >>+ */ > >>+#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ > >>+ { \ > >>+ int cpu; \ > >>+ for_each_possible_cpu (cpu) { \ > >>+ typeof(*pos) *next; \ > >>+ spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ > >>+ struct list_head *pchead =&per_cpu_ptr(head, cpu)->list;\ > >>+ spin_lock(pclock); \ > >>+ list_for_each_entry_safe(pos, next, pchead, member.list) > >>+ > >>+#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } > I will try to shorten the name and better document the macro. This is > probably the most tricky part of the whole part. Note that your use of _safe() here actually makes the usage in iterate_bdevs() unsafe! Because that function relies on __iget() pinning the current position, which means that once you re-acquire the list lock, pos->next is valid. Howveer, _safe() takes pos->next before dropping the lock, and that object is not actually pinned and can go away. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/17/2016 06:05 AM, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 12:00:40PM +0100, Peter Zijlstra wrote: >> On Wed, Feb 17, 2016 at 08:53:18PM +1100, Dave Chinner wrote: >>>> +/** >>>> + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking > But the locking is 'pointless'. You only lock one per-cpu sublist at a > time, therefore the list can be modified concurrently anyhow. For each per-cpu list, there can be only 1 task allowed to make changes. If you are talking about all the per-cpu lists within the group, operations can happen in parallel. > So why not use RCU for the list iteration and avoid potentially large > lock hold times? > I know we can use RCU for singly linked list, but I don't think we can use that for doubly linked list as there is no easy way to make atomic changes to both prev and next pointers simultaneously unless you are taking about 16b cmpxchg which is only supported in some architecture. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 11:16:10AM -0500, Waiman Long wrote: > >So why not use RCU for the list iteration and avoid potentially large > >lock hold times? > > > > I know we can use RCU for singly linked list, but I don't think we can use > that for doubly linked list as there is no easy way to make atomic changes > to both prev and next pointers simultaneously unless you are taking about > 16b cmpxchg which is only supported in some architecture. See include/linux/rculist.h, the only constraint is that reverse iteration is not allowed, because of what you say. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 17 Feb 2016, Waiman Long wrote: > I know we can use RCU for singly linked list, but I don't think we can use > that for doubly linked list as there is no easy way to make atomic changes to > both prev and next pointers simultaneously unless you are taking about 16b > cmpxchg which is only supported in some architecture. But its supported in the most important architecutes. You can fall back to spinlocks on the ones that do not support it. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/17/2016 11:27 AM, Christoph Lameter wrote: > On Wed, 17 Feb 2016, Waiman Long wrote: > >> I know we can use RCU for singly linked list, but I don't think we can use >> that for doubly linked list as there is no easy way to make atomic changes to >> both prev and next pointers simultaneously unless you are taking about 16b >> cmpxchg which is only supported in some architecture. > But its supported in the most important architecutes. You can fall back to > spinlocks on the ones that do not support it. > I guess with some limitations on how the lists can be traversed, we may be able to do that with RCU without lock. However, that will make the code more complex and harder to verify. Given that in both my and Dave's testing that contentions with list insertion and deletion are almost gone from the perf profile when they used to be a bottleneck, is it really worth the effort to do such a conversion? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote: > On 02/17/2016 11:27 AM, Christoph Lameter wrote: > >On Wed, 17 Feb 2016, Waiman Long wrote: > > > >>I know we can use RCU for singly linked list, but I don't think we can use > >>that for doubly linked list as there is no easy way to make atomic changes to > >>both prev and next pointers simultaneously unless you are taking about 16b > >>cmpxchg which is only supported in some architecture. > >But its supported in the most important architecutes. You can fall back to > >spinlocks on the ones that do not support it. > > > > I guess with some limitations on how the lists can be traversed, we may be > able to do that with RCU without lock. However, that will make the code more > complex and harder to verify. Given that in both my and Dave's testing that > contentions with list insertion and deletion are almost gone from the perf > profile when they used to be a bottleneck, is it really worth the effort to > do such a conversion? My initial concern was the preempt disable delay introduced by holding the spinlock over the entire iteration. There is no saying how many elements are on that list and there is no lock break. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/17/2016 12:18 PM, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote: >> On 02/17/2016 11:27 AM, Christoph Lameter wrote: >>> On Wed, 17 Feb 2016, Waiman Long wrote: >>> >>>> I know we can use RCU for singly linked list, but I don't think we can use >>>> that for doubly linked list as there is no easy way to make atomic changes to >>>> both prev and next pointers simultaneously unless you are taking about 16b >>>> cmpxchg which is only supported in some architecture. >>> But its supported in the most important architecutes. You can fall back to >>> spinlocks on the ones that do not support it. >>> >> I guess with some limitations on how the lists can be traversed, we may be >> able to do that with RCU without lock. However, that will make the code more >> complex and harder to verify. Given that in both my and Dave's testing that >> contentions with list insertion and deletion are almost gone from the perf >> profile when they used to be a bottleneck, is it really worth the effort to >> do such a conversion? > My initial concern was the preempt disable delay introduced by holding > the spinlock over the entire iteration. > > There is no saying how many elements are on that list and there is no > lock break. But preempt_disable() is called at the beginning of the spin_lock() call. So the additional preempt_disable() in percpu_list_add() is just to cover the this_cpu_ptr() call to make sure that the cpu number doesn't change. So we are talking about a few ns at most here. Actually, I think I can remove the preempt_disable() and preempt_enable() calls as we just need to put list entry in one of the per-cpu lists. It doesn't need to be the same CPU of the current task. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote: > On 02/17/2016 12:18 PM, Peter Zijlstra wrote: > >On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote: > >>On 02/17/2016 11:27 AM, Christoph Lameter wrote: > >>>On Wed, 17 Feb 2016, Waiman Long wrote: > >>> > >>>>I know we can use RCU for singly linked list, but I don't think we can use > >>>>that for doubly linked list as there is no easy way to make atomic changes to > >>>>both prev and next pointers simultaneously unless you are taking about 16b > >>>>cmpxchg which is only supported in some architecture. > >>>But its supported in the most important architecutes. You can fall back to > >>>spinlocks on the ones that do not support it. > >>> > >>I guess with some limitations on how the lists can be traversed, we may be > >>able to do that with RCU without lock. However, that will make the code more > >>complex and harder to verify. Given that in both my and Dave's testing that > >>contentions with list insertion and deletion are almost gone from the perf > >>profile when they used to be a bottleneck, is it really worth the effort to > >>do such a conversion? > >My initial concern was the preempt disable delay introduced by holding > >the spinlock over the entire iteration. > > > >There is no saying how many elements are on that list and there is no > >lock break. > > But preempt_disable() is called at the beginning of the spin_lock() call. So > the additional preempt_disable() in percpu_list_add() is just to cover the > this_cpu_ptr() call to make sure that the cpu number doesn't change. So we > are talking about a few ns at most here. > I'm talking about the list iteration, there is no preempt_disable() in there, just the spin_lock, which you hold over the entire list, which can be many, many element. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/17/2016 01:22 PM, Peter Zijlstra wrote: > On Wed, Feb 17, 2016 at 12:41:53PM -0500, Waiman Long wrote: >> On 02/17/2016 12:18 PM, Peter Zijlstra wrote: >>> On Wed, Feb 17, 2016 at 12:12:57PM -0500, Waiman Long wrote: >>>> On 02/17/2016 11:27 AM, Christoph Lameter wrote: >>>>> On Wed, 17 Feb 2016, Waiman Long wrote: >>>>> >>>>>> I know we can use RCU for singly linked list, but I don't think we can use >>>>>> that for doubly linked list as there is no easy way to make atomic changes to >>>>>> both prev and next pointers simultaneously unless you are taking about 16b >>>>>> cmpxchg which is only supported in some architecture. >>>>> But its supported in the most important architecutes. You can fall back to >>>>> spinlocks on the ones that do not support it. >>>>> >>>> I guess with some limitations on how the lists can be traversed, we may be >>>> able to do that with RCU without lock. However, that will make the code more >>>> complex and harder to verify. Given that in both my and Dave's testing that >>>> contentions with list insertion and deletion are almost gone from the perf >>>> profile when they used to be a bottleneck, is it really worth the effort to >>>> do such a conversion? >>> My initial concern was the preempt disable delay introduced by holding >>> the spinlock over the entire iteration. >>> >>> There is no saying how many elements are on that list and there is no >>> lock break. >> But preempt_disable() is called at the beginning of the spin_lock() call. So >> the additional preempt_disable() in percpu_list_add() is just to cover the >> this_cpu_ptr() call to make sure that the cpu number doesn't change. So we >> are talking about a few ns at most here. >> > I'm talking about the list iteration, there is no preempt_disable() in > there, just the spin_lock, which you hold over the entire list, which > can be many, many element. Sorry for the misunderstanding. The original code has one global lock and one single list that covers all the inodes in the filesystem. This patch essentially breaks it up into multiple smaller lists with one lock for each. So the lock hold time should have been greatly reduced unless we are unfortunately enough that most of the inodes are in one single list. If lock hold time is a concern, I think in some cases we can set the an upper limit on how many inodes we want to process, release the lock, reacquire it and continue. I am just worry that using RCU and 16b cmpxchg will introduce too much complexity with no performance gain to show. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 17, 2016 at 01:45:35PM -0500, Waiman Long wrote: > The original code has one global lock and one single list that covers all > the inodes in the filesystem. This patch essentially breaks it up into > multiple smaller lists with one lock for each. So the lock hold time should > have been greatly reduced unless we are unfortunately enough that most of > the inodes are in one single list. Most of the inode code has lock breaks in, but in general you cannot do that. The more I look at that inode code, the more I think you want an inode specific visitor interface and not bother provide something generic. iterate_bdevs(), drop_pagecache_sb(), wait_sb_inodes(), add_dquot_ref() all have the same pattern. And maybe fsnotify_unmount_inodes() can be man-handled into the same form. Afaict only invalidate_inodes() really doesn't do a lock-break, but its very similar in form to evict_inodes() which does. > If lock hold time is a concern, I think in some cases we can set the an > upper limit on how many inodes we want to process, release the lock, > reacquire it and continue. I am just worry that using RCU and 16b cmpxchg > will introduce too much complexity with no performance gain to show. You don't actually need cmpxchg16b in order to use RCU. But given the users of this, you don't actually need RCU either. Just don't try and provide a for_each_list_entry() like construct. -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/percpu-list.h b/include/linux/percpu-list.h new file mode 100644 index 0000000..94be520 --- /dev/null +++ b/include/linux/percpu-list.h @@ -0,0 +1,117 @@ +/* + * Per-cpu list + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP + * + * Authors: Waiman Long <waiman.long@hpe.com> + */ +#ifndef __LINUX_PERCPU_LIST_H +#define __LINUX_PERCPU_LIST_H + +#include <linux/spinlock.h> +#include <linux/list.h> +#include <linux/percpu.h> + +/* + * include/linux/percpu-list.h + * + * A per-cpu list protected by a per-cpu spinlock. + * + * The list head percpu_list structure contains the spinlock, the other + * entries in the list contain the spinlock pointer. + */ +struct percpu_list { + struct list_head list; + union { + spinlock_t lock; /* For list head */ + spinlock_t *lockptr; /* For other entries */ + }; +}; + +/* + * A simplified for_all_percpu_list_entries macro without the next and pchead + * parameters. + */ +#define for_all_percpu_list_entries_simple(pos, pclock, head, member) \ + for_all_percpu_list_entries(pos, next, pchead, pclock, head, member) + +#define PERCPU_LIST_HEAD_INIT(name) \ + { \ + .list.prev = &name.list, \ + .list.next = &name.list, \ + .list.lock = __SPIN_LOCK_UNLOCKED(name), \ + } + +#define PERCPU_LIST_ENTRY_INIT(name) \ + { \ + .list.prev = &name.list, \ + .list.next = &name.list, \ + .list.lockptr = NULL \ + } + +static inline void INIT_PERCPU_LIST_HEAD(struct percpu_list *pcpu_list) +{ + INIT_LIST_HEAD(&pcpu_list->list); + pcpu_list->lock = __SPIN_LOCK_UNLOCKED(&pcpu_list->lock); +} + +static inline void INIT_PERCPU_LIST_ENTRY(struct percpu_list *pcpu_list) +{ + INIT_LIST_HEAD(&pcpu_list->list); + pcpu_list->lockptr = NULL; +} + +#define PERCPU_LIST_HEAD(name) struct percpu_list __percpu *name + +static inline void free_percpu_list_head(struct percpu_list **pclist_handle) +{ + free_percpu(*pclist_handle); + *pclist_handle = NULL; +} + +static inline bool percpu_list_empty(struct percpu_list *pcpu_list) +{ + int cpu; + + for_each_possible_cpu (cpu) + if (!list_empty(&per_cpu_ptr(pcpu_list, cpu)->list)) + return false; + return true; +} + +/** + * for_all_percpu_list_entries - iterate over all the per-cpu list with locking + * @pos: the type * to use as a loop cursor for the current . + * @next: an internal type * variable pointing to the next entry + * @pchead: an internal struct list * of percpu list head + * @pclock: an internal variable for the current per-cpu spinlock + * @head: the head of the per-cpu list + * @member: the name of the per-cpu list within the struct + */ +#define for_all_percpu_list_entries(pos, next, pchead, pclock, head, member)\ + { \ + int cpu; \ + for_each_possible_cpu (cpu) { \ + typeof(*pos) *next; \ + spinlock_t *pclock = per_cpu_ptr(&(head)->lock, cpu); \ + struct list_head *pchead = &per_cpu_ptr(head, cpu)->list;\ + spin_lock(pclock); \ + list_for_each_entry_safe(pos, next, pchead, member.list) + +#define end_all_percpu_list_entries(pclock) spin_unlock(pclock); } } + +extern int init_percpu_list_head(struct percpu_list **pclist_handle); +extern void percpu_list_add(struct percpu_list *new, struct percpu_list *head); +extern void percpu_list_del(struct percpu_list *entry); + +#endif /* __LINUX_PERCPU_LIST_H */ diff --git a/lib/Makefile b/lib/Makefile index a7c26a4..71a25d4 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -27,7 +27,7 @@ obj-y += bcd.o div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \ gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \ bsearch.o find_bit.o llist.o memweight.o kfifo.o \ percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \ - once.o + once.o percpu-list.o obj-y += string_helpers.o obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o obj-y += hexdump.o diff --git a/lib/percpu-list.c b/lib/percpu-list.c new file mode 100644 index 0000000..e5c04bf --- /dev/null +++ b/lib/percpu-list.c @@ -0,0 +1,80 @@ +/* + * Per-cpu list + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP + * + * Authors: Waiman Long <waiman.long@hpe.com> + */ +#include <linux/percpu-list.h> + +/* + * Initialize the per-cpu list + */ +int init_percpu_list_head(struct percpu_list **pclist_handle) +{ + struct percpu_list *pclist = alloc_percpu(struct percpu_list); + int cpu; + + if (!pclist) + return -ENOMEM; + + for_each_possible_cpu (cpu) + INIT_PERCPU_LIST_HEAD(per_cpu_ptr(pclist, cpu)); + + *pclist_handle = pclist; + return 0; +} + +/* + * List selection is based on the CPU being used when the percpu_list_add() + * function is called. However, deletion may be done by a different CPU. + * So we still need to use a lock to protect the content of the list. + */ +void percpu_list_add(struct percpu_list *new, struct percpu_list *head) +{ + spinlock_t *lock; + + /* + * We need to disable preemption before accessing the per-cpu data + * to make sure that the cpu won't be changed because of preemption. + */ + preempt_disable(); + lock = this_cpu_ptr(&head->lock); + spin_lock(lock); + new->lockptr = lock; + list_add(&new->list, this_cpu_ptr(&head->list)); + spin_unlock(lock); + preempt_enable(); +} + +/* + * Delete an entry from a percpu list + * + * We need to check the lock pointer again after taking the lock to guard + * against concurrent delete of the same entry. If the lock pointer changes + * or becomes NULL, we assume that the deletion was done elsewhere. + */ +void percpu_list_del(struct percpu_list *entry) +{ + spinlock_t *lock = READ_ONCE(entry->lockptr); + + if (unlikely(!lock)) + return; + + spin_lock(lock); + if (likely(entry->lockptr && (lock == entry->lockptr))) { + list_del_init(&entry->list); + entry->lockptr = NULL; + } + spin_unlock(lock); +}
Linked list is used everywhere in the Linux kernel. However, if many threads are trying to add or delete entries into the same linked list, it can create a performance bottleneck. This patch introduces a new per-cpu list subystem with associated per-cpu locks for protecting each of the lists individually. This allows list entries insertion and deletion operations to happen in parallel instead of being serialized with a global list and lock. List entry insertion is strictly per cpu. List deletion, however, can happen in a cpu other than the one that did the insertion. So we still need lock to protect the list. Because of that, there may still be a small amount of contention when deletion is being done. A new header file include/linux/percpu-list.h will be added with the associated percpu_list structure. The following functions are used to manage the per-cpu list: 1. int init_percpu_list_head(struct percpu_list **pclist_handle) 2. void percpu_list_add(struct percpu_list *new, struct percpu_list *head) 3. void percpu_list_del(struct percpu_list *entry) Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- include/linux/percpu-list.h | 117 +++++++++++++++++++++++++++++++++++++++++++ lib/Makefile | 2 +- lib/percpu-list.c | 80 +++++++++++++++++++++++++++++ 3 files changed, 198 insertions(+), 1 deletions(-) create mode 100644 include/linux/percpu-list.h create mode 100644 lib/percpu-list.c