Message ID | de516124-ffd7-d159-2848-00c65a8573a8@ya.ru (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: skb content must be visible for lockless skb_peek() and its variations | expand |
On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: > From: Kirill Tkhai <tkhai@ya.ru> > > Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. > In the below example var2 may point to intial_val0 instead of expected var1: > > [cpu1] [cpu2] > skb->xxx = initial_val0; > ... > skb->xxx = var1; skb = READ_ONCE(prev_skb->next); > <no barrier> <no barrier> > WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; > > This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, > since it's a lowlevel function, and a caller has to understand the things it does (and > also __skb_peek() is used under queue lock in some places). > > Signed-off-by: Kirill Tkhai <tkhai@ya.ru> > --- > Hi, David, Eric and other developers, > > picking unix sockets code I found this problem, Could you please report exactly how/where the problem maifests (e.g. the involved call paths/time sequence)? > and for me it looks like it exists. If there > are arguments that everything is OK and it's expected, please, explain. I don't see why such barriers are needed for the locked peek/tail variants, as the spin_lock pair implies a full memory barrier. Cheers, Paolo
On 01.08.2022 09:52, Paolo Abeni wrote: > On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: >> From: Kirill Tkhai <tkhai@ya.ru> >> >> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. >> In the below example var2 may point to intial_val0 instead of expected var1: >> >> [cpu1] [cpu2] >> skb->xxx = initial_val0; >> ... >> skb->xxx = var1; skb = READ_ONCE(prev_skb->next); >> <no barrier> <no barrier> >> WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; >> >> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, >> since it's a lowlevel function, and a caller has to understand the things it does (and >> also __skb_peek() is used under queue lock in some places). >> >> Signed-off-by: Kirill Tkhai <tkhai@ya.ru> >> --- >> Hi, David, Eric and other developers, >> >> picking unix sockets code I found this problem, > > Could you please report exactly how/where the problem maifests (e.g. > the involved call paths/time sequence)? I didn't get why call paths in the patch description are not enough for you. Please, explain what you want. >> and for me it looks like it exists. If there >> are arguments that everything is OK and it's expected, please, explain. > > I don't see why such barriers are needed for the locked peek/tail > variants, as the spin_lock pair implies a full memory barrier. This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we?
On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote: > > On 01.08.2022 09:52, Paolo Abeni wrote: > > On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: > >> From: Kirill Tkhai <tkhai@ya.ru> > >> > >> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. > >> In the below example var2 may point to intial_val0 instead of expected var1: > >> > >> [cpu1] [cpu2] > >> skb->xxx = initial_val0; > >> ... > >> skb->xxx = var1; skb = READ_ONCE(prev_skb->next); > >> <no barrier> <no barrier> > >> WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; > >> > >> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, > >> since it's a lowlevel function, and a caller has to understand the things it does (and > >> also __skb_peek() is used under queue lock in some places). > >> > >> Signed-off-by: Kirill Tkhai <tkhai@ya.ru> > >> --- > >> Hi, David, Eric and other developers, > >> > >> picking unix sockets code I found this problem, > > > > Could you please report exactly how/where the problem maifests (e.g. > > the involved call paths/time sequence)? > > I didn't get why call paths in the patch description are not enough for you. Please, explain > what you want. > > >> and for me it looks like it exists. If there > >> are arguments that everything is OK and it's expected, please, explain. > > > > I don't see why such barriers are needed for the locked peek/tail > > variants, as the spin_lock pair implies a full memory barrier. > > This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() > this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we? We prefer explicit _lockless variants to document the precise points they are needed. A new helper (and its initial usage) will clearly point to the problem you saw in af_unix. BTW, smp_mb__after_spinlock() in your patch does not really make sense to me. Please add in your changelog the precise issue you are seeing.
On Mon, 2022-08-01 at 10:00 +0300, Kirill Tkhai wrote: > On 01.08.2022 09:52, Paolo Abeni wrote: > > On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: > > > From: Kirill Tkhai <tkhai@ya.ru> > > > > > > Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. > > > In the below example var2 may point to intial_val0 instead of expected var1: > > > > > > [cpu1] [cpu2] > > > skb->xxx = initial_val0; > > > ... > > > skb->xxx = var1; skb = READ_ONCE(prev_skb->next); > > > <no barrier> <no barrier> > > > WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; > > > > > > This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, > > > since it's a lowlevel function, and a caller has to understand the things it does (and > > > also __skb_peek() is used under queue lock in some places). > > > > > > Signed-off-by: Kirill Tkhai <tkhai@ya.ru> > > > --- > > > Hi, David, Eric and other developers, > > > > > > picking unix sockets code I found this problem, > > > > Could you please report exactly how/where the problem maifests (e.g. > > the involved call paths/time sequence)? > > I didn't get why call paths in the patch description are not enough for you. Please, explain > what you want. You mentioned the unix socket, so I expect to see something alike (I'm totally making up the symbols lists just to give an example): CPU0 CPU1 unix_stream_read_generic() unix_stream_sendmsg() skb_peek() skb_queue_tail(other->sk_receive_queue) plus some wording on how the critical race is reached, if not completely obvious. > > > > and for me it looks like it exists. If there > > > are arguments that everything is OK and it's expected, please, explain. > > > > I don't see why such barriers are needed for the locked peek/tail > > variants, as the spin_lock pair implies a full memory barrier. > > This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() > this is not needed. But you are also unconditioanlly adding barriers to the locked append/enqueue functions - which would possibly make sense only when the latters are paired with lockless read access. As Eric said, if any new barrier is needed, we want to apply it only where needed, and not to every skb_queue*()/skb_peek() user, so very likely a new helper (o a new pair of helpers) will be needed. Thanks! Paolo
On 01.08.2022 12:59, Paolo Abeni wrote: > On Mon, 2022-08-01 at 10:00 +0300, Kirill Tkhai wrote: >> On 01.08.2022 09:52, Paolo Abeni wrote: >>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: >>>> From: Kirill Tkhai <tkhai@ya.ru> >>>> >>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. >>>> In the below example var2 may point to intial_val0 instead of expected var1: >>>> >>>> [cpu1] [cpu2] >>>> skb->xxx = initial_val0; >>>> ... >>>> skb->xxx = var1; skb = READ_ONCE(prev_skb->next); >>>> <no barrier> <no barrier> >>>> WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; >>>> >>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, >>>> since it's a lowlevel function, and a caller has to understand the things it does (and >>>> also __skb_peek() is used under queue lock in some places). >>>> >>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru> >>>> --- >>>> Hi, David, Eric and other developers, >>>> >>>> picking unix sockets code I found this problem, >>> >>> Could you please report exactly how/where the problem maifests (e.g. >>> the involved call paths/time sequence)? >> >> I didn't get why call paths in the patch description are not enough for you. Please, explain >> what you want. > > You mentioned the unix socket, so I expect to see something alike (I'm > totally making up the symbols lists just to give an example): > > CPU0 CPU1 > unix_stream_read_generic() unix_stream_sendmsg() > skb_peek() skb_queue_tail(other->sk_receive_queue) > > plus some wording on how the critical race is reached, if not > completely obvious. Ah, you mean specific functions. Yes, this example and the rest of places, skb_peek{,tail} are used without queue lock, e.g., unix_stream_data_wait(). >> >>>> and for me it looks like it exists. If there >>>> are arguments that everything is OK and it's expected, please, explain. >>> >>> I don't see why such barriers are needed for the locked peek/tail >>> variants, as the spin_lock pair implies a full memory barrier. >> >> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() >> this is not needed. > > But you are also unconditioanlly adding barriers to the locked > append/enqueue functions - which would possibly make sense only when > the latters are paired with lockless read access. > > As Eric said, if any new barrier is needed, we want to apply it only > where needed, and not to every skb_queue*()/skb_peek() user, so very > likely a new helper (o a new pair of helpers) will be needed. Ok, thanks, Paolo. Kirill
On 01.08.2022 10:39, Eric Dumazet wrote: > On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote: >> >> On 01.08.2022 09:52, Paolo Abeni wrote: >>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: >>>> From: Kirill Tkhai <tkhai@ya.ru> >>>> >>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. >>>> In the below example var2 may point to intial_val0 instead of expected var1: >>>> >>>> [cpu1] [cpu2] >>>> skb->xxx = initial_val0; >>>> ... >>>> skb->xxx = var1; skb = READ_ONCE(prev_skb->next); >>>> <no barrier> <no barrier> >>>> WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; >>>> >>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, >>>> since it's a lowlevel function, and a caller has to understand the things it does (and >>>> also __skb_peek() is used under queue lock in some places). >>>> >>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru> >>>> --- >>>> Hi, David, Eric and other developers, >>>> >>>> picking unix sockets code I found this problem, >>> >>> Could you please report exactly how/where the problem maifests (e.g. >>> the involved call paths/time sequence)? >> >> I didn't get why call paths in the patch description are not enough for you. Please, explain >> what you want. >> >>>> and for me it looks like it exists. If there >>>> are arguments that everything is OK and it's expected, please, explain. >>> >>> I don't see why such barriers are needed for the locked peek/tail >>> variants, as the spin_lock pair implies a full memory barrier. >> >> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() >> this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we? > > We prefer explicit _lockless variants to document the precise points > they are needed. > > A new helper (and its initial usage) will clearly point to the problem > you saw in af_unix. The problem is: unix_stream_sendmsg() unix_stream_read_generic() skb->len = size; skb = skb_peek(); skb_queue_tail(skb); unix_skb_len(skb); <- here we read wrong len > BTW, smp_mb__after_spinlock() in your patch does not really make sense to me. > Please add in your changelog the precise issue you are seeing. What is about queue part, do you recommend to have a separate helper for that? skb_queue_tail_for_lockless()?
On 01.08.2022 21:45, Kirill Tkhai wrote: > On 01.08.2022 10:39, Eric Dumazet wrote: >> On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote: >>> >>> On 01.08.2022 09:52, Paolo Abeni wrote: >>>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote: >>>>> From: Kirill Tkhai <tkhai@ya.ru> >>>>> >>>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2. >>>>> In the below example var2 may point to intial_val0 instead of expected var1: >>>>> >>>>> [cpu1] [cpu2] >>>>> skb->xxx = initial_val0; >>>>> ... >>>>> skb->xxx = var1; skb = READ_ONCE(prev_skb->next); >>>>> <no barrier> <no barrier> >>>>> WRITE_ONCE(prev_skb->next, skb); var2 = skb->xxx; >>>>> >>>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched, >>>>> since it's a lowlevel function, and a caller has to understand the things it does (and >>>>> also __skb_peek() is used under queue lock in some places). >>>>> >>>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru> >>>>> --- >>>>> Hi, David, Eric and other developers, >>>>> >>>>> picking unix sockets code I found this problem, >>>> >>>> Could you please report exactly how/where the problem maifests (e.g. >>>> the involved call paths/time sequence)? >>> >>> I didn't get why call paths in the patch description are not enough for you. Please, explain >>> what you want. >>> >>>>> and for me it looks like it exists. If there >>>>> are arguments that everything is OK and it's expected, please, explain. >>>> >>>> I don't see why such barriers are needed for the locked peek/tail >>>> variants, as the spin_lock pair implies a full memory barrier. >>> >>> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek() >>> this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we? >> >> We prefer explicit _lockless variants to document the precise points >> they are needed. >> >> A new helper (and its initial usage) will clearly point to the problem >> you saw in af_unix. > > The problem is: > > unix_stream_sendmsg() unix_stream_read_generic() > skb->len = size; skb = skb_peek(); > skb_queue_tail(skb); unix_skb_len(skb); <- here we read wrong len Oh, there are unix_state_lock(). Please, ignore this patch...
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ca8afa382bf2..2939a5dc0ad7 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -2018,7 +2018,8 @@ static inline struct sk_buff *skb_unshare(struct sk_buff *skb, */ static inline struct sk_buff *skb_peek(const struct sk_buff_head *list_) { - struct sk_buff *skb = list_->next; + /* Pairs with mb in skb_queue_tail() and variations */ + struct sk_buff *skb = smp_load_acquire(&list_->next); if (skb == (struct sk_buff *)list_) skb = NULL; @@ -2048,7 +2049,8 @@ static inline struct sk_buff *__skb_peek(const struct sk_buff_head *list_) static inline struct sk_buff *skb_peek_next(struct sk_buff *skb, const struct sk_buff_head *list_) { - struct sk_buff *next = skb->next; + /* Pairs with mb in skb_queue_tail() and variations */ + struct sk_buff *next = smp_load_acquire(&skb->next); if (next == (struct sk_buff *)list_) next = NULL; @@ -2070,7 +2072,8 @@ static inline struct sk_buff *skb_peek_next(struct sk_buff *skb, */ static inline struct sk_buff *skb_peek_tail(const struct sk_buff_head *list_) { - struct sk_buff *skb = READ_ONCE(list_->prev); + /* Pairs with mb in skb_queue_tail() and variations */ + struct sk_buff *skb = smp_load_acquire(&list_->prev); if (skb == (struct sk_buff *)list_) skb = NULL; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 974bbbbe7138..1de46eb91405 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -3373,6 +3373,8 @@ void skb_queue_head(struct sk_buff_head *list, struct sk_buff *newsk) unsigned long flags; spin_lock_irqsave(&list->lock, flags); + /* Pairs with mb in skb_peek() and variations */ + smp_mb__after_spinlock(); __skb_queue_head(list, newsk); spin_unlock_irqrestore(&list->lock, flags); } @@ -3394,6 +3396,8 @@ void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *newsk) unsigned long flags; spin_lock_irqsave(&list->lock, flags); + /* Pairs with mb in skb_peek() and variations */ + smp_mb__after_spinlock(); __skb_queue_tail(list, newsk); spin_unlock_irqrestore(&list->lock, flags); } @@ -3434,6 +3438,8 @@ void skb_append(struct sk_buff *old, struct sk_buff *newsk, struct sk_buff_head unsigned long flags; spin_lock_irqsave(&list->lock, flags); + /* Pairs with mb in skb_peek() and variations */ + smp_mb__after_spinlock(); __skb_queue_after(list, old, newsk); spin_unlock_irqrestore(&list->lock, flags); }