diff mbox series

net: skb content must be visible for lockless skb_peek() and its variations

Message ID de516124-ffd7-d159-2848-00c65a8573a8@ya.ru (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: skb content must be visible for lockless skb_peek() and its variations | expand

Checks

Context Check Description
netdev/tree_selection success Guessed tree name to be net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 5527 this patch: 5527
netdev/cc_maintainers warning 3 maintainers not CCed: imagedong@tencent.com pabeni@redhat.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 1173 this patch: 1173
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 5682 this patch: 5682
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 51 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Kirill Tkhai July 31, 2022, 8:39 p.m. UTC
From: Kirill Tkhai <tkhai@ya.ru>

Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
In the below example var2 may point to intial_val0 instead of expected var1:

[cpu1]					[cpu2]
skb->xxx = initial_val0;
...
skb->xxx = var1;			skb = READ_ONCE(prev_skb->next);
<no barrier>				<no barrier>
WRITE_ONCE(prev_skb->next, skb);	var2 = skb->xxx;

This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
since it's a lowlevel function, and a caller has to understand the things it does (and
also __skb_peek() is used under queue lock in some places).

Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
---
Hi, David, Eric and other developers,

picking unix sockets code I found this problem, and for me it looks like it exists. If there
are arguments that everything is OK and it's expected, please, explain.

Best wishes,
Kirill

 include/linux/skbuff.h |    9 ++++++---
 net/core/skbuff.c      |    6 ++++++
 2 files changed, 12 insertions(+), 3 deletions(-)

Comments

Paolo Abeni Aug. 1, 2022, 6:52 a.m. UTC | #1
On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
> From: Kirill Tkhai <tkhai@ya.ru>
> 
> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
> In the below example var2 may point to intial_val0 instead of expected var1:
> 
> [cpu1]					[cpu2]
> skb->xxx = initial_val0;
> ...
> skb->xxx = var1;			skb = READ_ONCE(prev_skb->next);
> <no barrier>				<no barrier>
> WRITE_ONCE(prev_skb->next, skb);	var2 = skb->xxx;
> 
> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
> since it's a lowlevel function, and a caller has to understand the things it does (and
> also __skb_peek() is used under queue lock in some places).
> 
> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> ---
> Hi, David, Eric and other developers,
> 
> picking unix sockets code I found this problem, 

Could you please report exactly how/where the problem maifests (e.g.
the involved call paths/time sequence)? 

> and for me it looks like it exists. If there
> are arguments that everything is OK and it's expected, please, explain.

I don't see why such barriers are needed for the locked peek/tail
variants, as the spin_lock pair implies a full memory barrier.

Cheers,

Paolo
Kirill Tkhai Aug. 1, 2022, 7 a.m. UTC | #2
On 01.08.2022 09:52, Paolo Abeni wrote:
> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
>> From: Kirill Tkhai <tkhai@ya.ru>
>>
>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
>> In the below example var2 may point to intial_val0 instead of expected var1:
>>
>> [cpu1]					[cpu2]
>> skb->xxx = initial_val0;
>> ...
>> skb->xxx = var1;			skb = READ_ONCE(prev_skb->next);
>> <no barrier>				<no barrier>
>> WRITE_ONCE(prev_skb->next, skb);	var2 = skb->xxx;
>>
>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
>> since it's a lowlevel function, and a caller has to understand the things it does (and
>> also __skb_peek() is used under queue lock in some places).
>>
>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>> ---
>> Hi, David, Eric and other developers,
>>
>> picking unix sockets code I found this problem, 
> 
> Could you please report exactly how/where the problem maifests (e.g.
> the involved call paths/time sequence)?

I didn't get why call paths in the patch description are not enough for you. Please, explain
what you want.
 
>> and for me it looks like it exists. If there
>> are arguments that everything is OK and it's expected, please, explain.
> 
> I don't see why such barriers are needed for the locked peek/tail
> variants, as the spin_lock pair implies a full memory barrier.

This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we?
Eric Dumazet Aug. 1, 2022, 7:39 a.m. UTC | #3
On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote:
>
> On 01.08.2022 09:52, Paolo Abeni wrote:
> > On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
> >> From: Kirill Tkhai <tkhai@ya.ru>
> >>
> >> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
> >> In the below example var2 may point to intial_val0 instead of expected var1:
> >>
> >> [cpu1]                                       [cpu2]
> >> skb->xxx = initial_val0;
> >> ...
> >> skb->xxx = var1;                     skb = READ_ONCE(prev_skb->next);
> >> <no barrier>                         <no barrier>
> >> WRITE_ONCE(prev_skb->next, skb);     var2 = skb->xxx;
> >>
> >> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
> >> since it's a lowlevel function, and a caller has to understand the things it does (and
> >> also __skb_peek() is used under queue lock in some places).
> >>
> >> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> >> ---
> >> Hi, David, Eric and other developers,
> >>
> >> picking unix sockets code I found this problem,
> >
> > Could you please report exactly how/where the problem maifests (e.g.
> > the involved call paths/time sequence)?
>
> I didn't get why call paths in the patch description are not enough for you. Please, explain
> what you want.
>
> >> and for me it looks like it exists. If there
> >> are arguments that everything is OK and it's expected, please, explain.
> >
> > I don't see why such barriers are needed for the locked peek/tail
> > variants, as the spin_lock pair implies a full memory barrier.
>
> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
> this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we?

We prefer explicit _lockless variants to document the precise points
they are needed.

A new helper (and its initial usage) will clearly point to the problem
you saw in af_unix.

BTW, smp_mb__after_spinlock() in your patch does not really make sense to me.
Please add in your changelog the precise issue you are seeing.
Paolo Abeni Aug. 1, 2022, 9:59 a.m. UTC | #4
On Mon, 2022-08-01 at 10:00 +0300, Kirill Tkhai wrote:
> On 01.08.2022 09:52, Paolo Abeni wrote:
> > On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
> > > From: Kirill Tkhai <tkhai@ya.ru>
> > > 
> > > Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
> > > In the below example var2 may point to intial_val0 instead of expected var1:
> > > 
> > > [cpu1]					[cpu2]
> > > skb->xxx = initial_val0;
> > > ...
> > > skb->xxx = var1;			skb = READ_ONCE(prev_skb->next);
> > > <no barrier>				<no barrier>
> > > WRITE_ONCE(prev_skb->next, skb);	var2 = skb->xxx;
> > > 
> > > This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
> > > since it's a lowlevel function, and a caller has to understand the things it does (and
> > > also __skb_peek() is used under queue lock in some places).
> > > 
> > > Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
> > > ---
> > > Hi, David, Eric and other developers,
> > > 
> > > picking unix sockets code I found this problem, 
> > 
> > Could you please report exactly how/where the problem maifests (e.g.
> > the involved call paths/time sequence)?
> 
> I didn't get why call paths in the patch description are not enough for you. Please, explain
> what you want.

You mentioned the unix socket, so I expect to see something alike (I'm
totally making up the symbols lists just to give an example): 

CPU0					CPU1
unix_stream_read_generic()		unix_stream_sendmsg()
skb_peek()				skb_queue_tail(other->sk_receive_queue)

plus some wording on how the critical race is reached, if not
completely obvious.

>  
> > > and for me it looks like it exists. If there
> > > are arguments that everything is OK and it's expected, please, explain.
> > 
> > I don't see why such barriers are needed for the locked peek/tail
> > variants, as the spin_lock pair implies a full memory barrier.
> 
> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
> this is not needed. 

But you are also unconditioanlly adding barriers to the locked
append/enqueue functions - which would possibly make sense only when
the latters are paired with lockless read access.

As Eric said, if any new barrier is needed, we want to apply it only
where needed, and not to every skb_queue*()/skb_peek() user, so very
likely a new helper (o a new pair of helpers) will be needed.

Thanks!

Paolo
Kirill Tkhai Aug. 1, 2022, 6:37 p.m. UTC | #5
On 01.08.2022 12:59, Paolo Abeni wrote:
> On Mon, 2022-08-01 at 10:00 +0300, Kirill Tkhai wrote:
>> On 01.08.2022 09:52, Paolo Abeni wrote:
>>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>
>>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
>>>> In the below example var2 may point to intial_val0 instead of expected var1:
>>>>
>>>> [cpu1]					[cpu2]
>>>> skb->xxx = initial_val0;
>>>> ...
>>>> skb->xxx = var1;			skb = READ_ONCE(prev_skb->next);
>>>> <no barrier>				<no barrier>
>>>> WRITE_ONCE(prev_skb->next, skb);	var2 = skb->xxx;
>>>>
>>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
>>>> since it's a lowlevel function, and a caller has to understand the things it does (and
>>>> also __skb_peek() is used under queue lock in some places).
>>>>
>>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>>>> ---
>>>> Hi, David, Eric and other developers,
>>>>
>>>> picking unix sockets code I found this problem, 
>>>
>>> Could you please report exactly how/where the problem maifests (e.g.
>>> the involved call paths/time sequence)?
>>
>> I didn't get why call paths in the patch description are not enough for you. Please, explain
>> what you want.
> 
> You mentioned the unix socket, so I expect to see something alike (I'm
> totally making up the symbols lists just to give an example): 
> 
> CPU0					CPU1
> unix_stream_read_generic()		unix_stream_sendmsg()
> skb_peek()				skb_queue_tail(other->sk_receive_queue)
> 
> plus some wording on how the critical race is reached, if not
> completely obvious.

Ah, you mean specific functions. Yes, this example and the rest of places, skb_peek{,tail} are
used without queue lock, e.g., unix_stream_data_wait().
 
>>  
>>>> and for me it looks like it exists. If there
>>>> are arguments that everything is OK and it's expected, please, explain.
>>>
>>> I don't see why such barriers are needed for the locked peek/tail
>>> variants, as the spin_lock pair implies a full memory barrier.
>>
>> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
>> this is not needed. 
> 
> But you are also unconditioanlly adding barriers to the locked
> append/enqueue functions - which would possibly make sense only when
> the latters are paired with lockless read access.
> 
> As Eric said, if any new barrier is needed, we want to apply it only
> where needed, and not to every skb_queue*()/skb_peek() user, so very
> likely a new helper (o a new pair of helpers) will be needed.

Ok, thanks, Paolo.

Kirill
Kirill Tkhai Aug. 1, 2022, 6:45 p.m. UTC | #6
On 01.08.2022 10:39, Eric Dumazet wrote:
> On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote:
>>
>> On 01.08.2022 09:52, Paolo Abeni wrote:
>>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>
>>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
>>>> In the below example var2 may point to intial_val0 instead of expected var1:
>>>>
>>>> [cpu1]                                       [cpu2]
>>>> skb->xxx = initial_val0;
>>>> ...
>>>> skb->xxx = var1;                     skb = READ_ONCE(prev_skb->next);
>>>> <no barrier>                         <no barrier>
>>>> WRITE_ONCE(prev_skb->next, skb);     var2 = skb->xxx;
>>>>
>>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
>>>> since it's a lowlevel function, and a caller has to understand the things it does (and
>>>> also __skb_peek() is used under queue lock in some places).
>>>>
>>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>>>> ---
>>>> Hi, David, Eric and other developers,
>>>>
>>>> picking unix sockets code I found this problem,
>>>
>>> Could you please report exactly how/where the problem maifests (e.g.
>>> the involved call paths/time sequence)?
>>
>> I didn't get why call paths in the patch description are not enough for you. Please, explain
>> what you want.
>>
>>>> and for me it looks like it exists. If there
>>>> are arguments that everything is OK and it's expected, please, explain.
>>>
>>> I don't see why such barriers are needed for the locked peek/tail
>>> variants, as the spin_lock pair implies a full memory barrier.
>>
>> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
>> this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we?
> 
> We prefer explicit _lockless variants to document the precise points
> they are needed.
> 
> A new helper (and its initial usage) will clearly point to the problem
> you saw in af_unix.

The problem is:

unix_stream_sendmsg()	unix_stream_read_generic()
  skb->len = size;	  skb = skb_peek();
  skb_queue_tail(skb);	  unix_skb_len(skb); <- here we read wrong len

 
> BTW, smp_mb__after_spinlock() in your patch does not really make sense to me.
> Please add in your changelog the precise issue you are seeing.

What is about queue part, do you recommend to have a separate helper for that?
skb_queue_tail_for_lockless()?
Kirill Tkhai Aug. 7, 2022, 8:17 p.m. UTC | #7
On 01.08.2022 21:45, Kirill Tkhai wrote:
> On 01.08.2022 10:39, Eric Dumazet wrote:
>> On Mon, Aug 1, 2022 at 9:00 AM Kirill Tkhai <tkhai@ya.ru> wrote:
>>>
>>> On 01.08.2022 09:52, Paolo Abeni wrote:
>>>> On Sun, 2022-07-31 at 23:39 +0300, Kirill Tkhai wrote:
>>>>> From: Kirill Tkhai <tkhai@ya.ru>
>>>>>
>>>>> Currently, there are no barriers, and skb->xxx update may become invisible on cpu2.
>>>>> In the below example var2 may point to intial_val0 instead of expected var1:
>>>>>
>>>>> [cpu1]                                       [cpu2]
>>>>> skb->xxx = initial_val0;
>>>>> ...
>>>>> skb->xxx = var1;                     skb = READ_ONCE(prev_skb->next);
>>>>> <no barrier>                         <no barrier>
>>>>> WRITE_ONCE(prev_skb->next, skb);     var2 = skb->xxx;
>>>>>
>>>>> This patch adds barriers and fixes the problem. Note, that __skb_peek() is not patched,
>>>>> since it's a lowlevel function, and a caller has to understand the things it does (and
>>>>> also __skb_peek() is used under queue lock in some places).
>>>>>
>>>>> Signed-off-by: Kirill Tkhai <tkhai@ya.ru>
>>>>> ---
>>>>> Hi, David, Eric and other developers,
>>>>>
>>>>> picking unix sockets code I found this problem,
>>>>
>>>> Could you please report exactly how/where the problem maifests (e.g.
>>>> the involved call paths/time sequence)?
>>>
>>> I didn't get why call paths in the patch description are not enough for you. Please, explain
>>> what you want.
>>>
>>>>> and for me it looks like it exists. If there
>>>>> are arguments that everything is OK and it's expected, please, explain.
>>>>
>>>> I don't see why such barriers are needed for the locked peek/tail
>>>> variants, as the spin_lock pair implies a full memory barrier.
>>>
>>> This is for lockless skb_peek() calls and the patch is called in that way :). For locked skb_peek()
>>> this is not needed. I'm not sure we need separate skb_peek() and skb_peek_lockless(). Do we?
>>
>> We prefer explicit _lockless variants to document the precise points
>> they are needed.
>>
>> A new helper (and its initial usage) will clearly point to the problem
>> you saw in af_unix.
> 
> The problem is:
> 
> unix_stream_sendmsg()	unix_stream_read_generic()
>   skb->len = size;	  skb = skb_peek();
>   skb_queue_tail(skb);	  unix_skb_len(skb); <- here we read wrong len

Oh, there are unix_state_lock(). Please, ignore this patch...
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ca8afa382bf2..2939a5dc0ad7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2018,7 +2018,8 @@  static inline struct sk_buff *skb_unshare(struct sk_buff *skb,
  */
 static inline struct sk_buff *skb_peek(const struct sk_buff_head *list_)
 {
-	struct sk_buff *skb = list_->next;
+	/* Pairs with mb in skb_queue_tail() and variations */
+	struct sk_buff *skb = smp_load_acquire(&list_->next);
 
 	if (skb == (struct sk_buff *)list_)
 		skb = NULL;
@@ -2048,7 +2049,8 @@  static inline struct sk_buff *__skb_peek(const struct sk_buff_head *list_)
 static inline struct sk_buff *skb_peek_next(struct sk_buff *skb,
 		const struct sk_buff_head *list_)
 {
-	struct sk_buff *next = skb->next;
+	/* Pairs with mb in skb_queue_tail() and variations */
+	struct sk_buff *next = smp_load_acquire(&skb->next);
 
 	if (next == (struct sk_buff *)list_)
 		next = NULL;
@@ -2070,7 +2072,8 @@  static inline struct sk_buff *skb_peek_next(struct sk_buff *skb,
  */
 static inline struct sk_buff *skb_peek_tail(const struct sk_buff_head *list_)
 {
-	struct sk_buff *skb = READ_ONCE(list_->prev);
+	/* Pairs with mb in skb_queue_tail() and variations */
+	struct sk_buff *skb = smp_load_acquire(&list_->prev);
 
 	if (skb == (struct sk_buff *)list_)
 		skb = NULL;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 974bbbbe7138..1de46eb91405 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3373,6 +3373,8 @@  void skb_queue_head(struct sk_buff_head *list, struct sk_buff *newsk)
 	unsigned long flags;
 
 	spin_lock_irqsave(&list->lock, flags);
+	/* Pairs with mb in skb_peek() and variations */
+	smp_mb__after_spinlock();
 	__skb_queue_head(list, newsk);
 	spin_unlock_irqrestore(&list->lock, flags);
 }
@@ -3394,6 +3396,8 @@  void skb_queue_tail(struct sk_buff_head *list, struct sk_buff *newsk)
 	unsigned long flags;
 
 	spin_lock_irqsave(&list->lock, flags);
+	/* Pairs with mb in skb_peek() and variations */
+	smp_mb__after_spinlock();
 	__skb_queue_tail(list, newsk);
 	spin_unlock_irqrestore(&list->lock, flags);
 }
@@ -3434,6 +3438,8 @@  void skb_append(struct sk_buff *old, struct sk_buff *newsk, struct sk_buff_head
 	unsigned long flags;
 
 	spin_lock_irqsave(&list->lock, flags);
+	/* Pairs with mb in skb_peek() and variations */
+	smp_mb__after_spinlock();
 	__skb_queue_after(list, old, newsk);
 	spin_unlock_irqrestore(&list->lock, flags);
 }