diff mbox series

[net-next] net: fix kfree_skb_list use of skb_mark_not_on_list

Message ID 167415060025.1124471.10712199130760214632.stgit@firesoul (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net-next] net: fix kfree_skb_list use of skb_mark_not_on_list | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers fail 1 blamed authors not CCed: saeed@kernel.org; 1 maintainers not CCed: saeed@kernel.org
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 13 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer Jan. 19, 2023, 5:50 p.m. UTC
A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
invoking skb_mark_not_on_list().

The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
returns true, meaning the SKB is ready to be free'ed, as it calls/check
skb_unref().

This is needed as kfree_skb_list() is also invoked on skb_shared_info
frag_list. A frag_list can have SKBs with elevated refcnt due to cloning
via skb_clone_fraglist(), which takes a reference on all SKBs in the
list. This implies the invariant that all SKBs in the list must have the
same refcnt, when using kfree_skb_list().

Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/skbuff.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Jan. 19, 2023, 6:04 p.m. UTC | #1
On Thu, Jan 19, 2023 at 6:50 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
> kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
> invoking skb_mark_not_on_list().
>
> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
> returns true, meaning the SKB is ready to be free'ed, as it calls/check
> skb_unref().
>
> This is needed as kfree_skb_list() is also invoked on skb_shared_info
> frag_list. A frag_list can have SKBs with elevated refcnt due to cloning
> via skb_clone_fraglist(), which takes a reference on all SKBs in the
> list. This implies the invariant that all SKBs in the list must have the
> same refcnt, when using kfree_skb_list().

Yeah, or more precisely skb_drop_fraglist() calling kfree_skb_list()

>
> Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
> Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
> Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/skbuff.c |    6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4e73ab3482b8..1bffbcbe6087 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -999,10 +999,10 @@ kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
>         while (segs) {
>                 struct sk_buff *next = segs->next;
>
> -               skb_mark_not_on_list(segs);
> -
> -               if (__kfree_skb_reason(segs, reason))
> +               if (__kfree_skb_reason(segs, reason)) {
> +                       skb_mark_not_on_list(segs);

Real question is : Why do we need to set/change/dirt skb->next ?

I would remove this completely, and save extra cache lines dirtying.

Before your patch, we were not calling skb_mark_not_on_list(segs),
so why bother ?

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e73ab3482b87d81371cff266627dab646d3e84c..180df58e85c72eaa16f5cb56b56d181a379b8921
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -999,8 +999,6 @@ kfree_skb_list_reason(struct sk_buff *segs, enum
skb_drop_reason reason)
        while (segs) {
                struct sk_buff *next = segs->next;

-               skb_mark_not_on_list(segs);
-
                if (__kfree_skb_reason(segs, reason))
                        kfree_skb_add_bulk(segs, &sa, reason);
Jesper Dangaard Brouer Jan. 20, 2023, 9:09 a.m. UTC | #2
On 19/01/2023 19.04, Eric Dumazet wrote:
> On Thu, Jan 19, 2023 at 6:50 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
>> kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
>> invoking skb_mark_not_on_list().
>>
>> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
>> returns true, meaning the SKB is ready to be free'ed, as it calls/check
>> skb_unref().
>>
>> This is needed as kfree_skb_list() is also invoked on skb_shared_info
>> frag_list. A frag_list can have SKBs with elevated refcnt due to cloning
>> via skb_clone_fraglist(), which takes a reference on all SKBs in the
>> list. This implies the invariant that all SKBs in the list must have the
>> same refcnt, when using kfree_skb_list().
> 
> Yeah, or more precisely skb_drop_fraglist() calling kfree_skb_list()
> 
>>
>> Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
>> Reported-and-tested-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
>> Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> ---
>>   net/core/skbuff.c |    6 +++---
>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 4e73ab3482b8..1bffbcbe6087 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -999,10 +999,10 @@ kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
>>          while (segs) {
>>                  struct sk_buff *next = segs->next;
>>
>> -               skb_mark_not_on_list(segs);
>> -
>> -               if (__kfree_skb_reason(segs, reason))
>> +               if (__kfree_skb_reason(segs, reason)) {
>> +                       skb_mark_not_on_list(segs);
> 
> Real question is : Why do we need to set/change/dirt skb->next ?
> 
> I would remove this completely, and save extra cache lines dirtying.

First of all, we just read this cacheline via reading segs->next.
This cacheline must as minimum be in Shared (S) state.

Secondly SLUB will write into this cacheline. Thus, we actually know 
that this cacheline need to go into Modified (M) or Exclusive (E).
Thus, writing into it here should be okay.  We could replace it with a 
prefetchw() to help SLUB get Exclusive (E) cache coherency state.

> Before your patch, we were not calling skb_mark_not_on_list(segs),
> so why bother ?

To catch potential bugs.


> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 4e73ab3482b87d81371cff266627dab646d3e84c..180df58e85c72eaa16f5cb56b56d181a379b8921
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -999,8 +999,6 @@ kfree_skb_list_reason(struct sk_buff *segs, enum
> skb_drop_reason reason)
>          while (segs) {
>                  struct sk_buff *next = segs->next;
> 
> -               skb_mark_not_on_list(segs);
> -
>                  if (__kfree_skb_reason(segs, reason))
>                          kfree_skb_add_bulk(segs, &sa, reason);
>
Jesper Dangaard Brouer Jan. 20, 2023, 9:30 a.m. UTC | #3
On 20/01/2023 10.09, Jesper Dangaard Brouer wrote:
> 
> On 19/01/2023 19.04, Eric Dumazet wrote:
>> On Thu, Jan 19, 2023 at 6:50 PM Jesper Dangaard Brouer
>> <brouer@redhat.com> wrote:
>>>
>>> A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
>>> kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
>>> invoking skb_mark_not_on_list().
>>>
>>> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
>>> returns true, meaning the SKB is ready to be free'ed, as it calls/check
>>> skb_unref().
>>>
>>> This is needed as kfree_skb_list() is also invoked on skb_shared_info
>>> frag_list. A frag_list can have SKBs with elevated refcnt due to cloning
>>> via skb_clone_fraglist(), which takes a reference on all SKBs in the
>>> list. This implies the invariant that all SKBs in the list must have the
>>> same refcnt, when using kfree_skb_list().
>>
>> Yeah, or more precisely skb_drop_fraglist() calling kfree_skb_list()
>>
>>>
>>> Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
>>> Reported-and-tested-by: 
>>> syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
>>> Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>>   net/core/skbuff.c |    6 +++---
>>>   1 file changed, 3 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index 4e73ab3482b8..1bffbcbe6087 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -999,10 +999,10 @@ kfree_skb_list_reason(struct sk_buff *segs, 
>>> enum skb_drop_reason reason)
>>>          while (segs) {
>>>                  struct sk_buff *next = segs->next;
>>>
>>> -               skb_mark_not_on_list(segs);
>>> -
>>> -               if (__kfree_skb_reason(segs, reason))
>>> +               if (__kfree_skb_reason(segs, reason)) {
>>> +                       skb_mark_not_on_list(segs);
>>
>> Real question is : Why do we need to set/change/dirt skb->next ?
>>
>> I would remove this completely, and save extra cache lines dirtying.
> 
> First of all, we just read this cacheline via reading segs->next.
> This cacheline must as minimum be in Shared (S) state.
> 
> Secondly SLUB will write into this cacheline. Thus, we actually know 
> that this cacheline need to go into Modified (M) or Exclusive (E).
> Thus, writing into it here should be okay.  We could replace it with a 
> prefetchw() to help SLUB get Exclusive (E) cache coherency state.

I looked it up and SLUB no-longer uses the first cacheline of objects to
store it's freelist_ptr.  Since April 2020 (v5.7) in commit 3202fa62fb43
("slub: relocate freelist pointer to middle of object") (Author: Kees
Cook) the freelist is moved to the middle for security concerns.
Thus, my prefetchw idea is wrong (details: there is an internal
prefetch_freepointer that finds the right location).

Also it could make sense to save the potential (S) to (E) cache
coherency state transition, as SLUB actually writes into another 
cacheline that I first though.


>> Before your patch, we were not calling skb_mark_not_on_list(segs),
>> so why bother ?
> 
> To catch potential bugs.

For this purpose we could discuss creating skb_poison_list() as you
hinted in your debugging proposal, this would likely have caused us to
catch this bug faster (via crash on second caller).

Let me know if you prefer that we simply remove skb_mark_not_on_list() ?

>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 
>> 4e73ab3482b87d81371cff266627dab646d3e84c..180df58e85c72eaa16f5cb56b56d181a379b8921
>> 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -999,8 +999,6 @@ kfree_skb_list_reason(struct sk_buff *segs, enum
>> skb_drop_reason reason)
>>          while (segs) {
>>                  struct sk_buff *next = segs->next;
>>
>> -               skb_mark_not_on_list(segs);
>> -
>>                  if (__kfree_skb_reason(segs, reason))
>>                          kfree_skb_add_bulk(segs, &sa, reason);
>>
Eric Dumazet Jan. 20, 2023, 9:46 a.m. UTC | #4
On Fri, Jan 20, 2023 at 10:30 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 20/01/2023 10.09, Jesper Dangaard Brouer wrote:
> >
> > On 19/01/2023 19.04, Eric Dumazet wrote:
> >> On Thu, Jan 19, 2023 at 6:50 PM Jesper Dangaard Brouer
> >> <brouer@redhat.com> wrote:
> >>>
> >>> A bug was introduced by commit eedade12f4cb ("net: kfree_skb_list use
> >>> kmem_cache_free_bulk"). It unconditionally unlinked the SKB list via
> >>> invoking skb_mark_not_on_list().
> >>>
> >>> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
> >>> returns true, meaning the SKB is ready to be free'ed, as it calls/check
> >>> skb_unref().
> >>>
> >>> This is needed as kfree_skb_list() is also invoked on skb_shared_info
> >>> frag_list. A frag_list can have SKBs with elevated refcnt due to cloning
> >>> via skb_clone_fraglist(), which takes a reference on all SKBs in the
> >>> list. This implies the invariant that all SKBs in the list must have the
> >>> same refcnt, when using kfree_skb_list().
> >>
> >> Yeah, or more precisely skb_drop_fraglist() calling kfree_skb_list()
> >>
> >>>
> >>> Reported-by: syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
> >>> Reported-and-tested-by:
> >>> syzbot+c8a2e66e37eee553c4fd@syzkaller.appspotmail.com
> >>> Fixes: eedade12f4cb ("net: kfree_skb_list use kmem_cache_free_bulk")
> >>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >>> ---
> >>>   net/core/skbuff.c |    6 +++---
> >>>   1 file changed, 3 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index 4e73ab3482b8..1bffbcbe6087 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -999,10 +999,10 @@ kfree_skb_list_reason(struct sk_buff *segs,
> >>> enum skb_drop_reason reason)
> >>>          while (segs) {
> >>>                  struct sk_buff *next = segs->next;
> >>>
> >>> -               skb_mark_not_on_list(segs);
> >>> -
> >>> -               if (__kfree_skb_reason(segs, reason))
> >>> +               if (__kfree_skb_reason(segs, reason)) {
> >>> +                       skb_mark_not_on_list(segs);
> >>
> >> Real question is : Why do we need to set/change/dirt skb->next ?
> >>
> >> I would remove this completely, and save extra cache lines dirtying.
> >
> > First of all, we just read this cacheline via reading segs->next.
> > This cacheline must as minimum be in Shared (S) state.
> >
> > Secondly SLUB will write into this cacheline. Thus, we actually know
> > that this cacheline need to go into Modified (M) or Exclusive (E).
> > Thus, writing into it here should be okay.  We could replace it with a
> > prefetchw() to help SLUB get Exclusive (E) cache coherency state.
>
> I looked it up and SLUB no-longer uses the first cacheline of objects to
> store it's freelist_ptr.  Since April 2020 (v5.7) in commit 3202fa62fb43
> ("slub: relocate freelist pointer to middle of object") (Author: Kees
> Cook) the freelist is moved to the middle for security concerns.
> Thus, my prefetchw idea is wrong (details: there is an internal
> prefetch_freepointer that finds the right location).
>
> Also it could make sense to save the potential (S) to (E) cache
> coherency state transition, as SLUB actually writes into another
> cacheline that I first though.

Anyway, we should not assume anything about slub/slab ways of storing
freelists. They might switch to something less expensive, especially considering
bulk API passes an array of pointers, not a 'list'.


>
>
> >> Before your patch, we were not calling skb_mark_not_on_list(segs),
> >> so why bother ?
> >
> > To catch potential bugs.
>
> For this purpose we could discuss creating skb_poison_list() as you
> hinted in your debugging proposal, this would likely have caused us to
> catch this bug faster (via crash on second caller).
>
> Let me know if you prefer that we simply remove skb_mark_not_on_list() ?

Yes. Setting NULL here might hide another bug somewhere.

skb_poison_list() could be defined for CONFIG_DEBUG_NET=y builds I guess.
Jesper Dangaard Brouer Jan. 20, 2023, 10:35 a.m. UTC | #5
On 20/01/2023 10.46, Eric Dumazet wrote:
> On Fri, Jan 20, 2023 at 10:30 AM Jesper Dangaard Brouer
>>
[...]
>> Let me know if you prefer that we simply remove skb_mark_not_on_list() ?
> Yes. Setting NULL here might hide another bug somewhere.

Ok, sent V2
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e73ab3482b8..1bffbcbe6087 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -999,10 +999,10 @@  kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
 	while (segs) {
 		struct sk_buff *next = segs->next;
 
-		skb_mark_not_on_list(segs);
-
-		if (__kfree_skb_reason(segs, reason))
+		if (__kfree_skb_reason(segs, reason)) {
+			skb_mark_not_on_list(segs);
 			kfree_skb_add_bulk(segs, &sa, reason);
+		}
 
 		segs = next;
 	}