diff mbox series

[net-next,V2,2/2] net: kfree_skb_list use kmem_cache_free_bulk

Message ID 167361792462.531803.224198635706602340.stgit@firesoul (mailing list archive)
State Accepted
Commit eedade12f4cb7284555c4c0314485e9575c70ab7
Delegated to: Netdev Maintainers
Headers show
Series net: use kmem_cache_free_bulk in kfree_skb_list | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net-next
netdev/fixes_present success Fixes tag not required for -next series
netdev/subject_prefix success Link
netdev/cover_letter success Series has a cover letter
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 5 maintainers not CCed: bpf@vger.kernel.org ast@kernel.org daniel@iogearbox.net john.fastabend@gmail.com hawk@kernel.org
netdev/build_clang success Errors and warnings before: 1 this patch: 1
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 55 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jesper Dangaard Brouer Jan. 13, 2023, 1:52 p.m. UTC
The kfree_skb_list function walks SKB (via skb->next) and frees them
individually to the SLUB/SLAB allocator (kmem_cache). It is more
efficient to bulk free them via the kmem_cache_free_bulk API.

This patches create a stack local array with SKBs to bulk free while
walking the list. Bulk array size is limited to 16 SKBs to trade off
stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
for SLUB the optimal bulk free case is 32 objects belonging to same
slab, but runtime this isn't likely to occur.

The expected gain from using kmem_cache bulk alloc and free API
have been assessed via a microbencmark kernel module[1].

The module 'slab_bulk_test01' results at bulk 16 element:
 kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
 kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)

More detailed description of benchmarks avail in [2].

[1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
[2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org

V2: rename function to kfree_skb_add_bulk.

Reviewed-by: Saeed Mahameed <saeed@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/skbuff.c |   40 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 39 insertions(+), 1 deletion(-)

Comments

Eric Dumazet Jan. 18, 2023, 4:05 p.m. UTC | #1
On Fri, Jan 13, 2023 at 2:52 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>
> The kfree_skb_list function walks SKB (via skb->next) and frees them
> individually to the SLUB/SLAB allocator (kmem_cache). It is more
> efficient to bulk free them via the kmem_cache_free_bulk API.
>
> This patches create a stack local array with SKBs to bulk free while
> walking the list. Bulk array size is limited to 16 SKBs to trade off
> stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
> uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
> 32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
> for SLUB the optimal bulk free case is 32 objects belonging to same
> slab, but runtime this isn't likely to occur.
>
> The expected gain from using kmem_cache bulk alloc and free API
> have been assessed via a microbencmark kernel module[1].
>
> The module 'slab_bulk_test01' results at bulk 16 element:
>  kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
>  kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)
>
> More detailed description of benchmarks avail in [2].
>
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
>
> V2: rename function to kfree_skb_add_bulk.
>
> Reviewed-by: Saeed Mahameed <saeed@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---

According to syzbot, this patch causes kernel panics, in IP fragmentation logic.

Can you double check if there is no obvious bug ?
Jesper Dangaard Brouer Jan. 18, 2023, 4:42 p.m. UTC | #2
On 18/01/2023 17.05, Eric Dumazet wrote:
> On Fri, Jan 13, 2023 at 2:52 PM Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>>
>> The kfree_skb_list function walks SKB (via skb->next) and frees them
>> individually to the SLUB/SLAB allocator (kmem_cache). It is more
>> efficient to bulk free them via the kmem_cache_free_bulk API.
>>
>> This patches create a stack local array with SKBs to bulk free while
>> walking the list. Bulk array size is limited to 16 SKBs to trade off
>> stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
>> uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
>> 32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
>> for SLUB the optimal bulk free case is 32 objects belonging to same
>> slab, but runtime this isn't likely to occur.
>>
>> The expected gain from using kmem_cache bulk alloc and free API
>> have been assessed via a microbencmark kernel module[1].
>>
>> The module 'slab_bulk_test01' results at bulk 16 element:
>>   kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
>>   kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)
>>
>> More detailed description of benchmarks avail in [2].
>>
>> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
>> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
>>
>> V2: rename function to kfree_skb_add_bulk.
>>
>> Reviewed-by: Saeed Mahameed <saeed@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> ---
> 
> According to syzbot, this patch causes kernel panics, in IP fragmentation logic.
> 
> Can you double check if there is no obvious bug ?

Do you have a link to the syzbot issue?

--Jesper
Eric Dumazet Jan. 18, 2023, 4:42 p.m. UTC | #3
On Wed, Jan 18, 2023 at 5:42 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 18/01/2023 17.05, Eric Dumazet wrote:
> > On Fri, Jan 13, 2023 at 2:52 PM Jesper Dangaard Brouer
> > <brouer@redhat.com> wrote:
> >>
> >> The kfree_skb_list function walks SKB (via skb->next) and frees them
> >> individually to the SLUB/SLAB allocator (kmem_cache). It is more
> >> efficient to bulk free them via the kmem_cache_free_bulk API.
> >>
> >> This patches create a stack local array with SKBs to bulk free while
> >> walking the list. Bulk array size is limited to 16 SKBs to trade off
> >> stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
> >> uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
> >> 32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
> >> for SLUB the optimal bulk free case is 32 objects belonging to same
> >> slab, but runtime this isn't likely to occur.
> >>
> >> The expected gain from using kmem_cache bulk alloc and free API
> >> have been assessed via a microbencmark kernel module[1].
> >>
> >> The module 'slab_bulk_test01' results at bulk 16 element:
> >>   kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
> >>   kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)
> >>
> >> More detailed description of benchmarks avail in [2].
> >>
> >> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
> >> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
> >>
> >> V2: rename function to kfree_skb_add_bulk.
> >>
> >> Reviewed-by: Saeed Mahameed <saeed@kernel.org>
> >> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> >> ---
> >
> > According to syzbot, this patch causes kernel panics, in IP fragmentation logic.
> >
> > Can you double check if there is no obvious bug ?
>
> Do you have a link to the syzbot issue?

Not yet, I will release it, with a repro.
Jesper Dangaard Brouer Jan. 18, 2023, 9:37 p.m. UTC | #4
(related to syzbot issue[1])

On 13/01/2023 14.52, Jesper Dangaard Brouer wrote:
> The kfree_skb_list function walks SKB (via skb->next) and frees them
> individually to the SLUB/SLAB allocator (kmem_cache). It is more
> efficient to bulk free them via the kmem_cache_free_bulk API.
> 
> This patches create a stack local array with SKBs to bulk free while
> walking the list. Bulk array size is limited to 16 SKBs to trade off
> stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
> uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
> 32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
> for SLUB the optimal bulk free case is 32 objects belonging to same
> slab, but runtime this isn't likely to occur.
> 
> The expected gain from using kmem_cache bulk alloc and free API
> have been assessed via a microbencmark kernel module[1].
> 
> The module 'slab_bulk_test01' results at bulk 16 element:
>   kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
>   kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)
> 
> More detailed description of benchmarks avail in [2].
> 
> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
> 
> V2: rename function to kfree_skb_add_bulk.
> 
> Reviewed-by: Saeed Mahameed <saeed@kernel.org>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>   net/core/skbuff.c |   40 +++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 39 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 007a5fbe284b..79c9e795a964 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -964,16 +964,54 @@ kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
>   }
>   EXPORT_SYMBOL(kfree_skb_reason);
>   
> +#define KFREE_SKB_BULK_SIZE	16
> +
> +struct skb_free_array {
> +	unsigned int skb_count;
> +	void *skb_array[KFREE_SKB_BULK_SIZE];
> +};
> +
> +static void kfree_skb_add_bulk(struct sk_buff *skb,
> +			       struct skb_free_array *sa,
> +			       enum skb_drop_reason reason)
> +{
> +	/* if SKB is a clone, don't handle this case */
> +	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
> +		__kfree_skb(skb);
> +		return;
> +	}
> +
> +	skb_release_all(skb, reason);
> +	sa->skb_array[sa->skb_count++] = skb;
> +
> +	if (unlikely(sa->skb_count == KFREE_SKB_BULK_SIZE)) {
> +		kmem_cache_free_bulk(skbuff_head_cache, KFREE_SKB_BULK_SIZE,
> +				     sa->skb_array);
> +		sa->skb_count = 0;
> +	}
> +}
> +
>   void __fix_address
>   kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
>   {
> +	struct skb_free_array sa;
> +
> +	sa.skb_count = 0;
> +
>   	while (segs) {
>   		struct sk_buff *next = segs->next;
>   
> +		skb_mark_not_on_list(segs);

The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().

I don't understand why I cannot clear skb->next here?

[1] https://lore.kernel.org/all/000000000000d58eae05f28ca51f@google.com/

>   		if (__kfree_skb_reason(segs, reason))
> -			__kfree_skb(segs);
> +			kfree_skb_add_bulk(segs, &sa, reason);
> +
>   		segs = next;
>   	}
> +
> +	if (sa.skb_count)
> +		kmem_cache_free_bulk(skbuff_head_cache, sa.skb_count,
> +				     sa.skb_array);
>   }
>   EXPORT_SYMBOL(kfree_skb_list_reason);
>   
> 
>
Jakub Kicinski Jan. 19, 2023, 2:26 a.m. UTC | #5
On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
> > +		skb_mark_not_on_list(segs);  
> 
> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
> 
> I don't understand why I cannot clear skb->next here?

Some of the skbs on the list are not private?
IOW we should only unlink them if skb_unref().
Yunsheng Lin Jan. 19, 2023, 2:39 a.m. UTC | #6
On 2023/1/19 5:37, Jesper Dangaard Brouer wrote:
> (related to syzbot issue[1])
> 
> On 13/01/2023 14.52, Jesper Dangaard Brouer wrote:
>> The kfree_skb_list function walks SKB (via skb->next) and frees them
>> individually to the SLUB/SLAB allocator (kmem_cache). It is more
>> efficient to bulk free them via the kmem_cache_free_bulk API.
>>
>> This patches create a stack local array with SKBs to bulk free while
>> walking the list. Bulk array size is limited to 16 SKBs to trade off
>> stack usage and efficiency. The SLUB kmem_cache "skbuff_head_cache"
>> uses objsize 256 bytes usually in an order-1 page 8192 bytes that is
>> 32 objects per slab (can vary on archs and due to SLUB sharing). Thus,
>> for SLUB the optimal bulk free case is 32 objects belonging to same
>> slab, but runtime this isn't likely to occur.
>>
>> The expected gain from using kmem_cache bulk alloc and free API
>> have been assessed via a microbencmark kernel module[1].
>>
>> The module 'slab_bulk_test01' results at bulk 16 element:
>>   kmem-in-loop Per elem: 109 cycles(tsc) 30.532 ns (step:16)
>>   kmem-bulk    Per elem: 64 cycles(tsc) 17.905 ns (step:16)
>>
>> More detailed description of benchmarks avail in [2].
>>
>> [1] https://github.com/netoptimizer/prototype-kernel/tree/master/kernel/mm
>> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/kfree_skb_list01.org
>>
>> V2: rename function to kfree_skb_add_bulk.
>>
>> Reviewed-by: Saeed Mahameed <saeed@kernel.org>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>> ---
>>   net/core/skbuff.c |   40 +++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 39 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>> index 007a5fbe284b..79c9e795a964 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -964,16 +964,54 @@ kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
>>   }
>>   EXPORT_SYMBOL(kfree_skb_reason);
>>   +#define KFREE_SKB_BULK_SIZE    16
>> +
>> +struct skb_free_array {
>> +    unsigned int skb_count;
>> +    void *skb_array[KFREE_SKB_BULK_SIZE];
>> +};
>> +
>> +static void kfree_skb_add_bulk(struct sk_buff *skb,
>> +                   struct skb_free_array *sa,
>> +                   enum skb_drop_reason reason)
>> +{
>> +    /* if SKB is a clone, don't handle this case */
>> +    if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
>> +        __kfree_skb(skb);
>> +        return;
>> +    }
>> +
>> +    skb_release_all(skb, reason);
>> +    sa->skb_array[sa->skb_count++] = skb;
>> +
>> +    if (unlikely(sa->skb_count == KFREE_SKB_BULK_SIZE)) {
>> +        kmem_cache_free_bulk(skbuff_head_cache, KFREE_SKB_BULK_SIZE,
>> +                     sa->skb_array);
>> +        sa->skb_count = 0;
>> +    }
>> +}
>> +
>>   void __fix_address
>>   kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
>>   {
>> +    struct skb_free_array sa;
>> +
>> +    sa.skb_count = 0;
>> +
>>       while (segs) {
>>           struct sk_buff *next = segs->next;
>>   +        skb_mark_not_on_list(segs);
> 
> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
> 
> I don't understand why I cannot clear skb->next here?

Clearing skb->next seems unrelated, it may just increase the problem
recurrence probability.

Because It seems kfree_skb_list_reason() is also used to release skb in
shinfo->frag_list, which should go through the skb_unref() checking,
and this patch seems to skip the skb_unref() checking for skb in
shinfo->frag_list.

> 
> [1] https://lore.kernel.org/all/000000000000d58eae05f28ca51f@google.com/
> 
>>           if (__kfree_skb_reason(segs, reason))
>> -            __kfree_skb(segs);
>> +            kfree_skb_add_bulk(segs, &sa, reason);
>> +
>>           segs = next;
>>       }
>> +
>> +    if (sa.skb_count)
>> +        kmem_cache_free_bulk(skbuff_head_cache, sa.skb_count,
>> +                     sa.skb_array);
>>   }
>>   EXPORT_SYMBOL(kfree_skb_list_reason);
>>  
>>
> 
> .
>
Jesper Dangaard Brouer Jan. 19, 2023, 10:17 a.m. UTC | #7
On 19/01/2023 03.26, Jakub Kicinski wrote:
> On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
>>> +		skb_mark_not_on_list(segs);
>>
>> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
>>
>> I don't understand why I cannot clear skb->next here?
> 
> Some of the skbs on the list are not private?
> IOW we should only unlink them if skb_unref().

Yes, you are right.

The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
returns true, meaning the SKB is ready to be free'ed (as it calls/check
skb_unref()).

I will send a proper fix patch shortly... after syzbot do a test on it.

--Jesper
Eric Dumazet Jan. 19, 2023, 10:28 a.m. UTC | #8
On Thu, Jan 19, 2023 at 11:18 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 19/01/2023 03.26, Jakub Kicinski wrote:
> > On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
> >>> +           skb_mark_not_on_list(segs);
> >>
> >> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
> >>
> >> I don't understand why I cannot clear skb->next here?
> >
> > Some of the skbs on the list are not private?
> > IOW we should only unlink them if skb_unref().
>
> Yes, you are right.
>
> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
> returns true, meaning the SKB is ready to be free'ed (as it calls/check
> skb_unref()).


This was the case already before your changes.

skb->next/prev can not be shared by multiple users.

One skb can be put on a single list by definition.

Whoever calls kfree_skb_list(list) owns all the skbs->next|prev found
in the list

So you can mangle skb->next as you like, even if the unref() is
telling that someone
else has a reference on skb.

>
> I will send a proper fix patch shortly... after syzbot do a test on it.
>
> --Jesper
>
Jesper Dangaard Brouer Jan. 19, 2023, 11:22 a.m. UTC | #9
On 19/01/2023 11.28, Eric Dumazet wrote:
> On Thu, Jan 19, 2023 at 11:18 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 19/01/2023 03.26, Jakub Kicinski wrote:
>>> On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
>>>>> +           skb_mark_not_on_list(segs);
>>>>
>>>> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
>>>>
>>>> I don't understand why I cannot clear skb->next here?
>>>
>>> Some of the skbs on the list are not private?
>>> IOW we should only unlink them if skb_unref().
>>
>> Yes, you are right.
>>
>> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
>> returns true, meaning the SKB is ready to be free'ed (as it calls/check
>> skb_unref()).
> 
> 
> This was the case already before your changes.
> 
> skb->next/prev can not be shared by multiple users.
> 
> One skb can be put on a single list by definition.
> 
> Whoever calls kfree_skb_list(list) owns all the skbs->next|prev found
> in the list
> 
> So you can mangle skb->next as you like, even if the unref() is
> telling that someone
> else has a reference on skb.

Then why does the bug go way if I remove the skb_mark_not_on_list() call 
then?

>>
>> I will send a proper fix patch shortly... after syzbot do a test on it.
>>

I've send a patch for syzbot that only calls skb_mark_not_on_list() when 
unref() and __kfree_skb_reason() "permits" this.
I tested it locally with reproducer and it also fixes/"removes" the bug.

--Jesper
Eric Dumazet Jan. 19, 2023, 11:46 a.m. UTC | #10
On Thu, Jan 19, 2023 at 12:22 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 19/01/2023 11.28, Eric Dumazet wrote:
> > On Thu, Jan 19, 2023 at 11:18 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 19/01/2023 03.26, Jakub Kicinski wrote:
> >>> On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
> >>>>> +           skb_mark_not_on_list(segs);
> >>>>
> >>>> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
> >>>>
> >>>> I don't understand why I cannot clear skb->next here?
> >>>
> >>> Some of the skbs on the list are not private?
> >>> IOW we should only unlink them if skb_unref().
> >>
> >> Yes, you are right.
> >>
> >> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
> >> returns true, meaning the SKB is ready to be free'ed (as it calls/check
> >> skb_unref()).
> >
> >
> > This was the case already before your changes.
> >
> > skb->next/prev can not be shared by multiple users.
> >
> > One skb can be put on a single list by definition.
> >
> > Whoever calls kfree_skb_list(list) owns all the skbs->next|prev found
> > in the list
> >
> > So you can mangle skb->next as you like, even if the unref() is
> > telling that someone
> > else has a reference on skb.
>
> Then why does the bug go way if I remove the skb_mark_not_on_list() call
> then?
>

Some side effects.

This _particular_ repro uses a specific pattern that might be defeated
by a small change.
(just working around another bug)

Instead of setting skb->next to NULL, try to set it to

skb->next = (struct sk_buff *)0x800;

This might show a different pattern.

> >>
> >> I will send a proper fix patch shortly... after syzbot do a test on it.
> >>
>
> I've send a patch for syzbot that only calls skb_mark_not_on_list() when
> unref() and __kfree_skb_reason() "permits" this.
> I tested it locally with reproducer and it also fixes/"removes" the bug.

This does not mean we will accept a patch with no clear explanation
other than "this removes a syzbot bug, so this must be good"

Make sure to give precise details on _why_ this is needed or not.

Again, the user of kfree_skb_list(list) _owns_ skb->next for sure.
If you think this assertion is not true, we are in big trouble.
Jesper Dangaard Brouer Jan. 19, 2023, 1:18 p.m. UTC | #11
On 19/01/2023 12.46, Eric Dumazet wrote:
> On Thu, Jan 19, 2023 at 12:22 PM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 19/01/2023 11.28, Eric Dumazet wrote:
>>> On Thu, Jan 19, 2023 at 11:18 AM Jesper Dangaard Brouer
>>> <jbrouer@redhat.com> wrote:
>>>>
>>>>
>>>>
>>>> On 19/01/2023 03.26, Jakub Kicinski wrote:
>>>>> On Wed, 18 Jan 2023 22:37:47 +0100 Jesper Dangaard Brouer wrote:
>>>>>>> +           skb_mark_not_on_list(segs);
>>>>>>
>>>>>> The syzbot[1] bug goes way if I remove this skb_mark_not_on_list().
>>>>>>
>>>>>> I don't understand why I cannot clear skb->next here?
>>>>>
>>>>> Some of the skbs on the list are not private?
>>>>> IOW we should only unlink them if skb_unref().
>>>>
>>>> Yes, you are right.
>>>>
>>>> The skb_mark_not_on_list() should only be called if __kfree_skb_reason()
>>>> returns true, meaning the SKB is ready to be free'ed (as it calls/check
>>>> skb_unref()).
>>>
>>>
>>> This was the case already before your changes.
>>>
>>> skb->next/prev can not be shared by multiple users.
>>>
>>> One skb can be put on a single list by definition.
>>>
>>> Whoever calls kfree_skb_list(list) owns all the skbs->next|prev found
>>> in the list
>>>
>>> So you can mangle skb->next as you like, even if the unref() is
>>> telling that someone
>>> else has a reference on skb.
>>
>> Then why does the bug go way if I remove the skb_mark_not_on_list() call
>> then?
>>
> 
> Some side effects.
> 
> This _particular_ repro uses a specific pattern that might be defeated
> by a small change.
> (just working around another bug)
> 
> Instead of setting skb->next to NULL, try to set it to
> 
> skb->next = (struct sk_buff *)0x800;
> 
> This might show a different pattern.

Nice trick, I'll use this next time.

I modified to code and added a kfree_skb tracepoint with a known reason
(PROTO_MEM) to capture the callstack. See end of email.  Which shows
multicast code path is involved.

  trace_kfree_skb(segs, __builtin_return_address(0), 
SKB_DROP_REASON_PROTO_MEM);

>>>>
>>>> I will send a proper fix patch shortly... after syzbot do a test on it.
>>>>
>>
>> I've send a patch for syzbot that only calls skb_mark_not_on_list() when
>> unref() and __kfree_skb_reason() "permits" this.
>> I tested it locally with reproducer and it also fixes/"removes" the bug.
> 
> This does not mean we will accept a patch with no clear explanation
> other than "this removes a syzbot bug, so this must be good"
> 
> Make sure to give precise details on _why_ this is needed or not.
> 
> Again, the user of kfree_skb_list(list) _owns_ skb->next for sure.
> If you think this assertion is not true, we are in big trouble.
> 

I think I have found an explanation, why/when refcnt can be elevated on
an SKB-list.  The skb_shinfo(skb)->flag_list can increase refcnt.

See code:
  static void skb_clone_fraglist(struct sk_buff *skb)
  {
	struct sk_buff *list;

	skb_walk_frags(skb, list)
		skb_get(list);
  }

This walks the SKBs on the shinfo->frag_list and increase the refcnt
(skb->users).

Notice that kfree_skb_list is also called when freeing the SKBs
"frag_list" in skb_release_data().

IMHO this explains why we can only remove the SKB from the list, when
"permitted" by skb_unref(), e.g. if __kfree_skb_reason() returns true.

--Jesper


Call-stack of case with elevated refcnt when walking SKB-list:
--------------------------------------------------------------

repro  3048 [003]   101.689670: skb:kfree_skb: 
skbaddr=0xffff888104086600 protocol=0 
location=skb_release_data.cold+0x25 reason: PROTO_MEM
         ffffffff81bb6448 kfree_skb_list_reason.cold+0x3b 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81bb6448 kfree_skb_list_reason.cold+0x3b 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81bb6484 skb_release_data.cold+0x25 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819137e1 kfree_skb_reason+0x41 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81a1e246 igmp_rcv+0xf6 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819c5e85 ip_protocol_deliver_rcu+0x165 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819c5f12 ip_local_deliver_finish+0x72 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81930b89 __netif_receive_skb_one_core+0x69 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81930e31 process_backlog+0x91 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff8193197b __napi_poll+0x2b 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81931e86 net_rx_action+0x276 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81bcf083 __do_softirq+0xd3 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81087243 do_softirq+0x63 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff810872e4 __local_bh_enable_ip+0x64 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff8192ba8f netif_rx+0xdf 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff8192bb27 dev_loopback_xmit+0x77 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819c9195 ip_mc_finish_output+0x65 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819cc597 ip_mc_output+0x137 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819cd2b8 ip_push_pending_frames+0xa8 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81a04f67 raw_sendmsg+0x607 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff8190153b sock_sendmsg+0x8b 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff8190323b __sys_sendto+0xeb 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff819032b0 __x64_sys_sendto+0x20 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81bb9ffa do_syscall_64+0x3a 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
         ffffffff81c000aa entry_SYSCALL_64+0xaa 
(/boot/vmlinux-6.2.0-rc3-net-next-kfunc-xdp-hints+)
                   10af3d syscall+0x1d (/usr/lib64/libc.so.6)
                     46d3 do_sandbox_none+0x81 
(/home/jbrouer/syzbot-kfree_skb_list/repro)
                     48ce main+0xa5 
(/home/jbrouer/syzbot-kfree_skb_list/repro)
                    29550 __libc_start_call_main+0x80 (/usr/lib64/libc.so.6)




Resolving some symbols:

ip_mc_finish_output+0x65
------------------------
[net-next]$ ./scripts/faddr2line net/ipv4/ip_output.o 
ip_mc_finish_output+0x65
ip_mc_finish_output+0x65/0x190:
ip_mc_finish_output at net/ipv4/ip_output.c:356

ip_mc_output+0x137
------------------
[net-next]$ ./scripts/faddr2line vmlinux ip_mc_output+0x137
ip_mc_output+0x137/0x2a0:
skb_network_header at include/linux/skbuff.h:2829
(inlined by) ip_hdr at include/linux/ip.h:21
(inlined by) ip_mc_output at net/ipv4/ip_output.c:401

igmp_rcv+0xf6
-------------
[net-next]$ ./scripts/faddr2line vmlinux igmp_rcv+0xf6
igmp_rcv+0xf6/0x2e0:
igmp_rcv at net/ipv4/igmp.c:1130
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 007a5fbe284b..79c9e795a964 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -964,16 +964,54 @@  kfree_skb_reason(struct sk_buff *skb, enum skb_drop_reason reason)
 }
 EXPORT_SYMBOL(kfree_skb_reason);
 
+#define KFREE_SKB_BULK_SIZE	16
+
+struct skb_free_array {
+	unsigned int skb_count;
+	void *skb_array[KFREE_SKB_BULK_SIZE];
+};
+
+static void kfree_skb_add_bulk(struct sk_buff *skb,
+			       struct skb_free_array *sa,
+			       enum skb_drop_reason reason)
+{
+	/* if SKB is a clone, don't handle this case */
+	if (unlikely(skb->fclone != SKB_FCLONE_UNAVAILABLE)) {
+		__kfree_skb(skb);
+		return;
+	}
+
+	skb_release_all(skb, reason);
+	sa->skb_array[sa->skb_count++] = skb;
+
+	if (unlikely(sa->skb_count == KFREE_SKB_BULK_SIZE)) {
+		kmem_cache_free_bulk(skbuff_head_cache, KFREE_SKB_BULK_SIZE,
+				     sa->skb_array);
+		sa->skb_count = 0;
+	}
+}
+
 void __fix_address
 kfree_skb_list_reason(struct sk_buff *segs, enum skb_drop_reason reason)
 {
+	struct skb_free_array sa;
+
+	sa.skb_count = 0;
+
 	while (segs) {
 		struct sk_buff *next = segs->next;
 
+		skb_mark_not_on_list(segs);
+
 		if (__kfree_skb_reason(segs, reason))
-			__kfree_skb(segs);
+			kfree_skb_add_bulk(segs, &sa, reason);
+
 		segs = next;
 	}
+
+	if (sa.skb_count)
+		kmem_cache_free_bulk(skbuff_head_cache, sa.skb_count,
+				     sa.skb_array);
 }
 EXPORT_SYMBOL(kfree_skb_list_reason);