diff mbox series

[net] net: fix use-after-free when UDP GRO with shared fraglist

Message ID 1609750005-115609-1-git-send-email-dseok.yi@samsung.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net] net: fix use-after-free when UDP GRO with shared fraglist | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 1 maintainers not CCed: vladimir.oltean@nxp.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 37 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Dongseok Yi Jan. 4, 2021, 8:46 a.m. UTC
skbs in frag_list could be shared by pskb_expand_head() from BPF.

While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
chain made by skb_segment_list().

If the new skb (not frag_list) is queued to one of the sk_receive_queue,
multiple ptypes can see this. The skb could be released by ptypes and
it causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
---
 net/core/skbuff.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

Comments

Willem de Bruijn Jan. 4, 2021, 9:03 p.m. UTC | #1
On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
>
> skbs in frag_list could be shared by pskb_expand_head() from BPF.

Can you elaborate on the BPF connection?

> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> chain made by skb_segment_list().
>
> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> multiple ptypes can see this. The skb could be released by ptypes and
> it causes use-after-free.

If I understand correctly, a udp-gro-list skb makes it up the receive
path with one or more active packet sockets.

The packet socket will call skb_clone after accepting the filter. This
replaces the head_skb, but shares the skb_shinfo and thus frag_list.

udp_rcv_segment later converts the udp-gro-list skb to a list of
regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
Now all the frags are fully fledged packets, with headers pushed
before the payload. This does not change their refcount anymore than
the skb_clone in pf_packet did. This should be 1.

Eventually udp_recvmsg will call skb_consume_udp on each packet.

The packet socket eventually also frees its cloned head_skb, which triggers

  kfree_skb_list(shinfo->frag_list)
    kfree_skb
      skb_unref
        refcount_dec_and_test(&skb->users)

>
> [ 4443.426215] ------------[ cut here ]------------
> [ 4443.426222] refcount_t: underflow; use-after-free.
> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> [ 4443.426808] Call trace:
> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426823]  skb_release_data+0x144/0x264
> [ 4443.426828]  kfree_skb+0x58/0xc4
> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> [ 4443.426873]  el0_svc_handler+0x74/0x98
> [ 4443.426880]  el0_svc+0x8/0xc
>
> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> ---
>  net/core/skbuff.c | 20 +++++++++++++++++++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f62cae3..1dcbda8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>         unsigned int delta_truesize = 0;
>         unsigned int delta_len = 0;
>         struct sk_buff *tail = NULL;
> -       struct sk_buff *nskb;
> +       struct sk_buff *nskb, *tmp;
> +       int err;
>
>         skb_push(skb, -skb_network_offset(skb) + offset);
>
> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>                 nskb = list_skb;
>                 list_skb = list_skb->next;
>
> +               err = 0;
> +               if (skb_shared(nskb)) {

I must be missing something still. This does not square with my
understanding that the two sockets are operating on clones, with each
frag_list skb having skb->users == 1.

Unless the packet socket patch previously also triggered an
skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
calls skb_get on each frag_list skb.


> +                       tmp = skb_clone(nskb, GFP_ATOMIC);
> +                       if (tmp) {
> +                               kfree_skb(nskb);
> +                               nskb = tmp;
> +                               err = skb_unclone(nskb, GFP_ATOMIC);
> +                       } else {
> +                               err = -ENOMEM;
> +                       }
> +               }
> +
>                 if (!tail)
>                         skb->next = nskb;
>                 else
>                         tail->next = nskb;
>
> +               if (unlikely(err)) {
> +                       nskb->next = list_skb;
> +                       goto err_linearize;
> +               }
> +
>                 tail = nskb;
>
>                 delta_len += nskb->len;
> --
> 2.7.4
>
Dongseok Yi Jan. 6, 2021, 1:29 a.m. UTC | #2
On 2021-01-05 06:03, Willem de Bruijn wrote:
> 
> On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >
> > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> 
> Can you elaborate on the BPF connection?

With the following registered ptypes,

/proc/net # cat ptype
Type Device      Function
ALL           tpacket_rcv
0800          ip_rcv.cfi_jt
0011          llc_rcv.cfi_jt
0004          llc_rcv.cfi_jt
0806          arp_rcv
86dd          ipv6_rcv.cfi_jt

BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
(or ipv6_rcv). And it calls pskb_expand_head.

[  132.051228] pskb_expand_head+0x360/0x378
[  132.051237] skb_ensure_writable+0xa0/0xc4
[  132.051249] bpf_skb_pull_data+0x28/0x60
[  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
[  132.051273] cls_bpf_classify+0x254/0x348
[  132.051284] tcf_classify+0xa4/0x180
[  132.051294] __netif_receive_skb_core+0x590/0xd28
[  132.051303] __netif_receive_skb+0x50/0x17c
[  132.051312] process_backlog+0x15c/0x1b8

> 
> > While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> > But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> > chain made by skb_segment_list().
> >
> > If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> > multiple ptypes can see this. The skb could be released by ptypes and
> > it causes use-after-free.
> 
> If I understand correctly, a udp-gro-list skb makes it up the receive
> path with one or more active packet sockets.
> 
> The packet socket will call skb_clone after accepting the filter. This
> replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> 
> udp_rcv_segment later converts the udp-gro-list skb to a list of
> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> Now all the frags are fully fledged packets, with headers pushed
> before the payload. This does not change their refcount anymore than
> the skb_clone in pf_packet did. This should be 1.
> 
> Eventually udp_recvmsg will call skb_consume_udp on each packet.
> 
> The packet socket eventually also frees its cloned head_skb, which triggers
> 
>   kfree_skb_list(shinfo->frag_list)
>     kfree_skb
>       skb_unref
>         refcount_dec_and_test(&skb->users)

Every your understanding is right, but

> 
> >
> > [ 4443.426215] ------------[ cut here ]------------
> > [ 4443.426222] refcount_t: underflow; use-after-free.
> > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > [ 4443.426808] Call trace:
> > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426823]  skb_release_data+0x144/0x264
> > [ 4443.426828]  kfree_skb+0x58/0xc4
> > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > [ 4443.426880]  el0_svc+0x8/0xc
> >
> > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> > ---
> >  net/core/skbuff.c | 20 +++++++++++++++++++-
> >  1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index f62cae3..1dcbda8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >         unsigned int delta_truesize = 0;
> >         unsigned int delta_len = 0;
> >         struct sk_buff *tail = NULL;
> > -       struct sk_buff *nskb;
> > +       struct sk_buff *nskb, *tmp;
> > +       int err;
> >
> >         skb_push(skb, -skb_network_offset(skb) + offset);
> >
> > @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >                 nskb = list_skb;
> >                 list_skb = list_skb->next;
> >
> > +               err = 0;
> > +               if (skb_shared(nskb)) {
> 
> I must be missing something still. This does not square with my
> understanding that the two sockets are operating on clones, with each
> frag_list skb having skb->users == 1.
> 
> Unless the packet socket patch previously also triggered an
> skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> calls skb_get on each frag_list skb.

A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
with the original shinfo. pskb_expand_head reallocates the shinfo of
the skb and call skb_clone_fraglist. skb_release_data in
pskb_expand_head could not reduce skb->users of the each frag_list skb
if skb_shinfo(skb)->dataref == 2.

After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
skb could have skb->users == 2.

> 
> 
> > +                       tmp = skb_clone(nskb, GFP_ATOMIC);
> > +                       if (tmp) {
> > +                               kfree_skb(nskb);
> > +                               nskb = tmp;
> > +                               err = skb_unclone(nskb, GFP_ATOMIC);
> > +                       } else {
> > +                               err = -ENOMEM;
> > +                       }
> > +               }
> > +
> >                 if (!tail)
> >                         skb->next = nskb;
> >                 else
> >                         tail->next = nskb;
> >
> > +               if (unlikely(err)) {
> > +                       nskb->next = list_skb;
> > +                       goto err_linearize;
> > +               }
> > +
> >                 tail = nskb;
> >
> >                 delta_len += nskb->len;
> > --
> > 2.7.4
> >
Willem de Bruijn Jan. 6, 2021, 3:07 a.m. UTC | #3
On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
>
> On 2021-01-05 06:03, Willem de Bruijn wrote:
> >
> > On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> > >
> > > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> >
> > Can you elaborate on the BPF connection?
>
> With the following registered ptypes,
>
> /proc/net # cat ptype
> Type Device      Function
> ALL           tpacket_rcv
> 0800          ip_rcv.cfi_jt
> 0011          llc_rcv.cfi_jt
> 0004          llc_rcv.cfi_jt
> 0806          arp_rcv
> 86dd          ipv6_rcv.cfi_jt
>
> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> (or ipv6_rcv). And it calls pskb_expand_head.
>
> [  132.051228] pskb_expand_head+0x360/0x378
> [  132.051237] skb_ensure_writable+0xa0/0xc4
> [  132.051249] bpf_skb_pull_data+0x28/0x60
> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> [  132.051273] cls_bpf_classify+0x254/0x348
> [  132.051284] tcf_classify+0xa4/0x180

Ah, you have a BPF program loaded at TC. That was not entirely obvious.

This program gets called after packet sockets with ptype_all, before
those with a specific protocol.

Tcpdump will have inserted a program with ptype_all, which cloned the
skb. This triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get.

> [  132.051294] __netif_receive_skb_core+0x590/0xd28
> [  132.051303] __netif_receive_skb+0x50/0x17c
> [  132.051312] process_backlog+0x15c/0x1b8
>
> >
> > > While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> > > But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> > > chain made by skb_segment_list().
> > >
> > > If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> > > multiple ptypes can see this. The skb could be released by ptypes and
> > > it causes use-after-free.
> >
> > If I understand correctly, a udp-gro-list skb makes it up the receive
> > path with one or more active packet sockets.
> >
> > The packet socket will call skb_clone after accepting the filter. This
> > replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> >
> > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > Now all the frags are fully fledged packets, with headers pushed
> > before the payload. This does not change their refcount anymore than
> > the skb_clone in pf_packet did. This should be 1.
> >
> > Eventually udp_recvmsg will call skb_consume_udp on each packet.
> >
> > The packet socket eventually also frees its cloned head_skb, which triggers
> >
> >   kfree_skb_list(shinfo->frag_list)
> >     kfree_skb
> >       skb_unref
> >         refcount_dec_and_test(&skb->users)
>
> Every your understanding is right, but
>
> >
> > >
> > > [ 4443.426215] ------------[ cut here ]------------
> > > [ 4443.426222] refcount_t: underflow; use-after-free.
> > > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > > refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> > > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > > [ 4443.426808] Call trace:
> > > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > > [ 4443.426823]  skb_release_data+0x144/0x264
> > > [ 4443.426828]  kfree_skb+0x58/0xc4
> > > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > > [ 4443.426880]  el0_svc+0x8/0xc
> > >
> > > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > > Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> > > ---
> > >  net/core/skbuff.c | 20 +++++++++++++++++++-
> > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index f62cae3..1dcbda8 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > >         unsigned int delta_truesize = 0;
> > >         unsigned int delta_len = 0;
> > >         struct sk_buff *tail = NULL;
> > > -       struct sk_buff *nskb;
> > > +       struct sk_buff *nskb, *tmp;
> > > +       int err;
> > >
> > >         skb_push(skb, -skb_network_offset(skb) + offset);
> > >
> > > @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > >                 nskb = list_skb;
> > >                 list_skb = list_skb->next;
> > >
> > > +               err = 0;
> > > +               if (skb_shared(nskb)) {
> >
> > I must be missing something still. This does not square with my
> > understanding that the two sockets are operating on clones, with each
> > frag_list skb having skb->users == 1.
> >
> > Unless the packet socket patch previously also triggered an
> > skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> > calls skb_get on each frag_list skb.
>
> A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
> with the original shinfo. pskb_expand_head reallocates the shinfo of
> the skb and call skb_clone_fraglist. skb_release_data in
> pskb_expand_head could not reduce skb->users of the each frag_list skb
> if skb_shinfo(skb)->dataref == 2.
>
> After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
> skb could have skb->users == 2.

Yes, that makes sense. skb_clone_fraglist just increments the
frag_list skb's refcounts.

skb_segment_list must create an unshared struct sk_buff before it
changes skb data to insert the protocol headers.

> >
> >
> > > +                       tmp = skb_clone(nskb, GFP_ATOMIC);
> > > +                       if (tmp) {
> > > +                               kfree_skb(nskb);
> > > +                               nskb = tmp;
> > > +                               err = skb_unclone(nskb, GFP_ATOMIC);

Calling clone and unclone in quick succession looks odd.

But you need the first to create a private skb and to trigger the
second to create a private copy of the linear data (as well as frags,
if any, but these are not touched). So this looks okay.

> > > +                       } else {
> > > +                               err = -ENOMEM;
> > > +                       }
> > > +               }
> > > +
> > >                 if (!tail)
> > >                         skb->next = nskb;
> > >                 else
> > >                         tail->next = nskb;
> > >
> > > +               if (unlikely(err)) {
> > > +                       nskb->next = list_skb;

To avoid leaking these skbs when calling kfree_skb_list(skb->next). Is
that concern new with this patch, or also needed for the existing
error case?

> > > +                       goto err_linearize;
> > > +               }
> > > +
> > >                 tail = nskb;
> > >
> > >                 delta_len += nskb->len;
> > > --
> > > 2.7.4
> > >
>
Dongseok Yi Jan. 6, 2021, 3:32 a.m. UTC | #4
On 2021-01-06 12:07, Willem de Bruijn wrote:
> 
> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >
> > On 2021-01-05 06:03, Willem de Bruijn wrote:
> > >
> > > On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> > > >
> > > > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> > >
> > > Can you elaborate on the BPF connection?
> >
> > With the following registered ptypes,
> >
> > /proc/net # cat ptype
> > Type Device      Function
> > ALL           tpacket_rcv
> > 0800          ip_rcv.cfi_jt
> > 0011          llc_rcv.cfi_jt
> > 0004          llc_rcv.cfi_jt
> > 0806          arp_rcv
> > 86dd          ipv6_rcv.cfi_jt
> >
> > BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> > (or ipv6_rcv). And it calls pskb_expand_head.
> >
> > [  132.051228] pskb_expand_head+0x360/0x378
> > [  132.051237] skb_ensure_writable+0xa0/0xc4
> > [  132.051249] bpf_skb_pull_data+0x28/0x60
> > [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> > [  132.051273] cls_bpf_classify+0x254/0x348
> > [  132.051284] tcf_classify+0xa4/0x180
> 
> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
> 
> This program gets called after packet sockets with ptype_all, before
> those with a specific protocol.
> 
> Tcpdump will have inserted a program with ptype_all, which cloned the
> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
> skb_clone_fraglist -> skb_get.
> 
> > [  132.051294] __netif_receive_skb_core+0x590/0xd28
> > [  132.051303] __netif_receive_skb+0x50/0x17c
> > [  132.051312] process_backlog+0x15c/0x1b8
> >
> > >
> > > > While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> > > > But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> > > > chain made by skb_segment_list().
> > > >
> > > > If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> > > > multiple ptypes can see this. The skb could be released by ptypes and
> > > > it causes use-after-free.
> > >
> > > If I understand correctly, a udp-gro-list skb makes it up the receive
> > > path with one or more active packet sockets.
> > >
> > > The packet socket will call skb_clone after accepting the filter. This
> > > replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> > >
> > > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > > Now all the frags are fully fledged packets, with headers pushed
> > > before the payload. This does not change their refcount anymore than
> > > the skb_clone in pf_packet did. This should be 1.
> > >
> > > Eventually udp_recvmsg will call skb_consume_udp on each packet.
> > >
> > > The packet socket eventually also frees its cloned head_skb, which triggers
> > >
> > >   kfree_skb_list(shinfo->frag_list)
> > >     kfree_skb
> > >       skb_unref
> > >         refcount_dec_and_test(&skb->users)
> >
> > Every your understanding is right, but
> >
> > >
> > > >
> > > > [ 4443.426215] ------------[ cut here ]------------
> > > > [ 4443.426222] refcount_t: underflow; use-after-free.
> > > > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > > > refcount_dec_and_test_checked+0xa4/0xc8
> > > > [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> > > > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > > > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > > > [ 4443.426808] Call trace:
> > > > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > > > [ 4443.426823]  skb_release_data+0x144/0x264
> > > > [ 4443.426828]  kfree_skb+0x58/0xc4
> > > > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > > > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > > > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > > > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > > > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > > > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > > > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > > > [ 4443.426880]  el0_svc+0x8/0xc
> > > >
> > > > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > > > Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> > > > ---
> > > >  net/core/skbuff.c | 20 +++++++++++++++++++-
> > > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index f62cae3..1dcbda8 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > > >         unsigned int delta_truesize = 0;
> > > >         unsigned int delta_len = 0;
> > > >         struct sk_buff *tail = NULL;
> > > > -       struct sk_buff *nskb;
> > > > +       struct sk_buff *nskb, *tmp;
> > > > +       int err;
> > > >
> > > >         skb_push(skb, -skb_network_offset(skb) + offset);
> > > >
> > > > @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > > >                 nskb = list_skb;
> > > >                 list_skb = list_skb->next;
> > > >
> > > > +               err = 0;
> > > > +               if (skb_shared(nskb)) {
> > >
> > > I must be missing something still. This does not square with my
> > > understanding that the two sockets are operating on clones, with each
> > > frag_list skb having skb->users == 1.
> > >
> > > Unless the packet socket patch previously also triggered an
> > > skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> > > calls skb_get on each frag_list skb.
> >
> > A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
> > with the original shinfo. pskb_expand_head reallocates the shinfo of
> > the skb and call skb_clone_fraglist. skb_release_data in
> > pskb_expand_head could not reduce skb->users of the each frag_list skb
> > if skb_shinfo(skb)->dataref == 2.
> >
> > After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
> > skb could have skb->users == 2.
> 
> Yes, that makes sense. skb_clone_fraglist just increments the
> frag_list skb's refcounts.
> 
> skb_segment_list must create an unshared struct sk_buff before it
> changes skb data to insert the protocol headers.
> 
> > >
> > >
> > > > +                       tmp = skb_clone(nskb, GFP_ATOMIC);
> > > > +                       if (tmp) {
> > > > +                               kfree_skb(nskb);
> > > > +                               nskb = tmp;
> > > > +                               err = skb_unclone(nskb, GFP_ATOMIC);
> 
> Calling clone and unclone in quick succession looks odd.
> 
> But you need the first to create a private skb and to trigger the
> second to create a private copy of the linear data (as well as frags,
> if any, but these are not touched). So this looks okay.
> 
> > > > +                       } else {
> > > > +                               err = -ENOMEM;
> > > > +                       }
> > > > +               }
> > > > +
> > > >                 if (!tail)
> > > >                         skb->next = nskb;
> > > >                 else
> > > >                         tail->next = nskb;
> > > >
> > > > +               if (unlikely(err)) {
> > > > +                       nskb->next = list_skb;
> 
> To avoid leaking these skbs when calling kfree_skb_list(skb->next). Is
> that concern new with this patch, or also needed for the existing
> error case?

It's new for this patch. nskb can lose next skb due to
tmp = skb_clone(nskb, GFP_ATOMIC); on the prior. I believe it is not
needed for the existing errors.

> 
> > > > +                       goto err_linearize;
> > > > +               }
> > > > +
> > > >                 tail = nskb;
> > > >
> > > >                 delta_len += nskb->len;
> > > > --
> > > > 2.7.4
> > > >
> >
Willem de Bruijn Jan. 6, 2021, 5:13 p.m. UTC | #5
On Tue, Jan 5, 2021 at 10:32 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
>
> On 2021-01-06 12:07, Willem de Bruijn wrote:
> >
> > On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
> > >
> > > On 2021-01-05 06:03, Willem de Bruijn wrote:
> > > >
> > > > On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> > > > >
> > > > > skbs in frag_list could be shared by pskb_expand_head() from BPF.
> > > >
> > > > Can you elaborate on the BPF connection?
> > >
> > > With the following registered ptypes,
> > >
> > > /proc/net # cat ptype
> > > Type Device      Function
> > > ALL           tpacket_rcv
> > > 0800          ip_rcv.cfi_jt
> > > 0011          llc_rcv.cfi_jt
> > > 0004          llc_rcv.cfi_jt
> > > 0806          arp_rcv
> > > 86dd          ipv6_rcv.cfi_jt
> > >
> > > BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> > > (or ipv6_rcv). And it calls pskb_expand_head.
> > >
> > > [  132.051228] pskb_expand_head+0x360/0x378
> > > [  132.051237] skb_ensure_writable+0xa0/0xc4
> > > [  132.051249] bpf_skb_pull_data+0x28/0x60
> > > [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> > > [  132.051273] cls_bpf_classify+0x254/0x348
> > > [  132.051284] tcf_classify+0xa4/0x180
> >
> > Ah, you have a BPF program loaded at TC. That was not entirely obvious.
> >
> > This program gets called after packet sockets with ptype_all, before
> > those with a specific protocol.
> >
> > Tcpdump will have inserted a program with ptype_all, which cloned the
> > skb. This triggers skb_ensure_writable -> pskb_expand_head ->
> > skb_clone_fraglist -> skb_get.
> >
> > > [  132.051294] __netif_receive_skb_core+0x590/0xd28
> > > [  132.051303] __netif_receive_skb+0x50/0x17c
> > > [  132.051312] process_backlog+0x15c/0x1b8
> > >
> > > >
> > > > > While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> > > > > But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> > > > > chain made by skb_segment_list().
> > > > >
> > > > > If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> > > > > multiple ptypes can see this. The skb could be released by ptypes and
> > > > > it causes use-after-free.
> > > >
> > > > If I understand correctly, a udp-gro-list skb makes it up the receive
> > > > path with one or more active packet sockets.
> > > >
> > > > The packet socket will call skb_clone after accepting the filter. This
> > > > replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> > > >
> > > > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > > > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > > > Now all the frags are fully fledged packets, with headers pushed
> > > > before the payload. This does not change their refcount anymore than
> > > > the skb_clone in pf_packet did. This should be 1.
> > > >
> > > > Eventually udp_recvmsg will call skb_consume_udp on each packet.
> > > >
> > > > The packet socket eventually also frees its cloned head_skb, which triggers
> > > >
> > > >   kfree_skb_list(shinfo->frag_list)
> > > >     kfree_skb
> > > >       skb_unref
> > > >         refcount_dec_and_test(&skb->users)
> > >
> > > Every your understanding is right, but
> > >
> > > >
> > > > >
> > > > > [ 4443.426215] ------------[ cut here ]------------
> > > > > [ 4443.426222] refcount_t: underflow; use-after-free.
> > > > > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > > > > refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> > > > > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > > > > [ 4443.426808] Call trace:
> > > > > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > > > > [ 4443.426823]  skb_release_data+0x144/0x264
> > > > > [ 4443.426828]  kfree_skb+0x58/0xc4
> > > > > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > > > > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > > > > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > > > > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > > > > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > > > > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > > > > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > > > > [ 4443.426880]  el0_svc+0x8/0xc
> > > > >
> > > > > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > > > > Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> > > > > ---
> > > > >  net/core/skbuff.c | 20 +++++++++++++++++++-
> > > > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > > index f62cae3..1dcbda8 100644
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > > > >         unsigned int delta_truesize = 0;
> > > > >         unsigned int delta_len = 0;
> > > > >         struct sk_buff *tail = NULL;
> > > > > -       struct sk_buff *nskb;
> > > > > +       struct sk_buff *nskb, *tmp;
> > > > > +       int err;
> > > > >
> > > > >         skb_push(skb, -skb_network_offset(skb) + offset);
> > > > >
> > > > > @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> > > > >                 nskb = list_skb;
> > > > >                 list_skb = list_skb->next;
> > > > >
> > > > > +               err = 0;
> > > > > +               if (skb_shared(nskb)) {
> > > >
> > > > I must be missing something still. This does not square with my
> > > > understanding that the two sockets are operating on clones, with each
> > > > frag_list skb having skb->users == 1.
> > > >
> > > > Unless the packet socket patch previously also triggered an
> > > > skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> > > > calls skb_get on each frag_list skb.
> > >
> > > A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
> > > with the original shinfo. pskb_expand_head reallocates the shinfo of
> > > the skb and call skb_clone_fraglist. skb_release_data in
> > > pskb_expand_head could not reduce skb->users of the each frag_list skb
> > > if skb_shinfo(skb)->dataref == 2.
> > >
> > > After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
> > > skb could have skb->users == 2.
> >
> > Yes, that makes sense. skb_clone_fraglist just increments the
> > frag_list skb's refcounts.
> >
> > skb_segment_list must create an unshared struct sk_buff before it
> > changes skb data to insert the protocol headers.
> >
> > > >
> > > >
> > > > > +                       tmp = skb_clone(nskb, GFP_ATOMIC);
> > > > > +                       if (tmp) {
> > > > > +                               kfree_skb(nskb);
> > > > > +                               nskb = tmp;
> > > > > +                               err = skb_unclone(nskb, GFP_ATOMIC);
> >
> > Calling clone and unclone in quick succession looks odd.
> >
> > But you need the first to create a private skb and to trigger the
> > second to create a private copy of the linear data (as well as frags,
> > if any, but these are not touched). So this looks okay.
> >
> > > > > +                       } else {
> > > > > +                               err = -ENOMEM;
> > > > > +                       }
> > > > > +               }
> > > > > +
> > > > >                 if (!tail)
> > > > >                         skb->next = nskb;
> > > > >                 else
> > > > >                         tail->next = nskb;
> > > > >
> > > > > +               if (unlikely(err)) {
> > > > > +                       nskb->next = list_skb;
> >
> > To avoid leaking these skbs when calling kfree_skb_list(skb->next). Is
> > that concern new with this patch, or also needed for the existing
> > error case?
>
> It's new for this patch. nskb can lose next skb due to
> tmp = skb_clone(nskb, GFP_ATOMIC); on the prior. I believe it is not
> needed for the existing errors.

Ah, skb_clone clears the next pointer, indeed. Thanks.

Yes, then this looks correct to me. Thanks for fixing, not an obvious
code path or bug at all.

Acked-by: Willem de Bruijn <willemb@google.com>

The patch is already marked as changes requested in
https://patchwork.kernel.org/project/netdevbpf , so you might have to
resubmit it.

If so, please expand a little bit in the commit message on the fact
that a bpf filter is loaded at TC, which triggers skb_ensure_writable
-> pskb_expand_head -> skb_clone_fraglist -> skb_get on each skb in
the fraglist.
Yunsheng Lin April 17, 2021, 3:44 a.m. UTC | #6
On 2021/1/6 11:32, Dongseok Yi wrote:
> On 2021-01-06 12:07, Willem de Bruijn wrote:
>>
>> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
>>>
>>> On 2021-01-05 06:03, Willem de Bruijn wrote:
>>>>
>>>> On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
>>>>>
>>>>> skbs in frag_list could be shared by pskb_expand_head() from BPF.
>>>>
>>>> Can you elaborate on the BPF connection?
>>>
>>> With the following registered ptypes,
>>>
>>> /proc/net # cat ptype
>>> Type Device      Function
>>> ALL           tpacket_rcv
>>> 0800          ip_rcv.cfi_jt
>>> 0011          llc_rcv.cfi_jt
>>> 0004          llc_rcv.cfi_jt
>>> 0806          arp_rcv
>>> 86dd          ipv6_rcv.cfi_jt
>>>
>>> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
>>> (or ipv6_rcv). And it calls pskb_expand_head.
>>>
>>> [  132.051228] pskb_expand_head+0x360/0x378
>>> [  132.051237] skb_ensure_writable+0xa0/0xc4
>>> [  132.051249] bpf_skb_pull_data+0x28/0x60
>>> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
>>> [  132.051273] cls_bpf_classify+0x254/0x348
>>> [  132.051284] tcf_classify+0xa4/0x180
>>
>> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
>>
>> This program gets called after packet sockets with ptype_all, before
>> those with a specific protocol.
>>
>> Tcpdump will have inserted a program with ptype_all, which cloned the
>> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
>> skb_clone_fraglist -> skb_get.
>>
>>> [  132.051294] __netif_receive_skb_core+0x590/0xd28
>>> [  132.051303] __netif_receive_skb+0x50/0x17c
>>> [  132.051312] process_backlog+0x15c/0x1b8
>>>
>>>>
>>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
>>>>> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
>>>>> chain made by skb_segment_list().
>>>>>
>>>>> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
>>>>> multiple ptypes can see this. The skb could be released by ptypes and
>>>>> it causes use-after-free.
>>>>
>>>> If I understand correctly, a udp-gro-list skb makes it up the receive
>>>> path with one or more active packet sockets.
>>>>
>>>> The packet socket will call skb_clone after accepting the filter. This
>>>> replaces the head_skb, but shares the skb_shinfo and thus frag_list.
>>>>
>>>> udp_rcv_segment later converts the udp-gro-list skb to a list of
>>>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
>>>> Now all the frags are fully fledged packets, with headers pushed
>>>> before the payload. This does not change their refcount anymore than
>>>> the skb_clone in pf_packet did. This should be 1.
>>>>
>>>> Eventually udp_recvmsg will call skb_consume_udp on each packet.
>>>>
>>>> The packet socket eventually also frees its cloned head_skb, which triggers
>>>>
>>>>   kfree_skb_list(shinfo->frag_list)
>>>>     kfree_skb
>>>>       skb_unref
>>>>         refcount_dec_and_test(&skb->users)
>>>
>>> Every your understanding is right, but
>>>
>>>>
>>>>>
>>>>> [ 4443.426215] ------------[ cut here ]------------
>>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
>>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
>>>>> refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
>>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
>>>>> [ 4443.426808] Call trace:
>>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426823]  skb_release_data+0x144/0x264
>>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
>>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
>>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
>>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
>>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
>>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
>>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
>>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
>>>>> [ 4443.426880]  el0_svc+0x8/0xc
>>>>>
>>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
>>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
>>>>> ---
>>>>>  net/core/skbuff.c | 20 +++++++++++++++++++-
>>>>>  1 file changed, 19 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>> index f62cae3..1dcbda8 100644
>>>>> --- a/net/core/skbuff.c
>>>>> +++ b/net/core/skbuff.c
>>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>         unsigned int delta_truesize = 0;
>>>>>         unsigned int delta_len = 0;
>>>>>         struct sk_buff *tail = NULL;
>>>>> -       struct sk_buff *nskb;
>>>>> +       struct sk_buff *nskb, *tmp;
>>>>> +       int err;
>>>>>
>>>>>         skb_push(skb, -skb_network_offset(skb) + offset);
>>>>>
>>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>                 nskb = list_skb;
>>>>>                 list_skb = list_skb->next;
>>>>>
>>>>> +               err = 0;
>>>>> +               if (skb_shared(nskb)) {
>>>>
>>>> I must be missing something still. This does not square with my
>>>> understanding that the two sockets are operating on clones, with each
>>>> frag_list skb having skb->users == 1.
>>>>
>>>> Unless the packet socket patch previously also triggered an
>>>> skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
>>>> calls skb_get on each frag_list skb.
>>>
>>> A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
>>> with the original shinfo. pskb_expand_head reallocates the shinfo of
>>> the skb and call skb_clone_fraglist. skb_release_data in
>>> pskb_expand_head could not reduce skb->users of the each frag_list skb
>>> if skb_shinfo(skb)->dataref == 2.
>>>
>>> After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
>>> skb could have skb->users == 2.

Hi, Dongseok
   I understand there is liner head data shared between the frag_list skb in the
cloned skb(cloned by pf_packet?) and original skb, which should not be shared
when skb_segment_list() converts the frag_list skb into regular packet.

   But both skb->users of original and cloned skb is one(skb_shinfo(skb)->dataref
is one for both skb too), and skb->users of each fraglist skb is two because both
original and cloned skb is linking to the same fraglist pointer, and there is
"skb_shinfo(skb)->frag_list = NULL" for original skb in the begin of skb_segment_list(),
if kfree_skb() is called with original skb, the fraglist skb will not be freed.
If kfree_skb is called with original skb,cloned skb and each fraglist skb here, the
reference counter for three of them seem right here, so why is there a refcount_t
warning in the commit log? am I missing something obvious here?

Sorry for bringing up this thread again.

>>
>> Yes, that makes sense. skb_clone_fraglist just increments the
>> frag_list skb's refcounts.
>>
>> skb_segment_list must create an unshared struct sk_buff before it
>> changes skb data to insert the protocol headers.
>>
Dongseok Yi April 19, 2021, 12:35 a.m. UTC | #7
On Sat, Apr 17, 2021 at 11:44:35AM +0800, Yunsheng Lin wrote:
> 
> On 2021/1/6 11:32, Dongseok Yi wrote:
> > On 2021-01-06 12:07, Willem de Bruijn wrote:
> >>
> >> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >>>
> >>> On 2021-01-05 06:03, Willem de Bruijn wrote:
> >>>>
> >>>> On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >>>>>
> >>>>> skbs in frag_list could be shared by pskb_expand_head() from BPF.
> >>>>
> >>>> Can you elaborate on the BPF connection?
> >>>
> >>> With the following registered ptypes,
> >>>
> >>> /proc/net # cat ptype
> >>> Type Device      Function
> >>> ALL           tpacket_rcv
> >>> 0800          ip_rcv.cfi_jt
> >>> 0011          llc_rcv.cfi_jt
> >>> 0004          llc_rcv.cfi_jt
> >>> 0806          arp_rcv
> >>> 86dd          ipv6_rcv.cfi_jt
> >>>
> >>> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> >>> (or ipv6_rcv). And it calls pskb_expand_head.
> >>>
> >>> [  132.051228] pskb_expand_head+0x360/0x378
> >>> [  132.051237] skb_ensure_writable+0xa0/0xc4
> >>> [  132.051249] bpf_skb_pull_data+0x28/0x60
> >>> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> >>> [  132.051273] cls_bpf_classify+0x254/0x348
> >>> [  132.051284] tcf_classify+0xa4/0x180
> >>
> >> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
> >>
> >> This program gets called after packet sockets with ptype_all, before
> >> those with a specific protocol.
> >>
> >> Tcpdump will have inserted a program with ptype_all, which cloned the
> >> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
> >> skb_clone_fraglist -> skb_get.
> >>
> >>> [  132.051294] __netif_receive_skb_core+0x590/0xd28
> >>> [  132.051303] __netif_receive_skb+0x50/0x17c
> >>> [  132.051312] process_backlog+0x15c/0x1b8
> >>>
> >>>>
> >>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> >>>>> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> >>>>> chain made by skb_segment_list().
> >>>>>
> >>>>> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> >>>>> multiple ptypes can see this. The skb could be released by ptypes and
> >>>>> it causes use-after-free.
> >>>>
> >>>> If I understand correctly, a udp-gro-list skb makes it up the receive
> >>>> path with one or more active packet sockets.
> >>>>
> >>>> The packet socket will call skb_clone after accepting the filter. This
> >>>> replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> >>>>
> >>>> udp_rcv_segment later converts the udp-gro-list skb to a list of
> >>>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> >>>> Now all the frags are fully fledged packets, with headers pushed
> >>>> before the payload. This does not change their refcount anymore than
> >>>> the skb_clone in pf_packet did. This should be 1.
> >>>>
> >>>> Eventually udp_recvmsg will call skb_consume_udp on each packet.
> >>>>
> >>>> The packet socket eventually also frees its cloned head_skb, which triggers
> >>>>
> >>>>   kfree_skb_list(shinfo->frag_list)
> >>>>     kfree_skb
> >>>>       skb_unref
> >>>>         refcount_dec_and_test(&skb->users)
> >>>
> >>> Every your understanding is right, but
> >>>
> >>>>
> >>>>>
> >>>>> [ 4443.426215] ------------[ cut here ]------------
> >>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>>>> refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> >>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>>>> [ 4443.426808] Call trace:
> >>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>>>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>>>
> >>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> >>>>> ---
> >>>>>  net/core/skbuff.c | 20 +++++++++++++++++++-
> >>>>>  1 file changed, 19 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>>> index f62cae3..1dcbda8 100644
> >>>>> --- a/net/core/skbuff.c
> >>>>> +++ b/net/core/skbuff.c
> >>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>         unsigned int delta_truesize = 0;
> >>>>>         unsigned int delta_len = 0;
> >>>>>         struct sk_buff *tail = NULL;
> >>>>> -       struct sk_buff *nskb;
> >>>>> +       struct sk_buff *nskb, *tmp;
> >>>>> +       int err;
> >>>>>
> >>>>>         skb_push(skb, -skb_network_offset(skb) + offset);
> >>>>>
> >>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>                 nskb = list_skb;
> >>>>>                 list_skb = list_skb->next;
> >>>>>
> >>>>> +               err = 0;
> >>>>> +               if (skb_shared(nskb)) {
> >>>>
> >>>> I must be missing something still. This does not square with my
> >>>> understanding that the two sockets are operating on clones, with each
> >>>> frag_list skb having skb->users == 1.
> >>>>
> >>>> Unless the packet socket patch previously also triggered an
> >>>> skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> >>>> calls skb_get on each frag_list skb.
> >>>
> >>> A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
> >>> with the original shinfo. pskb_expand_head reallocates the shinfo of
> >>> the skb and call skb_clone_fraglist. skb_release_data in
> >>> pskb_expand_head could not reduce skb->users of the each frag_list skb
> >>> if skb_shinfo(skb)->dataref == 2.
> >>>
> >>> After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
> >>> skb could have skb->users == 2.
> 
> Hi, Dongseok
>    I understand there is liner head data shared between the frag_list skb in the
> cloned skb(cloned by pf_packet?) and original skb, which should not be shared
> when skb_segment_list() converts the frag_list skb into regular packet.
> 
>    But both skb->users of original and cloned skb is one(skb_shinfo(skb)->dataref
> is one for both skb too), and skb->users of each fraglist skb is two because both
> original and cloned skb is linking to the same fraglist pointer, and there is
> "skb_shinfo(skb)->frag_list = NULL" for original skb in the begin of skb_segment_list(),
> if kfree_skb() is called with original skb, the fraglist skb will not be freed.
> If kfree_skb is called with original skb,cloned skb and each fraglist skb here, the
> reference counter for three of them seem right here, so why is there a refcount_t
> warning in the commit log? am I missing something obvious here?
> 
> Sorry for bringing up this thread again.

A skb which detects use-after-free was not a part of frag_list. Please
check the commit msg once again.

Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
a link for the same frag_skbs chain. If a new skb (*not frags*) is
queued to one of the sk_receive_queue, multiple ptypes can see and
release this. It causes use-after-free.

> 
> >>
> >> Yes, that makes sense. skb_clone_fraglist just increments the
> >> frag_list skb's refcounts.
> >>
> >> skb_segment_list must create an unshared struct sk_buff before it
> >> changes skb data to insert the protocol headers.
> >>
Yunsheng Lin April 21, 2021, 9:42 a.m. UTC | #8
On 2021/4/19 8:35, Dongseok Yi wrote:
> On Sat, Apr 17, 2021 at 11:44:35AM +0800, Yunsheng Lin wrote:
>>
>> On 2021/1/6 11:32, Dongseok Yi wrote:
>>> On 2021-01-06 12:07, Willem de Bruijn wrote:
>>>>
>>>> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
>>>>>
>>>>> On 2021-01-05 06:03, Willem de Bruijn wrote:
>>>>>>
>>>>>> On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
>>>>>>>
>>>>>>> skbs in frag_list could be shared by pskb_expand_head() from BPF.
>>>>>>
>>>>>> Can you elaborate on the BPF connection?
>>>>>
>>>>> With the following registered ptypes,
>>>>>
>>>>> /proc/net # cat ptype
>>>>> Type Device      Function
>>>>> ALL           tpacket_rcv
>>>>> 0800          ip_rcv.cfi_jt
>>>>> 0011          llc_rcv.cfi_jt
>>>>> 0004          llc_rcv.cfi_jt
>>>>> 0806          arp_rcv
>>>>> 86dd          ipv6_rcv.cfi_jt
>>>>>
>>>>> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
>>>>> (or ipv6_rcv). And it calls pskb_expand_head.
>>>>>
>>>>> [  132.051228] pskb_expand_head+0x360/0x378
>>>>> [  132.051237] skb_ensure_writable+0xa0/0xc4
>>>>> [  132.051249] bpf_skb_pull_data+0x28/0x60
>>>>> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
>>>>> [  132.051273] cls_bpf_classify+0x254/0x348
>>>>> [  132.051284] tcf_classify+0xa4/0x180
>>>>
>>>> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
>>>>
>>>> This program gets called after packet sockets with ptype_all, before
>>>> those with a specific protocol.
>>>>
>>>> Tcpdump will have inserted a program with ptype_all, which cloned the
>>>> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
>>>> skb_clone_fraglist -> skb_get.
>>>>
>>>>> [  132.051294] __netif_receive_skb_core+0x590/0xd28
>>>>> [  132.051303] __netif_receive_skb+0x50/0x17c
>>>>> [  132.051312] process_backlog+0x15c/0x1b8
>>>>>
>>>>>>
>>>>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
>>>>>>> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
>>>>>>> chain made by skb_segment_list().
>>>>>>>
>>>>>>> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
>>>>>>> multiple ptypes can see this. The skb could be released by ptypes and
>>>>>>> it causes use-after-free.
>>>>>>
>>>>>> If I understand correctly, a udp-gro-list skb makes it up the receive
>>>>>> path with one or more active packet sockets.
>>>>>>
>>>>>> The packet socket will call skb_clone after accepting the filter. This
>>>>>> replaces the head_skb, but shares the skb_shinfo and thus frag_list.
>>>>>>
>>>>>> udp_rcv_segment later converts the udp-gro-list skb to a list of
>>>>>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
>>>>>> Now all the frags are fully fledged packets, with headers pushed
>>>>>> before the payload. This does not change their refcount anymore than
>>>>>> the skb_clone in pf_packet did. This should be 1.
>>>>>>
>>>>>> Eventually udp_recvmsg will call skb_consume_udp on each packet.
>>>>>>
>>>>>> The packet socket eventually also frees its cloned head_skb, which triggers
>>>>>>
>>>>>>   kfree_skb_list(shinfo->frag_list)
>>>>>>     kfree_skb
>>>>>>       skb_unref
>>>>>>         refcount_dec_and_test(&skb->users)
>>>>>
>>>>> Every your understanding is right, but
>>>>>
>>>>>>
>>>>>>>
>>>>>>> [ 4443.426215] ------------[ cut here ]------------
>>>>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
>>>>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
>>>>>>> refcount_dec_and_test_checked+0xa4/0xc8
>>>>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
>>>>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
>>>>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
>>>>>>> [ 4443.426808] Call trace:
>>>>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
>>>>>>> [ 4443.426823]  skb_release_data+0x144/0x264
>>>>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
>>>>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
>>>>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
>>>>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
>>>>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
>>>>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
>>>>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
>>>>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
>>>>>>> [ 4443.426880]  el0_svc+0x8/0xc
>>>>>>>
>>>>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
>>>>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
>>>>>>> ---
>>>>>>>  net/core/skbuff.c | 20 +++++++++++++++++++-
>>>>>>>  1 file changed, 19 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>>>> index f62cae3..1dcbda8 100644
>>>>>>> --- a/net/core/skbuff.c
>>>>>>> +++ b/net/core/skbuff.c
>>>>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>>>         unsigned int delta_truesize = 0;
>>>>>>>         unsigned int delta_len = 0;
>>>>>>>         struct sk_buff *tail = NULL;
>>>>>>> -       struct sk_buff *nskb;
>>>>>>> +       struct sk_buff *nskb, *tmp;
>>>>>>> +       int err;
>>>>>>>
>>>>>>>         skb_push(skb, -skb_network_offset(skb) + offset);
>>>>>>>
>>>>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>>>                 nskb = list_skb;
>>>>>>>                 list_skb = list_skb->next;
>>>>>>>
>>>>>>> +               err = 0;
>>>>>>> +               if (skb_shared(nskb)) {
>>>>>>
>>>>>> I must be missing something still. This does not square with my
>>>>>> understanding that the two sockets are operating on clones, with each
>>>>>> frag_list skb having skb->users == 1.
>>>>>>
>>>>>> Unless the packet socket patch previously also triggered an
>>>>>> skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
>>>>>> calls skb_get on each frag_list skb.
>>>>>
>>>>> A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
>>>>> with the original shinfo. pskb_expand_head reallocates the shinfo of
>>>>> the skb and call skb_clone_fraglist. skb_release_data in
>>>>> pskb_expand_head could not reduce skb->users of the each frag_list skb
>>>>> if skb_shinfo(skb)->dataref == 2.
>>>>>
>>>>> After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
>>>>> skb could have skb->users == 2.
>>
>> Hi, Dongseok
>>    I understand there is liner head data shared between the frag_list skb in the
>> cloned skb(cloned by pf_packet?) and original skb, which should not be shared
>> when skb_segment_list() converts the frag_list skb into regular packet.
>>
>>    But both skb->users of original and cloned skb is one(skb_shinfo(skb)->dataref
>> is one for both skb too), and skb->users of each fraglist skb is two because both
>> original and cloned skb is linking to the same fraglist pointer, and there is
>> "skb_shinfo(skb)->frag_list = NULL" for original skb in the begin of skb_segment_list(),
>> if kfree_skb() is called with original skb, the fraglist skb will not be freed.
>> If kfree_skb is called with original skb,cloned skb and each fraglist skb here, the
>> reference counter for three of them seem right here, so why is there a refcount_t
>> warning in the commit log? am I missing something obvious here?
>>
>> Sorry for bringing up this thread again.
> 
> A skb which detects use-after-free was not a part of frag_list. Please
> check the commit msg once again.

I checked the commit msg again, but still have not figured it out yet:)

So I tried to see if I understand the skb'reference counting correctly:

skb->user is used to reference counting the "struct sk_buff", and
skb_shinfo(skb)->dataref is used to reference counting head data.

skb_clone(): allocate a sperate "struct sk_buff" but share the head data
             with the original skb, so skb_shinfo()->dataref need
             incrmenting.

pskb_expand_head(): allocate a sperate head data(which includes the space
                    for skb_shinfo(skb)), since the original head data
		    and the new head data' skb_shinfo()->frag_list both
                    point to the same fraglist skb, so each fraglist_skb's
		    skb->users need incrmenting, and original head data's
		    skb_shinfo() need decrmenting.


So after pf_packet called skb_clone() and pskb_expand_head(), we have:

    old skb              new skb
      |                     |
      |                     |
old head data         new head data
        \                   /
          \                /
           \              /
             \           /
              \         /
             fraglist_skb1 -> fraglist_skb2 -> fraglist_skb3 .....

So both old and new skb' skb->user is one, both old and new head data's
skb_shinfo()->dataref is one, and both old and new head data'
skb_shinfo()->frag_list points to fraglist_skb1, and each fraglist_skb's
skb->user is two.

Each fraglist_skb points to a head data, and its skb_shinfo()->dataref
is one too.

Suppose old skb is called with skb_segment_list(), without this patch,
we have:

                         new skb
                            |
                            |
                     new head data
                            /
                           /
                          /
                         /
                        /
       old skb -> fraglist_skb1 -> fraglist_skb2 -> fraglist_skb3 .....
          |
          |
    old head data

And old skb and each fraglist_skb become a regular packet, so freeing
the old skb, new skb and each fraglist_skb here do not seems to have
any reference counting problem, because each fraglist_skb's skb->user
is two, right?

> 
> Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
> a link for the same frag_skbs chain. 

Does "frag_skbs chain" means fraglist_skb1? It seems only new head data's
skb_shinfo()->frag_list points to fraglist_skb1


If a new skb (*not frags*) is
> queued to one of the sk_receive_queue, multiple ptypes can see and
> release this. It causes use-after-free.

Does "a new skb" mean each fraglist_skb after skb_segment_list()? Or other
new incoming skb?

I am not so familiar with the PF_PACKET and PF_INET, so still have hard
time figuring how the reference counting goes wrong here:)

> 
>>
>>>>
>>>> Yes, that makes sense. skb_clone_fraglist just increments the
>>>> frag_list skb's refcounts.
>>>>
>>>> skb_segment_list must create an unshared struct sk_buff before it
>>>> changes skb data to insert the protocol headers.
>>>>
> 
> 
> 
> .
>
Dongseok Yi April 21, 2021, 11:04 a.m. UTC | #9
On Wed, Apr 21, 2021 at 05:42:12PM +0800, Yunsheng Lin wrote:
> On 2021/4/19 8:35, Dongseok Yi wrote:
> > On Sat, Apr 17, 2021 at 11:44:35AM +0800, Yunsheng Lin wrote:
> >>
> >> On 2021/1/6 11:32, Dongseok Yi wrote:
> >>> On 2021-01-06 12:07, Willem de Bruijn wrote:
> >>>>
> >>>> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >>>>>
> >>>>> On 2021-01-05 06:03, Willem de Bruijn wrote:
> >>>>>>
> >>>>>> On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi <dseok.yi@samsung.com> wrote:
> >>>>>>>
> >>>>>>> skbs in frag_list could be shared by pskb_expand_head() from BPF.
> >>>>>>
> >>>>>> Can you elaborate on the BPF connection?
> >>>>>
> >>>>> With the following registered ptypes,
> >>>>>
> >>>>> /proc/net # cat ptype
> >>>>> Type Device      Function
> >>>>> ALL           tpacket_rcv
> >>>>> 0800          ip_rcv.cfi_jt
> >>>>> 0011          llc_rcv.cfi_jt
> >>>>> 0004          llc_rcv.cfi_jt
> >>>>> 0806          arp_rcv
> >>>>> 86dd          ipv6_rcv.cfi_jt
> >>>>>
> >>>>> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
> >>>>> (or ipv6_rcv). And it calls pskb_expand_head.
> >>>>>
> >>>>> [  132.051228] pskb_expand_head+0x360/0x378
> >>>>> [  132.051237] skb_ensure_writable+0xa0/0xc4
> >>>>> [  132.051249] bpf_skb_pull_data+0x28/0x60
> >>>>> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
> >>>>> [  132.051273] cls_bpf_classify+0x254/0x348
> >>>>> [  132.051284] tcf_classify+0xa4/0x180
> >>>>
> >>>> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
> >>>>
> >>>> This program gets called after packet sockets with ptype_all, before
> >>>> those with a specific protocol.
> >>>>
> >>>> Tcpdump will have inserted a program with ptype_all, which cloned the
> >>>> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
> >>>> skb_clone_fraglist -> skb_get.
> >>>>
> >>>>> [  132.051294] __netif_receive_skb_core+0x590/0xd28
> >>>>> [  132.051303] __netif_receive_skb+0x50/0x17c
> >>>>> [  132.051312] process_backlog+0x15c/0x1b8
> >>>>>
> >>>>>>
> >>>>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> >>>>>>> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> >>>>>>> chain made by skb_segment_list().
> >>>>>>>
> >>>>>>> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> >>>>>>> multiple ptypes can see this. The skb could be released by ptypes and
> >>>>>>> it causes use-after-free.
> >>>>>>
> >>>>>> If I understand correctly, a udp-gro-list skb makes it up the receive
> >>>>>> path with one or more active packet sockets.
> >>>>>>
> >>>>>> The packet socket will call skb_clone after accepting the filter. This
> >>>>>> replaces the head_skb, but shares the skb_shinfo and thus frag_list.
> >>>>>>
> >>>>>> udp_rcv_segment later converts the udp-gro-list skb to a list of
> >>>>>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> >>>>>> Now all the frags are fully fledged packets, with headers pushed
> >>>>>> before the payload. This does not change their refcount anymore than
> >>>>>> the skb_clone in pf_packet did. This should be 1.
> >>>>>>
> >>>>>> Eventually udp_recvmsg will call skb_consume_udp on each packet.
> >>>>>>
> >>>>>> The packet socket eventually also frees its cloned head_skb, which triggers
> >>>>>>
> >>>>>>   kfree_skb_list(shinfo->frag_list)
> >>>>>>     kfree_skb
> >>>>>>       skb_unref
> >>>>>>         refcount_dec_and_test(&skb->users)
> >>>>>
> >>>>> Every your understanding is right, but
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> [ 4443.426215] ------------[ cut here ]------------
> >>>>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>>>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>>>>>> refcount_dec_and_test_checked+0xa4/0xc8
> >>>>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> >>>>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>>>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>>>>>> [ 4443.426808] Call trace:
> >>>>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>>>>>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>>>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>>>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>>>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>>>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>>>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>>>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>>>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>>>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>>>>>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>>>>>
> >>>>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>>>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> >>>>>>> ---
> >>>>>>>  net/core/skbuff.c | 20 +++++++++++++++++++-
> >>>>>>>  1 file changed, 19 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>>>>> index f62cae3..1dcbda8 100644
> >>>>>>> --- a/net/core/skbuff.c
> >>>>>>> +++ b/net/core/skbuff.c
> >>>>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>>>         unsigned int delta_truesize = 0;
> >>>>>>>         unsigned int delta_len = 0;
> >>>>>>>         struct sk_buff *tail = NULL;
> >>>>>>> -       struct sk_buff *nskb;
> >>>>>>> +       struct sk_buff *nskb, *tmp;
> >>>>>>> +       int err;
> >>>>>>>
> >>>>>>>         skb_push(skb, -skb_network_offset(skb) + offset);
> >>>>>>>
> >>>>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>>>                 nskb = list_skb;
> >>>>>>>                 list_skb = list_skb->next;
> >>>>>>>
> >>>>>>> +               err = 0;
> >>>>>>> +               if (skb_shared(nskb)) {
> >>>>>>
> >>>>>> I must be missing something still. This does not square with my
> >>>>>> understanding that the two sockets are operating on clones, with each
> >>>>>> frag_list skb having skb->users == 1.
> >>>>>>
> >>>>>> Unless the packet socket patch previously also triggered an
> >>>>>> skb_unclone/pskb_expand_head, as that call skb_clone_fraglist, which
> >>>>>> calls skb_get on each frag_list skb.
> >>>>>
> >>>>> A cloned skb after tpacket_rcv cannot go through skb_ensure_writable
> >>>>> with the original shinfo. pskb_expand_head reallocates the shinfo of
> >>>>> the skb and call skb_clone_fraglist. skb_release_data in
> >>>>> pskb_expand_head could not reduce skb->users of the each frag_list skb
> >>>>> if skb_shinfo(skb)->dataref == 2.
> >>>>>
> >>>>> After the reallocation, skb_shinfo(skb)->dataref == 1 but each frag_list
> >>>>> skb could have skb->users == 2.
> >>
> >> Hi, Dongseok
> >>    I understand there is liner head data shared between the frag_list skb in the
> >> cloned skb(cloned by pf_packet?) and original skb, which should not be shared
> >> when skb_segment_list() converts the frag_list skb into regular packet.
> >>
> >>    But both skb->users of original and cloned skb is one(skb_shinfo(skb)->dataref
> >> is one for both skb too), and skb->users of each fraglist skb is two because both
> >> original and cloned skb is linking to the same fraglist pointer, and there is
> >> "skb_shinfo(skb)->frag_list = NULL" for original skb in the begin of skb_segment_list(),
> >> if kfree_skb() is called with original skb, the fraglist skb will not be freed.
> >> If kfree_skb is called with original skb,cloned skb and each fraglist skb here, the
> >> reference counter for three of them seem right here, so why is there a refcount_t
> >> warning in the commit log? am I missing something obvious here?
> >>
> >> Sorry for bringing up this thread again.
> >
> > A skb which detects use-after-free was not a part of frag_list. Please
> > check the commit msg once again.
> 
> I checked the commit msg again, but still have not figured it out yet:)
> 
> So I tried to see if I understand the skb'reference counting correctly:
> 
> skb->user is used to reference counting the "struct sk_buff", and
> skb_shinfo(skb)->dataref is used to reference counting head data.
> 
> skb_clone(): allocate a sperate "struct sk_buff" but share the head data
>              with the original skb, so skb_shinfo()->dataref need
>              incrmenting.
> 
> pskb_expand_head(): allocate a sperate head data(which includes the space
>                     for skb_shinfo(skb)), since the original head data
> 		    and the new head data' skb_shinfo()->frag_list both
>                     point to the same fraglist skb, so each fraglist_skb's
> 		    skb->users need incrmenting, and original head data's
> 		    skb_shinfo() need decrmenting.
> 
> 
> So after pf_packet called skb_clone() and pskb_expand_head(), we have:
> 
>     old skb              new skb
>       |                     |
>       |                     |
> old head data         new head data
>         \                   /
>           \                /
>            \              /
>              \           /
>               \         /
>              fraglist_skb1 -> fraglist_skb2 -> fraglist_skb3 .....
> 
> So both old and new skb' skb->user is one, both old and new head data's
> skb_shinfo()->dataref is one, and both old and new head data'
> skb_shinfo()->frag_list points to fraglist_skb1, and each fraglist_skb's
> skb->user is two.
> 
> Each fraglist_skb points to a head data, and its skb_shinfo()->dataref
> is one too.
> 
> Suppose old skb is called with skb_segment_list(), without this patch,
> we have:
> 
>                          new skb
>                             |
>                             |
>                      new head data
>                             /
>                            /
>                           /
>                          /
>                         /
>        old skb -> fraglist_skb1 -> fraglist_skb2 -> fraglist_skb3 .....
>           |
>           |
>     old head data
> 
> And old skb and each fraglist_skb become a regular packet, so freeing
> the old skb, new skb and each fraglist_skb here do not seems to have
> any reference counting problem, because each fraglist_skb's skb->user
> is two, right?
> 
> >
> > Both sk_receive_queue of PF_PACKET and PF_INET (or PF_INET6) can have
> > a link for the same frag_skbs chain.
> 
> Does "frag_skbs chain" means fraglist_skb1? It seems only new head data's
> skb_shinfo()->frag_list points to fraglist_skb1

Yes, right.

> 
> 
> If a new skb (*not frags*) is
> > queued to one of the sk_receive_queue, multiple ptypes can see and
> > release this. It causes use-after-free.
> 
> Does "a new skb" mean each fraglist_skb after skb_segment_list()? Or other
> new incoming skb?

I mean a new incoming skb.

> 
> I am not so familiar with the PF_PACKET and PF_INET, so still have hard
> time figuring how the reference counting goes wrong here:)

Let's assume a new incoming skb that is added to the next of the last
fraglist_skb. The new incoming skb->user is *one*.

                         new skb
                            |
                            |
                     new head data
                            /
                           /
                          /
                         /
                        /
       old skb -> fraglist_skb1 -> fraglist_skb2 -> ... -> new incoming skb
          |
          |
    old head data

Let's skb_queue_purge from old skb. kfree_skb from old skb will free
2 skbs (marked as xxx1 and xxx2). What happened if kfree_skb(new skb)?

                         new skb
                            |
                            |
                     new head data
                            /
                           /
                          /
                         /
                        /
          xxx1 -> fraglist_skb1 -> fraglist_skb2 -> ... -> xxx2

It will try to free xxx2.

> 
> >
> >>
> >>>>
> >>>> Yes, that makes sense. skb_clone_fraglist just increments the
> >>>> frag_list skb's refcounts.
> >>>>
> >>>> skb_segment_list must create an unshared struct sk_buff before it
> >>>> changes skb data to insert the protocol headers.
> >>>>
> >
> >
> >
> > .
> >
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f62cae3..1dcbda8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3655,7 +3655,8 @@  struct sk_buff *skb_segment_list(struct sk_buff *skb,
 	unsigned int delta_truesize = 0;
 	unsigned int delta_len = 0;
 	struct sk_buff *tail = NULL;
-	struct sk_buff *nskb;
+	struct sk_buff *nskb, *tmp;
+	int err;
 
 	skb_push(skb, -skb_network_offset(skb) + offset);
 
@@ -3665,11 +3666,28 @@  struct sk_buff *skb_segment_list(struct sk_buff *skb,
 		nskb = list_skb;
 		list_skb = list_skb->next;
 
+		err = 0;
+		if (skb_shared(nskb)) {
+			tmp = skb_clone(nskb, GFP_ATOMIC);
+			if (tmp) {
+				kfree_skb(nskb);
+				nskb = tmp;
+				err = skb_unclone(nskb, GFP_ATOMIC);
+			} else {
+				err = -ENOMEM;
+			}
+		}
+
 		if (!tail)
 			skb->next = nskb;
 		else
 			tail->next = nskb;
 
+		if (unlikely(err)) {
+			nskb->next = list_skb;
+			goto err_linearize;
+		}
+
 		tail = nskb;
 
 		delta_len += nskb->len;