diff mbox series

[net,v2] net: fix use-after-free when UDP GRO with shared fraglist

Message ID 1609979953-181868-1-git-send-email-dseok.yi@samsung.com (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series [net,v2] net: fix use-after-free when UDP GRO with shared fraglist | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 1 maintainers not CCed: vladimir.oltean@nxp.com
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 1 this patch: 1
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 37 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 1 this patch: 1
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Dongseok Yi Jan. 7, 2021, 12:39 a.m. UTC
skbs in fraglist could be shared by a BPF filter loaded at TC. It
triggers skb_ensure_writable -> pskb_expand_head ->
skb_clone_fraglist -> skb_get on each skb in the fraglist.

While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
chain made by skb_segment_list.

If the new skb (not fraglist) is queued to one of the sk_receive_queue,
multiple ptypes can see this. The skb could be released by ptypes and
it causes use-after-free.

[ 4443.426215] ------------[ cut here ]------------
[ 4443.426222] refcount_t: underflow; use-after-free.
[ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
[ 4443.426808] Call trace:
[ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
[ 4443.426823]  skb_release_data+0x144/0x264
[ 4443.426828]  kfree_skb+0x58/0xc4
[ 4443.426832]  skb_queue_purge+0x64/0x9c
[ 4443.426844]  packet_set_ring+0x5f0/0x820
[ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
[ 4443.426853]  __sys_setsockopt+0x188/0x278
[ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
[ 4443.426869]  el0_svc_common+0xf0/0x1d0
[ 4443.426873]  el0_svc_handler+0x74/0x98
[ 4443.426880]  el0_svc+0x8/0xc

Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
Acked-by: Willem de Bruijn <willemb@google.com>
---
 net/core/skbuff.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

v2: Expand the commit message to clarify a BPF filter loaded

Comments

Daniel Borkmann Jan. 7, 2021, 11:05 a.m. UTC | #1
On 1/7/21 1:39 AM, Dongseok Yi wrote:
> skbs in fraglist could be shared by a BPF filter loaded at TC. It
> triggers skb_ensure_writable -> pskb_expand_head ->
> skb_clone_fraglist -> skb_get on each skb in the fraglist.
> 
> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> chain made by skb_segment_list.
> 
> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> multiple ptypes can see this. The skb could be released by ptypes and
> it causes use-after-free.
> 
> [ 4443.426215] ------------[ cut here ]------------
> [ 4443.426222] refcount_t: underflow; use-after-free.
> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> [ 4443.426808] Call trace:
> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426823]  skb_release_data+0x144/0x264
> [ 4443.426828]  kfree_skb+0x58/0xc4
> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> [ 4443.426873]  el0_svc_handler+0x74/0x98
> [ 4443.426880]  el0_svc+0x8/0xc
> 
> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> Acked-by: Willem de Bruijn <willemb@google.com>
> ---
>   net/core/skbuff.c | 20 +++++++++++++++++++-
>   1 file changed, 19 insertions(+), 1 deletion(-)
> 
> v2: Expand the commit message to clarify a BPF filter loaded
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f62cae3..1dcbda8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>   	unsigned int delta_truesize = 0;
>   	unsigned int delta_len = 0;
>   	struct sk_buff *tail = NULL;
> -	struct sk_buff *nskb;
> +	struct sk_buff *nskb, *tmp;
> +	int err;
>   
>   	skb_push(skb, -skb_network_offset(skb) + offset);
>   
> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>   		nskb = list_skb;
>   		list_skb = list_skb->next;
>   
> +		err = 0;
> +		if (skb_shared(nskb)) {
> +			tmp = skb_clone(nskb, GFP_ATOMIC);
> +			if (tmp) {
> +				kfree_skb(nskb);

Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
for drops in the stack.

> +				nskb = tmp;
> +				err = skb_unclone(nskb, GFP_ATOMIC);

Could you elaborate why you also need to unclone? This looks odd here. tc layer
(independent of BPF) from ingress & egress side generally assumes unshared skb,
so above clone + dropping ref of nskb looks okay to make the main skb struct private
for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
the additional skb_unclone() in this context?

> +			} else {
> +				err = -ENOMEM;
> +			}
> +		}
> +
>   		if (!tail)
>   			skb->next = nskb;
>   		else
>   			tail->next = nskb;
>   
> +		if (unlikely(err)) {
> +			nskb->next = list_skb;
> +			goto err_linearize;
> +		}
> +
>   		tail = nskb;
>   
>   		delta_len += nskb->len;
>
Dongseok Yi Jan. 7, 2021, 11:40 a.m. UTC | #2
On 2021-01-07 20:05, Daniel Borkmann wrote:
> 
> On 1/7/21 1:39 AM, Dongseok Yi wrote:
> > skbs in fraglist could be shared by a BPF filter loaded at TC. It
> > triggers skb_ensure_writable -> pskb_expand_head ->
> > skb_clone_fraglist -> skb_get on each skb in the fraglist.
> >
> > While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> > But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> > chain made by skb_segment_list.
> >
> > If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> > multiple ptypes can see this. The skb could be released by ptypes and
> > it causes use-after-free.
> >
> > [ 4443.426215] ------------[ cut here ]------------
> > [ 4443.426222] refcount_t: underflow; use-after-free.
> > [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> > refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> > [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> > [ 4443.426808] Call trace:
> > [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> > [ 4443.426823]  skb_release_data+0x144/0x264
> > [ 4443.426828]  kfree_skb+0x58/0xc4
> > [ 4443.426832]  skb_queue_purge+0x64/0x9c
> > [ 4443.426844]  packet_set_ring+0x5f0/0x820
> > [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> > [ 4443.426853]  __sys_setsockopt+0x188/0x278
> > [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> > [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> > [ 4443.426873]  el0_svc_handler+0x74/0x98
> > [ 4443.426880]  el0_svc+0x8/0xc
> >
> > Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> > Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> > Acked-by: Willem de Bruijn <willemb@google.com>
> > ---
> >   net/core/skbuff.c | 20 +++++++++++++++++++-
> >   1 file changed, 19 insertions(+), 1 deletion(-)
> >
> > v2: Expand the commit message to clarify a BPF filter loaded
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index f62cae3..1dcbda8 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >   	unsigned int delta_truesize = 0;
> >   	unsigned int delta_len = 0;
> >   	struct sk_buff *tail = NULL;
> > -	struct sk_buff *nskb;
> > +	struct sk_buff *nskb, *tmp;
> > +	int err;
> >
> >   	skb_push(skb, -skb_network_offset(skb) + offset);
> >
> > @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >   		nskb = list_skb;
> >   		list_skb = list_skb->next;
> >
> > +		err = 0;
> > +		if (skb_shared(nskb)) {
> > +			tmp = skb_clone(nskb, GFP_ATOMIC);
> > +			if (tmp) {
> > +				kfree_skb(nskb);
> 
> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
> for drops in the stack.

I will use to consume_skb() on the next version.

> 
> > +				nskb = tmp;
> > +				err = skb_unclone(nskb, GFP_ATOMIC);
> 
> Could you elaborate why you also need to unclone? This looks odd here. tc layer
> (independent of BPF) from ingress & egress side generally assumes unshared skb,
> so above clone + dropping ref of nskb looks okay to make the main skb struct private
> for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
> the additional skb_unclone() in this context?

Willem de Bruijn said:
udp_rcv_segment later converts the udp-gro-list skb to a list of
regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
Now all the frags are fully fledged packets, with headers pushed
before the payload.

PF_PACKET handles untouched fraglist. To modify the payload only
for udp_rcv_segment, skb_unclone is necessary.

> 
> > +			} else {
> > +				err = -ENOMEM;
> > +			}
> > +		}
> > +
> >   		if (!tail)
> >   			skb->next = nskb;
> >   		else
> >   			tail->next = nskb;
> >
> > +		if (unlikely(err)) {
> > +			nskb->next = list_skb;
> > +			goto err_linearize;
> > +		}
> > +
> >   		tail = nskb;
> >
> >   		delta_len += nskb->len;
> >
Daniel Borkmann Jan. 7, 2021, 12:49 p.m. UTC | #3
On 1/7/21 12:40 PM, Dongseok Yi wrote:
> On 2021-01-07 20:05, Daniel Borkmann wrote:
>> On 1/7/21 1:39 AM, Dongseok Yi wrote:
>>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
>>> triggers skb_ensure_writable -> pskb_expand_head ->
>>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
>>>
>>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
>>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
>>> chain made by skb_segment_list.
>>>
>>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
>>> multiple ptypes can see this. The skb could be released by ptypes and
>>> it causes use-after-free.
>>>
>>> [ 4443.426215] ------------[ cut here ]------------
>>> [ 4443.426222] refcount_t: underflow; use-after-free.
>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
>>> refcount_dec_and_test_checked+0xa4/0xc8
>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
>>> [ 4443.426808] Call trace:
>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
>>> [ 4443.426823]  skb_release_data+0x144/0x264
>>> [ 4443.426828]  kfree_skb+0x58/0xc4
>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
>>> [ 4443.426880]  el0_svc+0x8/0xc
>>>
>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
>>> Acked-by: Willem de Bruijn <willemb@google.com>
>>> ---
>>>    net/core/skbuff.c | 20 +++++++++++++++++++-
>>>    1 file changed, 19 insertions(+), 1 deletion(-)
>>>
>>> v2: Expand the commit message to clarify a BPF filter loaded
>>>
>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>> index f62cae3..1dcbda8 100644
>>> --- a/net/core/skbuff.c
>>> +++ b/net/core/skbuff.c
>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>    	unsigned int delta_truesize = 0;
>>>    	unsigned int delta_len = 0;
>>>    	struct sk_buff *tail = NULL;
>>> -	struct sk_buff *nskb;
>>> +	struct sk_buff *nskb, *tmp;
>>> +	int err;
>>>
>>>    	skb_push(skb, -skb_network_offset(skb) + offset);
>>>
>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>    		nskb = list_skb;
>>>    		list_skb = list_skb->next;
>>>
>>> +		err = 0;
>>> +		if (skb_shared(nskb)) {
>>> +			tmp = skb_clone(nskb, GFP_ATOMIC);
>>> +			if (tmp) {
>>> +				kfree_skb(nskb);
>>
>> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
>> for drops in the stack.
> 
> I will use to consume_skb() on the next version.
> 
>>> +				nskb = tmp;
>>> +				err = skb_unclone(nskb, GFP_ATOMIC);
>>
>> Could you elaborate why you also need to unclone? This looks odd here. tc layer
>> (independent of BPF) from ingress & egress side generally assumes unshared skb,
>> so above clone + dropping ref of nskb looks okay to make the main skb struct private
>> for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
>> the additional skb_unclone() in this context?
> 
> Willem de Bruijn said:
> udp_rcv_segment later converts the udp-gro-list skb to a list of
> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> Now all the frags are fully fledged packets, with headers pushed
> before the payload.

Yes.

> PF_PACKET handles untouched fraglist. To modify the payload only
> for udp_rcv_segment, skb_unclone is necessary.

I don't parse this last sentence here, please elaborate in more detail on why
it is necessary.

For example, if tc layer would modify mark on the skb, then __copy_skb_header()
in skb_segment_list() will propagate it. If tc layer would modify payload, the
skb_ensure_writable() will take care of that internally and if needed pull in
parts from fraglist into linear section to make it private. The purpose of the
skb_clone() above iff shared is to make the struct itself private (to safely
modify its struct members). What am I missing?

>>> +			} else {
>>> +				err = -ENOMEM;
>>> +			}
>>> +		}
>>> +
>>>    		if (!tail)
>>>    			skb->next = nskb;
>>>    		else
>>>    			tail->next = nskb;
>>>
>>> +		if (unlikely(err)) {
>>> +			nskb->next = list_skb;
>>> +			goto err_linearize;
>>> +		}
>>> +
>>>    		tail = nskb;
>>>
>>>    		delta_len += nskb->len;
>>>
> 
>
Willem de Bruijn Jan. 7, 2021, 1:05 p.m. UTC | #4
On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/7/21 12:40 PM, Dongseok Yi wrote:
> > On 2021-01-07 20:05, Daniel Borkmann wrote:
> >> On 1/7/21 1:39 AM, Dongseok Yi wrote:
> >>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
> >>> triggers skb_ensure_writable -> pskb_expand_head ->
> >>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
> >>>
> >>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> >>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> >>> chain made by skb_segment_list.
> >>>
> >>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> >>> multiple ptypes can see this. The skb could be released by ptypes and
> >>> it causes use-after-free.
> >>>
> >>> [ 4443.426215] ------------[ cut here ]------------
> >>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>> refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> >>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>> [ 4443.426808] Call trace:
> >>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>
> >>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> >>> Acked-by: Willem de Bruijn <willemb@google.com>
> >>> ---
> >>>    net/core/skbuff.c | 20 +++++++++++++++++++-
> >>>    1 file changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> v2: Expand the commit message to clarify a BPF filter loaded
> >>>
> >>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>> index f62cae3..1dcbda8 100644
> >>> --- a/net/core/skbuff.c
> >>> +++ b/net/core/skbuff.c
> >>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>     unsigned int delta_truesize = 0;
> >>>     unsigned int delta_len = 0;
> >>>     struct sk_buff *tail = NULL;
> >>> -   struct sk_buff *nskb;
> >>> +   struct sk_buff *nskb, *tmp;
> >>> +   int err;
> >>>
> >>>     skb_push(skb, -skb_network_offset(skb) + offset);
> >>>
> >>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>             nskb = list_skb;
> >>>             list_skb = list_skb->next;
> >>>
> >>> +           err = 0;
> >>> +           if (skb_shared(nskb)) {
> >>> +                   tmp = skb_clone(nskb, GFP_ATOMIC);
> >>> +                   if (tmp) {
> >>> +                           kfree_skb(nskb);
> >>
> >> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
> >> for drops in the stack.
> >
> > I will use to consume_skb() on the next version.
> >
> >>> +                           nskb = tmp;
> >>> +                           err = skb_unclone(nskb, GFP_ATOMIC);
> >>
> >> Could you elaborate why you also need to unclone? This looks odd here. tc layer
> >> (independent of BPF) from ingress & egress side generally assumes unshared skb,
> >> so above clone + dropping ref of nskb looks okay to make the main skb struct private
> >> for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
> >> the additional skb_unclone() in this context?
> >
> > Willem de Bruijn said:
> > udp_rcv_segment later converts the udp-gro-list skb to a list of
> > regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> > Now all the frags are fully fledged packets, with headers pushed
> > before the payload.
>
> Yes.
>
> > PF_PACKET handles untouched fraglist. To modify the payload only
> > for udp_rcv_segment, skb_unclone is necessary.
>
> I don't parse this last sentence here, please elaborate in more detail on why
> it is necessary.
>
> For example, if tc layer would modify mark on the skb, then __copy_skb_header()
> in skb_segment_list() will propagate it. If tc layer would modify payload, the
> skb_ensure_writable() will take care of that internally and if needed pull in
> parts from fraglist into linear section to make it private. The purpose of the
> skb_clone() above iff shared is to make the struct itself private (to safely
> modify its struct members). What am I missing?

If tc writes, it will call skb_make_writable and thus pskb_expand_head
to create a private linear section for the head_skb.

skb_segment_list overwrites part of the skb linear section of each
fragment itself. Even after skb_clone, the frag_skbs share their
linear section with their clone in pf_packet, so we need a
pskb_expand_head here, too.
Daniel Borkmann Jan. 7, 2021, 1:33 p.m. UTC | #5
On 1/7/21 2:05 PM, Willem de Bruijn wrote:
> On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/7/21 12:40 PM, Dongseok Yi wrote:
>>> On 2021-01-07 20:05, Daniel Borkmann wrote:
>>>> On 1/7/21 1:39 AM, Dongseok Yi wrote:
>>>>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
>>>>> triggers skb_ensure_writable -> pskb_expand_head ->
>>>>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
>>>>>
>>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
>>>>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
>>>>> chain made by skb_segment_list.
>>>>>
>>>>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
>>>>> multiple ptypes can see this. The skb could be released by ptypes and
>>>>> it causes use-after-free.
>>>>>
>>>>> [ 4443.426215] ------------[ cut here ]------------
>>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
>>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
>>>>> refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
>>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
>>>>> [ 4443.426808] Call trace:
>>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
>>>>> [ 4443.426823]  skb_release_data+0x144/0x264
>>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
>>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
>>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
>>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
>>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
>>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
>>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
>>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
>>>>> [ 4443.426880]  el0_svc+0x8/0xc
>>>>>
>>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
>>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
>>>>> Acked-by: Willem de Bruijn <willemb@google.com>
>>>>> ---
>>>>>     net/core/skbuff.c | 20 +++++++++++++++++++-
>>>>>     1 file changed, 19 insertions(+), 1 deletion(-)
>>>>>
>>>>> v2: Expand the commit message to clarify a BPF filter loaded
>>>>>
>>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
>>>>> index f62cae3..1dcbda8 100644
>>>>> --- a/net/core/skbuff.c
>>>>> +++ b/net/core/skbuff.c
>>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>      unsigned int delta_truesize = 0;
>>>>>      unsigned int delta_len = 0;
>>>>>      struct sk_buff *tail = NULL;
>>>>> -   struct sk_buff *nskb;
>>>>> +   struct sk_buff *nskb, *tmp;
>>>>> +   int err;
>>>>>
>>>>>      skb_push(skb, -skb_network_offset(skb) + offset);
>>>>>
>>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
>>>>>              nskb = list_skb;
>>>>>              list_skb = list_skb->next;
>>>>>
>>>>> +           err = 0;
>>>>> +           if (skb_shared(nskb)) {
>>>>> +                   tmp = skb_clone(nskb, GFP_ATOMIC);
>>>>> +                   if (tmp) {
>>>>> +                           kfree_skb(nskb);
>>>>
>>>> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
>>>> for drops in the stack.
>>>
>>> I will use to consume_skb() on the next version.
>>>
>>>>> +                           nskb = tmp;
>>>>> +                           err = skb_unclone(nskb, GFP_ATOMIC);
>>>>
>>>> Could you elaborate why you also need to unclone? This looks odd here. tc layer
>>>> (independent of BPF) from ingress & egress side generally assumes unshared skb,
>>>> so above clone + dropping ref of nskb looks okay to make the main skb struct private
>>>> for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
>>>> the additional skb_unclone() in this context?
>>>
>>> Willem de Bruijn said:
>>> udp_rcv_segment later converts the udp-gro-list skb to a list of
>>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
>>> Now all the frags are fully fledged packets, with headers pushed
>>> before the payload.
>>
>> Yes.
>>
>>> PF_PACKET handles untouched fraglist. To modify the payload only
>>> for udp_rcv_segment, skb_unclone is necessary.
>>
>> I don't parse this last sentence here, please elaborate in more detail on why
>> it is necessary.
>>
>> For example, if tc layer would modify mark on the skb, then __copy_skb_header()
>> in skb_segment_list() will propagate it. If tc layer would modify payload, the
>> skb_ensure_writable() will take care of that internally and if needed pull in
>> parts from fraglist into linear section to make it private. The purpose of the
>> skb_clone() above iff shared is to make the struct itself private (to safely
>> modify its struct members). What am I missing?
> 
> If tc writes, it will call skb_make_writable and thus pskb_expand_head
> to create a private linear section for the head_skb.
> 
> skb_segment_list overwrites part of the skb linear section of each
> fragment itself. Even after skb_clone, the frag_skbs share their
> linear section with their clone in pf_packet, so we need a
> pskb_expand_head here, too.

Ok, got it, thanks for the explanation. Would be great to have it in the v3 commit
log as well as that was more clear than the above. Too bad in that case (otoh
the pf_packet situation could be considered corner case ...); ether way, fix makes
sense then.
Willem de Bruijn Jan. 7, 2021, 2:44 p.m. UTC | #6
On Thu, Jan 7, 2021 at 8:33 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 1/7/21 2:05 PM, Willem de Bruijn wrote:
> > On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >> On 1/7/21 12:40 PM, Dongseok Yi wrote:
> >>> On 2021-01-07 20:05, Daniel Borkmann wrote:
> >>>> On 1/7/21 1:39 AM, Dongseok Yi wrote:
> >>>>> skbs in fraglist could be shared by a BPF filter loaded at TC. It
> >>>>> triggers skb_ensure_writable -> pskb_expand_head ->
> >>>>> skb_clone_fraglist -> skb_get on each skb in the fraglist.
> >>>>>
> >>>>> While tcpdump, sk_receive_queue of PF_PACKET has the original fraglist.
> >>>>> But the same fraglist is queued to PF_INET (or PF_INET6) as the fraglist
> >>>>> chain made by skb_segment_list.
> >>>>>
> >>>>> If the new skb (not fraglist) is queued to one of the sk_receive_queue,
> >>>>> multiple ptypes can see this. The skb could be released by ptypes and
> >>>>> it causes use-after-free.
> >>>>>
> >>>>> [ 4443.426215] ------------[ cut here ]------------
> >>>>> [ 4443.426222] refcount_t: underflow; use-after-free.
> >>>>> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> >>>>> refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426726] pstate: 60400005 (nZCv daif +PAN -UAO)
> >>>>> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> >>>>> [ 4443.426808] Call trace:
> >>>>> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> >>>>> [ 4443.426823]  skb_release_data+0x144/0x264
> >>>>> [ 4443.426828]  kfree_skb+0x58/0xc4
> >>>>> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> >>>>> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> >>>>> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> >>>>> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> >>>>> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> >>>>> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> >>>>> [ 4443.426873]  el0_svc_handler+0x74/0x98
> >>>>> [ 4443.426880]  el0_svc+0x8/0xc
> >>>>>
> >>>>> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> >>>>> Signed-off-by: Dongseok Yi <dseok.yi@samsung.com>
> >>>>> Acked-by: Willem de Bruijn <willemb@google.com>
> >>>>> ---
> >>>>>     net/core/skbuff.c | 20 +++++++++++++++++++-
> >>>>>     1 file changed, 19 insertions(+), 1 deletion(-)
> >>>>>
> >>>>> v2: Expand the commit message to clarify a BPF filter loaded
> >>>>>
> >>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> >>>>> index f62cae3..1dcbda8 100644
> >>>>> --- a/net/core/skbuff.c
> >>>>> +++ b/net/core/skbuff.c
> >>>>> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>      unsigned int delta_truesize = 0;
> >>>>>      unsigned int delta_len = 0;
> >>>>>      struct sk_buff *tail = NULL;
> >>>>> -   struct sk_buff *nskb;
> >>>>> +   struct sk_buff *nskb, *tmp;
> >>>>> +   int err;
> >>>>>
> >>>>>      skb_push(skb, -skb_network_offset(skb) + offset);
> >>>>>
> >>>>> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff *skb,
> >>>>>              nskb = list_skb;
> >>>>>              list_skb = list_skb->next;
> >>>>>
> >>>>> +           err = 0;
> >>>>> +           if (skb_shared(nskb)) {
> >>>>> +                   tmp = skb_clone(nskb, GFP_ATOMIC);
> >>>>> +                   if (tmp) {
> >>>>> +                           kfree_skb(nskb);
> >>>>
> >>>> Should use consume_skb() to not trigger skb:kfree_skb tracepoint when looking
> >>>> for drops in the stack.
> >>>
> >>> I will use to consume_skb() on the next version.
> >>>
> >>>>> +                           nskb = tmp;
> >>>>> +                           err = skb_unclone(nskb, GFP_ATOMIC);
> >>>>
> >>>> Could you elaborate why you also need to unclone? This looks odd here. tc layer
> >>>> (independent of BPF) from ingress & egress side generally assumes unshared skb,
> >>>> so above clone + dropping ref of nskb looks okay to make the main skb struct private
> >>>> for mangling attributes (e.g. mark) & should suffice. What is the exact purpose of
> >>>> the additional skb_unclone() in this context?
> >>>
> >>> Willem de Bruijn said:
> >>> udp_rcv_segment later converts the udp-gro-list skb to a list of
> >>> regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
> >>> Now all the frags are fully fledged packets, with headers pushed
> >>> before the payload.
> >>
> >> Yes.
> >>
> >>> PF_PACKET handles untouched fraglist. To modify the payload only
> >>> for udp_rcv_segment, skb_unclone is necessary.
> >>
> >> I don't parse this last sentence here, please elaborate in more detail on why
> >> it is necessary.
> >>
> >> For example, if tc layer would modify mark on the skb, then __copy_skb_header()
> >> in skb_segment_list() will propagate it. If tc layer would modify payload, the
> >> skb_ensure_writable() will take care of that internally and if needed pull in
> >> parts from fraglist into linear section to make it private. The purpose of the
> >> skb_clone() above iff shared is to make the struct itself private (to safely
> >> modify its struct members). What am I missing?
> >
> > If tc writes, it will call skb_make_writable and thus pskb_expand_head
> > to create a private linear section for the head_skb.
> >
> > skb_segment_list overwrites part of the skb linear section of each
> > fragment itself. Even after skb_clone, the frag_skbs share their
> > linear section with their clone in pf_packet, so we need a
> > pskb_expand_head here, too.
>
> Ok, got it, thanks for the explanation. Would be great to have it in the v3 commit
> log as well as that was more clear than the above. Too bad in that case (otoh
> the pf_packet situation could be considered corner case ...); ether way, fix makes
> sense then.

Thanks for double checking the tricky logic. Pf_packet + BPF is indeed
a peculiar corner case. But there are perhaps more, like raw sockets,
and other BPF hooks that can trigger an skb_make_writable.

Come to think of it, the no touching shared data rule is also violated
without a BPF program? Then there is nothing in the frag_skbs
themselves signifying that they are shared, but the head_skb is
cloned, so its data may still not be modified.

This modification happens to be safe in practice, as the pf_packet
clones don't access the bytes before skb->data where this path inserts
headers. But still.
Daniel Borkmann Jan. 8, 2021, 10:32 a.m. UTC | #7
On 1/7/21 3:44 PM, Willem de Bruijn wrote:
> On Thu, Jan 7, 2021 at 8:33 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 1/7/21 2:05 PM, Willem de Bruijn wrote:
>>> On Thu, Jan 7, 2021 at 7:52 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>> On 1/7/21 12:40 PM, Dongseok Yi wrote:
>>>>> On 2021-01-07 20:05, Daniel Borkmann wrote:
>>>>>> On 1/7/21 1:39 AM, Dongseok Yi wrote:
[...]
>>>>> PF_PACKET handles untouched fraglist. To modify the payload only
>>>>> for udp_rcv_segment, skb_unclone is necessary.
>>>>
>>>> I don't parse this last sentence here, please elaborate in more detail on why
>>>> it is necessary.
>>>>
>>>> For example, if tc layer would modify mark on the skb, then __copy_skb_header()
>>>> in skb_segment_list() will propagate it. If tc layer would modify payload, the
>>>> skb_ensure_writable() will take care of that internally and if needed pull in
>>>> parts from fraglist into linear section to make it private. The purpose of the
>>>> skb_clone() above iff shared is to make the struct itself private (to safely
>>>> modify its struct members). What am I missing?
>>>
>>> If tc writes, it will call skb_make_writable and thus pskb_expand_head
>>> to create a private linear section for the head_skb.
>>>
>>> skb_segment_list overwrites part of the skb linear section of each
>>> fragment itself. Even after skb_clone, the frag_skbs share their
>>> linear section with their clone in pf_packet, so we need a
>>> pskb_expand_head here, too.
>>
>> Ok, got it, thanks for the explanation. Would be great to have it in the v3 commit
>> log as well as that was more clear than the above. Too bad in that case (otoh
>> the pf_packet situation could be considered corner case ...); ether way, fix makes
>> sense then.
> 
> Thanks for double checking the tricky logic. Pf_packet + BPF is indeed
> a peculiar corner case. But there are perhaps more, like raw sockets,
> and other BPF hooks that can trigger an skb_make_writable.
> 
> Come to think of it, the no touching shared data rule is also violated
> without a BPF program? Then there is nothing in the frag_skbs
> themselves signifying that they are shared, but the head_skb is
> cloned, so its data may still not be modified.

The skb_ensure_writable() is used in plenty of places from bpf, ovs, netfilter
to core stack (e.g. vlan, mpls, icmp), but either way there should be nothing
wrong with that as it's making sure to pull the data into its linear section
before modification. Uncareful use of skb_store_bits() with offset into skb_frags
for example could defeat that, but I don't see any in-tree occurrence that would
be problematic at this point..

> This modification happens to be safe in practice, as the pf_packet
> clones don't access the bytes before skb->data where this path inserts
> headers. But still.
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f62cae3..1dcbda8 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3655,7 +3655,8 @@  struct sk_buff *skb_segment_list(struct sk_buff *skb,
 	unsigned int delta_truesize = 0;
 	unsigned int delta_len = 0;
 	struct sk_buff *tail = NULL;
-	struct sk_buff *nskb;
+	struct sk_buff *nskb, *tmp;
+	int err;
 
 	skb_push(skb, -skb_network_offset(skb) + offset);
 
@@ -3665,11 +3666,28 @@  struct sk_buff *skb_segment_list(struct sk_buff *skb,
 		nskb = list_skb;
 		list_skb = list_skb->next;
 
+		err = 0;
+		if (skb_shared(nskb)) {
+			tmp = skb_clone(nskb, GFP_ATOMIC);
+			if (tmp) {
+				kfree_skb(nskb);
+				nskb = tmp;
+				err = skb_unclone(nskb, GFP_ATOMIC);
+			} else {
+				err = -ENOMEM;
+			}
+		}
+
 		if (!tail)
 			skb->next = nskb;
 		else
 			tail->next = nskb;
 
+		if (unlikely(err)) {
+			nskb->next = list_skb;
+			goto err_linearize;
+		}
+
 		tail = nskb;
 
 		delta_len += nskb->len;