diff mbox series

[net,v3] skbuff: fix coalescing for page_pool fragment recycling

Message ID 20220331102440.1673-1-jean-philippe@linaro.org (mailing list archive)
State Accepted
Commit 1effe8ca4e34c34cdd9318436a4232dcb582ebf4
Delegated to: Netdev Maintainers
Headers show
Series [net,v3] skbuff: fix coalescing for page_pool fragment recycling | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for net
netdev/fixes_present success Fixes tag present in non-next series
netdev/subject_prefix success Link
netdev/cover_letter success Single patches do not need cover letters
netdev/patch_count success Link
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 2 this patch: 2
netdev/cc_maintainers warning 1 maintainers not CCed: edumazet@google.com
netdev/build_clang success Errors and warnings before: 9 this patch: 9
netdev/module_param success Was 0 now: 0
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/verify_fixes success Fixes tag looks correct
netdev/build_allmodconfig_warn success Errors and warnings before: 2 this patch: 2
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 22 lines checked
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Jean-Philippe Brucker March 31, 2022, 10:24 a.m. UTC
Fix a use-after-free when using page_pool with page fragments. We
encountered this problem during normal RX in the hns3 driver:

(1) Initially we have three descriptors in the RX queue. The first one
    allocates PAGE1 through page_pool, and the other two allocate one
    half of PAGE2 each. Page references look like this:

                RX_BD1 _______ PAGE1
                RX_BD2 _______ PAGE2
                RX_BD3 _________/

(2) Handle RX on the first descriptor. Allocate SKB1, eventually added
    to the receive queue by tcp_queue_rcv().

(3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
    netif_receive_skb():

    netif_receive_skb(SKB2)
      ip_rcv(SKB2)
        SKB3 = skb_clone(SKB2)

    SKB2 and SKB3 share a reference to PAGE2 through
    skb_shinfo()->dataref. The other ref to PAGE2 is still held by
    RX_BD3:

                      SKB2 ---+- PAGE2
                      SKB3 __/   /
                RX_BD3 _________/

 (3b) Now while handling TCP, coalesce SKB3 with SKB1:

      tcp_v4_rcv(SKB3)
        tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
        kfree_skb_partial(SKB3)
          skb_release_data(SKB3)                // drops one dataref

                      SKB1 _____ PAGE1
                           \____
                      SKB2 _____ PAGE2
                                 /
                RX_BD3 _________/

    In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
    PAGE2, where it should instead have increased the page_pool frag
    reference, pp_frag_count. Without coalescing, when releasing both
    SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
    when releasing SKB1 and SKB2, two references to PAGE2 will be
    dropped, resulting in underflow.

 (3c) Drop SKB2:

      af_packet_rcv(SKB2)
        consume_skb(SKB2)
          skb_release_data(SKB2)                // drops second dataref
            page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count

                      SKB1 _____ PAGE1
                           \____
                                 PAGE2
                                 /
                RX_BD3 _________/

(4) Userspace calls recvmsg()
    Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
    release the SKB3 page as well:

    tcp_eat_recv_skb(SKB1)
      skb_release_data(SKB1)
        page_pool_return_skb_page(PAGE1)
        page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count

(5) PAGE2 is freed, but the third RX descriptor was still using it!
    In our case this causes IOMMU faults, but it would silently corrupt
    memory if the IOMMU was disabled.

Change the logic that checks whether pp_recycle SKBs can be coalesced.
We still reject differing pp_recycle between 'from' and 'to' SKBs, but
in order to avoid the situation described above, we also reject
coalescing when both 'from' and 'to' are pp_recycled and 'from' is
cloned.

The new logic allows coalescing a cloned pp_recycle SKB into a page
refcounted one, because in this case the release (4) will drop the right
reference, the one taken by skb_try_coalesce().

Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
v2: https://lore.kernel.org/netdev/20220328132258.78307-1-jean-philippe@linaro.org/
v1: https://lore.kernel.org/netdev/20220324172913.26293-1-jean-philippe@linaro.org/
---
 net/core/skbuff.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

Comments

Yunsheng Lin March 31, 2022, 11 a.m. UTC | #1
On 2022/3/31 18:24, Jean-Philippe Brucker wrote:
> Fix a use-after-free when using page_pool with page fragments. We
> encountered this problem during normal RX in the hns3 driver:
> 
> (1) Initially we have three descriptors in the RX queue. The first one
>     allocates PAGE1 through page_pool, and the other two allocate one
>     half of PAGE2 each. Page references look like this:
> 
>                 RX_BD1 _______ PAGE1
>                 RX_BD2 _______ PAGE2
>                 RX_BD3 _________/
> 
> (2) Handle RX on the first descriptor. Allocate SKB1, eventually added
>     to the receive queue by tcp_queue_rcv().
> 
> (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
>     netif_receive_skb():
> 
>     netif_receive_skb(SKB2)
>       ip_rcv(SKB2)
>         SKB3 = skb_clone(SKB2)
> 
>     SKB2 and SKB3 share a reference to PAGE2 through
>     skb_shinfo()->dataref. The other ref to PAGE2 is still held by
>     RX_BD3:
> 
>                       SKB2 ---+- PAGE2
>                       SKB3 __/   /
>                 RX_BD3 _________/
> 
>  (3b) Now while handling TCP, coalesce SKB3 with SKB1:
> 
>       tcp_v4_rcv(SKB3)
>         tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
>         kfree_skb_partial(SKB3)
>           skb_release_data(SKB3)                // drops one dataref
> 
>                       SKB1 _____ PAGE1
>                            \____
>                       SKB2 _____ PAGE2
>                                  /
>                 RX_BD3 _________/
> 
>     In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
>     PAGE2, where it should instead have increased the page_pool frag
>     reference, pp_frag_count. Without coalescing, when releasing both
>     SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
>     when releasing SKB1 and SKB2, two references to PAGE2 will be
>     dropped, resulting in underflow.
> 
>  (3c) Drop SKB2:
> 
>       af_packet_rcv(SKB2)
>         consume_skb(SKB2)
>           skb_release_data(SKB2)                // drops second dataref
>             page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count
> 
>                       SKB1 _____ PAGE1
>                            \____
>                                  PAGE2
>                                  /
>                 RX_BD3 _________/
> 
> (4) Userspace calls recvmsg()
>     Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
>     release the SKB3 page as well:
> 
>     tcp_eat_recv_skb(SKB1)
>       skb_release_data(SKB1)
>         page_pool_return_skb_page(PAGE1)
>         page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count
> 
> (5) PAGE2 is freed, but the third RX descriptor was still using it!
>     In our case this causes IOMMU faults, but it would silently corrupt
>     memory if the IOMMU was disabled.
> 
> Change the logic that checks whether pp_recycle SKBs can be coalesced.
> We still reject differing pp_recycle between 'from' and 'to' SKBs, but
> in order to avoid the situation described above, we also reject
> coalescing when both 'from' and 'to' are pp_recycled and 'from' is
> cloned.
> 
> The new logic allows coalescing a cloned pp_recycle SKB into a page
> refcounted one, because in this case the release (4) will drop the right
> reference, the one taken by skb_try_coalesce().

LGTM.

Reviewed-by: Yunsheng Lin <linyunsheng@huawei.com>

> 
> Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
> Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v2: https://lore.kernel.org/netdev/20220328132258.78307-1-jean-philippe@linaro.org/
> v1: https://lore.kernel.org/netdev/20220324172913.26293-1-jean-philippe@linaro.org/
> ---
>  net/core/skbuff.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ea51e23e9247..2d6ef6d7ebf5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5244,11 +5244,18 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	if (skb_cloned(to))
>  		return false;
>  
> -	/* The page pool signature of struct page will eventually figure out
> -	 * which pages can be recycled or not but for now let's prohibit slab
> -	 * allocated and page_pool allocated SKBs from being coalesced.
> +	/* In general, avoid mixing slab allocated and page_pool allocated
> +	 * pages within the same SKB. However when @to is not pp_recycle and
> +	 * @from is cloned, we can transition frag pages from page_pool to
> +	 * reference counted.
> +	 *
> +	 * On the other hand, don't allow coalescing two pp_recycle SKBs if
> +	 * @from is cloned, in case the SKB is using page_pool fragment
> +	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
> +	 * references for cloned SKBs at the moment that would result in
> +	 * inconsistent reference counts.
>  	 */
> -	if (to->pp_recycle != from->pp_recycle)
> +	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
>  		return false;
>  
>  	if (len <= skb_tailroom(to)) {
>
Alexander Duyck March 31, 2022, 6:44 p.m. UTC | #2
> -----Original Message-----
> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Thursday, March 31, 2022 3:25 AM
> To: ilias.apalodimas@linaro.org; Alexander Duyck
> <alexanderduyck@fb.com>; linyunsheng@huawei.com
> Cc: hawk@kernel.org; davem@davemloft.net; kuba@kernel.org;
> pabeni@redhat.com; netdev@vger.kernel.org; Jean-Philippe Brucker <jean-
> philippe@linaro.org>
> Subject: [PATCH net v3] skbuff: fix coalescing for page_pool fragment
> recycling
> 
> Fix a use-after-free when using page_pool with page fragments. We
> encountered this problem during normal RX in the hns3 driver:

<snip>

> ---
>  net/core/skbuff.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c index
> ea51e23e9247..2d6ef6d7ebf5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5244,11 +5244,18 @@ bool skb_try_coalesce(struct sk_buff *to, struct
> sk_buff *from,
>  	if (skb_cloned(to))
>  		return false;
> 
> -	/* The page pool signature of struct page will eventually figure out
> -	 * which pages can be recycled or not but for now let's prohibit slab
> -	 * allocated and page_pool allocated SKBs from being coalesced.
> +	/* In general, avoid mixing slab allocated and page_pool allocated
> +	 * pages within the same SKB. However when @to is not pp_recycle
> and
> +	 * @from is cloned, we can transition frag pages from page_pool to
> +	 * reference counted.
> +	 *
> +	 * On the other hand, don't allow coalescing two pp_recycle SKBs if
> +	 * @from is cloned, in case the SKB is using page_pool fragment
> +	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
> +	 * references for cloned SKBs at the moment that would result in
> +	 * inconsistent reference counts.
>  	 */
> -	if (to->pp_recycle != from->pp_recycle)
> +	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
>  		return false;
> 
>  	if (len <= skb_tailroom(to)) {
> --
> 2.25.1

This looks good to me. The impact should be minimal since it only applies to pp_recycle pages.

Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
Ilias Apalodimas March 31, 2022, 7:36 p.m. UTC | #3
On Thu, Mar 31, 2022 at 11:24:41AM +0100, Jean-Philippe Brucker wrote:
> Fix a use-after-free when using page_pool with page fragments. We
> encountered this problem during normal RX in the hns3 driver:
> 
> (1) Initially we have three descriptors in the RX queue. The first one
>     allocates PAGE1 through page_pool, and the other two allocate one
>     half of PAGE2 each. Page references look like this:
> 
>                 RX_BD1 _______ PAGE1
>                 RX_BD2 _______ PAGE2
>                 RX_BD3 _________/
> 
> (2) Handle RX on the first descriptor. Allocate SKB1, eventually added
>     to the receive queue by tcp_queue_rcv().
> 
> (3) Handle RX on the second descriptor. Allocate SKB2 and pass it to
>     netif_receive_skb():
> 
>     netif_receive_skb(SKB2)
>       ip_rcv(SKB2)
>         SKB3 = skb_clone(SKB2)
> 
>     SKB2 and SKB3 share a reference to PAGE2 through
>     skb_shinfo()->dataref. The other ref to PAGE2 is still held by
>     RX_BD3:
> 
>                       SKB2 ---+- PAGE2
>                       SKB3 __/   /
>                 RX_BD3 _________/
> 
>  (3b) Now while handling TCP, coalesce SKB3 with SKB1:
> 
>       tcp_v4_rcv(SKB3)
>         tcp_try_coalesce(to=SKB1, from=SKB3)    // succeeds
>         kfree_skb_partial(SKB3)
>           skb_release_data(SKB3)                // drops one dataref
> 
>                       SKB1 _____ PAGE1
>                            \____
>                       SKB2 _____ PAGE2
>                                  /
>                 RX_BD3 _________/
> 
>     In skb_try_coalesce(), __skb_frag_ref() takes a page reference to
>     PAGE2, where it should instead have increased the page_pool frag
>     reference, pp_frag_count. Without coalescing, when releasing both
>     SKB2 and SKB3, a single reference to PAGE2 would be dropped. Now
>     when releasing SKB1 and SKB2, two references to PAGE2 will be
>     dropped, resulting in underflow.
> 
>  (3c) Drop SKB2:
> 
>       af_packet_rcv(SKB2)
>         consume_skb(SKB2)
>           skb_release_data(SKB2)                // drops second dataref
>             page_pool_return_skb_page(PAGE2)    // drops one pp_frag_count
> 
>                       SKB1 _____ PAGE1
>                            \____
>                                  PAGE2
>                                  /
>                 RX_BD3 _________/
> 
> (4) Userspace calls recvmsg()
>     Copies SKB1 and releases it. Since SKB3 was coalesced with SKB1, we
>     release the SKB3 page as well:
> 
>     tcp_eat_recv_skb(SKB1)
>       skb_release_data(SKB1)
>         page_pool_return_skb_page(PAGE1)
>         page_pool_return_skb_page(PAGE2)        // drops second pp_frag_count
> 
> (5) PAGE2 is freed, but the third RX descriptor was still using it!
>     In our case this causes IOMMU faults, but it would silently corrupt
>     memory if the IOMMU was disabled.
> 
> Change the logic that checks whether pp_recycle SKBs can be coalesced.
> We still reject differing pp_recycle between 'from' and 'to' SKBs, but
> in order to avoid the situation described above, we also reject
> coalescing when both 'from' and 'to' are pp_recycled and 'from' is
> cloned.
> 
> The new logic allows coalescing a cloned pp_recycle SKB into a page
> refcounted one, because in this case the release (4) will drop the right
> reference, the one taken by skb_try_coalesce().
> 
> Fixes: 53e0961da1c7 ("page_pool: add frag page recycling support in page pool")
> Suggested-by: Alexander Duyck <alexanderduyck@fb.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
> v2: https://lore.kernel.org/netdev/20220328132258.78307-1-jean-philippe@linaro.org/
> v1: https://lore.kernel.org/netdev/20220324172913.26293-1-jean-philippe@linaro.org/
> ---
>  net/core/skbuff.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ea51e23e9247..2d6ef6d7ebf5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -5244,11 +5244,18 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
>  	if (skb_cloned(to))
>  		return false;
>  
> -	/* The page pool signature of struct page will eventually figure out
> -	 * which pages can be recycled or not but for now let's prohibit slab
> -	 * allocated and page_pool allocated SKBs from being coalesced.
> +	/* In general, avoid mixing slab allocated and page_pool allocated
> +	 * pages within the same SKB. However when @to is not pp_recycle and
> +	 * @from is cloned, we can transition frag pages from page_pool to
> +	 * reference counted.
> +	 *
> +	 * On the other hand, don't allow coalescing two pp_recycle SKBs if
> +	 * @from is cloned, in case the SKB is using page_pool fragment
> +	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
> +	 * references for cloned SKBs at the moment that would result in
> +	 * inconsistent reference counts.
>  	 */
> -	if (to->pp_recycle != from->pp_recycle)
> +	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
>  		return false;
>  
>  	if (len <= skb_tailroom(to)) {
> -- 
> 2.25.1
> 

Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Jesper Dangaard Brouer April 1, 2022, 5:21 a.m. UTC | #4
On 31/03/2022 20.44, Alexander Duyck wrote:
>> -----Original Message-----
>> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
>> Sent: Thursday, March 31, 2022 3:25 AM
>> To: ilias.apalodimas@linaro.org; Alexander Duyck
>> <alexanderduyck@fb.com>; linyunsheng@huawei.com
>> Cc: hawk@kernel.org; davem@davemloft.net; kuba@kernel.org;
>> pabeni@redhat.com; netdev@vger.kernel.org; Jean-Philippe Brucker <jean-
>> philippe@linaro.org>
>> Subject: [PATCH net v3] skbuff: fix coalescing for page_pool fragment
>> recycling
>>
>> Fix a use-after-free when using page_pool with page fragments. We
>> encountered this problem during normal RX in the hns3 driver:
> 
> <snip>
> 
>> ---
>>   net/core/skbuff.c | 15 +++++++++++----
>>   1 file changed, 11 insertions(+), 4 deletions(-)
>>
>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c index
>> ea51e23e9247..2d6ef6d7ebf5 100644
>> --- a/net/core/skbuff.c
>> +++ b/net/core/skbuff.c
>> @@ -5244,11 +5244,18 @@ bool skb_try_coalesce(struct sk_buff *to, struct
>> sk_buff *from,
>>   	if (skb_cloned(to))
>>   		return false;
>>
>> -	/* The page pool signature of struct page will eventually figure out
>> -	 * which pages can be recycled or not but for now let's prohibit slab
>> -	 * allocated and page_pool allocated SKBs from being coalesced.
>> +	/* In general, avoid mixing slab allocated and page_pool allocated
>> +	 * pages within the same SKB. However when @to is not pp_recycle
>> and
>> +	 * @from is cloned, we can transition frag pages from page_pool to
>> +	 * reference counted.
>> +	 *
>> +	 * On the other hand, don't allow coalescing two pp_recycle SKBs if
>> +	 * @from is cloned, in case the SKB is using page_pool fragment
>> +	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
>> +	 * references for cloned SKBs at the moment that would result in
>> +	 * inconsistent reference counts.
>>   	 */
>> -	if (to->pp_recycle != from->pp_recycle)
>> +	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
>>   		return false;
>>
>>   	if (len <= skb_tailroom(to)) {
>> --
>> 2.25.1
> 
> This looks good to me. The impact should be minimal since it only applies to pp_recycle pages.
> 
> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>

I really appreciate Alex that you helped review this, thanks! :-)

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
patchwork-bot+netdevbpf@kernel.org April 1, 2022, 11 a.m. UTC | #5
Hello:

This patch was applied to netdev/net.git (master)
by David S. Miller <davem@davemloft.net>:

On Thu, 31 Mar 2022 11:24:41 +0100 you wrote:
> Fix a use-after-free when using page_pool with page fragments. We
> encountered this problem during normal RX in the hns3 driver:
> 
> (1) Initially we have three descriptors in the RX queue. The first one
>     allocates PAGE1 through page_pool, and the other two allocate one
>     half of PAGE2 each. Page references look like this:
> 
> [...]

Here is the summary with links:
  - [net,v3] skbuff: fix coalescing for page_pool fragment recycling
    https://git.kernel.org/netdev/net/c/1effe8ca4e34

You are awesome, thank you!
diff mbox series

Patch

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ea51e23e9247..2d6ef6d7ebf5 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -5244,11 +5244,18 @@  bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 	if (skb_cloned(to))
 		return false;
 
-	/* The page pool signature of struct page will eventually figure out
-	 * which pages can be recycled or not but for now let's prohibit slab
-	 * allocated and page_pool allocated SKBs from being coalesced.
+	/* In general, avoid mixing slab allocated and page_pool allocated
+	 * pages within the same SKB. However when @to is not pp_recycle and
+	 * @from is cloned, we can transition frag pages from page_pool to
+	 * reference counted.
+	 *
+	 * On the other hand, don't allow coalescing two pp_recycle SKBs if
+	 * @from is cloned, in case the SKB is using page_pool fragment
+	 * references (PP_FLAG_PAGE_FRAG). Since we only take full page
+	 * references for cloned SKBs at the moment that would result in
+	 * inconsistent reference counts.
 	 */
-	if (to->pp_recycle != from->pp_recycle)
+	if (to->pp_recycle != (from->pp_recycle && !skb_cloned(from)))
 		return false;
 
 	if (len <= skb_tailroom(to)) {