[net] net: bpf: fix request_sock leak in filter.c

Message ID	20220609011844.404011-1-jmaxwell37@gmail.com (mailing list archive)
State	Superseded
Delegated to:	BPF
Headers	show Return-Path: <netdev-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA4AFC433EF for <netdev@archiver.kernel.org>; Thu, 9 Jun 2022 01:20:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229633AbiFIBUT (ORCPT <rfc822;netdev@archiver.kernel.org>); Wed, 8 Jun 2022 21:20:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40244 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229598AbiFIBUS (ORCPT <rfc822;netdev@vger.kernel.org>); Wed, 8 Jun 2022 21:20:18 -0400 X-Greylist: delayed 62 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Wed, 08 Jun 2022 18:20:16 PDT Received: from rpt-glb-asav6.external.tpg.com.au (rpt-glb-asav6.external.tpg.com.au [60.241.0.15]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 7E9CC27B39 for <netdev@vger.kernel.org>; Wed, 8 Jun 2022 18:20:15 -0700 (PDT) IronPort-SDR: kF5J1R+jFDteTNT2ybK3W6ruPx2bgpPlVt3Wux6VJllNs8TlLq8H1Hte/4Sq+fQIh/Uh/f3wWY yQ/3trKOo52cGJ+M1k1MWBaXxarBhQmYWTUevEgCtzAU/AokBPG/4FK/QbMKHPa/6aSX+tRVYU 3AiDbj9sqL+wMRKoT2J1kVXs1PhsmEr4G2hAvDYGkc6IW1jUDzYj5iZguN5ZNTpHbIMhfiHVUo JQwuRWVl1jDvoblUQ8yiDvwPp8eQVMQKQt7n1XYaqdqQ3bNo35llUzm/efhiYbIkxQoooifNiY iTE= X-Ironport-Abuse: host=210-185-107-108.tpgi.com.au, ip=210.185.107.108, date=06/09/22 11:19:12 X-SMTP-MATCH: 0 X-IPAS-Result: A2FxAwDQSaFi/2xrudJaHgENLwwOCxKBRoR8lV2DAhSHaC8CkTuBfAsBAwEBAQEBSwQBATwBhEVRBYR0JjQJDgECBAEBAQEDAgMBAQEBBQEBAQUBAQEBAQEGAwEBAQKBGIUvRoI1IoN3KwsBKR0mXAIiK4J9gmUBAzCteBYFF4EBhloKGSgNZwOBYoE9hE6BSoM0giiFRYEVg2iBBQGBGoJxhW4EjUaKGAQFChoDAwIPFAMJBAcFUQICAQELAgYGBAYDAQEGAwkCBAISAgIEBxgKEggUAwIFAQIgBQEHBQEEAxIGDBEBCAYGAQQCCgECAgUFDAMBEQEEAgYCBAQEGBQEAgQHBgIJCQcFFgsECgIWAQoSAgYMCAICAgICBBUHAQ0FAgIEAQ4CBwYDCwIDBQcDAwQHAgoDAwwOAQMBBwEEBQMNBAEBBgIBCgMFCgIBAgIBDAEBAQYCAggBAQICAQMGAgEEAgcBAgUDAgMIAwIDAwICAQEECQgCAwQDBAIDAQUBAQUDAgUBAwMCAQMDAwIBBAMGCQoECAEEBAEBARECBwcCBgMDAgICAgUBAg0BAgECBAMIBgIDFAECBAEKAQUCAwkCBwMBAQIHBQoCBwUCBwICBAEFAw0BAwUCAwEBAwMCBAECAQMDCQEDAgMDAgICAgUCAwICAgkDBgEHAwIBAQQFAQQDAQIKBAQDBAIEAgcCBwIEBwIBBAYDBwYEAgEHAQEECgQDAwMBAQcBAgUCAgMCBhIGBwIEAQMEBAoCAgwCAQYBAQIBAQEBAgMCBwUOAQEBAwIDBgIFAgIBAQMICAMCAQQBBQMBBAUDBwIBBQkCCQMDCQMBAQUBAwEJAwMDAgkDAQICAgsEAwgDAwIDBAICAgIBAwIHBQgEAQQKAgEBAgECAgYCAQMaAQIDBQICCQwBBAICAwEDAQECCAQJBAIDBAIBAQMCAQICAQUCAw0GAQEBAQIDAwECAwEBBgcCCAIXHBMBAwMCAQICAgUCAgECAwICDQEBAQQCAQIBAgYBAwECAgMBAwECAgYCDAMIAgcBBQMDAgIDAQEFDwUCAQQCAQIGBQIBAQEEAQMEBAgCAgEDAwIOAgQBBAECAQEjAwQCAwEDFwECAQIDAwMEBgcGAgECEwECAQEBBQECAQEEAgQEAQYKAwICAgEFAwMFAQECAwIBAQEHDAICAhMCBAoJAwEGAQMHBQEGARQDAgQCAgECAgIKAgEBAgIBAwIJAgECAQUIARsDAQEPJAEBAgIBAgIDBAcCAQQGAw0CAgEBAQUGDQMCAwgMAgkDAgIDBQMCAgQBAgQMCgECAgECAgQFBQIBAgEIAwEFCgMFCQUCBAECAgEDCAEEAwsGAgYCAQIDBQMDAgEGBAUCAwECAQEDAQQBAwQGAQECAwICAQgCAgEBAwMEAQIBAgQCAgIIAgMCAQQCAQIDAQEBBAICAgICBAMIAwIBCAcFAQIEAQIBBAMCAgECBwECAgEJAgEDAwUDBAEDBwMPAwUDAQMDAgUHAgoDAQYEBAECAgECAgICBAICCQIEBQIFBgYGIQEGF02YdxIBDy9PgUQrDoF8AQGNfQmGR6ppQCEJAQYCWIFKdBUlmhUGhV0aMahbLZY8kQeREU2FA4EsghZNI4EBbYFKURkPjjeOS2M7AgYLAQEDCY8EAQE IronPort-PHdr: A9a23:sR1g/xJrcIaFzsA4TNmcuLxhWUAX0o4c3iYr45Yqw4hDbr6kt8y7e hCFvrM01A+CAd+TwskHotKei7rnV20E7MTJm1E5W7sIaSU4j94LlRcrGs+PBB6zBvfraysnA JYKDwc9rDm0PkdPBcnxeUDZrGGs4j4OABX/Mhd+KvjoFoLIgMm7ye6/94fObwlUhzexbrx/I AurpgjNq8cahpdvJLwswRXTuHtIfOpWxWJsJV2Nmhv3+9m98p1+/SlOovwt78FPX7n0cKQ+V rxYES8pM3sp683xtBnMVhWA630BWWgLiBVIAgzF7BbnXpfttybxq+Rw1DWGMcDwULs5Xymp4 aV2Rx/ykCoJNyA3/X/KhMJ+j6xVpx2uqRNkzoLIY4yYLuZyc7nBcd8GQ2dKQ8ZfVzZGAoO5d 4YBC+0BPeBFpIf6vVQPohW/CheoBOPr1zRFgX323agg3OUuHwDJwgggH9YAvXnWt9j1O6ISX vq0zKnM1znMc/RW2TLk5YXObxsuru2CU6hqfsrN1UkgCRnFjlOIpIHlPz2Y2PkAvmuV4eRgW u+iiXArpg5srzSx2MohjonEi5wJx13E+ih3z5s5K9K5RUJnfNKpHpReuS6bOoV5Qc4vRXxjt iUiyrAep5K3YTQGxI46yxPca/GLaZWE7g7hWeqLPDt0mHFodbSijBio60eg0PfzVsys3VZPq SpKj8fDu2gW1xzW9siHUvx9/lq92TqX1wDc9OVEIUcsmKrZLp4u2LExl5QNvkTHGi/6gln5j KiTdkk8++io7froYqn+q5OCKoN4lhvyPrktl8G/G+g0LxQCUmqB9eihyLHu/lX1QLBQgf03l qnZvoraJcMepqOhAQ9V15ws6hmxDji41NQYmXcKIVBedRKIiojmIVDOIPTiAfijhFSslS9nx /bdMbL5GJXCMmDDkKv9fbZ680NRyxI/zcpD6JJMFrEBPPXzV1f3tNPGEh82LhK7w/j8BdVj2 YMRR3iPDrWaMKzMq1+I4PwgI+2WaI8Sojb9JOAp5+Tygn8hhV8dYa6p0IMKZ3+iAPRpPUCZb GHxjdgbD2cFoA8+TOjtiF2MTT5ffXCyULwg5j0jEoKpEZ/DRpyxgLyGxCq0AIBZZn1DCl+WE HbnaZmEVuwDaCKVJc9hnTgEWqa7R4A90hGusRf2y6B7IerM5i0YqZXj2cB25+3Ojh497yd5D 8eD3GGXSWF7gGcISyUx3KBlrkxx0k2D3rRgg/xECdxT4OtEUh8gOpHH0eN6DdHyVxnbftiXV VmmQs+pAS0rQt0txN8OZl5xG8++gRDbwyqqH7gVmqSRC5wo7K3c2WL+J9xhy3vd16kukUMmQ s1ROm2inKJ/8BLTB4HRn0WDi6mqbbgc3DLK9Gqb0WWOoV1YXxRwUKXBWnAffFLarczk5kzZV LKvCa4oMgtGyc6FMKdFdtrpjVBeSPf5JNvee36xm3u3BRuQxLOMaZDlemoT3SrDDEgElw4e8 HSdOAgxAyeuuWPeDDh0GV3zZEPs9Lo2lHTuSEIowwyUR1Nu2qDz+RMPg/GYDfQJ0eEqoiAk/ hdzGh6Y1sLJBt6E715jeaxMft455AwY/W3cvg15eJenKvYx1RYlbw1rsha2hF1MAYJanJ1yx E4= IronPort-Data: A9a23:EJFRqah2mzzB49zWu/w/B5TgX161VBEKZh0ujC45NGQN5FlHY01je htvWG6PMvmNMzTzeNojaIi0oUMAu5PTy9M2SAFppX82ECgW8JqUDtmwEBz9bniYRiHhoOOLz Cm8hv3odp1coqr0/0/1WlTZQPoVOZigHtIQMsadUsxKbVIiGX5JZS5LwbZj2NY22YjhWGthh PuryyHhEA/9s9JLGj9Mg06zgEsHUCPa4W5wUvQWPJinjXeG/5UnJMt3yZKZdhMUdrJp8tuSH I4v+l0bElTxpH/BAvv9+lryWhFRGOaKZWBigFIOM0SpqkAqSiDfTs/XnRfTAKtao2zhojx/9 DlCnc2SRBUHJ4Tmo8U2VjtSQzBTFIpB4LCSdBBTseTLp6HHW3npyuVxAUUye4Yf/46bA0kUr KRecWBQKEnb2KTvmOLTpupE36zPKOHpOYoPpXxkyWqGJfkjSJHHBa7N4Le02R9p3Z4fQK+AO 5RxhTxHUhL/RQVdBXgtObk4u+y6ijrAYzYBgQfAzUYwyy2JpOBr65DrPcbZd8KiW8pYhACbq 3jA8mC/BQsVXPSTwCSI91qgj/HCmCf8Vp5UErCkntZnjECWz34eFDUZUly0pfT/gUm7M/pcN kYd0ikjt64/8AqsVNaVdwWxqnOCvzYGVtZQGvF84waIooLd/wufD3IYZj1MctorsIkxXzNC/ lSUg9r4ATt19aWIQ1qM/7eTqnW5Pi19BW0HbD8bQA8BuIbLr4Q6jxaJRdFmeJNZlfWvQGm1m mDX6XFm2PBK1Z5Ny720/BbMhDfqr4WhohMJ2zg7l1mNtmtRDLNJraT0gbQHxZ6s57p1grVMU LboViReAC0z4UmxqRGw IronPort-HdrOrdr: A9a23:ijeQkKCrXalNpvnlHem155DYdb4zR+YMi2TDGXoddfUzSL36qy nAppsmPHPP4wr5O0tBpTn/Ase9qBrnnPZICOIqUYtKMjONhILRFuBf0bc= X-IronPort-Anti-Spam-Filtered: true X-IronPort-AV: E=Sophos;i="5.91,287,1647262800"; d="scan'208";a="136951646" Received: from 210-185-107-108.tpgi.com.au (HELO jmaxwell.com) ([210.185.107.108]) by rpt-glb-asav6.external.tpg.com.au with ESMTP; 09 Jun 2022 11:19:12 +1000 From: Jon Maxwell <jmaxwell37@gmail.com> To: netdev@vger.kernel.org Cc: davem@davemloft.net, edumazet@google.com, kuba@kernel.org, pabeni@redhat.com, atenart@kernel.org, cutaylor-pub@yahoo.com, Jon Maxwell <jmaxwell37@gmail.com> Subject: [PATCH net] net: bpf: fix request_sock leak in filter.c Date: Thu, 9 Jun 2022 11:18:44 +1000 Message-Id: <20220609011844.404011-1-jmaxwell37@gmail.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <netdev.vger.kernel.org> X-Mailing-List: netdev@vger.kernel.org X-Patchwork-Delegate: kuba@kernel.org
Series	[net] net: bpf: fix request_sock leak in filter.c \| expand [net] net: bpf: fix request_sock leak in filter.c

Context	Check	Description
netdev/tree_selection	success	Clearly marked for net
netdev/fixes_present	success	Fixes tag present in non-next series
netdev/subject_prefix	success	Link
netdev/cover_letter	success	Single patches do not need cover letters
netdev/patch_count	success	Link
netdev/header_inline	success	No static functions without inline keyword in header files
netdev/build_32bit	success	Errors and warnings before: 25 this patch: 25
netdev/cc_maintainers	fail	5 blamed authors not CCed: lmb@cloudflare.com daniel@iogearbox.net ast@kernel.org joe@isovalent.com kafai@fb.com; 11 maintainers not CCed: lmb@cloudflare.com daniel@iogearbox.net songliubraving@fb.com ast@kernel.org joe@isovalent.com bpf@vger.kernel.org yhs@fb.com john.fastabend@gmail.com kafai@fb.com andrii@kernel.org kpsingh@kernel.org
netdev/build_clang	success	Errors and warnings before: 6 this patch: 6
netdev/module_param	success	Was 0 now: 0
netdev/verify_signedoff	success	Signed-off-by tag matches author and committer
netdev/check_selftest	success	No net selftest shell script
netdev/verify_fixes	fail	Problems with Fixes tag: 2
netdev/build_allmodconfig_warn	success	Errors and warnings before: 25 this patch: 25
netdev/checkpatch	warning	WARNING: Co-developed-by and Signed-off-by: name/email do not match WARNING: Use a single space after Signed-off-by: WARNING: line length of 95 exceeds 80 columns
netdev/kdoc	success	Errors and warnings before: 0 this patch: 0
netdev/source_inline	success	Was 0 now: 0
bpf/vmtest-bpf-next-VM_Test-1	success	Logs for Kernel LATEST on ubuntu-latest with gcc
bpf/vmtest-bpf-next-VM_Test-2	success	Logs for Kernel LATEST on ubuntu-latest with llvm-15
bpf/vmtest-bpf-next-PR	success	PR summary
bpf/vmtest-bpf-next-VM_Test-3	success	Logs for Kernel LATEST on z15 with gcc

Jonathan Maxwell June 9, 2022, 1:18 a.m. UTC

A customer reported a request_socket leak in a Calico cloud environment. We 
found that a BPF program was doing a socket lookup with takes a refcnt on 
the socket and that it was finding the request_socket but returning the parent 
LISTEN socket via sk_to_full_sk() without decrementing the child request socket 
1st, resulting in request_sock slab object leak. This patch retains the 
existing behaviour of returning full socks to the caller but it also decrements
the child request_socket if one is present before doing so to prevent the leak.

Thanks to Curtis Taylor for all the help in diagnosing and testing this. And 
thanks to Antoine Tenart for the reproducer and patch input.

Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
Co-developed-by: Antoine Tenart <atenart@kernel.org>
Signed-off-by:: Antoine Tenart <atenart@kernel.org>
Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
---
 net/core/filter.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

Antoine Tenart June 9, 2022, 1:35 p.m. UTC | #1

Hi Jon,

Quoting Jon Maxwell (2022-06-09 03:18:44)
> A customer reported a request_socket leak in a Calico cloud environment. We 
> found that a BPF program was doing a socket lookup with takes a refcnt on 
> the socket and that it was finding the request_socket but returning the parent 
> LISTEN socket via sk_to_full_sk() without decrementing the child request socket 
> 1st, resulting in request_sock slab object leak. This patch retains the 
> existing behaviour of returning full socks to the caller but it also decrements
> the child request_socket if one is present before doing so to prevent the leak.
> 
> Thanks to Curtis Taylor for all the help in diagnosing and testing this. And 
> thanks to Antoine Tenart for the reproducer and patch input.
> 
> Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")

"bpf:" should be inside the parenthesis in the two above lines.

Isn't the issue from before edbf8c01de5a for bpf_sk_lookup? Looking at a
5.1 kernel[1], __bpf_sk_lookup was called and also did the full socket
translation[2]. bpf_sk_release would not be called on the original
socket when that happens.

[1] https://elixir.bootlin.com/linux/v5.1/source/net/core/filter.c#L5204
[2] https://elixir.bootlin.com/linux/v5.1/source/net/core/filter.c#L5198

> Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> Co-developed-by: Antoine Tenart <atenart@kernel.org>
> Signed-off-by:: Antoine Tenart <atenart@kernel.org>

Please remove the extra ':'.

Thanks!
Antoine

> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> ---
>  net/core/filter.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e32cee2c469..e3c04ae7381f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>  {
>         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>                                            ifindex, proto, netns_id, flags);
> +       struct sock *sk1 = sk;
>  
>         if (sk) {
>                 sk = sk_to_full_sk(sk);
> -               if (!sk_fullsock(sk)) {
> -                       sock_gen_put(sk);
> +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> +                * sock refcnt is decremented to prevent a request_sock leak.
> +                */
> +               if (!sk_fullsock(sk1))
> +                       sock_gen_put(sk1);
> +               if (!sk_fullsock(sk))
>                         return NULL;
> -               }
>         }
>  
>         return sk;
> @@ -6239,13 +6243,17 @@ bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>  {
>         struct sock *sk = bpf_skc_lookup(skb, tuple, len, proto, netns_id,
>                                          flags);
> +       struct sock *sk1 = sk;
>  
>         if (sk) {
>                 sk = sk_to_full_sk(sk);
> -               if (!sk_fullsock(sk)) {
> -                       sock_gen_put(sk);
> +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> +                * sock refcnt is decremented to prevent a request_sock leak.
> +                */
> +               if (!sk_fullsock(sk1))
> +                       sock_gen_put(sk1);
> +               if (!sk_fullsock(sk))
>                         return NULL;
> -               }
>         }
>  
>         return sk;
> -- 
> 2.31.1
>

Daniel Borkmann June 9, 2022, 8:29 p.m. UTC | #2

On 6/9/22 3:18 AM, Jon Maxwell wrote:
> A customer reported a request_socket leak in a Calico cloud environment. We
> found that a BPF program was doing a socket lookup with takes a refcnt on
> the socket and that it was finding the request_socket but returning the parent
> LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> 1st, resulting in request_sock slab object leak. This patch retains the
> existing behaviour of returning full socks to the caller but it also decrements
> the child request_socket if one is present before doing so to prevent the leak.
> 
> Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> thanks to Antoine Tenart for the reproducer and patch input.
> 
> Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> Co-developed-by: Antoine Tenart <atenart@kernel.org>
> Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> ---
>   net/core/filter.c | 20 ++++++++++++++------
>   1 file changed, 14 insertions(+), 6 deletions(-)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e32cee2c469..e3c04ae7381f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>   {
>   	struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>   					   ifindex, proto, netns_id, flags);
> +	struct sock *sk1 = sk;
>   
>   	if (sk) {
>   		sk = sk_to_full_sk(sk);
> -		if (!sk_fullsock(sk)) {
> -			sock_gen_put(sk);
> +		/* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> +		 * sock refcnt is decremented to prevent a request_sock leak.
> +		 */
> +		if (!sk_fullsock(sk1))
> +			sock_gen_put(sk1);
> +		if (!sk_fullsock(sk))
>   			return NULL;

[ +Martin/Joe/Lorenz ]

I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
bpf_sk_release() case later on? Rough pseudocode could be something like below:

static struct sock *
__bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
                 struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
                 u64 flags)
{
         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
                                            ifindex, proto, netns_id, flags);
         if (sk) {
                 struct sock *sk2 = sk_to_full_sk(sk);

                 if (!sk_fullsock(sk2))
                         sk2 = NULL;
                 if (sk2 != sk) {
                         sock_gen_put(sk);
                         if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
                                 WARN_ONCE(1, "Found non-RCU, unreferenced socket!");
                                 sk2 = NULL;
                         }
                 }
                 sk = sk2;
         }
         return sk;
}

Thanks,
Daniel

Joe Stringer June 9, 2022, 10:15 p.m. UTC | #3

On Wed, Jun 8, 2022 at 6:21 PM Jon Maxwell <jmaxwell37@gmail.com> wrote:
>
> A customer reported a request_socket leak in a Calico cloud environment. We
> found that a BPF program was doing a socket lookup with takes a refcnt on
> the socket and that it was finding the request_socket but returning the parent
> LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> 1st, resulting in request_sock slab object leak. This patch retains the
> existing behaviour of returning full socks to the caller but it also decrements
> the child request_socket if one is present before doing so to prevent the leak.
>
> Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> thanks to Antoine Tenart for the reproducer and patch input.
>
> Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> Co-developed-by: Antoine Tenart <atenart@kernel.org>
> Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> ---
>  net/core/filter.c | 20 ++++++++++++++------
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 2e32cee2c469..e3c04ae7381f 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>  {
>         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>                                            ifindex, proto, netns_id, flags);
> +       struct sock *sk1 = sk;
>
>         if (sk) {
>                 sk = sk_to_full_sk(sk);
> -               if (!sk_fullsock(sk)) {
> -                       sock_gen_put(sk);
> +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> +                * sock refcnt is decremented to prevent a request_sock leak.
> +                */
> +               if (!sk_fullsock(sk1))
> +                       sock_gen_put(sk1);
> +               if (!sk_fullsock(sk))
>                         return NULL;
> -               }
>         }
>
>         return sk;

Thinking through the constraints of this function:
1. If the return value is NULL, then all references taken during the
processing must be released.
2. If the return value is non-NULL, then the socket must either have
gained one reference OR it must have the SOCK_RCU_FREE flag set.
3. It also shouldn't return TIME_WAIT / request sockets (!sk_fullsock(sk)).

__bpf_skc_lookup() will give us the properties of (1)/(2) in a socket
that may or may not be `sk_is_refcounted()` at the start of the
function, so then we just need to consider the logic being changed
here.

Digging further, are these statements accurate?
* sk_to_full_sk() can either return the argument or a different listen socket.
* Iff sk1 and sk are the same, then we only need to consider (3),
hence the fullsock check, then depending on what type of socket it is,
we satisfy either (1) (current sock_gen_put() call + NULL) or (2)
(just return).
* Iff sk1 and sk are different, then we should release the reference
on sk1 and then do something with sk following the constraints above.
* Iff sk1 and sk are different, then sk must be a LISTEN socket.
* LISTEN sockets always have SOCK_RCU_FREE.
* Therefore, if sk1 and sk are different, we must release the
reference on sk1 and we do not need to take a reference on sk, and we
can just return sk.

Following the above, the implementation looks concise and follows the
logic for each case. I can't help but think that it would be easier to
read with an sk_is_refcounted() call in there though since the concern
is how the references for sk vs sk1 are tracked in this function.

Thanks,
Joe

Joe Stringer June 9, 2022, 10:22 p.m. UTC | #4

On Thu, Jun 9, 2022 at 1:30 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > A customer reported a request_socket leak in a Calico cloud environment. We
> > found that a BPF program was doing a socket lookup with takes a refcnt on
> > the socket and that it was finding the request_socket but returning the parent
> > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > 1st, resulting in request_sock slab object leak. This patch retains the
> > existing behaviour of returning full socks to the caller but it also decrements
> > the child request_socket if one is present before doing so to prevent the leak.
> >
> > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > thanks to Antoine Tenart for the reproducer and patch input.
> >
> > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > ---
> >   net/core/filter.c | 20 ++++++++++++++------
> >   1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 2e32cee2c469..e3c04ae7381f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> >   {
> >       struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> >                                          ifindex, proto, netns_id, flags);
> > +     struct sock *sk1 = sk;
> >
> >       if (sk) {
> >               sk = sk_to_full_sk(sk);
> > -             if (!sk_fullsock(sk)) {
> > -                     sock_gen_put(sk);
> > +             /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > +              * sock refcnt is decremented to prevent a request_sock leak.
> > +              */
> > +             if (!sk_fullsock(sk1))
> > +                     sock_gen_put(sk1);
> > +             if (!sk_fullsock(sk))
> >                       return NULL;
>
> [ +Martin/Joe/Lorenz ]
>
> I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
> bpf_sk_release() case later on? Rough pseudocode could be something like below:
>
> static struct sock *
> __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>                  struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
>                  u64 flags)
> {
>          struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>                                             ifindex, proto, netns_id, flags);
>          if (sk) {
>                  struct sock *sk2 = sk_to_full_sk(sk);
>
>                  if (!sk_fullsock(sk2))
>                          sk2 = NULL;
>                  if (sk2 != sk) {
>                          sock_gen_put(sk);
>                          if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
>                                  WARN_ONCE(1, "Found non-RCU, unreferenced socket!");
>                                  sk2 = NULL;
>                          }
>                  }
>                  sk = sk2;
>          }
>          return sk;
> }

This seems a bit more readable to me from the perspective of
understanding the way that the socket references are tracked & freed.

Jonathan Maxwell June 9, 2022, 11:32 p.m. UTC | #5

On Fri, Jun 10, 2022 at 8:22 AM Joe Stringer <joe@cilium.io> wrote:
>
> On Thu, Jun 9, 2022 at 1:30 PM Daniel Borkmann <daniel@iogearbox.net> wrote:
> >
> > On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > > A customer reported a request_socket leak in a Calico cloud environment. We
> > > found that a BPF program was doing a socket lookup with takes a refcnt on
> > > the socket and that it was finding the request_socket but returning the parent
> > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > > 1st, resulting in request_sock slab object leak. This patch retains the
> > > existing behaviour of returning full socks to the caller but it also decrements
> > > the child request_socket if one is present before doing so to prevent the leak.
> > >
> > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > > thanks to Antoine Tenart for the reproducer and patch input.
> > >
> > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> > > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> > > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > > ---
> > >   net/core/filter.c | 20 ++++++++++++++------
> > >   1 file changed, 14 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > index 2e32cee2c469..e3c04ae7381f 100644
> > > --- a/net/core/filter.c
> > > +++ b/net/core/filter.c
> > > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> > >   {
> > >       struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > >                                          ifindex, proto, netns_id, flags);
> > > +     struct sock *sk1 = sk;
> > >
> > >       if (sk) {
> > >               sk = sk_to_full_sk(sk);
> > > -             if (!sk_fullsock(sk)) {
> > > -                     sock_gen_put(sk);
> > > +             /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > > +              * sock refcnt is decremented to prevent a request_sock leak.
> > > +              */
> > > +             if (!sk_fullsock(sk1))
> > > +                     sock_gen_put(sk1);
> > > +             if (!sk_fullsock(sk))
> > >                       return NULL;
> >
> > [ +Martin/Joe/Lorenz ]
> >
> > I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
> > bpf_sk_release() case later on? Rough pseudocode could be something like below:
> >
> > static struct sock *
> > __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> >                  struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
> >                  u64 flags)
> > {
> >          struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> >                                             ifindex, proto, netns_id, flags);
> >          if (sk) {
> >                  struct sock *sk2 = sk_to_full_sk(sk);
> >
> >                  if (!sk_fullsock(sk2))
> >                          sk2 = NULL;
> >                  if (sk2 != sk) {
> >                          sock_gen_put(sk);
> >                          if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
> >                                  WARN_ONCE(1, "Found non-RCU, unreferenced socket!");
> >                                  sk2 = NULL;
> >                          }
> >                  }
> >                  sk = sk2;
> >          }
> >          return sk;
> > }
>
> This seems a bit more readable to me from the perspective of
> understanding the way that the socket references are tracked & freed.

Thanks for the suggestion Daniel and Joe, looks good to me, we will run some
tests with that implemented in our reproducer.

Regards

Jon

Martin KaFai Lau June 10, 2022, 12:17 a.m. UTC | #6

On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
> On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > A customer reported a request_socket leak in a Calico cloud environment. We
> > found that a BPF program was doing a socket lookup with takes a refcnt on
> > the socket and that it was finding the request_socket but returning the parent
> > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > 1st, resulting in request_sock slab object leak. This patch retains the
Great catch and debug indeed!

> > existing behaviour of returning full socks to the caller but it also decrements
> > the child request_socket if one is present before doing so to prevent the leak.
> > 
> > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > thanks to Antoine Tenart for the reproducer and patch input.
> > 
> > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
Instead of the above commits, I think this dated back to
6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")

> > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > ---
> >   net/core/filter.c | 20 ++++++++++++++------
> >   1 file changed, 14 insertions(+), 6 deletions(-)
> > 
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 2e32cee2c469..e3c04ae7381f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> >   {
> >   	struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> >   					   ifindex, proto, netns_id, flags);
> > +	struct sock *sk1 = sk;
> >   	if (sk) {
> >   		sk = sk_to_full_sk(sk);
> > -		if (!sk_fullsock(sk)) {
> > -			sock_gen_put(sk);
> > +		/* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > +		 * sock refcnt is decremented to prevent a request_sock leak.
> > +		 */
> > +		if (!sk_fullsock(sk1))
> > +			sock_gen_put(sk1);
> > +		if (!sk_fullsock(sk))
In this case, sk1 == sk (timewait).  It is a bit worrying to pass
sk to sk_fullsock(sk) after the above sock_gen_put().
I think Daniel's 'if (sk2 != sk) { sock_gen_put(sk); }' check is better.

> 
> [ +Martin/Joe/Lorenz ]
> 
> I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
> bpf_sk_release() case later on? Rough pseudocode could be something like below:
> 
> static struct sock *
> __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>                 struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
>                 u64 flags)
> {
>         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>                                            ifindex, proto, netns_id, flags);
>         if (sk) {
>                 struct sock *sk2 = sk_to_full_sk(sk);
> 
>                 if (!sk_fullsock(sk2))
>                         sk2 = NULL;
>                 if (sk2 != sk) {
>                         sock_gen_put(sk);
>                         if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
I don't think it matters if the helper-returned sk2 is refcounted or not (SOCK_RCU_FREE).
The verifier has ensured the bpf_sk_lookup() and bpf_sk_release() are
always balanced regardless of the type of sk2.

bpf_sk_release() will do the right thing to check the sk2 is refcounted or not
before calling sock_gen_put().

The bug here is the helper forgot to call sock_gen_put(sk) while
the verifier only tracks the sk2, so I think the 'if (unlikely...) { WARN_ONCE(...); }'
can be saved.

>                                 WARN_ONCE(1, "Found non-RCU, unreferenced socket!");
>                                 sk2 = NULL;
>                         }
>                 }
>                 sk = sk2;
>         }
>         return sk;
> }

Martin KaFai Lau June 10, 2022, 12:36 a.m. UTC | #7

On Thu, Jun 09, 2022 at 05:17:47PM -0700, Martin KaFai Lau wrote:
> On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
> > On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > > A customer reported a request_socket leak in a Calico cloud environment. We
> > > found that a BPF program was doing a socket lookup with takes a refcnt on
> > > the socket and that it was finding the request_socket but returning the parent
> > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > > 1st, resulting in request_sock slab object leak. This patch retains the
> Great catch and debug indeed!
> 
> > > existing behaviour of returning full socks to the caller but it also decrements
> > > the child request_socket if one is present before doing so to prevent the leak.
> > > 
> > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > > thanks to Antoine Tenart for the reproducer and patch input.
> > > 
> > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> Instead of the above commits, I think this dated back to
> 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")

Since this is more bpf specific, I think it could go to the bpf tree.
In v2, please cc bpf@vger.kernel.org and tag it with 'PATCH v2 bpf'.

Jonathan Maxwell June 10, 2022, 12:45 a.m. UTC | #8

On Fri, Jun 10, 2022 at 10:36 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Thu, Jun 09, 2022 at 05:17:47PM -0700, Martin KaFai Lau wrote:
> > On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
> > > On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > > > A customer reported a request_socket leak in a Calico cloud environment. We
> > > > found that a BPF program was doing a socket lookup with takes a refcnt on
> > > > the socket and that it was finding the request_socket but returning the parent
> > > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > > > 1st, resulting in request_sock slab object leak. This patch retains the
> > Great catch and debug indeed!
> >
> > > > existing behaviour of returning full socks to the caller but it also decrements
> > > > the child request_socket if one is present before doing so to prevent the leak.
> > > >
> > > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > > > thanks to Antoine Tenart for the reproducer and patch input.
> > > >
> > > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> > Instead of the above commits, I think this dated back to
> > 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
>
> Since this is more bpf specific, I think it could go to the bpf tree.
> In v2, please cc bpf@vger.kernel.org and tag it with 'PATCH v2 bpf'.

Okay thanks will do.

Daniel, are you okay with omitting 'if (unlikely...) { WARN_ONCE(...); }'?

If so I'll stick to the rest of the logic of your suggestion and omit that
check in v1.

Regards

Jon

Jonathan Maxwell June 10, 2022, 12:49 a.m. UTC | #9

On Thu, Jun 9, 2022 at 11:35 PM Antoine Tenart <atenart@kernel.org> wrote:
>
> Hi Jon,
>
> Quoting Jon Maxwell (2022-06-09 03:18:44)
> > A customer reported a request_socket leak in a Calico cloud environment. We
> > found that a BPF program was doing a socket lookup with takes a refcnt on
> > the socket and that it was finding the request_socket but returning the parent
> > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > 1st, resulting in request_sock slab object leak. This patch retains the
> > existing behaviour of returning full socks to the caller but it also decrements
> > the child request_socket if one is present before doing so to prevent the leak.
> >
> > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > thanks to Antoine Tenart for the reproducer and patch input.
> >
> > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
>
> "bpf:" should be inside the parenthesis in the two above lines.
>
> Isn't the issue from before edbf8c01de5a for bpf_sk_lookup? Looking at a
> 5.1 kernel[1], __bpf_sk_lookup was called and also did the full socket
> translation[2]. bpf_sk_release would not be called on the original
> socket when that happens.
>
> [1] https://elixir.bootlin.com/linux/v5.1/source/net/core/filter.c#L5204
> [2] https://elixir.bootlin.com/linux/v5.1/source/net/core/filter.c#L5198
>
> > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
>
> Please remove the extra ':'.
>

Sure will correct those typos in v1.

Regards

Jon

> Thanks!
> Antoine
>
> > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > ---
> >  net/core/filter.c | 20 ++++++++++++++------
> >  1 file changed, 14 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 2e32cee2c469..e3c04ae7381f 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> >  {
> >         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> >                                            ifindex, proto, netns_id, flags);
> > +       struct sock *sk1 = sk;
> >
> >         if (sk) {
> >                 sk = sk_to_full_sk(sk);
> > -               if (!sk_fullsock(sk)) {
> > -                       sock_gen_put(sk);
> > +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > +                * sock refcnt is decremented to prevent a request_sock leak.
> > +                */
> > +               if (!sk_fullsock(sk1))
> > +                       sock_gen_put(sk1);
> > +               if (!sk_fullsock(sk))
> >                         return NULL;
> > -               }
> >         }
> >
> >         return sk;
> > @@ -6239,13 +6243,17 @@ bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> >  {
> >         struct sock *sk = bpf_skc_lookup(skb, tuple, len, proto, netns_id,
> >                                          flags);
> > +       struct sock *sk1 = sk;
> >
> >         if (sk) {
> >                 sk = sk_to_full_sk(sk);
> > -               if (!sk_fullsock(sk)) {
> > -                       sock_gen_put(sk);
> > +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > +                * sock refcnt is decremented to prevent a request_sock leak.
> > +                */
> > +               if (!sk_fullsock(sk1))
> > +                       sock_gen_put(sk1);
> > +               if (!sk_fullsock(sk))
> >                         return NULL;
> > -               }
> >         }
> >
> >         return sk;
> > --
> > 2.31.1
> >

Daniel Borkmann June 10, 2022, 7:08 a.m. UTC | #10

On 6/10/22 2:17 AM, Martin KaFai Lau wrote:
> On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
>> On 6/9/22 3:18 AM, Jon Maxwell wrote:
>>> A customer reported a request_socket leak in a Calico cloud environment. We
>>> found that a BPF program was doing a socket lookup with takes a refcnt on
>>> the socket and that it was finding the request_socket but returning the parent
>>> LISTEN socket via sk_to_full_sk() without decrementing the child request socket
>>> 1st, resulting in request_sock slab object leak. This patch retains the
> Great catch and debug indeed!
> 
>>> existing behaviour of returning full socks to the caller but it also decrements
>>> the child request_socket if one is present before doing so to prevent the leak.
>>>
>>> Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
>>> thanks to Antoine Tenart for the reproducer and patch input.
>>>
>>> Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
>>> Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> Instead of the above commits, I think this dated back to
> 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> 
>>> Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
>>> Co-developed-by: Antoine Tenart <atenart@kernel.org>
>>> Signed-off-by:: Antoine Tenart <atenart@kernel.org>
>>> Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
>>> ---
>>>    net/core/filter.c | 20 ++++++++++++++------
>>>    1 file changed, 14 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>> index 2e32cee2c469..e3c04ae7381f 100644
>>> --- a/net/core/filter.c
>>> +++ b/net/core/filter.c
>>> @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>>>    {
>>>    	struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>>>    					   ifindex, proto, netns_id, flags);
>>> +	struct sock *sk1 = sk;
>>>    	if (sk) {
>>>    		sk = sk_to_full_sk(sk);
>>> -		if (!sk_fullsock(sk)) {
>>> -			sock_gen_put(sk);
>>> +		/* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
>>> +		 * sock refcnt is decremented to prevent a request_sock leak.
>>> +		 */
>>> +		if (!sk_fullsock(sk1))
>>> +			sock_gen_put(sk1);
>>> +		if (!sk_fullsock(sk))
> In this case, sk1 == sk (timewait).  It is a bit worrying to pass
> sk to sk_fullsock(sk) after the above sock_gen_put().
> I think Daniel's 'if (sk2 != sk) { sock_gen_put(sk); }' check is better.
> 
>> [ +Martin/Joe/Lorenz ]
>>
>> I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
>> bpf_sk_release() case later on? Rough pseudocode could be something like below:
>>
>> static struct sock *
>> __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
>>                  struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
>>                  u64 flags)
>> {
>>          struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
>>                                             ifindex, proto, netns_id, flags);
>>          if (sk) {
>>                  struct sock *sk2 = sk_to_full_sk(sk);
>>
>>                  if (!sk_fullsock(sk2))
>>                          sk2 = NULL;
>>                  if (sk2 != sk) {
>>                          sock_gen_put(sk);
>>                          if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
> I don't think it matters if the helper-returned sk2 is refcounted or not (SOCK_RCU_FREE).
> The verifier has ensured the bpf_sk_lookup() and bpf_sk_release() are
> always balanced regardless of the type of sk2.
> 
> bpf_sk_release() will do the right thing to check the sk2 is refcounted or not
> before calling sock_gen_put().
> 
> The bug here is the helper forgot to call sock_gen_put(sk) while
> the verifier only tracks the sk2, so I think the 'if (unlikely...) { WARN_ONCE(...); }'
> can be saved.

I was mainly thinking given in sk_lookup() we have the check around `sk && !refcounted &&
!sock_flag(sk, SOCK_RCU_FREE)` to check for unreferenced non-SOCK_RCU_FREE socket, and
given sk_to_full_sk() can return inet_reqsk(sk)->rsk_listener we don't have a similar
assertion there. Given we don't bump any ref on the latter, it must be SOCK_RCU_FREE then
as otherwise latter call to bpf_sk_release() will unbalance sk2. @Jon: maybe lets just
manually verify that such sk2 has SOCK_RCU_FREE and state it in the commit message for
future reference then, either is fine with me. Thanks!

>>                                  WARN_ONCE(1, "Found non-RCU, unreferenced socket!");
>>                                  sk2 = NULL;
>>                          }
>>                  }
>>                  sk = sk2;
>>          }
>>          return sk;
>> }

Daniel Borkmann June 10, 2022, 7:09 a.m. UTC | #11

On 6/10/22 2:45 AM, Jonathan Maxwell wrote:
> On Fri, Jun 10, 2022 at 10:36 AM Martin KaFai Lau <kafai@fb.com> wrote:
>>
>> On Thu, Jun 09, 2022 at 05:17:47PM -0700, Martin KaFai Lau wrote:
>>> On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
>>>> On 6/9/22 3:18 AM, Jon Maxwell wrote:
>>>>> A customer reported a request_socket leak in a Calico cloud environment. We
>>>>> found that a BPF program was doing a socket lookup with takes a refcnt on
>>>>> the socket and that it was finding the request_socket but returning the parent
>>>>> LISTEN socket via sk_to_full_sk() without decrementing the child request socket
>>>>> 1st, resulting in request_sock slab object leak. This patch retains the
>>> Great catch and debug indeed!
>>>
>>>>> existing behaviour of returning full socks to the caller but it also decrements
>>>>> the child request_socket if one is present before doing so to prevent the leak.
>>>>>
>>>>> Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
>>>>> thanks to Antoine Tenart for the reproducer and patch input.
>>>>>
>>>>> Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
>>>>> Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
>>> Instead of the above commits, I think this dated back to
>>> 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
>>
>> Since this is more bpf specific, I think it could go to the bpf tree.
>> In v2, please cc bpf@vger.kernel.org and tag it with 'PATCH v2 bpf'.
> 
> Okay thanks will do.
> 
> Daniel, are you okay with omitting 'if (unlikely...) { WARN_ONCE(...); }'?
> 
> If so I'll stick to the rest of the logic of your suggestion and omit that
> check in v1.

Ok, works for me, see also my other reply that we should at least mention it in
the commit log.

Thanks!
Daniel

Martin KaFai Lau June 10, 2022, 5:58 p.m. UTC | #12

On Fri, Jun 10, 2022 at 09:08:41AM +0200, Daniel Borkmann wrote:
> On 6/10/22 2:17 AM, Martin KaFai Lau wrote:
> > On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
> > > On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > > > A customer reported a request_socket leak in a Calico cloud environment. We
> > > > found that a BPF program was doing a socket lookup with takes a refcnt on
> > > > the socket and that it was finding the request_socket but returning the parent
> > > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > > > 1st, resulting in request_sock slab object leak. This patch retains the
> > Great catch and debug indeed!
> > 
> > > > existing behaviour of returning full socks to the caller but it also decrements
> > > > the child request_socket if one is present before doing so to prevent the leak.
> > > > 
> > > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > > > thanks to Antoine Tenart for the reproducer and patch input.
> > > > 
> > > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> > Instead of the above commits, I think this dated back to
> > 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > 
> > > > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > > > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > > > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> > > > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > > > ---
> > > >    net/core/filter.c | 20 ++++++++++++++------
> > > >    1 file changed, 14 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > index 2e32cee2c469..e3c04ae7381f 100644
> > > > --- a/net/core/filter.c
> > > > +++ b/net/core/filter.c
> > > > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> > > >    {
> > > >    	struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > > >    					   ifindex, proto, netns_id, flags);
> > > > +	struct sock *sk1 = sk;
> > > >    	if (sk) {
> > > >    		sk = sk_to_full_sk(sk);
> > > > -		if (!sk_fullsock(sk)) {
> > > > -			sock_gen_put(sk);
> > > > +		/* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > > > +		 * sock refcnt is decremented to prevent a request_sock leak.
> > > > +		 */
> > > > +		if (!sk_fullsock(sk1))
> > > > +			sock_gen_put(sk1);
> > > > +		if (!sk_fullsock(sk))
> > In this case, sk1 == sk (timewait).  It is a bit worrying to pass
> > sk to sk_fullsock(sk) after the above sock_gen_put().
> > I think Daniel's 'if (sk2 != sk) { sock_gen_put(sk); }' check is better.
> > 
> > > [ +Martin/Joe/Lorenz ]
> > > 
> > > I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
> > > bpf_sk_release() case later on? Rough pseudocode could be something like below:
> > > 
> > > static struct sock *
> > > __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> > >                  struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
> > >                  u64 flags)
> > > {
> > >          struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > >                                             ifindex, proto, netns_id, flags);
> > >          if (sk) {
> > >                  struct sock *sk2 = sk_to_full_sk(sk);
> > > 
> > >                  if (!sk_fullsock(sk2))
> > >                          sk2 = NULL;
> > >                  if (sk2 != sk) {
> > >                          sock_gen_put(sk);
> > >                          if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
> > I don't think it matters if the helper-returned sk2 is refcounted or not (SOCK_RCU_FREE).
> > The verifier has ensured the bpf_sk_lookup() and bpf_sk_release() are
> > always balanced regardless of the type of sk2.
> > 
> > bpf_sk_release() will do the right thing to check the sk2 is refcounted or not
> > before calling sock_gen_put().
> > 
> > The bug here is the helper forgot to call sock_gen_put(sk) while
> > the verifier only tracks the sk2, so I think the 'if (unlikely...) { WARN_ONCE(...); }'
> > can be saved.
> 
> I was mainly thinking given in sk_lookup() we have the check around `sk && !refcounted &&
> !sock_flag(sk, SOCK_RCU_FREE)` to check for unreferenced non-SOCK_RCU_FREE socket, and
> given sk_to_full_sk() can return inet_reqsk(sk)->rsk_listener we don't have a similar
> assertion there. Given we don't bump any ref on the latter, it must be SOCK_RCU_FREE then
Ah. got it.  Thanks for the explanation.

Yep, agree.  It is useful to have this check here to ensure
no need to bump the sk2 refcnt.  A comment may be useful
here also, /* Ensure there is no need to bump sk2 refcnt */

Thanks!

Jonathan Maxwell June 14, 2022, 1:05 a.m. UTC | #13

On Sat, Jun 11, 2022 at 3:58 AM Martin KaFai Lau <kafai@fb.com> wrote:
>
> On Fri, Jun 10, 2022 at 09:08:41AM +0200, Daniel Borkmann wrote:
> > On 6/10/22 2:17 AM, Martin KaFai Lau wrote:
> > > On Thu, Jun 09, 2022 at 10:29:15PM +0200, Daniel Borkmann wrote:
> > > > On 6/9/22 3:18 AM, Jon Maxwell wrote:
> > > > > A customer reported a request_socket leak in a Calico cloud environment. We
> > > > > found that a BPF program was doing a socket lookup with takes a refcnt on
> > > > > the socket and that it was finding the request_socket but returning the parent
> > > > > LISTEN socket via sk_to_full_sk() without decrementing the child request socket
> > > > > 1st, resulting in request_sock slab object leak. This patch retains the
> > > Great catch and debug indeed!
> > >
> > > > > existing behaviour of returning full socks to the caller but it also decrements
> > > > > the child request_socket if one is present before doing so to prevent the leak.
> > > > >
> > > > > Thanks to Curtis Taylor for all the help in diagnosing and testing this. And
> > > > > thanks to Antoine Tenart for the reproducer and patch input.
> > > > >
> > > > > Fixes: f7355a6c0497 bpf: ("Check sk_fullsock() before returning from bpf_sk_lookup()")
> > > > > Fixes: edbf8c01de5a bpf: ("add skc_lookup_tcp helper")
> > > Instead of the above commits, I think this dated back to
> > > 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> > >
> > > > > Tested-by: Curtis Taylor <cutaylor-pub@yahoo.com>
> > > > > Co-developed-by: Antoine Tenart <atenart@kernel.org>
> > > > > Signed-off-by:: Antoine Tenart <atenart@kernel.org>
> > > > > Signed-off-by: Jon Maxwell <jmaxwell37@gmail.com>
> > > > > ---
> > > > >    net/core/filter.c | 20 ++++++++++++++------
> > > > >    1 file changed, 14 insertions(+), 6 deletions(-)
> > > > >
> > > > > diff --git a/net/core/filter.c b/net/core/filter.c
> > > > > index 2e32cee2c469..e3c04ae7381f 100644
> > > > > --- a/net/core/filter.c
> > > > > +++ b/net/core/filter.c
> > > > > @@ -6202,13 +6202,17 @@ __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> > > > >    {
> > > > >         struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > > > >                                            ifindex, proto, netns_id, flags);
> > > > > +       struct sock *sk1 = sk;
> > > > >         if (sk) {
> > > > >                 sk = sk_to_full_sk(sk);
> > > > > -               if (!sk_fullsock(sk)) {
> > > > > -                       sock_gen_put(sk);
> > > > > +               /* sk_to_full_sk() may return (sk)->rsk_listener, so make sure the original sk1
> > > > > +                * sock refcnt is decremented to prevent a request_sock leak.
> > > > > +                */
> > > > > +               if (!sk_fullsock(sk1))
> > > > > +                       sock_gen_put(sk1);
> > > > > +               if (!sk_fullsock(sk))
> > > In this case, sk1 == sk (timewait).  It is a bit worrying to pass
> > > sk to sk_fullsock(sk) after the above sock_gen_put().
> > > I think Daniel's 'if (sk2 != sk) { sock_gen_put(sk); }' check is better.
> > >
> > > > [ +Martin/Joe/Lorenz ]
> > > >
> > > > I wonder, should we also add some asserts in here to ensure we don't get an unbalance for the
> > > > bpf_sk_release() case later on? Rough pseudocode could be something like below:
> > > >
> > > > static struct sock *
> > > > __bpf_sk_lookup(struct sk_buff *skb, struct bpf_sock_tuple *tuple, u32 len,
> > > >                  struct net *caller_net, u32 ifindex, u8 proto, u64 netns_id,
> > > >                  u64 flags)
> > > > {
> > > >          struct sock *sk = __bpf_skc_lookup(skb, tuple, len, caller_net,
> > > >                                             ifindex, proto, netns_id, flags);
> > > >          if (sk) {
> > > >                  struct sock *sk2 = sk_to_full_sk(sk);
> > > >
> > > >                  if (!sk_fullsock(sk2))
> > > >                          sk2 = NULL;
> > > >                  if (sk2 != sk) {
> > > >                          sock_gen_put(sk);
> > > >                          if (unlikely(sk2 && !sock_flag(sk2, SOCK_RCU_FREE))) {
> > > I don't think it matters if the helper-returned sk2 is refcounted or not (SOCK_RCU_FREE).
> > > The verifier has ensured the bpf_sk_lookup() and bpf_sk_release() are
> > > always balanced regardless of the type of sk2.
> > >
> > > bpf_sk_release() will do the right thing to check the sk2 is refcounted or not
> > > before calling sock_gen_put().
> > >
> > > The bug here is the helper forgot to call sock_gen_put(sk) while
> > > the verifier only tracks the sk2, so I think the 'if (unlikely...) { WARN_ONCE(...); }'
> > > can be saved.
> >
> > I was mainly thinking given in sk_lookup() we have the check around `sk && !refcounted &&
> > !sock_flag(sk, SOCK_RCU_FREE)` to check for unreferenced non-SOCK_RCU_FREE socket, and
> > given sk_to_full_sk() can return inet_reqsk(sk)->rsk_listener we don't have a similar
> > assertion there. Given we don't bump any ref on the latter, it must be SOCK_RCU_FREE then
> Ah. got it.  Thanks for the explanation.
>
> Yep, agree.  It is useful to have this check here to ensure
> no need to bump the sk2 refcnt.  A comment may be useful
> here also, /* Ensure there is no need to bump sk2 refcnt */
>

I'll add that comment.

I'll add the SOCK_RCU_FREE check. We are currently testing the new patch
based on Daniels recommendation. When that is complete I'll resubmit the next
version of the patch including that. It'll probably be a few days.

Regards

Jon

> Thanks!
>

[net] net: bpf: fix request_sock leak in filter.c

Checks

Commit Message

Comments

Patch