From patchwork Tue Apr 9 20:52:58 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zijian Zhang X-Patchwork-Id: 13623244 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-qk1-f178.google.com (mail-qk1-f178.google.com [209.85.222.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 396C8158866 for ; Tue, 9 Apr 2024 20:53:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696015; cv=none; b=a/TKbWoY0xKbD6gq20ruTJ2IBhDRH5e5WRMuqg3xdl6WyNAWYZnj42couOvFV5t0AVtyCs2TbUiih9buGfzrbmzc3zKc2rg57liCW0k3NKEnfQTXLmSHoiP1RvP2PtsPBHfJ4q3AB4tnmtyqNG5CBlFeGln/p49JzNSTGU5Eaec= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696015; c=relaxed/simple; bh=qkvPGqmmhqTJxNoSxWdJI+aKLyNvPiCYvedQLWGkUeo=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=UU/h0Mns/+uTZeqFqxoY9JM+d+tzJFAUCA/FCrZADj+toNmSb6Sf02MhI8Xzr7DLhw7TczDt/3GCzF1zuYIOof0NoUkwLLyxSRqMHgu+mMMJLswB/j5rtE9rmFtoINfTbEqVFUybcCDo5vWk+sNozF/dZvZ3bFL7Wk612vm+gqU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=LmusKFkP; arc=none smtp.client-ip=209.85.222.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="LmusKFkP" Received: by mail-qk1-f178.google.com with SMTP id af79cd13be357-78d68de297fso173551585a.2 for ; Tue, 09 Apr 2024 13:53:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712696010; x=1713300810; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=H5+itgkf2LFwHJtwoMT6ETkGRXP/s1Y6Y4KFS5NgRGQ=; b=LmusKFkPUtNHe4wiub8FRZRQ/isMA66NmZnRbWlwXESfCOhbPjqDNSIXwe4bjRjtvr csLsqe7Q6DPspZExKnPp+5VntAWGWijUuVUtQ8/9Xly7yloHa+7LqOKJPMolLsOQ2uXY fBwe2/4euHZlwcUnk14BQB4JMfbC+6SADhn6g7N8ulvwipueCjshEZ8r435RoslWGUkM zWcF6cAuBgbsWRfdsYlFPJzWj7Le6YjfMHBLZc8HkYmM4SmwpoN/Q3pfDM6NIiBU4gV4 GYj/HKS5UqNURp0Cr9rPYh3TkBIlXp8xOPUrDtasbsg/ujeh0+p21fdbc2vpqwLktPNh f8Dw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712696010; x=1713300810; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=H5+itgkf2LFwHJtwoMT6ETkGRXP/s1Y6Y4KFS5NgRGQ=; b=JWrGYsH7QijA3TjXFXD6dyOv+Nc1v4waNtPJbKMT+Z8lvliQ5IdvOm1Uuabz8Qj5XK /MVaOUCYewFYTved1NA1q+WlvBIi5kJBi/yl8YWM+1yReDYle7PHkLSWlmgqWBNS1iJp TJfZkAemGPUA+NHfmX0Yzc+pjECLGs3/K4DP6gY/0gpvSzdvrovJ4W1J6KMqfxjrH9zy 29mxsPoRJzFyoht7TM/lLtY6flvQ4msKqgPEbSRvcjs72thzRW+pDrVw1GApH0MMYBTV dUjez7cetYkbJ4920qtRJ26YmPgm+s4MrTiPAbZQSelwoc71fiLjUfmI/juRRSNGYtLg qTNg== X-Gm-Message-State: AOJu0Yz/J5LVhzOcsA8p804GNmaLMe9eEQVm9A3ysX5QL4NDzrnraqWd uG7TCimrvmWFsqDCE4wZ65b8bDlblZZ67tha8ieo1xu/Ug2qy7zl6M91e+ND2NurOX6b/J4SOkl 5 X-Google-Smtp-Source: AGHT+IF3xe5/9xh/xk2QdGRxk76AikDHkRRt9j59+i1QKWlFB2oimQC22JOWoj41QwYIWLHOKHtmNw== X-Received: by 2002:a05:620a:ce2:b0:78d:7255:f9cd with SMTP id c2-20020a05620a0ce200b0078d7255f9cdmr784292qkj.11.1712696010405; Tue, 09 Apr 2024 13:53:30 -0700 (PDT) Received: from n191-036-066.byted.org ([147.160.184.150]) by smtp.gmail.com with ESMTPSA id vy3-20020a05620a490300b0078d6bcfb580sm1151619qkn.10.2024.04.09.13.53.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Apr 2024 13:53:30 -0700 (PDT) From: zijianzhang@bytedance.com To: netdev@vger.kernel.org Cc: edumazet@google.com, willemdebruijn.kernel@gmail.com, davem@davemloft.net, kuba@kernel.org, cong.wang@bytedance.com, xiaochun.lu@bytedance.com, Zijian Zhang Subject: [PATCH net-next 1/3] sock: add MSG_ZEROCOPY_UARG Date: Tue, 9 Apr 2024 20:52:58 +0000 Message-Id: <20240409205300.1346681-2-zijianzhang@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20240409205300.1346681-1-zijianzhang@bytedance.com> References: <20240409205300.1346681-1-zijianzhang@bytedance.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Zijian Zhang The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. However, zerocopy is not a free lunch. Apart from the management of user pages, the combination of poll + recvmsg to receive notifications incurs unignorable overhead in the applications. The overhead of such sometimes might be more than the CPU savings from zerocopy. We try to solve this problem with a new option for TCP and UDP, MSG_ZEROCOPY_UARG. This new mechanism aims to reduce the overhead associated with receiving notifications by embedding them directly into user arguments passed with each sendmsg control message. By doing so, we can significantly reduce the complexity and overhead for managing notifications. In an ideal pattern, the user will keep calling sendmsg with MSG_ZEROCOPY_UARG flag, and the notification will be delivered as soon as possible. Signed-off-by: Zijian Zhang Signed-off-by: Xiaochun Lu --- include/linux/skbuff.h | 7 +- include/linux/socket.h | 1 + include/linux/tcp.h | 3 + include/linux/udp.h | 3 + include/net/sock.h | 17 +++ include/net/udp.h | 1 + include/uapi/asm-generic/socket.h | 2 + include/uapi/linux/socket.h | 17 +++ net/core/skbuff.c | 137 ++++++++++++++++++++++-- net/core/sock.c | 50 +++++++++ net/ipv4/ip_output.c | 6 +- net/ipv4/tcp.c | 7 +- net/ipv4/udp.c | 9 ++ net/ipv6/ip6_output.c | 5 +- net/ipv6/udp.c | 9 ++ net/vmw_vsock/virtio_transport_common.c | 2 +- 16 files changed, 258 insertions(+), 18 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 03ea36a82cdd..19b94ba01007 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1663,12 +1663,14 @@ static inline void skb_set_end_offset(struct sk_buff *skb, unsigned int offset) #endif struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, - struct ubuf_info *uarg); + struct ubuf_info *uarg, bool user_args_notification); void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref); void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg, bool success); +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg, + bool success); int __zerocopy_sg_from_iter(struct msghdr *msg, struct sock *sk, struct sk_buff *skb, struct iov_iter *from, @@ -1763,7 +1765,8 @@ static inline void net_zcopy_put(struct ubuf_info *uarg) static inline void net_zcopy_put_abort(struct ubuf_info *uarg, bool have_uref) { if (uarg) { - if (uarg->callback == msg_zerocopy_callback) + if (uarg->callback == msg_zerocopy_callback || + uarg->callback == msg_zerocopy_uarg_callback) msg_zerocopy_put_abort(uarg, have_uref); else if (have_uref) net_zcopy_put(uarg); diff --git a/include/linux/socket.h b/include/linux/socket.h index 139c330ccf2c..de01392344e1 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_ZEROCOPY_UARG 0x2000000 /* MSG_ZEROCOPY with UARG notifications */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 55399ee2a57e..e973f4990646 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -501,6 +501,9 @@ struct tcp_sock { */ struct request_sock __rcu *fastopen_rsk; struct saved_syn *saved_syn; + +/* TCP MSG_ZEROCOPY_UARG related information */ + struct tx_msg_zcopy_queue tx_zcopy_queue; }; enum tsq_enum { diff --git a/include/linux/udp.h b/include/linux/udp.h index 3748e82b627b..502b393eac67 100644 --- a/include/linux/udp.h +++ b/include/linux/udp.h @@ -95,6 +95,9 @@ struct udp_sock { /* Cache friendly copy of sk->sk_peek_off >= 0 */ bool peeking_with_offset; + + /* This field is used by sendmsg zcopy user arg mode notification */ + struct tx_msg_zcopy_queue tx_zcopy_queue; }; #define udp_test_bit(nr, sk) \ diff --git a/include/net/sock.h b/include/net/sock.h index 2253eefe2848..f7c045e98213 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -544,6 +544,23 @@ struct sock { netns_tracker ns_tracker; }; +struct tx_msg_zcopy_node { + struct list_head node; + struct tx_msg_zcopy_info info; + struct sk_buff *skb; +}; + +struct tx_msg_zcopy_queue { + struct list_head head; + spinlock_t lock; /* protects head queue */ +}; + +static inline void tx_message_zcopy_queue_init(struct tx_msg_zcopy_queue *q) +{ + spin_lock_init(&q->lock); + INIT_LIST_HEAD(&q->head); +} + enum sk_pacing { SK_PACING_NONE = 0, SK_PACING_NEEDED = 1, diff --git a/include/net/udp.h b/include/net/udp.h index 488a6d2babcc..9e4d7b128de4 100644 --- a/include/net/udp.h +++ b/include/net/udp.h @@ -182,6 +182,7 @@ static inline void udp_lib_init_sock(struct sock *sk) skb_queue_head_init(&up->reader_queue); up->forward_threshold = sk->sk_rcvbuf >> 2; set_bit(SOCK_CUSTOM_SOCKOPT, &sk->sk_socket->flags); + tx_message_zcopy_queue_init(&up->tx_zcopy_queue); } /* hash routines shared between UDPv4/6 and UDP-Litev4/6 */ diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..86aa4b5cb7f1 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,8 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_ZEROCOPY_NOTIFICATION 78 + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/socket.h b/include/uapi/linux/socket.h index d3fcd3b5ec53..469ed8f4e6c8 100644 --- a/include/uapi/linux/socket.h +++ b/include/uapi/linux/socket.h @@ -35,4 +35,21 @@ struct __kernel_sockaddr_storage { #define SOCK_TXREHASH_DISABLED 0 #define SOCK_TXREHASH_ENABLED 1 +/* + * Given the fact that MSG_ZEROCOPY_UARG tries to copy notifications + * back to user as soon as possible, 8 should be sufficient. + */ +#define SOCK_USR_ZC_INFO_MAX 8 + +struct tx_msg_zcopy_info { + __u32 lo; + __u32 hi; + __u8 zerocopy; +}; + +struct tx_usr_zcopy_info { + int length; + struct tx_msg_zcopy_info info[SOCK_USR_ZC_INFO_MAX]; +}; + #endif /* _UAPI_LINUX_SOCKET_H */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 2a5ce6667bbb..d939b2c14d55 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1661,6 +1661,16 @@ void mm_unaccount_pinned_pages(struct mmpin *mmp) } EXPORT_SYMBOL_GPL(mm_unaccount_pinned_pages); +static void init_ubuf_info_msgzc(struct ubuf_info_msgzc *uarg, struct sock *sk, size_t size) +{ + uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1; + uarg->len = 1; + uarg->bytelen = size; + uarg->zerocopy = 1; + uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; + refcount_set(&uarg->ubuf.refcnt, 1); +} + static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size) { struct ubuf_info_msgzc *uarg; @@ -1682,12 +1692,38 @@ static struct ubuf_info *msg_zerocopy_alloc(struct sock *sk, size_t size) } uarg->ubuf.callback = msg_zerocopy_callback; - uarg->id = ((u32)atomic_inc_return(&sk->sk_zckey)) - 1; - uarg->len = 1; - uarg->bytelen = size; - uarg->zerocopy = 1; - uarg->ubuf.flags = SKBFL_ZEROCOPY_FRAG | SKBFL_DONT_ORPHAN; - refcount_set(&uarg->ubuf.refcnt, 1); + init_ubuf_info_msgzc(uarg, sk, size); + sock_hold(sk); + + return &uarg->ubuf; +} + +static struct ubuf_info *msg_zerocopy_uarg_alloc(struct sock *sk, size_t size) +{ + struct sk_buff *skb; + struct ubuf_info_msgzc *uarg; + struct tx_msg_zcopy_node *zcopy_node_p; + + WARN_ON_ONCE(!in_task()); + + skb = sock_omalloc(sk, sizeof(*zcopy_node_p), GFP_KERNEL); + if (!skb) + return NULL; + + BUILD_BUG_ON(sizeof(*uarg) > sizeof(skb->cb)); + uarg = (void *)skb->cb; + uarg->mmp.user = NULL; + zcopy_node_p = (struct tx_msg_zcopy_node *)skb_put(skb, sizeof(*zcopy_node_p)); + + if (mm_account_pinned_pages(&uarg->mmp, size)) { + kfree_skb(skb); + return NULL; + } + + INIT_LIST_HEAD(&zcopy_node_p->node); + zcopy_node_p->skb = skb; + uarg->ubuf.callback = msg_zerocopy_uarg_callback; + init_ubuf_info_msgzc(uarg, sk, size); sock_hold(sk); return &uarg->ubuf; @@ -1699,7 +1735,7 @@ static inline struct sk_buff *skb_from_uarg(struct ubuf_info_msgzc *uarg) } struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, - struct ubuf_info *uarg) + struct ubuf_info *uarg, bool usr_arg_notification) { if (uarg) { struct ubuf_info_msgzc *uarg_zc; @@ -1707,7 +1743,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, u32 bytelen, next; /* there might be non MSG_ZEROCOPY users */ - if (uarg->callback != msg_zerocopy_callback) + if (uarg->callback != msg_zerocopy_callback && + uarg->callback != msg_zerocopy_uarg_callback) return NULL; /* realloc only when socket is locked (TCP, UDP cork), @@ -1744,6 +1781,8 @@ struct ubuf_info *msg_zerocopy_realloc(struct sock *sk, size_t size, } new_alloc: + if (usr_arg_notification) + return msg_zerocopy_uarg_alloc(sk, size); return msg_zerocopy_alloc(sk, size); } EXPORT_SYMBOL_GPL(msg_zerocopy_realloc); @@ -1830,6 +1869,86 @@ void msg_zerocopy_callback(struct sk_buff *skb, struct ubuf_info *uarg, } EXPORT_SYMBOL_GPL(msg_zerocopy_callback); +static bool skb_zerocopy_uarg_notify_extend(struct tx_msg_zcopy_node *node, u32 lo, u16 len) +{ + u32 old_lo, old_hi; + u64 sum_len; + + old_lo = node->info.lo; + old_hi = node->info.hi; + sum_len = old_hi - old_lo + 1ULL + len; + + if (sum_len >= (1ULL << 32)) + return false; + + if (lo != old_hi + 1) + return false; + + node->info.hi += len; + return true; +} + +static void __msg_zerocopy_uarg_callback(struct ubuf_info_msgzc *uarg) +{ + struct sk_buff *skb = skb_from_uarg(uarg); + struct sock *sk = skb->sk; + struct tx_msg_zcopy_node *zcopy_node_p, *tail; + struct tx_msg_zcopy_queue *zcopy_queue; + unsigned long flags; + u32 lo, hi; + u16 len; + + mm_unaccount_pinned_pages(&uarg->mmp); + + /* if !len, there was only 1 call, and it was aborted + * so do not queue a completion notification + */ + if (!uarg->len || sock_flag(sk, SOCK_DEAD)) + goto release; + + /* only support TCP and UCP currently */ + if (sk_is_tcp(sk)) { + zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue; + } else if (sk_is_udp(sk)) { + zcopy_queue = &udp_sk(sk)->tx_zcopy_queue; + } else { + pr_warn("MSG_ZEROCOPY_UARG only support TCP && UDP sockets"); + goto release; + } + + len = uarg->len; + lo = uarg->id; + hi = uarg->id + len - 1; + + zcopy_node_p = (struct tx_msg_zcopy_node *)skb->data; + zcopy_node_p->info.lo = lo; + zcopy_node_p->info.hi = hi; + zcopy_node_p->info.zerocopy = uarg->zerocopy; + + spin_lock_irqsave(&zcopy_queue->lock, flags); + tail = list_last_entry(&zcopy_queue->head, struct tx_msg_zcopy_node, node); + if (!tail || !skb_zerocopy_uarg_notify_extend(tail, lo, len)) { + list_add_tail(&zcopy_node_p->node, &zcopy_queue->head); + skb = NULL; + } + spin_unlock_irqrestore(&zcopy_queue->lock, flags); +release: + consume_skb(skb); + sock_put(sk); +} + +void msg_zerocopy_uarg_callback(struct sk_buff *skb, struct ubuf_info *uarg, + bool success) +{ + struct ubuf_info_msgzc *uarg_zc = uarg_to_msgzc(uarg); + + uarg_zc->zerocopy = uarg_zc->zerocopy & success; + + if (refcount_dec_and_test(&uarg->refcnt)) + __msg_zerocopy_uarg_callback(uarg_zc); +} +EXPORT_SYMBOL_GPL(msg_zerocopy_uarg_callback); + void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref) { struct sock *sk = skb_from_uarg(uarg_to_msgzc(uarg))->sk; @@ -1838,7 +1957,7 @@ void msg_zerocopy_put_abort(struct ubuf_info *uarg, bool have_uref) uarg_to_msgzc(uarg)->len--; if (have_uref) - msg_zerocopy_callback(NULL, uarg, true); + uarg->callback(NULL, uarg, true); } EXPORT_SYMBOL_GPL(msg_zerocopy_put_abort); diff --git a/net/core/sock.c b/net/core/sock.c index 5ed411231fc7..a00ebd71f6ed 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2843,6 +2843,56 @@ int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg, case SCM_RIGHTS: case SCM_CREDENTIALS: break; + case SO_ZEROCOPY_NOTIFICATION: + if (sock_flag(sk, SOCK_ZEROCOPY)) { + int i = 0; + struct tx_usr_zcopy_info sys_zcopy_info; + struct tx_msg_zcopy_node *zcopy_node_p, *tmp; + struct tx_msg_zcopy_queue *zcopy_queue; + struct tx_msg_zcopy_node *zcopy_node_ps[SOCK_USR_ZC_INFO_MAX]; + unsigned long flags; + + if (cmsg->cmsg_len != CMSG_LEN(sizeof(void *))) + return -EINVAL; + + if (sk_is_tcp(sk)) + zcopy_queue = &tcp_sk(sk)->tx_zcopy_queue; + else if (sk_is_udp(sk)) + zcopy_queue = &udp_sk(sk)->tx_zcopy_queue; + else + return -EINVAL; + + spin_lock_irqsave(&zcopy_queue->lock, flags); + list_for_each_entry_safe(zcopy_node_p, tmp, &zcopy_queue->head, node) { + sys_zcopy_info.info[i].lo = zcopy_node_p->info.lo; + sys_zcopy_info.info[i].hi = zcopy_node_p->info.hi; + sys_zcopy_info.info[i].zerocopy = zcopy_node_p->info.zerocopy; + list_del(&zcopy_node_p->node); + zcopy_node_ps[i++] = zcopy_node_p; + if (i == SOCK_USR_ZC_INFO_MAX) + break; + } + spin_unlock_irqrestore(&zcopy_queue->lock, flags); + + if (i > 0) { + sys_zcopy_info.length = i; + if (unlikely(copy_to_user(*(void **)CMSG_DATA(cmsg), + &sys_zcopy_info, + sizeof(sys_zcopy_info)) + )) { + spin_lock_irqsave(&zcopy_queue->lock, flags); + while (i > 0) + list_add(&zcopy_node_ps[--i]->node, + &zcopy_queue->head); + spin_unlock_irqrestore(&zcopy_queue->lock, flags); + return -EFAULT; + } + + while (i > 0) + consume_skb(zcopy_node_ps[--i]->skb); + } + } + break; default: return -EINVAL; } diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 1fe794967211..5adb737c4c01 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -1005,7 +1005,7 @@ static int __ip_append_data(struct sock *sk, (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM))) csummode = CHECKSUM_PARTIAL; - if ((flags & MSG_ZEROCOPY) && length) { + if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) { struct msghdr *msg = from; if (getfrag == ip_generic_getfrag && msg->msg_ubuf) { @@ -1022,7 +1022,9 @@ static int __ip_append_data(struct sock *sk, uarg = msg->msg_ubuf; } } else if (sock_flag(sk, SOCK_ZEROCOPY)) { - uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); + bool user_args = flags & MSG_ZEROCOPY_UARG; + + uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), user_args); if (!uarg) return -ENOBUFS; extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e767721b3a58..6254d0eef3af 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -462,6 +462,8 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + + tx_message_zcopy_queue_init(&tp->tx_zcopy_queue); } EXPORT_SYMBOL(tcp_init_sock); @@ -1050,14 +1052,15 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size) flags = msg->msg_flags; - if ((flags & MSG_ZEROCOPY) && size) { + if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && size) { if (msg->msg_ubuf) { uarg = msg->msg_ubuf; if (sk->sk_route_caps & NETIF_F_SG) zc = MSG_ZEROCOPY; } else if (sock_flag(sk, SOCK_ZEROCOPY)) { + bool zc_uarg = flags & MSG_ZEROCOPY_UARG; skb = tcp_write_queue_tail(sk); - uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb)); + uarg = msg_zerocopy_realloc(sk, size, skb_zcopy(skb), zc_uarg); if (!uarg) { err = -ENOBUFS; goto out_err; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 11460d751e73..6c62aacd74d6 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1126,6 +1126,15 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (ipc.opt) free = 1; connected = 0; + + /* If len is zero and flag MSG_ZEROCOPY_UARG is set, + * it means this call just wants to get zcopy notifications + * instead of sending packets. It is useful when users + * finish sending and want to get trailing notifications. + */ + if ((msg->msg_flags & MSG_ZEROCOPY_UARG) && + sock_flag(sk, SOCK_ZEROCOPY) && len == 0) + return 0; } if (!ipc.opt) { struct ip_options_rcu *inet_opt; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index b9dd3a66e423..891526ddd74c 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -1493,7 +1493,7 @@ static int __ip6_append_data(struct sock *sk, rt->dst.dev->features & (NETIF_F_IPV6_CSUM | NETIF_F_HW_CSUM)) csummode = CHECKSUM_PARTIAL; - if ((flags & MSG_ZEROCOPY) && length) { + if (((flags & MSG_ZEROCOPY) || (flags & MSG_ZEROCOPY_UARG)) && length) { struct msghdr *msg = from; if (getfrag == ip_generic_getfrag && msg->msg_ubuf) { @@ -1510,7 +1510,8 @@ static int __ip6_append_data(struct sock *sk, uarg = msg->msg_ubuf; } } else if (sock_flag(sk, SOCK_ZEROCOPY)) { - uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb)); + uarg = msg_zerocopy_realloc(sk, length, skb_zcopy(skb), + flags & MSG_ZEROCOPY_UARG); if (!uarg) return -ENOBUFS; extra_uref = !skb_zcopy(skb); /* only ref on new uarg */ diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 2e4dc5e6137b..98f6905c5db9 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -1490,6 +1490,15 @@ int udpv6_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (!(opt->opt_nflen|opt->opt_flen)) opt = NULL; connected = false; + + /* If len is zero and flag MSG_ZEROCOPY_UARG is set, + * it means this call just wants to get zcopy notifications + * instead of sending packets. It is useful when users + * finish sending and want to get trailing notifications. + */ + if ((msg->msg_flags & MSG_ZEROCOPY_UARG) && + sock_flag(sk, SOCK_ZEROCOPY) && len == 0) + return 0; } if (!opt) { opt = txopt_get(np); diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c index 16ff976a86e3..d6e6830f6ffe 100644 --- a/net/vmw_vsock/virtio_transport_common.c +++ b/net/vmw_vsock/virtio_transport_common.c @@ -84,7 +84,7 @@ static int virtio_transport_init_zcopy_skb(struct vsock_sock *vsk, uarg = msg_zerocopy_realloc(sk_vsock(vsk), iter->count, - NULL); + NULL, false); if (!uarg) return -1; From patchwork Tue Apr 9 20:52:59 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zijian Zhang X-Patchwork-Id: 13623245 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-qk1-f170.google.com (mail-qk1-f170.google.com [209.85.222.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D455370 for ; Tue, 9 Apr 2024 20:53:33 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.222.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696015; cv=none; b=IybpjSFlSFw27J3NdsDhiHdCdAviH7Wcjsm7cvoBAJ3wFgFDxqA/x52MRwOGE79vWtPgP9ZLw0CCzyR7/TfLkpKy4fkogIfB9/ssteoyqt+iIGhGAhdMGPUraBXPysD/Fu1S3c9p3epNM78XHstIdW785UKeuw7Btq3vyeEhFCI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696015; c=relaxed/simple; bh=xqVQkoH6aBoUsPV/E4Zy5PqbwS0X6hGlKu2AoDH7kGI=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=uHuEAl8GykA1woH/w2QQeZJDcVHG1lxU8VzrhU1ZJOOd/1fZQAy1IHjNLlYOZ2ahaPi7PGmPfNrGFrKTmEq5iReW4qQ/PZHYooKwpktw4IaOtdTHkeSJMD8gpLFjQx6w6c5gD67+fYBHbJiOD9cJwD6EWMNdh9hTR9zkn2N9kro= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=hF6RlQMO; arc=none smtp.client-ip=209.85.222.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="hF6RlQMO" Received: by mail-qk1-f170.google.com with SMTP id af79cd13be357-78d5f0440d3so231832785a.2 for ; Tue, 09 Apr 2024 13:53:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712696012; x=1713300812; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=DRFZMizCWKZhpAerS+9gpeIjSBBecARmZERg5gpoJd0=; b=hF6RlQMOHvDrJjlKEWEw0hYJsHfwty6mw0QTMZmKDM73FEZ7yMl0+gthLt8GPxHTil Ob8b0Jhb8hddd897GALSbQeagVcb5gwgoCYl4weEm1RPO9/1eBLwqlHCzzXKndMbUTHT N8zxqDOuJsCD4iae1UMbv3mp1oSmzuJkpg5BJa2WtleHDqyTPWoqFu8uUEL+1yfoY5xb bPPLSxxQXZXHm9N75ryqQQhepZ9C6+G4o2WZrxMvpQnfaOg/5638g524jmsRQdrd8d06 vMc5vnjHMMor6m35XppOujqMX0obvk/hhni4L4zPdQgnQmRTrbYRTBrOdunmEEUT6Hcw WUMg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712696012; x=1713300812; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=DRFZMizCWKZhpAerS+9gpeIjSBBecARmZERg5gpoJd0=; b=BHgxmiKs18ZBwF7BtrOArd1rPfX5zVr6kycqacJ/uw0WtROXW5CVKfhLVySHF0tGeC JDL4BY/5c5xVbIvD0Kxu9T6e4EMVceHIpk6bttzo/tZtUwVmsf7LNdkPk+PADaJy45QU sICIVfK0Lo9w5pHw7dsbEp4BQlRJs4cZM3xeqxke4l7NswRrvbf6wYMchY0jrMxHIwjW oQr9MLbwl2origsVjJzZg4Lq8DfSaE3Lwpl1vW9Qdnk8HgPEEbwnbzZkUlqflAD/0J/S PJYe7otcs12f7ykm0h6Xv+RvYrkUaV3GawSZaVv4lernOm/C/u5o/oTmsTmGw2zahZQ8 /+FQ== X-Gm-Message-State: AOJu0YxHzPjIFto7ZNcKhWiiDKbSB8aXOGyDkAlI5laRkqYFfZZSgwDR Qu/hP7wb71j3zyK35crLn3ljjCtKYIzahktOuKwK4Rx9d6spY8ROlVNYU2z0Kcm2gnZ846dOeZc 5 X-Google-Smtp-Source: AGHT+IHdXWES4da2H3LXPPzNOlsJSM+679wEQBCQKwlSlqEs1fMV3LVvCTwWzP/FoGBmvGr/mnwcSg== X-Received: by 2002:a05:620a:22e3:b0:78d:5f2c:9a19 with SMTP id p3-20020a05620a22e300b0078d5f2c9a19mr728720qki.40.1712696012255; Tue, 09 Apr 2024 13:53:32 -0700 (PDT) Received: from n191-036-066.byted.org ([147.160.184.150]) by smtp.gmail.com with ESMTPSA id vy3-20020a05620a490300b0078d6bcfb580sm1151619qkn.10.2024.04.09.13.53.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Apr 2024 13:53:31 -0700 (PDT) From: zijianzhang@bytedance.com To: netdev@vger.kernel.org Cc: edumazet@google.com, willemdebruijn.kernel@gmail.com, davem@davemloft.net, kuba@kernel.org, cong.wang@bytedance.com, xiaochun.lu@bytedance.com, Zijian Zhang Subject: [PATCH net-next 2/3] selftests: fix OOM problem in msg_zerocopy selftest Date: Tue, 9 Apr 2024 20:52:59 +0000 Message-Id: <20240409205300.1346681-3-zijianzhang@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20240409205300.1346681-1-zijianzhang@bytedance.com> References: <20240409205300.1346681-1-zijianzhang@bytedance.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Zijian Zhang In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg on a socket, and it will recv the completion notifications when the socket is not writable. Typically, it will start the receiving process after around 30+ sendmsgs. However, because of the commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable and does not get any chance to run recv notifications. The selftest always exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds the core.sysctl_optmem_max. We introduce "cfg_notification_limit" to force sender to receive notifications after some number of sendmsgs. Plus, in the selftest, we need to update skb_orphan_frags_rx to be the same as skb_orphan_frags. In this case, for some reason, notifications do not come in order now. We introduce "cfg_notification_order_check" to possibly ignore the checking for order. Signed-off-by: Zijian Zhang Signed-off-by: Xiaochun Lu --- tools/testing/selftests/net/msg_zerocopy.c | 24 ++++++++++++++++++---- 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index bdc03a2097e8..8e595216a0af 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -85,6 +85,8 @@ static bool cfg_rx; static int cfg_runtime_ms = 4200; static int cfg_verbose; static int cfg_waittime_ms = 500; +static bool cfg_notification_order_check; +static int cfg_notification_limit = 32; static bool cfg_zerocopy; static socklen_t cfg_alen; @@ -435,7 +437,7 @@ static bool do_recv_completion(int fd, int domain) /* Detect notification gaps. These should not happen often, if at all. * Gaps can occur due to drops, reordering and retransmissions. */ - if (lo != next_completion) + if (cfg_notification_order_check && lo != next_completion) fprintf(stderr, "gap: %u..%u does not append to %u\n", lo, hi, next_completion); next_completion = hi + 1; @@ -489,7 +491,7 @@ static void do_tx(int domain, int type, int protocol) struct iphdr iph; } nh; uint64_t tstop; - int fd; + int fd, sendmsg_counter = 0; fd = do_setup_tx(domain, type, protocol); @@ -548,10 +550,18 @@ static void do_tx(int domain, int type, int protocol) do_sendmsg_corked(fd, &msg); else do_sendmsg(fd, &msg, cfg_zerocopy, domain); + sendmsg_counter++; + + if (sendmsg_counter == cfg_notification_limit && cfg_zerocopy) { + do_recv_completions(fd, domain); + sendmsg_counter = 0; + } while (!do_poll(fd, POLLOUT)) { - if (cfg_zerocopy) + if (cfg_zerocopy) { do_recv_completions(fd, domain); + sendmsg_counter = 0; + } } } while (gettimeofday_ms() < tstop); @@ -708,7 +718,7 @@ static void parse_opts(int argc, char **argv) cfg_payload_len = max_payload_len; - while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vz")) != -1) { + while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -760,6 +770,12 @@ static void parse_opts(int argc, char **argv) case 'z': cfg_zerocopy = true; break; + case 'o': + cfg_notification_order_check = true; + break; + case 'l': + cfg_notification_limit = strtoul(optarg, NULL, 0); + break; } } From patchwork Tue Apr 9 20:53:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Zijian Zhang X-Patchwork-Id: 13623246 X-Patchwork-Delegate: kuba@kernel.org Received: from mail-qt1-f172.google.com (mail-qt1-f172.google.com [209.85.160.172]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C8F87158870 for ; Tue, 9 Apr 2024 20:53:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.160.172 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696017; cv=none; b=jseXYsasBKbWSxzuC20yFWYnjeBbm19FDjgpb0TjZhB1UAdc4sdkZMggQyvRLc9YSlGC1qOG888u0yoaN8OO3BletRgYMZuEjUzzwKHbudlHS0sH91xvcmOxFgppdMyPTN+e4dENYwRYR1fnTDcf+/MJx6vbXbWiVp/vzZ+Wbow= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712696017; c=relaxed/simple; bh=b8j7pEH8WZQ4sFFA/Qh7qhqe/Peh5fmJt4fc3lvu62Y=; h=From:To:Cc:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=BK7/VRVeow2myEC8X5ISiFxaaWWrZDfyWtuZ6nidRS9fwVjZ4j1shwz66AMcytG5YDPBRoaD05xMelst5Nj/vITa4hadEIpq8hEM/ZEdetdH3/7yquejiaY17m/lzBJjSLjzebDeSBiiK830Lhg77qzvQYN/Q85tlPos00FY5kk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=kyo1Yf0y; arc=none smtp.client-ip=209.85.160.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="kyo1Yf0y" Received: by mail-qt1-f172.google.com with SMTP id d75a77b69052e-4346520b081so21015071cf.2 for ; Tue, 09 Apr 2024 13:53:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1712696014; x=1713300814; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fxtiYh6JmiBN4qiwYT8tdey/mxkw1tAcTBtKYLoJ8IU=; b=kyo1Yf0ypOHxX1c/uceMHZrDLvSMKWDEQZWuPvH897C4tFuHJYUgvWvhJ0R3laafXS ulp5srkudw2QLPwKsx5rplEBM7fxTXj1H8LbQr/f4otbzmCVS/40O5/ZqgbxDI8MKbZ/ y9rUmbcdDhn1ZUwBm0uOgbvsYdgQQSZpbXtB8TuBUEf5QaqgnGBSPOC7YcynZ/cojMqK /ENLmgsokgVcqc1zQFEUzIIupeGpqjwZBmyDAPlQVUqNqqITgViUMAp+/QNyfjhX9Tr8 1gv5iyFSq3qsqh/2SO1unW6m+fejSOr9U1OsRVD2A2iBDCv5asXKdOxKvj4YgezCLQZX 5xpA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1712696014; x=1713300814; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fxtiYh6JmiBN4qiwYT8tdey/mxkw1tAcTBtKYLoJ8IU=; b=tBO1K3UC7owZPPNERUs9UnlRsZg1uZJ5hOtCfgjw2XrIxbpa6ANL7rLE3b/BYYlh72 6TPxTRwjtkAKE2ID1P9UpBbc7uJJig743GQwvmzLp3REpSBeASGjDCu4qcRGuZ2R0Zj/ VxKaexsSf3NQALYEJl78i2FDPGgpUK9ymMbxPcHmvSM0gZtKF70GmMiH82PH8nZTjWcP PQ+ufndtOuxjn3WUizWKI4OKtLs9bXf4IBKkB130Uj9FIIgbRmNlo62zJ0JyXyZrIOCO kRoEhgMnuPCf+8xib1KSPurzk6v7TtXH3UV9Ld5QZEextVLGOomuWebUAroT2vbz7Uzf bsdg== X-Gm-Message-State: AOJu0Yze4sNBBmy7mGXZj8Ntu49PZxYx5dAdmdbyJmDwjQagUC+QGKww NJmh6pmcYWqJ6ghrEMXHga5ZifXRzOdmhOLf6k8rkJcJA6tupi1ntkWY8HK0uAv/YrYaw2vyWEq + X-Google-Smtp-Source: AGHT+IH/CpOYhhm32ZSxycdlQ7o8sTepW5LJ/DT8NyUvk4jDA/ZaEpGUZDiIELSbhA5/Y5yU42b/6Q== X-Received: by 2002:a05:620a:559a:b0:78d:39f7:1d16 with SMTP id vq26-20020a05620a559a00b0078d39f71d16mr915729qkn.31.1712696014005; Tue, 09 Apr 2024 13:53:34 -0700 (PDT) Received: from n191-036-066.byted.org ([147.160.184.150]) by smtp.gmail.com with ESMTPSA id vy3-20020a05620a490300b0078d6bcfb580sm1151619qkn.10.2024.04.09.13.53.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 09 Apr 2024 13:53:33 -0700 (PDT) From: zijianzhang@bytedance.com To: netdev@vger.kernel.org Cc: edumazet@google.com, willemdebruijn.kernel@gmail.com, davem@davemloft.net, kuba@kernel.org, cong.wang@bytedance.com, xiaochun.lu@bytedance.com, Zijian Zhang Subject: [PATCH net-next 3/3] selftests: add msg_zerocopy_uarg test Date: Tue, 9 Apr 2024 20:53:00 +0000 Message-Id: <20240409205300.1346681-4-zijianzhang@bytedance.com> X-Mailer: git-send-email 2.20.1 In-Reply-To: <20240409205300.1346681-1-zijianzhang@bytedance.com> References: <20240409205300.1346681-1-zijianzhang@bytedance.com> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 X-Patchwork-Delegate: kuba@kernel.org From: Zijian Zhang We update selftests/net/msg_zerocopy.c to accommodate the new flag. In the original selftest, it tries to retrieve notifications when the socket is not writable. In order to compare with the new flag, we introduce a new config, "cfg_notification_limit", which forces the application to recv notifications when some number of sendmsgs finishes. Test result from selftests/net/msg_zerocopy.c, cfg_notification_limit = 1, it's an unrealistic setting for MSG_ZEROCOPY, and it approximately aligns with the semantics of MSG_ZEROCOPY_UARG. In this case, the new flag has around 15% cpu savings in TCP and 28% cpu savings in UDP. The numbers are in the unit of MB. +---------------------+---------+---------+---------+---------+ | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | +---------------------+---------+---------+---------+---------+ | Copy | 5517 | 5345 | 9158 | 8767 | +---------------------+---------+---------+---------+---------+ | ZCopy | 5588 | 5439 | 8538 | 8169 | +---------------------+---------+---------+---------+---------+ | New ZCopy | 6517 | 6103 | 11000 | 10839 | +---------------------+---------+---------+---------+---------+ | ZCopy / Copy | 101.29% | 101.76% | 93.23% | 93.18% | +---------------------+---------+---------+---------+---------+ | New ZCopy / Copy | 118.13% | 114.18% | 120.11% | 123.63% | +---------------------+---------+---------+---------+---------+ cfg_notification_limit = 8, it means less poll + recvmsg overhead, the new flag performs 7% better in TCP and 4% better in UDP. The numbers are in the unit of MB. +---------------------+---------+---------+---------+---------+ | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | +---------------------+---------+---------+---------+---------+ | Copy | 5328 | 5159 | 8581 | 8457 | +---------------------+---------+---------+---------+---------+ | ZCopy | 5877 | 5568 | 10314 | 10091 | +---------------------+---------+---------+---------+---------+ | New ZCopy | 6254 | 5901 | 10674 | 10293 | +---------------------+---------+---------+---------+---------+ | ZCopy / Copy | 110.30% | 107.93% | 120.20% | 119.32% | +---------------------+---------+---------+---------+---------+ | New ZCopy / Copy | 117.38% | 114.38% | 124.39% | 121.71% | +---------------------+---------+---------+---------+---------+ Signed-off-by: Zijian Zhang Signed-off-by: Xiaochun Lu --- tools/testing/selftests/net/msg_zerocopy.c | 132 ++++++++++++++++++-- tools/testing/selftests/net/msg_zerocopy.sh | 1 + 2 files changed, 122 insertions(+), 11 deletions(-) diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 8e595216a0af..0ca5e8509032 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -1,4 +1,5 @@ -/* Evaluate MSG_ZEROCOPY +// SPDX-License-Identifier: GPL-2.0 +/* Evaluate MSG_ZEROCOPY && MSG_ZEROCOPY_UARG * * Send traffic between two processes over one of the supported * protocols and modes: @@ -66,14 +67,29 @@ #define SO_ZEROCOPY 60 #endif +#ifndef SO_ZEROCOPY_NOTIFICATION +#define SO_ZEROCOPY_NOTIFICATION 78 +#endif + #ifndef SO_EE_CODE_ZEROCOPY_COPIED #define SO_EE_CODE_ZEROCOPY_COPIED 1 #endif +#ifndef MSG_ZEROCOPY_UARG +#define MSG_ZEROCOPY_UARG 0x2000000 +#endif + #ifndef MSG_ZEROCOPY #define MSG_ZEROCOPY 0x4000000 #endif +#ifndef SOCK_USR_ZC_INFO_MAX +#define SOCK_USR_ZC_INFO_MAX 8 +#endif + +#define ZEROCOPY_MSGERR_NOTIFICATION 1 +#define ZEROCOPY_USER_ARG_NOTIFICATION 2 + static int cfg_cork; static bool cfg_cork_mixed; static int cfg_cpu = -1; /* default: pin to last cpu */ @@ -87,7 +103,7 @@ static int cfg_verbose; static int cfg_waittime_ms = 500; static bool cfg_notification_order_check; static int cfg_notification_limit = 32; -static bool cfg_zerocopy; +static int cfg_zerocopy; /* 1 for MSG_ZEROCOPY, 2 for MSG_ZEROCOPY_UARG */ static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -169,6 +185,19 @@ static int do_accept(int fd) return fd; } +static void add_zcopy_user_arg(struct msghdr *msg, void *usr_addr) +{ + struct cmsghdr *cm; + + if (!msg->msg_control) + error(1, errno, "NULL user arg"); + cm = (void *)msg->msg_control; + cm->cmsg_len = CMSG_LEN(sizeof(void *)); + cm->cmsg_level = SOL_SOCKET; + cm->cmsg_type = SO_ZEROCOPY_NOTIFICATION; + memcpy(CMSG_DATA(cm), &usr_addr, sizeof(usr_addr)); +} + static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie) { struct cmsghdr *cm; @@ -182,18 +211,55 @@ static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie) memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie)); } -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) +static void do_recv_completion_user_arg(void *p) +{ + int i; + __u32 hi, lo, range; + __u8 zerocopy; + struct tx_usr_zcopy_info *zc_info_p = (struct tx_usr_zcopy_info *)p; + + for (i = 0; i < zc_info_p->length; ++i) { + struct tx_msg_zcopy_info elem = zc_info_p->info[i]; + + hi = elem.hi; + lo = elem.lo; + zerocopy = elem.zerocopy; + range = hi - lo + 1; + + if (cfg_notification_order_check && lo != next_completion) + fprintf(stderr, "gap: %u..%u does not append to %u\n", + lo, hi, next_completion); + next_completion = hi + 1; + + if (zerocopied == -1) + zerocopied = zerocopy; + else if (zerocopied != zerocopy) { + fprintf(stderr, "serr: inconsistent\n"); + zerocopied = zerocopy; + } + + completions += range; + + if (cfg_verbose >= 2) + fprintf(stderr, "completed: %u (h=%u l=%u)\n", + range, hi, lo); + } +} + +static bool do_sendmsg(int fd, struct msghdr *msg, int do_zerocopy, int domain) { int ret, len, i, flags; static uint32_t cookie; - char ckbuf[CMSG_SPACE(sizeof(cookie))]; + /* ckbuf is used to either hold uint32_t cookie or void *pointer */ + char ckbuf[CMSG_SPACE(sizeof(void *))]; + struct tx_usr_zcopy_info zc_info; len = 0; for (i = 0; i < msg->msg_iovlen; i++) len += msg->msg_iov[i].iov_len; flags = MSG_DONTWAIT; - if (do_zerocopy) { + if (do_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) { flags |= MSG_ZEROCOPY; if (domain == PF_RDS) { memset(&msg->msg_control, 0, sizeof(msg->msg_control)); @@ -201,6 +267,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) msg->msg_control = (struct cmsghdr *)ckbuf; add_zcopy_cookie(msg, ++cookie); } + } else if (do_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION) { + flags |= MSG_ZEROCOPY_UARG; + memset(&zc_info, 0, sizeof(zc_info)); + msg->msg_controllen = CMSG_SPACE(sizeof(void *)); + msg->msg_control = (struct cmsghdr *)ckbuf; + add_zcopy_user_arg(msg, &zc_info); } ret = sendmsg(fd, msg, flags); @@ -211,13 +283,16 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) if (cfg_verbose && ret != len) fprintf(stderr, "send: ret=%u != %u\n", ret, len); + if (do_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION) + do_recv_completion_user_arg(&zc_info); + if (len) { packets++; bytes += ret; if (do_zerocopy && ret) expected_completions++; } - if (do_zerocopy && domain == PF_RDS) { + if (msg->msg_control) { msg->msg_control = NULL; msg->msg_controllen = 0; } @@ -480,6 +555,36 @@ static void do_recv_remaining_completions(int fd, int domain) completions, expected_completions); } +static void do_new_recv_remaining_completions(int fd, struct msghdr *msg) +{ + int ret, flags; + struct tx_usr_zcopy_info zc_info; + int64_t tstop = gettimeofday_ms() + cfg_waittime_ms; + char ckbuf[CMSG_SPACE(sizeof(void *))]; + + flags = MSG_DONTWAIT | MSG_ZEROCOPY_UARG; + msg->msg_iovlen = 0; + msg->msg_controllen = CMSG_SPACE(sizeof(void *)); + msg->msg_control = (struct cmsghdr *)ckbuf; + add_zcopy_user_arg(msg, &zc_info); + + while (completions < expected_completions && + gettimeofday_ms() < tstop) { + memset(&zc_info, 0, sizeof(zc_info)); + ret = sendmsg(fd, msg, flags); + if (ret == -1 && errno == EAGAIN) + return; + if (ret == -1) + error(1, errno, "send"); + + do_recv_completion_user_arg(&zc_info); + } + + if (completions < expected_completions) + fprintf(stderr, "missing notifications: %lu < %lu\n", + completions, expected_completions); +} + static void do_tx(int domain, int type, int protocol) { struct iovec iov[3] = { {0} }; @@ -552,13 +657,14 @@ static void do_tx(int domain, int type, int protocol) do_sendmsg(fd, &msg, cfg_zerocopy, domain); sendmsg_counter++; - if (sendmsg_counter == cfg_notification_limit && cfg_zerocopy) { + if (sendmsg_counter == cfg_notification_limit && + cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) { do_recv_completions(fd, domain); sendmsg_counter = 0; } while (!do_poll(fd, POLLOUT)) { - if (cfg_zerocopy) { + if (cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) { do_recv_completions(fd, domain); sendmsg_counter = 0; } @@ -566,8 +672,10 @@ static void do_tx(int domain, int type, int protocol) } while (gettimeofday_ms() < tstop); - if (cfg_zerocopy) + if (cfg_zerocopy == ZEROCOPY_MSGERR_NOTIFICATION) do_recv_remaining_completions(fd, domain); + else if (cfg_zerocopy == ZEROCOPY_USER_ARG_NOTIFICATION) + do_new_recv_remaining_completions(fd, &msg); if (close(fd)) error(1, errno, "close"); @@ -718,7 +826,7 @@ static void parse_opts(int argc, char **argv) cfg_payload_len = max_payload_len; - while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:")) != -1) { + while ((c = getopt(argc, argv, "46c:C:D:i:mp:rs:S:t:vzol:n")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -768,7 +876,7 @@ static void parse_opts(int argc, char **argv) cfg_verbose++; break; case 'z': - cfg_zerocopy = true; + cfg_zerocopy = ZEROCOPY_MSGERR_NOTIFICATION; break; case 'o': cfg_notification_order_check = true; @@ -776,6 +884,9 @@ static void parse_opts(int argc, char **argv) case 'l': cfg_notification_limit = strtoul(optarg, NULL, 0); break; + case 'n': + cfg_zerocopy = ZEROCOPY_USER_ARG_NOTIFICATION; + break; } } diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh index 89c22f5320e0..022a6936d86f 100755 --- a/tools/testing/selftests/net/msg_zerocopy.sh +++ b/tools/testing/selftests/net/msg_zerocopy.sh @@ -118,4 +118,5 @@ do_test() { do_test "${EXTRA_ARGS}" do_test "-z ${EXTRA_ARGS}" +do_test "-n ${EXTRA_ARGS}" echo ok