Message ID | 20240730184120.4089835-4-zijianzhang@bytedance.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: A lightweight zero-copy notification mechanism for MSG_ZEROCOPY | expand |
zijianzhang@ wrote: > From: Zijian Zhang <zijianzhang@bytedance.com> > > We update selftests/net/msg_zerocopy.c to accommodate the new mechanism, Please make commit messages stand on their own. If someone does a git blame, make the message self explanatory. Replace "the new mechanism" with sendmsg SCM_ZC_NOTIFICATION. In patch 2 or as a separate patch 4, also add a new short section on this API in Documentation/networking/msg_zerocopy.rst. Probably with the same contents as a good explanation of the feature in the commit message of patch 2. > cfg_notification_limit has the same semantics for both methods. Test > results are as follows, we update skb_orphan_frags_rx to the same as > skb_orphan_frags to support zerocopy in the localhost test. > > cfg_notification_limit = 1, both method get notifications after 1 calling > of sendmsg. In this case, the new method has around 17% cpu savings in TCP > and 23% cpu savings in UDP. > +---------------------+---------+---------+---------+---------+ > | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | > +---------------------+---------+---------+---------+---------+ > | ZCopy (MB) | 7523 | 7706 | 7489 | 7304 | > +---------------------+---------+---------+---------+---------+ > | New ZCopy (MB) | 8834 | 8993 | 9053 | 9228 | > +---------------------+---------+---------+---------+---------+ > | New ZCopy / ZCopy | 117.42% | 116.70% | 120.88% | 126.34% | > +---------------------+---------+---------+---------+---------+ > > cfg_notification_limit = 32, both get notifications after 32 calling of > sendmsg, which means more chances to coalesce notifications, and less > overhead of poll + recvmsg for the original method. In this case, the new > method has around 7% cpu savings in TCP and slightly better cpu usage in > UDP. In the env of selftest, notifications of TCP are more likely to be > out of order than UDP, it's easier to coalesce more notifications in UDP. > The original method can get one notification with range of 32 in a recvmsg > most of the time. In TCP, most notifications' range is around 2, so the > original method needs around 16 recvmsgs to get notified in one round. > That's the reason for the "New ZCopy / ZCopy" diff in TCP and UDP here. > +---------------------+---------+---------+---------+---------+ > | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | > +---------------------+---------+---------+---------+---------+ > | ZCopy (MB) | 8842 | 8735 | 10072 | 9380 | > +---------------------+---------+---------+---------+---------+ > | New ZCopy (MB) | 9366 | 9477 | 10108 | 9385 | > +---------------------+---------+---------+---------+---------+ > | New ZCopy / ZCopy | 106.00% | 108.28% | 100.31% | 100.01% | > +---------------------+---------+---------+---------+---------+ > > In conclusion, when notification interval is small or notifications are > hard to be coalesced, the new mechanism is highly recommended. Otherwise, > the performance gain from the new mechanism is very limited. > > Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> > Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com> > -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > +static void add_zcopy_info(struct msghdr *msg) > +{ > + struct zc_info *zc_info; > + struct cmsghdr *cm; > + > + if (!msg->msg_control) > + error(1, errno, "NULL user arg"); Don't add precondition checks for code entirely under your control. This is not a user API. > + cm = (struct cmsghdr *)msg->msg_control; > + cm->cmsg_len = CMSG_LEN(ZC_INFO_SIZE); > + cm->cmsg_level = SOL_SOCKET; > + cm->cmsg_type = SCM_ZC_NOTIFICATION; > + > + zc_info = (struct zc_info *)CMSG_DATA(cm); > + zc_info->size = ZC_NOTIFICATION_MAX; > + > + added_zcopy_info = true; Just initialize every time? Is this here to reuse the same msg_control as long as metadata is returned? > +} > + > +static bool do_sendmsg(int fd, struct msghdr *msg, > + enum notification_type do_zerocopy, int domain) > { > int ret, len, i, flags; > static uint32_t cookie; > @@ -200,6 +233,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > msg->msg_controllen = CMSG_SPACE(sizeof(cookie)); > msg->msg_control = (struct cmsghdr *)ckbuf; > add_zcopy_cookie(msg, ++cookie); > + } else if (do_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && > + sends_since_notify + 1 >= cfg_notification_limit) { > + memset(&msg->msg_control, 0, sizeof(msg->msg_control)); > + msg->msg_controllen = CMSG_SPACE(ZC_INFO_SIZE); > + msg->msg_control = (struct cmsghdr *)zc_ckbuf; > + add_zcopy_info(msg); > } > } > > @@ -218,7 +257,7 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > if (do_zerocopy && ret) > expected_completions++; > } > - if (do_zerocopy && domain == PF_RDS) { > + if (msg->msg_control) { > msg->msg_control = NULL; > msg->msg_controllen = 0; > } > @@ -466,6 +505,44 @@ static void do_recv_completions(int fd, int domain) > sends_since_notify = 0; > } > > +static void do_recv_completions2(void) functionname2 is very uninformative. do_recv_completions_sendmsg or so. > +{ > + struct cmsghdr *cm = (struct cmsghdr *)zc_ckbuf; > + struct zc_info *zc_info; > + __u32 hi, lo, range; > + __u8 zerocopy; > + int i; > + > + zc_info = (struct zc_info *)CMSG_DATA(cm); > + for (i = 0; i < zc_info->size; i++) { > + hi = zc_info->arr[i].hi; > + lo = zc_info->arr[i].lo; > + zerocopy = zc_info->arr[i].zerocopy; > + range = hi - lo + 1; > + > + if (cfg_verbose && lo != next_completion) > + fprintf(stderr, "gap: %u..%u does not append to %u\n", > + lo, hi, next_completion); > + next_completion = hi + 1; > + > + if (zerocopied == -1) { > + zerocopied = zerocopy; > + } else if (zerocopied != zerocopy) { > + fprintf(stderr, "serr: inconsistent\n"); > + zerocopied = zerocopy; > + } > + > + completions += range; > + sends_since_notify -= range; > + > + if (cfg_verbose >= 2) > + fprintf(stderr, "completed: %u (h=%u l=%u)\n", > + range, hi, lo); > + } > + > + added_zcopy_info = false; > +} > + > /* Wait for all remaining completions on the errqueue */ > static void do_recv_remaining_completions(int fd, int domain) > { > @@ -553,11 +630,16 @@ static void do_tx(int domain, int type, int protocol) > else > do_sendmsg(fd, &msg, cfg_zerocopy, domain); > > - if (cfg_zerocopy && sends_since_notify >= cfg_notification_limit) > + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE && > + sends_since_notify >= cfg_notification_limit) > do_recv_completions(fd, domain); > > + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && > + added_zcopy_info) > + do_recv_completions2(); > + > while (!do_poll(fd, POLLOUT)) { > - if (cfg_zerocopy) > + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE) > do_recv_completions(fd, domain); > } > > @@ -715,7 +797,7 @@ static void parse_opts(int argc, char **argv) > > cfg_payload_len = max_payload_len; > > - while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vz")) != -1) { > + while ((c = getopt(argc, argv, "46c:C:D:i:l:mnp:rs:S:t:vz")) != -1) { > switch (c) { > case '4': > if (cfg_family != PF_UNSPEC) > @@ -749,6 +831,9 @@ static void parse_opts(int argc, char **argv) > case 'm': > cfg_cork_mixed = true; > break; > + case 'n': > + cfg_zerocopy = MSG_ZEROCOPY_NOTIFY_SENDMSG; > + break; How about -Z to make clear that this is still MSG_ZEROCOPY, just with a different notification mechanism. And perhaps add a testcase that exercises both this mechanism and existing recvmsg MSG_ERRQUEUE. As they should work in parallel and concurrently in a multithreaded environment.
On 7/31/24 3:32 PM, Willem de Bruijn wrote: > zijianzhang@ wrote: >> From: Zijian Zhang <zijianzhang@bytedance.com> >> >> We update selftests/net/msg_zerocopy.c to accommodate the new mechanism, First of all, thanks for the detailed suggestions! > > Please make commit messages stand on their own. If someone does a git > blame, make the message self explanatory. Replace "the new mechanism" > with sendmsg SCM_ZC_NOTIFICATION. > > In patch 2 or as a separate patch 4, also add a new short section on > this API in Documentation/networking/msg_zerocopy.rst. Probably with > the same contents as a good explanation of the feature in the commit > message of patch 2. > Agreed. >> cfg_notification_limit has the same semantics for both methods. Test >> results are as follows, we update skb_orphan_frags_rx to the same as >> skb_orphan_frags to support zerocopy in the localhost test. >> >> cfg_notification_limit = 1, both method get notifications after 1 calling >> of sendmsg. In this case, the new method has around 17% cpu savings in TCP >> and 23% cpu savings in UDP. >> +---------------------+---------+---------+---------+---------+ >> | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | >> +---------------------+---------+---------+---------+---------+ >> | ZCopy (MB) | 7523 | 7706 | 7489 | 7304 | >> +---------------------+---------+---------+---------+---------+ >> | New ZCopy (MB) | 8834 | 8993 | 9053 | 9228 | >> +---------------------+---------+---------+---------+---------+ >> | New ZCopy / ZCopy | 117.42% | 116.70% | 120.88% | 126.34% | >> +---------------------+---------+---------+---------+---------+ >> >> cfg_notification_limit = 32, both get notifications after 32 calling of >> sendmsg, which means more chances to coalesce notifications, and less >> overhead of poll + recvmsg for the original method. In this case, the new >> method has around 7% cpu savings in TCP and slightly better cpu usage in >> UDP. In the env of selftest, notifications of TCP are more likely to be >> out of order than UDP, it's easier to coalesce more notifications in UDP. >> The original method can get one notification with range of 32 in a recvmsg >> most of the time. In TCP, most notifications' range is around 2, so the >> original method needs around 16 recvmsgs to get notified in one round. >> That's the reason for the "New ZCopy / ZCopy" diff in TCP and UDP here. >> +---------------------+---------+---------+---------+---------+ >> | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | >> +---------------------+---------+---------+---------+---------+ >> | ZCopy (MB) | 8842 | 8735 | 10072 | 9380 | >> +---------------------+---------+---------+---------+---------+ >> | New ZCopy (MB) | 9366 | 9477 | 10108 | 9385 | >> +---------------------+---------+---------+---------+---------+ >> | New ZCopy / ZCopy | 106.00% | 108.28% | 100.31% | 100.01% | >> +---------------------+---------+---------+---------+---------+ >> >> In conclusion, when notification interval is small or notifications are >> hard to be coalesced, the new mechanism is highly recommended. Otherwise, >> the performance gain from the new mechanism is very limited. >> >> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> >> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com> > >> -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) >> +static void add_zcopy_info(struct msghdr *msg) >> +{ >> + struct zc_info *zc_info; >> + struct cmsghdr *cm; >> + >> + if (!msg->msg_control) >> + error(1, errno, "NULL user arg"); > > Don't add precondition checks for code entirely under your control. > This is not a user API. > Ack. >> + cm = (struct cmsghdr *)msg->msg_control; >> + cm->cmsg_len = CMSG_LEN(ZC_INFO_SIZE); >> + cm->cmsg_level = SOL_SOCKET; >> + cm->cmsg_type = SCM_ZC_NOTIFICATION; >> + >> + zc_info = (struct zc_info *)CMSG_DATA(cm); >> + zc_info->size = ZC_NOTIFICATION_MAX; >> + >> + added_zcopy_info = true; > > Just initialize every time? Is this here to reuse the same msg_control > as long as metadata is returned? > Yes, the same msg_control will be reused. The overall paradiagm is, start: sendmsg(..) sendmsg(..) ... sends_since_notify sendmsgs in total add_zcopy_info(..) sendmsg(.., msg_control) do_recv_completions_sendmsg(..) goto start; if (sends_since_notify + 1 >= cfg_notification_limit), add_zcopy_info will be invoked, and the right next sendmsg will have the msg_control passed in. If (added_zcopy_info), do_recv_completions_sendmsg will be invoked, and added_zcopy_info will be set to false in it. >> +} >> + >> +static bool do_sendmsg(int fd, struct msghdr *msg, >> + enum notification_type do_zerocopy, int domain) >> { >> int ret, len, i, flags; >> static uint32_t cookie; >> @@ -200,6 +233,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) >> msg->msg_controllen = CMSG_SPACE(sizeof(cookie)); >> msg->msg_control = (struct cmsghdr *)ckbuf; >> add_zcopy_cookie(msg, ++cookie); >> + } else if (do_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && >> + sends_since_notify + 1 >= cfg_notification_limit) { >> + memset(&msg->msg_control, 0, sizeof(msg->msg_control)); >> + msg->msg_controllen = CMSG_SPACE(ZC_INFO_SIZE); >> + msg->msg_control = (struct cmsghdr *)zc_ckbuf; >> + add_zcopy_info(msg); >> } >> } >> >> @@ -218,7 +257,7 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) >> if (do_zerocopy && ret) >> expected_completions++; >> } >> - if (do_zerocopy && domain == PF_RDS) { >> + if (msg->msg_control) { >> msg->msg_control = NULL; >> msg->msg_controllen = 0; >> } >> @@ -466,6 +505,44 @@ static void do_recv_completions(int fd, int domain) >> sends_since_notify = 0; >> } >> >> +static void do_recv_completions2(void) > > functionname2 is very uninformative. > > do_recv_completions_sendmsg or so. > Ack. >> +{ >> + struct cmsghdr *cm = (struct cmsghdr *)zc_ckbuf; >> + struct zc_info *zc_info; >> + __u32 hi, lo, range; >> + __u8 zerocopy; >> + int i; >> + >> + zc_info = (struct zc_info *)CMSG_DATA(cm); >> + for (i = 0; i < zc_info->size; i++) { >> + hi = zc_info->arr[i].hi; >> + lo = zc_info->arr[i].lo; >> + zerocopy = zc_info->arr[i].zerocopy; >> + range = hi - lo + 1; >> + >> + if (cfg_verbose && lo != next_completion) >> + fprintf(stderr, "gap: %u..%u does not append to %u\n", >> + lo, hi, next_completion); >> + next_completion = hi + 1; >> + >> + if (zerocopied == -1) { >> + zerocopied = zerocopy; >> + } else if (zerocopied != zerocopy) { >> + fprintf(stderr, "serr: inconsistent\n"); >> + zerocopied = zerocopy; >> + } >> + >> + completions += range; >> + sends_since_notify -= range; >> + >> + if (cfg_verbose >= 2) >> + fprintf(stderr, "completed: %u (h=%u l=%u)\n", >> + range, hi, lo); >> + } >> + >> + added_zcopy_info = false; >> +} >> + >> /* Wait for all remaining completions on the errqueue */ >> static void do_recv_remaining_completions(int fd, int domain) >> { >> @@ -553,11 +630,16 @@ static void do_tx(int domain, int type, int protocol) >> else >> do_sendmsg(fd, &msg, cfg_zerocopy, domain); >> >> - if (cfg_zerocopy && sends_since_notify >= cfg_notification_limit) >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE && >> + sends_since_notify >= cfg_notification_limit) >> do_recv_completions(fd, domain); >> >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && >> + added_zcopy_info) >> + do_recv_completions2(); >> + >> while (!do_poll(fd, POLLOUT)) { >> - if (cfg_zerocopy) >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE) >> do_recv_completions(fd, domain); >> } >> >> @@ -715,7 +797,7 @@ static void parse_opts(int argc, char **argv) >> >> cfg_payload_len = max_payload_len; >> >> - while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vz")) != -1) { >> + while ((c = getopt(argc, argv, "46c:C:D:i:l:mnp:rs:S:t:vz")) != -1) { >> switch (c) { >> case '4': >> if (cfg_family != PF_UNSPEC) >> @@ -749,6 +831,9 @@ static void parse_opts(int argc, char **argv) >> case 'm': >> cfg_cork_mixed = true; >> break; >> + case 'n': >> + cfg_zerocopy = MSG_ZEROCOPY_NOTIFY_SENDMSG; >> + break; > > How about -Z to make clear that this is still MSG_ZEROCOPY, just with > a different notification mechanism. > > And perhaps add a testcase that exercises both this mechanism and > existing recvmsg MSG_ERRQUEUE. As they should work in parallel and > concurrently in a multithreaded environment. > -Z is more clear, and the hybrid testcase will be helpful. Btw, before I put some efforts to solve the current issues, I think I should wait for comments about api change from linux-api@vger.kernel.org?
On Thu, Aug 1, 2024 at 1:30 PM Zijian Zhang <zijianzhang@bytedance.com> wrote: > > On 7/31/24 3:32 PM, Willem de Bruijn wrote: > > zijianzhang@ wrote: > >> From: Zijian Zhang <zijianzhang@bytedance.com> > >> > >> We update selftests/net/msg_zerocopy.c to accommodate the new mechanism, > > First of all, thanks for the detailed suggestions! > > > > > Please make commit messages stand on their own. If someone does a git > > blame, make the message self explanatory. Replace "the new mechanism" > > with sendmsg SCM_ZC_NOTIFICATION. > > > > In patch 2 or as a separate patch 4, also add a new short section on > > this API in Documentation/networking/msg_zerocopy.rst. Probably with > > the same contents as a good explanation of the feature in the commit > > message of patch 2. > > > > Agreed. > > >> cfg_notification_limit has the same semantics for both methods. Test > >> results are as follows, we update skb_orphan_frags_rx to the same as > >> skb_orphan_frags to support zerocopy in the localhost test. > >> > >> cfg_notification_limit = 1, both method get notifications after 1 calling > >> of sendmsg. In this case, the new method has around 17% cpu savings in TCP > >> and 23% cpu savings in UDP. > >> +---------------------+---------+---------+---------+---------+ > >> | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | > >> +---------------------+---------+---------+---------+---------+ > >> | ZCopy (MB) | 7523 | 7706 | 7489 | 7304 | > >> +---------------------+---------+---------+---------+---------+ > >> | New ZCopy (MB) | 8834 | 8993 | 9053 | 9228 | > >> +---------------------+---------+---------+---------+---------+ > >> | New ZCopy / ZCopy | 117.42% | 116.70% | 120.88% | 126.34% | > >> +---------------------+---------+---------+---------+---------+ > >> > >> cfg_notification_limit = 32, both get notifications after 32 calling of > >> sendmsg, which means more chances to coalesce notifications, and less > >> overhead of poll + recvmsg for the original method. In this case, the new > >> method has around 7% cpu savings in TCP and slightly better cpu usage in > >> UDP. In the env of selftest, notifications of TCP are more likely to be > >> out of order than UDP, it's easier to coalesce more notifications in UDP. > >> The original method can get one notification with range of 32 in a recvmsg > >> most of the time. In TCP, most notifications' range is around 2, so the > >> original method needs around 16 recvmsgs to get notified in one round. > >> That's the reason for the "New ZCopy / ZCopy" diff in TCP and UDP here. > >> +---------------------+---------+---------+---------+---------+ > >> | Test Type / Protocol| TCP v4 | TCP v6 | UDP v4 | UDP v6 | > >> +---------------------+---------+---------+---------+---------+ > >> | ZCopy (MB) | 8842 | 8735 | 10072 | 9380 | > >> +---------------------+---------+---------+---------+---------+ > >> | New ZCopy (MB) | 9366 | 9477 | 10108 | 9385 | > >> +---------------------+---------+---------+---------+---------+ > >> | New ZCopy / ZCopy | 106.00% | 108.28% | 100.31% | 100.01% | > >> +---------------------+---------+---------+---------+---------+ > >> > >> In conclusion, when notification interval is small or notifications are > >> hard to be coalesced, the new mechanism is highly recommended. Otherwise, > >> the performance gain from the new mechanism is very limited. > >> > >> Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> > >> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com> > > > >> -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > >> +static void add_zcopy_info(struct msghdr *msg) > >> +{ > >> + struct zc_info *zc_info; > >> + struct cmsghdr *cm; > >> + > >> + if (!msg->msg_control) > >> + error(1, errno, "NULL user arg"); > > > > Don't add precondition checks for code entirely under your control. > > This is not a user API. > > > > Ack. > > >> + cm = (struct cmsghdr *)msg->msg_control; > >> + cm->cmsg_len = CMSG_LEN(ZC_INFO_SIZE); > >> + cm->cmsg_level = SOL_SOCKET; > >> + cm->cmsg_type = SCM_ZC_NOTIFICATION; > >> + > >> + zc_info = (struct zc_info *)CMSG_DATA(cm); > >> + zc_info->size = ZC_NOTIFICATION_MAX; > >> + > >> + added_zcopy_info = true; > > > > Just initialize every time? Is this here to reuse the same msg_control > > as long as metadata is returned? > > > > Yes, the same msg_control will be reused. > > The overall paradiagm is, > start: > sendmsg(..) > sendmsg(..) > ... sends_since_notify sendmsgs in total > > add_zcopy_info(..) > sendmsg(.., msg_control) > do_recv_completions_sendmsg(..) > goto start; > > if (sends_since_notify + 1 >= cfg_notification_limit), add_zcopy_info > will be invoked, and the right next sendmsg will have the msg_control > passed in. > > If (added_zcopy_info), do_recv_completions_sendmsg will be invoked, > and added_zcopy_info will be set to false in it. This does not seem like it would need a global variable? > >> +} > >> + > >> +static bool do_sendmsg(int fd, struct msghdr *msg, > >> + enum notification_type do_zerocopy, int domain) > >> { > >> int ret, len, i, flags; > >> static uint32_t cookie; > >> @@ -200,6 +233,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > >> msg->msg_controllen = CMSG_SPACE(sizeof(cookie)); > >> msg->msg_control = (struct cmsghdr *)ckbuf; > >> add_zcopy_cookie(msg, ++cookie); > >> + } else if (do_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && > >> + sends_since_notify + 1 >= cfg_notification_limit) { > >> + memset(&msg->msg_control, 0, sizeof(msg->msg_control)); > >> + msg->msg_controllen = CMSG_SPACE(ZC_INFO_SIZE); > >> + msg->msg_control = (struct cmsghdr *)zc_ckbuf; > >> + add_zcopy_info(msg); > >> } > >> } > >> > >> @@ -218,7 +257,7 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) > >> if (do_zerocopy && ret) > >> expected_completions++; > >> } > >> - if (do_zerocopy && domain == PF_RDS) { > >> + if (msg->msg_control) { > >> msg->msg_control = NULL; > >> msg->msg_controllen = 0; > >> } > >> @@ -466,6 +505,44 @@ static void do_recv_completions(int fd, int domain) > >> sends_since_notify = 0; > >> } > >> > >> +static void do_recv_completions2(void) > > > > functionname2 is very uninformative. > > > > do_recv_completions_sendmsg or so. > > > > Ack. > > >> +{ > >> + struct cmsghdr *cm = (struct cmsghdr *)zc_ckbuf; > >> + struct zc_info *zc_info; > >> + __u32 hi, lo, range; > >> + __u8 zerocopy; > >> + int i; > >> + > >> + zc_info = (struct zc_info *)CMSG_DATA(cm); > >> + for (i = 0; i < zc_info->size; i++) { > >> + hi = zc_info->arr[i].hi; > >> + lo = zc_info->arr[i].lo; > >> + zerocopy = zc_info->arr[i].zerocopy; > >> + range = hi - lo + 1; > >> + > >> + if (cfg_verbose && lo != next_completion) > >> + fprintf(stderr, "gap: %u..%u does not append to %u\n", > >> + lo, hi, next_completion); > >> + next_completion = hi + 1; > >> + > >> + if (zerocopied == -1) { > >> + zerocopied = zerocopy; > >> + } else if (zerocopied != zerocopy) { > >> + fprintf(stderr, "serr: inconsistent\n"); > >> + zerocopied = zerocopy; > >> + } > >> + > >> + completions += range; > >> + sends_since_notify -= range; > >> + > >> + if (cfg_verbose >= 2) > >> + fprintf(stderr, "completed: %u (h=%u l=%u)\n", > >> + range, hi, lo); > >> + } > >> + > >> + added_zcopy_info = false; > >> +} > >> + > >> /* Wait for all remaining completions on the errqueue */ > >> static void do_recv_remaining_completions(int fd, int domain) > >> { > >> @@ -553,11 +630,16 @@ static void do_tx(int domain, int type, int protocol) > >> else > >> do_sendmsg(fd, &msg, cfg_zerocopy, domain); > >> > >> - if (cfg_zerocopy && sends_since_notify >= cfg_notification_limit) > >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE && > >> + sends_since_notify >= cfg_notification_limit) > >> do_recv_completions(fd, domain); > >> > >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && > >> + added_zcopy_info) > >> + do_recv_completions2(); > >> + > >> while (!do_poll(fd, POLLOUT)) { > >> - if (cfg_zerocopy) > >> + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE) > >> do_recv_completions(fd, domain); > >> } > >> > >> @@ -715,7 +797,7 @@ static void parse_opts(int argc, char **argv) > >> > >> cfg_payload_len = max_payload_len; > >> > >> - while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vz")) != -1) { > >> + while ((c = getopt(argc, argv, "46c:C:D:i:l:mnp:rs:S:t:vz")) != -1) { > >> switch (c) { > >> case '4': > >> if (cfg_family != PF_UNSPEC) > >> @@ -749,6 +831,9 @@ static void parse_opts(int argc, char **argv) > >> case 'm': > >> cfg_cork_mixed = true; > >> break; > >> + case 'n': > >> + cfg_zerocopy = MSG_ZEROCOPY_NOTIFY_SENDMSG; > >> + break; > > > > How about -Z to make clear that this is still MSG_ZEROCOPY, just with > > a different notification mechanism. > > > > And perhaps add a testcase that exercises both this mechanism and > > existing recvmsg MSG_ERRQUEUE. As they should work in parallel and > > concurrently in a multithreaded environment. > > > > -Z is more clear, and the hybrid testcase will be helpful. > > Btw, before I put some efforts to solve the current issues, I think > I should wait for comments about api change from linux-api@vger.kernel.org? I'm not sure whether anyone on that list will give feedback. I would continue with revisions at a normal schedule, as long as that stays in the Cc.
On 8/1/24 10:36 AM, Willem de Bruijn wrote: > On Thu, Aug 1, 2024 at 1:30 PM Zijian Zhang <zijianzhang@bytedance.com> wrote: >>> >>>> -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) >>>> +static void add_zcopy_info(struct msghdr *msg) >>>> +{ >>>> + struct zc_info *zc_info; >>>> + struct cmsghdr *cm; >>>> + >>>> + if (!msg->msg_control) >>>> + error(1, errno, "NULL user arg"); >>> >>> Don't add precondition checks for code entirely under your control. >>> This is not a user API. >>> >> >> Ack. >> >>>> + cm = (struct cmsghdr *)msg->msg_control; >>>> + cm->cmsg_len = CMSG_LEN(ZC_INFO_SIZE); >>>> + cm->cmsg_level = SOL_SOCKET; >>>> + cm->cmsg_type = SCM_ZC_NOTIFICATION; >>>> + >>>> + zc_info = (struct zc_info *)CMSG_DATA(cm); >>>> + zc_info->size = ZC_NOTIFICATION_MAX; >>>> + >>>> + added_zcopy_info = true; >>> >>> Just initialize every time? Is this here to reuse the same msg_control >>> as long as metadata is returned? >>> >> >> Yes, the same msg_control will be reused. >> >> The overall paradiagm is, >> start: >> sendmsg(..) >> sendmsg(..) >> ... sends_since_notify sendmsgs in total >> >> add_zcopy_info(..) >> sendmsg(.., msg_control) >> do_recv_completions_sendmsg(..) >> goto start; >> >> if (sends_since_notify + 1 >= cfg_notification_limit), add_zcopy_info >> will be invoked, and the right next sendmsg will have the msg_control >> passed in. >> >> If (added_zcopy_info), do_recv_completions_sendmsg will be invoked, >> and added_zcopy_info will be set to false in it. > > This does not seem like it would need a global variable? > Agreed, maybe I can use sends_since_notify to check whether we need to do_recv_completions_sendmsg, then we get rid of added_zcopy_info. >> Btw, before I put some efforts to solve the current issues, I think >> I should wait for comments about api change from linux-api@vger.kernel.org? > > I'm not sure whether anyone on that list will give feedback. > > I would continue with revisions at a normal schedule, as long as that > stays in the Cc. Got it, thanks
diff --git a/tools/testing/selftests/net/msg_zerocopy.c b/tools/testing/selftests/net/msg_zerocopy.c index 7ea5fb28c93d..cf227f0011b5 100644 --- a/tools/testing/selftests/net/msg_zerocopy.c +++ b/tools/testing/selftests/net/msg_zerocopy.c @@ -66,6 +66,10 @@ #define SO_ZEROCOPY 60 #endif +#ifndef SCM_ZC_NOTIFICATION +#define SCM_ZC_NOTIFICATION 78 +#endif + #ifndef SO_EE_CODE_ZEROCOPY_COPIED #define SO_EE_CODE_ZEROCOPY_COPIED 1 #endif @@ -74,6 +78,14 @@ #define MSG_ZEROCOPY 0x4000000 #endif +#define ZC_INFO_ARR_SIZE (ZC_NOTIFICATION_MAX * sizeof(struct zc_info_elem)) +#define ZC_INFO_SIZE (sizeof(struct zc_info) + ZC_INFO_ARR_SIZE) + +enum notification_type { + MSG_ZEROCOPY_NOTIFY_ERRQUEUE = 1, + MSG_ZEROCOPY_NOTIFY_SENDMSG = 2, +}; + static int cfg_cork; static bool cfg_cork_mixed; static int cfg_cpu = -1; /* default: pin to last cpu */ @@ -86,7 +98,7 @@ static int cfg_runtime_ms = 4200; static int cfg_verbose; static int cfg_waittime_ms = 500; static int cfg_notification_limit = 32; -static bool cfg_zerocopy; +static enum notification_type cfg_zerocopy; static socklen_t cfg_alen; static struct sockaddr_storage cfg_dst_addr; @@ -97,6 +109,8 @@ static long packets, bytes, completions, expected_completions; static int zerocopied = -1; static uint32_t next_completion; static uint32_t sends_since_notify; +static char zc_ckbuf[CMSG_SPACE(ZC_INFO_SIZE)]; +static bool added_zcopy_info; static unsigned long gettimeofday_ms(void) { @@ -182,7 +196,26 @@ static void add_zcopy_cookie(struct msghdr *msg, uint32_t cookie) memcpy(CMSG_DATA(cm), &cookie, sizeof(cookie)); } -static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) +static void add_zcopy_info(struct msghdr *msg) +{ + struct zc_info *zc_info; + struct cmsghdr *cm; + + if (!msg->msg_control) + error(1, errno, "NULL user arg"); + cm = (struct cmsghdr *)msg->msg_control; + cm->cmsg_len = CMSG_LEN(ZC_INFO_SIZE); + cm->cmsg_level = SOL_SOCKET; + cm->cmsg_type = SCM_ZC_NOTIFICATION; + + zc_info = (struct zc_info *)CMSG_DATA(cm); + zc_info->size = ZC_NOTIFICATION_MAX; + + added_zcopy_info = true; +} + +static bool do_sendmsg(int fd, struct msghdr *msg, + enum notification_type do_zerocopy, int domain) { int ret, len, i, flags; static uint32_t cookie; @@ -200,6 +233,12 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) msg->msg_controllen = CMSG_SPACE(sizeof(cookie)); msg->msg_control = (struct cmsghdr *)ckbuf; add_zcopy_cookie(msg, ++cookie); + } else if (do_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && + sends_since_notify + 1 >= cfg_notification_limit) { + memset(&msg->msg_control, 0, sizeof(msg->msg_control)); + msg->msg_controllen = CMSG_SPACE(ZC_INFO_SIZE); + msg->msg_control = (struct cmsghdr *)zc_ckbuf; + add_zcopy_info(msg); } } @@ -218,7 +257,7 @@ static bool do_sendmsg(int fd, struct msghdr *msg, bool do_zerocopy, int domain) if (do_zerocopy && ret) expected_completions++; } - if (do_zerocopy && domain == PF_RDS) { + if (msg->msg_control) { msg->msg_control = NULL; msg->msg_controllen = 0; } @@ -466,6 +505,44 @@ static void do_recv_completions(int fd, int domain) sends_since_notify = 0; } +static void do_recv_completions2(void) +{ + struct cmsghdr *cm = (struct cmsghdr *)zc_ckbuf; + struct zc_info *zc_info; + __u32 hi, lo, range; + __u8 zerocopy; + int i; + + zc_info = (struct zc_info *)CMSG_DATA(cm); + for (i = 0; i < zc_info->size; i++) { + hi = zc_info->arr[i].hi; + lo = zc_info->arr[i].lo; + zerocopy = zc_info->arr[i].zerocopy; + range = hi - lo + 1; + + if (cfg_verbose && lo != next_completion) + fprintf(stderr, "gap: %u..%u does not append to %u\n", + lo, hi, next_completion); + next_completion = hi + 1; + + if (zerocopied == -1) { + zerocopied = zerocopy; + } else if (zerocopied != zerocopy) { + fprintf(stderr, "serr: inconsistent\n"); + zerocopied = zerocopy; + } + + completions += range; + sends_since_notify -= range; + + if (cfg_verbose >= 2) + fprintf(stderr, "completed: %u (h=%u l=%u)\n", + range, hi, lo); + } + + added_zcopy_info = false; +} + /* Wait for all remaining completions on the errqueue */ static void do_recv_remaining_completions(int fd, int domain) { @@ -553,11 +630,16 @@ static void do_tx(int domain, int type, int protocol) else do_sendmsg(fd, &msg, cfg_zerocopy, domain); - if (cfg_zerocopy && sends_since_notify >= cfg_notification_limit) + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE && + sends_since_notify >= cfg_notification_limit) do_recv_completions(fd, domain); + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG && + added_zcopy_info) + do_recv_completions2(); + while (!do_poll(fd, POLLOUT)) { - if (cfg_zerocopy) + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_ERRQUEUE) do_recv_completions(fd, domain); } @@ -715,7 +797,7 @@ static void parse_opts(int argc, char **argv) cfg_payload_len = max_payload_len; - while ((c = getopt(argc, argv, "46c:C:D:i:l:mp:rs:S:t:vz")) != -1) { + while ((c = getopt(argc, argv, "46c:C:D:i:l:mnp:rs:S:t:vz")) != -1) { switch (c) { case '4': if (cfg_family != PF_UNSPEC) @@ -749,6 +831,9 @@ static void parse_opts(int argc, char **argv) case 'm': cfg_cork_mixed = true; break; + case 'n': + cfg_zerocopy = MSG_ZEROCOPY_NOTIFY_SENDMSG; + break; case 'p': cfg_port = strtoul(optarg, NULL, 0); break; @@ -768,7 +853,7 @@ static void parse_opts(int argc, char **argv) cfg_verbose++; break; case 'z': - cfg_zerocopy = true; + cfg_zerocopy = MSG_ZEROCOPY_NOTIFY_ERRQUEUE; break; } } @@ -779,6 +864,8 @@ static void parse_opts(int argc, char **argv) error(1, 0, "-D <server addr> required for PF_RDS\n"); if (!cfg_rx && !saddr) error(1, 0, "-S <client addr> required for PF_RDS\n"); + if (cfg_zerocopy == MSG_ZEROCOPY_NOTIFY_SENDMSG) + error(1, 0, "PF_RDS does not support ZC_NOTIF_SENDMSG"); } setup_sockaddr(cfg_family, daddr, &cfg_dst_addr); setup_sockaddr(cfg_family, saddr, &cfg_src_addr); diff --git a/tools/testing/selftests/net/msg_zerocopy.sh b/tools/testing/selftests/net/msg_zerocopy.sh index 89c22f5320e0..022a6936d86f 100755 --- a/tools/testing/selftests/net/msg_zerocopy.sh +++ b/tools/testing/selftests/net/msg_zerocopy.sh @@ -118,4 +118,5 @@ do_test() { do_test "${EXTRA_ARGS}" do_test "-z ${EXTRA_ARGS}" +do_test "-n ${EXTRA_ARGS}" echo ok