Message ID | 20230310043812.3087672-3-kuifeng@meta.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | Transit between BPF TCP congestion controls. | expand |
On Thu, 9 Mar 2023 20:38:07 -0800 Kui-Feng Lee <kuifeng@meta.com> wrote: > This feature lets you immediately transition to another congestion > control algorithm or implementation with the same name. Once a name > is updated, new connections will apply this new algorithm. > > Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> What is the use case and userspace API for this? The congestion control algorithm normally doesn't allow this because algorithm specific variables (current state of connection) may not work with another algorithm. Seems like you are opening Pandora's box here.
On 3/10/23 08:47, Stephen Hemminger wrote: > On Thu, 9 Mar 2023 20:38:07 -0800 > Kui-Feng Lee <kuifeng@meta.com> wrote: > >> This feature lets you immediately transition to another congestion >> control algorithm or implementation with the same name. Once a name >> is updated, new connections will apply this new algorithm. >> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> > > What is the use case and userspace API for this? > The congestion control algorithm normally doesn't allow this because > algorithm specific variables (current state of connection) may not > work with another algorithm. Only new connections will apply the new algorithm, while existing connections keep using the algorithm applied. It shouldn't have the per-connection state/variable issue you mentioned. It will be used to upgrade an existing algorithm to a new version. The userspace API is used in the 8th patch of this patchset. One of examples in the testcase is link = bpf_map__attach_struct_ops(skel->maps.ca_update_1); ....... err = bpf_link__update_map(link, skel->maps.ca_update_2); Calling bpf_link__update_map(...) will register ca_pupdate_2 and unregister ca_update_1 with the same name in one call. However, the existing connections that has applied ca_update_1 keep using the algorithm except someone call setsockopt(TCP_CONGESTION, ...) on them. > > Seems like you are opening Pandora's box here.
On 3/13/23 08:46, Kui-Feng Lee wrote: > > > On 3/10/23 08:47, Stephen Hemminger wrote: >> On Thu, 9 Mar 2023 20:38:07 -0800 >> Kui-Feng Lee <kuifeng@meta.com> wrote: >> >>> This feature lets you immediately transition to another congestion >>> control algorithm or implementation with the same name. Once a name >>> is updated, new connections will apply this new algorithm. >>> >>> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> >> >> What is the use case and userspace API for this? >> The congestion control algorithm normally doesn't allow this because >> algorithm specific variables (current state of connection) may not >> work with another algorithm. > > Only new connections will apply the new algorithm, while > existing connections keep using the algorithm applied. It shouldn't > have the per-connection state/variable issue you mentioned. > > It will be used to upgrade an existing algorithm to a new version. > The userspace API is used in the 8th patch of this patchset. > One of examples in the testcase is > > link = bpf_map__attach_struct_ops(skel->maps.ca_update_1); > ....... > err = bpf_link__update_map(link, skel->maps.ca_update_2); > > Calling bpf_link__update_map(...) will register ca_pupdate_2 and > unregister ca_update_1 with the same name > in one call. However, the existing connections that has applied > ca_update_1 keep using the algorithm except someone call > setsockopt(TCP_CONGESTION, ...) on them. FYI! The thread head of the patchset is https://lore.kernel.org/all/20230310043812.3087672-1-kuifeng@meta.com/ > > > >> >> Seems like you are opening Pandora's box here. >
On 3/9/23 8:38 PM, Kui-Feng Lee wrote: > This feature lets you immediately transition to another congestion > control algorithm or implementation with the same name. Once a name > is updated, new connections will apply this new algorithm. The commit message needs to explain why the change is needed and some major details on how the patch is doing it. In this case, why a later bpf patch needs it and what major changes are made to tcp_cong.c. For example, A later bpf patch will allow attaching a bpf_struct_ops (TCP Congestion Control implemented in bpf) to bpf_link. The later bpf patch will also use the existing bpf_link API to replace a bpf_struct_ops (ie. to replace an old tcp-cc with a new tcp-cc under the same name). This requires a helper function to replace a tcp-cc under a tcp_cong_list_lock. Thus, this patch adds a tcp_update_congestion_control() to replace the "old_ca" with a new "ca". This patch also takes this chance to refactor the ca validation into the new tcp_validate_congestion_control() function. > > Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> > --- > include/linux/bpf.h | 1 + > include/net/tcp.h | 2 ++ > net/bpf/bpf_dummy_struct_ops.c | 6 ++++ > net/ipv4/bpf_tcp_ca.c | 6 ++++ > net/ipv4/tcp_cong.c | 60 ++++++++++++++++++++++++++++++---- > 5 files changed, 68 insertions(+), 7 deletions(-) > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > index 00ca92ea6f2e..0f84925d66db 100644 > --- a/include/linux/bpf.h > +++ b/include/linux/bpf.h > @@ -1511,6 +1511,7 @@ struct bpf_struct_ops { > void *kdata, const void *udata); > int (*reg)(void *kdata); > void (*unreg)(void *kdata); > + int (*update)(void *kdata, void *old_kdata); > const struct btf_type *type; > const struct btf_type *value_type; > const char *name; > diff --git a/include/net/tcp.h b/include/net/tcp.h > index db9f828e9d1e..239cc0e2639c 100644 > --- a/include/net/tcp.h > +++ b/include/net/tcp.h > @@ -1117,6 +1117,8 @@ struct tcp_congestion_ops { > > int tcp_register_congestion_control(struct tcp_congestion_ops *type); > void tcp_unregister_congestion_control(struct tcp_congestion_ops *type); > +int tcp_update_congestion_control(struct tcp_congestion_ops *type, > + struct tcp_congestion_ops *old_type); > > void tcp_assign_congestion_control(struct sock *sk); > void tcp_init_congestion_control(struct sock *sk); > diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c > index ff4f89a2b02a..158f14e240d0 100644 > --- a/net/bpf/bpf_dummy_struct_ops.c > +++ b/net/bpf/bpf_dummy_struct_ops.c > @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata) > { > } > > +static int bpf_dummy_update(void *kdata, void *old_kdata) > +{ > + return -EOPNOTSUPP; > +} > + > struct bpf_struct_ops bpf_bpf_dummy_ops = { > .verifier_ops = &bpf_dummy_verifier_ops, > .init = bpf_dummy_init, > .check_member = bpf_dummy_ops_check_member, > .init_member = bpf_dummy_init_member, > .reg = bpf_dummy_reg, > + .update = bpf_dummy_update, > .unreg = bpf_dummy_unreg, > .name = "bpf_dummy_ops", > }; > diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c > index 13fc0c185cd9..66ce5fadfe42 100644 > --- a/net/ipv4/bpf_tcp_ca.c > +++ b/net/ipv4/bpf_tcp_ca.c > @@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata) > tcp_unregister_congestion_control(kdata); > } > > +static int bpf_tcp_ca_update(void *kdata, void *old_kdata) > +{ > + return tcp_update_congestion_control(kdata, old_kdata); > +} > + > struct bpf_struct_ops bpf_tcp_congestion_ops = { > .verifier_ops = &bpf_tcp_ca_verifier_ops, > .reg = bpf_tcp_ca_reg, > .unreg = bpf_tcp_ca_unreg, > + .update = bpf_tcp_ca_update, In v5, a comment was given to move the ".update" related changes to patch 5 such that patch 2 will only have netdev change in tcp_cong.c for review purpose. Please ensure the earlier review comment is addressed in the next revision or reply if the earlier review comment does not make sense. This will save time for the reviewer not to have to repeat the same comment again. > .check_member = bpf_tcp_ca_check_member, > .init_member = bpf_tcp_ca_init_member, > .init = bpf_tcp_ca_init,
On 3/13/23 17:28, Martin KaFai Lau wrote: > On 3/9/23 8:38 PM, Kui-Feng Lee wrote: >> This feature lets you immediately transition to another congestion >> control algorithm or implementation with the same name. Once a name >> is updated, new connections will apply this new algorithm. > > The commit message needs to explain why the change is needed and some > major details on how the patch is doing it. In this case, why a later > bpf patch needs it and what major changes are made to tcp_cong.c. > > For example, > > A later bpf patch will allow attaching a bpf_struct_ops (TCP Congestion > Control implemented in bpf) to bpf_link. The later bpf patch will also > use the existing bpf_link API to replace a bpf_struct_ops (ie. to > replace an old tcp-cc with a new tcp-cc under the same name). This > requires a helper function to replace a tcp-cc under a > tcp_cong_list_lock. Thus, this patch adds a > tcp_update_congestion_control() to replace the "old_ca" with a new "ca". > > This patch also takes this chance to refactor the ca validation into the > new tcp_validate_congestion_control() function. Sure! > >> >> Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> >> --- >> include/linux/bpf.h | 1 + >> include/net/tcp.h | 2 ++ >> net/bpf/bpf_dummy_struct_ops.c | 6 ++++ >> net/ipv4/bpf_tcp_ca.c | 6 ++++ >> net/ipv4/tcp_cong.c | 60 ++++++++++++++++++++++++++++++---- >> 5 files changed, 68 insertions(+), 7 deletions(-) >> >> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >> index 00ca92ea6f2e..0f84925d66db 100644 >> --- a/include/linux/bpf.h >> +++ b/include/linux/bpf.h >> @@ -1511,6 +1511,7 @@ struct bpf_struct_ops { >> void *kdata, const void *udata); >> int (*reg)(void *kdata); >> void (*unreg)(void *kdata); >> + int (*update)(void *kdata, void *old_kdata); >> const struct btf_type *type; >> const struct btf_type *value_type; >> const char *name; >> diff --git a/include/net/tcp.h b/include/net/tcp.h >> index db9f828e9d1e..239cc0e2639c 100644 >> --- a/include/net/tcp.h >> +++ b/include/net/tcp.h >> @@ -1117,6 +1117,8 @@ struct tcp_congestion_ops { >> int tcp_register_congestion_control(struct tcp_congestion_ops *type); >> void tcp_unregister_congestion_control(struct tcp_congestion_ops >> *type); >> +int tcp_update_congestion_control(struct tcp_congestion_ops *type, >> + struct tcp_congestion_ops *old_type); >> void tcp_assign_congestion_control(struct sock *sk); >> void tcp_init_congestion_control(struct sock *sk); >> diff --git a/net/bpf/bpf_dummy_struct_ops.c >> b/net/bpf/bpf_dummy_struct_ops.c >> index ff4f89a2b02a..158f14e240d0 100644 >> --- a/net/bpf/bpf_dummy_struct_ops.c >> +++ b/net/bpf/bpf_dummy_struct_ops.c >> @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata) >> { >> } >> +static int bpf_dummy_update(void *kdata, void *old_kdata) >> +{ >> + return -EOPNOTSUPP; >> +} >> + >> struct bpf_struct_ops bpf_bpf_dummy_ops = { >> .verifier_ops = &bpf_dummy_verifier_ops, >> .init = bpf_dummy_init, >> .check_member = bpf_dummy_ops_check_member, >> .init_member = bpf_dummy_init_member, >> .reg = bpf_dummy_reg, >> + .update = bpf_dummy_update, >> .unreg = bpf_dummy_unreg, >> .name = "bpf_dummy_ops", >> }; >> diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c >> index 13fc0c185cd9..66ce5fadfe42 100644 >> --- a/net/ipv4/bpf_tcp_ca.c >> +++ b/net/ipv4/bpf_tcp_ca.c >> @@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata) >> tcp_unregister_congestion_control(kdata); >> } >> +static int bpf_tcp_ca_update(void *kdata, void *old_kdata) >> +{ >> + return tcp_update_congestion_control(kdata, old_kdata); >> +} >> + >> struct bpf_struct_ops bpf_tcp_congestion_ops = { >> .verifier_ops = &bpf_tcp_ca_verifier_ops, >> .reg = bpf_tcp_ca_reg, >> .unreg = bpf_tcp_ca_unreg, >> + .update = bpf_tcp_ca_update, > > In v5, a comment was given to move the ".update" related changes to > patch 5 such that patch 2 will only have netdev change in tcp_cong.c for > review purpose. > > Please ensure the earlier review comment is addressed in the next > revision or reply if the earlier review comment does not make sense. > This will save time for the reviewer not to have to repeat the same > comment again. Sorry about this. I only addressed .validate and missed .update. Will fix this. > >> .check_member = bpf_tcp_ca_check_member, >> .init_member = bpf_tcp_ca_init_member, >> .init = bpf_tcp_ca_init, > >
diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 00ca92ea6f2e..0f84925d66db 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1511,6 +1511,7 @@ struct bpf_struct_ops { void *kdata, const void *udata); int (*reg)(void *kdata); void (*unreg)(void *kdata); + int (*update)(void *kdata, void *old_kdata); const struct btf_type *type; const struct btf_type *value_type; const char *name; diff --git a/include/net/tcp.h b/include/net/tcp.h index db9f828e9d1e..239cc0e2639c 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1117,6 +1117,8 @@ struct tcp_congestion_ops { int tcp_register_congestion_control(struct tcp_congestion_ops *type); void tcp_unregister_congestion_control(struct tcp_congestion_ops *type); +int tcp_update_congestion_control(struct tcp_congestion_ops *type, + struct tcp_congestion_ops *old_type); void tcp_assign_congestion_control(struct sock *sk); void tcp_init_congestion_control(struct sock *sk); diff --git a/net/bpf/bpf_dummy_struct_ops.c b/net/bpf/bpf_dummy_struct_ops.c index ff4f89a2b02a..158f14e240d0 100644 --- a/net/bpf/bpf_dummy_struct_ops.c +++ b/net/bpf/bpf_dummy_struct_ops.c @@ -222,12 +222,18 @@ static void bpf_dummy_unreg(void *kdata) { } +static int bpf_dummy_update(void *kdata, void *old_kdata) +{ + return -EOPNOTSUPP; +} + struct bpf_struct_ops bpf_bpf_dummy_ops = { .verifier_ops = &bpf_dummy_verifier_ops, .init = bpf_dummy_init, .check_member = bpf_dummy_ops_check_member, .init_member = bpf_dummy_init_member, .reg = bpf_dummy_reg, + .update = bpf_dummy_update, .unreg = bpf_dummy_unreg, .name = "bpf_dummy_ops", }; diff --git a/net/ipv4/bpf_tcp_ca.c b/net/ipv4/bpf_tcp_ca.c index 13fc0c185cd9..66ce5fadfe42 100644 --- a/net/ipv4/bpf_tcp_ca.c +++ b/net/ipv4/bpf_tcp_ca.c @@ -266,10 +266,16 @@ static void bpf_tcp_ca_unreg(void *kdata) tcp_unregister_congestion_control(kdata); } +static int bpf_tcp_ca_update(void *kdata, void *old_kdata) +{ + return tcp_update_congestion_control(kdata, old_kdata); +} + struct bpf_struct_ops bpf_tcp_congestion_ops = { .verifier_ops = &bpf_tcp_ca_verifier_ops, .reg = bpf_tcp_ca_reg, .unreg = bpf_tcp_ca_unreg, + .update = bpf_tcp_ca_update, .check_member = bpf_tcp_ca_check_member, .init_member = bpf_tcp_ca_init_member, .init = bpf_tcp_ca_init, diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c index db8b4b488c31..c90791ae8389 100644 --- a/net/ipv4/tcp_cong.c +++ b/net/ipv4/tcp_cong.c @@ -75,14 +75,8 @@ struct tcp_congestion_ops *tcp_ca_find_key(u32 key) return NULL; } -/* - * Attach new congestion control algorithm to the list - * of available options. - */ -int tcp_register_congestion_control(struct tcp_congestion_ops *ca) +int tcp_validate_congestion_control(struct tcp_congestion_ops *ca) { - int ret = 0; - /* all algorithms must implement these */ if (!ca->ssthresh || !ca->undo_cwnd || !(ca->cong_avoid || ca->cong_control)) { @@ -90,6 +84,20 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca) return -EINVAL; } + return 0; +} + +/* Attach new congestion control algorithm to the list + * of available options. + */ +int tcp_register_congestion_control(struct tcp_congestion_ops *ca) +{ + int ret; + + ret = tcp_validate_congestion_control(ca); + if (ret) + return ret; + ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name)); spin_lock(&tcp_cong_list_lock); @@ -130,6 +138,44 @@ void tcp_unregister_congestion_control(struct tcp_congestion_ops *ca) } EXPORT_SYMBOL_GPL(tcp_unregister_congestion_control); +/* Replace a registered old ca with a new one. + * + * The new ca must have the same name as the old one, that has been + * registered. + */ +int tcp_update_congestion_control(struct tcp_congestion_ops *ca, struct tcp_congestion_ops *old_ca) +{ + struct tcp_congestion_ops *existing; + int ret; + + ret = tcp_validate_congestion_control(ca); + if (ret) + return ret; + + ca->key = jhash(ca->name, sizeof(ca->name), strlen(ca->name)); + + spin_lock(&tcp_cong_list_lock); + existing = tcp_ca_find_key(old_ca->key); + if (ca->key == TCP_CA_UNSPEC || !existing || strcmp(existing->name, ca->name)) { + pr_notice("%s not registered or non-unique key\n", + ca->name); + ret = -EINVAL; + } else if (existing != old_ca) { + pr_notice("invalid old congestion control algorithm to replace\n"); + ret = -EINVAL; + } else { + /* Add the new one before removing the old one to keep + * one implementation available all the time. + */ + list_add_tail_rcu(&ca->list, &tcp_cong_list); + list_del_rcu(&existing->list); + pr_debug("%s updated\n", ca->name); + } + spin_unlock(&tcp_cong_list_lock); + + return ret; +} + u32 tcp_ca_get_key_by_name(struct net *net, const char *name, bool *ecn_ca) { const struct tcp_congestion_ops *ca;
This feature lets you immediately transition to another congestion control algorithm or implementation with the same name. Once a name is updated, new connections will apply this new algorithm. Signed-off-by: Kui-Feng Lee <kuifeng@meta.com> --- include/linux/bpf.h | 1 + include/net/tcp.h | 2 ++ net/bpf/bpf_dummy_struct_ops.c | 6 ++++ net/ipv4/bpf_tcp_ca.c | 6 ++++ net/ipv4/tcp_cong.c | 60 ++++++++++++++++++++++++++++++---- 5 files changed, 68 insertions(+), 7 deletions(-)