diff mbox series

[RFC,bpf-next,3/8] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.

Message ID 20201117094023.3685-4-kuniyu@amazon.co.jp (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Socket migration for SO_REUSEPORT. | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for bpf-next
netdev/subject_prefix success Link
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 3809 this patch: 3809
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch warning CHECK: extern prototypes should be avoided in .h files WARNING: line length of 100 exceeds 80 columns WARNING: line length of 88 exceeds 80 columns WARNING: line length of 90 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/build_allmodconfig_warn success Errors and warnings before: 4178 this patch: 4178
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Iwashima, Kuniyuki Nov. 17, 2020, 9:40 a.m. UTC
This patch lets reuseport_detach_sock() return a pointer of struct sock,
which is used only by inet_unhash(). If it is not NULL,
inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
sockets from the closing listener to the selected one.

Listening sockets hold incoming connections as a linked list of struct
request_sock in the accept queue, and each request has reference to a full
socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
requests from the closing listener's queue and relink them to the head of
the new listener's queue. We do not process each request, so the migration
completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
sockets, we will take special care in the next commit.

By default, we select the last element of socks[] as the new listener.
This behaviour is based on how the kernel moves sockets in socks[].

For example, we call listen() for four sockets (A, B, C, D), and close the
first two by turns. The sockets move in socks[] like below. (See also [1])

  socks[0] : A <-.      socks[0] : D          socks[0] : D
  socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
  socks[2] : C   |      socks[2] : C --'
  socks[3] : D --'

Then, if C and D have newer settings than A and B, and each socket has a
request (a, b, c, d) in their accept queue, we can redistribute old
requests evenly to new listeners.

  socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
  socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
  socks[2] : C (c)   |      socks[2] : C (c) --'
  socks[3] : D (d) --'

Here, (A, D), or (B, C) can have different application settings, but they
MUST have the same settings at the socket API level; otherwise, unexpected
error may happen. For instance, if only the new listeners have
TCP_SAVE_SYN, old requests do not have SYN data, so the application will
face inconsistency and cause an error.

Therefore, if there are different kinds of sockets, we must disable this
feature or use an eBPF program described in later commits.

Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/
---
 include/net/inet_connection_sock.h |  1 +
 include/net/sock_reuseport.h       |  2 +-
 net/core/sock_reuseport.c          |  8 +++++++-
 net/ipv4/inet_connection_sock.c    | 30 ++++++++++++++++++++++++++++++
 net/ipv4/inet_hashtables.c         |  9 +++++++--
 5 files changed, 46 insertions(+), 4 deletions(-)

Comments

Martin KaFai Lau Nov. 18, 2020, 11:50 p.m. UTC | #1
On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> This patch lets reuseport_detach_sock() return a pointer of struct sock,
> which is used only by inet_unhash(). If it is not NULL,
> inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> sockets from the closing listener to the selected one.
> 
> Listening sockets hold incoming connections as a linked list of struct
> request_sock in the accept queue, and each request has reference to a full
> socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> requests from the closing listener's queue and relink them to the head of
> the new listener's queue. We do not process each request, so the migration
> completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> sockets, we will take special care in the next commit.
> 
> By default, we select the last element of socks[] as the new listener.
> This behaviour is based on how the kernel moves sockets in socks[].
> 
> For example, we call listen() for four sockets (A, B, C, D), and close the
> first two by turns. The sockets move in socks[] like below. (See also [1])
> 
>   socks[0] : A <-.      socks[0] : D          socks[0] : D
>   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
>   socks[2] : C   |      socks[2] : C --'
>   socks[3] : D --'
> 
> Then, if C and D have newer settings than A and B, and each socket has a
> request (a, b, c, d) in their accept queue, we can redistribute old
> requests evenly to new listeners.
I don't think it should emphasize/claim there is a specific way that
the kernel-pick here can redistribute the requests evenly.  It depends on
how the application close/listen.  The userspace can not expect the
ordering of socks[] will behave in a certain way.

The primary redistribution policy has to depend on BPF which is the
policy defined by the user based on its application logic (e.g. how
its binary restart work).  The application (and bpf) knows which one
is a dying process and can avoid distributing to it.

The kernel-pick could be an optional fallback but not a must.  If the bpf
prog is attached, I would even go further to call bpf to redistribute
regardless of the sysctl, so I think the sysctl is not necessary.

> 
>   socks[0] : A (a) <-.      socks[0] : D (a + d)      socks[0] : D (a + d)
>   socks[1] : B (b)   |  =>  socks[1] : B (b) <-.  =>  socks[1] : C (b + c)
>   socks[2] : C (c)   |      socks[2] : C (c) --'
>   socks[3] : D (d) --'
>
Iwashima, Kuniyuki Nov. 19, 2020, 10:09 p.m. UTC | #2
From: Martin KaFai Lau <kafai@fb.com>
Date: Wed, 18 Nov 2020 15:50:17 -0800
> On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > which is used only by inet_unhash(). If it is not NULL,
> > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > sockets from the closing listener to the selected one.
> > 
> > Listening sockets hold incoming connections as a linked list of struct
> > request_sock in the accept queue, and each request has reference to a full
> > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > requests from the closing listener's queue and relink them to the head of
> > the new listener's queue. We do not process each request, so the migration
> > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > sockets, we will take special care in the next commit.
> > 
> > By default, we select the last element of socks[] as the new listener.
> > This behaviour is based on how the kernel moves sockets in socks[].
> > 
> > For example, we call listen() for four sockets (A, B, C, D), and close the
> > first two by turns. The sockets move in socks[] like below. (See also [1])
> > 
> >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> >   socks[2] : C   |      socks[2] : C --'
> >   socks[3] : D --'
> > 
> > Then, if C and D have newer settings than A and B, and each socket has a
> > request (a, b, c, d) in their accept queue, we can redistribute old
> > requests evenly to new listeners.
> I don't think it should emphasize/claim there is a specific way that
> the kernel-pick here can redistribute the requests evenly.  It depends on
> how the application close/listen.  The userspace can not expect the
> ordering of socks[] will behave in a certain way.

I've expected replacing listeners by generations as a general use case.
But exactly. Users should not expect the undocumented kernel internal.


> The primary redistribution policy has to depend on BPF which is the
> policy defined by the user based on its application logic (e.g. how
> its binary restart work).  The application (and bpf) knows which one
> is a dying process and can avoid distributing to it.
> 
> The kernel-pick could be an optional fallback but not a must.  If the bpf
> prog is attached, I would even go further to call bpf to redistribute
> regardless of the sysctl, so I think the sysctl is not necessary.

I also think it is just an optional fallback, but to pick out a different
listener everytime, choosing the moved socket was reasonable. So the even
redistribution for a specific use case is a side effect of such socket
selection.

But, users should decide to use either way:
  (1) let the kernel select a new listener randomly
  (2) select a particular listener by eBPF

I will update the commit message like:
The kernel selects a new listener randomly, but as the side effect, it can
redistribute packets evenly for a specific case where an application
replaces listeners by generations.
Martin KaFai Lau Nov. 20, 2020, 1:53 a.m. UTC | #3
On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> From: Martin KaFai Lau <kafai@fb.com>
> Date: Wed, 18 Nov 2020 15:50:17 -0800
> > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > which is used only by inet_unhash(). If it is not NULL,
> > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > sockets from the closing listener to the selected one.
> > > 
> > > Listening sockets hold incoming connections as a linked list of struct
> > > request_sock in the accept queue, and each request has reference to a full
> > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > requests from the closing listener's queue and relink them to the head of
> > > the new listener's queue. We do not process each request, so the migration
> > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > sockets, we will take special care in the next commit.
> > > 
> > > By default, we select the last element of socks[] as the new listener.
> > > This behaviour is based on how the kernel moves sockets in socks[].
> > > 
> > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > 
> > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > >   socks[2] : C   |      socks[2] : C --'
> > >   socks[3] : D --'
> > > 
> > > Then, if C and D have newer settings than A and B, and each socket has a
> > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > requests evenly to new listeners.
> > I don't think it should emphasize/claim there is a specific way that
> > the kernel-pick here can redistribute the requests evenly.  It depends on
> > how the application close/listen.  The userspace can not expect the
> > ordering of socks[] will behave in a certain way.
> 
> I've expected replacing listeners by generations as a general use case.
> But exactly. Users should not expect the undocumented kernel internal.
> 
> 
> > The primary redistribution policy has to depend on BPF which is the
> > policy defined by the user based on its application logic (e.g. how
> > its binary restart work).  The application (and bpf) knows which one
> > is a dying process and can avoid distributing to it.
> > 
> > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > prog is attached, I would even go further to call bpf to redistribute
> > regardless of the sysctl, so I think the sysctl is not necessary.
> 
> I also think it is just an optional fallback, but to pick out a different
> listener everytime, choosing the moved socket was reasonable. So the even
> redistribution for a specific use case is a side effect of such socket
> selection.
> 
> But, users should decide to use either way:
>   (1) let the kernel select a new listener randomly
>   (2) select a particular listener by eBPF
> 
> I will update the commit message like:
> The kernel selects a new listener randomly, but as the side effect, it can
> redistribute packets evenly for a specific case where an application
> replaces listeners by generations.
Since there is no feedback on sysctl, so may be something missed
in the lines.

I don't think this migration logic should depend on a sysctl.
At least not when a bpf prog is attached that is capable of doing
migration, it is too fragile to ask user to remember to turn on
the sysctl before attaching the bpf prog.

Your use case is to primarily based on bpf prog to pick or only based
on kernel to do a random pick?

Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
change it until all the listening sockets are closed which is exactly
the service disruption problem this series is trying to solve here.
Iwashima, Kuniyuki Nov. 21, 2020, 10:13 a.m. UTC | #4
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Thu, 19 Nov 2020 17:53:46 -0800
> On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > From: Martin KaFai Lau <kafai@fb.com>
> > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > which is used only by inet_unhash(). If it is not NULL,
> > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > sockets from the closing listener to the selected one.
> > > > 
> > > > Listening sockets hold incoming connections as a linked list of struct
> > > > request_sock in the accept queue, and each request has reference to a full
> > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > requests from the closing listener's queue and relink them to the head of
> > > > the new listener's queue. We do not process each request, so the migration
> > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > sockets, we will take special care in the next commit.
> > > > 
> > > > By default, we select the last element of socks[] as the new listener.
> > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > 
> > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > 
> > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > >   socks[2] : C   |      socks[2] : C --'
> > > >   socks[3] : D --'
> > > > 
> > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > requests evenly to new listeners.
> > > I don't think it should emphasize/claim there is a specific way that
> > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > how the application close/listen.  The userspace can not expect the
> > > ordering of socks[] will behave in a certain way.
> > 
> > I've expected replacing listeners by generations as a general use case.
> > But exactly. Users should not expect the undocumented kernel internal.
> > 
> > 
> > > The primary redistribution policy has to depend on BPF which is the
> > > policy defined by the user based on its application logic (e.g. how
> > > its binary restart work).  The application (and bpf) knows which one
> > > is a dying process and can avoid distributing to it.
> > > 
> > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > prog is attached, I would even go further to call bpf to redistribute
> > > regardless of the sysctl, so I think the sysctl is not necessary.
> > 
> > I also think it is just an optional fallback, but to pick out a different
> > listener everytime, choosing the moved socket was reasonable. So the even
> > redistribution for a specific use case is a side effect of such socket
> > selection.
> > 
> > But, users should decide to use either way:
> >   (1) let the kernel select a new listener randomly
> >   (2) select a particular listener by eBPF
> > 
> > I will update the commit message like:
> > The kernel selects a new listener randomly, but as the side effect, it can
> > redistribute packets evenly for a specific case where an application
> > replaces listeners by generations.
> Since there is no feedback on sysctl, so may be something missed
> in the lines.

I'm sorry, I have missed this point while thinking about each reply...


> I don't think this migration logic should depend on a sysctl.
> At least not when a bpf prog is attached that is capable of doing
> migration, it is too fragile to ask user to remember to turn on
> the sysctl before attaching the bpf prog.
> 
> Your use case is to primarily based on bpf prog to pick or only based
> on kernel to do a random pick?

I think we have to care about both cases.

I think we can always enable the migration feature if eBPF prog is not
attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
to select a listener by some rules, along updating the kernel,
redistributing requests without user intention can break the application.
So, there is something needed to confirm user intension at least if eBPF
prog is attached.

But honestly, I believe such eBPF users can follow this change and
implement migration eBPF prog if we introduce such a breaking change.


> Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> change it until all the listening sockets are closed which is exactly
> the service disruption problem this series is trying to solve here.

Oh, exactly...
If we apply this series by live patching, we cannot enable the feature
without service disruption.

To enable the migration feature dynamically, how about this logic?
In this logic, we do not save the sysctl value and check it at each time.

  1. no eBPF prog attached -> ON
  2. eBPF prog attached and sysctl is 0 -> OFF
  3. eBPF prog attached and sysctl is 1 -> ON

So, replacing 

  if (reuse->migrate_req)

to 

  if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)
Martin KaFai Lau Nov. 23, 2020, 12:40 a.m. UTC | #5
On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> From:   Martin KaFai Lau <kafai@fb.com>
> Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > From: Martin KaFai Lau <kafai@fb.com>
> > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > sockets from the closing listener to the selected one.
> > > > > 
> > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > requests from the closing listener's queue and relink them to the head of
> > > > > the new listener's queue. We do not process each request, so the migration
> > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > sockets, we will take special care in the next commit.
> > > > > 
> > > > > By default, we select the last element of socks[] as the new listener.
> > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > 
> > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > 
> > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > >   socks[2] : C   |      socks[2] : C --'
> > > > >   socks[3] : D --'
> > > > > 
> > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > requests evenly to new listeners.
> > > > I don't think it should emphasize/claim there is a specific way that
> > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > how the application close/listen.  The userspace can not expect the
> > > > ordering of socks[] will behave in a certain way.
> > > 
> > > I've expected replacing listeners by generations as a general use case.
> > > But exactly. Users should not expect the undocumented kernel internal.
> > > 
> > > 
> > > > The primary redistribution policy has to depend on BPF which is the
> > > > policy defined by the user based on its application logic (e.g. how
> > > > its binary restart work).  The application (and bpf) knows which one
> > > > is a dying process and can avoid distributing to it.
> > > > 
> > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > prog is attached, I would even go further to call bpf to redistribute
> > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > 
> > > I also think it is just an optional fallback, but to pick out a different
> > > listener everytime, choosing the moved socket was reasonable. So the even
> > > redistribution for a specific use case is a side effect of such socket
> > > selection.
> > > 
> > > But, users should decide to use either way:
> > >   (1) let the kernel select a new listener randomly
> > >   (2) select a particular listener by eBPF
> > > 
> > > I will update the commit message like:
> > > The kernel selects a new listener randomly, but as the side effect, it can
> > > redistribute packets evenly for a specific case where an application
> > > replaces listeners by generations.
> > Since there is no feedback on sysctl, so may be something missed
> > in the lines.
> 
> I'm sorry, I have missed this point while thinking about each reply...
> 
> 
> > I don't think this migration logic should depend on a sysctl.
> > At least not when a bpf prog is attached that is capable of doing
> > migration, it is too fragile to ask user to remember to turn on
> > the sysctl before attaching the bpf prog.
> > 
> > Your use case is to primarily based on bpf prog to pick or only based
> > on kernel to do a random pick?
Again, what is your primarily use case?

> 
> I think we have to care about both cases.
> 
> I think we can always enable the migration feature if eBPF prog is not
> attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> to select a listener by some rules, along updating the kernel,
> redistributing requests without user intention can break the application.
> So, there is something needed to confirm user intension at least if eBPF
> prog is attached.
Right, something being able to tell if the bpf prog can do migration
can confirm the user intention here.  However, this will not be a
sysctl.

A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
"prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
can be used to decide if migration can be done by the bpf prog.
Although the prog->expected_attach_type has not been checked for
BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
that the risk of breaking is very small and is acceptable.

Instead of depending on !reuse_md->data to decide if it
is doing migration or not, a clearer signal should be given
to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
(and to "struct sk_reuseport_md" accordingly).  It can tell
the bpf prog that it is doing migration.  It should also tell if it is
migrating a list of established sk(s) or an individual req_sk.
Accessing "reuse_md->migration" should only be allowed for
BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().

During migration, if skb is not available, an empty skb can be used.
Migration is a slow path and does not happen very often, so it will
be fine even it has to create a temp skb (or may be a static const skb
can be used, not sure but this is implementation details).

> 
> But honestly, I believe such eBPF users can follow this change and
> implement migration eBPF prog if we introduce such a breaking change.
> 
> 
> > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > change it until all the listening sockets are closed which is exactly
> > the service disruption problem this series is trying to solve here.
> 
> Oh, exactly...
> If we apply this series by live patching, we cannot enable the feature
> without service disruption.
> 
> To enable the migration feature dynamically, how about this logic?
> In this logic, we do not save the sysctl value and check it at each time.
> 
>   1. no eBPF prog attached -> ON
>   2. eBPF prog attached and sysctl is 0 -> OFF
No.  When bpf prog is attached and it clearly signals (expected_attach_type
here) it can do migration, it should not depend on anything else.  It is very
confusing to use.  When a prog is successfully loaded, verified
and attached, it is expected to run.

This sysctl essentially only disables the bpf prog with
type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
This is going down a path that having another sysctl in the future
to disable another bpf prog type.  If there would be a need to disable
bpf prog on a type-by-type bases, it would need a more
generic solution on the bpf side and do it in a consistent way
for all prog types.  It needs a separate and longer discussion.

All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
should not depend on this sysctl at all .

/* Pseudo code to show the idea only.
 * Actual implementation should try to fit
 * better into the current code and should look
 * quite different from here.
 */

if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
	/* call bpf to migrate */
	action = BPF_PROG_RUN(prog, &reuse_kern);

	if (action == SK_PASS) {
		if (!reuse_kern.selected_sk)
			/* fallback to kernel random pick */
		else
			/* migrate to reuse_kern.selected_sk */
	} else {
		/* action == SK_DROP. don't do migration at all and
		 * don't fallback to kernel random pick.
		 */ 
	}
}

Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
do you still have a need on adding sysctl_tcp_migrate_req?
Regardless, if there is still a need,
the document for sysctl_tcp_migrate_req should be something like:
"the kernel will do a random pick when there is no bpf prog
 attached to the reuseport group...."

[ ps, my reply will be slow in this week. ]

>   3. eBPF prog attached and sysctl is 1 -> ON
> 
> So, replacing 
> 
>   if (reuse->migrate_req)
> 
> to 
> 
>   if (!reuse->prog || net->ipv4.sysctl_tcp_migrate_req)
Iwashima, Kuniyuki Nov. 24, 2020, 9:24 a.m. UTC | #6
From:   Martin KaFai Lau <kafai@fb.com>
Date:   Sun, 22 Nov 2020 16:40:20 -0800
> On Sat, Nov 21, 2020 at 07:13:22PM +0900, Kuniyuki Iwashima wrote:
> > From:   Martin KaFai Lau <kafai@fb.com>
> > Date:   Thu, 19 Nov 2020 17:53:46 -0800
> > > On Fri, Nov 20, 2020 at 07:09:22AM +0900, Kuniyuki Iwashima wrote:
> > > > From: Martin KaFai Lau <kafai@fb.com>
> > > > Date: Wed, 18 Nov 2020 15:50:17 -0800
> > > > > On Tue, Nov 17, 2020 at 06:40:18PM +0900, Kuniyuki Iwashima wrote:
> > > > > > This patch lets reuseport_detach_sock() return a pointer of struct sock,
> > > > > > which is used only by inet_unhash(). If it is not NULL,
> > > > > > inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV
> > > > > > sockets from the closing listener to the selected one.
> > > > > > 
> > > > > > Listening sockets hold incoming connections as a linked list of struct
> > > > > > request_sock in the accept queue, and each request has reference to a full
> > > > > > socket and its listener. In inet_csk_reqsk_queue_migrate(), we unlink the
> > > > > > requests from the closing listener's queue and relink them to the head of
> > > > > > the new listener's queue. We do not process each request, so the migration
> > > > > > completes in O(1) time complexity. However, in the case of TCP_SYN_RECV
> > > > > > sockets, we will take special care in the next commit.
> > > > > > 
> > > > > > By default, we select the last element of socks[] as the new listener.
> > > > > > This behaviour is based on how the kernel moves sockets in socks[].
> > > > > > 
> > > > > > For example, we call listen() for four sockets (A, B, C, D), and close the
> > > > > > first two by turns. The sockets move in socks[] like below. (See also [1])
> > > > > > 
> > > > > >   socks[0] : A <-.      socks[0] : D          socks[0] : D
> > > > > >   socks[1] : B   |  =>  socks[1] : B <-.  =>  socks[1] : C
> > > > > >   socks[2] : C   |      socks[2] : C --'
> > > > > >   socks[3] : D --'
> > > > > > 
> > > > > > Then, if C and D have newer settings than A and B, and each socket has a
> > > > > > request (a, b, c, d) in their accept queue, we can redistribute old
> > > > > > requests evenly to new listeners.
> > > > > I don't think it should emphasize/claim there is a specific way that
> > > > > the kernel-pick here can redistribute the requests evenly.  It depends on
> > > > > how the application close/listen.  The userspace can not expect the
> > > > > ordering of socks[] will behave in a certain way.
> > > > 
> > > > I've expected replacing listeners by generations as a general use case.
> > > > But exactly. Users should not expect the undocumented kernel internal.
> > > > 
> > > > 
> > > > > The primary redistribution policy has to depend on BPF which is the
> > > > > policy defined by the user based on its application logic (e.g. how
> > > > > its binary restart work).  The application (and bpf) knows which one
> > > > > is a dying process and can avoid distributing to it.
> > > > > 
> > > > > The kernel-pick could be an optional fallback but not a must.  If the bpf
> > > > > prog is attached, I would even go further to call bpf to redistribute
> > > > > regardless of the sysctl, so I think the sysctl is not necessary.
> > > > 
> > > > I also think it is just an optional fallback, but to pick out a different
> > > > listener everytime, choosing the moved socket was reasonable. So the even
> > > > redistribution for a specific use case is a side effect of such socket
> > > > selection.
> > > > 
> > > > But, users should decide to use either way:
> > > >   (1) let the kernel select a new listener randomly
> > > >   (2) select a particular listener by eBPF
> > > > 
> > > > I will update the commit message like:
> > > > The kernel selects a new listener randomly, but as the side effect, it can
> > > > redistribute packets evenly for a specific case where an application
> > > > replaces listeners by generations.
> > > Since there is no feedback on sysctl, so may be something missed
> > > in the lines.
> > 
> > I'm sorry, I have missed this point while thinking about each reply...
> > 
> > 
> > > I don't think this migration logic should depend on a sysctl.
> > > At least not when a bpf prog is attached that is capable of doing
> > > migration, it is too fragile to ask user to remember to turn on
> > > the sysctl before attaching the bpf prog.
> > > 
> > > Your use case is to primarily based on bpf prog to pick or only based
> > > on kernel to do a random pick?
> Again, what is your primarily use case?

We have so many services and components that I cannot grasp all of their
implementations, but I have started this series because a service component
based on the random pick by the kernel suffered from the issue.


> > I think we have to care about both cases.
> > 
> > I think we can always enable the migration feature if eBPF prog is not
> > attached. On the other hand, if BPF_PROG_TYPE_SK_REUSEPORT prog is attached
> > to select a listener by some rules, along updating the kernel,
> > redistributing requests without user intention can break the application.
> > So, there is something needed to confirm user intension at least if eBPF
> > prog is attached.
> Right, something being able to tell if the bpf prog can do migration
> can confirm the user intention here.  However, this will not be a
> sysctl.
> 
> A new bpf_attach_type "BPF_SK_REUSEPORT_SELECT_OR_MIGRATE" can be added.
> "prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE"
> can be used to decide if migration can be done by the bpf prog.
> Although the prog->expected_attach_type has not been checked for
> BPF_PROG_TYPE_SK_REUSEPORT, there was an earlier discussion
> that the risk of breaking is very small and is acceptable.
> 
> Instead of depending on !reuse_md->data to decide if it
> is doing migration or not, a clearer signal should be given
> to the bpf prog.  A "u8 migration" can be added to "struct sk_reuseport_kern"
> (and to "struct sk_reuseport_md" accordingly).  It can tell
> the bpf prog that it is doing migration.  It should also tell if it is
> migrating a list of established sk(s) or an individual req_sk.
> Accessing "reuse_md->migration" should only be allowed for
> BPF_SK_REUSEPORT_SELECT_OR_MIGRATE during is_valid_access().
> 
> During migration, if skb is not available, an empty skb can be used.
> Migration is a slow path and does not happen very often, so it will
> be fine even it has to create a temp skb (or may be a static const skb
> can be used, not sure but this is implementation details).

I greatly appreciate your detailed idea and explanation!
I will try to implement this.


> > But honestly, I believe such eBPF users can follow this change and
> > implement migration eBPF prog if we introduce such a breaking change.
> > 
> > 
> > > Also, IIUC, this sysctl setting sticks at "*reuse", there is no way to
> > > change it until all the listening sockets are closed which is exactly
> > > the service disruption problem this series is trying to solve here.
> > 
> > Oh, exactly...
> > If we apply this series by live patching, we cannot enable the feature
> > without service disruption.
> > 
> > To enable the migration feature dynamically, how about this logic?
> > In this logic, we do not save the sysctl value and check it at each time.
> > 
> >   1. no eBPF prog attached -> ON
> >   2. eBPF prog attached and sysctl is 0 -> OFF
> No.  When bpf prog is attached and it clearly signals (expected_attach_type
> here) it can do migration, it should not depend on anything else.  It is very
> confusing to use.  When a prog is successfully loaded, verified
> and attached, it is expected to run.
> 
> This sysctl essentially only disables the bpf prog with
> type == BPF_PROG_TYPE_SK_REUSEPORT running at a particular point.
> This is going down a path that having another sysctl in the future
> to disable another bpf prog type.  If there would be a need to disable
> bpf prog on a type-by-type bases, it would need a more
> generic solution on the bpf side and do it in a consistent way
> for all prog types.  It needs a separate and longer discussion.
> 
> All behaviors of the BPF_SK_REUSEPORT_SELECT_OR_MIGRATE bpf prog
> should not depend on this sysctl at all .
> 
> /* Pseudo code to show the idea only.
>  * Actual implementation should try to fit
>  * better into the current code and should look
>  * quite different from here.
>  */
> 
> if ((prog && prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE)) {
> 	/* call bpf to migrate */
> 	action = BPF_PROG_RUN(prog, &reuse_kern);
> 
> 	if (action == SK_PASS) {
> 		if (!reuse_kern.selected_sk)
> 			/* fallback to kernel random pick */
> 		else
> 			/* migrate to reuse_kern.selected_sk */
> 	} else {
> 		/* action == SK_DROP. don't do migration at all and
> 		 * don't fallback to kernel random pick.
> 		 */ 
> 	}
> }
> 
> Going back to the sysctl, with BPF_SK_REUSEPORT_SELECT_OR_MIGRATE,
> do you still have a need on adding sysctl_tcp_migrate_req?

No, now I do not think the option should be sysctl.
It will be BPF_SK_REUSEPORT_SELECT_OR_MIGRATE in the next series.
Thank you!


> Regardless, if there is still a need,
> the document for sysctl_tcp_migrate_req should be something like:
> "the kernel will do a random pick when there is no bpf prog
>  attached to the reuseport group...."
> 
> [ ps, my reply will be slow in this week. ]
diff mbox series

Patch

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7338b3865a2a..2ea2d743f8fc 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -260,6 +260,7 @@  struct dst_entry *inet_csk_route_child_sock(const struct sock *sk,
 struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 				      struct request_sock *req,
 				      struct sock *child);
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
 				   unsigned long timeout);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h
index ade3af55c91f..ece1c70ca907 100644
--- a/include/net/sock_reuseport.h
+++ b/include/net/sock_reuseport.h
@@ -32,7 +32,7 @@  struct sock_reuseport {
 extern int reuseport_alloc(struct sock *sk, bool bind_inany);
 extern int reuseport_add_sock(struct sock *sk, struct sock *sk2,
 			      bool bind_inany);
-extern void reuseport_detach_sock(struct sock *sk);
+extern struct sock *reuseport_detach_sock(struct sock *sk);
 extern struct sock *reuseport_select_sock(struct sock *sk,
 					  u32 hash,
 					  struct sk_buff *skb,
diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c
index 01a8b4ba39d7..74a46197854b 100644
--- a/net/core/sock_reuseport.c
+++ b/net/core/sock_reuseport.c
@@ -220,9 +220,10 @@  int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany)
 }
 EXPORT_SYMBOL(reuseport_add_sock);
 
-void reuseport_detach_sock(struct sock *sk)
+struct sock *reuseport_detach_sock(struct sock *sk)
 {
 	struct sock_reuseport *reuse;
+	struct sock *nsk = NULL;
 	int i;
 
 	spin_lock_bh(&reuseport_lock);
@@ -248,6 +249,9 @@  void reuseport_detach_sock(struct sock *sk)
 		reuse->socks[i] = reuse->socks[reuse->num_socks];
 
 		if (reuse->migrate_req) {
+			if (reuse->num_socks)
+				nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i];
+
 			reuse->num_closed_socks++;
 			reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk;
 		} else {
@@ -268,6 +272,8 @@  void reuseport_detach_sock(struct sock *sk)
 		call_rcu(&reuse->rcu, reuseport_free_rcu);
 out:
 	spin_unlock_bh(&reuseport_lock);
+
+	return nsk;
 }
 EXPORT_SYMBOL(reuseport_detach_sock);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index be8cda5b664f..583db7e2b1da 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -992,6 +992,36 @@  struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 }
 EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
 
+void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk)
+{
+	struct request_sock_queue *old_accept_queue, *new_accept_queue;
+
+	old_accept_queue = &inet_csk(sk)->icsk_accept_queue;
+	new_accept_queue = &inet_csk(nsk)->icsk_accept_queue;
+
+	spin_lock(&old_accept_queue->rskq_lock);
+	spin_lock(&new_accept_queue->rskq_lock);
+
+	if (old_accept_queue->rskq_accept_head) {
+		if (new_accept_queue->rskq_accept_head)
+			old_accept_queue->rskq_accept_tail->dl_next =
+				new_accept_queue->rskq_accept_head;
+		else
+			new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail;
+
+		new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head;
+		old_accept_queue->rskq_accept_head = NULL;
+		old_accept_queue->rskq_accept_tail = NULL;
+
+		WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog);
+		WRITE_ONCE(sk->sk_ack_backlog, 0);
+	}
+
+	spin_unlock(&new_accept_queue->rskq_lock);
+	spin_unlock(&old_accept_queue->rskq_lock);
+}
+EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate);
+
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 					 struct request_sock *req, bool own_req)
 {
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 8cbe74313f38..f35c76cf3365 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -629,6 +629,7 @@  void inet_unhash(struct sock *sk)
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 	struct inet_listen_hashbucket *ilb = NULL;
+	struct sock *nsk;
 	spinlock_t *lock;
 
 	if (sk_unhashed(sk))
@@ -644,8 +645,12 @@  void inet_unhash(struct sock *sk)
 	if (sk_unhashed(sk))
 		goto unlock;
 
-	if (rcu_access_pointer(sk->sk_reuseport_cb))
-		reuseport_detach_sock(sk);
+	if (rcu_access_pointer(sk->sk_reuseport_cb)) {
+		nsk = reuseport_detach_sock(sk);
+		if (nsk)
+			inet_csk_reqsk_queue_migrate(sk, nsk);
+	}
+
 	if (ilb) {
 		inet_unhash2(hashinfo, sk);
 		ilb->count--;