Message ID | 20230327175446.98151-4-john.fastabend@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | BPF |
Headers | show |
Series | bpf sockmap fixes | expand |
On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote: > We noticed some rare sk_buffs were stepping past the queue when system was > under memory pressure. The general theory is to skip enqueueing > sk_buffs when its not necessary which is the normal case with a system > that is properly provisioned for the task, no memory pressure and enough > cpu assigned. > > But, if we can't allocate memory due to an ENOMEM error when enqueueing > the sk_buff into the sockmap receive queue we push it onto a delayed > workqueue to retry later. When a new sk_buff is received we then check > if that queue is empty. However, there is a problem with simply checking > the queue length. When a sk_buff is being processed from the ingress queue > but not yet on the sockmap msg receive queue its possible to also recv > a sk_buff through normal path. It will check the ingress queue which is > zero and then skip ahead of the pkt being processed. > > Previously we used sock lock from both contexts which made the problem > harder to hit, but not impossible. > > To fix also check the 'state' variable where we would cache partially > processed sk_buff. This catches the majority of cases. But, we also > need to use the mutex lock around this check because we can't have both > codes running and check sensibly. We could perhaps do this with atomic > bit checks, but we are already here due to memory pressure so slowing > things down a bit seems OK and simpler to just grab a lock. > > To reproduce issue we run NGINX compliance test with sockmap running and > observe some flakes in our testing that we attributed to this issue. > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > Tested-by: William Findlay <will@isovalent.com> > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > --- I've got an idea to try, but it'd a bigger change. skb_dequeue is lock, skb_peek, skb_unlink, unlock, right? What if we split up the skb_dequeue in sk_psock_backlog to publish the change to the ingress_skb queue only once an skb has been processed? static void sk_psock_backlog(struct work_struct *work) { ... while ((skb = skb_peek_locked(&psock->ingress_skb))) { ... skb_unlink(skb, &psock->ingress_skb); } ... } Even more - if we hold off the unlinking until an skb has been fully processed, that perhaps opens up the way to get rid of keeping state in sk_psock_work_state. We could just skb_pull the processed data instead. It's just an idea and I don't want to block a tested fix that LGTM so consider this: Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Jakub Sitnicki wrote: > On Mon, Mar 27, 2023 at 10:54 AM -07, John Fastabend wrote: > > We noticed some rare sk_buffs were stepping past the queue when system was > > under memory pressure. The general theory is to skip enqueueing > > sk_buffs when its not necessary which is the normal case with a system > > that is properly provisioned for the task, no memory pressure and enough > > cpu assigned. > > > > But, if we can't allocate memory due to an ENOMEM error when enqueueing > > the sk_buff into the sockmap receive queue we push it onto a delayed > > workqueue to retry later. When a new sk_buff is received we then check > > if that queue is empty. However, there is a problem with simply checking > > the queue length. When a sk_buff is being processed from the ingress queue > > but not yet on the sockmap msg receive queue its possible to also recv > > a sk_buff through normal path. It will check the ingress queue which is > > zero and then skip ahead of the pkt being processed. > > > > Previously we used sock lock from both contexts which made the problem > > harder to hit, but not impossible. > > > > To fix also check the 'state' variable where we would cache partially > > processed sk_buff. This catches the majority of cases. But, we also > > need to use the mutex lock around this check because we can't have both > > codes running and check sensibly. We could perhaps do this with atomic > > bit checks, but we are already here due to memory pressure so slowing > > things down a bit seems OK and simpler to just grab a lock. > > > > To reproduce issue we run NGINX compliance test with sockmap running and > > observe some flakes in our testing that we attributed to this issue. > > > > Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") > > Tested-by: William Findlay <will@isovalent.com> > > Signed-off-by: John Fastabend <john.fastabend@gmail.com> > > --- > > I've got an idea to try, but it'd a bigger change. > > skb_dequeue is lock, skb_peek, skb_unlink, unlock, right? > > What if we split up the skb_dequeue in sk_psock_backlog to publish the > change to the ingress_skb queue only once an skb has been processed? I think this works now. Early on we tried to run backlog without locking but given this is now locked by mutex I think it works out. We have a few places that work on ingress_skb but those are all enqueue on tail so should be fine to peek here. > > static void sk_psock_backlog(struct work_struct *work) > { > ... > while ((skb = skb_peek_locked(&psock->ingress_skb))) { > ... > skb_unlink(skb, &psock->ingress_skb); > } > ... > } > > Even more - if we hold off the unlinking until an skb has been fully > processed, that perhaps opens up the way to get rid of keeping state in > sk_psock_work_state. We could just skb_pull the processed data instead. Yep. > > It's just an idea and I don't want to block a tested fix that LGTM so > consider this: Did you get a chance to try this? Otherwise I can also give the idea a try next week. > > Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
On Fri, Mar 31, 2023 at 05:59 PM -07, John Fastabend wrote: > Jakub Sitnicki wrote: [...] >> It's just an idea and I don't want to block a tested fix that LGTM so >> consider this: > > Did you get a chance to try this? Otherwise I can also give the idea > a try next week. No. By all means, feel free. You caught me at a busy time. That's also why the review has been going so slow.
diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 96a6a3a74a67..34de0605694e 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -985,6 +985,7 @@ EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read); static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, int verdict) { + struct sk_psock_work_state *state; struct sock *sk_other; int err = 0; u32 len, off; @@ -1001,13 +1002,28 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, skb_bpf_set_ingress(skb); + /* We need to grab mutex here because in-flight skb is in one of + * the following states: either on ingress_skb, in psock->state + * or being processed by backlog and neither in state->skb and + * ingress_skb may be also empty. The troublesome case is when + * the skb has been dequeued from ingress_skb list or taken from + * state->skb because we can not easily test this case. Maybe we + * could be clever with flags and resolve this but being clever + * got us here in the first place and we note this is done under + * sock lock and backlog conditions mean we are already running + * into ENOMEM or other performance hindering cases so lets do + * the obvious thing and grab the mutex. + */ + mutex_lock(&psock->work_mutex); + state = &psock->work_state; + /* If the queue is empty then we can submit directly * into the msg queue. If its not empty we have to * queue work otherwise we may get OOO data. Otherwise, * if sk_psock_skb_ingress errors will be handled by * retrying later from workqueue. */ - if (skb_queue_empty(&psock->ingress_skb)) { + if (skb_queue_empty(&psock->ingress_skb) && likely(!state->skb)) { len = skb->len; off = 0; if (skb_bpf_strparser(skb)) { @@ -1028,9 +1044,11 @@ static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb, spin_unlock_bh(&psock->ingress_lock); if (err < 0) { skb_bpf_redirect_clear(skb); + mutex_unlock(&psock->work_mutex); goto out_free; } } + mutex_unlock(&psock->work_mutex); break; case __SK_REDIRECT: err = sk_psock_skb_redirect(psock, skb);