diff mbox series

[net,1/1] netfilter: conntrack: Check offload bit on table dump

Message ID 20210128074052.777999-1-roid@nvidia.com (mailing list archive)
State Awaiting Upstream
Delegated to: Netdev Maintainers
Headers show
Series [net,1/1] netfilter: conntrack: Check offload bit on table dump | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Clearly marked for net
netdev/subject_prefix success Link
netdev/cc_maintainers warning 6 maintainers not CCed: kadlec@netfilter.org fw@strlen.de coreteam@netfilter.org netfilter-devel@vger.kernel.org davem@davemloft.net kuba@kernel.org
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 155 this patch: 155
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 8 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 155 this patch: 155
netdev/header_inline success Link
netdev/stable success Stable not CCed

Commit Message

Roi Dayan Jan. 28, 2021, 7:40 a.m. UTC
Currently, offloaded flows might be deleted when executing conntrack -L
or cat /proc/net/nf_conntrack while rules being offloaded.
Ct timeout is not maintained for offloaded flows as aging
of offloaded flows are managed by the flow table offload infrastructure.

Don't do garbage collection for offloaded flows when dumping the
entries.

Fixes: 90964016e5d3 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit")
Signed-off-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
---
 include/net/netfilter/nf_conntrack.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Pablo Neira Ayuso Jan. 30, 2021, 12:01 p.m. UTC | #1
Hi Roi,

On Thu, Jan 28, 2021 at 09:40:52AM +0200, Roi Dayan wrote:
> Currently, offloaded flows might be deleted when executing conntrack -L
> or cat /proc/net/nf_conntrack while rules being offloaded.
> Ct timeout is not maintained for offloaded flows as aging
> of offloaded flows are managed by the flow table offload infrastructure.
> 
> Don't do garbage collection for offloaded flows when dumping the
> entries.
> 
> Fixes: 90964016e5d3 ("netfilter: nf_conntrack: add IPS_OFFLOAD status bit")
> Signed-off-by: Roi Dayan <roid@nvidia.com>
> Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
> ---
>  include/net/netfilter/nf_conntrack.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index 439379ca9ffa..87c85109946a 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -276,7 +276,7 @@ static inline bool nf_ct_is_expired(const struct nf_conn *ct)
>  static inline bool nf_ct_should_gc(const struct nf_conn *ct)
>  {
>  	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
> -	       !nf_ct_is_dying(ct);
> +	       !nf_ct_is_dying(ct) && !test_bit(IPS_OFFLOAD_BIT, &ct->status);

The gc_worker() calls nf_ct_offload_timeout() if the flow if
offloaded, so it extends the timeout to skip the garbage collection.

Could you update ctnetlink_dump_table() and ct_seq_show() to extend
the timeout if the flow is offloaded?

Thanks.
Pablo Neira Ayuso Feb. 1, 2021, 3:08 a.m. UTC | #2
Hi Roi,

On Sun, Jan 31, 2021 at 03:18:34PM +0200, Roi Dayan wrote:
[...]
> Hi Pablo,
> 
> We did more tests with just updating the timeout in the 2 callers
> and it's not enough. We reproduce the issue of rules being timed
> out just now frim different place.

Thanks for giving it a try to my suggestion, it was not correct.

> There is a 3rd caller nf_ct_gc_expired() which being called by 3
> other callers:
> ____nf_conntrack_find()
> nf_conntrack_tuple_taken()
> early_drop_list()

Hm. I'm not sure yet what path is triggering this bug.

Florian came up with the idea of setting a very large timeout for
offloaded flows (that are refreshed by the garbage collector) to avoid
the extra check from the packet path, so those 3 functions above never
hit the garbage collection path. This also applies for the ctnetlink
(conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the
patch describes, those should not ever see an offloaded flow with a
small timeout.

nf_ct_offload_timeout() is called from:

#1 flow_offload_add() to set a very large timer.
#2 the garbage collector path, to refresh the timeout the very large
   offload timer.

Probably there is a race between setting the IPS_OFFLOAD and when
flow_offload_add() is called? Garbage collector gets in between and
zaps the connection. Is a newly offloaded connection that you observed
that is being removed?

> only early_drop_list() has a check to skip conns with offload bit
> but without extending the timeout.
> I didnt do a dump but the issue could be from the other 2 calls.
> 
> With current commit as is I didn't need to check more callers as I made
> sure all callers will skip the non-offload gc.
>
> Instead of updating more callers and there might be more callers
> later why current commit is not enough?
> We skip offloaded flows and soon gc_worker() will hit and will update
> the timeout anyway.

Another possibility would be to check for the offload bit from
nf_ct_is_expired(), which is coming slighty before nf_ct_should_gc().
But this is also in the ____nf_conntrack_find() path.

Florian?
Roi Dayan Feb. 1, 2021, 7:53 a.m. UTC | #3
On 2021-02-01 5:08 AM, Pablo Neira Ayuso wrote:
> Hi Roi,
> 
> On Sun, Jan 31, 2021 at 03:18:34PM +0200, Roi Dayan wrote:
> [...]
>> Hi Pablo,
>>
>> We did more tests with just updating the timeout in the 2 callers
>> and it's not enough. We reproduce the issue of rules being timed
>> out just now frim different place.
> 
> Thanks for giving it a try to my suggestion, it was not correct.
> 
>> There is a 3rd caller nf_ct_gc_expired() which being called by 3
>> other callers:
>> ____nf_conntrack_find()
>> nf_conntrack_tuple_taken()
>> early_drop_list()
> 
> Hm. I'm not sure yet what path is triggering this bug.
> 
> Florian came up with the idea of setting a very large timeout for
> offloaded flows (that are refreshed by the garbage collector) to avoid
> the extra check from the packet path, so those 3 functions above never
> hit the garbage collection path. This also applies for the ctnetlink
> (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the
> patch describes, those should not ever see an offloaded flow with a
> small timeout.
> 
> nf_ct_offload_timeout() is called from:
> 
> #1 flow_offload_add() to set a very large timer.
> #2 the garbage collector path, to refresh the timeout the very large
>     offload timer.
> 
> Probably there is a race between setting the IPS_OFFLOAD and when
> flow_offload_add() is called? Garbage collector gets in between and
> zaps the connection. Is a newly offloaded connection that you observed
> that is being removed?
> 

yes. the flows being removed are newly offloaded connections.
I don't think the race is between setting IPS_OFFLOAD and calling
flow_offload_add().
In act_ct.c tcf_ct_flow_table_add() we first set the bit
and then call flow_offload_add().
I think the race is more of getting into the other flows and gc still
didn't kick in.
I'm sure of this because we added trace dump in nf_ct_delete()
and while creating connections we also ran in a loop conntrack -L.
In the same way instead of conntrack we did the dump from
/proc/net/nf_conntrack and we also saw the trace from there.
In this scenario if we stopped the loop of the dump we didn't crash
and then in more tests we failed again after I tried to moving the bit
check to your suggestion.
But leaving the bit check as in this commit we didn't reproduce the
issue at all.

>> only early_drop_list() has a check to skip conns with offload bit
>> but without extending the timeout.
>> I didnt do a dump but the issue could be from the other 2 calls.
>>
>> With current commit as is I didn't need to check more callers as I made
>> sure all callers will skip the non-offload gc.
>>
>> Instead of updating more callers and there might be more callers
>> later why current commit is not enough?
>> We skip offloaded flows and soon gc_worker() will hit and will update
>> the timeout anyway.
> 
> Another possibility would be to check for the offload bit from
> nf_ct_is_expired(), which is coming slighty before nf_ct_should_gc().
> But this is also in the ____nf_conntrack_find() path.
> > Florian?
> 

Right. beside the dumps paths that just call nf_ct_should_gc() which
calls nf_ct_is_expired() again.
Moving the bit check into nf_ct_is_expired() should work the same as
this commit.
I'll wait for a final decision if you prefer the bit check in
nf_ct_is_expired() and we will test again.
Florian Westphal Feb. 1, 2021, 11:50 a.m. UTC | #4
Roi Dayan <roid@nvidia.com> wrote:
> > > There is a 3rd caller nf_ct_gc_expired() which being called by 3
> > > other callers:
> > > ____nf_conntrack_find()
> > > nf_conntrack_tuple_taken()
> > > early_drop_list()
> > 
> > Hm. I'm not sure yet what path is triggering this bug.
> > 
> > Florian came up with the idea of setting a very large timeout for
> > offloaded flows (that are refreshed by the garbage collector) to avoid
> > the extra check from the packet path, so those 3 functions above never
> > hit the garbage collection path. This also applies for the ctnetlink
> > (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the
> > patch describes, those should not ever see an offloaded flow with a
> > small timeout.
> > 
> > nf_ct_offload_timeout() is called from:
> > 
> > #1 flow_offload_add() to set a very large timer.
> > #2 the garbage collector path, to refresh the timeout the very large
> >     offload timer.
> > 
> > Probably there is a race between setting the IPS_OFFLOAD and when
> > flow_offload_add() is called? Garbage collector gets in between and
> > zaps the connection. Is a newly offloaded connection that you observed
> > that is being removed?
> > 
> 
> yes. the flows being removed are newly offloaded connections.

If they are new, how can they be timed out already?

TCP initial timeout is one minute, UDP 30 seconds.
That should surely be enough to do flow_offload_add (which extends
the timeout)?

Maybe something is doing flow_offload_add() for unconfirmed conntrack?

In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for
tcp it will be set to 60 * HZ.

conntrack confirmation adds jiffies32 to it to make it relative
to current time (this is before insertion into the conntrack table,
so GC isn't supposed to happen before this).

In any case adding test for the offload bit seems to be papering over
invalid/broken ct->timeout value.
Roi Dayan Feb. 1, 2021, 3:04 p.m. UTC | #5
On 2021-02-01 1:50 PM, Florian Westphal wrote:
> Roi Dayan <roid@nvidia.com> wrote:
>>>> There is a 3rd caller nf_ct_gc_expired() which being called by 3
>>>> other callers:
>>>> ____nf_conntrack_find()
>>>> nf_conntrack_tuple_taken()
>>>> early_drop_list()
>>>
>>> Hm. I'm not sure yet what path is triggering this bug.
>>>
>>> Florian came up with the idea of setting a very large timeout for
>>> offloaded flows (that are refreshed by the garbage collector) to avoid
>>> the extra check from the packet path, so those 3 functions above never
>>> hit the garbage collection path. This also applies for the ctnetlink
>>> (conntrack -L) and the /proc/net/nf_conntrack sysctl paths that the
>>> patch describes, those should not ever see an offloaded flow with a
>>> small timeout.
>>>
>>> nf_ct_offload_timeout() is called from:
>>>
>>> #1 flow_offload_add() to set a very large timer.
>>> #2 the garbage collector path, to refresh the timeout the very large
>>>      offload timer.
>>>
>>> Probably there is a race between setting the IPS_OFFLOAD and when
>>> flow_offload_add() is called? Garbage collector gets in between and
>>> zaps the connection. Is a newly offloaded connection that you observed
>>> that is being removed?
>>>
>>
>> yes. the flows being removed are newly offloaded connections.
> 
> If they are new, how can they be timed out already?
> 
> TCP initial timeout is one minute, UDP 30 seconds.
> That should surely be enough to do flow_offload_add (which extends
> the timeout)?

Yes, flow_offload_add() extends the timeout. but it needs to finish.

> 
> Maybe something is doing flow_offload_add() for unconfirmed conntrack?
> 
> In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for
> tcp it will be set to 60 * HZ.

When I hit the issue I printed jiffies and ct->timeout and saw they are
the same or very close but not an absolute number.

> 
> conntrack confirmation adds jiffies32 to it to make it relative
> to current time (this is before insertion into the conntrack table,
> so GC isn't supposed to happen before this).
> 

We hit this issue before more easily and pushed this fix

4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow

That commit changed flow_offload_add() to extend ct timeout because on
we noticed on busy systems GC didn't finish a full iteration on all
conns and conns were cleaned.
I think we might have the same issue.

tcf_ct_flow_table_add() set the offload bit and calls flow_offload_add()

We do know the offload bit is set when conn it deleted, so we hit the
issue where timeout being tested after tcf_ct_flow_table_add() was 
called but before ct timeout was fixed. so flow_offload_add() didn't
finish and GC didn't start, or did start but did not finish full
iteration.

> In any case adding test for the offload bit seems to be papering over
> invalid/broken ct->timeout value.
>
Florian Westphal Feb. 1, 2021, 3:25 p.m. UTC | #6
Roi Dayan <roid@nvidia.com> wrote:
> > TCP initial timeout is one minute, UDP 30 seconds.
> > That should surely be enough to do flow_offload_add (which extends
> > the timeout)?
> 
> Yes, flow_offload_add() extends the timeout. but it needs to finish.
> 
> > 
> > Maybe something is doing flow_offload_add() for unconfirmed conntrack?
> > 
> > In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for
> > tcp it will be set to 60 * HZ.
> 
> When I hit the issue I printed jiffies and ct->timeout and saw they are
> the same or very close but not an absolute number.

Thats strange, for new flows they should not be close at all.
UDP sets a 30 second timeout, TCP sets a 60 second initial timeout.

Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
dozens of seconds?

Thats about the only thing I can see between 'offload bit gets set'
and 'timeout is extended' in flow_offload_add() that could at least
spend *some* time.

> We hit this issue before more easily and pushed this fix
>
> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow

This fix makes sense to me.
Roi Dayan Feb. 2, 2021, 5:08 p.m. UTC | #7
On 2021-02-01 5:25 PM, Florian Westphal wrote:
> Roi Dayan <roid@nvidia.com> wrote:
>>> TCP initial timeout is one minute, UDP 30 seconds.
>>> That should surely be enough to do flow_offload_add (which extends
>>> the timeout)?
>>
>> Yes, flow_offload_add() extends the timeout. but it needs to finish.
>>
>>>
>>> Maybe something is doing flow_offload_add() for unconfirmed conntrack?
>>>
>>> In unconfirmed conntrack case, ct->timeout is absolute timeout value, e.g. for
>>> tcp it will be set to 60 * HZ.
>>
>> When I hit the issue I printed jiffies and ct->timeout and saw they are
>> the same or very close but not an absolute number.
> 
> Thats strange, for new flows they should not be close at all.
> UDP sets a 30 second timeout, TCP sets a 60 second initial timeout.
> 
> Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
> dozens of seconds?

I'm not sure. but its not only that but also the time to be in
established state as only then we offload.

> 
> Thats about the only thing I can see between 'offload bit gets set'
> and 'timeout is extended' in flow_offload_add() that could at least
> spend *some* time.
> 
>> We hit this issue before more easily and pushed this fix
>>
>> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow
> 
> This fix makes sense to me.
> 


I just noted we didn't test correctly Pablo's suggestion instead of
to check the bit and extend the timeout in ctnetlink_dump_table() and
ct_seq_show() like GC does.
We replaced nf_conntrack module but not nf_conntrack_netlink and tested
with conntrack -L.
We redo the test and replaced correct modules and it looks ok now.
We still need to run the long test to be sure but is that still
a good option as a fix?
Florian Westphal Feb. 3, 2021, 12:50 p.m. UTC | #8
Roi Dayan <roid@nvidia.com> wrote:
> > Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
> > dozens of seconds?
> 
> I'm not sure. but its not only that but also the time to be in
> established state as only then we offload.

That makes it even more weird.  Timeout for established is even larger.
In case of TCP, its days... so I don't understand at all :/

> > Thats about the only thing I can see between 'offload bit gets set'
> > and 'timeout is extended' in flow_offload_add() that could at least
> > spend *some* time.
> > 
> > > We hit this issue before more easily and pushed this fix
> > > 
> > > 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow
> > 
> > This fix makes sense to me.
> 
> I just noted we didn't test correctly Pablo's suggestion instead of
> to check the bit and extend the timeout in ctnetlink_dump_table() and
> ct_seq_show() like GC does.

Ok.  Extending it there makes sense, but I still don't understand
why newly offloaded flows hit this problem.

Flow offload should never see a 'about to expire' ct entry.
The 'extend timeout from gc' is more to make sure GC doesn't reap
long-lived entries that have been offloaded aeons ago, not 'prevent
new flows from getting zapped...'
Roi Dayan Feb. 7, 2021, 8:38 a.m. UTC | #9
On 2021-02-03 2:50 PM, Florian Westphal wrote:
> Roi Dayan <roid@nvidia.com> wrote:
>>> Do you think rhashtable_insert_fast() in flow_offload_add() blocks for
>>> dozens of seconds?
>>
>> I'm not sure. but its not only that but also the time to be in
>> established state as only then we offload.
> 
> That makes it even more weird.  Timeout for established is even larger.
> In case of TCP, its days... so I don't understand at all :/
> 
>>> Thats about the only thing I can see between 'offload bit gets set'
>>> and 'timeout is extended' in flow_offload_add() that could at least
>>> spend *some* time.
>>>
>>>> We hit this issue before more easily and pushed this fix
>>>>
>>>> 4203b19c2796 netfilter: flowtable: Set offload timeout when adding flow
>>>
>>> This fix makes sense to me.
>>
>> I just noted we didn't test correctly Pablo's suggestion instead of
>> to check the bit and extend the timeout in ctnetlink_dump_table() and
>> ct_seq_show() like GC does.
> 
> Ok.  Extending it there makes sense, but I still don't understand
> why newly offloaded flows hit this problem.
> 
> Flow offload should never see a 'about to expire' ct entry.
> The 'extend timeout from gc' is more to make sure GC doesn't reap
> long-lived entries that have been offloaded aeons ago, not 'prevent
> new flows from getting zapped...'
> 

I'll add that an extended timeout from gc is if gc actually iterated
this conn since it was created.
I'll investigate this more and try to do more tests.
thanks for the info
diff mbox series

Patch

diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index 439379ca9ffa..87c85109946a 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -276,7 +276,7 @@  static inline bool nf_ct_is_expired(const struct nf_conn *ct)
 static inline bool nf_ct_should_gc(const struct nf_conn *ct)
 {
 	return nf_ct_is_expired(ct) && nf_ct_is_confirmed(ct) &&
-	       !nf_ct_is_dying(ct);
+	       !nf_ct_is_dying(ct) && !test_bit(IPS_OFFLOAD_BIT, &ct->status);
 }
 
 #define	NF_CT_DAY	(86400 * HZ)