[v2] dhcp: fix overflow causing retries to stop

Message ID	20240111133331.1423853-1-prestwoj@gmail.com (mailing list archive)
State	New
Headers	show Received: from mail-qk1-f181.google.com (mail-qk1-f181.google.com [209.85.222.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 12DAD3B195 for <ell@lists.linux.dev>; Thu, 11 Jan 2024 13:33:36 +0000 (UTC) From: James Prestwood <prestwoj@gmail.com> To: ell@lists.linux.dev Cc: James Prestwood <prestwoj@gmail.com> Subject: [PATCH v2] dhcp: fix overflow causing retries to stop Date: Thu, 11 Jan 2024 05:33:31 -0800 Message-Id: <20240111133331.1423853-1-prestwoj@gmail.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v2] dhcp: fix overflow causing retries to stop \| expand [v2] dhcp: fix overflow causing retries to stop

Context	Check	Description
tedd_an/pre-ci_am	success	Success
prestwoj/iwd-ci-makedistcheck	success	Make Distcheck
prestwoj/iwd-ci-build	success	Build - Configure
prestwoj/iwd-ci-makecheckvalgrind	success	Make Check w/Valgrind
prestwoj/iwd-ci-makecheck	success	Make Check
prestwoj/iwd-ci-clang	success	clang PASS
prestwoj/iwd-ci-testrunner	success	test-runner PASS

James Prestwood Jan. 11, 2024, 1:33 p.m. UTC

If DHCP is in a SELECTING/REQUESTING state and the number of attempts
reached a value where 2 << attempts overflowed an unsigned int the
next timeout would become zero, causing DHCP to never retry without
any event notification.

Roughly 2 minutes has passed when we reach >5 attempts so it makes
little sense to retry indefinitely, at least without notifying the
upper layers (which could decided to retry themselves).

Add a new event L_DHCP_CLIENT_EVENT_MAX_RETRIES_REACHED to notify
netconfig of the situation. Netconfig will then send a failure event
and the consumer can decide how to proceed.

Fixes: f130c448 ("dhcp: Introduce timeout fuzzing")
---
 ell/dhcp.c      | 19 ++++++++++++++++---
 ell/dhcp.h      |  1 +
 ell/netconfig.c |  6 ++++++
 3 files changed, 23 insertions(+), 3 deletions(-)

v2:
 * Added new event instead of retry forever.

Denis Kenzior Jan. 12, 2024, 5:11 p.m. UTC | #1

Hi James,

On 1/11/24 07:33, James Prestwood wrote:
> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
> reached a value where 2 << attempts overflowed an unsigned int the
> next timeout would become zero, causing DHCP to never retry without
> any event notification.
> 
> Roughly 2 minutes has passed when we reach >5 attempts so it makes
> little sense to retry indefinitely, at least without notifying the
> upper layers (which could decided to retry themselves).

Okay, but '5' is just a magic number in the code with no context.  Lets avoid that.

Can we make this limit configurable?  Say between 3..30?  For iwd, even 2 
minutes might be too long.

Regards,
-Denis

James Prestwood Jan. 14, 2024, 4:12 p.m. UTC | #2

Hi Denis,

On 1/12/24 9:11 AM, Denis Kenzior wrote:
> Hi James,
>
> On 1/11/24 07:33, James Prestwood wrote:
>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>> reached a value where 2 << attempts overflowed an unsigned int the
>> next timeout would become zero, causing DHCP to never retry without
>> any event notification.
>>
>> Roughly 2 minutes has passed when we reach >5 attempts so it makes
>> little sense to retry indefinitely, at least without notifying the
>> upper layers (which could decided to retry themselves).
>
> Okay, but '5' is just a magic number in the code with no context. Lets 
> avoid that.
>
> Can we make this limit configurable?  Say between 3..30?  For iwd, 
> even 2 minutes might be too long.

I'm thinking it may be better to leave the default unlimited retries but 
add a configurable time limit versus retry limit. Setting based on 
retries isn't intuitive a) because 99% of people have no idea what the 
backoff algorithm is, and b) time limits are fuzzed so its not an exact 
amount of time.

I say this because if we want to include netconfig into IWD's Connect() 
DBus method return we kinda need a way to define a time limit as opposed 
to guessing how long 2-3 retries will actually take. We, potentially, 
could iterate through several BSS's and fail extending the time to 
connect, then start DHCP. We would need a way to limit netconfig to 
within the DBus method timeout (25 seconds IIRC), including how long all 
the connect attempts took prior. Maybe I'm overthinking it but 
theoretically we could hit the method timeout limit so we may want to 
add handling for it, which would be easier if we did the above with a 
timeout versus retries.

Thanks,

James

>
> Regards,
> -Denis

Denis Kenzior Jan. 15, 2024, 2:30 p.m. UTC | #3

Hi James,

On 1/14/24 10:12, James Prestwood wrote:
> Hi Denis,
> 
> On 1/12/24 9:11 AM, Denis Kenzior wrote:
>> Hi James,
>>
>> On 1/11/24 07:33, James Prestwood wrote:
>>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>>> reached a value where 2 << attempts overflowed an unsigned int the
>>> next timeout would become zero, causing DHCP to never retry without
>>> any event notification.
>>>
>>> Roughly 2 minutes has passed when we reach >5 attempts so it makes
>>> little sense to retry indefinitely, at least without notifying the
>>> upper layers (which could decided to retry themselves).
>>
>> Okay, but '5' is just a magic number in the code with no context. Lets avoid 
>> that.
>>
>> Can we make this limit configurable?  Say between 3..30?  For iwd, even 2 
>> minutes might be too long.
> 
> I'm thinking it may be better to leave the default unlimited retries but add a 
> configurable time limit versus retry limit. Setting based on retries isn't 
> intuitive a) because 99% of people have no idea what the backoff algorithm is, 
> and b) time limits are fuzzed so its not an exact amount of time.

Hopefully we know enough to tune the retries :)

> 
> I say this because if we want to include netconfig into IWD's Connect() DBus 
> method return we kinda need a way to define a time limit as opposed to guessing 
> how long 2-3 retries will actually take. We, potentially, could iterate through 
> several BSS's and fail extending the time to connect, then start DHCP. We would 
> need a way to limit netconfig to within the DBus method timeout (25 seconds 

DBus method timeouts don't exist on the protocol level.  ell doesn't even let 
you specify a method timeout for that reason.  Some libraries use a locally 
generated timeout by default (such as libdbus-1), but such libraries usually 
provide a way to specify an arbitrary timeout, or even disable it.

The problem with such timeouts is that when a library-local timeout is hit, 
creating a fake dbus method reply, there's no way to tell the callee.  In the 
case of iwd, it will just continue its Connect() operation since nothing 
happened as far as it is concerned.  Any dbus method timeouts are up to the 
application to deal with correctly...

Also, having dhcp use its own timeout incurs a cost of an extra file-descriptor. 
  Not worth it.

> IIRC), including how long all the connect attempts took prior. Maybe I'm 
> overthinking it but theoretically we could hit the method timeout limit so we 
> may want to add handling for it, which would be easier if we did the above with 
> a timeout versus retries.

All we can do is document that Connect might be a long-running operation (should 
be obvious) for the reasons you outline.

Regards,
-Denis

James Prestwood Jan. 16, 2024, 2:11 p.m. UTC | #4

Hi Denis,

On 1/15/24 6:30 AM, Denis Kenzior wrote:
> Hi James,
>
> On 1/14/24 10:12, James Prestwood wrote:
>> Hi Denis,
>>
>> On 1/12/24 9:11 AM, Denis Kenzior wrote:
>>> Hi James,
>>>
>>> On 1/11/24 07:33, James Prestwood wrote:
>>>> If DHCP is in a SELECTING/REQUESTING state and the number of attempts
>>>> reached a value where 2 << attempts overflowed an unsigned int the
>>>> next timeout would become zero, causing DHCP to never retry without
>>>> any event notification.
>>>>
>>>> Roughly 2 minutes has passed when we reach >5 attempts so it makes
>>>> little sense to retry indefinitely, at least without notifying the
>>>> upper layers (which could decided to retry themselves).
>>>
>>> Okay, but '5' is just a magic number in the code with no context. 
>>> Lets avoid that.
>>>
>>> Can we make this limit configurable?  Say between 3..30?  For iwd, 
>>> even 2 minutes might be too long.
>>
>> I'm thinking it may be better to leave the default unlimited retries 
>> but add a configurable time limit versus retry limit. Setting based 
>> on retries isn't intuitive a) because 99% of people have no idea what 
>> the backoff algorithm is, and b) time limits are fuzzed so its not an 
>> exact amount of time.
>
> Hopefully we know enough to tune the retries :)
I was more saying in the context of someone using the ELL API, not 
specifically for IWD.
>
>>
>> I say this because if we want to include netconfig into IWD's 
>> Connect() DBus method return we kinda need a way to define a time 
>> limit as opposed to guessing how long 2-3 retries will actually take. 
>> We, potentially, could iterate through several BSS's and fail 
>> extending the time to connect, then start DHCP. We would need a way 
>> to limit netconfig to within the DBus method timeout (25 seconds 
>
> DBus method timeouts don't exist on the protocol level.  ell doesn't 
> even let you specify a method timeout for that reason. Some libraries 
> use a locally generated timeout by default (such as libdbus-1), but 
> such libraries usually provide a way to specify an arbitrary timeout, 
> or even disable it.
>
> The problem with such timeouts is that when a library-local timeout is 
> hit, creating a fake dbus method reply, there's no way to tell the 
> callee.  In the case of iwd, it will just continue its Connect() 
> operation since nothing happened as far as it is concerned.  Any dbus 
> method timeouts are up to the application to deal with correctly...
>
> Also, having dhcp use its own timeout incurs a cost of an extra 
> file-descriptor.  Not worth it.
>
>> IIRC), including how long all the connect attempts took prior. Maybe 
>> I'm overthinking it but theoretically we could hit the method timeout 
>> limit so we may want to add handling for it, which would be easier if 
>> we did the above with a timeout versus retries.
>
> All we can do is document that Connect might be a long-running 
> operation (should be obvious) for the reasons you outline.

So from an ELL perspective I'm fine adding a retries setter (though 
timeout makes more intuitive sense IMO) but I do want to explain some 
issues I see with utilizing this in IWD, and doing netconfig prior to 
the DBus method return.

Adding netconfig into the mix would greatly increase the likelihood of a 
DBus method timeout in "semi-normal" cases depending on the retry limit 
chosen. I understand the dilemma you explained above but I do think IWD 
should try its best to stay within the default DBus method timeout of 25 
seconds. Trying to do this while also doing netconfig seems difficult 
unless we really cut down the allowed retries, which has is drawbacks as 
poor RF could easily drop a few frames and we could quickly get up to 25 
seconds with the nature of the DHCP timeouts and if the wifi association 
took a while (multiple BSS retries).

I'm also not sure if a DBus method timeout is going to break or cause 
weird behavior with Connman/NetworkManager etc. And even if its handled, 
what exactly should they do in this case? Will DHCP complete? should 
they re-issue connect which could return -EBUSY? What happens when IWD 
does reply and a timeout has already occurred? It really opens up a lot 
of possibilities here.

This would have to be an IWD 3.0 change, but adding netconfig as a DBus 
state may be the better decision here. Then netconfig could take much 
longer (maybe user configurable), and we avoid/lessen the risk of method 
timeouts. To me this is similar to how Scan() works where the method 
returns quickly, but the caller then needs to wait for the scanning 
property before getting results.

Thanks,

James

> Regards,
> -Denis

Denis Kenzior Jan. 16, 2024, 4:13 p.m. UTC | #5

Hi James,

> 
> So from an ELL perspective I'm fine adding a retries setter (though timeout 
> makes more intuitive sense IMO) but I do want to explain some issues I see with 
> utilizing this in IWD, and doing netconfig prior to the DBus method return.

I disagree here.  For the dhcp_client, max retries is a much clearer 'lever' to 
adjust than the overall timeout.  As you said earlier, with fuzzing the number 
of retries in a given period can vary.  If the upper layers want to enforce a 
given timeout, they can do so anyway.

> 
> Adding netconfig into the mix would greatly increase the likelihood of a DBus 
> method timeout in "semi-normal" cases depending on the retry limit chosen. I 
> understand the dilemma you explained above but I do think IWD should try its 
> best to stay within the default DBus method timeout of 25 seconds. Trying to do 

You'll have to explain why you make this assertion?  If DBus bindings being used 
don't allow setting an infinite timeout, get better bindings.  I don't see any 
need to force iwd into fitting into this arbitrary 25 second limit.  And anyway, 
Connect() already can invoke Agent methods, which can take arbitrarily long 
anyway.  So netconfig is the least of your problems.

> this while also doing netconfig seems difficult unless we really cut down the 
> allowed retries, which has is drawbacks as poor RF could easily drop a few 
> frames and we could quickly get up to 25 seconds with the nature of the DHCP 
> timeouts and if the wifi association took a while (multiple BSS retries).

Do you have any data / examples to back up this assertion?

> 
> I'm also not sure if a DBus method timeout is going to break or cause weird 
> behavior with Connman/NetworkManager etc. And even if its handled, what exactly 

Not if they're written properly.

> should they do in this case? Will DHCP complete? should they re-issue connect 
> which could return -EBUSY? What happens when IWD does reply and a timeout has 
> already occurred? It really opens up a lot of possibilities here.

That is why I said that timeout handling has to be handled by the application. 
Most I've seen in real life do this poorly since there is usually a fundamental 
misunderstanding of how D-Bus works under the hood.

In practice, locally generated timeouts like the one used by default in 
libdbus-1 are just a bad idea.  I have no clue why they were added originally. 
If you have a misbehaving service on your bus which leaks messages, you will end 
up clogging D-Bus at some point and only a service restart will fix that.  Your 
service always has to reply to the message.  If you trust your running-as-root 
services to do the right thing, then adding arbitrary timeouts just gets in the way.

The fact that the timeout happened is unknown to the dbus peer.  It will 
continue as if nothing happened, and if well-behaved will generate a reply 
message eventually.  This has to happen for proper functioning of D-Bus.  The 
dbus-daemon will track that reply and forward it to the caller.  The caller's 
dbus implementation will just drop it on the floor since the reply-to serial 
number is no longer on the pending list.

So your iwd client has only two choices:
1. Call Connect() with no timeout and trust iwd to do the right thing.
2. Handle timeouts by trying to abort the connection somehow.  Perhaps via 
Station.Disconnect().  But that has its own side-effects.
3. If a timeout happens, assume iwd service is dead and restart it.

Hint: Option 1 or 3 is preferred

Regards,
-Denis

James Prestwood Jan. 16, 2024, 5:06 p.m. UTC | #6

Hi Denis,

On 1/16/24 8:13 AM, Denis Kenzior wrote:
> Hi James,
>
>>
>> So from an ELL perspective I'm fine adding a retries setter (though 
>> timeout makes more intuitive sense IMO) but I do want to explain some 
>> issues I see with utilizing this in IWD, and doing netconfig prior to 
>> the DBus method return.
>
> I disagree here.  For the dhcp_client, max retries is a much clearer 
> 'lever' to adjust than the overall timeout.  As you said earlier, with 
> fuzzing the number of retries in a given period can vary.  If the 
> upper layers want to enforce a given timeout, they can do so anyway.
>
>>
>> Adding netconfig into the mix would greatly increase the likelihood 
>> of a DBus method timeout in "semi-normal" cases depending on the 
>> retry limit chosen. I understand the dilemma you explained above but 
>> I do think IWD should try its best to stay within the default DBus 
>> method timeout of 25 seconds. Trying to do 
>
> You'll have to explain why you make this assertion?  If DBus bindings 
> being used don't allow setting an infinite timeout, get better 
> bindings.  I don't see any need to force iwd into fitting into this 
> arbitrary 25 second limit.  And anyway, Connect() already can invoke 
> Agent methods, which can take arbitrarily long anyway.  So netconfig 
> is the least of your problems.
Just based on the bindings we use for test-runner and the bindings I've 
been using recently (dbus_next), both have a 25 second method timeout. 
It appears that "dbus" (what we use in test-runner) might let you 
specify a timeout, not sure about dbus_next. I just wanted to 
communicate that a timeout is common across these bindings, for better 
or worse. If we don't want to conform to that, that's fine.
>
>> this while also doing netconfig seems difficult unless we really cut 
>> down the allowed retries, which has is drawbacks as poor RF could 
>> easily drop a few frames and we could quickly get up to 25 seconds 
>> with the nature of the DHCP timeouts and if the wifi association took 
>> a while (multiple BSS retries).
>
> Do you have any data / examples to back up this assertion?
We have seen DHCP take 10-15 seconds due to dropped frames. All I'm 
saying is that just a few dropped frames can drastically increase the 
time. If we don't care about the arbitrary 25 second timeout, then this 
is fine, I just was pointing that out.
>
>>
>> I'm also not sure if a DBus method timeout is going to break or cause 
>> weird behavior with Connman/NetworkManager etc. And even if its 
>> handled, what exactly 
>
> Not if they're written properly.
>
>> should they do in this case? Will DHCP complete? should they re-issue 
>> connect which could return -EBUSY? What happens when IWD does reply 
>> and a timeout has already occurred? It really opens up a lot of 
>> possibilities here.
>
> That is why I said that timeout handling has to be handled by the 
> application. Most I've seen in real life do this poorly since there is 
> usually a fundamental misunderstanding of how D-Bus works under the hood.
Ok, I'll definitely add handling for this, hopefully everything else 
using Connect() already handles it.
>
> In practice, locally generated timeouts like the one used by default 
> in libdbus-1 are just a bad idea.  I have no clue why they were added 
> originally. If you have a misbehaving service on your bus which leaks 
> messages, you will end up clogging D-Bus at some point and only a 
> service restart will fix that.  Your service always has to reply to 
> the message.  If you trust your running-as-root services to do the 
> right thing, then adding arbitrary timeouts just gets in the way.
>
> The fact that the timeout happened is unknown to the dbus peer. It 
> will continue as if nothing happened, and if well-behaved will 
> generate a reply message eventually.  This has to happen for proper 
> functioning of D-Bus.  The dbus-daemon will track that reply and 
> forward it to the caller.  The caller's dbus implementation will just 
> drop it on the floor since the reply-to serial number is no longer on 
> the pending list.
>
> So your iwd client has only two choices:
> 1. Call Connect() with no timeout and trust iwd to do the right thing.
> 2. Handle timeouts by trying to abort the connection somehow. Perhaps 
> via Station.Disconnect().  But that has its own side-effects.
> 3. If a timeout happens, assume iwd service is dead and restart it.
>
> Hint: Option 1 or 3 is preferred
>
> Regards,
> -Denis

[v2] dhcp: fix overflow causing retries to stop

Checks

Commit Message

Comments

Patch