Message ID | d840ddcf-07a6-a838-abf8-b1d85446138e@bluematt.me (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net-next] Reduce IP_FRAG_TIME fragment-reassembly timeout to 1s, from 30s | expand |
Context | Check | Description |
---|---|---|
netdev/apply | fail | Patch does not apply to net-next |
netdev/tree_selection | success | Clearly marked for net-next |
On Wed, Apr 28, 2021 at 4:29 AM Matt Corallo <netdev-list@mattcorallo.com> wrote: > > The default IP reassembly timeout of 30 seconds predates git > history (and cursory web searches turn up nothing related to it). > The only relevant source cited in net/ipv4/ip_fragment.c is RFC > 791 defining IPv4 in 1981. RFC 791 suggests allowing the timer to > increase on the receipt of each fragment (which Linux deliberately > does not do), with a default timeout for each fragment of 15 > seconds. It suggests 15s to cap a 10Kb/s flow to a 150Kb buffer of > fragments. > > When Linux receives a fragment, if the total memory used for the > fragment reassembly buffer (across all senders) exceeds > net.ipv4.ipfrag_high_thresh (or the equivalent for IPv6), it > silently drops all future fragments fragments until the timers on > the original expire. > > All the way in 2021, these numbers feel almost comical. The default > buffer size for fragmentation reassembly is hard-coded at 4MiB as > `net->ipv4.fqdir->high_thresh = 4 * 1024 * 1024;` capping a host at > 1.06Mb/s of lost fragments before all fragments received on the > host are dropped (with independent limits for IPv6). > > At 1.06Mb/s of lost fragments, we move from DoS attack threshold to > real-world scenarios - at moderate loss rates on consumer networks > today its fairly easy to hit this, causing end hosts with their MTU > (mis-)configured to fragment to have nearly 10-20 second blocks of > 100% packet loss. > > Reducing the default fragment timeout to 1sec gives us 32Mb/s of > fragments before we drop all fragments, which is certainly more in > line with today's network speeds than 1.06Mb/s, though an optimal > value may be still lower. Sadly, reducing it further requires a > change to the sysctl interface, as net.ipv4.ipfrag_time is only > specified in seconds. > --- > include/net/ip.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/net/ip.h b/include/net/ip.h > index 2d6b985d11cc..f1473ac5a27c 100644 > --- a/include/net/ip.h > +++ b/include/net/ip.h > @@ -135,7 +135,7 @@ struct ip_ra_chain { > #define IP_MF 0x2000 /* Flag: "More Fragments" */ > #define IP_OFFSET 0x1FFF /* "Fragment Offset" part */ > > -#define IP_FRAG_TIME (30 * HZ) /* fragment lifetime */ > +#define IP_FRAG_TIME (1 * HZ) /* fragment lifetime */ > > struct msghdr; > struct net_device; > -- > 2.30.2 This is going to break many use cases. I can certainly say that in many cases, we need more than 1 second to complete reassembly. Some Internet users share satellite links with 600 ms RTT, not everybody has fiber links in 2021. There is a sysctl, exactly for the cases where admins can decide to make the value smaller. You can laugh all you want, the sad thing with IP frags is that really some applications still want to use them. Also, admins willing to use 400 MB of memory instead of 4MB can just change a sysctl. Again, nothing will prevent reassembly units to be DDOS targets. At Google, we use 100 MB for /proc/sys/net/ipv4/ipfrag_high_thresh and /proc/sys/net/ipv6/ip6frag_high_thresh, no kernel patch is needed.
On 4/28/21 08:20, Eric Dumazet wrote: > This is going to break many use cases. > > I can certainly say that in many cases, we need more than 1 second to > complete reassembly. > Some Internet users share satellite links with 600 ms RTT, not > everybody has fiber links in 2021. I'm curious what RTT has to do with it? Frags aren't resent, so there's no RTT you need to wait for, the question is more your available bandwidth and how much packet reordering you see, which even for many sat links isn't zero anymore (better, in-flow packet reordering is becoming more and more rare!). Even given some material reordering, 30 seconds on a 100Kb is a lot! > There is a sysctl, exactly for the cases where admins can decide to > make the value smaller. Sadly this doesn't actually solve it in many cases. Because Linux reassembles fragments by default any time conntrack is loaded (disabling this is very nontrivial), anyone with a Linux box in between two hosts ends up breaking flows with any material loss of frags. More broadly, just because there is a sysctl, doesn't mean the default needs to be sensible for most users. As you note, there's a sysctl, if someone is on a 1Kbps sat link with fragments sent out of order, they can change it :). This constant hasn't been touched since pre-git! > You can laugh all you want, the sad thing with IP frags is that really > some applications still want to use them. Yes, including my application, which breaks any time the flow *transits* a Linux box (ie not just my end host(s), but any box in between that happens to have conntrack loaded). > Also, admins willing to use 400 MB of memory instead of 4MB can just > change a sysctl. > > Again, nothing will prevent reassembly units to be DDOS targets. Yep, not claiming any differently. As noted in a previous thread you really have to crank up the limits to prevent DDOS. > At Google, we use 100 MB for /proc/sys/net/ipv4/ipfrag_high_thresh and > /proc/sys/net/ipv6/ip6frag_high_thresh, > no kernel patch is needed. >
On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote: > > > On 4/28/21 08:20, Eric Dumazet wrote: > > This is going to break many use cases. > > > > I can certainly say that in many cases, we need more than 1 second to > > complete reassembly. > > Some Internet users share satellite links with 600 ms RTT, not > > everybody has fiber links in 2021. > > I'm curious what RTT has to do with it? Frags aren't resent, so there's no > RTT you need to wait for, the question is more your available bandwidth and > how much packet reordering you see, which even for many sat links isn't zero > anymore (better, in-flow packet reordering is becoming more and more rare!). Regardless of retransmits, large RTTs are often an indication of buffer bloat on the path, and this can take some fragments apart, even worse when you mix this with multi-path routing where some fragments may take a short path and others can take a congested one. In this case you'll note that the excessive buffer time can become a non-negligible part of the observed RTT, hence the indirect relation between the two. Willy
On 4/28/21 10:13, Willy Tarreau wrote: > On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote: > Regardless of retransmits, large RTTs are often an indication of buffer bloat > on the path, and this can take some fragments apart, even worse when you mix > this with multi-path routing where some fragments may take a short path and > others can take a congested one. In this case you'll note that the excessive > buffer time can become a non-negligible part of the observed RTT, hence the > indirect relation between the two. Right, buffer bloat is definitely a concern. Would it make more sense to reduce the default to somewhere closer to 3s? More generally, I find this a rather interesting case - obviously breaking *deployed* use-cases of Linux is Really Bad, but at the same time, the internet has changed around us and suddenly other reasonable use-cases of Linux (ie as a router processing real-world consumer flows - in my case a stupid DOCSIS modem dropping 1Mbps from its measly 20Mbps limit) have slowly broken instead. Matt
On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo <netdev-list@mattcorallo.com> wrote: > > > > On 4/28/21 10:13, Willy Tarreau wrote: > > On Wed, Apr 28, 2021 at 10:09:00AM -0400, Matt Corallo wrote: > > Regardless of retransmits, large RTTs are often an indication of buffer bloat > > on the path, and this can take some fragments apart, even worse when you mix > > this with multi-path routing where some fragments may take a short path and > > others can take a congested one. In this case you'll note that the excessive > > buffer time can become a non-negligible part of the observed RTT, hence the > > indirect relation between the two. > > Right, buffer bloat is definitely a concern. Would it make more sense to reduce the default to somewhere closer to 3s? > > More generally, I find this a rather interesting case - obviously breaking *deployed* use-cases of Linux is Really Bad, > but at the same time, the internet has changed around us and suddenly other reasonable use-cases of Linux (ie as a > router processing real-world consumer flows - in my case a stupid DOCSIS modem dropping 1Mbps from its measly 20Mbps > limit) have slowly broken instead. > > Matt I have been working in wifi environments (linux conferences) where RTT could reach 20 sec, and even 30 seconds, and this was in some very rich cities in the USA. Obviously, when a network is under provisioned by 50x factor, you _need_ more time to complete fragments. For some reason, the crazy IP reassembly stuff comes every couple of years, and it is now a FAQ. The Internet has changed for the lucky ones, but some deployments are using 4Mbps satellite connectivity, shared by hundreds of people. I urge application designers to _not_ rely on doomed frags, even in controlled networks.
On 4/28/21 11:38, Eric Dumazet wrote: > On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo > <netdev-list@mattcorallo.com> wrote: > I have been working in wifi environments (linux conferences) where RTT > could reach 20 sec, and even 30 seconds, and this was in some very > rich cities in the USA. > > Obviously, when a network is under provisioned by 50x factor, you > _need_ more time to complete fragments. Its also a trade-off - if you're in a hugely under-provisioned environment with bufferblot issues you may have some fragments that need more time for reassembly if they've gotten horribly reordered (though just having 20 second RTT doesn't imply that fragments are going to be re-ordered by 20 seconds, more likely you might see a small fraction of it), but you're also likely to have more *lost* fragments, which can trigger the black-holing behavior here. If you have some loss in the flow, its very easy to hit 1Mbps of lost fragments and suddenly instead of just giving more time to reassemble, you're just black-holing instead. I'm not claiming I have the right trade-off here, I'd love more input, but at least in my experience trying to just occasionally send fragments on a pretty standard DOCSIS modem, 30s is way off. > For some reason, the crazy IP reassembly stuff comes every couple of > years, and it is now a FAQ. > > The Internet has changed for the lucky ones, but some deployments are > using 4Mbps satellite connectivity, shared by hundreds of people. I'd think this is a great example of a case where you precisely *dont* want such a low threshold for dropping all fragments. Note that in my specific deployment (standard DOCSIS), we're talking about the same speed and network as was available ten years ago, this isn't exactly an uncommon or particularly fancy deployment. The real issue is applications which happily send 8MB of fragments within a few seconds and suddenly find themselves completely black-holed for 30 seconds, but this isn't going to just go away. > I urge application designers to _not_ rely on doomed frags, even in > controlled networks. I'd love to, but we're talking about a default value for fragment reassembly. At least in my subjective experience here, the conservative 30s time takes things from "more time" to "completely blackhole", which feels like the wrong tradeoff. At the end of the day, you can't expect fragments to work super well, indeed, and you assume some amount of loss, the goal is to minimize the loss you see from them. Even if you have some reordering, you're unlikely to see every fragment reordered (I guess you could imagine a horribly broken qdisc, does such a thing exist in practice?) such that you always need 30s to reassemble. Taking some loss to avoid making it so easy to completely blackhole fragments seems like a reasonable tradeoff. Matt
On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo <netdev-list@mattcorallo.com> wrote: > > Following up - is there a way forward here? > Tune the sysctls to meet your goals ? I did the needed work so that you can absolutely decide to use 256GB of ram per host for frags if you want. (Although I have not tested with crazy values like that, maybe some kind of bottleneck will be hit) > I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be > changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it. > > Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach > saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts > which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing > all fragments entirely in more standard internet access deployments. > Give me one implementation, I will give you a DDOS program to defeat it. linux code is public, attackers will simply change their attacks. There is no generic solution, they are all bad. If you evict randomly, it will also fail. So why bother ? > > > > > > On 4/28/21 11:38, Eric Dumazet wrote: > >> On Wed, Apr 28, 2021 at 4:28 PM Matt Corallo > >> <netdev-list@mattcorallo.com> wrote: > >> I have been working in wifi environments (linux conferences) where RTT > >> could reach 20 sec, and even 30 seconds, and this was in some very > >> rich cities in the USA. > >> > >> Obviously, when a network is under provisioned by 50x factor, you > >> _need_ more time to complete fragments. > > > > Its also a trade-off - if you're in a hugely under-provisioned environment with bufferblot issues you may have some > > fragments that need more time for reassembly if they've gotten horribly reordered (though just having 20 second RTT > > doesn't imply that fragments are going to be re-ordered by 20 seconds, more likely you might see a small fraction of > > it), but you're also likely to have more *lost* fragments, which can trigger the black-holing behavior here. > > > > If you have some loss in the flow, its very easy to hit 1Mbps of lost fragments and suddenly instead of just giving more > > time to reassemble, you're just black-holing instead. I'm not claiming I have the right trade-off here, I'd love more > > input, but at least in my experience trying to just occasionally send fragments on a pretty standard DOCSIS modem, 30s > > is way off. > > > >> For some reason, the crazy IP reassembly stuff comes every couple of > >> years, and it is now a FAQ. > >> > >> The Internet has changed for the lucky ones, but some deployments are > >> using 4Mbps satellite connectivity, shared by hundreds of people. > > > > I'd think this is a great example of a case where you precisely *dont* want such a low threshold for dropping all > > fragments. Note that in my specific deployment (standard DOCSIS), we're talking about the same speed and network as was > > available ten years ago, this isn't exactly an uncommon or particularly fancy deployment. The real issue is applications > > which happily send 8MB of fragments within a few seconds and suddenly find themselves completely black-holed for 30 > > seconds, but this isn't going to just go away. > > > >> I urge application designers to _not_ rely on doomed frags, even in > >> controlled networks. > > > > I'd love to, but we're talking about a default value for fragment reassembly. At least in my subjective experience here, > > the conservative 30s time takes things from "more time" to "completely blackhole", which feels like the wrong tradeoff. > > At the end of the day, you can't expect fragments to work super well, indeed, and you assume some amount of loss, the > > goal is to minimize the loss you see from them. > > > > Even if you have some reordering, you're unlikely to see every fragment reordered (I guess you could imagine a horribly > > broken qdisc, does such a thing exist in practice?) such that you always need 30s to reassemble. Taking some loss to > > avoid making it so easy to completely blackhole fragments seems like a reasonable tradeoff. > > > > Matt
On 4/30/21 13:09, Eric Dumazet wrote: > On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo > <netdev-list@mattcorallo.com> wrote: >> >> Following up - is there a way forward here? >> > > Tune the sysctls to meet your goals ? > > I did the needed work so that you can absolutely decide to use 256GB > of ram per host for frags if you want. > (Although I have not tested with crazy values like that, maybe some > kind of bottleneck will be hit) Again, this is not a solution universally because this issue appears when transiting a Linux router. This isn't only about end-hosts (or I wouldn't have even bothered with any of this). Sometimes packets flow over a Linux router that you don't have control over, which is true in my case. >> I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be >> changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it. >> >> Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach >> saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts >> which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing >> all fragments entirely in more standard internet access deployments. >> > > Give me one implementation, I will give you a DDOS program to defeat it. > linux code is public, attackers will simply change their attacks. > > There is no generic solution, they are all bad. > > If you evict randomly, it will also fail. So why bother ? This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss. I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete blackholing of fragments that it does today. Matt
On Fri, Apr 30, 2021 at 7:42 PM Matt Corallo <netdev-list@mattcorallo.com> wrote: > > > > On 4/30/21 13:09, Eric Dumazet wrote: > > On Fri, Apr 30, 2021 at 5:52 PM Matt Corallo > > <netdev-list@mattcorallo.com> wrote: > >> > >> Following up - is there a way forward here? > >> > > > > Tune the sysctls to meet your goals ? > > > > I did the needed work so that you can absolutely decide to use 256GB > > of ram per host for frags if you want. > > (Although I have not tested with crazy values like that, maybe some > > kind of bottleneck will be hit) > > Again, this is not a solution universally because this issue appears when transiting a Linux router. This isn't only > about end-hosts (or I wouldn't have even bothered with any of this). Sometimes packets flow over a Linux router that you > don't have control over, which is true in my case. > > >> I think the current ease of hitting the black-hole-ing behavior is unacceptable (and often not something that can be > >> changed even with the sysctl knobs due to intermediate hosts), and am happy to do some work to fix it. > >> > >> Someone mentioned in a previous thread randomly evicting fragments instead of dropping all new fragments when we reach > >> saturation, which may be an option. We could also do something in between 1s and 30s, preserving behavior for hosts > >> which see fragments delivered out-of-order by seconds while still reducing the ease of accidentally just black-holing > >> all fragments entirely in more standard internet access deployments. > >> > > > > Give me one implementation, I will give you a DDOS program to defeat it. > > linux code is public, attackers will simply change their attacks. > > > > There is no generic solution, they are all bad. > > > > If you evict randomly, it will also fail. So why bother ? > > This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments > blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss. Again, it will be trivial to have a use case where valid fragments are dropped. Random can be considered as the worst strategy in some cases. Queue management can tail drop, head drop, random drop, there is no magical choice. > > I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On > the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete > blackholing of fragments that it does today. Your expectations are unfortunately not something that linux can satisfy _automatically_, you have to tweak sysctls to tune _your_ workload.
On 4/30/21 13:49, Eric Dumazet wrote: > On Fri, Apr 30, 2021 at 7:42 PM Matt Corallo > <netdev-list@mattcorallo.com> wrote: >> This was never about DDoS attacks - as noted several times this is about it being trivial to have all your fragments >> blackholed for 30 seconds at a time just because you have some normal run-of-the-mill packet loss. > > Again, it will be trivial to have a use case where valid fragments are dropped. > > Random can be considered as the worst strategy in some cases. > > Queue management can tail drop, head drop, random drop, there is no > magical choice. Glad we're on the same page :). >> >> I agree with you wholeheartedly that there isn't a solution to the DDoS attack issue, I'm not trying to address it. On >> the other hand, in the face of no attacks or otherwise malicious behavior, I'd expect Linux to not exhibit the complete >> blackholing of fragments that it does today. > > Your expectations are unfortunately not something that linux can > satisfy _automatically_, > you have to tweak sysctls to tune _your_ workload. Yep, totally agree, its an optimization question. We just have to decide on what the most reasonable use-case is that can be supported at low cost. I'm still a little dubious that a constant picked some twenty years ago is still the best selection for an optimization question that is a function of real-world networks. Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the second only more recently has become more and more common (both due to network speed and application behavior). Thanks again for your time and consideration, Matt
On 4/30/21 13:53, Matt Corallo wrote: > > Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the > second only more recently has become more and more common (both due to network speed and application behavior). It may be worth noting, to further highlight the tradeoffs made here - that, given a constant amount of memory allocated for fragment reassembly, *under* estimating the timeout will result in only loss of some % of packets which were reordered in excess of the timeout, whereas *over* estimating the timeout results in complete blackhole for up to the timeout in the face of material packet loss. This asymmetry is why I suggested possibly random eviction could be useful as a different set of trade-offs, but I'm certainly not qualified to make that determination. Thanks again for your time and consideration, Matt
At the risk of being obnoxious here - that's a "no" to reconsidering the tradeoffs picked 20 years ago? I don't want to waste time if the answer is a complete "no", but if it isn't I'm happy to try to figure out what exactly the right tradeoffs are here, and spend time implementing things. Thanks, Matt On 4/30/21 14:04, Matt Corallo wrote: > On 4/30/21 13:53, Matt Corallo wrote: >> >> Buffer bloat exists, but so do networks that will happily drop 1Mbps of packets. The first has always been true, the >> second only more recently has become more and more common (both due to network speed and application behavior). > > It may be worth noting, to further highlight the tradeoffs made here - that, given a constant amount of memory allocated > for fragment reassembly, *under* estimating the timeout will result in only loss of some % of packets which were > reordered in excess of the timeout, whereas *over* estimating the timeout results in complete blackhole for up to the > timeout in the face of material packet loss. > > This asymmetry is why I suggested possibly random eviction could be useful as a different set of trade-offs, but I'm > certainly not qualified to make that determination. > > Thanks again for your time and consideration, > Matt
diff --git a/include/net/ip.h b/include/net/ip.h index 2d6b985d11cc..f1473ac5a27c 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -135,7 +135,7 @@ struct ip_ra_chain { #define IP_MF 0x2000 /* Flag: "More Fragments" */ #define IP_OFFSET 0x1FFF /* "Fragment Offset" part */ -#define IP_FRAG_TIME (30 * HZ) /* fragment lifetime */ +#define IP_FRAG_TIME (1 * HZ) /* fragment lifetime */ struct msghdr; struct net_device;