Message ID | 20210421135747.312095-1-i.maximets@ovn.org (mailing list archive) |
---|---|
State | Accepted |
Commit | 7d742b509dd773f6ae2f32ffe3d2c0f3ea598a6d |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [net] openvswitch: meter: remove rate from the bucket size calculation | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Clearly marked for net |
netdev/subject_prefix | success | Link |
netdev/cc_maintainers | success | CCed 6 of 6 maintainers |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 16 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | Link |
Hello: This patch was applied to netdev/net.git (refs/heads/master): On Wed, 21 Apr 2021 15:57:47 +0200 you wrote: > Implementation of meters supposed to be a classic token bucket with 2 > typical parameters: rate and burst size. > > Burst size in this schema is the maximum number of bytes/packets that > could pass without being rate limited. > > Recent changes to userspace datapath made meter implementation to be > in line with the kernel one, and this uncovered several issues. > > [...] Here is the summary with links: - [net] openvswitch: meter: remove rate from the bucket size calculation https://git.kernel.org/netdev/net/c/7d742b509dd7 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html
On Wed, Apr 21, 2021 at 9:57 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > Implementation of meters supposed to be a classic token bucket with 2 > typical parameters: rate and burst size. > > Burst size in this schema is the maximum number of bytes/packets that > could pass without being rate limited. > > Recent changes to userspace datapath made meter implementation to be > in line with the kernel one, and this uncovered several issues. > > The main problem is that maximum bucket size for unknown reason > accounts not only burst size, but also the numerical value of rate. > This creates a lot of confusion around behavior of meters. > > For example, if rate is configured as 1000 pps and burst size set to 1, > this should mean that meter will tolerate bursts of 1 packet at most, > i.e. not a single packet above the rate should pass the meter. > However, current implementation calculates maximum bucket size as > (rate + burst size), so the effective bucket size will be 1001. This > means that first 1000 packets will not be rate limited and average > rate might be twice as high as the configured rate. This also makes > it practically impossible to configure meter that will have burst size > lower than the rate, which might be a desirable configuration if the > rate is high. > > Inability to configure low values of a burst size and overall inability > for a user to predict what will be a maximum and average rate from the > configured parameters of a meter without looking at the OVS and kernel > code might be also classified as a security issue, because drop meters > are frequently used as a way of protection from DoS attacks. > > This change removes rate from the calculation of a bucket size, making > it in line with the classic token bucket algorithm and essentially > making the rate and burst tolerance being predictable from a users' > perspective. > > Same change proposed for the userspace implementation. Hi Ilya If we set the burst size too small, the meters of ovs don't work. For example, ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats burst bands=type=drop rate=10000 burst_size=12" ovs-ofctl -O OpenFlow13 add-flow br-int "in_port=$P0 action=meter=1,output:$P1" but the rate of port P1 was 5.61 Mbit/s or ovs-ofctl -O OpenFlow13 add-meter br-int "meter=1 kbps stats burst bands=type=drop rate=10000 burst_size=1" but the rate of port P1 was 0. the length of packets is 1400B. I think we should check whether the band->burst_size >= band->burst_rate ? I don't test the userspace meters. > Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure") > Signed-off-by: Ilya Maximets <i.maximets@ovn.org> > --- > > The same patch for the userspace datapath: > https://patchwork.ozlabs.org/project/openvswitch/patch/20210421134816.311584-1-i.maximets@ovn.org/ > > net/openvswitch/meter.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c > index 15424d26e85d..96b524ceabca 100644 > --- a/net/openvswitch/meter.c > +++ b/net/openvswitch/meter.c > @@ -392,7 +392,7 @@ static struct dp_meter *dp_meter_create(struct nlattr **a) > * > * Start with a full bucket. > */ > - band->bucket = (band->burst_size + band->rate) * 1000ULL; > + band->bucket = band->burst_size * 1000ULL; > band_max_delta_t = div_u64(band->bucket, band->rate); > if (band_max_delta_t > meter->max_delta_t) > meter->max_delta_t = band_max_delta_t; > @@ -641,7 +641,7 @@ bool ovs_meter_execute(struct datapath *dp, struct sk_buff *skb, > long long int max_bucket_size; > > band = &meter->bands[i]; > - max_bucket_size = (band->burst_size + band->rate) * 1000LL; > + max_bucket_size = band->burst_size * 1000LL; > > band->bucket += delta_ms * band->rate; > if (band->bucket > max_bucket_size) > -- > 2.26.3 >
On Wed, Apr 28, 2021 at 02:24:10PM +0800, Tonghao Zhang wrote: > Hi Ilya > If we set the burst size too small, the meters of ovs don't work. Most likely, you need to set the burst size larger. A quick Google on finding a good burst size : https://www.juniper.net/documentation/us/en/software/junos/routing-policy/topics/concept/policer-mx-m120-m320-burstsize-determining.html Now, the interesting question, is the behaviour of OVS different from a standard token bucket, such as a kernel policer ? Here is how to set up a kernel policer : ---------------------------------------------------------- # Create a dummy classful discipline to attach filter tc qdisc del dev eth6 root tc qdisc add dev eth6 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 tc qdisc add dev eth6 parent 1:1 handle 10: pfifo limit 1000 tc qdisc add dev eth6 parent 1:2 handle 20: pfifo limit 1000 tc -s qdisc show dev eth6 tc -s class show dev eth6 # Filter to do hard rate limiting tc filter del dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 tc filter add dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 20K mtu 10000 drop tc -s filter show dev eth6 tc filter change dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 50K mtu 10000 drop ---------------------------------------------------------- Regards, Jean
On 4/28/21 8:45 AM, Jean Tourrilhes wrote: > On Wed, Apr 28, 2021 at 02:24:10PM +0800, Tonghao Zhang wrote: >> Hi Ilya >> If we set the burst size too small, the meters of ovs don't work. > > Most likely, you need to set the burst size larger. > A quick Google on finding a good burst size : > https://www.juniper.net/documentation/us/en/software/junos/routing-policy/topics/concept/policer-mx-m120-m320-burstsize-determining.html +1. Tonghao, If you're configuring burst size too low, meter will not pass packets. That's expected behavior. In your example with 1400B packets and 1500B (12 kbit) burst size there is a very high probability that a lot of packets will be dropped and not pass the meter unless you're sending them in a very precise points in time. I don't think that anyone will recommend setting burst size so close to the MTU. The article above suggests using 10x MTU value, but I don't know if that will be enough with high speed devices. > > Now, the interesting question, is the behaviour of OVS > different from a standard token bucket, such as a kernel policer ? I didn't test it, but I looked at the implementation in net/sched/act_police.c and net/sched/sch_tbf.c, and they should work in a same way as this patch, i.e. it's a classic token bucket where burst is a burst and nothing else. These implementations uses burst in nanoseconds instead of bytes, but that doesn't matter (nanoseconds calculated from the rate and burst in bytes specified by user). For example, net/sched/act_police.c works like this: toks = min_t(s64, now - police->tcfp_t_c, p->tcfp_burst); ^---- calculating how many tokens needs to be added toks += police->tcfp_toks; <-- also adding all existing tokens if (toks > p->tcfp_burst) toks = p->tcfp_burst; <-- hard limit of tokens by the burst size toks -= (s64)psched_l2t_ns(&p->rate, qdisc_pkt_len(skb)); ^-- spending tokens to pass the packet if (toks >= 0) { <-- Did we have enough tokens? /* Packet passed. */ police->tcfp_t_c = now; police->tcfp_toks = toks; } net/sched/sch_tbf.c works in almost exactly same way. So, there is *no algorithmic difference* here. --- There is one difference though. I said that it doesn't matter that tc uses time instead of bytes as a measure for tokens, but it actually does matter because time is calculated based on the configured rate, but applied to the actual rate. Let me explain: Assuming configuration "rate 200mbit burst 20K" as in example below. iproute2 will calculate burst using tc_calc_xmittime function: https://github.com/shemminger/iproute2/blob/9f366536edb5158343152604e82b968be46dbf26/tc/tc_core.c#L60 So the burst configuration passed to kernel will be: TIME_UNITS_PER_SEC(1000000) * (20 * 1024) / (200 * 1024*1024/8) = 781 usec 10^-6 bytes bytes/sec That means that burst is not 20K bytes as configured, but any number of bytes in 781 usec window regardless of a line rate. For example, if traffic goes from 10 Gbps interface, effective burst size will be 10^9 / 8 * 781 * 10^-6 = 97K which is almost 5 times higher than the configured value. And the difference scales linearly with the increase of the line rate speed. For 100G interface it will be 970K. It might be much more noticeable with lower configured rate. For "rate 10mbit burst 20K", real burst interval will be 15.6 msec, which will translate into 1.9M burst size for a 10G line rate, which is almost 100 times larger than configured 20K. And it will be 19M for a 100Gbps interface, making the average rate triple as high as configured for a policer. All in all this looks more like an issue of TC and iproute implementation. IMHO, tc command should not allow configuration of burst in bytes just because it can not configure that in kernel and therefore can not guarantee that behavior. Configuration should be in micro/nanoseconds instead. CC: Cong, Davide Maybe someone from the TC side can comment on that? We can try to mimic this behavior in OVS, but I'm not sure if it's correct. Current OVS implementation, unlike TC, guarantees the burst size in bytes. And it's also a completely different kind of difference with OVS meters, so unrelated to the current patch. Best regards, Ilya Maximets. > Here is how to set up a kernel policer : > ---------------------------------------------------------- > # Create a dummy classful discipline to attach filter > tc qdisc del dev eth6 root > tc qdisc add dev eth6 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 > tc qdisc add dev eth6 parent 1:1 handle 10: pfifo limit 1000 > tc qdisc add dev eth6 parent 1:2 handle 20: pfifo limit 1000 > tc -s qdisc show dev eth6 > tc -s class show dev eth6 > > # Filter to do hard rate limiting > tc filter del dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 > tc filter add dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 20K mtu 10000 drop > tc -s filter show dev eth6 > tc filter change dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 50K mtu 10000 drop > ---------------------------------------------------------- > > Regards, > > Jean >
On Wed, Apr 28, 2021 at 01:22:12PM +0200, Ilya Maximets wrote: > > I didn't test it, but I looked at the implementation in > net/sched/act_police.c and net/sched/sch_tbf.c, and they should work > in a same way as this patch, i.e. it's a classic token bucket where > burst is a burst and nothing else. Actually, act_police.c and sch_tbf.c will behave completely differently, even if they are both based on the token bucket algorithm. The reason is that sch_tbf.c is applied to a queue, and the queue will smooth out traffic and avoid drops. The token bucket is used to dequeue the queue, this is sometime called leaky bucket. I've personally used sch_tbf.c with burst size barely bigger than the MTU, and it works fine. This is why I was suggesting to compare to act_police.c, which does not have a queue to smooth out traffic and can only drop packets. I believe OVS meters are similar to policers, so that's why they are suprising for people used to queues such as TBF and HTB. Regards, Jean
On 4/28/21 6:31 PM, Jean Tourrilhes wrote: > On Wed, Apr 28, 2021 at 01:22:12PM +0200, Ilya Maximets wrote: >> >> I didn't test it, but I looked at the implementation in >> net/sched/act_police.c and net/sched/sch_tbf.c, and they should work >> in a same way as this patch, i.e. it's a classic token bucket where >> burst is a burst and nothing else. > > Actually, act_police.c and sch_tbf.c will behave completely > differently, even if they are both based on the token bucket > algorithm. > The reason is that sch_tbf.c is applied to a queue, and the > queue will smooth out traffic and avoid drops. The token bucket is > used to dequeue the queue, this is sometime called leaky bucket. I've > personally used sch_tbf.c with burst size barely bigger than the MTU, > and it works fine. Makes sense. Thanks for the clarification! > This is why I was suggesting to compare to act_police.c, which > does not have a queue to smooth out traffic and can only drop > packets. I see. Unfortunately, due to the fact that act_police.c uses time instead of bytes as a measure for tokens, we will still see a difference in behavior. Probably, not so big, but it will be there and it will depend on a line rate. > I believe OVS meters are similar to policers, so that's why > they are suprising for people used to queues such as TBF and HTB. > > Regards, > > Jean >
On 4/28/21 1:22 PM, Ilya Maximets wrote: > On 4/28/21 8:45 AM, Jean Tourrilhes wrote: >> On Wed, Apr 28, 2021 at 02:24:10PM +0800, Tonghao Zhang wrote: >>> Hi Ilya >>> If we set the burst size too small, the meters of ovs don't work. >> >> Most likely, you need to set the burst size larger. >> A quick Google on finding a good burst size : >> https://www.juniper.net/documentation/us/en/software/junos/routing-policy/topics/concept/policer-mx-m120-m320-burstsize-determining.html > > +1. > Tonghao, If you're configuring burst size too low, meter will not pass > packets. That's expected behavior. In your example with 1400B packets > and 1500B (12 kbit) burst size there is a very high probability that a > lot of packets will be dropped and not pass the meter unless you're > sending them in a very precise points in time. I don't think that anyone > will recommend setting burst size so close to the MTU. The article above > suggests using 10x MTU value, but I don't know if that will be enough > with high speed devices. > >> >> Now, the interesting question, is the behaviour of OVS >> different from a standard token bucket, such as a kernel policer ? > > I didn't test it, but I looked at the implementation in > net/sched/act_police.c and net/sched/sch_tbf.c, and they should work > in a same way as this patch, i.e. it's a classic token bucket where > burst is a burst and nothing else. These implementations uses burst > in nanoseconds instead of bytes, but that doesn't matter (nanoseconds > calculated from the rate and burst in bytes specified by user). > For example, net/sched/act_police.c works like this: > > toks = min_t(s64, now - police->tcfp_t_c, p->tcfp_burst); > ^---- calculating how many tokens needs to be added > toks += police->tcfp_toks; <-- also adding all existing tokens > if (toks > p->tcfp_burst) > toks = p->tcfp_burst; <-- hard limit of tokens by the burst size > toks -= (s64)psched_l2t_ns(&p->rate, qdisc_pkt_len(skb)); > ^-- spending tokens to pass the packet > if (toks >= 0) { <-- Did we have enough tokens? > /* Packet passed. */ > police->tcfp_t_c = now; > police->tcfp_toks = toks; > } > > net/sched/sch_tbf.c works in almost exactly same way. So, there is > *no algorithmic difference* here. > > --- > > There is one difference though. I said that it doesn't matter that > tc uses time instead of bytes as a measure for tokens, but it actually > does matter because time is calculated based on the configured rate, > but applied to the actual rate. Let me explain: > > Assuming configuration "rate 200mbit burst 20K" as in example below. > iproute2 will calculate burst using tc_calc_xmittime function: > https://github.com/shemminger/iproute2/blob/9f366536edb5158343152604e82b968be46dbf26/tc/tc_core.c#L60 > > So the burst configuration passed to kernel will be: > > TIME_UNITS_PER_SEC(1000000) * (20 * 1024) / (200 * 1024*1024/8) = 781 usec > 10^-6 bytes bytes/sec > > That means that burst is not 20K bytes as configured, but any number of > bytes in 781 usec window regardless of a line rate. OK. I found my mistake here. Even though the burst size is in units of time, it doesn't matter because, when tokens are consumed, algorithm subtracts time needed to pass a packet with a configured rate (see psched_l2t_ns() function). This evens out the difference. So, everything is perfectly fine here. :) Sorry for the noise. > For example, if traffic goes from 10 Gbps interface, effective burst size > will be 10^9 / 8 * 781 * 10^-6 = 97K which is almost 5 times higher than > the configured value. And the difference scales linearly with the increase > of the line rate speed. For 100G interface it will be 970K. > > It might be much more noticeable with lower configured rate. > For "rate 10mbit burst 20K", real burst interval will be 15.6 msec, which > will translate into 1.9M burst size for a 10G line rate, which is almost > 100 times larger than configured 20K. And it will be 19M for a 100Gbps > interface, making the average rate triple as high as configured for a > policer. > > All in all this looks more like an issue of TC and iproute implementation. > IMHO, tc command should not allow configuration of burst in bytes just > because it can not configure that in kernel and therefore can not guarantee > that behavior. Configuration should be in micro/nanoseconds instead. > > CC: Cong, Davide > Maybe someone from the TC side can comment on that? > > We can try to mimic this behavior in OVS, but I'm not sure if it's correct. > Current OVS implementation, unlike TC, guarantees the burst size in bytes. > And it's also a completely different kind of difference with OVS meters, so > unrelated to the current patch. > > Best regards, Ilya Maximets. > >> Here is how to set up a kernel policer : >> ---------------------------------------------------------- >> # Create a dummy classful discipline to attach filter >> tc qdisc del dev eth6 root >> tc qdisc add dev eth6 root handle 1: prio bands 2 priomap 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> tc qdisc add dev eth6 parent 1:1 handle 10: pfifo limit 1000 >> tc qdisc add dev eth6 parent 1:2 handle 20: pfifo limit 1000 >> tc -s qdisc show dev eth6 >> tc -s class show dev eth6 >> >> # Filter to do hard rate limiting >> tc filter del dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 >> tc filter add dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 20K mtu 10000 drop >> tc -s filter show dev eth6 >> tc filter change dev eth6 parent 1: protocol all prio 1 handle 800::100 u32 match u32 0 0 police rate 200mbit burst 50K mtu 10000 drop >> ---------------------------------------------------------- >> >> Regards, >> >> Jean >> >
On 4/28/21 8:59 PM, Ilya Maximets wrote: > On 4/28/21 6:31 PM, Jean Tourrilhes wrote: >> On Wed, Apr 28, 2021 at 01:22:12PM +0200, Ilya Maximets wrote: >>> >>> I didn't test it, but I looked at the implementation in >>> net/sched/act_police.c and net/sched/sch_tbf.c, and they should work >>> in a same way as this patch, i.e. it's a classic token bucket where >>> burst is a burst and nothing else. >> >> Actually, act_police.c and sch_tbf.c will behave completely >> differently, even if they are both based on the token bucket >> algorithm. >> The reason is that sch_tbf.c is applied to a queue, and the >> queue will smooth out traffic and avoid drops. The token bucket is >> used to dequeue the queue, this is sometime called leaky bucket. I've >> personally used sch_tbf.c with burst size barely bigger than the MTU, >> and it works fine. > > Makes sense. Thanks for the clarification! > >> This is why I was suggesting to compare to act_police.c, which >> does not have a queue to smooth out traffic and can only drop >> packets. > > I see. Unfortunately, due to the fact that act_police.c uses time > instead of bytes as a measure for tokens, we will still see a difference > in behavior. Probably, not so big, but it will be there and it will > depend on a line rate. I found my mistake in calculations (see another reply). So, there should not be significant difference with OVS meters, indeed. > >> I believe OVS meters are similar to policers, so that's why >> they are suprising for people used to queues such as TBF and HTB. >> >> Regards, >> >> Jean >>
diff --git a/net/openvswitch/meter.c b/net/openvswitch/meter.c index 15424d26e85d..96b524ceabca 100644 --- a/net/openvswitch/meter.c +++ b/net/openvswitch/meter.c @@ -392,7 +392,7 @@ static struct dp_meter *dp_meter_create(struct nlattr **a) * * Start with a full bucket. */ - band->bucket = (band->burst_size + band->rate) * 1000ULL; + band->bucket = band->burst_size * 1000ULL; band_max_delta_t = div_u64(band->bucket, band->rate); if (band_max_delta_t > meter->max_delta_t) meter->max_delta_t = band_max_delta_t; @@ -641,7 +641,7 @@ bool ovs_meter_execute(struct datapath *dp, struct sk_buff *skb, long long int max_bucket_size; band = &meter->bands[i]; - max_bucket_size = (band->burst_size + band->rate) * 1000LL; + max_bucket_size = band->burst_size * 1000LL; band->bucket += delta_ms * band->rate; if (band->bucket > max_bucket_size)
Implementation of meters supposed to be a classic token bucket with 2 typical parameters: rate and burst size. Burst size in this schema is the maximum number of bytes/packets that could pass without being rate limited. Recent changes to userspace datapath made meter implementation to be in line with the kernel one, and this uncovered several issues. The main problem is that maximum bucket size for unknown reason accounts not only burst size, but also the numerical value of rate. This creates a lot of confusion around behavior of meters. For example, if rate is configured as 1000 pps and burst size set to 1, this should mean that meter will tolerate bursts of 1 packet at most, i.e. not a single packet above the rate should pass the meter. However, current implementation calculates maximum bucket size as (rate + burst size), so the effective bucket size will be 1001. This means that first 1000 packets will not be rate limited and average rate might be twice as high as the configured rate. This also makes it practically impossible to configure meter that will have burst size lower than the rate, which might be a desirable configuration if the rate is high. Inability to configure low values of a burst size and overall inability for a user to predict what will be a maximum and average rate from the configured parameters of a meter without looking at the OVS and kernel code might be also classified as a security issue, because drop meters are frequently used as a way of protection from DoS attacks. This change removes rate from the calculation of a bucket size, making it in line with the classic token bucket algorithm and essentially making the rate and burst tolerance being predictable from a users' perspective. Same change proposed for the userspace implementation. Fixes: 96fbc13d7e77 ("openvswitch: Add meter infrastructure") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> --- The same patch for the userspace datapath: https://patchwork.ozlabs.org/project/openvswitch/patch/20210421134816.311584-1-i.maximets@ovn.org/ net/openvswitch/meter.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)