Message ID | 552EA2BC.5000707@eu.citrix.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: > So you'd be OK with a patch like this? (With perhaps a better changelog?) > > -George > > --- > TSQ: Raise default static TSQ limit > > A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the > size of actual packets and the amount of data being transmitted. > Raise the default static limit to allow that new limit to actually > come into effect. > > This fixes a regression where NICs with large transmit completion > times (such as xennet) had a 30% hit unless the user manually tweaked > the value in /proc. > > Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c > index 1db253e..8ad7cdf 100644 > --- a/net/ipv4/tcp_output.c > +++ b/net/ipv4/tcp_output.c > @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; > */ > int sysctl_tcp_workaround_signed_windows __read_mostly = 0; > > -/* Default TSQ limit of two TSO segments */ > -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; > +/* Static TSQ limit. A more dynamic limit is calculated in > tcp_write_xmit. */ > +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; > > /* This limits the percentage of the congestion window which we > * will allow a single TSO frame to consume. Building TSO frames > Have you tested this patch on a NIC without GSO/TSO ? This would allow more than 500 packets for a single flow. Hello bufferbloat. So my answer to this patch is a no.
> > Have you tested this patch on a NIC without GSO/TSO ? > > This would allow more than 500 packets for a single flow. > > Hello bufferbloat. Woudln't the fq_codel qdisc on that interface address that problem? rick
On 04/15/2015 06:52 PM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote: > >> So you'd be OK with a patch like this? (With perhaps a better changelog?) >> >> -George >> >> --- >> TSQ: Raise default static TSQ limit >> >> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the >> size of actual packets and the amount of data being transmitted. >> Raise the default static limit to allow that new limit to actually >> come into effect. >> >> This fixes a regression where NICs with large transmit completion >> times (such as xennet) had a 30% hit unless the user manually tweaked >> the value in /proc. >> >> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> >> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c >> index 1db253e..8ad7cdf 100644 >> --- a/net/ipv4/tcp_output.c >> +++ b/net/ipv4/tcp_output.c >> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; >> */ >> int sysctl_tcp_workaround_signed_windows __read_mostly = 0; >> >> -/* Default TSQ limit of two TSO segments */ >> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; >> +/* Static TSQ limit. A more dynamic limit is calculated in >> tcp_write_xmit. */ >> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; >> >> /* This limits the percentage of the congestion window which we >> * will allow a single TSO frame to consume. Building TSO frames >> > > Have you tested this patch on a NIC without GSO/TSO ? > > This would allow more than 500 packets for a single flow. > > Hello bufferbloat. > > So my answer to this patch is a no. You said: "I asked you guys to make a test by increasing sysctl_tcp_limit_output_bytes You have no need to explain me the code I wrote, thank you." Which implies to me that you think you've already pointed us to the answer you want and we're just not getting it. Maybe you should stop wasting all of our time and just tell us what you're thinking. -George
On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: > > > > Have you tested this patch on a NIC without GSO/TSO ? > > > > This would allow more than 500 packets for a single flow. > > > > Hello bufferbloat. > > Woudln't the fq_codel qdisc on that interface address that problem? Last time I checked, default qdisc was pfifo_fast. These guys do not want to change a sysctl, how pfifo_fast will magically becomes fq_codel ?
On 04/15/2015 11:08 AM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote: >>> >>> Have you tested this patch on a NIC without GSO/TSO ? >>> >>> This would allow more than 500 packets for a single flow. >>> >>> Hello bufferbloat. >> >> Woudln't the fq_codel qdisc on that interface address that problem? > > Last time I checked, default qdisc was pfifo_fast. Bummer. > These guys do not want to change a sysctl, how pfifo_fast will magically > becomes fq_codel ? Well, I'm not sure that it is George and Jonathan themselves who don't want to change a sysctl, but the customers who would have to tweak that in their VMs? rick
On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: > Maybe you should stop wasting all of our time and just tell us what > you're thinking. I think you make me wasting my time. I already gave all the hints in prior discussions. Rome was not built in one day.
On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: > Well, I'm not sure that it is George and Jonathan themselves who don't > want to change a sysctl, but the customers who would have to tweak that > in their VMs? Keep in mind some VM users install custom qdisc, or even custom TCP sysctls.
On 04/15/2015 11:32 AM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote: > >> Well, I'm not sure that it is George and Jonathan themselves who don't >> want to change a sysctl, but the customers who would have to tweak that >> in their VMs? > > Keep in mind some VM users install custom qdisc, or even custom TCP > sysctls. That could very well be, though I confess I've not seen that happening in my little corner of the cloud. They tend to want to launch the VM and go. Some of the more advanced/sophisticated ones might tweak a few things but my (admittedly limited) experience has been they are few in number. They just expect it to work "out of the box" (to the extent one can use that phrase still). It's kind of ironic - go back to the (early) 1990s when NICs generated a completion interrupt for every individual tx completion (and incoming packet) and all everyone wanted to do was coalesce/avoid interrupts. I guess that has gone rather far. And today to fight bufferbloat TCP gets tweaked to favor quick tx completions. Call it cycles, or pendulums or whatever I guess. I wonder just how consistent tx completion timings are for a VM so a virtio_net or whatnot in the VM can pick a per-device setting to advertise to TCP? Hopefully, full NIC emulation is no longer a thing and VMs "universally" use a virtual NIC interface. At least in my little corner of the cloud, emulated NICs are gone, and good riddance. rick
On 04/15/2015 07:19 PM, Eric Dumazet wrote: > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: > >> Maybe you should stop wasting all of our time and just tell us what >> you're thinking. > > I think you make me wasting my time. > > I already gave all the hints in prior discussions. Right, and I suggested these two options: "Obviously one solution would be to allow the drivers themselves to set the tcp_limit_output_bytes, but that seems like a maintenance nightmare. "Another simple solution would be to allow drivers to indicate whether they have a high transmit latency, and have the kernel use a higher value by default when that's the case." [1] Neither of which you commented on. Instead you pointed me to a comment that only partially described what the limitations were. (I.e., it described the "two packets or 1ms", but not how they related, nor how they related to the "max of 2 64k packets outstanding" of the default tcp_limit_output_bytes setting.) -George [1] http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
On 04/16/2015 10:56 AM, George Dunlap wrote: > On 04/15/2015 07:19 PM, Eric Dumazet wrote: >> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: >> >>> Maybe you should stop wasting all of our time and just tell us what >>> you're thinking. >> >> I think you make me wasting my time. >> >> I already gave all the hints in prior discussions. > > Right, and I suggested these two options: So mid term, it would be much more beneficial if you attempt fix the underlying driver issues that actually cause high tx completion delays, instead of reintroducing bufferbloat. So that we all can move forward and not backwards in time. > "Obviously one solution would be to allow the drivers themselves to set > the tcp_limit_output_bytes, but that seems like a maintenance > nightmare. Possible, but very hacky, as you penalize globally. > "Another simple solution would be to allow drivers to indicate whether > they have a high transmit latency, and have the kernel use a higher > value by default when that's the case." [1] > > Neither of which you commented on. Instead you pointed me to a comment What Eric described to you was that you introduce a new netdev member like netdev->needs_bufferbloat, set that indication from driver site, and cache that in the socket that binds to it, so you can adjust the test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication for such devices. Hmm? > that only partially described what the limitations were. (I.e., it > described the "two packets or 1ms", but not how they related, nor how > they related to the "max of 2 64k packets outstanding" of the default > tcp_limit_output_bytes setting.) > > -George > > [1] > http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com> > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >
From: George Dunlap > Sent: 16 April 2015 09:56 > On 04/15/2015 07:19 PM, Eric Dumazet wrote: > > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote: > > > >> Maybe you should stop wasting all of our time and just tell us what > >> you're thinking. > > > > I think you make me wasting my time. > > > > I already gave all the hints in prior discussions. > > Right, and I suggested these two options: > > "Obviously one solution would be to allow the drivers themselves to set > the tcp_limit_output_bytes, but that seems like a maintenance > nightmare. > > "Another simple solution would be to allow drivers to indicate whether > they have a high transmit latency, and have the kernel use a higher > value by default when that's the case." [1] > > Neither of which you commented on. Instead you pointed me to a comment > that only partially described what the limitations were. (I.e., it > described the "two packets or 1ms", but not how they related, nor how > they related to the "max of 2 64k packets outstanding" of the default > tcp_limit_output_bytes setting.) ISTM that you are changing the wrong knob. You need to change something that affects the global amount of pending tx data, not the amount that can be buffered by a single connection. If you change tcp_limit_output_bytes and then have 1000 connections trying to send data you'll suffer 'bufferbloat'. If you call skb_orphan() in the tx setup path then the total number of buffers is limited, but a single connection can (and will) will the tx ring leading to incorrect RTT calculations and additional latency for other connections. This will give high single connection throughput but isn't ideal. One possibility might be to call skb_orphan() when enough time has elapsed since the packet was queued for transmit that it is very likely to have actually been transmitted - even though 'transmit done' has not yet been signalled. Not at all sure how this would fit in though... David
On 04/16/2015 10:20 AM, Daniel Borkmann wrote: > So mid term, it would be much more beneficial if you attempt fix the > underlying driver issues that actually cause high tx completion delays, > instead of reintroducing bufferbloat. So that we all can move forward > and not backwards in time. Yes, I think we definitely see the need for this. I think we certainly agree that bufferbloat needs to be reduced, and minimizing the data we need "in the pipe" for full performance on xennet is an important part of that. It should be said, however, that any virtual device is always going to have higher latency than a physical device. Hopefully we'll be able to get the latency of xennet down to something that's more "reasonable", but it may just not be possible. And in any case, if we're going to be cranking down these limits to just barely within the tolerance of physical NICs, virtual devices (either xennet or virtio_net) are never going to be able to catch up. (Without cheating that is.) > What Eric described to you was that you introduce a new netdev member > like netdev->needs_bufferbloat, set that indication from driver site, > and cache that in the socket that binds to it, so you can adjust the > test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication > for such devices. Hmm? He suggested that after he'd been prodded by 4 more e-mails in which two of us guessed what he was trying to get at. That's what I was complaining about. Having a per-device "long transmit latency" hint sounds like a sensible short-term solution to me. -George
On Thu, Apr 16, 2015 at 10:22 AM, David Laight <David.Laight@aculab.com> wrote: > ISTM that you are changing the wrong knob. > You need to change something that affects the global amount of pending tx data, > not the amount that can be buffered by a single connection. Well it seems like the problem is that the global amount of pending tx data is high enough, but that the per-stream amount is too low for only a single stream. > If you change tcp_limit_output_bytes and then have 1000 connections trying > to send data you'll suffer 'bufferbloat'. Right -- so are you worried about the buffers in the local device here, or are you worried about buffers elsewhere in the network? If you're worried about buffers on the local device, don't you have a similar problem for physical NICs? i.e., if a NIC has a big buffer that you're trying to keep mostly empty, limiting a single TCP stream may keep that buffer empty, but if you have 1000 connections, 1000*limit will still fill up the buffer. Or am I missing something? > If you call skb_orphan() in the tx setup path then the total number of > buffers is limited, but a single connection can (and will) will the tx > ring leading to incorrect RTT calculations and additional latency for > other connections. > This will give high single connection throughput but isn't ideal. > > One possibility might be to call skb_orphan() when enough time has > elapsed since the packet was queued for transmit that it is very likely > to have actually been transmitted - even though 'transmit done' has > not yet been signalled. > Not at all sure how this would fit in though... Right -- so it sounds like the problem with skb_orphan() is making sure that the tx ring is shared properly between different streams. That would mean that ideally we wouldn't call it until the tx ring actually had space to add more packets onto it. The Xen project is having a sort of developer meeting in a few weeks; if we can get a good picture of all the constraints, maybe we can hash out a solution that works for everyone. -George
On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: > He suggested that after he'd been prodded by 4 more e-mails in which two > of us guessed what he was trying to get at. That's what I was > complaining about. My big complain is that I suggested to test to double the sysctl, which gave good results. Then you provided a patch using a 8x factor. How does that sound ? Next time I ask a raise, I should try a 8x factor as well, who knows, it might be accepted.
On Thu, Apr 16, 2015 at 1:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: > >> He suggested that after he'd been prodded by 4 more e-mails in which two >> of us guessed what he was trying to get at. That's what I was >> complaining about. > > My big complain is that I suggested to test to double the sysctl, which > gave good results. > > Then you provided a patch using a 8x factor. How does that sound ? > > Next time I ask a raise, I should try a 8x factor as well, who knows, > it might be accepted. I see. I chose the value that Stefano had determined had completely eliminated the overhead. Doubling the value reduces the overhead to 8%, which should be fine for a short-term fix while we git a proper mid/long-term fix. -George
Hi Eric Sorry for coming late to the discussion. On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote: > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: > > > He suggested that after he'd been prodded by 4 more e-mails in which two > > of us guessed what he was trying to get at. That's what I was > > complaining about. > > My big complain is that I suggested to test to double the sysctl, which > gave good results. > Do I understand correctly that it's acceptable to you to double the size of the buffer? If so I will send a patch to do that. Wei. > Then you provided a patch using a 8x factor. How does that sound ? > > Next time I ask a raise, I should try a 8x factor as well, who knows, > it might be accepted. > >
On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote: > Hi Eric > > Sorry for coming late to the discussion. > > On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote: > > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote: > > > > > He suggested that after he'd been prodded by 4 more e-mails in which two > > > of us guessed what he was trying to get at. That's what I was > > > complaining about. > > > > My big complain is that I suggested to test to double the sysctl, which > > gave good results. > > > > Do I understand correctly that it's acceptable to you to double the size > of the buffer? If so I will send a patch to do that. Absolutely.
TSQ: Raise default static TSQ limit A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the size of actual packets and the amount of data being transmitted. Raise the default static limit to allow that new limit to actually come into effect. This fixes a regression where NICs with large transmit completion times (such as xennet) had a 30% hit unless the user manually tweaked the value in /proc. Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1db253e..8ad7cdf 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1; */ int sysctl_tcp_workaround_signed_windows __read_mostly = 0; -/* Default TSQ limit of two TSO segments */ -int sysctl_tcp_limit_output_bytes __read_mostly = 131072; +/* Static TSQ limit. A more dynamic limit is calculated in tcp_write_xmit. */ +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576; /* This limits the percentage of the congestion window which we * will allow a single TSO frame to consume. Building TSO frames