diff mbox

[Xen-devel] "tcp: refine TSO autosizing" causes performance regression on Xen

Message ID 552EA2BC.5000707@eu.citrix.com (mailing list archive)
State New, archived
Headers show

Commit Message

George Dunlap April 15, 2015, 5:41 p.m. UTC
On 04/15/2015 06:29 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:23 +0100, George Dunlap wrote:
>> On 04/15/2015 05:38 PM, Eric Dumazet wrote:
>>> My thoughts that instead of these long talks you should guys read the
>>> code :
>>>
>>>                 /* TCP Small Queues :
>>>                  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
>>>                  * This allows for :
>>>                  *  - better RTT estimation and ACK scheduling
>>>                  *  - faster recovery
>>>                  *  - high rates
>>>                  * Alas, some drivers / subsystems require a fair amount
>>>                  * of queued bytes to ensure line rate.
>>>                  * One example is wifi aggregation (802.11 AMPDU)
>>>                  */
>>>                 limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
>>>                 limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
>>>
>>>
>>> Then you'll see that most of your questions are already answered.
>>>
>>> Feel free to try to improve the behavior, if it does not hurt critical workloads
>>> like TCP_RR, where we we send very small messages, millions times per second.
>>
>> First of all, with regard to critical workloads, once this patch gets
>> into distros, *normal TCP streams* on every VM running on Amazon,
>> Rackspace, Linode, &c will get a 30% hit in performance *by default*.
>> Normal TCP streams on xennet *are* a critical workload, and deserve the
>> same kind of accommodation as TCP_RR (if not more).  The same goes for
>> virtio_net.
>>
>> Secondly, according to Stefano's and Jonathan's tests,
>> tcp_limit_output_bytes completely fixes the problem for Xen.
>>
>> Which means that max(2*skb->truesize, sk->sk_pacing_rate >>10) is
>> *already* larger for Xen; that calculation mentioned in the comment is
>> *already* doing the right thing.
>>
>> As Jonathan pointed out, sysctl_tcp_limit_output_bytes is overriding an
>> automatic TSQ calculation which is actually choosing an effective value
>> for xennet.
>>
>> It certainly makes sense for sysctl_tcp_limit_output_bytes to be an
>> actual maximum limit.  I went back and looked at the original patch
>> which introduced it (46d3ceabd), and it looks to me like it was designed
>> to be a rough, quick estimate of "two packets outstanding" (by choosing
>> the maximum size of the packet, 64k, and multiplying it by two).
>>
>> Now that you have a better algorithm -- the size of 2 actual packets or
>> the amount transmitted in 1ms -- it seems like the default
>> sysctl_tcp_limit_output_bytes should be higher, and let the automatic
>> TSQ you have on the first line throttle things down when necessary.
> 
> 
> I asked you guys to make a test by increasing
> sysctl_tcp_limit_output_bytes

So you'd be OK with a patch like this?  (With perhaps a better changelog?)

 -George

---
TSQ: Raise default static TSQ limit

A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.

This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.

Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

  * will allow a single TSO frame to consume.  Building TSO frames

Comments

Eric Dumazet April 15, 2015, 5:52 p.m. UTC | #1
On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:

> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
> 
>  -George
> 
> ---
> TSQ: Raise default static TSQ limit
> 
> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
> size of actual packets and the amount of data being transmitted.
> Raise the default static limit to allow that new limit to actually
> come into effect.
> 
> This fixes a regression where NICs with large transmit completion
> times (such as xennet) had a 30% hit unless the user manually tweaked
> the value in /proc.
> 
> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 1db253e..8ad7cdf 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>   */
>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
> 
> -/* Default TSQ limit of two TSO segments */
> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
> +/* Static TSQ limit.  A more dynamic limit is calculated in
> tcp_write_xmit. */
> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
> 
>  /* This limits the percentage of the congestion window which we
>   * will allow a single TSO frame to consume.  Building TSO frames
> 

Have you tested this patch on a NIC without GSO/TSO ?

This would allow more than 500 packets for a single flow.

Hello bufferbloat.

So my answer to this patch is a no.
Rick Jones April 15, 2015, 5:55 p.m. UTC | #2
>
> Have you tested this patch on a NIC without GSO/TSO ?
>
> This would allow more than 500 packets for a single flow.
>
> Hello bufferbloat.

Woudln't the fq_codel qdisc on that interface address that problem?

rick
George Dunlap April 15, 2015, 6:04 p.m. UTC | #3
On 04/15/2015 06:52 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 18:41 +0100, George Dunlap wrote:
> 
>> So you'd be OK with a patch like this?  (With perhaps a better changelog?)
>>
>>  -George
>>
>> ---
>> TSQ: Raise default static TSQ limit
>>
>> A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
>> size of actual packets and the amount of data being transmitted.
>> Raise the default static limit to allow that new limit to actually
>> come into effect.
>>
>> This fixes a regression where NICs with large transmit completion
>> times (such as xennet) had a 30% hit unless the user manually tweaked
>> the value in /proc.
>>
>> Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 1db253e..8ad7cdf 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -50,8 +50,8 @@ int sysctl_tcp_retrans_collapse __read_mostly = 1;
>>   */
>>  int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
>>
>> -/* Default TSQ limit of two TSO segments */
>> -int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
>> +/* Static TSQ limit.  A more dynamic limit is calculated in
>> tcp_write_xmit. */
>> +int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
>>
>>  /* This limits the percentage of the congestion window which we
>>   * will allow a single TSO frame to consume.  Building TSO frames
>>
> 
> Have you tested this patch on a NIC without GSO/TSO ?
> 
> This would allow more than 500 packets for a single flow.
> 
> Hello bufferbloat.
> 
> So my answer to this patch is a no.

You said:

"I asked you guys to make a test by increasing
sysctl_tcp_limit_output_bytes  You have no need to explain me the code I
wrote, thank you."

Which implies to me that you think you've already pointed us to the
answer you want and we're just not getting it.

Maybe you should stop wasting all of our time and just tell us what
you're thinking.

 -George
Eric Dumazet April 15, 2015, 6:08 p.m. UTC | #4
On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
> >
> > Have you tested this patch on a NIC without GSO/TSO ?
> >
> > This would allow more than 500 packets for a single flow.
> >
> > Hello bufferbloat.
> 
> Woudln't the fq_codel qdisc on that interface address that problem?

Last time I checked, default qdisc was pfifo_fast.

These guys do not want to change a sysctl, how pfifo_fast will magically
becomes fq_codel ?
Rick Jones April 15, 2015, 6:19 p.m. UTC | #5
On 04/15/2015 11:08 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 10:55 -0700, Rick Jones wrote:
>>>
>>> Have you tested this patch on a NIC without GSO/TSO ?
>>>
>>> This would allow more than 500 packets for a single flow.
>>>
>>> Hello bufferbloat.
>>
>> Woudln't the fq_codel qdisc on that interface address that problem?
>
> Last time I checked, default qdisc was pfifo_fast.

Bummer.

> These guys do not want to change a sysctl, how pfifo_fast will magically
> becomes fq_codel ?

Well, I'm not sure that it is George and Jonathan themselves who don't 
want to change a sysctl, but the customers who would have to tweak that 
in their VMs?

rick
Eric Dumazet April 15, 2015, 6:19 p.m. UTC | #6
On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:

> Maybe you should stop wasting all of our time and just tell us what
> you're thinking.

I think you make me wasting my time.

I already gave all the hints in prior discussions.

Rome was not built in one day.
Eric Dumazet April 15, 2015, 6:32 p.m. UTC | #7
On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:

> Well, I'm not sure that it is George and Jonathan themselves who don't 
> want to change a sysctl, but the customers who would have to tweak that 
> in their VMs?

Keep in mind some VM users install custom qdisc, or even custom TCP
sysctls.
Rick Jones April 15, 2015, 8:08 p.m. UTC | #8
On 04/15/2015 11:32 AM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 11:19 -0700, Rick Jones wrote:
>
>> Well, I'm not sure that it is George and Jonathan themselves who don't
>> want to change a sysctl, but the customers who would have to tweak that
>> in their VMs?
>
> Keep in mind some VM users install custom qdisc, or even custom TCP
> sysctls.

That could very well be, though I confess I've not seen that happening 
in my little corner of the cloud.  They tend to want to launch the VM 
and go.  Some of the more advanced/sophisticated ones might tweak a few 
things but my (admittedly limited) experience has been they are few in 
number.  They just expect it to work "out of the box" (to the extent one 
can use that phrase still).

It's kind of ironic - go back to the (early) 1990s when NICs generated a 
completion interrupt for every individual tx completion (and incoming 
packet) and all everyone wanted to do was coalesce/avoid interrupts.  I 
guess that has gone rather far.  And today to fight bufferbloat TCP gets 
tweaked to favor quick tx completions.  Call it cycles, or pendulums or 
whatever I guess.

I wonder just how consistent tx completion timings are for a VM so a 
virtio_net or whatnot in the VM can pick a per-device setting to 
advertise to TCP?  Hopefully, full NIC emulation is no longer a thing 
and VMs "universally" use a virtual NIC interface. At least in my little 
corner of the cloud, emulated NICs are gone, and good riddance.

rick
George Dunlap April 16, 2015, 8:56 a.m. UTC | #9
On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> 
>> Maybe you should stop wasting all of our time and just tell us what
>> you're thinking.
> 
> I think you make me wasting my time.
> 
> I already gave all the hints in prior discussions.

Right, and I suggested these two options:

"Obviously one solution would be to allow the drivers themselves to set
the tcp_limit_output_bytes, but that seems like a maintenance
nightmare.

"Another simple solution would be to allow drivers to indicate whether
they have a high transmit latency, and have the kernel use a higher
value by default when that's the case." [1]

Neither of which you commented on.  Instead you pointed me to a comment
that only partially described what the limitations were. (I.e., it
described the "two packets or 1ms", but not how they related, nor how
they related to the "max of 2 64k packets outstanding" of the default
tcp_limit_output_bytes setting.)

 -George


[1]
http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
Daniel Borkmann April 16, 2015, 9:20 a.m. UTC | #10
On 04/16/2015 10:56 AM, George Dunlap wrote:
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
>> On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
>>
>>> Maybe you should stop wasting all of our time and just tell us what
>>> you're thinking.
>>
>> I think you make me wasting my time.
>>
>> I already gave all the hints in prior discussions.
>
> Right, and I suggested these two options:

So mid term, it would be much more beneficial if you attempt fix the
underlying driver issues that actually cause high tx completion delays,
instead of reintroducing bufferbloat. So that we all can move forward
and not backwards in time.

> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.

Possible, but very hacky, as you penalize globally.

> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
>
> Neither of which you commented on.  Instead you pointed me to a comment

What Eric described to you was that you introduce a new netdev member
like netdev->needs_bufferbloat, set that indication from driver site,
and cache that in the socket that binds to it, so you can adjust the
test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
for such devices. Hmm?

> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)
>
>   -George
>
> [1]
> http://marc.info/?i=<CAFLBxZYt7-v29ysm=f+5QMOw64_QhESjzj98udba+1cS-PfObA@mail.gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
David Laight April 16, 2015, 9:22 a.m. UTC | #11
From: George Dunlap
> Sent: 16 April 2015 09:56
> On 04/15/2015 07:19 PM, Eric Dumazet wrote:
> > On Wed, 2015-04-15 at 19:04 +0100, George Dunlap wrote:
> >
> >> Maybe you should stop wasting all of our time and just tell us what
> >> you're thinking.
> >
> > I think you make me wasting my time.
> >
> > I already gave all the hints in prior discussions.
> 
> Right, and I suggested these two options:
> 
> "Obviously one solution would be to allow the drivers themselves to set
> the tcp_limit_output_bytes, but that seems like a maintenance
> nightmare.
> 
> "Another simple solution would be to allow drivers to indicate whether
> they have a high transmit latency, and have the kernel use a higher
> value by default when that's the case." [1]
> 
> Neither of which you commented on.  Instead you pointed me to a comment
> that only partially described what the limitations were. (I.e., it
> described the "two packets or 1ms", but not how they related, nor how
> they related to the "max of 2 64k packets outstanding" of the default
> tcp_limit_output_bytes setting.)

ISTM that you are changing the wrong knob.
You need to change something that affects the global amount of pending tx data,
not the amount that can be buffered by a single connection.

If you change tcp_limit_output_bytes and then have 1000 connections trying
to send data you'll suffer 'bufferbloat'.

If you call skb_orphan() in the tx setup path then the total number of
buffers is limited, but a single connection can (and will) will the tx
ring leading to incorrect RTT calculations and additional latency for
other connections.
This will give high single connection throughput but isn't ideal.

One possibility might be to call skb_orphan() when enough time has
elapsed since the packet was queued for transmit that it is very likely
to have actually been transmitted - even though 'transmit done' has
not yet been signalled.
Not at all sure how this would fit in though...

	David
George Dunlap April 16, 2015, 10:01 a.m. UTC | #12
On 04/16/2015 10:20 AM, Daniel Borkmann wrote:
> So mid term, it would be much more beneficial if you attempt fix the
> underlying driver issues that actually cause high tx completion delays,
> instead of reintroducing bufferbloat. So that we all can move forward
> and not backwards in time.

Yes, I think we definitely see the need for this.  I think we certainly
agree that bufferbloat needs to be reduced, and minimizing the data we
need "in the pipe" for full performance on xennet is an important part
of that.

It should be said, however, that any virtual device is always going to
have higher latency than a physical device.  Hopefully we'll be able to
get the latency of xennet down to something that's more "reasonable",
but it may just not be possible.  And in any case, if we're going to be
cranking down these limits to just barely within the tolerance of
physical NICs, virtual devices (either xennet or virtio_net) are never
going to be able to catch up.  (Without cheating that is.)

> What Eric described to you was that you introduce a new netdev member
> like netdev->needs_bufferbloat, set that indication from driver site,
> and cache that in the socket that binds to it, so you can adjust the
> test in tcp_xmit_size_goal(). It should merely be seen as a hint/indication
> for such devices. Hmm?

He suggested that after he'd been prodded by 4 more e-mails in which two
of us guessed what he was trying to get at.  That's what I was
complaining about.

Having a per-device "long transmit latency" hint sounds like a sensible
short-term solution to me.

 -George
George Dunlap April 16, 2015, 10:57 a.m. UTC | #13
On Thu, Apr 16, 2015 at 10:22 AM, David Laight <David.Laight@aculab.com> wrote:
> ISTM that you are changing the wrong knob.
> You need to change something that affects the global amount of pending tx data,
> not the amount that can be buffered by a single connection.

Well it seems like the problem is that the global amount of pending tx
data is high enough, but that the per-stream amount is too low for
only a single stream.

> If you change tcp_limit_output_bytes and then have 1000 connections trying
> to send data you'll suffer 'bufferbloat'.

Right -- so are you worried about the buffers in the local device
here, or are you worried about buffers elsewhere in the network?

If you're worried about buffers on the local device, don't you have a
similar problem for physical NICs?  i.e., if a NIC has a big buffer
that you're trying to keep mostly empty, limiting a single TCP stream
may keep that buffer empty, but if you have 1000 connections,
1000*limit will still fill up the buffer.

Or am I missing something?

> If you call skb_orphan() in the tx setup path then the total number of
> buffers is limited, but a single connection can (and will) will the tx
> ring leading to incorrect RTT calculations and additional latency for
> other connections.
> This will give high single connection throughput but isn't ideal.
>
> One possibility might be to call skb_orphan() when enough time has
> elapsed since the packet was queued for transmit that it is very likely
> to have actually been transmitted - even though 'transmit done' has
> not yet been signalled.
> Not at all sure how this would fit in though...

Right -- so it sounds like the problem with skb_orphan() is making
sure that the tx ring is shared properly between different streams.
That would mean that ideally we wouldn't call it until the tx ring
actually had space to add more packets onto it.

The Xen project is having a sort of developer meeting in a few weeks;
if we can get a good picture of all the constraints, maybe we can hash
out a solution that works for everyone.

 -George
Eric Dumazet April 16, 2015, 12:42 p.m. UTC | #14
On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:

> He suggested that after he'd been prodded by 4 more e-mails in which two
> of us guessed what he was trying to get at.  That's what I was
> complaining about.

My big complain is that I suggested to test to double the sysctl, which
gave good results.

Then you provided a patch using a 8x factor. How does that sound ?

Next time I ask a raise, I should try a 8x factor as well, who knows,
it might be accepted.
George Dunlap April 20, 2015, 11:03 a.m. UTC | #15
On Thu, Apr 16, 2015 at 1:42 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
>
>> He suggested that after he'd been prodded by 4 more e-mails in which two
>> of us guessed what he was trying to get at.  That's what I was
>> complaining about.
>
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
>
> Then you provided a patch using a 8x factor. How does that sound ?
>
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.

I see.  I chose the value that Stefano had determined had completely
eliminated the overhead.  Doubling the value reduces the overhead to
8%, which should be fine for a short-term fix while we git a proper
mid/long-term fix.

 -George
Wei Liu June 2, 2015, 9:52 a.m. UTC | #16
Hi Eric

Sorry for coming late to the discussion.

On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> 
> > He suggested that after he'd been prodded by 4 more e-mails in which two
> > of us guessed what he was trying to get at.  That's what I was
> > complaining about.
> 
> My big complain is that I suggested to test to double the sysctl, which
> gave good results.
> 

Do I understand correctly that it's acceptable to you to double the size
of the buffer? If so I will send a patch to do that.

Wei.

> Then you provided a patch using a 8x factor. How does that sound ?
> 
> Next time I ask a raise, I should try a 8x factor as well, who knows,
> it might be accepted.
> 
>
Eric Dumazet June 2, 2015, 4:16 p.m. UTC | #17
On Tue, 2015-06-02 at 10:52 +0100, Wei Liu wrote:
> Hi Eric
> 
> Sorry for coming late to the discussion.
> 
> On Thu, Apr 16, 2015 at 05:42:16AM -0700, Eric Dumazet wrote:
> > On Thu, 2015-04-16 at 11:01 +0100, George Dunlap wrote:
> > 
> > > He suggested that after he'd been prodded by 4 more e-mails in which two
> > > of us guessed what he was trying to get at.  That's what I was
> > > complaining about.
> > 
> > My big complain is that I suggested to test to double the sysctl, which
> > gave good results.
> > 
> 
> Do I understand correctly that it's acceptable to you to double the size
> of the buffer? If so I will send a patch to do that.

Absolutely.
diff mbox

Patch

TSQ: Raise default static TSQ limit
    
A new dynamic TSQ limit was introduced in c/s 605ad7f18 based on the
size of actual packets and the amount of data being transmitted.
Raise the default static limit to allow that new limit to actually
come into effect.
 
This fixes a regression where NICs with large transmit completion
times (such as xennet) had a 30% hit unless the user manually tweaked
the value in /proc.
    
Signed-off-by: George Dunlap <george.dunlap@eu.citrix.com>

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 1db253e..8ad7cdf 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -50,8 +50,8 @@  int sysctl_tcp_retrans_collapse __read_mostly = 1;
  */
 int sysctl_tcp_workaround_signed_windows __read_mostly = 0;
 
-/* Default TSQ limit of two TSO segments */
-int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
+/* Static TSQ limit.  A more dynamic limit is calculated in tcp_write_xmit. */
+int sysctl_tcp_limit_output_bytes __read_mostly = 1048576;
 
 /* This limits the percentage of the congestion window which we
  * will allow a single TSO frame to consume.  Building TSO frames