TCP small packets throughput and multiqueue virtio-net
diff mbox

Message ID 5139840A.6020204@redhat.com
State New, archived
Headers show

Commit Message

Jason Wang March 8, 2013, 6:24 a.m. UTC
Hello all:

I meet an issue when testing multiqueue virtio-net. When I testing guest
small packets stream sending performance with netperf. I find an
regression of multiqueue. When I run 2 sessions of TCP_STREAM test with
1024 byte from guest to local host, I get following result:

1q result: 3457.64
2q result: 7781.45

Statistics shows that: compared with one queue, multiqueue tends to send
much more but smaller packets. Tcpdump shows single queue has a much
higher possibility to produce a 64K gso packet compared to multiqueue.
More but smaller packets will cause more vmexits and interrupts which
lead a degradation on throughput.

Then problem only exist for small packets sending. When I test with
larger size, multiqueue will gradually outperfrom single queue. And
multiqueue also outperfrom in both TCP_RR and pktgen test (even with
small packets). The problem disappear when I turn off both gso and tso.

I'm not sure whether it's a bug or expected since anyway we get
improvement on latency. And if it's a bug, I suspect it was related to
TCP GSO batching algorithm who tends to batch less in this situation. (
Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't
find any obvious clue on this). After some experiments, I find the it
maybe related to tcp_tso_should_defer(), after
1) change the tcp_tso_win_divisor to 1
2) the following changes
the throughput were almost the same (but still a little worse) as single
queue:

already? */
        if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))

Git history shows this check were added for both westwood and fasttcp,
I'm not familiar with tcp but looks like we can easily hit this check
under when multiqueue is enabled for virtio-net. Maybe I was wrong but I
wonder whether we can still do some batching here.

Any comments, thoughts are welcomed.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 8, 2013, 3:05 p.m. UTC | #1
On Fri, 2013-03-08 at 14:24 +0800, Jason Wang wrote:
> Hello all:
> 
> I meet an issue when testing multiqueue virtio-net. When I testing guest
> small packets stream sending performance with netperf. I find an
> regression of multiqueue. When I run 2 sessions of TCP_STREAM test with
> 1024 byte from guest to local host, I get following result:
> 
> 1q result: 3457.64
> 2q result: 7781.45
> 
> Statistics shows that: compared with one queue, multiqueue tends to send
> much more but smaller packets. Tcpdump shows single queue has a much
> higher possibility to produce a 64K gso packet compared to multiqueue.
> More but smaller packets will cause more vmexits and interrupts which
> lead a degradation on throughput.
> 
> Then problem only exist for small packets sending. When I test with
> larger size, multiqueue will gradually outperfrom single queue. And
> multiqueue also outperfrom in both TCP_RR and pktgen test (even with
> small packets). The problem disappear when I turn off both gso and tso.
> 

This makes little sense to me : TCP_RR workload (assuming one byte
payload) cannot use GSO or TSO anyway. Same for pktgen as it uses UDP.

> I'm not sure whether it's a bug or expected since anyway we get
> improvement on latency. And if it's a bug, I suspect it was related to
> TCP GSO batching algorithm who tends to batch less in this situation. (
> Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't
> find any obvious clue on this). After some experiments, I find the it
> maybe related to tcp_tso_should_defer(), after
> 1) change the tcp_tso_win_divisor to 1
> 2) the following changes
> the throughput were almost the same (but still a little worse) as single
> queue:
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index fd0cea1..dedd2a6 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1777,10 +1777,12 @@ static bool tcp_tso_should_defer(struct sock
> *sk, struct sk_buff *skb)
>  
>         limit = min(send_win, cong_win);
>  
> +#if 0
>         /* If a full-sized TSO skb can be sent, do it. */
>         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
>                            sk->sk_gso_max_segs * tp->mss_cache))
>                 goto send_now;
> +#endif
>  
>         /* Middle in queue won't get any more data, full sendable
> already? */
>         if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
> 
> Git history shows this check were added for both westwood and fasttcp,
> I'm not familiar with tcp but looks like we can easily hit this check
> under when multiqueue is enabled for virtio-net. Maybe I was wrong but I
> wonder whether we can still do some batching here.
> 
> Any comments, thoughts are welcomed.
> 

Well, the point is : if your app does write(1024) bytes, thats probably
because it wants small packets from the very beginning. (See the TCP
PUSH flag ?)

If the transport is slow, TCP stack will automatically collapse several
write into single skbs (assuming TSO or GSO is on), and you'll see big
GSO packets with tcpdump [1]. So TCP will help you to get less overhead
in this case.

But if transport is fast, you'll see small packets, and thats good for
latency.

So my opinion is : its exactly behaving as expected.

If you want bigger packets either :
- Make the application doing big write()
- Slow the vmexit ;)

[1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
packets', but they hit dev_hard_start_xmit() as GSO packets.



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones March 8, 2013, 5:26 p.m. UTC | #2
>
> Well, the point is : if your app does write(1024) bytes, thats probably
> because it wants small packets from the very beginning. (See the TCP
> PUSH flag ?)

I think that raises the question of whether or not Jason was setting the 
test-specific -D option on his TCP_STREAM tests, to have netperf make a 
setsockopt(TCP_NODELAY) call.

happy benchmarking,

rick jones

> If the transport is slow, TCP stack will automatically collapse several
> write into single skbs (assuming TSO or GSO is on), and you'll see big
> GSO packets with tcpdump [1]. So TCP will help you to get less overhead
> in this case.
>
> But if transport is fast, you'll see small packets, and thats good for
> latency.
>
> So my opinion is : its exactly behaving as expected.
>
> If you want bigger packets either :
> - Make the application doing big write()
> - Slow the vmexit ;)
>
> [1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
> packets', but they hit dev_hard_start_xmit() as GSO packets.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang March 11, 2013, 6:17 a.m. UTC | #3
On 03/08/2013 11:05 PM, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 14:24 +0800, Jason Wang wrote:
>> Hello all:
>>
>> I meet an issue when testing multiqueue virtio-net. When I testing guest
>> small packets stream sending performance with netperf. I find an
>> regression of multiqueue. When I run 2 sessions of TCP_STREAM test with
>> 1024 byte from guest to local host, I get following result:
>>
>> 1q result: 3457.64
>> 2q result: 7781.45
>>
>> Statistics shows that: compared with one queue, multiqueue tends to send
>> much more but smaller packets. Tcpdump shows single queue has a much
>> higher possibility to produce a 64K gso packet compared to multiqueue.
>> More but smaller packets will cause more vmexits and interrupts which
>> lead a degradation on throughput.
>>
>> Then problem only exist for small packets sending. When I test with
>> larger size, multiqueue will gradually outperfrom single queue. And
>> multiqueue also outperfrom in both TCP_RR and pktgen test (even with
>> small packets). The problem disappear when I turn off both gso and tso.
>>
> This makes little sense to me : TCP_RR workload (assuming one byte
> payload) cannot use GSO or TSO anyway. Same for pktgen as it uses UDP.
>
>> I'm not sure whether it's a bug or expected since anyway we get
>> improvement on latency. And if it's a bug, I suspect it was related to
>> TCP GSO batching algorithm who tends to batch less in this situation. (
>> Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't
>> find any obvious clue on this). After some experiments, I find the it
>> maybe related to tcp_tso_should_defer(), after
>> 1) change the tcp_tso_win_divisor to 1
>> 2) the following changes
>> the throughput were almost the same (but still a little worse) as single
>> queue:
>>
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index fd0cea1..dedd2a6 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -1777,10 +1777,12 @@ static bool tcp_tso_should_defer(struct sock
>> *sk, struct sk_buff *skb)
>>  
>>         limit = min(send_win, cong_win);
>>  
>> +#if 0
>>         /* If a full-sized TSO skb can be sent, do it. */
>>         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
>>                            sk->sk_gso_max_segs * tp->mss_cache))
>>                 goto send_now;
>> +#endif
>>  
>>         /* Middle in queue won't get any more data, full sendable
>> already? */
>>         if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
>>
>> Git history shows this check were added for both westwood and fasttcp,
>> I'm not familiar with tcp but looks like we can easily hit this check
>> under when multiqueue is enabled for virtio-net. Maybe I was wrong but I
>> wonder whether we can still do some batching here.
>>
>> Any comments, thoughts are welcomed.
>>
> Well, the point is : if your app does write(1024) bytes, thats probably
> because it wants small packets from the very beginning. (See the TCP
> PUSH flag ?)

Didn't fully understand the question, but according to the tcpdump, TCP
PUSH flag were seen in very few packets.
> If the transport is slow, TCP stack will automatically collapse several
> write into single skbs (assuming TSO or GSO is on), and you'll see big
> GSO packets with tcpdump [1]. So TCP will help you to get less overhead
> in this case.
>
> But if transport is fast, you'll see small packets, and thats good for
> latency.
>
> So my opinion is : its exactly behaving as expected.
>
> If you want bigger packets either :
> - Make the application doing big write()
> - Slow the vmexit ;)

Good to know this, thanks for the explanation.
> [1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
> packets', but they hit dev_hard_start_xmit() as GSO packets.
>
>
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Wang March 11, 2013, 6:21 a.m. UTC | #4
On 03/09/2013 01:26 AM, Rick Jones wrote:
>
>>
>> Well, the point is : if your app does write(1024) bytes, thats probably
>> because it wants small packets from the very beginning. (See the TCP
>> PUSH flag ?)
>
> I think that raises the question of whether or not Jason was setting
> the test-specific -D option on his TCP_STREAM tests, to have netperf
> make a setsockopt(TCP_NODELAY) call.

I didn't set the -D option to disable nagle. But I get almost the almost
same result with -D (1024 byte TCP_STREAM from guest to local host).
>
> happy benchmarking,
>
> rick jones
>
>> If the transport is slow, TCP stack will automatically collapse several
>> write into single skbs (assuming TSO or GSO is on), and you'll see big
>> GSO packets with tcpdump [1]. So TCP will help you to get less overhead
>> in this case.
>>
>> But if transport is fast, you'll see small packets, and thats good for
>> latency.
>>
>> So my opinion is : its exactly behaving as expected.
>>
>> If you want bigger packets either :
>> - Make the application doing big write()
>> - Slow the vmexit ;)
>>
>> [1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
>> packets', but they hit dev_hard_start_xmit() as GSO packets.
>>
>>
>>
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin March 11, 2013, 11:29 a.m. UTC | #5
On Fri, Mar 08, 2013 at 07:05:15AM -0800, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 14:24 +0800, Jason Wang wrote:
> > Hello all:
> > 
> > I meet an issue when testing multiqueue virtio-net. When I testing guest
> > small packets stream sending performance with netperf. I find an
> > regression of multiqueue. When I run 2 sessions of TCP_STREAM test with
> > 1024 byte from guest to local host, I get following result:
> > 
> > 1q result: 3457.64
> > 2q result: 7781.45
> > 
> > Statistics shows that: compared with one queue, multiqueue tends to send
> > much more but smaller packets. Tcpdump shows single queue has a much
> > higher possibility to produce a 64K gso packet compared to multiqueue.
> > More but smaller packets will cause more vmexits and interrupts which
> > lead a degradation on throughput.
> > 
> > Then problem only exist for small packets sending. When I test with
> > larger size, multiqueue will gradually outperfrom single queue. And
> > multiqueue also outperfrom in both TCP_RR and pktgen test (even with
> > small packets). The problem disappear when I turn off both gso and tso.
> > 
> 
> This makes little sense to me : TCP_RR workload (assuming one byte
> payload) cannot use GSO or TSO anyway. Same for pktgen as it uses UDP.
> 
> > I'm not sure whether it's a bug or expected since anyway we get
> > improvement on latency. And if it's a bug, I suspect it was related to
> > TCP GSO batching algorithm who tends to batch less in this situation. (
> > Jiri Pirko suspect it was the defect of virtio-net driver, but I didn't
> > find any obvious clue on this). After some experiments, I find the it
> > maybe related to tcp_tso_should_defer(), after
> > 1) change the tcp_tso_win_divisor to 1
> > 2) the following changes
> > the throughput were almost the same (but still a little worse) as single
> > queue:
> > 
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index fd0cea1..dedd2a6 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -1777,10 +1777,12 @@ static bool tcp_tso_should_defer(struct sock
> > *sk, struct sk_buff *skb)
> >  
> >         limit = min(send_win, cong_win);
> >  
> > +#if 0
> >         /* If a full-sized TSO skb can be sent, do it. */
> >         if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
> >                            sk->sk_gso_max_segs * tp->mss_cache))
> >                 goto send_now;
> > +#endif
> >  
> >         /* Middle in queue won't get any more data, full sendable
> > already? */
> >         if ((skb != tcp_write_queue_tail(sk)) && (limit >= skb->len))
> > 
> > Git history shows this check were added for both westwood and fasttcp,

Hmm yes,
http://www.mail-archive.com/netdev@vger.kernel.org/msg08738.html
How about we disable it for cubic,reno then?

> > I'm not familiar with tcp but looks like we can easily hit this check
> > under when multiqueue is enabled for virtio-net. Maybe I was wrong but I
> > wonder whether we can still do some batching here.
> > 
> > Any comments, thoughts are welcomed.
> > 
> 
> Well, the point is : if your app does write(1024) bytes, thats probably
> because it wants small packets from the very beginning. (See the TCP
> PUSH flag ?)

In that case the application typically won't have packets
in flight (e.g. like TCP_RR) so TSO won't trigger at all, no?

It would seem that packets in flight might rather indicate
that the application is trying to keep the socket buffer full
by giving the kernel data as fast as it becomes available.
At least, this is exactly what benchmark tools seem to be doing, right?

> If the transport is slow, TCP stack will automatically collapse several
> write into single skbs (assuming TSO or GSO is on), and you'll see big
> GSO packets with tcpdump [1]. So TCP will help you to get less overhead
> in this case.
> 
> But if transport is fast, you'll see small packets, and thats good for
> latency.

But is a large CWND really a good indicator of a low latency link?  Can't
CWND grow (depending on the protocol) as long as there's no packet loss
even if the latency is high?

So if a VM is using a 10G backend link in the host, it seems that (due
to vmexit overhead) the latency is high so we are not gaining much from
sending the packet a bit earlier, OTOH, the bandwidth is high so the
per packet overhead becomes measureable. Virt workloads are probably not
the only ones that are like that, it's just that it's easier for people
to observe the overhead.

> So my opinion is : its exactly behaving as expected.
> 
> If you want bigger packets either :
> - Make the application doing big write()
> - Slow the vmexit ;)
> 
> [1] GSO fools tcpdump : Actual packets sent to the wire are not 'big
> packets', but they hit dev_hard_start_xmit() as GSO packets.
>

Patch
diff mbox

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index fd0cea1..dedd2a6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1777,10 +1777,12 @@  static bool tcp_tso_should_defer(struct sock
*sk, struct sk_buff *skb)
 
        limit = min(send_win, cong_win);
 
+#if 0
        /* If a full-sized TSO skb can be sent, do it. */
        if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
                           sk->sk_gso_max_segs * tp->mss_cache))
                goto send_now;
+#endif
 
        /* Middle in queue won't get any more data, full sendable