[net,1/4] net/udp_gso: Allow TX timestamp with UDP GSO

Message ID	20190523210651.80902-2-fklassen@appneta.com (mailing list archive)
State	New
Headers	show Return-Path: <linux-kselftest-owner@kernel.org> From: Fred Klassen <fklassen@appneta.com> To: "David S. Miller" <davem@davemloft.net>, Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>, Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>, Shuah Khan <shuah@kernel.org>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, Willem de Bruijn <willemb@google.com> Cc: Fred Klassen <fklassen@appneta.com> Subject: [PATCH net 1/4] net/udp_gso: Allow TX timestamp with UDP GSO Date: Thu, 23 May 2019 14:06:48 -0700 Message-Id: <20190523210651.80902-2-fklassen@appneta.com> In-Reply-To: <20190523210651.80902-1-fklassen@appneta.com> References: <20190523210651.80902-1-fklassen@appneta.com> Sender: linux-kselftest-owner@vger.kernel.org Precedence: bulk
Series	Allow TX timestamp with UDP GSO \| expand [net,0/4] Allow TX timestamp with UDP GSO [net,1/4] net/udp_gso: Allow TX timestamp with UDP GSO [net,2/4] net/udpgso_bench_tx: options to exercise TX CMSG [net,3/4] net/udpgso_bench_tx: fix sendmmsg on unconnected socket [net,4/4] net/udpgso_bench_tx: audit error queue

Fred Klassen May 23, 2019, 9:06 p.m. UTC

Fixes an issue where TX Timestamps are not arriving on the error queue
when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
This can be illustrated with an updated updgso_bench_tx program which
includes the '-T' option to test for this condition.

    ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

The "poll timeout" message above indicates that TX timestamp never
arrived.

It also appears that other TX CMSG types cause similar issues, for
example trying to set SOL_IP/IP_TOS.

    ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
    poll timeout
    udp tx:      0 MB/s        1 calls/s      1 msg/s

This patch preserves tx_flags for the first UDP GSO segment. This
mirrors the stack's behaviour for IPv4 fragments.

Fixes: ee80d1ebe5ba ("udp: add udp gso")
Signed-off-by: Fred Klassen <fklassen@appneta.com>
---
 net/ipv4/udp_offload.c | 4 ++++
 1 file changed, 4 insertions(+)

Willem de Bruijn May 23, 2019, 9:39 p.m. UTC | #1

On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Fixes an issue where TX Timestamps are not arriving on the error queue
> when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
> This can be illustrated with an updated updgso_bench_tx program which
> includes the '-T' option to test for this condition.
>
>     ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> The "poll timeout" message above indicates that TX timestamp never
> arrived.
>
> It also appears that other TX CMSG types cause similar issues, for
> example trying to set SOL_IP/IP_TOS.
>
>     ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> This patch preserves tx_flags for the first UDP GSO segment. This
> mirrors the stack's behaviour for IPv4 fragments.
>
> Fixes: ee80d1ebe5ba ("udp: add udp gso")
> Signed-off-by: Fred Klassen <fklassen@appneta.com>
> ---
>  net/ipv4/udp_offload.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
> index 065334b41d57..33de347695ae 100644
> --- a/net/ipv4/udp_offload.c
> +++ b/net/ipv4/udp_offload.c
> @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb,
>         seg = segs;
>         uh = udp_hdr(seg);
>
> +       /* preserve TX timestamp and zero-copy info for first segment */
> +       skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey;
> +       skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags;
> +

Thanks for the report.

Zerocopy notification reference count is managed in skb_segment. That
should work.

Support for timestamping with the new GSO feature is indeed an
oversight. The solution is similar to how TCP associates the timestamp
with the right segment in tcp_gso_tstamp.

Only, I think we want to transfer the timestamp request to the last
datagram, not the first. For send timestamp, the final byte leaving
the host is usually more interesting.

Willem de Bruijn May 23, 2019, 9:59 p.m. UTC | #2

On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> Fixes an issue where TX Timestamps are not arriving on the error queue
> when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING.
> This can be illustrated with an updated updgso_bench_tx program which
> includes the '-T' option to test for this condition.
>
>     ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s
>
> The "poll timeout" message above indicates that TX timestamp never
> arrived.
>
> It also appears that other TX CMSG types cause similar issues, for
> example trying to set SOL_IP/IP_TOS.
>
>     ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18
>     poll timeout
>     udp tx:      0 MB/s        1 calls/s      1 msg/s

what exactly is the issue with IP_TOS?

If I understand correctly, the issue here is that the new 'P' option
that polls on the error queue times out. This is unrelated to
specifying TOS bits? Without zerocopy or timestamps, no message is
expected on the error queue.

Fred Klassen May 24, 2019, 1:38 a.m. UTC | #3

> Thanks for the report.
> 
> Zerocopy notification reference count is managed in skb_segment. That
> should work.
> 
> Support for timestamping with the new GSO feature is indeed an
> oversight. The solution is similar to how TCP associates the timestamp
> with the right segment in tcp_gso_tstamp.
> 
> Only, I think we want to transfer the timestamp request to the last
> datagram, not the first. For send timestamp, the final byte leaving
> the host is usually more interesting.

TX Timestamping the last packet of a datagram is something that would
work poorly for our application. We need to measure the time it takes
for the first bit that is sent until the first bit of the last packet is received.
Timestaming the last packet of a burst seems somewhat random to me
and would not be useful. Essentially we would be timestamping a 
random byte in a UDP GSO buffer.

I believe there is a precedence for timestamping the first packet. With
IPv4 packets, the first packet is timestamped and the remaining fragments
are not.

Willem de Bruijn May 24, 2019, 4:53 a.m. UTC | #4

On Thu, May 23, 2019 at 9:38 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> > Thanks for the report.
> >
> > Zerocopy notification reference count is managed in skb_segment. That
> > should work.
> >
> > Support for timestamping with the new GSO feature is indeed an
> > oversight. The solution is similar to how TCP associates the timestamp
> > with the right segment in tcp_gso_tstamp.
> >
> > Only, I think we want to transfer the timestamp request to the last
> > datagram, not the first. For send timestamp, the final byte leaving
> > the host is usually more interesting.
>
> TX Timestamping the last packet of a datagram is something that would
> work poorly for our application. We need to measure the time it takes
> for the first bit that is sent until the first bit of the last packet is received.
> Timestaming the last packet of a burst seems somewhat random to me
> and would not be useful. Essentially we would be timestamping a
> random byte in a UDP GSO buffer.
>
> I believe there is a precedence for timestamping the first packet. With
> IPv4 packets, the first packet is timestamped and the remaining fragments
> are not.

Interesting. TCP timestamping takes the opposite choice and does
timestamp the last byte in the sendmsg request.

It sounds like it depends on the workload. Perhaps this then needs to
be configurable with an SOF_.. flag.

Another option would be to return a timestamp for every segment. But
they would all return the same tskey. And it causes different behavior
with and without hardware offload.

Fred Klassen May 24, 2019, 4:34 p.m. UTC | #5

> Interesting. TCP timestamping takes the opposite choice and does
> timestamp the last byte in the sendmsg request.
> 

I have a difficult time with the philosophy of TX timestamping the last
segment. The actual timestamp occurs just before the last segment
is sent. This is neither the start  nor the end of a GSO packet, which
to me seems somewhat arbitrary. It is even more arbitrary when using
software TX tiimestamping. These are timestamps represent the
time that the packet is queued onto the NIC’s buffer, not actual
time leaving the wire. Queuing to a ring buffer is usually much faster
than wire rates. Therefore, say the timestamp of the last 1500 byte 
segment of a 64K GSO packet may in reality be representing a time
about half way through the burst.

Since the timestamp of a TX packet occurs just before any data is sent,
I have found it most valuable to timestamp just before the first byte of 
the packet or burst. Conversely, I find it most valuable to get an RX
timestamp  after the last byte arrives.

> It sounds like it depends on the workload. Perhaps this then needs to
> be configurable with an SOF_.. flag.
> 

It would be interesting if a practical case can be made for timestamping
the last segment. In my mind, I don’t see how that would be valuable.

> Another option would be to return a timestamp for every segment. But
> they would all return the same tskey. And it causes different behavior
> with and without hardware offload.

When it comes to RX packets, getting per-packet (or per segment)
timestamps is invaluable. They represent actual wire times. However
my previous research into TX timestamping has led me to conclude
that there is no practical value when timestamping every packet of 
a back-to-back burst.

When using software TX timestamping, The inter-packet timestamps
are typically much faster than line rate. Whereas you may be sending
on a GigE link, you may measure 20Gbps. At higher rates, I have found
that the overhead of per-packet software timestamping can produce
gaps in packets.

When using hardware timestamping, I think you will find that nearly all
adapters only allow one timestamp at a time. Therefore only one
packet in a burst would get timestamped. There are exceptions, for
example I am playing with a 100G Mellanox adapter that has
per-packet TX timestamping. However, I suspect that when I am
done testing, all I will see is timestamps that are representing wire
rate (e.g. 123nsec per 1500 byte packet).

Beyond testing the accuracy of a NIC’s timestamping capabilities, I
see very little value in doing per-segment timestamping.

Willem de Bruijn May 24, 2019, 7:29 p.m. UTC | #6

On Fri, May 24, 2019 at 12:34 PM Fred Klassen <fklassen@appneta.com> wrote:
>
> > Interesting. TCP timestamping takes the opposite choice and does
> > timestamp the last byte in the sendmsg request.
> >
>
> I have a difficult time with the philosophy of TX timestamping the last
> segment. The actual timestamp occurs just before the last segment
> is sent. This is neither the start  nor the end of a GSO packet, which
> to me seems somewhat arbitrary. It is even more arbitrary when using
> software TX tiimestamping. These are timestamps represent the
> time that the packet is queued onto the NIC’s buffer, not actual
> time leaving the wire.

It is the last moment that a timestamp can be generated for the last
byte, I don't see how that is "neither the start nor the end of a GSO
packet".

> Queuing to a ring buffer is usually much faster
> than wire rates. Therefore, say the timestamp of the last 1500 byte
> segment of a 64K GSO packet may in reality be representing a time
> about half way through the burst.
>
> Since the timestamp of a TX packet occurs just before any data is sent,
> I have found it most valuable to timestamp just before the first byte of
> the packet or burst. Conversely, I find it most valuable to get an RX
> timestamp  after the last byte arrives.
>
> > It sounds like it depends on the workload. Perhaps this then needs to
> > be configurable with an SOF_.. flag.
> >
>
> It would be interesting if a practical case can be made for timestamping
> the last segment. In my mind, I don’t see how that would be valuable.

It depends whether you are interested in measuring network latency or
host transmit path latency.

For the latter, knowing the time from the start of the sendmsg call to
the moment the last byte hits the wire is most relevant. Or in absence
of (well defined) hardware support, the last byte being queued to the
device is the next best thing.

It would make sense for this software implementation to follow
established hardware behavior. But as far as I know, the exact time a
hardware timestamp is taken is not consistent across devices, either.

For fine grained timestamped data, perhaps GSO is simply not a good
mechanism. That said, it still has to queue a timestamp if requested.

> > Another option would be to return a timestamp for every segment. But
> > they would all return the same tskey. And it causes different behavior
> > with and without hardware offload.
>
> When it comes to RX packets, getting per-packet (or per segment)
> timestamps is invaluable. They represent actual wire times. However
> my previous research into TX timestamping has led me to conclude
> that there is no practical value when timestamping every packet of
> a back-to-back burst.
>
> When using software TX timestamping, The inter-packet timestamps
> are typically much faster than line rate. Whereas you may be sending
> on a GigE link, you may measure 20Gbps. At higher rates, I have found
> that the overhead of per-packet software timestamping can produce
> gaps in packets.
>
> When using hardware timestamping, I think you will find that nearly all
> adapters only allow one timestamp at a time. Therefore only one
> packet in a burst would get timestamped.

Can you elaborate? When the host queues N packets all with hardware
timestamps requested, all N completions will have a timestamp? Or is
that not guaranteed?

> There are exceptions, for
> example I am playing with a 100G Mellanox adapter that has
> per-packet TX timestamping. However, I suspect that when I am
> done testing, all I will see is timestamps that are representing wire
> rate (e.g. 123nsec per 1500 byte packet).
>
> Beyond testing the accuracy of a NIC’s timestamping capabilities, I
> see very little value in doing per-segment timestamping.

Ack. Great detailed argument, thanks.

Fred Klassen May 24, 2019, 10:01 p.m. UTC | #7

> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> It is the last moment that a timestamp can be generated for the last
> byte, I don't see how that is "neither the start nor the end of a GSO
> packet”.

My misunderstanding. I thought TCP did last segment timestamping, not
last byte. In that case, your statements make sense.

>> It would be interesting if a practical case can be made for timestamping
>> the last segment. In my mind, I don’t see how that would be valuable.
> 
> It depends whether you are interested in measuring network latency or
> host transmit path latency.
> 
> For the latter, knowing the time from the start of the sendmsg call to
> the moment the last byte hits the wire is most relevant. Or in absence
> of (well defined) hardware support, the last byte being queued to the
> device is the next best thing.
> 
> It would make sense for this software implementation to follow
> established hardware behavior. But as far as I know, the exact time a
> hardware timestamp is taken is not consistent across devices, either.
> 
> For fine grained timestamped data, perhaps GSO is simply not a good
> mechanism. That said, it still has to queue a timestamp if requested.

I see your point. Makes sense to me.

>> When using hardware timestamping, I think you will find that nearly all
>> adapters only allow one timestamp at a time. Therefore only one
>> packet in a burst would get timestamped.
> 
> Can you elaborate? When the host queues N packets all with hardware
> timestamps requested, all N completions will have a timestamp? Or is
> that not guaranteed?
> 

It is not guaranteed. The best example is in ixgbe_main.c and search for
‘SKBTX_HW_TSTAMP’.  If there is a PTP TX timestamp in progress,
‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps
are possible. The flag is cleared after transmit softirq, and only then
can another TX timestamp be taken.  

>> There are exceptions, for
>> example I am playing with a 100G Mellanox adapter that has
>> per-packet TX timestamping. However, I suspect that when I am
>> done testing, all I will see is timestamps that are representing wire
>> rate (e.g. 123nsec per 1500 byte packet).
>> 
>> Beyond testing the accuracy of a NIC’s timestamping capabilities, I
>> see very little value in doing per-segment timestamping.
> 
> Ack. Great detailed argument, thanks.

Thanks. I’m a timestamping nerd and have learned lots with this 
discussion.

Willem de Bruijn May 25, 2019, 3:20 p.m. UTC | #8

On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >
> > It is the last moment that a timestamp can be generated for the last
> > byte, I don't see how that is "neither the start nor the end of a GSO
> > packet”.
>
> My misunderstanding. I thought TCP did last segment timestamping, not
> last byte. In that case, your statements make sense.
>
> >> It would be interesting if a practical case can be made for timestamping
> >> the last segment. In my mind, I don’t see how that would be valuable.
> >
> > It depends whether you are interested in measuring network latency or
> > host transmit path latency.
> >
> > For the latter, knowing the time from the start of the sendmsg call to
> > the moment the last byte hits the wire is most relevant. Or in absence
> > of (well defined) hardware support, the last byte being queued to the
> > device is the next best thing.

Sounds to me like both cases have a legitimate use case, and we want
to support both.

Implementation constraints are that storage for this timestamp
information is scarce and we cannot add new cold cacheline accesses in
the datapath.

The simplest approach would be to unconditionally timestamp both the
first and last segment. With the same ID. Not terribly elegant. But it
works.

If conditional, tx_flags has only one bit left. I think we can harvest
some, as not all defined bits are in use at the same stages in the
datapath, but that is not a trivial change. Some might also better be
set in the skb, instead of skb_shinfo. Which would also avoids
touching that cacheline. We could possibly repurpose bits from u32
tskey.

All that can come later. Initially, unless we can come up with
something more elegant, I would suggest that UDP follows the rule
established by TCP and timestamps the last byte. And we add an
explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
__sock_tx_timestamp and is interpreted in __udp_gso_segment.

> >
> > It would make sense for this software implementation to follow
> > established hardware behavior. But as far as I know, the exact time a
> > hardware timestamp is taken is not consistent across devices, either.
> >
> > For fine grained timestamped data, perhaps GSO is simply not a good
> > mechanism. That said, it still has to queue a timestamp if requested.
>
> I see your point. Makes sense to me.
>
> >> When using hardware timestamping, I think you will find that nearly all
> >> adapters only allow one timestamp at a time. Therefore only one
> >> packet in a burst would get timestamped.
> >
> > Can you elaborate? When the host queues N packets all with hardware
> > timestamps requested, all N completions will have a timestamp? Or is
> > that not guaranteed?
> >
>
> It is not guaranteed. The best example is in ixgbe_main.c and search for
> ‘SKBTX_HW_TSTAMP’.  If there is a PTP TX timestamp in progress,
> ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps
> are possible. The flag is cleared after transmit softirq, and only then
> can another TX timestamp be taken.

Interesting, thanks. I had no idea.

Fred Klassen May 25, 2019, 6:47 p.m. UTC | #9

> On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
>> 
>> 
>> 
>>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>>> 
>>> It is the last moment that a timestamp can be generated for the last
>>> byte, I don't see how that is "neither the start nor the end of a GSO
>>> packet”.
>> 
>> My misunderstanding. I thought TCP did last segment timestamping, not
>> last byte. In that case, your statements make sense.
>> 
>>>> It would be interesting if a practical case can be made for timestamping
>>>> the last segment. In my mind, I don’t see how that would be valuable.
>>> 
>>> It depends whether you are interested in measuring network latency or
>>> host transmit path latency.
>>> 
>>> For the latter, knowing the time from the start of the sendmsg call to
>>> the moment the last byte hits the wire is most relevant. Or in absence
>>> of (well defined) hardware support, the last byte being queued to the
>>> device is the next best thing.
> 
> Sounds to me like both cases have a legitimate use case, and we want
> to support both.
> 
> Implementation constraints are that storage for this timestamp
> information is scarce and we cannot add new cold cacheline accesses in
> the datapath.
> 
> The simplest approach would be to unconditionally timestamp both the
> first and last segment. With the same ID. Not terribly elegant. But it
> works.
> 
> If conditional, tx_flags has only one bit left. I think we can harvest
> some, as not all defined bits are in use at the same stages in the
> datapath, but that is not a trivial change. Some might also better be
> set in the skb, instead of skb_shinfo. Which would also avoids
> touching that cacheline. We could possibly repurpose bits from u32
> tskey.
> 
> All that can come later. Initially, unless we can come up with
> something more elegant, I would suggest that UDP follows the rule
> established by TCP and timestamps the last byte. And we add an
> explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> 

I don’t see how to practically TX timestamp the last byte of any packet
(UDP GSO or otherwise). The best we could do is timestamp the last
segment,  or rather the time that the last segment is queued. Let me
attempt to explain.

First let’s look at software TX timestamps which are for are generated
by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
states:

—————————— cut ————————————
 * Ethernet MAC Drivers should call this function in their hard_xmit()
 * function immediately before giving the sk_buff to the MAC hardware.
—————————— cut ————————————

That means that the sk_buff will get timestamped just before rather
than just after it is sent. To truly capture the timestamp of the last
byte, this routine routine would have to be called a second time, right
after sending to MAC hardware. Then the user program would have
sort out the 2 timestamps. My guess is that this isn’t something that
NIC vendors would be willing to implement in their drivers.

So, the best we can do is timestamp is just before the last segment.
Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
occurs half way through the burst. But it may not be exactly half way
because the segments may get queued much faster than wire rate.
Therefore the time between segment 1 and segment 2 may be much
much smaller than their spacing on the wire. I would not find this
useful.

I propose that we stick with the method used for IP fragments, which
is timestamping just before the first byte is sent. Put another way, I 
propose that we start the clock in an automobile race just before the
front of the first car crosses the start line rather than when the front
of the last car crosses the start line.

TX timestamping in hardware has even more limitations. For the most
part, we can only do one timestamp per packet or burst.  If we requested
a timestamp of only the last segment of a packet, we would have work
backwards to calculate the start time of the packet, but that would
only be be a best guess. For extremely time sensitive applications
(such as the one we develop), this would not be practical.

We could still consider setting a flag that would allow the timestamping
the last segment rather than the first. However since we cannot 
truly measure the timestamp of the last byte, I would question the value
in doing so.

Fred Klassen May 25, 2019, 8:09 p.m. UTC | #10

> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> what exactly is the issue with IP_TOS?
> 
> If I understand correctly, the issue here is that the new 'P' option
> that polls on the error queue times out. This is unrelated to
> specifying TOS bits? Without zerocopy or timestamps, no message is
> expected on the error queue.

I was not able to get to the root cause, but I noticed that IP_TOS
CMSG was lost until I applied this fix. I also found it confusing as to
why that may be the case.

Fred Klassen May 25, 2019, 8:46 p.m. UTC | #11

> On May 23, 2019, at 2:39 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> Zerocopy notification reference count is managed in skb_segment. That
> should work.
> 

I’m trying to understand the context of reference counting in skb_segment. I assume that
there is an opportunity to optimize the count of outstanding zerocopy buffers, but I 
can’t see it. Please clarify.

Fred Klassen May 25, 2019, 8:47 p.m. UTC | #12

> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:what exactly is the issue with IP_TOS?
> 
> If I understand correctly, the issue here is that the new 'P' option
> that polls on the error queue times out. This is unrelated to
> specifying TOS bits? Without zerocopy or timestamps, no message is
> expected on the error queue.

Please disregard last message. I think I was chasing a non-issue with
TOS bits. I will remove all references to TOS.

Willem de Bruijn May 27, 2019, 1:30 a.m. UTC | #13

On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote:
>
>
>
> > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
> >>
> >>
> >>
> >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> >>>
> >>> It is the last moment that a timestamp can be generated for the last
> >>> byte, I don't see how that is "neither the start nor the end of a GSO
> >>> packet”.
> >>
> >> My misunderstanding. I thought TCP did last segment timestamping, not
> >> last byte. In that case, your statements make sense.
> >>
> >>>> It would be interesting if a practical case can be made for timestamping
> >>>> the last segment. In my mind, I don’t see how that would be valuable.
> >>>
> >>> It depends whether you are interested in measuring network latency or
> >>> host transmit path latency.
> >>>
> >>> For the latter, knowing the time from the start of the sendmsg call to
> >>> the moment the last byte hits the wire is most relevant. Or in absence
> >>> of (well defined) hardware support, the last byte being queued to the
> >>> device is the next best thing.
> >
> > Sounds to me like both cases have a legitimate use case, and we want
> > to support both.
> >
> > Implementation constraints are that storage for this timestamp
> > information is scarce and we cannot add new cold cacheline accesses in
> > the datapath.
> >
> > The simplest approach would be to unconditionally timestamp both the
> > first and last segment. With the same ID. Not terribly elegant. But it
> > works.
> >
> > If conditional, tx_flags has only one bit left. I think we can harvest
> > some, as not all defined bits are in use at the same stages in the
> > datapath, but that is not a trivial change. Some might also better be
> > set in the skb, instead of skb_shinfo. Which would also avoids
> > touching that cacheline. We could possibly repurpose bits from u32
> > tskey.
> >
> > All that can come later. Initially, unless we can come up with
> > something more elegant, I would suggest that UDP follows the rule
> > established by TCP and timestamps the last byte. And we add an
> > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> > __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> >
>
> I don’t see how to practically TX timestamp the last byte of any packet
> (UDP GSO or otherwise). The best we could do is timestamp the last
> segment,  or rather the time that the last segment is queued. Let me
> attempt to explain.
>
> First let’s look at software TX timestamps which are for are generated
> by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
> states:
>
> —————————— cut ————————————
>  * Ethernet MAC Drivers should call this function in their hard_xmit()
>  * function immediately before giving the sk_buff to the MAC hardware.
> —————————— cut ————————————
>
> That means that the sk_buff will get timestamped just before rather
> than just after it is sent. To truly capture the timestamp of the last
> byte, this routine routine would have to be called a second time, right
> after sending to MAC hardware. Then the user program would have
> sort out the 2 timestamps. My guess is that this isn’t something that
> NIC vendors would be willing to implement in their drivers.
>
> So, the best we can do is timestamp is just before the last segment.
> Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
> If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
> occurs half way through the burst. But it may not be exactly half way
> because the segments may get queued much faster than wire rate.
> Therefore the time between segment 1 and segment 2 may be much
> much smaller than their spacing on the wire. I would not find this
> useful.

For measuring host queueing latency, a timestamp at the existing
skb_tx_timestamp() for the last segment is perfectly informative.

> I propose that we stick with the method used for IP fragments, which
> is timestamping just before the first byte is sent.

I understand that this addresses your workload. It simply ignores the
other identified earlier in this thread.

> Put another way, I
> propose that we start the clock in an automobile race just before the
> front of the first car crosses the start line rather than when the front
> of the last car crosses the start line.
>
> TX timestamping in hardware has even more limitations. For the most
> part, we can only do one timestamp per packet or burst.  If we requested
> a timestamp of only the last segment of a packet, we would have work
> backwards to calculate the start time of the packet, but that would
> only be be a best guess. For extremely time sensitive applications
> (such as the one we develop), this would not be practical.

Note that for any particularly sensitive measurements, a segment can
always be sent separately.

> We could still consider setting a flag that would allow the timestamping
> the last segment rather than the first. However since we cannot
> truly measure the timestamp of the last byte, I would question the value
> in doing so.
>

Willem de Bruijn May 27, 2019, 2:09 a.m. UTC | #14

On Sun, May 26, 2019 at 8:30 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote:
> >
> >
> >
> > > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote:
> > >>
> > >>
> > >>
> > >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> > >>>
> > >>> It is the last moment that a timestamp can be generated for the last
> > >>> byte, I don't see how that is "neither the start nor the end of a GSO
> > >>> packet”.
> > >>
> > >> My misunderstanding. I thought TCP did last segment timestamping, not
> > >> last byte. In that case, your statements make sense.
> > >>
> > >>>> It would be interesting if a practical case can be made for timestamping
> > >>>> the last segment. In my mind, I don’t see how that would be valuable.
> > >>>
> > >>> It depends whether you are interested in measuring network latency or
> > >>> host transmit path latency.
> > >>>
> > >>> For the latter, knowing the time from the start of the sendmsg call to
> > >>> the moment the last byte hits the wire is most relevant. Or in absence
> > >>> of (well defined) hardware support, the last byte being queued to the
> > >>> device is the next best thing.
> > >
> > > Sounds to me like both cases have a legitimate use case, and we want
> > > to support both.
> > >
> > > Implementation constraints are that storage for this timestamp
> > > information is scarce and we cannot add new cold cacheline accesses in
> > > the datapath.
> > >
> > > The simplest approach would be to unconditionally timestamp both the
> > > first and last segment. With the same ID. Not terribly elegant. But it
> > > works.
> > >
> > > If conditional, tx_flags has only one bit left. I think we can harvest
> > > some, as not all defined bits are in use at the same stages in the
> > > datapath, but that is not a trivial change. Some might also better be
> > > set in the skb, instead of skb_shinfo. Which would also avoids
> > > touching that cacheline. We could possibly repurpose bits from u32
> > > tskey.
> > >
> > > All that can come later. Initially, unless we can come up with
> > > something more elegant, I would suggest that UDP follows the rule
> > > established by TCP and timestamps the last byte. And we add an
> > > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only
> > > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in
> > > __sock_tx_timestamp and is interpreted in __udp_gso_segment.
> > >
> >
> > I don’t see how to practically TX timestamp the last byte of any packet
> > (UDP GSO or otherwise). The best we could do is timestamp the last
> > segment,  or rather the time that the last segment is queued. Let me
> > attempt to explain.
> >
> > First let’s look at software TX timestamps which are for are generated
> > by skb_tx_timestamp() in nearly every network driver’s xmit routine. It
> > states:
> >
> > —————————— cut ————————————
> >  * Ethernet MAC Drivers should call this function in their hard_xmit()
> >  * function immediately before giving the sk_buff to the MAC hardware.
> > —————————— cut ————————————
> >
> > That means that the sk_buff will get timestamped just before rather
> > than just after it is sent. To truly capture the timestamp of the last
> > byte, this routine routine would have to be called a second time, right
> > after sending to MAC hardware. Then the user program would have
> > sort out the 2 timestamps. My guess is that this isn’t something that
> > NIC vendors would be willing to implement in their drivers.
> >
> > So, the best we can do is timestamp is just before the last segment.
> > Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter.
> > If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp
> > occurs half way through the burst. But it may not be exactly half way
> > because the segments may get queued much faster than wire rate.
> > Therefore the time between segment 1 and segment 2 may be much
> > much smaller than their spacing on the wire. I would not find this
> > useful.
>
> For measuring host queueing latency, a timestamp at the existing
> skb_tx_timestamp() for the last segment is perfectly informative.

In most cases all segments will be sent in a single xmit_more train.
In which case the device doorbell is rung when the last segment is
queued.

A device may also pause in the middle of a train, causing the rest of
the list to be requeued and resent after a tx completion frees up
descriptors and wakes the device. This seems like a relevant exception
to be able to measure.

That said, I am not opposed to the first segment, if we have to make a
binary choice for a default. Either option has cons. See more specific
revision requests in the v2 patch.

[net,1/4] net/udp_gso: Allow TX timestamp with UDP GSO

Commit Message

Comments

Patch