Message ID | 20190523210651.80902-2-fklassen@appneta.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Allow TX timestamp with UDP GSO | expand |
On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote: > > Fixes an issue where TX Timestamps are not arriving on the error queue > when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. > This can be illustrated with an updated updgso_bench_tx program which > includes the '-T' option to test for this condition. > > ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 > poll timeout > udp tx: 0 MB/s 1 calls/s 1 msg/s > > The "poll timeout" message above indicates that TX timestamp never > arrived. > > It also appears that other TX CMSG types cause similar issues, for > example trying to set SOL_IP/IP_TOS. > > ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 > poll timeout > udp tx: 0 MB/s 1 calls/s 1 msg/s > > This patch preserves tx_flags for the first UDP GSO segment. This > mirrors the stack's behaviour for IPv4 fragments. > > Fixes: ee80d1ebe5ba ("udp: add udp gso") > Signed-off-by: Fred Klassen <fklassen@appneta.com> > --- > net/ipv4/udp_offload.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c > index 065334b41d57..33de347695ae 100644 > --- a/net/ipv4/udp_offload.c > +++ b/net/ipv4/udp_offload.c > @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, > seg = segs; > uh = udp_hdr(seg); > > + /* preserve TX timestamp and zero-copy info for first segment */ > + skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey; > + skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags; > + Thanks for the report. Zerocopy notification reference count is managed in skb_segment. That should work. Support for timestamping with the new GSO feature is indeed an oversight. The solution is similar to how TCP associates the timestamp with the right segment in tcp_gso_tstamp. Only, I think we want to transfer the timestamp request to the last datagram, not the first. For send timestamp, the final byte leaving the host is usually more interesting.
On Thu, May 23, 2019 at 5:09 PM Fred Klassen <fklassen@appneta.com> wrote: > > Fixes an issue where TX Timestamps are not arriving on the error queue > when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. > This can be illustrated with an updated updgso_bench_tx program which > includes the '-T' option to test for this condition. > > ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 > poll timeout > udp tx: 0 MB/s 1 calls/s 1 msg/s > > The "poll timeout" message above indicates that TX timestamp never > arrived. > > It also appears that other TX CMSG types cause similar issues, for > example trying to set SOL_IP/IP_TOS. > > ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 > poll timeout > udp tx: 0 MB/s 1 calls/s 1 msg/s what exactly is the issue with IP_TOS? If I understand correctly, the issue here is that the new 'P' option that polls on the error queue times out. This is unrelated to specifying TOS bits? Without zerocopy or timestamps, no message is expected on the error queue.
> Thanks for the report. > > Zerocopy notification reference count is managed in skb_segment. That > should work. > > Support for timestamping with the new GSO feature is indeed an > oversight. The solution is similar to how TCP associates the timestamp > with the right segment in tcp_gso_tstamp. > > Only, I think we want to transfer the timestamp request to the last > datagram, not the first. For send timestamp, the final byte leaving > the host is usually more interesting. TX Timestamping the last packet of a datagram is something that would work poorly for our application. We need to measure the time it takes for the first bit that is sent until the first bit of the last packet is received. Timestaming the last packet of a burst seems somewhat random to me and would not be useful. Essentially we would be timestamping a random byte in a UDP GSO buffer. I believe there is a precedence for timestamping the first packet. With IPv4 packets, the first packet is timestamped and the remaining fragments are not.
On Thu, May 23, 2019 at 9:38 PM Fred Klassen <fklassen@appneta.com> wrote: > > > Thanks for the report. > > > > Zerocopy notification reference count is managed in skb_segment. That > > should work. > > > > Support for timestamping with the new GSO feature is indeed an > > oversight. The solution is similar to how TCP associates the timestamp > > with the right segment in tcp_gso_tstamp. > > > > Only, I think we want to transfer the timestamp request to the last > > datagram, not the first. For send timestamp, the final byte leaving > > the host is usually more interesting. > > TX Timestamping the last packet of a datagram is something that would > work poorly for our application. We need to measure the time it takes > for the first bit that is sent until the first bit of the last packet is received. > Timestaming the last packet of a burst seems somewhat random to me > and would not be useful. Essentially we would be timestamping a > random byte in a UDP GSO buffer. > > I believe there is a precedence for timestamping the first packet. With > IPv4 packets, the first packet is timestamped and the remaining fragments > are not. Interesting. TCP timestamping takes the opposite choice and does timestamp the last byte in the sendmsg request. It sounds like it depends on the workload. Perhaps this then needs to be configurable with an SOF_.. flag. Another option would be to return a timestamp for every segment. But they would all return the same tskey. And it causes different behavior with and without hardware offload.
> Interesting. TCP timestamping takes the opposite choice and does > timestamp the last byte in the sendmsg request. > I have a difficult time with the philosophy of TX timestamping the last segment. The actual timestamp occurs just before the last segment is sent. This is neither the start nor the end of a GSO packet, which to me seems somewhat arbitrary. It is even more arbitrary when using software TX tiimestamping. These are timestamps represent the time that the packet is queued onto the NIC’s buffer, not actual time leaving the wire. Queuing to a ring buffer is usually much faster than wire rates. Therefore, say the timestamp of the last 1500 byte segment of a 64K GSO packet may in reality be representing a time about half way through the burst. Since the timestamp of a TX packet occurs just before any data is sent, I have found it most valuable to timestamp just before the first byte of the packet or burst. Conversely, I find it most valuable to get an RX timestamp after the last byte arrives. > It sounds like it depends on the workload. Perhaps this then needs to > be configurable with an SOF_.. flag. > It would be interesting if a practical case can be made for timestamping the last segment. In my mind, I don’t see how that would be valuable. > Another option would be to return a timestamp for every segment. But > they would all return the same tskey. And it causes different behavior > with and without hardware offload. When it comes to RX packets, getting per-packet (or per segment) timestamps is invaluable. They represent actual wire times. However my previous research into TX timestamping has led me to conclude that there is no practical value when timestamping every packet of a back-to-back burst. When using software TX timestamping, The inter-packet timestamps are typically much faster than line rate. Whereas you may be sending on a GigE link, you may measure 20Gbps. At higher rates, I have found that the overhead of per-packet software timestamping can produce gaps in packets. When using hardware timestamping, I think you will find that nearly all adapters only allow one timestamp at a time. Therefore only one packet in a burst would get timestamped. There are exceptions, for example I am playing with a 100G Mellanox adapter that has per-packet TX timestamping. However, I suspect that when I am done testing, all I will see is timestamps that are representing wire rate (e.g. 123nsec per 1500 byte packet). Beyond testing the accuracy of a NIC’s timestamping capabilities, I see very little value in doing per-segment timestamping.
On Fri, May 24, 2019 at 12:34 PM Fred Klassen <fklassen@appneta.com> wrote: > > > Interesting. TCP timestamping takes the opposite choice and does > > timestamp the last byte in the sendmsg request. > > > > I have a difficult time with the philosophy of TX timestamping the last > segment. The actual timestamp occurs just before the last segment > is sent. This is neither the start nor the end of a GSO packet, which > to me seems somewhat arbitrary. It is even more arbitrary when using > software TX tiimestamping. These are timestamps represent the > time that the packet is queued onto the NIC’s buffer, not actual > time leaving the wire. It is the last moment that a timestamp can be generated for the last byte, I don't see how that is "neither the start nor the end of a GSO packet". > Queuing to a ring buffer is usually much faster > than wire rates. Therefore, say the timestamp of the last 1500 byte > segment of a 64K GSO packet may in reality be representing a time > about half way through the burst. > > Since the timestamp of a TX packet occurs just before any data is sent, > I have found it most valuable to timestamp just before the first byte of > the packet or burst. Conversely, I find it most valuable to get an RX > timestamp after the last byte arrives. > > > It sounds like it depends on the workload. Perhaps this then needs to > > be configurable with an SOF_.. flag. > > > > It would be interesting if a practical case can be made for timestamping > the last segment. In my mind, I don’t see how that would be valuable. It depends whether you are interested in measuring network latency or host transmit path latency. For the latter, knowing the time from the start of the sendmsg call to the moment the last byte hits the wire is most relevant. Or in absence of (well defined) hardware support, the last byte being queued to the device is the next best thing. It would make sense for this software implementation to follow established hardware behavior. But as far as I know, the exact time a hardware timestamp is taken is not consistent across devices, either. For fine grained timestamped data, perhaps GSO is simply not a good mechanism. That said, it still has to queue a timestamp if requested. > > Another option would be to return a timestamp for every segment. But > > they would all return the same tskey. And it causes different behavior > > with and without hardware offload. > > When it comes to RX packets, getting per-packet (or per segment) > timestamps is invaluable. They represent actual wire times. However > my previous research into TX timestamping has led me to conclude > that there is no practical value when timestamping every packet of > a back-to-back burst. > > When using software TX timestamping, The inter-packet timestamps > are typically much faster than line rate. Whereas you may be sending > on a GigE link, you may measure 20Gbps. At higher rates, I have found > that the overhead of per-packet software timestamping can produce > gaps in packets. > > When using hardware timestamping, I think you will find that nearly all > adapters only allow one timestamp at a time. Therefore only one > packet in a burst would get timestamped. Can you elaborate? When the host queues N packets all with hardware timestamps requested, all N completions will have a timestamp? Or is that not guaranteed? > There are exceptions, for > example I am playing with a 100G Mellanox adapter that has > per-packet TX timestamping. However, I suspect that when I am > done testing, all I will see is timestamps that are representing wire > rate (e.g. 123nsec per 1500 byte packet). > > Beyond testing the accuracy of a NIC’s timestamping capabilities, I > see very little value in doing per-segment timestamping. Ack. Great detailed argument, thanks.
> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > It is the last moment that a timestamp can be generated for the last > byte, I don't see how that is "neither the start nor the end of a GSO > packet”. My misunderstanding. I thought TCP did last segment timestamping, not last byte. In that case, your statements make sense. >> It would be interesting if a practical case can be made for timestamping >> the last segment. In my mind, I don’t see how that would be valuable. > > It depends whether you are interested in measuring network latency or > host transmit path latency. > > For the latter, knowing the time from the start of the sendmsg call to > the moment the last byte hits the wire is most relevant. Or in absence > of (well defined) hardware support, the last byte being queued to the > device is the next best thing. > > It would make sense for this software implementation to follow > established hardware behavior. But as far as I know, the exact time a > hardware timestamp is taken is not consistent across devices, either. > > For fine grained timestamped data, perhaps GSO is simply not a good > mechanism. That said, it still has to queue a timestamp if requested. I see your point. Makes sense to me. >> When using hardware timestamping, I think you will find that nearly all >> adapters only allow one timestamp at a time. Therefore only one >> packet in a burst would get timestamped. > > Can you elaborate? When the host queues N packets all with hardware > timestamps requested, all N completions will have a timestamp? Or is > that not guaranteed? > It is not guaranteed. The best example is in ixgbe_main.c and search for ‘SKBTX_HW_TSTAMP’. If there is a PTP TX timestamp in progress, ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps are possible. The flag is cleared after transmit softirq, and only then can another TX timestamp be taken. >> There are exceptions, for >> example I am playing with a 100G Mellanox adapter that has >> per-packet TX timestamping. However, I suspect that when I am >> done testing, all I will see is timestamps that are representing wire >> rate (e.g. 123nsec per 1500 byte packet). >> >> Beyond testing the accuracy of a NIC’s timestamping capabilities, I >> see very little value in doing per-segment timestamping. > > Ack. Great detailed argument, thanks. Thanks. I’m a timestamping nerd and have learned lots with this discussion.
On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote: > > > > > On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > It is the last moment that a timestamp can be generated for the last > > byte, I don't see how that is "neither the start nor the end of a GSO > > packet”. > > My misunderstanding. I thought TCP did last segment timestamping, not > last byte. In that case, your statements make sense. > > >> It would be interesting if a practical case can be made for timestamping > >> the last segment. In my mind, I don’t see how that would be valuable. > > > > It depends whether you are interested in measuring network latency or > > host transmit path latency. > > > > For the latter, knowing the time from the start of the sendmsg call to > > the moment the last byte hits the wire is most relevant. Or in absence > > of (well defined) hardware support, the last byte being queued to the > > device is the next best thing. Sounds to me like both cases have a legitimate use case, and we want to support both. Implementation constraints are that storage for this timestamp information is scarce and we cannot add new cold cacheline accesses in the datapath. The simplest approach would be to unconditionally timestamp both the first and last segment. With the same ID. Not terribly elegant. But it works. If conditional, tx_flags has only one bit left. I think we can harvest some, as not all defined bits are in use at the same stages in the datapath, but that is not a trivial change. Some might also better be set in the skb, instead of skb_shinfo. Which would also avoids touching that cacheline. We could possibly repurpose bits from u32 tskey. All that can come later. Initially, unless we can come up with something more elegant, I would suggest that UDP follows the rule established by TCP and timestamps the last byte. And we add an explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in __sock_tx_timestamp and is interpreted in __udp_gso_segment. > > > > It would make sense for this software implementation to follow > > established hardware behavior. But as far as I know, the exact time a > > hardware timestamp is taken is not consistent across devices, either. > > > > For fine grained timestamped data, perhaps GSO is simply not a good > > mechanism. That said, it still has to queue a timestamp if requested. > > I see your point. Makes sense to me. > > >> When using hardware timestamping, I think you will find that nearly all > >> adapters only allow one timestamp at a time. Therefore only one > >> packet in a burst would get timestamped. > > > > Can you elaborate? When the host queues N packets all with hardware > > timestamps requested, all N completions will have a timestamp? Or is > > that not guaranteed? > > > > It is not guaranteed. The best example is in ixgbe_main.c and search for > ‘SKBTX_HW_TSTAMP’. If there is a PTP TX timestamp in progress, > ‘__IXGBE_PTP_TX_IN_PROGRESS’ is set and no other timestamps > are possible. The flag is cleared after transmit softirq, and only then > can another TX timestamp be taken. Interesting, thanks. I had no idea.
> On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote: >> >> >> >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: >>> >>> It is the last moment that a timestamp can be generated for the last >>> byte, I don't see how that is "neither the start nor the end of a GSO >>> packet”. >> >> My misunderstanding. I thought TCP did last segment timestamping, not >> last byte. In that case, your statements make sense. >> >>>> It would be interesting if a practical case can be made for timestamping >>>> the last segment. In my mind, I don’t see how that would be valuable. >>> >>> It depends whether you are interested in measuring network latency or >>> host transmit path latency. >>> >>> For the latter, knowing the time from the start of the sendmsg call to >>> the moment the last byte hits the wire is most relevant. Or in absence >>> of (well defined) hardware support, the last byte being queued to the >>> device is the next best thing. > > Sounds to me like both cases have a legitimate use case, and we want > to support both. > > Implementation constraints are that storage for this timestamp > information is scarce and we cannot add new cold cacheline accesses in > the datapath. > > The simplest approach would be to unconditionally timestamp both the > first and last segment. With the same ID. Not terribly elegant. But it > works. > > If conditional, tx_flags has only one bit left. I think we can harvest > some, as not all defined bits are in use at the same stages in the > datapath, but that is not a trivial change. Some might also better be > set in the skb, instead of skb_shinfo. Which would also avoids > touching that cacheline. We could possibly repurpose bits from u32 > tskey. > > All that can come later. Initially, unless we can come up with > something more elegant, I would suggest that UDP follows the rule > established by TCP and timestamps the last byte. And we add an > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in > __sock_tx_timestamp and is interpreted in __udp_gso_segment. > I don’t see how to practically TX timestamp the last byte of any packet (UDP GSO or otherwise). The best we could do is timestamp the last segment, or rather the time that the last segment is queued. Let me attempt to explain. First let’s look at software TX timestamps which are for are generated by skb_tx_timestamp() in nearly every network driver’s xmit routine. It states: —————————— cut ———————————— * Ethernet MAC Drivers should call this function in their hard_xmit() * function immediately before giving the sk_buff to the MAC hardware. —————————— cut ———————————— That means that the sk_buff will get timestamped just before rather than just after it is sent. To truly capture the timestamp of the last byte, this routine routine would have to be called a second time, right after sending to MAC hardware. Then the user program would have sort out the 2 timestamps. My guess is that this isn’t something that NIC vendors would be willing to implement in their drivers. So, the best we can do is timestamp is just before the last segment. Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp occurs half way through the burst. But it may not be exactly half way because the segments may get queued much faster than wire rate. Therefore the time between segment 1 and segment 2 may be much much smaller than their spacing on the wire. I would not find this useful. I propose that we stick with the method used for IP fragments, which is timestamping just before the first byte is sent. Put another way, I propose that we start the clock in an automobile race just before the front of the first car crosses the start line rather than when the front of the last car crosses the start line. TX timestamping in hardware has even more limitations. For the most part, we can only do one timestamp per packet or burst. If we requested a timestamp of only the last segment of a packet, we would have work backwards to calculate the start time of the packet, but that would only be be a best guess. For extremely time sensitive applications (such as the one we develop), this would not be practical. We could still consider setting a flag that would allow the timestamping the last segment rather than the first. However since we cannot truly measure the timestamp of the last byte, I would question the value in doing so.
> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > what exactly is the issue with IP_TOS? > > If I understand correctly, the issue here is that the new 'P' option > that polls on the error queue times out. This is unrelated to > specifying TOS bits? Without zerocopy or timestamps, no message is > expected on the error queue. I was not able to get to the root cause, but I noticed that IP_TOS CMSG was lost until I applied this fix. I also found it confusing as to why that may be the case.
> On May 23, 2019, at 2:39 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > Zerocopy notification reference count is managed in skb_segment. That > should work. > I’m trying to understand the context of reference counting in skb_segment. I assume that there is an opportunity to optimize the count of outstanding zerocopy buffers, but I can’t see it. Please clarify.
> On May 23, 2019, at 2:59 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:what exactly is the issue with IP_TOS? > > If I understand correctly, the issue here is that the new 'P' option > that polls on the error queue times out. This is unrelated to > specifying TOS bits? Without zerocopy or timestamps, no message is > expected on the error queue. Please disregard last message. I think I was chasing a non-issue with TOS bits. I will remove all references to TOS.
On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote: > > > > > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote: > >> > >> > >> > >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > >>> > >>> It is the last moment that a timestamp can be generated for the last > >>> byte, I don't see how that is "neither the start nor the end of a GSO > >>> packet”. > >> > >> My misunderstanding. I thought TCP did last segment timestamping, not > >> last byte. In that case, your statements make sense. > >> > >>>> It would be interesting if a practical case can be made for timestamping > >>>> the last segment. In my mind, I don’t see how that would be valuable. > >>> > >>> It depends whether you are interested in measuring network latency or > >>> host transmit path latency. > >>> > >>> For the latter, knowing the time from the start of the sendmsg call to > >>> the moment the last byte hits the wire is most relevant. Or in absence > >>> of (well defined) hardware support, the last byte being queued to the > >>> device is the next best thing. > > > > Sounds to me like both cases have a legitimate use case, and we want > > to support both. > > > > Implementation constraints are that storage for this timestamp > > information is scarce and we cannot add new cold cacheline accesses in > > the datapath. > > > > The simplest approach would be to unconditionally timestamp both the > > first and last segment. With the same ID. Not terribly elegant. But it > > works. > > > > If conditional, tx_flags has only one bit left. I think we can harvest > > some, as not all defined bits are in use at the same stages in the > > datapath, but that is not a trivial change. Some might also better be > > set in the skb, instead of skb_shinfo. Which would also avoids > > touching that cacheline. We could possibly repurpose bits from u32 > > tskey. > > > > All that can come later. Initially, unless we can come up with > > something more elegant, I would suggest that UDP follows the rule > > established by TCP and timestamps the last byte. And we add an > > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only > > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in > > __sock_tx_timestamp and is interpreted in __udp_gso_segment. > > > > I don’t see how to practically TX timestamp the last byte of any packet > (UDP GSO or otherwise). The best we could do is timestamp the last > segment, or rather the time that the last segment is queued. Let me > attempt to explain. > > First let’s look at software TX timestamps which are for are generated > by skb_tx_timestamp() in nearly every network driver’s xmit routine. It > states: > > —————————— cut ———————————— > * Ethernet MAC Drivers should call this function in their hard_xmit() > * function immediately before giving the sk_buff to the MAC hardware. > —————————— cut ———————————— > > That means that the sk_buff will get timestamped just before rather > than just after it is sent. To truly capture the timestamp of the last > byte, this routine routine would have to be called a second time, right > after sending to MAC hardware. Then the user program would have > sort out the 2 timestamps. My guess is that this isn’t something that > NIC vendors would be willing to implement in their drivers. > > So, the best we can do is timestamp is just before the last segment. > Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. > If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp > occurs half way through the burst. But it may not be exactly half way > because the segments may get queued much faster than wire rate. > Therefore the time between segment 1 and segment 2 may be much > much smaller than their spacing on the wire. I would not find this > useful. For measuring host queueing latency, a timestamp at the existing skb_tx_timestamp() for the last segment is perfectly informative. > I propose that we stick with the method used for IP fragments, which > is timestamping just before the first byte is sent. I understand that this addresses your workload. It simply ignores the other identified earlier in this thread. > Put another way, I > propose that we start the clock in an automobile race just before the > front of the first car crosses the start line rather than when the front > of the last car crosses the start line. > > TX timestamping in hardware has even more limitations. For the most > part, we can only do one timestamp per packet or burst. If we requested > a timestamp of only the last segment of a packet, we would have work > backwards to calculate the start time of the packet, but that would > only be be a best guess. For extremely time sensitive applications > (such as the one we develop), this would not be practical. Note that for any particularly sensitive measurements, a segment can always be sent separately. > We could still consider setting a flag that would allow the timestamping > the last segment rather than the first. However since we cannot > truly measure the timestamp of the last byte, I would question the value > in doing so. >
On Sun, May 26, 2019 at 8:30 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Sat, May 25, 2019 at 1:47 PM Fred Klassen <fklassen@appneta.com> wrote: > > > > > > > > > On May 25, 2019, at 8:20 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > > > > > On Fri, May 24, 2019 at 6:01 PM Fred Klassen <fklassen@appneta.com> wrote: > > >> > > >> > > >> > > >>> On May 24, 2019, at 12:29 PM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > >>> > > >>> It is the last moment that a timestamp can be generated for the last > > >>> byte, I don't see how that is "neither the start nor the end of a GSO > > >>> packet”. > > >> > > >> My misunderstanding. I thought TCP did last segment timestamping, not > > >> last byte. In that case, your statements make sense. > > >> > > >>>> It would be interesting if a practical case can be made for timestamping > > >>>> the last segment. In my mind, I don’t see how that would be valuable. > > >>> > > >>> It depends whether you are interested in measuring network latency or > > >>> host transmit path latency. > > >>> > > >>> For the latter, knowing the time from the start of the sendmsg call to > > >>> the moment the last byte hits the wire is most relevant. Or in absence > > >>> of (well defined) hardware support, the last byte being queued to the > > >>> device is the next best thing. > > > > > > Sounds to me like both cases have a legitimate use case, and we want > > > to support both. > > > > > > Implementation constraints are that storage for this timestamp > > > information is scarce and we cannot add new cold cacheline accesses in > > > the datapath. > > > > > > The simplest approach would be to unconditionally timestamp both the > > > first and last segment. With the same ID. Not terribly elegant. But it > > > works. > > > > > > If conditional, tx_flags has only one bit left. I think we can harvest > > > some, as not all defined bits are in use at the same stages in the > > > datapath, but that is not a trivial change. Some might also better be > > > set in the skb, instead of skb_shinfo. Which would also avoids > > > touching that cacheline. We could possibly repurpose bits from u32 > > > tskey. > > > > > > All that can come later. Initially, unless we can come up with > > > something more elegant, I would suggest that UDP follows the rule > > > established by TCP and timestamps the last byte. And we add an > > > explicit SOF_TIMESTAMPING_OPT_FIRSTBYTE that is initially only > > > supported for UDP, sets a new SKBTX_TX_FB_TSTAMP bit in > > > __sock_tx_timestamp and is interpreted in __udp_gso_segment. > > > > > > > I don’t see how to practically TX timestamp the last byte of any packet > > (UDP GSO or otherwise). The best we could do is timestamp the last > > segment, or rather the time that the last segment is queued. Let me > > attempt to explain. > > > > First let’s look at software TX timestamps which are for are generated > > by skb_tx_timestamp() in nearly every network driver’s xmit routine. It > > states: > > > > —————————— cut ———————————— > > * Ethernet MAC Drivers should call this function in their hard_xmit() > > * function immediately before giving the sk_buff to the MAC hardware. > > —————————— cut ———————————— > > > > That means that the sk_buff will get timestamped just before rather > > than just after it is sent. To truly capture the timestamp of the last > > byte, this routine routine would have to be called a second time, right > > after sending to MAC hardware. Then the user program would have > > sort out the 2 timestamps. My guess is that this isn’t something that > > NIC vendors would be willing to implement in their drivers. > > > > So, the best we can do is timestamp is just before the last segment. > > Suppose UDP GSO sends 3000 bytes to a 1500 byte MTU adapter. > > If we set SKBTX_HW_TSTAMP flag on the last segment, the timestamp > > occurs half way through the burst. But it may not be exactly half way > > because the segments may get queued much faster than wire rate. > > Therefore the time between segment 1 and segment 2 may be much > > much smaller than their spacing on the wire. I would not find this > > useful. > > For measuring host queueing latency, a timestamp at the existing > skb_tx_timestamp() for the last segment is perfectly informative. In most cases all segments will be sent in a single xmit_more train. In which case the device doorbell is rung when the last segment is queued. A device may also pause in the middle of a train, causing the rest of the list to be requeued and resent after a tx completion frees up descriptors and wakes the device. This seems like a relevant exception to be able to measure. That said, I am not opposed to the first segment, if we have to make a binary choice for a default. Either option has cons. See more specific revision requests in the v2 patch.
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c index 065334b41d57..33de347695ae 100644 --- a/net/ipv4/udp_offload.c +++ b/net/ipv4/udp_offload.c @@ -228,6 +228,10 @@ struct sk_buff *__udp_gso_segment(struct sk_buff *gso_skb, seg = segs; uh = udp_hdr(seg); + /* preserve TX timestamp and zero-copy info for first segment */ + skb_shinfo(seg)->tskey = skb_shinfo(gso_skb)->tskey; + skb_shinfo(seg)->tx_flags = skb_shinfo(gso_skb)->tx_flags; + /* compute checksum adjustment based on old length versus new */ newlen = htons(sizeof(*uh) + mss); check = csum16_add(csum16_sub(uh->check, uh->len), newlen);
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. It also appears that other TX CMSG types cause similar issues, for example trying to set SOL_IP/IP_TOS. ./udpgso_bench_tx -4ucPv -S 1472 -q 182 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s This patch preserves tx_flags for the first UDP GSO segment. This mirrors the stack's behaviour for IPv4 fragments. Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> --- net/ipv4/udp_offload.c | 4 ++++ 1 file changed, 4 insertions(+)