diff mbox series

[RFC,bpf-next,v1,3/3] net: Add additional bit to support userspace timestamp type

Message ID 20240409210547.3815806-4-quic_abchauha@quicinc.com (mailing list archive)
State RFC
Delegated to: BPF
Headers show
Series Rename mono_delivery_time to | expand

Checks

Context Check Description
netdev/tree_selection success Clearly marked for bpf-next
netdev/apply fail Patch does not apply to bpf-next-0
bpf/vmtest-bpf-next-PR success PR summary
bpf/vmtest-bpf-next-VM_Test-3 success Logs for Validate matrix.py
bpf/vmtest-bpf-next-VM_Test-2 success Logs for Unittests
bpf/vmtest-bpf-next-VM_Test-0 success Logs for Lint
bpf/vmtest-bpf-next-VM_Test-4 success Logs for aarch64-gcc / build / build for aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-5 success Logs for aarch64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-1 success Logs for ShellCheck
bpf/vmtest-bpf-next-VM_Test-6 success Logs for aarch64-gcc / test (test_maps, false, 360) / test_maps on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-9 success Logs for aarch64-gcc / test (test_verifier, false, 360) / test_verifier on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-7 success Logs for aarch64-gcc / test (test_progs, false, 360) / test_progs on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-10 success Logs for aarch64-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-8 success Logs for aarch64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on aarch64 with gcc
bpf/vmtest-bpf-next-VM_Test-11 success Logs for s390x-gcc / build / build for s390x with gcc
bpf/vmtest-bpf-next-VM_Test-12 success Logs for s390x-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-14 success Logs for s390x-gcc / test (test_progs, false, 360) / test_progs on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-13 success Logs for s390x-gcc / test (test_maps, false, 360) / test_maps on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-18 success Logs for set-matrix
bpf/vmtest-bpf-next-VM_Test-17 success Logs for s390x-gcc / veristat
bpf/vmtest-bpf-next-VM_Test-20 success Logs for x86_64-gcc / build-release
bpf/vmtest-bpf-next-VM_Test-21 success Logs for x86_64-gcc / test (test_maps, false, 360) / test_maps on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-15 success Logs for s390x-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-16 success Logs for s390x-gcc / test (test_verifier, false, 360) / test_verifier on s390x with gcc
bpf/vmtest-bpf-next-VM_Test-19 success Logs for x86_64-gcc / build / build for x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-26 success Logs for x86_64-gcc / test (test_verifier, false, 360) / test_verifier on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-24 success Logs for x86_64-gcc / test (test_progs_no_alu32_parallel, true, 30) / test_progs_no_alu32_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-22 success Logs for x86_64-gcc / test (test_progs, false, 360) / test_progs on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-27 success Logs for x86_64-gcc / veristat / veristat on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-25 success Logs for x86_64-gcc / test (test_progs_parallel, true, 30) / test_progs_parallel on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-23 success Logs for x86_64-gcc / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with gcc
bpf/vmtest-bpf-next-VM_Test-28 success Logs for x86_64-llvm-17 / build / build for x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-29 success Logs for x86_64-llvm-17 / build-release / build for x86_64 with llvm-17 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-30 success Logs for x86_64-llvm-17 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-31 success Logs for x86_64-llvm-17 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-32 success Logs for x86_64-llvm-17 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-35 success Logs for x86_64-llvm-18 / build / build for x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-33 success Logs for x86_64-llvm-17 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-17
bpf/vmtest-bpf-next-VM_Test-34 success Logs for x86_64-llvm-17 / veristat
bpf/vmtest-bpf-next-VM_Test-36 success Logs for x86_64-llvm-18 / build-release / build for x86_64 with llvm-18 and -O2 optimization
bpf/vmtest-bpf-next-VM_Test-38 success Logs for x86_64-llvm-18 / test (test_progs, false, 360) / test_progs on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-37 success Logs for x86_64-llvm-18 / test (test_maps, false, 360) / test_maps on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-39 success Logs for x86_64-llvm-18 / test (test_progs_cpuv4, false, 360) / test_progs_cpuv4 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-41 success Logs for x86_64-llvm-18 / test (test_verifier, false, 360) / test_verifier on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-40 success Logs for x86_64-llvm-18 / test (test_progs_no_alu32, false, 360) / test_progs_no_alu32 on x86_64 with llvm-18
bpf/vmtest-bpf-next-VM_Test-42 success Logs for x86_64-llvm-18 / veristat

Commit Message

Abhishek Chauhan (ABC) April 9, 2024, 9:05 p.m. UTC
tstamp_type can be real, mono or userspace timestamp.

This commit adds userspace timestamp and sets it if there is
valid transmit_time available in socket coming from userspace.

To make the design scalable for future needs this commit bring in
the change to extend the tstamp_type:1 to tstamp_type:2 to support
userspace timestamp.

Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
---
 include/linux/skbuff.h | 19 +++++++++++++++++--
 net/ipv4/ip_output.c   |  2 +-
 net/ipv4/raw.c         |  2 +-
 net/ipv6/ip6_output.c  |  2 +-
 net/ipv6/raw.c         |  2 +-
 net/packet/af_packet.c |  6 +++---
 6 files changed, 24 insertions(+), 9 deletions(-)

Comments

Willem de Bruijn April 10, 2024, 3:42 p.m. UTC | #1
Abhishek Chauhan wrote:
> tstamp_type can be real, mono or userspace timestamp.
> 
> This commit adds userspace timestamp and sets it if there is
> valid transmit_time available in socket coming from userspace.
> 
> To make the design scalable for future needs this commit bring in
> the change to extend the tstamp_type:1 to tstamp_type:2 to support
> userspace timestamp.
> 
> Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
> Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
> ---
>  include/linux/skbuff.h | 19 +++++++++++++++++--
>  net/ipv4/ip_output.c   |  2 +-
>  net/ipv4/raw.c         |  2 +-
>  net/ipv6/ip6_output.c  |  2 +-
>  net/ipv6/raw.c         |  2 +-
>  net/packet/af_packet.c |  6 +++---
>  6 files changed, 24 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 6160185f0fe0..2f91a8a2157a 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -705,6 +705,9 @@ typedef unsigned char *sk_buff_data_t;
>  enum skb_tstamp_type {
>  	SKB_TSTAMP_TYPE_RX_REAL = 0,    /* A RX (receive) time in real */
>  	SKB_TSTAMP_TYPE_TX_MONO = 1,    /* A TX (delivery) time in mono */
> +	SKB_TSTAMP_TYPE_TX_USER = 2,    /* A TX (delivery) time and its clock
> +									 * is in skb->sk->sk_clockid.
> +									 */

Weird indentation?

More fundamentally: instead of defining a type TX_USER, can we use a
real clockid (e.g., CLOCK_TAI) based on skb->sk->sk_clockid? Rather
than store an id that means "go look at sk_clockid".

>  };
>  
>  /**
> @@ -830,6 +833,9 @@ enum skb_tstamp_type {
>   *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
>   *		skb->tstamp has the (rcv) timestamp at ingress and
>   *		delivery_time at egress.
> + *		delivery_time in mono clock base (i.e., EDT) or a clock base chosen
> + *		by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
> + *		ingress.
>   *	@napi_id: id of the NAPI struct this skb came from
>   *	@sender_cpu: (aka @napi_id) source CPU in XPS
>   *	@alloc_cpu: CPU which did the skb allocation.
> @@ -960,7 +966,7 @@ struct sk_buff {
>  	/* private: */
>  	__u8			__mono_tc_offset[0];
>  	/* public: */
> -	__u8			tstamp_type:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
> +	__u8			tstamp_type:2;	/* See SKB_MONO_DELIVERY_TIME_MASK */
>  #ifdef CONFIG_NET_XGRESS
>  	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
>  	__u8			tc_skip_classify:1;

With pahole, does this have an effect on sk_buff layout?

> @@ -4274,7 +4280,16 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
>  					enum skb_tstamp_type tstamp_type)
>  {
>  	skb->tstamp = kt;
> -	skb->tstamp_type = kt && tstamp_type;
> +
> +	if (skb->tstamp_type)
> +		return;
> +

Why bail if a type is already set? And what if
skb->tstamp_type != tstamp_type? Should skb->tstamp then not be
written to (i.e., the test moved up), and perhaps a rate limited
warning.

> +	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_MONO)
> +		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_MONO;
> +
> +	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_USER)
> +		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_USER;

Simpler

    if (kt)
        skb->tstamp_type = tstamp_type;
Abhishek Chauhan (ABC) April 10, 2024, 8:25 p.m. UTC | #2
On 4/10/2024 8:42 AM, Willem de Bruijn wrote:
> Abhishek Chauhan wrote:
>> tstamp_type can be real, mono or userspace timestamp.
>>
>> This commit adds userspace timestamp and sets it if there is
>> valid transmit_time available in socket coming from userspace.
>>
>> To make the design scalable for future needs this commit bring in
>> the change to extend the tstamp_type:1 to tstamp_type:2 to support
>> userspace timestamp.
>>
>> Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
>> Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>
>> ---
>>  include/linux/skbuff.h | 19 +++++++++++++++++--
>>  net/ipv4/ip_output.c   |  2 +-
>>  net/ipv4/raw.c         |  2 +-
>>  net/ipv6/ip6_output.c  |  2 +-
>>  net/ipv6/raw.c         |  2 +-
>>  net/packet/af_packet.c |  6 +++---
>>  6 files changed, 24 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index 6160185f0fe0..2f91a8a2157a 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -705,6 +705,9 @@ typedef unsigned char *sk_buff_data_t;
>>  enum skb_tstamp_type {
>>  	SKB_TSTAMP_TYPE_RX_REAL = 0,    /* A RX (receive) time in real */
>>  	SKB_TSTAMP_TYPE_TX_MONO = 1,    /* A TX (delivery) time in mono */
>> +	SKB_TSTAMP_TYPE_TX_USER = 2,    /* A TX (delivery) time and its clock
>> +									 * is in skb->sk->sk_clockid.
>> +									 */
> 
> Weird indentation?
> 
I will correct it. 

> More fundamentally: instead of defining a type TX_USER, can we use a
> real clockid (e.g., CLOCK_TAI) based on skb->sk->sk_clockid? Rather
> than store an id that means "go look at sk_clockid".
> 
>>  };
>>  
>>  /**
>> @@ -830,6 +833,9 @@ enum skb_tstamp_type {
>>   *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
>>   *		skb->tstamp has the (rcv) timestamp at ingress and
>>   *		delivery_time at egress.
>> + *		delivery_time in mono clock base (i.e., EDT) or a clock base chosen
>> + *		by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
>> + *		ingress.
>>   *	@napi_id: id of the NAPI struct this skb came from
>>   *	@sender_cpu: (aka @napi_id) source CPU in XPS
>>   *	@alloc_cpu: CPU which did the skb allocation.
>> @@ -960,7 +966,7 @@ struct sk_buff {
>>  	/* private: */
>>  	__u8			__mono_tc_offset[0];
>>  	/* public: */
>> -	__u8			tstamp_type:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
>> +	__u8			tstamp_type:2;	/* See SKB_MONO_DELIVERY_TIME_MASK */
>>  #ifdef CONFIG_NET_XGRESS
>>  	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
>>  	__u8			tc_skip_classify:1;
> 
> With pahole, does this have an effect on sk_buff layout?
> 
I think it does and it also impacts BPF testing. Hence in my cover letter i have mentioned that these
changes will impact BPF. My level of expertise is very limited to BPF hence the reason for RFC. 
That being said i am actually trying to understand/learn BPF instructions to know things better. 
I think we need to also change the offset SKB_MONO_DELIVERY_TIME_MASK and TC_AT_INGRESS_MASK


#ifdef __BIG_ENDIAN_BITFIELD
#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 7) //Suspecting changes here too
#define TC_AT_INGRESS_MASK		(1 << 6) // and here 
#else
#define SKB_MONO_DELIVERY_TIME_MASK	(1 << 0)
#define TC_AT_INGRESS_MASK		(1 << 1) (this might have to change to 1<<2 )
#endif
#define SKB_BF_MONO_TC_OFFSET		offsetof(struct sk_buff, __mono_tc_offset)

Also i suspect i change in /selftests/bpf/prog_tests/ctx_rewrite.c 
I am trying to figure out what this part of the code is doing.
https://lore.kernel.org/all/20230321014115.997841-4-kuba@kernel.org/

Please correct me if i am wrong here. 

>> @@ -4274,7 +4280,16 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
>>  					enum skb_tstamp_type tstamp_type)
>>  {
>>  	skb->tstamp = kt;
>> -	skb->tstamp_type = kt && tstamp_type;
>> +
>> +	if (skb->tstamp_type)
>> +		return;
>> +
> 
I can put a warn on here incase if both MONO and TAI are set. 
OR 
Rather make it simple as you have mentioned below. 
> Why bail if a type is already set? And what if
> skb->tstamp_type != tstamp_type? Should skb->tstamp then not be
> written to (i.e., the test moved up), and perhaps a rate limited
> warning.
> 
>> +	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_MONO)
>> +		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_MONO;
>> +
>> +	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_USER)
>> +		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_USER;
> 
> Simpler
> 
>     if (kt)
>         skb->tstamp_type = tstamp_type;
Willem de Bruijn April 10, 2024, 10:58 p.m. UTC | #3
Abhishek Chauhan (ABC) wrote:
> 
> 
> On 4/10/2024 8:42 AM, Willem de Bruijn wrote:
> > Abhishek Chauhan wrote:
> >> tstamp_type can be real, mono or userspace timestamp.
> >>
> >> This commit adds userspace timestamp and sets it if there is
> >> valid transmit_time available in socket coming from userspace.
> >>
> >> To make the design scalable for future needs this commit bring in
> >> the change to extend the tstamp_type:1 to tstamp_type:2 to support
> >> userspace timestamp.
> >>
> >> Link: https://lore.kernel.org/netdev/bc037db4-58bb-4861-ac31-a361a93841d3@linux.dev/
> >> Signed-off-by: Abhishek Chauhan <quic_abchauha@quicinc.com>

> > With pahole, does this have an effect on sk_buff layout?
> > 
> I think it does and it also impacts BPF testing. Hence in my cover letter i have mentioned that these
> changes will impact BPF. My level of expertise is very limited to BPF hence the reason for RFC. 
> That being said i am actually trying to understand/learn BPF instructions to know things better. 
> I think we need to also change the offset SKB_MONO_DELIVERY_TIME_MASK and TC_AT_INGRESS_MASK
> 
> 
> #ifdef __BIG_ENDIAN_BITFIELD
> #define SKB_MONO_DELIVERY_TIME_MASK	(1 << 7) //Suspecting changes here too
> #define TC_AT_INGRESS_MASK		(1 << 6) // and here 
> #else
> #define SKB_MONO_DELIVERY_TIME_MASK	(1 << 0)
> #define TC_AT_INGRESS_MASK		(1 << 1) (this might have to change to 1<<2 )
> #endif
> #define SKB_BF_MONO_TC_OFFSET		offsetof(struct sk_buff, __mono_tc_offset)
> 
> Also i suspect i change in /selftests/bpf/prog_tests/ctx_rewrite.c 
> I am trying to figure out what this part of the code is doing.
> https://lore.kernel.org/all/20230321014115.997841-4-kuba@kernel.org/
> 
> Please correct me if i am wrong here.

I broadly agree. We should convert all references to
SKB_MONO_DELIVERY_TIME_MASK to an skb_tstamp_type equivalent.

> 
> >> @@ -4274,7 +4280,16 @@ static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
> >>  					enum skb_tstamp_type tstamp_type)
> >>  {
> >>  	skb->tstamp = kt;
> >> -	skb->tstamp_type = kt && tstamp_type;
> >> +
> >> +	if (skb->tstamp_type)
> >> +		return;
> >> +
> > 
> I can put a warn on here incase if both MONO and TAI are set. 
> OR 
> Rather make it simple as you have mentioned below.

When might skb->tstamp_type already be non-zero when
skb_set_deliver_type is called?

In most cases, this is called for a fresh skb.

In br_ip6_fragment, it is called with a previous value. But this is
the value of skb->tstamp_type. It just clears it if kt is 0.

If skb->tstamp_type != tstamp_type is not a condition that can be
forced by an unprivileged user, then we can warn.
Martin KaFai Lau April 10, 2024, 11:25 p.m. UTC | #4
On 4/10/24 1:25 PM, Abhishek Chauhan (ABC) wrote:
>>> @@ -830,6 +833,9 @@ enum skb_tstamp_type {
>>>    *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
>>>    *		skb->tstamp has the (rcv) timestamp at ingress and
>>>    *		delivery_time at egress.
>>> + *		delivery_time in mono clock base (i.e., EDT) or a clock base chosen
>>> + *		by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
>>> + *		ingress.
>>>    *	@napi_id: id of the NAPI struct this skb came from
>>>    *	@sender_cpu: (aka @napi_id) source CPU in XPS
>>>    *	@alloc_cpu: CPU which did the skb allocation.
>>> @@ -960,7 +966,7 @@ struct sk_buff {
>>>   	/* private: */
>>>   	__u8			__mono_tc_offset[0];
>>>   	/* public: */
>>> -	__u8			tstamp_type:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
>>> +	__u8			tstamp_type:2;	/* See SKB_MONO_DELIVERY_TIME_MASK */
>>>   #ifdef CONFIG_NET_XGRESS
>>>   	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */

The above "tstamp_type:2" change shifted the tc_at_ingress bit.
TC_AT_INGRESS_MASK needs to be adjusted.

>>>   	__u8			tc_skip_classify:1;
>>
>> With pahole, does this have an effect on sk_buff layout?
>>
> I think it does and it also impacts BPF testing. Hence in my cover letter i have mentioned that these
> changes will impact BPF. My level of expertise is very limited to BPF hence the reason for RFC.
> That being said i am actually trying to understand/learn BPF instructions to know things better.
> I think we need to also change the offset SKB_MONO_DELIVERY_TIME_MASK and TC_AT_INGRESS_MASK
> 
> 
> #ifdef __BIG_ENDIAN_BITFIELD
> #define SKB_MONO_DELIVERY_TIME_MASK	(1 << 7) //Suspecting changes here too
> #define TC_AT_INGRESS_MASK		(1 << 6) // and here
> #else
> #define SKB_MONO_DELIVERY_TIME_MASK	(1 << 0)
> #define TC_AT_INGRESS_MASK		(1 << 1) (this might have to change to 1<<2 )

This should be (1 << 2) now. Similar adjustment for the big endian.

> #endif
> #define SKB_BF_MONO_TC_OFFSET		offsetof(struct sk_buff, __mono_tc_offset)
> 
> Also i suspect i change in /selftests/bpf/prog_tests/ctx_rewrite.c

ctx_rewrite.c tests the bpf ctx rewrite code. In this particular case, it tests
the bpf_convert_tstamp_read() and bpf_convert_tstamp_write() generate the
correct bpf instructions.
e.g. "w11 &= 3;" is testing the following in bpf_convert_tstamp_read():
		*insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
	 				TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK);

The existing "TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK" is 0x3
and it should become 0x5 if my hand counts correctly.

The patch set cannot be applied to the bpf-next:
https://patchwork.kernel.org/project/netdevbpf/patch/20240409210547.3815806-4-quic_abchauha@quicinc.com/
, so bpf CI cannot run to reproduce the issue.
Abhishek Chauhan (ABC) April 10, 2024, 11:39 p.m. UTC | #5
On 4/10/2024 4:25 PM, Martin KaFai Lau wrote:
> On 4/10/24 1:25 PM, Abhishek Chauhan (ABC) wrote:
>>>> @@ -830,6 +833,9 @@ enum skb_tstamp_type {
>>>>    *        delivery_time in mono clock base (i.e. EDT).  Otherwise, the
>>>>    *        skb->tstamp has the (rcv) timestamp at ingress and
>>>>    *        delivery_time at egress.
>>>> + *        delivery_time in mono clock base (i.e., EDT) or a clock base chosen
>>>> + *        by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
>>>> + *        ingress.
>>>>    *    @napi_id: id of the NAPI struct this skb came from
>>>>    *    @sender_cpu: (aka @napi_id) source CPU in XPS
>>>>    *    @alloc_cpu: CPU which did the skb allocation.
>>>> @@ -960,7 +966,7 @@ struct sk_buff {
>>>>       /* private: */
>>>>       __u8            __mono_tc_offset[0];
>>>>       /* public: */
>>>> -    __u8            tstamp_type:1;    /* See SKB_MONO_DELIVERY_TIME_MASK */
>>>> +    __u8            tstamp_type:2;    /* See SKB_MONO_DELIVERY_TIME_MASK */
>>>>   #ifdef CONFIG_NET_XGRESS
>>>>       __u8            tc_at_ingress:1;    /* See TC_AT_INGRESS_MASK */
> 
> The above "tstamp_type:2" change shifted the tc_at_ingress bit.
> TC_AT_INGRESS_MASK needs to be adjusted.
> 
>>>>       __u8            tc_skip_classify:1;
>>>
>>> With pahole, does this have an effect on sk_buff layout?
>>>
>> I think it does and it also impacts BPF testing. Hence in my cover letter i have mentioned that these
>> changes will impact BPF. My level of expertise is very limited to BPF hence the reason for RFC.
>> That being said i am actually trying to understand/learn BPF instructions to know things better.
>> I think we need to also change the offset SKB_MONO_DELIVERY_TIME_MASK and TC_AT_INGRESS_MASK
>>
>>
>> #ifdef __BIG_ENDIAN_BITFIELD
>> #define SKB_MONO_DELIVERY_TIME_MASK    (1 << 7) //Suspecting changes here too
>> #define TC_AT_INGRESS_MASK        (1 << 6) // and here
>> #else
>> #define SKB_MONO_DELIVERY_TIME_MASK    (1 << 0)
>> #define TC_AT_INGRESS_MASK        (1 << 1) (this might have to change to 1<<2 )
> 
> This should be (1 << 2) now. Similar adjustment for the big endian.
> 
>> #endif
>> #define SKB_BF_MONO_TC_OFFSET        offsetof(struct sk_buff, __mono_tc_offset)
>>
>> Also i suspect i change in /selftests/bpf/prog_tests/ctx_rewrite.c
> 
> ctx_rewrite.c tests the bpf ctx rewrite code. In this particular case, it tests
> the bpf_convert_tstamp_read() and bpf_convert_tstamp_write() generate the
> correct bpf instructions.
> e.g. "w11 &= 3;" is testing the following in bpf_convert_tstamp_read():
>         *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
>                      TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK);
> 
> The existing "TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK" is 0x3
> and it should become 0x5 if my hand counts correctly.
> 

so the changes will be as follows (Martin correct me if am wrong)

		//w11 is checked againt 0x5 (Binary = 101)
		N(SCHED_CLS, struct __sk_buff, tstamp),
		.read  = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
			 "w11 &= 5;" <== here 
			 "if w11 != 0x5  goto pc+2;" <==here
			 "$dst = 0;"
			 "goto pc+1;"
			 "$dst = *(u64 *)($ctx + sk_buff::tstamp);",

		//w11 is checked againt 0x4 (100) 
		.write = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
			 "if w11 & 0x4 goto pc+1;" <== here
			 "goto pc+2;"
			 "w11 &= -4;" <==here
			 "*(u8 *)($ctx + sk_buff::__mono_tc_offset) = r11;"
			 "*(u64 *)($ctx + sk_buff::tstamp) = $src;",


> The patch set cannot be applied to the bpf-next:
> https://patchwork.kernel.org/project/netdevbpf/patch/20240409210547.3815806-4-quic_abchauha@quicinc.com/
> , so bpf CI cannot run to reproduce the issue.
>
Abhishek Chauhan (ABC) April 12, 2024, 9:14 p.m. UTC | #6
On 4/10/2024 4:39 PM, Abhishek Chauhan (ABC) wrote:
> 
> 
> On 4/10/2024 4:25 PM, Martin KaFai Lau wrote:
>> On 4/10/24 1:25 PM, Abhishek Chauhan (ABC) wrote:
>>>>> @@ -830,6 +833,9 @@ enum skb_tstamp_type {
>>>>>    *        delivery_time in mono clock base (i.e. EDT).  Otherwise, the
>>>>>    *        skb->tstamp has the (rcv) timestamp at ingress and
>>>>>    *        delivery_time at egress.
>>>>> + *        delivery_time in mono clock base (i.e., EDT) or a clock base chosen
>>>>> + *        by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
>>>>> + *        ingress.
>>>>>    *    @napi_id: id of the NAPI struct this skb came from
>>>>>    *    @sender_cpu: (aka @napi_id) source CPU in XPS
>>>>>    *    @alloc_cpu: CPU which did the skb allocation.
>>>>> @@ -960,7 +966,7 @@ struct sk_buff {
>>>>>       /* private: */
>>>>>       __u8            __mono_tc_offset[0];
>>>>>       /* public: */
>>>>> -    __u8            tstamp_type:1;    /* See SKB_MONO_DELIVERY_TIME_MASK */
>>>>> +    __u8            tstamp_type:2;    /* See SKB_MONO_DELIVERY_TIME_MASK */
>>>>>   #ifdef CONFIG_NET_XGRESS
>>>>>       __u8            tc_at_ingress:1;    /* See TC_AT_INGRESS_MASK */
>>
>> The above "tstamp_type:2" change shifted the tc_at_ingress bit.
>> TC_AT_INGRESS_MASK needs to be adjusted.
>>
>>>>>       __u8            tc_skip_classify:1;
>>>>
>>>> With pahole, does this have an effect on sk_buff layout?
>>>>
>>> I think it does and it also impacts BPF testing. Hence in my cover letter i have mentioned that these
>>> changes will impact BPF. My level of expertise is very limited to BPF hence the reason for RFC.
>>> That being said i am actually trying to understand/learn BPF instructions to know things better.
>>> I think we need to also change the offset SKB_MONO_DELIVERY_TIME_MASK and TC_AT_INGRESS_MASK
>>>
>>>
>>> #ifdef __BIG_ENDIAN_BITFIELD
>>> #define SKB_MONO_DELIVERY_TIME_MASK    (1 << 7) //Suspecting changes here too
>>> #define TC_AT_INGRESS_MASK        (1 << 6) // and here
>>> #else
>>> #define SKB_MONO_DELIVERY_TIME_MASK    (1 << 0)
>>> #define TC_AT_INGRESS_MASK        (1 << 1) (this might have to change to 1<<2 )
>>
>> This should be (1 << 2) now. Similar adjustment for the big endian.
>>
>>> #endif
>>> #define SKB_BF_MONO_TC_OFFSET        offsetof(struct sk_buff, __mono_tc_offset)
>>>
>>> Also i suspect i change in /selftests/bpf/prog_tests/ctx_rewrite.c
>>
>> ctx_rewrite.c tests the bpf ctx rewrite code. In this particular case, it tests
>> the bpf_convert_tstamp_read() and bpf_convert_tstamp_write() generate the
>> correct bpf instructions.
>> e.g. "w11 &= 3;" is testing the following in bpf_convert_tstamp_read():
>>         *insn++ = BPF_ALU32_IMM(BPF_AND, tmp_reg,
>>                      TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK);
>>
>> The existing "TC_AT_INGRESS_MASK | SKB_MONO_DELIVERY_TIME_MASK" is 0x3
>> and it should become 0x5 if my hand counts correctly.
>>
> 
> so the changes will be as follows (Martin correct me if am wrong)
> 
> 		//w11 is checked againt 0x5 (Binary = 101)
> 		N(SCHED_CLS, struct __sk_buff, tstamp),
> 		.read  = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
> 			 "w11 &= 5;" <== here 
> 			 "if w11 != 0x5  goto pc+2;" <==here
> 			 "$dst = 0;"
> 			 "goto pc+1;"
> 			 "$dst = *(u64 *)($ctx + sk_buff::tstamp);",
> 
> 		//w11 is checked againt 0x4 (100) 
> 		.write = "r11 = *(u8 *)($ctx + sk_buff::__mono_tc_offset);"
> 			 "if w11 & 0x4 goto pc+1;" <== here
> 			 "goto pc+2;"
> 			 "w11 &= -4;" <==here
> 			 "*(u8 *)($ctx + sk_buff::__mono_tc_offset) = r11;"
> 			 "*(u64 *)($ctx + sk_buff::tstamp) = $src;",
> 
>
Martin and Willem,
After the above changes, patchset v3 of these changes passed BPF test cases . Looks like we are good to go with final review now. If you have any further comments
Thank you for all the comments and design discussion that we had as part of this patch set series. 

Testing :- 
1. https://patchwork.kernel.org/project/netdevbpf/patch/20240412210125.1780574-2-quic_abchauha@quicinc.com/
2. https://patchwork.kernel.org/project/netdevbpf/patch/20240412210125.1780574-3-quic_abchauha@quicinc.com/
 
>> The patch set cannot be applied to the bpf-next:
>> https://patchwork.kernel.org/project/netdevbpf/patch/20240409210547.3815806-4-quic_abchauha@quicinc.com/
>> , so bpf CI cannot run to reproduce the issue.
>>
diff mbox series

Patch

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 6160185f0fe0..2f91a8a2157a 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -705,6 +705,9 @@  typedef unsigned char *sk_buff_data_t;
 enum skb_tstamp_type {
 	SKB_TSTAMP_TYPE_RX_REAL = 0,    /* A RX (receive) time in real */
 	SKB_TSTAMP_TYPE_TX_MONO = 1,    /* A TX (delivery) time in mono */
+	SKB_TSTAMP_TYPE_TX_USER = 2,    /* A TX (delivery) time and its clock
+									 * is in skb->sk->sk_clockid.
+									 */
 };
 
 /**
@@ -830,6 +833,9 @@  enum skb_tstamp_type {
  *		delivery_time in mono clock base (i.e. EDT).  Otherwise, the
  *		skb->tstamp has the (rcv) timestamp at ingress and
  *		delivery_time at egress.
+ *		delivery_time in mono clock base (i.e., EDT) or a clock base chosen
+ *		by SO_TXTIME. If zero, skb->tstamp has the (rcv) timestamp at
+ *		ingress.
  *	@napi_id: id of the NAPI struct this skb came from
  *	@sender_cpu: (aka @napi_id) source CPU in XPS
  *	@alloc_cpu: CPU which did the skb allocation.
@@ -960,7 +966,7 @@  struct sk_buff {
 	/* private: */
 	__u8			__mono_tc_offset[0];
 	/* public: */
-	__u8			tstamp_type:1;	/* See SKB_MONO_DELIVERY_TIME_MASK */
+	__u8			tstamp_type:2;	/* See SKB_MONO_DELIVERY_TIME_MASK */
 #ifdef CONFIG_NET_XGRESS
 	__u8			tc_at_ingress:1;	/* See TC_AT_INGRESS_MASK */
 	__u8			tc_skip_classify:1;
@@ -4274,7 +4280,16 @@  static inline void skb_set_delivery_time(struct sk_buff *skb, ktime_t kt,
 					enum skb_tstamp_type tstamp_type)
 {
 	skb->tstamp = kt;
-	skb->tstamp_type = kt && tstamp_type;
+
+	if (skb->tstamp_type)
+		return;
+
+	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_MONO)
+		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_MONO;
+
+	if (kt && tstamp_type == SKB_TSTAMP_TYPE_TX_USER)
+		skb->tstamp_type = SKB_TSTAMP_TYPE_TX_USER;
+
 }
 
 DECLARE_STATIC_KEY_FALSE(netstamp_needed_key);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 62e457f7c02c..9aea6e810f52 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1457,7 +1457,7 @@  struct sk_buff *__ip_make_skb(struct sock *sk,
 
 	skb->priority = (cork->tos != -1) ? cork->priority: READ_ONCE(sk->sk_priority);
 	skb->mark = cork->mark;
-	skb->tstamp = cork->transmit_time;
+	skb_set_delivery_time(skb, cork->transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 	/*
 	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
 	 * on dst refcount
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index dcb11f22cbf2..d8f52bc06ed3 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -360,7 +360,7 @@  static int raw_send_hdrinc(struct sock *sk, struct flowi4 *fl4,
 	skb->protocol = htons(ETH_P_IP);
 	skb->priority = READ_ONCE(sk->sk_priority);
 	skb->mark = sockc->mark;
-	skb->tstamp = sockc->transmit_time;
+	skb_set_delivery_time(skb, sockc->transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 	skb_dst_set(skb, &rt->dst);
 	*rtp = NULL;
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index a9e819115622..2beb9fc8c0b1 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1924,7 +1924,7 @@  struct sk_buff *__ip6_make_skb(struct sock *sk,
 
 	skb->priority = READ_ONCE(sk->sk_priority);
 	skb->mark = cork->base.mark;
-	skb->tstamp = cork->base.transmit_time;
+	skb_set_delivery_time(skb, cork->base.transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 
 	ip6_cork_steal_dst(skb, cork);
 	IP6_INC_STATS(net, rt->rt6i_idev, IPSTATS_MIB_OUTREQUESTS);
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 0d896ca7b589..3a68ca80bf83 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -621,7 +621,7 @@  static int rawv6_send_hdrinc(struct sock *sk, struct msghdr *msg, int length,
 	skb->protocol = htons(ETH_P_IPV6);
 	skb->priority = READ_ONCE(sk->sk_priority);
 	skb->mark = sockc->mark;
-	skb->tstamp = sockc->transmit_time;
+	skb_set_delivery_time(skb, sockc->transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 
 	skb_put(skb, length);
 	skb_reset_network_header(skb);
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 18f616f487ea..27ea972dfc56 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2056,7 +2056,7 @@  static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg,
 	skb->dev = dev;
 	skb->priority = READ_ONCE(sk->sk_priority);
 	skb->mark = READ_ONCE(sk->sk_mark);
-	skb->tstamp = sockc.transmit_time;
+	skb_set_delivery_time(skb, sockc.transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 
 	skb_setup_tx_timestamp(skb, sockc.tsflags);
 
@@ -2585,7 +2585,7 @@  static int tpacket_fill_skb(struct packet_sock *po, struct sk_buff *skb,
 	skb->dev = dev;
 	skb->priority = READ_ONCE(po->sk.sk_priority);
 	skb->mark = READ_ONCE(po->sk.sk_mark);
-	skb->tstamp = sockc->transmit_time;
+	skb_set_delivery_time(skb, sockc->transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 	skb_setup_tx_timestamp(skb, sockc->tsflags);
 	skb_zcopy_set_nouarg(skb, ph.raw);
 
@@ -3063,7 +3063,7 @@  static int packet_snd(struct socket *sock, struct msghdr *msg, size_t len)
 	skb->dev = dev;
 	skb->priority = READ_ONCE(sk->sk_priority);
 	skb->mark = sockc.mark;
-	skb->tstamp = sockc.transmit_time;
+	skb_set_delivery_time(skb, sockc.transmit_time, SKB_TSTAMP_TYPE_TX_USER);
 
 	if (unlikely(extra_len == 4))
 		skb->no_fcs = 1;