diff mbox series

[ipsec-next,v8,10/16] xfrm: iptfs: add fragmenting of larger than MTU user packets

Message ID 20240804203346.3654426-11-chopps@chopps.org (mailing list archive)
State Superseded
Delegated to: Netdev Maintainers
Headers show
Series Add IP-TFS mode to xfrm | expand

Checks

Context Check Description
netdev/series_format fail Series longer than 15 patches
netdev/tree_selection success Guessed tree name to be net-next
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 29 this patch: 29
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers warning 4 maintainers not CCed: herbert@gondor.apana.org.au edumazet@google.com pabeni@redhat.com kuba@kernel.org
netdev/build_clang success Errors and warnings before: 29 this patch: 29
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 29 this patch: 29
netdev/checkpatch warning WARNING: Do not crash the kernel unless it is absolutely unavoidable--use WARN_ON_ONCE() plus recovery code (if feasible) instead of BUG() or variants WARNING: line length of 81 exceeds 80 columns WARNING: line length of 82 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 0
netdev/source_inline success Was 0 now: 0

Commit Message

Christian Hopps Aug. 4, 2024, 8:33 p.m. UTC
From: Christian Hopps <chopps@labn.net>

Add support for tunneling user (inner) packets that are larger than the
tunnel's path MTU (outer) using IP-TFS fragmentation.

Signed-off-by: Christian Hopps <chopps@labn.net>
---
 net/xfrm/xfrm_iptfs.c | 407 +++++++++++++++++++++++++++++++++++++++---
 1 file changed, 381 insertions(+), 26 deletions(-)

Comments

Sabrina Dubroca Aug. 4, 2024, 10:25 p.m. UTC | #1
Please CC the reviewers from previous versions of the patchset. It's
really hard to keep track of discussions and reposts otherwise.


2024-08-04, 16:33:39 -0400, Christian Hopps wrote:
> From: Christian Hopps <chopps@labn.net>
> 
> Add support for tunneling user (inner) packets that are larger than the
> tunnel's path MTU (outer) using IP-TFS fragmentation.
> 
> Signed-off-by: Christian Hopps <chopps@labn.net>
> ---
>  net/xfrm/xfrm_iptfs.c | 407 +++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 381 insertions(+), 26 deletions(-)
> 
> diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
> index 20c19894720e..38735e2d64c3 100644
> --- a/net/xfrm/xfrm_iptfs.c
> +++ b/net/xfrm/xfrm_iptfs.c
> @@ -46,12 +46,23 @@
>   */
>  #define IPTFS_DEFAULT_MAX_QUEUE_SIZE	(1024 * 10240)
>  
> +/* 1) skb->head should be cache aligned.
> + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> + * start -16 from data.
> + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> + * we want data to be cache line aligned so all the pushed headers will be in
> + * another cacheline.
> + */
> +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)

How did you pick those values?



> +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
> +				       bool l3resv)
> +{
> +	struct sk_buff *skb;
> +	u32 resv;
> +
> +	if (!l3resv) {
> +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
> +	} else {
> +		resv = skb_headroom(tpl);
> +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
> +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
> +	}
> +
> +	skb = alloc_skb(len + resv, GFP_ATOMIC);
> +	if (!skb) {
> +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);

Hmpf, so we've gone from incrementing the wrong counter to
incrementing a new counter that doesn't have a precise meaning.

> +		return NULL;
> +	}
> +
> +	skb_reserve(skb, resv);
> +
> +	/* We do not want any of the tpl->headers copied over, so we do
> +	 * not use `skb_copy_header()`.
> +	 */

This is a bit of a bad sign for the implementation. It also worries
me, as this may not be updated when changes are made to
__copy_skb_header().
(c/p'd from v1 review since this was still not answered)


> +/**
> + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
> + * @st: source skb_seq_state
> + * @offset: offset in source
> + * @to: destination buffer
> + * @len: number of bytes to copy
> + *
> + * Copy @len bytes from @offset bytes into the source @st to the destination
> + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
> + * call to this function. If offset needs to decrease from the previous use `st`
> + * should be reset first.
> + *
> + * Return: 0 on success or a negative error code on failure
> + */
> +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
> +			     int len)

Probably belongs in net/core/skbuff.c, although I'm really not
convinced copying data around is the right way to implement the type
of packet splitting IPTFS does (which sounds a bit like a kind of
GSO). And there are helpers in net/core/skbuff.c (such as
pskb_carve/pskb_extract) that seem to do similar things to what you
need here, without as much data copying.


> +static int iptfs_first_skb(struct sk_buff **skbp, struct xfrm_iptfs_data *xtfs,
> +			   u32 mtu)
> +{
> +	struct sk_buff *skb = *skbp;
> +	int err;
> +
> +	/* Classic ESP skips the don't fragment ICMP error if DF is clear on
> +	 * the inner packet or ignore_df is set. Otherwise it will send an ICMP
> +	 * or local error if the inner packet won't fit it's MTU.
> +	 *
> +	 * With IPTFS we do not care about the inner packet DF bit. If the
> +	 * tunnel is configured to "don't fragment" we error back if things
> +	 * don't fit in our max packet size. Otherwise we iptfs-fragment as
> +	 * normal.
> +	 */
> +
> +	/* The opportunity for HW offload has ended */
> +	if (skb->ip_summed == CHECKSUM_PARTIAL) {
> +		err = skb_checksum_help(skb);
> +		if (err)
> +			return err;
> +	}
> +
> +	/* We've split these up before queuing */
> +	BUG_ON(skb_is_gso(skb));

As I've said previously, I don't think that's a valid reason to
crash. BUG_ON should be used very rarely:

https://elixir.bootlin.com/linux/v6.10/source/Documentation/process/coding-style.rst#L1230

Dropping a bogus packet is an easy way to recover from this situation,
so we should not crash here (and probably in all of IPTFS).
Christian Hopps Aug. 5, 2024, 2:33 a.m. UTC | #2
Sabrina Dubroca <sd@queasysnail.net> writes:

> Please CC the reviewers from previous versions of the patchset. It's
> really hard to keep track of discussions and reposts otherwise.

Wasn't aware of this requirement, will try and add all the reviewers in the future.

> 2024-08-04, 16:33:39 -0400, Christian Hopps wrote:
>> From: Christian Hopps <chopps@labn.net>
>>
>> Add support for tunneling user (inner) packets that are larger than the
>> tunnel's path MTU (outer) using IP-TFS fragmentation.
>>
>> Signed-off-by: Christian Hopps <chopps@labn.net>
>> ---
>>  net/xfrm/xfrm_iptfs.c | 407 +++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 381 insertions(+), 26 deletions(-)
>>
>> diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
>> index 20c19894720e..38735e2d64c3 100644
>> --- a/net/xfrm/xfrm_iptfs.c
>> +++ b/net/xfrm/xfrm_iptfs.c
>> @@ -46,12 +46,23 @@
>>   */
>>  #define IPTFS_DEFAULT_MAX_QUEUE_SIZE	(1024 * 10240)
>>
>> +/* 1) skb->head should be cache aligned.
>> + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>> + * start -16 from data.
>> + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>> + * we want data to be cache line aligned so all the pushed headers will be in
>> + * another cacheline.
>> + */
>> +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>> +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>
> How did you pick those values?

That's what the comment is talking to. When reserving space for L2 headers we pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start -16 from where skb->data will point at. For L3 we reserve double the power of 2 space we reserved for L2 only.

We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.

>> +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
>> +				       bool l3resv)
>> +{
>> +	struct sk_buff *skb;
>> +	u32 resv;
>> +
>> +	if (!l3resv) {
>> +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
>> +	} else {
>> +		resv = skb_headroom(tpl);
>> +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
>> +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
>> +	}
>> +
>> +	skb = alloc_skb(len + resv, GFP_ATOMIC);
>> +	if (!skb) {
>> +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
>
> Hmpf, so we've gone from incrementing the wrong counter to
> incrementing a new counter that doesn't have a precise meaning.

The new "No SKB" counter is supposed to mean "couldn't get an SKB", given plenty of other errors are logged under "OutErr" or "InErr" i'm not sure what level of precision you're looking for here. :)

>> +		return NULL;
>> +	}
>> +
>> +	skb_reserve(skb, resv);
>> +
>> +	/* We do not want any of the tpl->headers copied over, so we do
>> +	 * not use `skb_copy_header()`.
>> +	 */
>
> This is a bit of a bad sign for the implementation. It also worries
> me, as this may not be updated when changes are made to
> __copy_skb_header().
> (c/p'd from v1 review since this was still not answered)

I don't agree that this is a bad design at all, I'm curious what you think a good design to be.

I did specifically state why we are not re-using skb_copy_header(). The functionality is different. We are not trying to make a copy of an skb we are using an skb as a template for new skbs.

>> +/**
>> + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> + * @st: source skb_seq_state
>> + * @offset: offset in source
>> + * @to: destination buffer
>> + * @len: number of bytes to copy
>> + *
>> + * Copy @len bytes from @offset bytes into the source @st to the destination
>> + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> + * call to this function. If offset needs to decrease from the previous use `st`
>> + * should be reset first.
>> + *
>> + * Return: 0 on success or a negative error code on failure
>> + */
>> +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> +			     int len)
>
> Probably belongs in net/core/skbuff.c, although I'm really not
> convinced copying data around is the right way to implement the type
> of packet splitting IPTFS does (which sounds a bit like a kind of
> GSO). And there are helpers in net/core/skbuff.c (such as
> pskb_carve/pskb_extract) that seem to do similar things to what you
> need here, without as much data copying.

I don't have an issue with moving more general skb functionality into skbuff.c; however, I do not want to gate IP-TFS on this change to the general net infra, it is appropriate for a patchset of it's own.

Re copying: Let's be clear here, we are not always copying data, there are sharing code paths as well; however, there are times when it is the best (and even fastest) way to accomplish things (e.g., b/c the packet is small or the data is arranged in skbs in a way to make sharing ridiculously complex and thus slow).

This is in fact a very elegant and efficient sequential walk solution to a complex problem.

>> +static int iptfs_first_skb(struct sk_buff **skbp, struct xfrm_iptfs_data *xtfs,
>> +			   u32 mtu)
>> +{
>> +	struct sk_buff *skb = *skbp;
>> +	int err;
>> +
>> +	/* Classic ESP skips the don't fragment ICMP error if DF is clear on
>> +	 * the inner packet or ignore_df is set. Otherwise it will send an ICMP
>> +	 * or local error if the inner packet won't fit it's MTU.
>> +	 *
>> +	 * With IPTFS we do not care about the inner packet DF bit. If the
>> +	 * tunnel is configured to "don't fragment" we error back if things
>> +	 * don't fit in our max packet size. Otherwise we iptfs-fragment as
>> +	 * normal.
>> +	 */
>> +
>> +	/* The opportunity for HW offload has ended */
>> +	if (skb->ip_summed == CHECKSUM_PARTIAL) {
>> +		err = skb_checksum_help(skb);
>> +		if (err)
>> +			return err;
>> +	}
>> +
>> +	/* We've split these up before queuing */
>> +	BUG_ON(skb_is_gso(skb));
>
> As I've said previously, I don't think that's a valid reason to
> crash. BUG_ON should be used very rarely:
>
> https://elixir.bootlin.com/linux/v6.10/source/Documentation/process/coding-style.rst#L1230
>
> Dropping a bogus packet is an easy way to recover from this situation,
> so we should not crash here (and probably in all of IPTFS).

This is basically following a style of coding that aims to simplify overall code by eliminating multiple checks for the same condition over and over in code. It does this by arranging for a single variant at the beginning of an operation and then counting on that from then on in the code. Asserts are the way to document this, if no assert then nothing b/c using a conditional is exactly against the design principle.

An existing example of this in the kernel is `assert_spin_locked()`.

Anyway, I will just remove it if this is going to block adoption of the patch.

Thanks,
Chris.
Christian Hopps Aug. 5, 2024, 4:19 a.m. UTC | #3
Christian Hopps <chopps@chopps.org> writes:

> Sabrina Dubroca <sd@queasysnail.net> writes:

>>> +	/* The opportunity for HW offload has ended */
>>> +	if (skb->ip_summed == CHECKSUM_PARTIAL) {
>>> +		err = skb_checksum_help(skb);
>>> +		if (err)
>>> +			return err;
>>> +	}
>>> +
>>> +	/* We've split these up before queuing */
>>> +	BUG_ON(skb_is_gso(skb));
>>
>> As I've said previously, I don't think that's a valid reason to
>> crash. BUG_ON should be used very rarely:
>>
>> https://elixir.bootlin.com/linux/v6.10/source/Documentation/process/coding-style.rst#L1230
>>
>> Dropping a bogus packet is an easy way to recover from this situation,
>> so we should not crash here (and probably in all of IPTFS).
>
> This is basically following a style of coding that aims to simplify overall code
> by eliminating multiple checks for the same condition over and over in code. It
> does this by arranging for a single variant at the beginning of an operation and
> then counting on that from then on in the code. Asserts are the way to document
> this, if no assert then nothing b/c using a conditional is exactly against the
> design principle.
>
> An existing example of this in the kernel is `assert_spin_locked()`.
>
> Anyway, I will just remove it if this is going to block adoption of the patch.

Actually, I'll just convert all BUG_ON() to WARN_ON().

Thanks,
Chris.

> Thanks,
> Chris.
Sabrina Dubroca Aug. 6, 2024, 8:47 a.m. UTC | #4
2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
> > > +/* 1) skb->head should be cache aligned.
> > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> > > + * start -16 from data.
> > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> > > + * we want data to be cache line aligned so all the pushed headers will be in
> > > + * another cacheline.
> > > + */
> > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
> > 
> > How did you pick those values?
> 
> That's what the comment is talking to. When reserving space for L2 headers we pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start -16 from where skb->data will point at.

Hard-coding the x86 cacheline size is not a good idea. And what's the
16B for? You don't know that it's enough for the actual L2 headers.

> For L3 we reserve double the power of 2 space we reserved for L2 only.

But that's the core of my question. Why is that correct/enough?

> 
> We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.
> 
> > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
> > > +				       bool l3resv)
> > > +{
> > > +	struct sk_buff *skb;
> > > +	u32 resv;
> > > +
> > > +	if (!l3resv) {
> > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
> > > +	} else {
> > > +		resv = skb_headroom(tpl);
> > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
> > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
> > > +	}
> > > +
> > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
> > > +	if (!skb) {
> > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
> > 
> > Hmpf, so we've gone from incrementing the wrong counter to
> > incrementing a new counter that doesn't have a precise meaning.
> 
> The new "No SKB" counter is supposed to mean "couldn't get an SKB",
> given plenty of other errors are logged under "OutErr" or "InErr"
> i'm not sure what level of precision you're looking for here. :)

OutErr and InErr would be better than that new counter IMO.


> > > +		return NULL;
> > > +	}
> > > +
> > > +	skb_reserve(skb, resv);
> > > +
> > > +	/* We do not want any of the tpl->headers copied over, so we do
> > > +	 * not use `skb_copy_header()`.
> > > +	 */
> > 
> > This is a bit of a bad sign for the implementation. It also worries
> > me, as this may not be updated when changes are made to
> > __copy_skb_header().
> > (c/p'd from v1 review since this was still not answered)
> 
> I don't agree that this is a bad design at all, I'm curious what you think a good design to be.

Strange skb manipulations hiding in a protocol module is not good
design.

c/p bits of core code into a module (where they will never get fixed
up when the core code gets updated) is always a bad idea.


> I did specifically state why we are not re-using
> skb_copy_header(). The functionality is different. We are not trying
> to make a copy of an skb we are using an skb as a template for new
> skbs.

I saw that. That doesn't mean it's a good thing to do.

> > > +/**
> > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
> > > + * @st: source skb_seq_state
> > > + * @offset: offset in source
> > > + * @to: destination buffer
> > > + * @len: number of bytes to copy
> > > + *
> > > + * Copy @len bytes from @offset bytes into the source @st to the destination
> > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
> > > + * call to this function. If offset needs to decrease from the previous use `st`
> > > + * should be reset first.
> > > + *
> > > + * Return: 0 on success or a negative error code on failure
> > > + */
> > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
> > > +			     int len)
> > 
> > Probably belongs in net/core/skbuff.c, although I'm really not
> > convinced copying data around is the right way to implement the type
> > of packet splitting IPTFS does (which sounds a bit like a kind of
> > GSO). And there are helpers in net/core/skbuff.c (such as
> > pskb_carve/pskb_extract) that seem to do similar things to what you
> > need here, without as much data copying.
> 
> I don't have an issue with moving more general skb functionality
> into skbuff.c; however, I do not want to gate IP-TFS on this change
> to the general net infra, it is appropriate for a patchset of it's
> own.

If you need helpers that don't exist, it's part of your job to make
the core changes that are required to implement the functionality.

> Re copying: Let's be clear here, we are not always copying data,
> there are sharing code paths as well; however, there are times when
> it is the best (and even fastest) way to accomplish things (e.g.,
> b/c the packet is small or the data is arranged in skbs in a way to
> make sharing ridiculously complex and thus slow).

I'm not finding the sharing code. You mean iptfs_first_should_copy
returning false?
Christian Hopps Aug. 6, 2024, 8:54 a.m. UTC | #5
Sabrina Dubroca <sd@queasysnail.net> writes:

> 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>> > > +/* 1) skb->head should be cache aligned.
>> > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>> > > + * start -16 from data.
>> > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>> > > + * we want data to be cache line aligned so all the pushed headers will be in
>> > > + * another cacheline.
>> > > + */
>> > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>> > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>> >
>> > How did you pick those values?
>>
>> That's what the comment is talking to. When reserving space for L2 headers we
>> pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>> -16 from where skb->data will point at.
>
> Hard-coding the x86 cacheline size is not a good idea. And what's the
> 16B for? You don't know that it's enough for the actual L2 headers.

I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.

16B is to allow for the incredibly common 14B L2 header to fit.


>> For L3 we reserve double the power of 2 space we reserved for L2 only.
>
> But that's the core of my question. Why is that correct/enough?

I have to pick a value. There is no magically perfect number that I can pick. I've given you technical reasons and justifications for the numbers I have chosen -- not sure what else I can say. Do you have better suggestions for the sizes which would be more optimal on more architectures? If not then let's use the numbers that I have given technical reasons for choosing.

Put this another way. I could just pick 128 b/c it's 2 cachelines and fits lots of different headers and would be "good enough". That's plenty justification too. I think you looking for too much here -- this isn't a precision thing, it's a "Good Enough" thing.

>> We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.
>>
>> > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
>> > > +				       bool l3resv)
>> > > +{
>> > > +	struct sk_buff *skb;
>> > > +	u32 resv;
>> > > +
>> > > +	if (!l3resv) {
>> > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
>> > > +	} else {
>> > > +		resv = skb_headroom(tpl);
>> > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
>> > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
>> > > +	}
>> > > +
>> > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
>> > > +	if (!skb) {
>> > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
>> >
>> > Hmpf, so we've gone from incrementing the wrong counter to
>> > incrementing a new counter that doesn't have a precise meaning.
>>
>> The new "No SKB" counter is supposed to mean "couldn't get an SKB",
>> given plenty of other errors are logged under "OutErr" or "InErr"
>> i'm not sure what level of precision you're looking for here. :)
>
> OutErr and InErr would be better than that new counter IMO.

Why?

My counter tracks the SKB depletion failure that is actually happening. Would you have me now pass in the direction argument just so I can tick the correct overly general MIB counter that provides less value to the user in identifying the actual problem? How is that good design?

I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.

>> > > +		return NULL;
>> > > +	}
>> > > +
>> > > +	skb_reserve(skb, resv);
>> > > +
>> > > +	/* We do not want any of the tpl->headers copied over, so we do
>> > > +	 * not use `skb_copy_header()`.
>> > > +	 */
>> >
>> > This is a bit of a bad sign for the implementation. It also worries
>> > me, as this may not be updated when changes are made to
>> > __copy_skb_header().
>> > (c/p'd from v1 review since this was still not answered)
>>
>> I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
>
> Strange skb manipulations hiding in a protocol module is not good
> design.

It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.

I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.

> c/p bits of core code into a module (where they will never get fixed
> up when the core code gets updated) is always a bad idea.

I need some values from the SKB, so I copy them -- it's that simple.

>> I did specifically state why we are not re-using
>> skb_copy_header(). The functionality is different. We are not trying
>> to make a copy of an skb we are using an skb as a template for new
>> skbs.
>
> I saw that. That doesn't mean it's a good thing to do.

Please suggest an alternative.

>> > > +/**
>> > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> > > + * @st: source skb_seq_state
>> > > + * @offset: offset in source
>> > > + * @to: destination buffer
>> > > + * @len: number of bytes to copy
>> > > + *
>> > > + * Copy @len bytes from @offset bytes into the source @st to the destination
>> > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> > > + * call to this function. If offset needs to decrease from the previous use `st`
>> > > + * should be reset first.
>> > > + *
>> > > + * Return: 0 on success or a negative error code on failure
>> > > + */
>> > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> > > +			     int len)
>> >
>> > Probably belongs in net/core/skbuff.c, although I'm really not
>> > convinced copying data around is the right way to implement the type
>> > of packet splitting IPTFS does (which sounds a bit like a kind of
>> > GSO). And there are helpers in net/core/skbuff.c (such as
>> > pskb_carve/pskb_extract) that seem to do similar things to what you
>> > need here, without as much data copying.
>>
>> I don't have an issue with moving more general skb functionality
>> into skbuff.c; however, I do not want to gate IP-TFS on this change
>> to the general net infra, it is appropriate for a patchset of it's
>> own.
>
> If you need helpers that don't exist, it's part of your job to make
> the core changes that are required to implement the functionality.

This is part of a new code protocol and feature addition and it's a single use. Another patchset can present this code to the general network community to see if they think it *also* has value outside of IPTFS. There is *no* reason to delay IPTFS on general network infrastructure improvements. Please don't do this.

>> Re copying: Let's be clear here, we are not always copying data,
>> there are sharing code paths as well; however, there are times when
>> it is the best (and even fastest) way to accomplish things (e.g.,
>> b/c the packet is small or the data is arranged in skbs in a way to
>> make sharing ridiculously complex and thus slow).
>
> I'm not finding the sharing code. You mean iptfs_first_should_copy
> returning false?


        /* Try share then copy. */
        if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)) {
        ...
                leftover = skb_add_frags(newskb, fragwalk, data, copylen);
        } else {
                /* copy fragment data into newskb */
                if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen),
                ...
        }

Thanks,
Chris.
Florian Westphal Aug. 6, 2024, 10:03 a.m. UTC | #6
Christian Hopps <chopps@chopps.org> wrote:
> > > > > +	if (!l3resv) {
> > > > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
> > > > > +	} else {
> > > > > +		resv = skb_headroom(tpl);
> > > > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
> > > > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
> > > > > +	}
> > > > > +
> > > > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
> > > > > +	if (!skb) {
> > > > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
> > > >
> > > > Hmpf, so we've gone from incrementing the wrong counter to
> > > > incrementing a new counter that doesn't have a precise meaning.
> > > 
> > > The new "No SKB" counter is supposed to mean "couldn't get an SKB",
> > > given plenty of other errors are logged under "OutErr" or "InErr"
> > > i'm not sure what level of precision you're looking for here. :)
> > 
> > OutErr and InErr would be better than that new counter IMO.
> 
> Why?
> 
> My counter tracks the SKB depletion failure that is actually happening. Would you have me now pass in the direction argument just so I can tick the correct overly general MIB counter that provides less value to the user in identifying the actual problem? How is that good design?
> 
> I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.

Makes sense to me, skb allocation failure is transient anyway, there is
no action that could be taken if this error counter is incrementing.

You might want to pass GFP_ATOMIC | __GFP_NOWARN to alloc_skb() to avoid
any splats given this is a high-volume allocation.
Christian Hopps Aug. 6, 2024, 10:05 a.m. UTC | #7
Florian Westphal <fw@strlen.de> writes:

> Christian Hopps <chopps@chopps.org> wrote:
>> > > > > +	if (!l3resv) {
>> > > > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
>> > > > > +	} else {
>> > > > > +		resv = skb_headroom(tpl);
>> > > > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
>> > > > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
>> > > > > +	}
>> > > > > +
>> > > > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
>> > > > > +	if (!skb) {
>> > > > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
>> > > >
>> > > > Hmpf, so we've gone from incrementing the wrong counter to
>> > > > incrementing a new counter that doesn't have a precise meaning.
>> > >
>> > > The new "No SKB" counter is supposed to mean "couldn't get an SKB",
>> > > given plenty of other errors are logged under "OutErr" or "InErr"
>> > > i'm not sure what level of precision you're looking for here. :)
>> >
>> > OutErr and InErr would be better than that new counter IMO.
>>
>> Why?
>>
>> My counter tracks the SKB depletion failure that is actually happening. Would
>> you have me now pass in the direction argument just so I can tick the correct
>> overly general MIB counter that provides less value to the user in identifying
>> the actual problem? How is that good design?
>>
>> I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.
>
> Makes sense to me, skb allocation failure is transient anyway, there is
> no action that could be taken if this error counter is incrementing.
>
> You might want to pass GFP_ATOMIC | __GFP_NOWARN to alloc_skb() to avoid
> any splats given this is a high-volume allocation.

Will do.

Thanks,
Chris.
Sabrina Dubroca Aug. 6, 2024, 11:05 a.m. UTC | #8
2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
> 
> Sabrina Dubroca <sd@queasysnail.net> writes:
> 
> > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
> > > > > +/* 1) skb->head should be cache aligned.
> > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> > > > > + * start -16 from data.
> > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> > > > > + * we want data to be cache line aligned so all the pushed headers will be in
> > > > > + * another cacheline.
> > > > > + */
> > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
> > > >
> > > > How did you pick those values?
> > > 
> > > That's what the comment is talking to. When reserving space for L2 headers we
> > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
> > > -16 from where skb->data will point at.
> > 
> > Hard-coding the x86 cacheline size is not a good idea. And what's the
> > 16B for? You don't know that it's enough for the actual L2 headers.
> 
> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.

At least use SMP_CACHE_BYTES then?

> 16B is to allow for the incredibly common 14B L2 header to fit.

Why not use skb->dev->needed_headroom, like a bunch of tunnels are
already doing? No guessing required. ethernet is the most common, but
there's no reason to penalize other protocols when the information is
available.

> > > For L3 we reserve double the power of 2 space we reserved for L2 only.
> > 
> > But that's the core of my question. Why is that correct/enough?
> 
> I have to pick a value. There is no magically perfect number that I can pick. I've given you technical reasons and justifications for the numbers I have chosen -- not sure what else I can say. Do you have better suggestions for the sizes which would be more optimal on more architectures? If not then let's use the numbers that I have given technical reasons for choosing.

Yes, now you've spelled it out, and we can evaluate your choices.

> Put this another way. I could just pick 128 b/c it's 2 cachelines
> and fits lots of different headers and would be "good
> enough". That's plenty justification too. I think you looking for
> too much here -- this isn't a precision thing, it's a "Good Enough"
> thing.

I'm asking questions. That's kind of the reviewer's job, understanding
how the thing they're reviewing works. ¯\_(ツ)_/¯

> > > We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.
> > > 
> > > > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
> > > > > +				       bool l3resv)
> > > > > +{
> > > > > +	struct sk_buff *skb;
> > > > > +	u32 resv;
> > > > > +
> > > > > +	if (!l3resv) {
> > > > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
> > > > > +	} else {
> > > > > +		resv = skb_headroom(tpl);
> > > > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
> > > > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
> > > > > +	}
> > > > > +
> > > > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
> > > > > +	if (!skb) {
> > > > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
> > > >
> > > > Hmpf, so we've gone from incrementing the wrong counter to
> > > > incrementing a new counter that doesn't have a precise meaning.
> > > 
> > > The new "No SKB" counter is supposed to mean "couldn't get an SKB",
> > > given plenty of other errors are logged under "OutErr" or "InErr"
> > > i'm not sure what level of precision you're looking for here. :)
> > 
> > OutErr and InErr would be better than that new counter IMO.
> 
> Why?
> 
> My counter tracks the SKB depletion failure that is actually happening. Would you have me now pass in the direction argument just so I can tick the correct overly general MIB counter that provides less value to the user in identifying the actual problem? How is that good design?
> 
> I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.

Fine.

> > > > > +		return NULL;
> > > > > +	}
> > > > > +
> > > > > +	skb_reserve(skb, resv);
> > > > > +
> > > > > +	/* We do not want any of the tpl->headers copied over, so we do
> > > > > +	 * not use `skb_copy_header()`.
> > > > > +	 */
> > > >
> > > > This is a bit of a bad sign for the implementation. It also worries
> > > > me, as this may not be updated when changes are made to
> > > > __copy_skb_header().
> > > > (c/p'd from v1 review since this was still not answered)
> > > 
> > > I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
> > 
> > Strange skb manipulations hiding in a protocol module is not good
> > design.
> 
> It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.

packet content != cherry-picked parts of sk_buff

> I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.

I did say it's a bad sign, not a blocking issue on its own. But that
bad sign, combined with the unusual use of skb_seq and a lot of
copying data around, indicates that this is not the right way to
implement this part of the protocol.

> > c/p bits of core code into a module (where they will never get fixed
> > up when the core code gets updated) is always a bad idea.
> 
> I need some values from the SKB, so I copy them -- it's that simple.
> 
> > > I did specifically state why we are not re-using
> > > skb_copy_header(). The functionality is different. We are not trying
> > > to make a copy of an skb we are using an skb as a template for new
> > > skbs.
> > 
> > I saw that. That doesn't mean it's a good thing to do.
> 
> Please suggest an alternative.

A common helper in a location where people are going to know that they
need to fix it up when they modify things about sk_buff would be a
good start.

> > > > > +/**
> > > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
> > > > > + * @st: source skb_seq_state
> > > > > + * @offset: offset in source
> > > > > + * @to: destination buffer
> > > > > + * @len: number of bytes to copy
> > > > > + *
> > > > > + * Copy @len bytes from @offset bytes into the source @st to the destination
> > > > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
> > > > > + * call to this function. If offset needs to decrease from the previous use `st`
> > > > > + * should be reset first.
> > > > > + *
> > > > > + * Return: 0 on success or a negative error code on failure
> > > > > + */
> > > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
> > > > > +			     int len)
> > > >
> > > > Probably belongs in net/core/skbuff.c, although I'm really not
> > > > convinced copying data around is the right way to implement the type
> > > > of packet splitting IPTFS does (which sounds a bit like a kind of
> > > > GSO). And there are helpers in net/core/skbuff.c (such as
> > > > pskb_carve/pskb_extract) that seem to do similar things to what you
> > > > need here, without as much data copying.
> > > 
> > > I don't have an issue with moving more general skb functionality
> > > into skbuff.c; however, I do not want to gate IP-TFS on this change
> > > to the general net infra, it is appropriate for a patchset of it's
> > > own.
> > 
> > If you need helpers that don't exist, it's part of your job to make
> > the core changes that are required to implement the functionality.
> 
> This is part of a new code protocol and feature addition and it's a single use.

Of course the helper would be single use when it's introduced. You
don't know if it will remain single use. And pskb_extract is single
use, it's fine.

> Another patchset can present this code to the general network
> community to see if they think it *also* has value outside of
> IPTFS. There is *no* reason to delay IPTFS on general network
> infrastructure improvements. Please don't do this.

Sorry, I don't think that's how it works.

> > > Re copying: Let's be clear here, we are not always copying data,
> > > there are sharing code paths as well; however, there are times when
> > > it is the best (and even fastest) way to accomplish things (e.g.,
> > > b/c the packet is small or the data is arranged in skbs in a way to
> > > make sharing ridiculously complex and thus slow).
> > 
> > I'm not finding the sharing code. You mean iptfs_first_should_copy
> > returning false?
> 
> 
>        /* Try share then copy. */
>        if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)) {
>        ...
>                leftover = skb_add_frags(newskb, fragwalk, data, copylen);
>        } else {
>                /* copy fragment data into newskb */
>                if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen),
>                ...
>        }

You're talking about reassembly now. This patch is fragmentation/TX.
Christian Hopps Aug. 6, 2024, 11:07 a.m. UTC | #9
Sabrina Dubroca <sd@queasysnail.net> writes:

> 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
>>
>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>
>> > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>> > > > > +/* 1) skb->head should be cache aligned.
>> > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>> > > > > + * start -16 from data.
>> > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>> > > > > + * we want data to be cache line aligned so all the pushed headers will be in
>> > > > > + * another cacheline.
>> > > > > + */
>> > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>> > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>> > > >
>> > > > How did you pick those values?
>> > >
>> > > That's what the comment is talking to. When reserving space for L2 headers we
>> > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>> > > -16 from where skb->data will point at.
>> >
>> > Hard-coding the x86 cacheline size is not a good idea. And what's the
>> > 16B for? You don't know that it's enough for the actual L2 headers.
>>
>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
>
> At least use SMP_CACHE_BYTES then?

Ok.

>> 16B is to allow for the incredibly common 14B L2 header to fit.
>
> Why not use skb->dev->needed_headroom, like a bunch of tunnels are
> already doing? No guessing required. ethernet is the most common, but
> there's no reason to penalize other protocols when the information is
> available.

>> > > For L3 we reserve double the power of 2 space we reserved for L2 only.
>> >
>> > But that's the core of my question. Why is that correct/enough?
>>
>> I have to pick a value. There is no magically perfect number that I can pick.
>> I've given you technical reasons and justifications for the numbers I have
>> chosen -- not sure what else I can say. Do you have better suggestions for the
>> sizes which would be more optimal on more architectures? If not then let's use
>> the numbers that I have given technical reasons for choosing.
>
> Yes, now you've spelled it out, and we can evaluate your choices.
>
>> Put this another way. I could just pick 128 b/c it's 2 cachelines
>> and fits lots of different headers and would be "good
>> enough". That's plenty justification too. I think you looking for
>> too much here -- this isn't a precision thing, it's a "Good Enough"
>> thing.
>
> I'm asking questions. That's kind of the reviewer's job, understanding
> how the thing they're reviewing works. ¯\_(ツ)_/¯
>
>> > > We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.
>> > >
>> > > > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
>> > > > > +				       bool l3resv)
>> > > > > +{
>> > > > > +	struct sk_buff *skb;
>> > > > > +	u32 resv;
>> > > > > +
>> > > > > +	if (!l3resv) {
>> > > > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
>> > > > > +	} else {
>> > > > > +		resv = skb_headroom(tpl);
>> > > > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
>> > > > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
>> > > > > +	}
>> > > > > +
>> > > > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
>> > > > > +	if (!skb) {
>> > > > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
>> > > >
>> > > > Hmpf, so we've gone from incrementing the wrong counter to
>> > > > incrementing a new counter that doesn't have a precise meaning.
>> > >
>> > > The new "No SKB" counter is supposed to mean "couldn't get an SKB",
>> > > given plenty of other errors are logged under "OutErr" or "InErr"
>> > > i'm not sure what level of precision you're looking for here. :)
>> >
>> > OutErr and InErr would be better than that new counter IMO.
>>
>> Why?
>>
>> My counter tracks the SKB depletion failure that is actually happening. Would
>> you have me now pass in the direction argument just so I can tick the correct
>> overly general MIB counter that provides less value to the user in identifying
>> the actual problem? How is that good design?
>>
>> I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.
>
> Fine.
>
>> > > > > +		return NULL;
>> > > > > +	}
>> > > > > +
>> > > > > +	skb_reserve(skb, resv);
>> > > > > +
>> > > > > +	/* We do not want any of the tpl->headers copied over, so we do
>> > > > > +	 * not use `skb_copy_header()`.
>> > > > > +	 */
>> > > >
>> > > > This is a bit of a bad sign for the implementation. It also worries
>> > > > me, as this may not be updated when changes are made to
>> > > > __copy_skb_header().
>> > > > (c/p'd from v1 review since this was still not answered)
>> > >
>> > > I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
>> >
>> > Strange skb manipulations hiding in a protocol module is not good
>> > design.
>>
>> It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.
>
> packet content != cherry-picked parts of sk_buff
>
>> I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.
>
> I did say it's a bad sign, not a blocking issue on its own. But that
> bad sign, combined with the unusual use of skb_seq and a lot of
> copying data around, indicates that this is not the right way to
> implement this part of the protocol.

The seq walk is not a bad design, I am literally using an existing walk API to walk the possibly complex nested chain of skbs.

All I've added is a useful utility function using that API to copy chunks of data from the chain. Its doing this using a single seq walk through the chain. It's the most obvious clean solution I can imagine to extract the fragments -- hardly bad design, the opposite really.

>> > c/p bits of core code into a module (where they will never get fixed
>> > up when the core code gets updated) is always a bad idea.
>>
>> I need some values from the SKB, so I copy them -- it's that simple.
>>
>> > > I did specifically state why we are not re-using
>> > > skb_copy_header(). The functionality is different. We are not trying
>> > > to make a copy of an skb we are using an skb as a template for new
>> > > skbs.
>> >
>> > I saw that. That doesn't mean it's a good thing to do.
>>
>> Please suggest an alternative.
>
> A common helper in a location where people are going to know that they
> need to fix it up when they modify things about sk_buff would be a
> good start.

What I am copying is specific to IP-TFS use case, that's what I am trying to convey here. I am copying some data from the SKB, it is not a generalized cloning operation that should be shared by anyone.

Is the advice that I should move this IPTFS functionality into skbuff.c?

>> > > > > +/**
>> > > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> > > > > + * @st: source skb_seq_state
>> > > > > + * @offset: offset in source
>> > > > > + * @to: destination buffer
>> > > > > + * @len: number of bytes to copy
>> > > > > + *
>> > > > > + * Copy @len bytes from @offset bytes into the source @st to the destination
>> > > > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> > > > > + * call to this function. If offset needs to decrease from the previous use `st`
>> > > > > + * should be reset first.
>> > > > > + *
>> > > > > + * Return: 0 on success or a negative error code on failure
>> > > > > + */
>> > > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> > > > > +			     int len)
>> > > >
>> > > > Probably belongs in net/core/skbuff.c, although I'm really not
>> > > > convinced copying data around is the right way to implement the type
>> > > > of packet splitting IPTFS does (which sounds a bit like a kind of
>> > > > GSO). And there are helpers in net/core/skbuff.c (such as
>> > > > pskb_carve/pskb_extract) that seem to do similar things to what you
>> > > > need here, without as much data copying.
>> > >
>> > > I don't have an issue with moving more general skb functionality
>> > > into skbuff.c; however, I do not want to gate IP-TFS on this change
>> > > to the general net infra, it is appropriate for a patchset of it's
>> > > own.
>> >
>> > If you need helpers that don't exist, it's part of your job to make
>> > the core changes that are required to implement the functionality.
>>
>> This is part of a new code protocol and feature addition and it's a single use.
>
> Of course the helper would be single use when it's introduced. You
> don't know if it will remain single use. And pskb_extract is single
> use, it's fine.

That is what a nice targeted patch and review process on netdev would reveal.

>> Another patchset can present this code to the general network
>> community to see if they think it *also* has value outside of
>> IPTFS. There is *no* reason to delay IPTFS on general network
>> infrastructure improvements. Please don't do this.
>
> Sorry, I don't think that's how it works.

I guess I disagree. Trying to boil the ocean here is what this feels like. Let's introduce this major new feature IPTFS. That's enough to handle for this patchset and review.

>> > > Re copying: Let's be clear here, we are not always copying data,
>> > > there are sharing code paths as well; however, there are times when
>> > > it is the best (and even fastest) way to accomplish things (e.g.,
>> > > b/c the packet is small or the data is arranged in skbs in a way to
>> > > make sharing ridiculously complex and thus slow).
>> >
>> > I'm not finding the sharing code. You mean iptfs_first_should_copy
>> > returning false?
>>
>>
>>        /* Try share then copy. */
>>        if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)) {
>>        ...
>>                leftover = skb_add_frags(newskb, fragwalk, data, copylen);
>>        } else {
>>                /* copy fragment data into newskb */
>>                if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen),
>>                ...
>>        }
>
> You're talking about reassembly now. This patch is fragmentation/TX.

Correct.

This code has been tested and performs quite good, especially for an initial implementation. It keeps up with existing ESP in the general IPmix case, and even outperforms the existing IPsec/ESP for small packets flows when aggregation kicks in. We also gain all the other benefits from IPTFS framing.

Adding to the code path you're referring to is possible enhancement task, it will be complex, and as it is not required to achieve on-par and even better performance than the existing code, it should not block or be required for the initial IPTFS implementation.

Thanks,
Chris.
Steffen Klassert Aug. 6, 2024, 11:07 a.m. UTC | #10
On Tue, Aug 06, 2024 at 04:54:53AM -0400, Christian Hopps wrote:
> 
> Sabrina Dubroca <sd@queasysnail.net> writes:
> 
> > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
> > > > > +/* 1) skb->head should be cache aligned.
> > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> > > > > + * start -16 from data.
> > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> > > > > + * we want data to be cache line aligned so all the pushed headers will be in
> > > > > + * another cacheline.
> > > > > + */
> > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
> > > >
> > > > How did you pick those values?
> > > 
> > > That's what the comment is talking to. When reserving space for L2 headers we
> > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
> > > -16 from where skb->data will point at.
> > 
> > Hard-coding the x86 cacheline size is not a good idea. And what's the
> > 16B for? You don't know that it's enough for the actual L2 headers.
> 
> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.

Maybe use L1_CACHE_BYTES instead of 64? This will give you
the actual size of the cacheline.

> > > > > +
> > > > > +	skb_reserve(skb, resv);
> > > > > +
> > > > > +	/* We do not want any of the tpl->headers copied over, so we do
> > > > > +	 * not use `skb_copy_header()`.
> > > > > +	 */
> > > >
> > > > This is a bit of a bad sign for the implementation. It also worries
> > > > me, as this may not be updated when changes are made to
> > > > __copy_skb_header().
> > > > (c/p'd from v1 review since this was still not answered)
> > > 
> > > I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
> > 
> > Strange skb manipulations hiding in a protocol module is not good
> > design.
> 
> It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.
> 
> I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.
> 
> > c/p bits of core code into a module (where they will never get fixed
> > up when the core code gets updated) is always a bad idea.
> 
> I need some values from the SKB, so I copy them -- it's that simple.
> 
> > > I did specifically state why we are not re-using
> > > skb_copy_header(). The functionality is different. We are not trying
> > > to make a copy of an skb we are using an skb as a template for new
> > > skbs.
> > 
> > I saw that. That doesn't mean it's a good thing to do.
> 
> Please suggest an alternative.

Maybe create a helper like this:

void ___copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
{
        new->tstamp             = old->tstamp;
        /* We do not copy old->sk */
        new->dev                = old->dev;
        memcpy(new->cb, old->cb, sizeof(old->cb)); 
        skb_dst_copy(new, old);  
        __skb_ext_copy(new, old);
        __nf_copy(new, old, false);
}

and change __copy_skb_header() to use this too. That way it gets
updated whenever something changes here.

It also might make sense to split out the generic infrastructure changes
into a separate pachset wih netdev maintainers Cced on. That would make
the changes more visible.
Steffen Klassert Aug. 6, 2024, 11:32 a.m. UTC | #11
On Mon, Aug 05, 2024 at 12:25:57AM +0200, Sabrina Dubroca wrote:
> 
> > +/**
> > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
> > + * @st: source skb_seq_state
> > + * @offset: offset in source
> > + * @to: destination buffer
> > + * @len: number of bytes to copy
> > + *
> > + * Copy @len bytes from @offset bytes into the source @st to the destination
> > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
> > + * call to this function. If offset needs to decrease from the previous use `st`
> > + * should be reset first.
> > + *
> > + * Return: 0 on success or a negative error code on failure
> > + */
> > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
> > +			     int len)
> 
> Probably belongs in net/core/skbuff.c, although I'm really not
> convinced copying data around is the right way to implement the type
> of packet splitting IPTFS does (which sounds a bit like a kind of
> GSO).

I tried to come up with a 'GSO like' variant of this when I did the
initial review last year at the IPsec workshop. But it turned out
that things will get even more complicated as they are now.
We did some performance tests and it was quite compareable to
tunnel mode, so for a first implementation I'd be ok with the
copy variant.


> And there are helpers in net/core/skbuff.c (such as
> pskb_carve/pskb_extract) that seem to do similar things to what you
> need here, without as much data copying.

In case we have helpers that will fit here, we should use them of
course.
Christian Hopps Aug. 7, 2024, 4:23 p.m. UTC | #12
Steffen Klassert <steffen.klassert@secunet.com> writes:

> On Tue, Aug 06, 2024 at 04:54:53AM -0400, Christian Hopps wrote:
>>
>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>
>> > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>> > > > > +/* 1) skb->head should be cache aligned.
>> > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>> > > > > + * start -16 from data.
>> > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>> > > > > + * we want data to be cache line aligned so all the pushed headers will be in
>> > > > > + * another cacheline.
>> > > > > + */
>> > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>> > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>> > > >
>> > > > How did you pick those values?
>> > >
>> > > That's what the comment is talking to. When reserving space for L2 headers we
>> > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>> > > -16 from where skb->data will point at.
>> >
>> > Hard-coding the x86 cacheline size is not a good idea. And what's the
>> > 16B for? You don't know that it's enough for the actual L2 headers.
>>
>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
>
> Maybe use L1_CACHE_BYTES instead of 64? This will give you
> the actual size of the cacheline.

Yes, although a bit more than just a swap:

#define XFRM_IPTFS_MIN_L2HEADROOM (L1_CACHE_BYTES > 64 ? 64 : 64 + 16)

Here's the new comment text which explains this:

/*
 * L2 Header resv: Arrange for cacheline to start at skb->data - 16 to keep the
 * to-be-pushed L2 header in the same cacheline as resulting `skb->data` (i.e.,
 * the L3 header). If cacheline size is > 64 then skb->data + pushed L2 will all
 * be in a single cacheline if we simply reserve 64 bytes.
 */

I'm simply protecting against some new arch that decides to have 256 byte cacheline since we do not need to reserve 256 bytes for L2 headers.

>> > > > > +	skb_reserve(skb, resv);
>> > > > > +
>> > > > > +	/* We do not want any of the tpl->headers copied over, so we do
>> > > > > +	 * not use `skb_copy_header()`.
>> > > > > +	 */
>> > > >
>> > > > This is a bit of a bad sign for the implementation. It also worries
>> > > > me, as this may not be updated when changes are made to
>> > > > __copy_skb_header().
>> > > > (c/p'd from v1 review since this was still not answered)
>> > >
>> > > I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
>> >
>> > Strange skb manipulations hiding in a protocol module is not good
>> > design.
>>
>> It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.
>>
>> I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.
>>
>> > c/p bits of core code into a module (where they will never get fixed
>> > up when the core code gets updated) is always a bad idea.
>>
>> I need some values from the SKB, so I copy them -- it's that simple.
>>
>> > > I did specifically state why we are not re-using
>> > > skb_copy_header(). The functionality is different. We are not trying
>> > > to make a copy of an skb we are using an skb as a template for new
>> > > skbs.
>> >
>> > I saw that. That doesn't mean it's a good thing to do.
>>
>> Please suggest an alternative.
>
> Maybe create a helper like this:
>
> void ___copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> {
>         new->tstamp             = old->tstamp;
>         /* We do not copy old->sk */
>         new->dev                = old->dev;
>         memcpy(new->cb, old->cb, sizeof(old->cb));
>         skb_dst_copy(new, old);
>         __skb_ext_copy(new, old);
>         __nf_copy(new, old, false);
> }
>
> and change __copy_skb_header() to use this too. That way it gets
> updated whenever something changes here.

Ok.

Thanks,
Chris.

> It also might make sense to split out the generic infrastructure changes
> into a separate pachset wih netdev maintainers Cced on. That would make
> the changes more visible.
Christian Hopps Aug. 7, 2024, 7:40 p.m. UTC | #13
Steffen Klassert <steffen.klassert@secunet.com> writes:

> On Mon, Aug 05, 2024 at 12:25:57AM +0200, Sabrina Dubroca wrote:
>>
>> > +/**
>> > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> > + * @st: source skb_seq_state
>> > + * @offset: offset in source
>> > + * @to: destination buffer
>> > + * @len: number of bytes to copy
>> > + *
>> > + * Copy @len bytes from @offset bytes into the source @st to the destination
>> > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> > + * call to this function. If offset needs to decrease from the previous use `st`
>> > + * should be reset first.
>> > + *
>> > + * Return: 0 on success or a negative error code on failure
>> > + */
>> > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> > +			     int len)
>>
>> Probably belongs in net/core/skbuff.c, although I'm really not
>> convinced copying data around is the right way to implement the type
>> of packet splitting IPTFS does (which sounds a bit like a kind of
>> GSO).
>
> I tried to come up with a 'GSO like' variant of this when I did the
> initial review last year at the IPsec workshop. But it turned out
> that things will get even more complicated as they are now.
> We did some performance tests and it was quite compareable to
> tunnel mode, so for a first implementation I'd be ok with the
> copy variant.
>
>
>> And there are helpers in net/core/skbuff.c (such as
>> pskb_carve/pskb_extract) that seem to do similar things to what you
>> need here, without as much data copying.
>
> In case we have helpers that will fit here, we should use them of
> course.

FWIW, The reason I didn't use pskb_extract() rather than the simple iptfs_copy_create_frag() is because pskb_extract uses skb_clone on the original skb then pskb_carve() to narrow the (copied) data pointers to a subset of the original. The new skb data is read-only which does not work for us.

Each of these new skbs are IP-TFS tunnel packets and as such we need to push and write IPTFS+ESP+IP+Ethernet headers on them. In order to make pskb_extract()s skbs writable we would have to allocate new buffer space and copy the data turning them into a writeable skb buffer, and now we're doing something more complex and more cpu intensive to arrive back to what iptfs_copy_create_frag() did simply and straight-forwardly to begin with.

Thanks,
Chris.
Sabrina Dubroca Aug. 8, 2024, 9:26 a.m. UTC | #14
2024-08-07, 15:40:14 -0400, Christian Hopps wrote:
> 
> Steffen Klassert <steffen.klassert@secunet.com> writes:
> 
> > On Mon, Aug 05, 2024 at 12:25:57AM +0200, Sabrina Dubroca wrote:
> > > 
> > > > +/**
> > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
> > > > + * @st: source skb_seq_state
> > > > + * @offset: offset in source
> > > > + * @to: destination buffer
> > > > + * @len: number of bytes to copy
> > > > + *
> > > > + * Copy @len bytes from @offset bytes into the source @st to the destination
> > > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
> > > > + * call to this function. If offset needs to decrease from the previous use `st`
> > > > + * should be reset first.
> > > > + *
> > > > + * Return: 0 on success or a negative error code on failure
> > > > + */
> > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
> > > > +			     int len)
> > > 
> > > Probably belongs in net/core/skbuff.c, although I'm really not
> > > convinced copying data around is the right way to implement the type
> > > of packet splitting IPTFS does (which sounds a bit like a kind of
> > > GSO).
> > 
> > I tried to come up with a 'GSO like' variant of this when I did the
> > initial review last year at the IPsec workshop. But it turned out
> > that things will get even more complicated as they are now.
> > We did some performance tests and it was quite compareable to
> > tunnel mode, so for a first implementation I'd be ok with the
> > copy variant.

Ok.

> > > And there are helpers in net/core/skbuff.c (such as
> > > pskb_carve/pskb_extract) that seem to do similar things to what you
> > > need here, without as much data copying.
> > 
> > In case we have helpers that will fit here, we should use them of
> > course.
> 
> FWIW, The reason I didn't use pskb_extract() rather than the simple
> iptfs_copy_create_frag() is because pskb_extract uses skb_clone on
> the original skb then pskb_carve() to narrow the (copied) data
> pointers to a subset of the original. The new skb data is read-only
> which does not work for us.
> 
> Each of these new skbs are IP-TFS tunnel packets and as such we need
> to push and write IPTFS+ESP+IP+Ethernet headers on them. In order to
> make pskb_extract()s skbs writable we would have to allocate new
> buffer space and copy the data turning them into a writeable skb
> buffer, and now we're doing something more complex and more cpu
> intensive to arrive back to what iptfs_copy_create_frag() did simply
> and straight-forwardly to begin with.

That only requires the header to be writeable, not the full packet,
right? I doubt it would actually be more cpu/memory intensive.
Christian Hopps Aug. 8, 2024, 11:23 a.m. UTC | #15
Sabrina Dubroca <sd@queasysnail.net> writes:

> 2024-08-07, 15:40:14 -0400, Christian Hopps wrote:
>>
>> Steffen Klassert <steffen.klassert@secunet.com> writes:
>>
>> > On Mon, Aug 05, 2024 at 12:25:57AM +0200, Sabrina Dubroca wrote:
>> > >
>> > > > +/**
>> > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> > > > + * @st: source skb_seq_state
>> > > > + * @offset: offset in source
>> > > > + * @to: destination buffer
>> > > > + * @len: number of bytes to copy
>> > > > + *
>> > > > + * Copy @len bytes from @offset bytes into the source @st to the destination
>> > > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> > > > + * call to this function. If offset needs to decrease from the previous use `st`
>> > > > + * should be reset first.
>> > > > + *
>> > > > + * Return: 0 on success or a negative error code on failure
>> > > > + */
>> > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> > > > +			     int len)
>> > >
>> > > Probably belongs in net/core/skbuff.c, although I'm really not
>> > > convinced copying data around is the right way to implement the type
>> > > of packet splitting IPTFS does (which sounds a bit like a kind of
>> > > GSO).
>> >
>> > I tried to come up with a 'GSO like' variant of this when I did the
>> > initial review last year at the IPsec workshop. But it turned out
>> > that things will get even more complicated as they are now.
>> > We did some performance tests and it was quite compareable to
>> > tunnel mode, so for a first implementation I'd be ok with the
>> > copy variant.
>
> Ok.
>
>> > > And there are helpers in net/core/skbuff.c (such as
>> > > pskb_carve/pskb_extract) that seem to do similar things to what you
>> > > need here, without as much data copying.
>> >
>> > In case we have helpers that will fit here, we should use them of
>> > course.
>>
>> FWIW, The reason I didn't use pskb_extract() rather than the simple
>> iptfs_copy_create_frag() is because pskb_extract uses skb_clone on
>> the original skb then pskb_carve() to narrow the (copied) data
>> pointers to a subset of the original. The new skb data is read-only
>> which does not work for us.
>>
>> Each of these new skbs are IP-TFS tunnel packets and as such we need
>> to push and write IPTFS+ESP+IP+Ethernet headers on them. In order to
>> make pskb_extract()s skbs writable we would have to allocate new
>> buffer space and copy the data turning them into a writeable skb
>> buffer, and now we're doing something more complex and more cpu
>> intensive to arrive back to what iptfs_copy_create_frag() did simply
>> and straight-forwardly to begin with.
>
> That only requires the header to be writeable, not the full packet,
> right? I doubt it would actually be more cpu/memory intensive.

pskb_extract() function leaves `skb->head = skb->data` there is no room left to push headers that we need to push.

FWIW, pskb_extract() is used in one place for TCP; it's not suited for IPTFS.

Thanks,
Chris.
Christian Hopps Aug. 8, 2024, 11:30 a.m. UTC | #16
Sabrina Dubroca <sd@queasysnail.net> writes:

> 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
>>
>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>
>> > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>> > > > > +/* 1) skb->head should be cache aligned.
>> > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>> > > > > + * start -16 from data.
>> > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>> > > > > + * we want data to be cache line aligned so all the pushed headers will be in
>> > > > > + * another cacheline.
>> > > > > + */
>> > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>> > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>> > > >
>> > > > How did you pick those values?
>> > >
>> > > That's what the comment is talking to. When reserving space for L2 headers we
>> > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>> > > -16 from where skb->data will point at.
>> >
>> > Hard-coding the x86 cacheline size is not a good idea. And what's the
>> > 16B for? You don't know that it's enough for the actual L2 headers.
>>
>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
>
> At least use SMP_CACHE_BYTES then?

Right, I have changed this work with L1_CACHE_BYTES value.

>> 16B is to allow for the incredibly common 14B L2 header to fit.
>
> Why not use skb->dev->needed_headroom, like a bunch of tunnels are
> already doing? No guessing required. ethernet is the most common, but
> there's no reason to penalize other protocols when the information is
> available.

We can't use `skb->dev->needed_headroom`, b/c `skb->dev` is not correct for the new packets. `skb->dev` is from the received IPTFS tunnel packet. The skb being created here are the inner user packets leaving the tunnel, so they have an L3 header (thus why we are only making room for L2 header). They are being handed to gro receive and still have to be routed to their correct destination interface/dev.

16 handles the general common case an ethernet device being the destination, if it's not correct after routing, nothing is broken, it just means that we may or may not achieve this maximal cache locality (but we still might e.g., if its destined to a GRE tunnel then we are looking at a bunch more headers so they and the existing L3 header will occupy 2 cachelines anyway). Again, this is a best effort thing.

Thanks,
Chris.


>
>> > > For L3 we reserve double the power of 2 space we reserved for L2 only.
>> >
>> > But that's the core of my question. Why is that correct/enough?
>>
>> I have to pick a value. There is no magically perfect number that I can pick.
>> I've given you technical reasons and justifications for the numbers I have
>> chosen -- not sure what else I can say. Do you have better suggestions for the
>> sizes which would be more optimal on more architectures? If not then let's use
>> the numbers that I have given technical reasons for choosing.
>
> Yes, now you've spelled it out, and we can evaluate your choices.
>
>> Put this another way. I could just pick 128 b/c it's 2 cachelines
>> and fits lots of different headers and would be "good
>> enough". That's plenty justification too. I think you looking for
>> too much here -- this isn't a precision thing, it's a "Good Enough"
>> thing.
>
> I'm asking questions. That's kind of the reviewer's job, understanding
> how the thing they're reviewing works. ¯\_(ツ)_/¯
>
>> > > We have to reserve some amount of space for pushed headers, so the above made sense to me for good performance/cache locality.
>> > >
>> > > > > +static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
>> > > > > +				       bool l3resv)
>> > > > > +{
>> > > > > +	struct sk_buff *skb;
>> > > > > +	u32 resv;
>> > > > > +
>> > > > > +	if (!l3resv) {
>> > > > > +		resv = XFRM_IPTFS_MIN_L2HEADROOM;
>> > > > > +	} else {
>> > > > > +		resv = skb_headroom(tpl);
>> > > > > +		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
>> > > > > +			resv = XFRM_IPTFS_MIN_L3HEADROOM;
>> > > > > +	}
>> > > > > +
>> > > > > +	skb = alloc_skb(len + resv, GFP_ATOMIC);
>> > > > > +	if (!skb) {
>> > > > > +		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
>> > > >
>> > > > Hmpf, so we've gone from incrementing the wrong counter to
>> > > > incrementing a new counter that doesn't have a precise meaning.
>> > >
>> > > The new "No SKB" counter is supposed to mean "couldn't get an SKB",
>> > > given plenty of other errors are logged under "OutErr" or "InErr"
>> > > i'm not sure what level of precision you're looking for here. :)
>> >
>> > OutErr and InErr would be better than that new counter IMO.
>>
>> Why?
>>
>> My counter tracks the SKB depletion failure that is actually happening. Would
>> you have me now pass in the direction argument just so I can tick the correct
>> overly general MIB counter that provides less value to the user in identifying
>> the actual problem? How is that good design?
>>
>> I'm inclined to just delete the thing altogether rather than block on this thing that will almost never happen.
>
> Fine.
>
>> > > > > +		return NULL;
>> > > > > +	}
>> > > > > +
>> > > > > +	skb_reserve(skb, resv);
>> > > > > +
>> > > > > +	/* We do not want any of the tpl->headers copied over, so we do
>> > > > > +	 * not use `skb_copy_header()`.
>> > > > > +	 */
>> > > >
>> > > > This is a bit of a bad sign for the implementation. It also worries
>> > > > me, as this may not be updated when changes are made to
>> > > > __copy_skb_header().
>> > > > (c/p'd from v1 review since this was still not answered)
>> > >
>> > > I don't agree that this is a bad design at all, I'm curious what you think a good design to be.
>> >
>> > Strange skb manipulations hiding in a protocol module is not good
>> > design.
>>
>> It's a fragmentation and aggregation protocol, it's needs work with skbs by design. It's literally the function of the protocol to manipulate packet content.
>
> packet content != cherry-picked parts of sk_buff
>
>> I would appreciate it if you could provide technical reasons to justify referring to things as "bad" or "strange" -- it's not helpful otherwise.
>
> I did say it's a bad sign, not a blocking issue on its own. But that
> bad sign, combined with the unusual use of skb_seq and a lot of
> copying data around, indicates that this is not the right way to
> implement this part of the protocol.
>
>> > c/p bits of core code into a module (where they will never get fixed
>> > up when the core code gets updated) is always a bad idea.
>>
>> I need some values from the SKB, so I copy them -- it's that simple.
>>
>> > > I did specifically state why we are not re-using
>> > > skb_copy_header(). The functionality is different. We are not trying
>> > > to make a copy of an skb we are using an skb as a template for new
>> > > skbs.
>> >
>> > I saw that. That doesn't mean it's a good thing to do.
>>
>> Please suggest an alternative.
>
> A common helper in a location where people are going to know that they
> need to fix it up when they modify things about sk_buff would be a
> good start.
>
>> > > > > +/**
>> > > > > + * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
>> > > > > + * @st: source skb_seq_state
>> > > > > + * @offset: offset in source
>> > > > > + * @to: destination buffer
>> > > > > + * @len: number of bytes to copy
>> > > > > + *
>> > > > > + * Copy @len bytes from @offset bytes into the source @st to the destination
>> > > > > + * buffer @to. `offset` should increase (or be unchanged) with each subsequent
>> > > > > + * call to this function. If offset needs to decrease from the previous use `st`
>> > > > > + * should be reset first.
>> > > > > + *
>> > > > > + * Return: 0 on success or a negative error code on failure
>> > > > > + */
>> > > > > +static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
>> > > > > +			     int len)
>> > > >
>> > > > Probably belongs in net/core/skbuff.c, although I'm really not
>> > > > convinced copying data around is the right way to implement the type
>> > > > of packet splitting IPTFS does (which sounds a bit like a kind of
>> > > > GSO). And there are helpers in net/core/skbuff.c (such as
>> > > > pskb_carve/pskb_extract) that seem to do similar things to what you
>> > > > need here, without as much data copying.
>> > >
>> > > I don't have an issue with moving more general skb functionality
>> > > into skbuff.c; however, I do not want to gate IP-TFS on this change
>> > > to the general net infra, it is appropriate for a patchset of it's
>> > > own.
>> >
>> > If you need helpers that don't exist, it's part of your job to make
>> > the core changes that are required to implement the functionality.
>>
>> This is part of a new code protocol and feature addition and it's a single use.
>
> Of course the helper would be single use when it's introduced. You
> don't know if it will remain single use. And pskb_extract is single
> use, it's fine.
>
>> Another patchset can present this code to the general network
>> community to see if they think it *also* has value outside of
>> IPTFS. There is *no* reason to delay IPTFS on general network
>> infrastructure improvements. Please don't do this.
>
> Sorry, I don't think that's how it works.
>
>> > > Re copying: Let's be clear here, we are not always copying data,
>> > > there are sharing code paths as well; however, there are times when
>> > > it is the best (and even fastest) way to accomplish things (e.g.,
>> > > b/c the packet is small or the data is arranged in skbs in a way to
>> > > make sharing ridiculously complex and thus slow).
>> >
>> > I'm not finding the sharing code. You mean iptfs_first_should_copy
>> > returning false?
>>
>>
>>        /* Try share then copy. */
>>        if (fragwalk && skb_can_add_frags(newskb, fragwalk, data, copylen)) {
>>        ...
>>                leftover = skb_add_frags(newskb, fragwalk, data, copylen);
>>        } else {
>>                /* copy fragment data into newskb */
>>                if (skb_copy_bits_seq(st, data, skb_put(newskb, copylen),
>>                ...
>>        }
>
> You're talking about reassembly now. This patch is fragmentation/TX.
Sabrina Dubroca Aug. 8, 2024, 1:28 p.m. UTC | #17
2024-08-08, 07:30:13 -0400, Christian Hopps wrote:
> 
> Sabrina Dubroca <sd@queasysnail.net> writes:
> 
> > 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
> > > 
> > > Sabrina Dubroca <sd@queasysnail.net> writes:
> > > 
> > > > 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
> > > > > > > +/* 1) skb->head should be cache aligned.
> > > > > > > + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> > > > > > > + * start -16 from data.
> > > > > > > + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> > > > > > > + * we want data to be cache line aligned so all the pushed headers will be in
> > > > > > > + * another cacheline.
> > > > > > > + */
> > > > > > > +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> > > > > > > +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
> > > > > >
> > > > > > How did you pick those values?
> > > > >
> > > > > That's what the comment is talking to. When reserving space for L2 headers we
> > > > > pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
> > > > > -16 from where skb->data will point at.
> > > >
> > > > Hard-coding the x86 cacheline size is not a good idea. And what's the
> > > > 16B for? You don't know that it's enough for the actual L2 headers.
> > > 
> > > I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
> > 
> > At least use SMP_CACHE_BYTES then?
> 
> Right, I have changed this work with L1_CACHE_BYTES value.
> 
> > > 16B is to allow for the incredibly common 14B L2 header to fit.
> > 
> > Why not use skb->dev->needed_headroom, like a bunch of tunnels are
> > already doing? No guessing required. ethernet is the most common, but
> > there's no reason to penalize other protocols when the information is
> > available.
> 
> We can't use `skb->dev->needed_headroom`, b/c `skb->dev` is not
> correct for the new packets. `skb->dev` is from the received IPTFS
> tunnel packet. The skb being created here are the inner user packets
> leaving the tunnel, so they have an L3 header (thus why we are only
> making room for L2 header). They are being handed to gro receive and
> still have to be routed to their correct destination interface/dev.

You're talking about RX now. You're assuming the main use-case is an
IPsec GW that's going to send the decapped packets out on another
ethernet interface? (or at least, that's that's a use-case worth
optimizing for)

What about TX? Is skb->dev->needed_headroom also incorrect there?

Is iptfs_alloc_skb's l3resv argument equivalent to a RX/TX switch?

> 16 handles the general common case an ethernet device being the
> destination, if it's not correct after routing, nothing is broken,
> it just means that we may or may not achieve this maximal cache
> locality (but we still might e.g., if its destined to a GRE tunnel
> then we are looking at a bunch more headers so they and the existing
> L3 header will occupy 2 cachelines anyway). Again, this is a best
> effort thing.

Ok.
Christian Hopps Aug. 8, 2024, 1:35 p.m. UTC | #18
> On Aug 8, 2024, at 09:28, Sabrina Dubroca <sd@queasysnail.net> wrote:
> 
> 2024-08-08, 07:30:13 -0400, Christian Hopps wrote:
>> 
>> Sabrina Dubroca <sd@queasysnail.net> writes:
>> 
>>> 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
>>>> 
>>>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>>> 
>>>>> 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>>>>>>>> +/* 1) skb->head should be cache aligned.
>>>>>>>> + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>>>>>>>> + * start -16 from data.
>>>>>>>> + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>>>>>>>> + * we want data to be cache line aligned so all the pushed headers will be in
>>>>>>>> + * another cacheline.
>>>>>>>> + */
>>>>>>>> +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>>>>>>>> +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>>>>>>> 
>>>>>>> How did you pick those values?
>>>>>> 
>>>>>> That's what the comment is talking to. When reserving space for L2 headers we
>>>>>> pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>>>>>> -16 from where skb->data will point at.
>>>>> 
>>>>> Hard-coding the x86 cacheline size is not a good idea. And what's the
>>>>> 16B for? You don't know that it's enough for the actual L2 headers.
>>>> 
>>>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
>>> 
>>> At least use SMP_CACHE_BYTES then?
>> 
>> Right, I have changed this work with L1_CACHE_BYTES value.
>> 
>>>> 16B is to allow for the incredibly common 14B L2 header to fit.
>>> 
>>> Why not use skb->dev->needed_headroom, like a bunch of tunnels are
>>> already doing? No guessing required. ethernet is the most common, but
>>> there's no reason to penalize other protocols when the information is
>>> available.
>> 
>> We can't use `skb->dev->needed_headroom`, b/c `skb->dev` is not
>> correct for the new packets. `skb->dev` is from the received IPTFS
>> tunnel packet. The skb being created here are the inner user packets
>> leaving the tunnel, so they have an L3 header (thus why we are only
>> making room for L2 header). They are being handed to gro receive and
>> still have to be routed to their correct destination interface/dev.
> 
> You're talking about RX now. You're assuming the main use-case is an
> IPsec GW that's going to send the decapped packets out on another
> ethernet interface? (or at least, that's that's a use-case worth
> optimizing for)
> 
> What about TX? Is skb->dev->needed_headroom also incorrect there?
> 
> Is iptfs_alloc_skb's l3resv argument equivalent to a RX/TX switch?

Exactly right. When we are generating IPTFS tunnel packets we need to add all the L3+l2 headers, and in that case we pass l3resv = true.

Thanks,
Chris.

> 
>> 16 handles the general common case an ethernet device being the
>> destination, if it's not correct after routing, nothing is broken,
>> it just means that we may or may not achieve this maximal cache
>> locality (but we still might e.g., if its destined to a GRE tunnel
>> then we are looking at a bunch more headers so they and the existing
>> L3 header will occupy 2 cachelines anyway). Again, this is a best
>> effort thing.
> 
> Ok.
> 
> -- 
> Sabrina
Sabrina Dubroca Aug. 8, 2024, 2:01 p.m. UTC | #19
2024-08-08, 09:35:04 -0400, Christian Hopps wrote:
> 
> 
> > On Aug 8, 2024, at 09:28, Sabrina Dubroca <sd@queasysnail.net> wrote:
> > 
> > 2024-08-08, 07:30:13 -0400, Christian Hopps wrote:
> >> 
> >> Sabrina Dubroca <sd@queasysnail.net> writes:
> >> 
> >>> 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
> >>>> 
> >>>> Sabrina Dubroca <sd@queasysnail.net> writes:
> >>>> 
> >>>>> 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
> >>>>>>>> +/* 1) skb->head should be cache aligned.
> >>>>>>>> + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
> >>>>>>>> + * start -16 from data.
> >>>>>>>> + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
> >>>>>>>> + * we want data to be cache line aligned so all the pushed headers will be in
> >>>>>>>> + * another cacheline.
> >>>>>>>> + */
> >>>>>>>> +#define XFRM_IPTFS_MIN_L3HEADROOM 128
> >>>>>>>> +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
> >>>>>>> 
> >>>>>>> How did you pick those values?
> >>>>>> 
> >>>>>> That's what the comment is talking to. When reserving space for L2 headers we
> >>>>>> pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
> >>>>>> -16 from where skb->data will point at.
> >>>>> 
> >>>>> Hard-coding the x86 cacheline size is not a good idea. And what's the
> >>>>> 16B for? You don't know that it's enough for the actual L2 headers.
> >>>> 
> >>>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
> >>> 
> >>> At least use SMP_CACHE_BYTES then?
> >> 
> >> Right, I have changed this work with L1_CACHE_BYTES value.
> >> 
> >>>> 16B is to allow for the incredibly common 14B L2 header to fit.
> >>> 
> >>> Why not use skb->dev->needed_headroom, like a bunch of tunnels are
> >>> already doing? No guessing required. ethernet is the most common, but
> >>> there's no reason to penalize other protocols when the information is
> >>> available.
> >> 
> >> We can't use `skb->dev->needed_headroom`, b/c `skb->dev` is not
> >> correct for the new packets. `skb->dev` is from the received IPTFS
> >> tunnel packet. The skb being created here are the inner user packets
> >> leaving the tunnel, so they have an L3 header (thus why we are only
> >> making room for L2 header). They are being handed to gro receive and
> >> still have to be routed to their correct destination interface/dev.
> > 
> > You're talking about RX now. You're assuming the main use-case is an
> > IPsec GW that's going to send the decapped packets out on another
> > ethernet interface? (or at least, that's that's a use-case worth
> > optimizing for)
> > 
> > What about TX? Is skb->dev->needed_headroom also incorrect there?
> > 
> > Is iptfs_alloc_skb's l3resv argument equivalent to a RX/TX switch?
> 
> Exactly right. When we are generating IPTFS tunnel packets we need
> to add all the L3+l2 headers, and in that case we pass l3resv =
> true.

Could you add a little comment alongside iptfs_alloc_skb? It would
help make sense of the sizes you're choosing and how they fit the use
of those skbs (something like "l3resv=true is used on TX, because we
need to reserve L2+L3 headers. On RX, we only need L2 headers because
[reason why we need L2 headers].").

And if skb->dev->needed_headroom is correct in the TX case, I'd still
prefer (skb->dev->needed_headroom + <some space for l3>) to a fixed 128.
Christian Hopps Aug. 8, 2024, 9:42 p.m. UTC | #20
> On Aug 8, 2024, at 10:01, Sabrina Dubroca <sd@queasysnail.net> wrote:
> 
> 2024-08-08, 09:35:04 -0400, Christian Hopps wrote:
>> 
>> 
>>> On Aug 8, 2024, at 09:28, Sabrina Dubroca <sd@queasysnail.net> wrote:
>>> 
>>> 2024-08-08, 07:30:13 -0400, Christian Hopps wrote:
>>>> 
>>>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>>> 
>>>>> 2024-08-06, 04:54:53 -0400, Christian Hopps wrote:
>>>>>> 
>>>>>> Sabrina Dubroca <sd@queasysnail.net> writes:
>>>>>> 
>>>>>>> 2024-08-04, 22:33:05 -0400, Christian Hopps wrote:
>>>>>>>>>> +/* 1) skb->head should be cache aligned.
>>>>>>>>>> + * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
>>>>>>>>>> + * start -16 from data.
>>>>>>>>>> + * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
>>>>>>>>>> + * we want data to be cache line aligned so all the pushed headers will be in
>>>>>>>>>> + * another cacheline.
>>>>>>>>>> + */
>>>>>>>>>> +#define XFRM_IPTFS_MIN_L3HEADROOM 128
>>>>>>>>>> +#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
>>>>>>>>> 
>>>>>>>>> How did you pick those values?
>>>>>>>> 
>>>>>>>> That's what the comment is talking to. When reserving space for L2 headers we
>>>>>>>> pick 64 + 16 (a 2^(<=6) cacheline + 16 bytes so the the cacheline should start
>>>>>>>> -16 from where skb->data will point at.
>>>>>>> 
>>>>>>> Hard-coding the x86 cacheline size is not a good idea. And what's the
>>>>>>> 16B for? You don't know that it's enough for the actual L2 headers.
>>>>>> 
>>>>>> I am not hard coding the x86 cacheline. I am picking 64 as the largest cacheline that this is optimized for, it also works for smaller cachelines.
>>>>> 
>>>>> At least use SMP_CACHE_BYTES then?
>>>> 
>>>> Right, I have changed this work with L1_CACHE_BYTES value.
>>>> 
>>>>>> 16B is to allow for the incredibly common 14B L2 header to fit.
>>>>> 
>>>>> Why not use skb->dev->needed_headroom, like a bunch of tunnels are
>>>>> already doing? No guessing required. ethernet is the most common, but
>>>>> there's no reason to penalize other protocols when the information is
>>>>> available.
>>>> 
>>>> We can't use `skb->dev->needed_headroom`, b/c `skb->dev` is not
>>>> correct for the new packets. `skb->dev` is from the received IPTFS
>>>> tunnel packet. The skb being created here are the inner user packets
>>>> leaving the tunnel, so they have an L3 header (thus why we are only
>>>> making room for L2 header). They are being handed to gro receive and
>>>> still have to be routed to their correct destination interface/dev.
>>> 
>>> You're talking about RX now. You're assuming the main use-case is an
>>> IPsec GW that's going to send the decapped packets out on another
>>> ethernet interface? (or at least, that's that's a use-case worth
>>> optimizing for)
>>> 
>>> What about TX? Is skb->dev->needed_headroom also incorrect there?
>>> 
>>> Is iptfs_alloc_skb's l3resv argument equivalent to a RX/TX switch?
>> 
>> Exactly right. When we are generating IPTFS tunnel packets we need
>> to add all the L3+l2 headers, and in that case we pass l3resv =
>> true.
> 
> Could you add a little comment alongside iptfs_alloc_skb? It would
> help make sense of the sizes you're choosing and how they fit the use
> of those skbs (something like "l3resv=true is used on TX, because we
> need to reserve L2+L3 headers. On RX, we only need L2 headers because
> [reason why we need L2 headers].").

Sure.

> And if skb->dev->needed_headroom is correct in the TX case, I'd still
> prefer (skb->dev->needed_headroom + <some space for l3>) to a fixed 128.

So dev->needed_headroom is defined as possible extra needed headroom.For ethernet it is 12. It's not what we want.

The actual MAC size value in this case is: dev->hard_header_len which is 14..

(gdb) p st->root_skb
$2 = (struct sk_buff *) 0xffff888012969a00
(gdb) p (struct dst_entry *)$2->_skb_refdst
$3 = (struct dst_entry *) 0xffff888012ef7180
(gdb) p $3->deb
$5 = (struct net_device *) 0xffff88800dca8000
(gdb) p $5->needed_headroom
$6 = 12
(gdb) p $5->hard_header_len
$7 = 14
(gdb) p $5->min_header_len
$8 = 14 '\016'

We also have access to the IPTFS required header space by looking in the tpl->dst...

# L3 header space requirement for xfrm
(gdb) p $3->header_len
$4 = 40

which in this case is 40 == IP (20) + ESP (8) + GCM-IV (8) + IPTFS (4)

So what I think you're looking for is this:

struct dst_entry *dst = skb_dst(tpl);

resv = LL_RESERVED_SPACE(dst->dev) + dst->header_len;
resv = L1_CACHE_ALIGN(resv);

FWIW (not that much) this is 128 in the ethernet dev case :)

# LL_RESERVED_SPACE()
(gdb) p (14 + 12 + 16) & ~15
$9 = 32

# above + dst->header_len
(gdb) p 40 + 32
$10 = 72

# aligned to L1_CACHE_BYTES
(gdb) p (72 + 64) & ~63
$12 = 128

Thanks,
Chris.

> 
> -- 
> Sabrina
diff mbox series

Patch

diff --git a/net/xfrm/xfrm_iptfs.c b/net/xfrm/xfrm_iptfs.c
index 20c19894720e..38735e2d64c3 100644
--- a/net/xfrm/xfrm_iptfs.c
+++ b/net/xfrm/xfrm_iptfs.c
@@ -46,12 +46,23 @@ 
  */
 #define IPTFS_DEFAULT_MAX_QUEUE_SIZE	(1024 * 10240)
 
+/* 1) skb->head should be cache aligned.
+ * 2) when resv is for L2 headers (i.e., ethernet) we want the cacheline to
+ * start -16 from data.
+ * 3) when resv is for L3+L2 headers IOW skb->data points at the IPTFS payload
+ * we want data to be cache line aligned so all the pushed headers will be in
+ * another cacheline.
+ */
+#define XFRM_IPTFS_MIN_L3HEADROOM 128
+#define XFRM_IPTFS_MIN_L2HEADROOM (64 + 16)
+#define IPTFS_FRAG_COPY_MAX 256 /* max for copying to create iptfs frags */
 #define NSECS_IN_USEC 1000
 
 #define IPTFS_HRTIMER_MODE HRTIMER_MODE_REL_SOFT
 
 /**
  * struct xfrm_iptfs_config - configuration for the IPTFS tunnel.
+ * @dont_frag: true to inhibit fragmenting across IPTFS outer packets.
  * @pkt_size: size of the outer IP packet. 0 to use interface and MTU discovery,
  *	otherwise the user specified value.
  * @max_queue_size: The maximum number of octets allowed to be queued to be sent
@@ -59,6 +70,7 @@ 
  *	packets enqueued.
  */
 struct xfrm_iptfs_config {
+	bool dont_frag : 1;
 	u32 pkt_size;	    /* outer_packet_size or 0 */
 	u32 max_queue_size; /* octets */
 };
@@ -88,13 +100,71 @@  struct xfrm_iptfs_data {
 	u32 payload_mtu;	    /* max payload size */
 };
 
-static u32 iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu);
+static u32 __iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu);
 static enum hrtimer_restart iptfs_delay_timer(struct hrtimer *me);
 
 /* ================= */
 /* SK_BUFF Functions */
 /* ================= */
 
+/**
+ * iptfs_alloc_skb() - Allocate a new `skb` using a meta-data template.
+ * @tpl: the template to copy the new `skb`s meta-data from.
+ * @len: the linear length of the head data, zero is fine.
+ * @l3resv: true if reserve needs to support pushing L3 headers
+ *
+ * A new `skb` is allocated and it's meta-data is initialized from `tpl`, the
+ * head data is sized to `len` + reserved space set according to the @l3resv
+ * boolean. When @l3resv is false, resv is XFRM_IPTFS_MIN_L2HEADROOM which
+ * arranges for `skb->data - 16` (etherhdr space) to be the start of a cacheline.
+ * Otherwise, @l3resv is true and resv is either the size of headroom from `tpl` or
+ * XFRM_IPTFS_MIN_L3HEADROOM whichever is greater, which tries to align
+ * skb->data to a cacheline as all headers will be pushed on the previous
+ * cacheline bytes.
+ *
+ * When copying meta-data from the @tpl, the sk_buff->headers are not copied.
+ *
+ * Zero length skbs are allocated when we only need a head skb to hold new
+ * packet headers (basically the mac header) that sit on top of existing shared
+ * packet data.
+ *
+ * Return: the new skb or NULL.
+ */
+static struct sk_buff *iptfs_alloc_skb(struct sk_buff *tpl, u32 len,
+				       bool l3resv)
+{
+	struct sk_buff *skb;
+	u32 resv;
+
+	if (!l3resv) {
+		resv = XFRM_IPTFS_MIN_L2HEADROOM;
+	} else {
+		resv = skb_headroom(tpl);
+		if (resv < XFRM_IPTFS_MIN_L3HEADROOM)
+			resv = XFRM_IPTFS_MIN_L3HEADROOM;
+	}
+
+	skb = alloc_skb(len + resv, GFP_ATOMIC);
+	if (!skb) {
+		XFRM_INC_STATS(dev_net(tpl->dev), LINUX_MIB_XFRMNOSKBERROR);
+		return NULL;
+	}
+
+	skb_reserve(skb, resv);
+
+	/* We do not want any of the tpl->headers copied over, so we do
+	 * not use `skb_copy_header()`.
+	 */
+	skb->tstamp = tpl->tstamp;
+	skb->dev = tpl->dev;
+	memcpy(skb->cb, tpl->cb, sizeof(skb->cb));
+	skb_dst_copy(skb, tpl);
+	__skb_ext_copy(skb, tpl);
+	__nf_copy(skb, tpl, false);
+
+	return skb;
+}
+
 /**
  * skb_head_to_frag() - initialize a skb_frag_t based on skb head data
  * @skb: skb with the head data
@@ -109,6 +179,41 @@  static void skb_head_to_frag(const struct sk_buff *skb, skb_frag_t *frag)
 	skb_frag_fill_page_desc(frag, page, skb->data - addr, skb_headlen(skb));
 }
 
+/**
+ * skb_copy_bits_seq - copy bits from a skb_seq_state to kernel buffer
+ * @st: source skb_seq_state
+ * @offset: offset in source
+ * @to: destination buffer
+ * @len: number of bytes to copy
+ *
+ * Copy @len bytes from @offset bytes into the source @st to the destination
+ * buffer @to. `offset` should increase (or be unchanged) with each subsequent
+ * call to this function. If offset needs to decrease from the previous use `st`
+ * should be reset first.
+ *
+ * Return: 0 on success or a negative error code on failure
+ */
+static int skb_copy_bits_seq(struct skb_seq_state *st, int offset, void *to,
+			     int len)
+{
+	const u8 *data;
+	u32 sqlen;
+
+	for (;;) {
+		sqlen = skb_seq_read(offset, &data, st);
+		if (sqlen == 0)
+			return -ENOMEM;
+		if (sqlen >= len) {
+			memcpy(to, data, len);
+			return 0;
+		}
+		memcpy(to, data, sqlen);
+		to += sqlen;
+		offset += sqlen;
+		len -= sqlen;
+	}
+}
+
 /* ================================= */
 /* IPTFS Sending (ingress) Functions */
 /* ================================= */
@@ -153,7 +258,7 @@  static int iptfs_get_cur_pmtu(struct xfrm_state *x,
 {
 	struct xfrm_dst *xdst = (struct xfrm_dst *)skb_dst(skb);
 	u32 payload_mtu = xtfs->payload_mtu;
-	u32 pmtu = iptfs_get_inner_mtu(x, xdst->child_mtu_cached);
+	u32 pmtu = __iptfs_get_inner_mtu(x, xdst->child_mtu_cached);
 
 	if (payload_mtu && payload_mtu < pmtu)
 		pmtu = payload_mtu;
@@ -216,7 +321,8 @@  static int iptfs_output_collect(struct net *net, struct sock *sk,
 
 	BUG_ON(!xtfs);
 
-	pmtu = iptfs_get_cur_pmtu(x, xtfs, skb);
+	if (xtfs->cfg.dont_frag)
+		pmtu = iptfs_get_cur_pmtu(x, xtfs, skb);
 
 	/* Break apart GSO skbs. If the queue is nearing full then we want the
 	 * accounting and queuing to be based on the individual packets not on the
@@ -256,8 +362,10 @@  static int iptfs_output_collect(struct net *net, struct sock *sk,
 			continue;
 		}
 
-		/* Fragmenting handled in following commits. */
-		if (iptfs_is_too_big(sk, skb, pmtu)) {
+		/* If the user indicated no iptfs fragmenting check before
+		 * enqueue.
+		 */
+		if (xtfs->cfg.dont_frag && iptfs_is_too_big(sk, skb, pmtu)) {
 			kfree_skb_reason(skb, SKB_DROP_REASON_PKT_TOO_BIG);
 			continue;
 		}
@@ -301,6 +409,219 @@  static void iptfs_output_prepare_skb(struct sk_buff *skb, u32 blkoff)
 	IPCB(skb)->flags |= IPSKB_XFRM_TUNNEL_SIZE;
 }
 
+/**
+ * iptfs_copy_create_frag() - create an inner fragment skb.
+ * @st: The source packet data.
+ * @offset: offset in @st of the new fragment data.
+ * @copy_len: the amount of data to copy from @st.
+ *
+ * Create a new skb holding a single IPTFS inner packet fragment. @copy_len must
+ * not be greater than the max fragment size.
+ *
+ * Return: the new fragment skb or an ERR_PTR().
+ */
+static struct sk_buff *iptfs_copy_create_frag(struct skb_seq_state *st,
+					      u32 offset, u32 copy_len)
+{
+	struct sk_buff *src = st->root_skb;
+	struct sk_buff *skb;
+	int err;
+
+	skb = iptfs_alloc_skb(src, copy_len, true);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	/* Now copy `copy_len` data from src */
+	err = skb_copy_bits_seq(st, offset, skb_put(skb, copy_len), copy_len);
+	if (err) {
+		kfree_skb(skb);
+		return ERR_PTR(err);
+	}
+
+	return skb;
+}
+
+/**
+ * iptfs_copy_create_frags() - create and send N-1 fragments of a larger skb.
+ * @skbp: the source packet skb (IN), skb holding the last fragment in
+ *        the fragment stream (OUT).
+ * @xtfs: IPTFS SA state.
+ * @mtu: the max IPTFS fragment size.
+ *
+ * This function is responsible for fragmenting a larger inner packet into a
+ * sequence of IPTFS payload packets. The last fragment is returned rather than
+ * being sent so that the caller can append more inner packets (aggregation) if
+ * there is room.
+ *
+ * Return: 0 on success or a negative error code on failure
+ */
+static int iptfs_copy_create_frags(struct sk_buff **skbp,
+				   struct xfrm_iptfs_data *xtfs, u32 mtu)
+{
+	struct skb_seq_state skbseq;
+	struct list_head sublist;
+	struct sk_buff *skb = *skbp;
+	struct sk_buff *nskb = *skbp;
+	u32 copy_len, offset;
+	u32 to_copy = skb->len - mtu;
+	int err = 0;
+
+	INIT_LIST_HEAD(&sublist);
+
+	BUG_ON(skb->len <= mtu);
+	skb_prepare_seq_read(skb, 0, skb->len, &skbseq);
+
+	/* A trimmed `skb` will be sent as the first fragment, later. */
+	offset = mtu;
+	to_copy = skb->len - offset;
+	while (to_copy) {
+		/* Send all but last fragment to allow agg. append */
+		list_add_tail(&nskb->list, &sublist);
+
+		/* FUTURE: if the packet has an odd/non-aligning length we could
+		 * send less data in the penultimate fragment so that the last
+		 * fragment then ends on an aligned boundary.
+		 */
+		copy_len = min(to_copy, mtu);
+		nskb = iptfs_copy_create_frag(&skbseq, offset, copy_len);
+		if (IS_ERR(nskb)) {
+			XFRM_INC_STATS(xs_net(xtfs->x),
+				       LINUX_MIB_XFRMOUTERROR);
+			skb_abort_seq_read(&skbseq);
+			err = PTR_ERR(nskb);
+			nskb = NULL;
+			break;
+		}
+		iptfs_output_prepare_skb(nskb, to_copy);
+		offset += copy_len;
+		to_copy -= copy_len;
+	}
+	skb_abort_seq_read(&skbseq);
+
+	/* return last fragment that will be unsent (or NULL) */
+	*skbp = nskb;
+
+	/* trim the original skb to MTU */
+	if (!err)
+		err = pskb_trim(skb, mtu);
+
+	if (err) {
+		/* Free all frags. Don't bother sending a partial packet we will
+		 * never complete.
+		 */
+		kfree_skb(nskb);
+		list_for_each_entry_safe(skb, nskb, &sublist, list) {
+			skb_list_del_init(skb);
+			kfree_skb(skb);
+		}
+		return err;
+	}
+
+	/* prepare the initial fragment with an iptfs header */
+	iptfs_output_prepare_skb(skb, 0);
+
+	/* Send all but last fragment, if we fail to send a fragment then free
+	 * the rest -- no point in sending a packet that can't be reassembled.
+	 */
+	list_for_each_entry_safe(skb, nskb, &sublist, list) {
+		skb_list_del_init(skb);
+		if (!err)
+			err = xfrm_output(NULL, skb);
+		else
+			kfree_skb(skb);
+	}
+	if (err)
+		kfree_skb(*skbp);
+	return err;
+}
+
+/**
+ * iptfs_first_should_copy() - determine if we should copy packet data.
+ * @first_skb: the first skb in the packet
+ * @mtu: the MTU.
+ *
+ * Determine if we should create subsequent skbs to hold the remaining data from
+ * a large inner packet by copying the packet data, or cloning the original skb
+ * and adjusting the offsets.
+ *
+ * Return: true if we should copy the data out of the skb.
+ */
+static bool iptfs_first_should_copy(struct sk_buff *first_skb, u32 mtu)
+{
+	u32 frag_copy_max;
+
+	/* If we have less than frag_copy_max for remaining packet we copy
+	 * those tail bytes as it is more efficient.
+	 */
+	frag_copy_max = min(mtu, IPTFS_FRAG_COPY_MAX);
+	if ((int)first_skb->len - (int)mtu < (int)frag_copy_max)
+		return true;
+
+	/* If we have non-linear skb just use copy */
+	if (skb_is_nonlinear(first_skb))
+		return true;
+
+	/* So we have a simple linear skb, easy to clone and share */
+	return false;
+}
+
+/**
+ * iptfs_first_skb() - handle the first dequeued inner packet for output
+ * @skbp: the source packet skb (IN), skb holding the last fragment in
+ *        the fragment stream (OUT).
+ * @xtfs: IPTFS SA state.
+ * @mtu: the max IPTFS fragment size.
+ *
+ * This function is responsible for fragmenting a larger inner packet into a
+ * sequence of IPTFS payload packets. If it needs to fragment into subsequent
+ * skb's, it will either do so by copying or cloning.
+ *
+ * The last fragment is returned rather than being sent so that the caller can
+ * append more inner packets (aggregation) if there is room.
+ *
+ * Return: 0 on success or a negative error code on failure
+ */
+static int iptfs_first_skb(struct sk_buff **skbp, struct xfrm_iptfs_data *xtfs,
+			   u32 mtu)
+{
+	struct sk_buff *skb = *skbp;
+	int err;
+
+	/* Classic ESP skips the don't fragment ICMP error if DF is clear on
+	 * the inner packet or ignore_df is set. Otherwise it will send an ICMP
+	 * or local error if the inner packet won't fit it's MTU.
+	 *
+	 * With IPTFS we do not care about the inner packet DF bit. If the
+	 * tunnel is configured to "don't fragment" we error back if things
+	 * don't fit in our max packet size. Otherwise we iptfs-fragment as
+	 * normal.
+	 */
+
+	/* The opportunity for HW offload has ended */
+	if (skb->ip_summed == CHECKSUM_PARTIAL) {
+		err = skb_checksum_help(skb);
+		if (err)
+			return err;
+	}
+
+	/* We've split these up before queuing */
+	BUG_ON(skb_is_gso(skb));
+
+	/* Simple case -- it fits. `mtu` accounted for all the overhead
+	 * including the basic IPTFS header.
+	 */
+	if (skb->len <= mtu) {
+		iptfs_output_prepare_skb(skb, 0);
+		return 0;
+	}
+
+	if (iptfs_first_should_copy(skb, mtu))
+		return iptfs_copy_create_frags(skbp, xtfs, mtu);
+
+	/* For now we always copy */
+	return iptfs_copy_create_frags(skbp, xtfs, mtu);
+}
+
 static struct sk_buff **iptfs_rehome_fraglist(struct sk_buff **nextp,
 					      struct sk_buff *child)
 {
@@ -360,6 +681,15 @@  static void iptfs_output_queued(struct xfrm_state *x, struct sk_buff_head *list)
 	struct sk_buff *skb, *skb2, **nextp;
 	struct skb_shared_info *shi, *shi2;
 
+	/* If we are fragmenting due to a large inner packet we will output all
+	 * the outer IPTFS packets required to contain the fragments of the
+	 * single large inner packet. These outer packets need to be sent
+	 * consecutively (ESP seq-wise). Since this output function is always
+	 * running from a timer we do not need a lock to provide this guarantee.
+	 * We will output our packets consecutively before the timer is allowed
+	 * to run again on some other CPU.
+	 */
+
 	while ((skb = __skb_dequeue(list))) {
 		u32 mtu = iptfs_get_cur_pmtu(x, xtfs, skb);
 		bool share_ok = true;
@@ -370,7 +700,7 @@  static void iptfs_output_queued(struct xfrm_state *x, struct sk_buff_head *list)
 					htons(ETH_P_IP) :
 					htons(ETH_P_IPV6);
 
-		if (skb->len > mtu) {
+		if (skb->len > mtu && xtfs->cfg.dont_frag) {
 			/* We handle this case before enqueueing so we are only
 			 * here b/c MTU changed after we enqueued before we
 			 * dequeued, just drop these.
@@ -381,26 +711,22 @@  static void iptfs_output_queued(struct xfrm_state *x, struct sk_buff_head *list)
 			continue;
 		}
 
-		/* If we don't have a cksum in the packet we need to add one
-		 * before encapsulation.
+		/* Convert first inner packet into an outer IPTFS packet,
+		 * dealing with any fragmentation into multiple outer packets
+		 * if necessary.
 		 */
-		if (skb->ip_summed == CHECKSUM_PARTIAL) {
-			if (skb_checksum_help(skb)) {
-				XFRM_INC_STATS(dev_net(skb_dst(skb)->dev),
-					       LINUX_MIB_XFRMOUTERROR);
-				kfree_skb(skb);
-				continue;
-			}
-		}
-
-		/* Convert first inner packet into an outer IPTFS packet */
-		iptfs_output_prepare_skb(skb, 0);
+		if (iptfs_first_skb(&skb, xtfs, mtu))
+			continue;
 
-		/* The space remaining to send more inner packet data is `mtu` -
-		 * (skb->len - sizeof iptfs header). This is b/c the `mtu` value
-		 * has the basic IPTFS header len accounted for, and we added
-		 * that header to the skb so it is a part of skb->len, thus we
-		 * subtract it from the skb length.
+		/* If fragmentation was required the returned skb is the last
+		 * IPTFS fragment in the chain, and it's IPTFS header blkoff has
+		 * been set just past the end of the fragment data.
+		 *
+		 * In either case the space remaining to send more inner packet
+		 * data is `mtu` - (skb->len - sizeof iptfs header). This is b/c
+		 * the `mtu` value has the basic IPTFS header len accounted for,
+		 * and we added that header to the skb so it is a part of
+		 * skb->len, thus we subtract it from the skb length.
 		 */
 		remaining = mtu - (skb->len - sizeof(struct ip_iptfs_hdr));
 
@@ -641,11 +967,13 @@  static int iptfs_prepare_output(struct xfrm_state *x, struct sk_buff *skb)
 /* ========================== */
 
 /**
- * iptfs_get_inner_mtu() - return inner MTU with no fragmentation.
+ * __iptfs_get_inner_mtu() - return inner MTU with no fragmentation.
  * @x: xfrm state.
  * @outer_mtu: the outer mtu
+ *
+ * Return: Correct MTU taking in to account the encap overhead.
  */
-static u32 iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
+static u32 __iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
 {
 	struct crypto_aead *aead;
 	u32 blksize;
@@ -656,6 +984,23 @@  static u32 iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
 		~(blksize - 1)) - 2;
 }
 
+/**
+ * iptfs_get_inner_mtu() - return the inner MTU for an IPTFS xfrm.
+ * @x: xfrm state.
+ * @outer_mtu: Outer MTU for the encapsulated packet.
+ *
+ * Return: Correct MTU taking in to account the encap overhead.
+ */
+static u32 iptfs_get_inner_mtu(struct xfrm_state *x, int outer_mtu)
+{
+	struct xfrm_iptfs_data *xtfs = x->mode_data;
+
+	/* If not dont-frag we have no MTU */
+	if (!xtfs->cfg.dont_frag)
+		return x->outer_mode.family == AF_INET ? IP_MAX_MTU : IP6_MAX_MTU;
+	return __iptfs_get_inner_mtu(x, outer_mtu);
+}
+
 /**
  * iptfs_user_init() - initialize the SA with IPTFS options from netlink.
  * @net: the net data
@@ -677,6 +1022,8 @@  static int iptfs_user_init(struct net *net, struct xfrm_state *x,
 	xc->max_queue_size = IPTFS_DEFAULT_MAX_QUEUE_SIZE;
 	xtfs->init_delay_ns = IPTFS_DEFAULT_INIT_DELAY_USECS * NSECS_IN_USEC;
 
+	if (attrs[XFRMA_IPTFS_DONT_FRAG])
+		xc->dont_frag = true;
 	if (attrs[XFRMA_IPTFS_PKT_SIZE]) {
 		xc->pkt_size = nla_get_u32(attrs[XFRMA_IPTFS_PKT_SIZE]);
 		if (!xc->pkt_size) {
@@ -710,6 +1057,8 @@  static unsigned int iptfs_sa_len(const struct xfrm_state *x)
 	unsigned int l = 0;
 
 	if (x->dir == XFRM_SA_DIR_OUT) {
+		if (xc->dont_frag)
+			l += nla_total_size(0);	  /* dont-frag flag */
 		l += nla_total_size(sizeof(u32)); /* init delay usec */
 		l += nla_total_size(sizeof(xc->max_queue_size));
 		l += nla_total_size(sizeof(xc->pkt_size));
@@ -726,6 +1075,12 @@  static int iptfs_copy_to_user(struct xfrm_state *x, struct sk_buff *skb)
 	u64 q;
 
 	if (x->dir == XFRM_SA_DIR_OUT) {
+		if (xc->dont_frag) {
+			ret = nla_put_flag(skb, XFRMA_IPTFS_DONT_FRAG);
+			if (ret)
+				return ret;
+		}
+
 		q = xtfs->init_delay_ns;
 		(void)do_div(q, NSECS_IN_USEC);
 		ret = nla_put_u32(skb, XFRMA_IPTFS_INIT_DELAY, q);