diff mbox series

[net-next,v4,4/4] net: gro: move L3 flush checks to tcp_gro_receive

Message ID 20240325182543.87683-5-richardbgobert@gmail.com (mailing list archive)
State New
Headers show
Series net: gro: encapsulation bug fix and flush checks improvements | expand

Commit Message

Richard Gobert March 25, 2024, 6:25 p.m. UTC
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used
currently only in tcp flows in GRO.

These checks need to be done only once in tcp_gro_receive and only against
the found p skb, since they only affect flush and not same_flow.

Levaraging the previous commit in the series, in which correct network
header offsets are saved for both outer and inner network headers -
allowing these checks to be done only once, in tcp_gro_receive. As a
result, NAPI_GRO_CB(p)->flush is not used at all. In addition - flush_id
checks are more declarative and contained in inet_gro_flush, thus removing
the need for flush_id in napi_gro_cb.

This results in less parsing code for UDP flows and non-loop flush tests
for TCP flows.

For example, running 40 IP/UDP netperf connections:
./super_netperf.sh 40 -H 1.1.1.2 -t UDP_STREAM -l 120

Running perf top for 90s we can see that relatively less time is spent
on inet_gro_receive when GRO is not coalescing UDP:

net-next:
   1.26%  [kernel]  [k] inet_gro_receive

patch applied:
   0.85%  [kernel]  [k] inet_gro_receive

udpgro_bench.sh single connection GRO improvement:
net-next:
   0.76%  [kernel]  [k] inet_gro_receive

patch applied:
   0.61%  [kernel]  [k] inet_gro_receive

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
---
 include/net/gro.h      |  9 ++----
 net/core/gro.c         |  3 --
 net/ipv4/af_inet.c     | 36 ---------------------
 net/ipv4/tcp_offload.c | 72 ++++++++++++++++++++++++++++++++++--------
 net/ipv6/ip6_offload.c | 11 -------
 5 files changed, 62 insertions(+), 69 deletions(-)

Comments

Willem de Bruijn March 25, 2024, 6:53 p.m. UTC | #1
Richard Gobert wrote:
> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> iph->id, ...) against all packets in a loop. These flush checks are used
> currently only in tcp flows in GRO.
> 
> These checks need to be done only once in tcp_gro_receive and only against
> the found p skb, since they only affect flush and not same_flow.
> 
> Levaraging the previous commit in the series, in which correct network
> header offsets are saved for both outer and inner network headers -
> allowing these checks to be done only once, in tcp_gro_receive. As a
> result, NAPI_GRO_CB(p)->flush is not used at all. In addition - flush_id
> checks are more declarative and contained in inet_gro_flush, thus removing
> the need for flush_id in napi_gro_cb.
> 
> This results in less parsing code for UDP flows and non-loop flush tests
> for TCP flows.
> 
> For example, running 40 IP/UDP netperf connections:
> ./super_netperf.sh 40 -H 1.1.1.2 -t UDP_STREAM -l 120
> 
> Running perf top for 90s we can see that relatively less time is spent
> on inet_gro_receive when GRO is not coalescing UDP:
> 
> net-next:
>    1.26%  [kernel]  [k] inet_gro_receive
> 
> patch applied:
>    0.85%  [kernel]  [k] inet_gro_receive
> 
> udpgro_bench.sh single connection GRO improvement:
> net-next:
>    0.76%  [kernel]  [k] inet_gro_receive
> 
> patch applied:
>    0.61%  [kernel]  [k] inet_gro_receive
> 
> Signed-off-by: Richard Gobert <richardbgobert@gmail.com>

In v3 we discussed how the flush on network layer differences (like
TTL or ToS) currently only affect the TCP GRO path, but should apply
more broadly.

We agreed that it is fine to leave that to a separate patch series.

But seeing this patch, it introduces a lot of churn, but also makes
it harder to address that issue for UDP, as it now moves network
layer checks directly to the TCP code.
Jakub Kicinski March 26, 2024, 2:33 a.m. UTC | #2
On Mon, 25 Mar 2024 19:25:43 +0100 Richard Gobert wrote:
> +	const u32 id = ntohl(*(__be32 *)&iph->id);
> +	const u32 id2 = ntohl(*(__be32 *)&iph2->id);
> +	const int flush_id = ntohs(id >> 16) - ntohs(id2 >> 16);

The endian conversions don't match types here. sparse is unhappy.
If id is in host endian shouldn't it be htons(id >> 16) ?
Also if you cast to a bitwise type you need __force
Richard Gobert March 26, 2024, 12:35 p.m. UTC | #3
Willem de Bruijn wrote:
> In v3 we discussed how the flush on network layer differences (like
> TTL or ToS) currently only affect the TCP GRO path, but should apply
> more broadly.
> 
> We agreed that it is fine to leave that to a separate patch series.
> 
> But seeing this patch, it introduces a lot of churn, but also makes
> it harder to address that issue for UDP, as it now moves network
> layer checks directly to the TCP code.
Currently the logic of flush_id is scattered in tcp_gro_receive and
{inet,ipv6}_gro_receive with conditionals rewriting ->flush and ->flush_id,
so IMO the code should be more concise when it's in one place - in addition
to not doing checks against non relevant packets.

With this patch, the fix will probably be simple, most likely just calling
gro_network_flush from skb_gro_receive or from the relevant flow in
udp_gro_receive_segment. Since this bug fix should be simple and it being
not relevant to the optimization, I'd like to solve it in another series
and properly test that new flow. Do you agree?
Willem de Bruijn March 26, 2024, 1:40 p.m. UTC | #4
Richard Gobert wrote:
> Willem de Bruijn wrote:
> > In v3 we discussed how the flush on network layer differences (like
> > TTL or ToS) currently only affect the TCP GRO path, but should apply
> > more broadly.
> > 
> > We agreed that it is fine to leave that to a separate patch series.
> > 
> > But seeing this patch, it introduces a lot of churn, but also makes
> > it harder to address that issue for UDP, as it now moves network
> > layer checks directly to the TCP code.
> Currently the logic of flush_id is scattered in tcp_gro_receive and
> {inet,ipv6}_gro_receive with conditionals rewriting ->flush and ->flush_id,
> so IMO the code should be more concise when it's in one place - in addition
> to not doing checks against non relevant packets.
> 
> With this patch, the fix will probably be simple, most likely just calling
> gro_network_flush from skb_gro_receive or from the relevant flow in
> udp_gro_receive_segment. Since this bug fix should be simple and it being
> not relevant to the optimization, I'd like to solve it in another series
> and properly test that new flow. Do you agree?

My main concern is moving this code to tcp_offload.c, if it likely
soon will be moved elsewhere again.
Richard Gobert March 26, 2024, 2:09 p.m. UTC | #5
Willem de Bruijn wrote:
> My main concern is moving this code to tcp_offload.c, if it likely
> soon will be moved elsewhere again.

Got it. I'll move these functions to gro.c and post a v5 with Jakub's
requests as well.
Eric Dumazet March 26, 2024, 2:14 p.m. UTC | #6
On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert <richardbgobert@gmail.com> wrote:
>
> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> iph->id, ...) against all packets in a loop. These flush checks are used
> currently only in tcp flows in GRO.

I think this is a bug.

GRO should not aggregate packets if their ttl/tos fields do not match.
Richard Gobert March 26, 2024, 2:43 p.m. UTC | #7
Eric Dumazet wrote:
> On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert <richardbgobert@gmail.com> wrote:
>>
>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
>> iph->id, ...) against all packets in a loop. These flush checks are used
>> currently only in tcp flows in GRO.
> 
> I think this is a bug.
> 
> GRO should not aggregate packets if their ttl/tos fields do not match.

AFAIU, the only UDP flow where ttl/flush_id need to be checked is when
udp_gro_receive_segment calls skb_gro_receive - could you confirm / point
out if there are any other flows to which these flush checks may be
relevant?

As I've discussed with Willem in v3 I prefer to fix this bug in a separate
series.
Eric Dumazet March 26, 2024, 2:46 p.m. UTC | #8
On Tue, Mar 26, 2024 at 3:43 PM Richard Gobert <richardbgobert@gmail.com> wrote:
>
> Eric Dumazet wrote:
> > On Mon, Mar 25, 2024 at 7:27 PM Richard Gobert <richardbgobert@gmail.com> wrote:
> >>
> >> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> >> iph->id, ...) against all packets in a loop. These flush checks are used
> >> currently only in tcp flows in GRO.
> >
> > I think this is a bug.
> >
> > GRO should not aggregate packets if their ttl/tos fields do not match.
>
> AFAIU, the only UDP flow where ttl/flush_id need to be checked is when
> udp_gro_receive_segment calls skb_gro_receive - could you confirm / point
> out if there are any other flows to which these flush checks may be
> relevant?
>
> As I've discussed with Willem in v3 I prefer to fix this bug in a separate
> series.

I do not understand this patch 4/4 then.

Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move
it back to where it belongs ?
Richard Gobert March 26, 2024, 3:02 p.m. UTC | #9
Eric Dumazet wrote:
> 
> I do not understand this patch 4/4 then.
> 
> Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move
> it back to where it belongs ?

Willem also pointed that out, and I agree. I'll post a v5 and move this
functionality to gro.c. Currently, gro_network_flush will be called from
tcp_gro_receive and in a separate series I'll fix the bug by calling
gro_network_flush in skb_gro_receive or adding it to
udp_gro_receive_segment - whichever is better.

This patch is meaningful by itself - removing checks against non-relevant
packets and making the flush/flush_id checks in a single place.
Willem de Bruijn March 26, 2024, 3:10 p.m. UTC | #10
Richard Gobert wrote:
> Eric Dumazet wrote:
> > 
> > I do not understand this patch 4/4 then.
> > 
> > Why bother moving stuff in net/ipv4/tcp_offload.c if we plan to move
> > it back to where it belongs ?
> 
> Willem also pointed that out, and I agree. I'll post a v5 and move this
> functionality to gro.c. Currently, gro_network_flush will be called from
> tcp_gro_receive and in a separate series I'll fix the bug by calling
> gro_network_flush in skb_gro_receive or adding it to
> udp_gro_receive_segment - whichever is better.
> 
> This patch is meaningful by itself - removing checks against non-relevant
> packets and making the flush/flush_id checks in a single place.

One issue: if the do this move in net/next, then a later fix that
relies on it cannot be backporter to older stable kernels.
Richard Gobert March 26, 2024, 3:26 p.m. UTC | #11
Willem de Bruijn wrote:> One issue: if the do this move in net/next, then a later fix that
> relies on it cannot be backporter to older stable kernels.
> 

I understand. I can either add a first commit to this series which fixes
the bug by checking ->flush and ->flush_id in udp_gro_receive_segment,
or directly fix the bug in the 4th commit of this series in addition to
the optimization - which can be backported as a whole.

What do you think?
Paolo Abeni March 26, 2024, 4:14 p.m. UTC | #12
Hi,

On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
> This patch is meaningful by itself - removing checks against non-relevant
> packets and making the flush/flush_id checks in a single place.

I'm personally not sure this patch is a win. The code churn is
significant. I understand this is for performance's sake, but I don't
see the benefit??? 

The changelog shows that perf reports slightly lower figures for
inet_gro_receive(). That is expected, as this patch move code out of
such functio. What about inet_gro_flush()/tcp_gro_receive() where such
code is moved?

Additionally the reported deltas is within noise level according to my
personal experience with similar tests.

I think we are better off without this patch.

Paolo
Richard Gobert March 26, 2024, 5:25 p.m. UTC | #13
Paolo Abeni wrote:
> Hi,
> 
> On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
>> This patch is meaningful by itself - removing checks against non-relevant
>> packets and making the flush/flush_id checks in a single place.
> 
> I'm personally not sure this patch is a win. The code churn is
> significant. I understand this is for performance's sake, but I don't
> see the benefit??? 
> 

Could you clarify what do you mean by code churn?

This patch removes all use of p->flush and flush_id from the
CB. The entire logic for L3 flush_id is scattered in tcp_gro_receive
and {inet,ipv6}_gro_receive with conditionals rewriting ->flush,
->flush_id and ->is_atomic. Moving it to one place (gro_network_flush)
should be more readable. (Personally, it took me a lot of time to
understand the current logic of flush + flush_id + is_atomic)

> The changelog shows that perf reports slightly lower figures for
> inet_gro_receive(). That is expected, as this patch move code out of
> such functio. What about inet_gro_flush()/tcp_gro_receive() where such
> code is moved?
> 

Please consider the following 2 common scenarios:

1) Multiple packets in the GRO bucket - the common case with multiple
   packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer
   executes a for loop - going over each packet in the bucket. Specifically,
   L3 gro_receive loops over the bucket making flush,flush_id,is_atomic
   checks. For most packets in the bucket, these checks are not
   relevant. (possibly also dirtying cache lines with non-relevant p
   packets). Removing code in the for loop for this case is significant.

2) UDP/TCP streams which do not coalesce in GRO. This is the common case
   for regular UDP connections (i.e. running netperf UDP_STREAM). In this
   case, GRO is just overhead. Removing any code from these layers
   is good (shown in the first measurement of the commit message).

In the case of a single TCP connection - the amount of checks should be
the same overall not causing any noticeable difference.

> Additionally the reported deltas is within noise level according to my
> personal experience with similar tests.
> 

I've tested the difference between net-next and this patch repetitively,
which showed stable results each time. Is there any specific test you
think would be helpful to show the result?

Thanks
Paolo Abeni March 26, 2024, 6:29 p.m. UTC | #14
On Tue, 2024-03-26 at 18:25 +0100, Richard Gobert wrote:
> Paolo Abeni wrote:
> > Hi,
> > 
> > On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
> > > This patch is meaningful by itself - removing checks against non-relevant
> > > packets and making the flush/flush_id checks in a single place.
> > 
> > I'm personally not sure this patch is a win. The code churn is
> > significant. I understand this is for performance's sake, but I don't
> > see the benefit??? 
> > 
> 
> Could you clarify what do you mean by code churn?

The diffstat of this patch is not negligible and touches very sensitive
areas.

> > he changelog shows that perf reports slightly lower figures for
> > inet_gro_receive(). That is expected, as this patch move code out of
> > such functio. What about inet_gro_flush()/tcp_gro_receive() where such
> > code is moved?
> > 
> 
> Please consider the following 2 common scenarios:
> 
> 1) Multiple packets in the GRO bucket - the common case with multiple
>    packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer
>    executes a for loop - going over each packet in the bucket. Specifically,
>    L3 gro_receive loops over the bucket making flush,flush_id,is_atomic
>    checks. 

Only for packets with the same rx hash. 

> For most packets in the bucket, these checks are not
>    relevant. (possibly also dirtying cache lines with non-relevant p
>    packets). Removing code in the for loop for this case is significant.
> 
> 2) UDP/TCP streams which do not coalesce in GRO. This is the common case
>    for regular UDP connections (i.e. running netperf UDP_STREAM). In this
>    case, GRO is just overhead. Removing any code from these layers
>    is good (shown in the first measurement of the commit message).

If UDP GRO is not enabled, there are no UDP packet staging in the UDP
gro engine, the bucket list is empty.

> > Additionally the reported deltas is within noise level according to my
> > personal experience with similar tests.
> > 
> 
> I've tested the difference between net-next and this patch repetitively,
> which showed stable results each time. Is there any specific test you
> think would be helpful to show the result?

Anything that show measurable gain. 

Reporting the CPU utilization in the inet_gro_receive() function alone
is not enough, as part of the load has been moved into
gro_network_flush()/tcp_gro_receive().

Cheers,

Paolo
Richard Gobert March 27, 2024, 4:07 p.m. UTC | #15
Paolo Abeni wrote:
> On Tue, 2024-03-26 at 18:25 +0100, Richard Gobert wrote:
>> Paolo Abeni wrote:
>>> Hi,
>>>
>>> On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
>>>> This patch is meaningful by itself - removing checks against non-relevant
>>>> packets and making the flush/flush_id checks in a single place.
>>>
>>> I'm personally not sure this patch is a win. The code churn is
>>> significant. I understand this is for performance's sake, but I don't
>>> see the benefit??? 
>>>
>>
>> Could you clarify what do you mean by code churn?
> 
> The diffstat of this patch is not negligible and touches very sensitive
> areas.
> 

diff mainly touches flush/flush_id/is_atomic, the new code should be
less complex. I agree this is sensitive as it is part of core GRO -
I checked all relevant flows manually, but I can also create more
tests and ensure that logic remains the same.

>>> he changelog shows that perf reports slightly lower figures for
>>> inet_gro_receive(). That is expected, as this patch move code out of
>>> such functio. What about inet_gro_flush()/tcp_gro_receive() where such
>>> code is moved?
>>>
>>
>> Please consider the following 2 common scenarios:
>>
>> 1) Multiple packets in the GRO bucket - the common case with multiple
>>    packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer
>>    executes a for loop - going over each packet in the bucket. Specifically,
>>    L3 gro_receive loops over the bucket making flush,flush_id,is_atomic
>>    checks. 
> 
> Only for packets with the same rx hash. 
> 

Right, but there are only 8 GRO buckets, so a collision can still happen
on multiple concurrent streams.

>> For most packets in the bucket, these checks are not
>>    relevant. (possibly also dirtying cache lines with non-relevant p
>>    packets). Removing code in the for loop for this case is significant.
>>
>> 2) UDP/TCP streams which do not coalesce in GRO. This is the common case
>>    for regular UDP connections (i.e. running netperf UDP_STREAM). In this
>>    case, GRO is just overhead. Removing any code from these layers
>>    is good (shown in the first measurement of the commit message).
> 
> If UDP GRO is not enabled, there are no UDP packet staging in the UDP
> gro engine, the bucket list is empty.
> 
>>> Additionally the reported deltas is within noise level according to my
>>> personal experience with similar tests.
>>>
>>
>> I've tested the difference between net-next and this patch repetitively,
>> which showed stable results each time. Is there any specific test you
>> think would be helpful to show the result?
> 
> Anything that show measurable gain. 
> 
> Reporting the CPU utilization in the inet_gro_receive() function alone
> is not enough, as part of the load has been moved into
> gro_network_flush()/tcp_gro_receive().
> 

Got it, the numbers I reported were only relevant to UDP flows (so
measuring perf top with -g flag showed the same improvement). I'll post in v5
numbers relevant to TCP as well.

Thanks
diff mbox series

Patch

diff --git a/include/net/gro.h b/include/net/gro.h
index a1cc8e8c2ebd..73e266194528 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -36,15 +36,15 @@  struct napi_gro_cb {
 	/* This is non-zero if the packet cannot be merged with the new skb. */
 	u16	flush;
 
-	/* Save the IP ID here and check when we get to the transport layer */
-	u16	flush_id;
-
 	/* Number of segments aggregated. */
 	u16	count;
 
 	/* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */
 	u16	proto;
 
+	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
+	__wsum	csum;
+
 /* Used in napi_gro_cb::free */
 #define NAPI_GRO_FREE             1
 #define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -85,9 +85,6 @@  struct napi_gro_cb {
 		u8	is_flist:1;
 	);
 
-	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
-	__wsum	csum;
-
 	/* L3 offsets */
 	union {
 		struct {
diff --git a/net/core/gro.c b/net/core/gro.c
index 7c468ed805f4..f8cbc08197f7 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -329,8 +329,6 @@  static void gro_list_prepare(const struct list_head *head,
 	list_for_each_entry(p, head, list) {
 		unsigned long diffs;
 
-		NAPI_GRO_CB(p)->flush = 0;
-
 		if (hash != skb_get_hash_raw(p)) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
@@ -470,7 +468,6 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 					sizeof(u32))); /* Avoid slow unaligned acc */
 	*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;
 	NAPI_GRO_CB(skb)->flush = skb_has_frag_list(skb);
-	NAPI_GRO_CB(skb)->is_atomic = 1;
 	NAPI_GRO_CB(skb)->count = 1;
 	if (unlikely(skb_is_gso(skb))) {
 		NAPI_GRO_CB(skb)->count = skb_shinfo(skb)->gso_segs;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6546bf376b24..9d7fd79ee915 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1513,7 +1513,6 @@  struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 
 	list_for_each_entry(p, head, list) {
 		struct iphdr *iph2;
-		u16 flush_id;
 
 		if (!NAPI_GRO_CB(p)->same_flow)
 			continue;
@@ -1530,43 +1529,8 @@  struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
-
-		/* All fields must match except length and checksum. */
-		NAPI_GRO_CB(p)->flush |=
-			(iph->ttl ^ iph2->ttl) |
-			(iph->tos ^ iph2->tos) |
-			((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
-
-		NAPI_GRO_CB(p)->flush |= flush;
-
-		/* We need to store of the IP ID check to be included later
-		 * when we can verify that this packet does in fact belong
-		 * to a given flow.
-		 */
-		flush_id = (u16)(id - ntohs(iph2->id));
-
-		/* This bit of code makes it much easier for us to identify
-		 * the cases where we are doing atomic vs non-atomic IP ID
-		 * checks.  Specifically an atomic check can return IP ID
-		 * values 0 - 0xFFFF, while a non-atomic check can only
-		 * return 0 or 0xFFFF.
-		 */
-		if (!NAPI_GRO_CB(p)->is_atomic ||
-		    !(iph->frag_off & htons(IP_DF))) {
-			flush_id ^= NAPI_GRO_CB(p)->count;
-			flush_id = flush_id ? 0xFFFF : 0;
-		}
-
-		/* If the previous IP ID value was based on an atomic
-		 * datagram we can overwrite the value and ignore it.
-		 */
-		if (NAPI_GRO_CB(skb)->is_atomic)
-			NAPI_GRO_CB(p)->flush_id = flush_id;
-		else
-			NAPI_GRO_CB(p)->flush_id |= flush_id;
 	}
 
-	NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
 	NAPI_GRO_CB(skb)->flush |= flush;
 
 	/* Note : No need to call skb_gro_postpull_rcsum() here,
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index a63af9a6b0f7..ed9947bb903d 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -178,6 +178,55 @@  struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
 	return segs;
 }
 
+static int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
+			  struct sk_buff *p, u32 outer)
+{
+	const u32 id = ntohl(*(__be32 *)&iph->id);
+	const u32 id2 = ntohl(*(__be32 *)&iph2->id);
+	const int flush_id = ntohs(id >> 16) - ntohs(id2 >> 16);
+	const u16 count = NAPI_GRO_CB(p)->count;
+	const u32 df = id & IP_DF;
+	u32 is_atomic;
+	int flush;
+
+	/* All fields must match except length and checksum. */
+	flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
+
+	/* When we receive our second frame we can make a decision on if we
+	 * continue this flow as an atomic flow with a fixed ID or if we use
+	 * an incrementing ID.
+	 */
+	if (count == 1) {
+		is_atomic = df && flush_id == 0;
+		NAPI_GRO_CB(p)->is_atomic = is_atomic;
+	} else {
+		is_atomic = df && NAPI_GRO_CB(p)->is_atomic;
+	}
+
+	/* Ignore outer IP ID value if based on atomic datagram. */
+	outer = (outer && df) - 1;
+	is_atomic--;
+
+	return flush | ((flush_id ^ (count & is_atomic)) & outer);
+}
+
+static int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
+{
+	/* <Version:4><Traffic_Class:8><Flow_Label:20> */
+	__be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
+
+	/* Flush if Traffic Class fields are different. */
+	return (first_word & htonl(0x0FF00000)) |
+		(__force __be32)(iph->hop_limit ^ iph2->hop_limit);
+}
+
+static int gro_network_flush(const void *nh, const void *nh2,
+			     struct sk_buff *p, u32 outer)
+{
+	return (((struct iphdr *)nh)->version == 6) ? ipv6_gro_flush(nh, nh2) :
+		inet_gro_flush(nh, nh2, p, outer);
+}
+
 struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 {
 	struct sk_buff *pp = NULL;
@@ -190,6 +239,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	unsigned int mss = 1;
 	unsigned int hlen;
 	unsigned int off;
+	bool encap_mark;
 	int flush = 1;
 	int i;
 
@@ -232,9 +282,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	goto out_check_final;
 
 found:
-	/* Include the IP ID check below from the inner most IP hdr */
-	flush = NAPI_GRO_CB(p)->flush;
-	flush |= (__force int)(flags & TCP_FLAG_CWR);
+	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
@@ -242,16 +290,14 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);
 
-	/* When we receive our second frame we can made a decision on if we
-	 * continue this flow as an atomic flow with a fixed ID or if we use
-	 * an incrementing ID.
-	 */
-	if (NAPI_GRO_CB(p)->flush_id != 1 ||
-	    NAPI_GRO_CB(p)->count != 1 ||
-	    !NAPI_GRO_CB(p)->is_atomic)
-		flush |= NAPI_GRO_CB(p)->flush_id;
-	else
-		NAPI_GRO_CB(p)->is_atomic = false;
+	encap_mark = NAPI_GRO_CB(skb)->encap_mark;
+	for (i = 0; i <= encap_mark; i++) {
+		const u16 diff = off - NAPI_GRO_CB(skb)->network_offsets[i];
+
+		flush |= gro_network_flush((void *)th - diff,
+					   (void *)th2 - diff,
+					   p, i != encap_mark);
+	}
 
 	mss = skb_shinfo(p)->gso_size;
 
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index ba41939537f2..c9a6bc1afc9a 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -288,19 +288,8 @@  INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
 				   nlen - sizeof(struct ipv6hdr)))
 				goto not_same_flow;
 		}
-		/* flush if Traffic Class fields are different */
-		NAPI_GRO_CB(p)->flush |= !!((first_word & htonl(0x0FF00000)) |
-			(__force __be32)(iph->hop_limit ^ iph2->hop_limit));
-		NAPI_GRO_CB(p)->flush |= flush;
-
-		/* If the previous IP ID value was based on an atomic
-		 * datagram we can overwrite the value and ignore it.
-		 */
-		if (NAPI_GRO_CB(skb)->is_atomic)
-			NAPI_GRO_CB(p)->flush_id = 0;
 	}
 
-	NAPI_GRO_CB(skb)->is_atomic = true;
 	NAPI_GRO_CB(skb)->flush |= flush;
 
 	skb_gro_postpull_rcsum(skb, iph, nlen);