diff mbox series

[net-next,v6,5/6] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

Message ID 20240410153423.107381-6-richardbgobert@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series net: gro: encapsulation bug fix and flush checks improvements | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net-next, async
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present success Fixes tag not required for -next series
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 970 this patch: 969
netdev/build_tools success Errors and warnings before: 0 this patch: 0
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 954 this patch: 954
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 983 this patch: 982
netdev/checkpatch warning WARNING: line length of 106 exceeds 80 columns WARNING: line length of 83 exceeds 80 columns WARNING: line length of 87 exceeds 80 columns WARNING: line length of 96 exceeds 80 columns
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/source_inline success Was 0 now: 0
netdev/contest fail net-next-2024-04-11--09-00 (tests: 961)

Commit Message

Richard Gobert April 10, 2024, 3:34 p.m. UTC
{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used
currently only in tcp flows in GRO.

These checks need to be done only once in tcp_gro_receive and only against
the found p skb, since they only affect flush and not same_flow.

Leveraging the previous commit in the series, in which correct network
header offsets are saved for both outer and inner network headers -
allowing these checks to be done only once, in tcp_gro_receive. As a
result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id
checks are more declarative and contained in inet_gro_flush, thus removing
the need for flush_id in napi_gro_cb.

This results in less parsing code for UDP flows and non-loop flush tests
for TCP flows.

To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization). gro_network_flush is
compiled inline to tcp_gro_receive.

perf top while replaying 64 parallel IP/TCP streams merging in GRO
net-next:
        6.94% [kernel] [k] inet_gro_receive
        3.02% [kernel] [k] tcp_gro_receive

patch applied:
        4.27% [kernel] [k] tcp_gro_receive
        4.22% [kernel] [k] inet_gro_receive

perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
        10.09% [kernel] [k] inet_gro_receive
        2.08% [kernel] [k] tcp_gro_receive

patch applied:
        6.97% [kernel] [k] inet_gro_receive
        3.68% [kernel] [k] tcp_gro_receive

perf top -g while running 64 IP/UDP netperf connections NOT merging in GRO
(udp_gro_receive is included because of -g, in this case GRO is just
overhead)
net-next:
	1.26% [kernel] [k] inet_gro_receive

patch applied:
	0.85% [kernel] [k] inet_gro_receive

Signed-off-by: Richard Gobert <richardbgobert@gmail.com>
---
 include/net/gro.h      | 66 ++++++++++++++++++++++++++++++++++++++----
 net/core/gro.c         |  4 ---
 net/ipv4/af_inet.c     | 41 +-------------------------
 net/ipv4/tcp_offload.c | 15 ++--------
 net/ipv4/udp_offload.c | 16 +++-------
 net/ipv6/ip6_offload.c | 11 -------
 6 files changed, 67 insertions(+), 86 deletions(-)

Comments

Willem de Bruijn April 11, 2024, 2:45 a.m. UTC | #1
Richard Gobert wrote:
> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> iph->id, ...) against all packets in a loop. These flush checks are used
> currently only in tcp flows in GRO.
> 
> These checks need to be done only once in tcp_gro_receive and only against
> the found p skb, since they only affect flush and not same_flow.

I don't quite understand where the performance improvements arise.
As inet_gro_receive will skip any p that does not match:

      if (!NAPI_GRO_CB(p)->same_flow)
              continue;

      iph2 = (struct iphdr *)(p->data + off);
      /* The above works because, with the exception of the top
       * (inner most) layer, we only aggregate pkts with the same
       * hdr length so all the hdrs we'll need to verify will start
       * at the same offset.
       */
      if ((iph->protocol ^ iph2->protocol) |
          ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
          ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
              NAPI_GRO_CB(p)->same_flow = 0;
              continue;
      }

So these checks are already only performed against a p that matches.
 
> Leveraging the previous commit in the series, in which correct network
> header offsets are saved for both outer and inner network headers -
> allowing these checks to be done only once, in tcp_gro_receive. As a

Comments should be updated to reflect both TCP and L4 UDP. Can
generalize to transport callbacks.

> result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id
> checks are more declarative and contained in inet_gro_flush, thus removing
> the need for flush_id in napi_gro_cb.
> 
> This results in less parsing code for UDP flows and non-loop flush tests
> for TCP flows.

This moves network layer tests out of the network layer callbacks into
helpers called from the transport layer callback. And then the helper
has to look up the network layer header and demultiplex the protocol
again:

    +		if (((struct iphdr *)nh)->version == 6)
    +			flush |= ipv6_gro_flush(nh, nh2);
    +		else
    +			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);

That just seems a bit roundabout.
Richard Gobert April 11, 2024, 4:07 p.m. UTC | #2
Willem de Bruijn wrote:
> Richard Gobert wrote:
>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
>> iph->id, ...) against all packets in a loop. These flush checks are used
>> currently only in tcp flows in GRO.
>>
>> These checks need to be done only once in tcp_gro_receive and only against
>> the found p skb, since they only affect flush and not same_flow.
> 
> I don't quite understand where the performance improvements arise.
> As inet_gro_receive will skip any p that does not match:
> 
>       if (!NAPI_GRO_CB(p)->same_flow)
>               continue;
> 
>       iph2 = (struct iphdr *)(p->data + off);
>       /* The above works because, with the exception of the top
>        * (inner most) layer, we only aggregate pkts with the same
>        * hdr length so all the hdrs we'll need to verify will start
>        * at the same offset.
>        */
>       if ((iph->protocol ^ iph2->protocol) |
>           ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
>           ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
>               NAPI_GRO_CB(p)->same_flow = 0;
>               continue;
>       }
> 
> So these checks are already only performed against a p that matches.
>  


Thanks for the review!

flush/flush_id is calculated for all other p with same_flow = 1 (which is
not always determined to be 0 before inet_gro_receive) and same src/dst
addr in the bucket. Moving it to udp_gro_receive_segment/tcp_gro_receive
will make it run only once when a matching p is found.

In addition, UDP flows where skb_gro_receive_list is called -
flush/flush_id is not relevant and does not need to be calculated. In these
cases total CPU time in GRO should drop. I could post perf numbers for
this flow as well.


>> Leveraging the previous commit in the series, in which correct network
>> header offsets are saved for both outer and inner network headers -
>> allowing these checks to be done only once, in tcp_gro_receive. As a
> 
> Comments should be updated to reflect both TCP and L4 UDP. Can
> generalize to transport callbacks.
> 
>> result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id
>> checks are more declarative and contained in inet_gro_flush, thus removing
>> the need for flush_id in napi_gro_cb.
>>
>> This results in less parsing code for UDP flows and non-loop flush tests
>> for TCP flows.
> 
> This moves network layer tests out of the network layer callbacks into
> helpers called from the transport layer callback. And then the helper
> has to look up the network layer header and demultiplex the protocol
> again:
> 
>     +		if (((struct iphdr *)nh)->version == 6)
>     +			flush |= ipv6_gro_flush(nh, nh2);
>     +		else
>     +			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
> 
> That just seems a bit roundabout.

IMO this commit could be a part of a larger change, where all
loops in gro_list_prepare, inet_gro_receive and ipv6_gro_receive can be
removed, and the logic for finding a matching p will be moved to L4.  This
means that when p is found, the rest of the gro_list would not need to be
traversed and thus would not even dirty cache lines at all. I can provide a
code snippet which would explain it better.
Willem de Bruijn April 11, 2024, 9:35 p.m. UTC | #3
Richard Gobert wrote:
> Willem de Bruijn wrote:
> > Richard Gobert wrote:
> >> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> >> iph->id, ...) against all packets in a loop. These flush checks are used
> >> currently only in tcp flows in GRO.
> >>
> >> These checks need to be done only once in tcp_gro_receive and only against
> >> the found p skb, since they only affect flush and not same_flow.
> > 
> > I don't quite understand where the performance improvements arise.
> > As inet_gro_receive will skip any p that does not match:
> > 
> >       if (!NAPI_GRO_CB(p)->same_flow)
> >               continue;
> > 
> >       iph2 = (struct iphdr *)(p->data + off);
> >       /* The above works because, with the exception of the top
> >        * (inner most) layer, we only aggregate pkts with the same
> >        * hdr length so all the hdrs we'll need to verify will start
> >        * at the same offset.
> >        */
> >       if ((iph->protocol ^ iph2->protocol) |
> >           ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
> >           ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
> >               NAPI_GRO_CB(p)->same_flow = 0;
> >               continue;
> >       }
> > 
> > So these checks are already only performed against a p that matches.
> >  
> 
> 
> Thanks for the review!
> 
> flush/flush_id is calculated for all other p with same_flow = 1 (which is
> not always determined to be 0 before inet_gro_receive) and same src/dst
> addr in the bucket. Moving it to udp_gro_receive_segment/tcp_gro_receive
> will make it run only once when a matching p is found.

So this optimization is for flows that are the same up to having the
same saddr/daddr. Aside from stress tests, it seems rare to have many
concurrent flows between the same pair of machines?

> 
> In addition, UDP flows where skb_gro_receive_list is called -
> flush/flush_id is not relevant and does not need to be calculated. 

That makes sense

> In these
> cases total CPU time in GRO should drop. I could post perf numbers for
> this flow as well.
> 
> 
> >> Leveraging the previous commit in the series, in which correct network
> >> header offsets are saved for both outer and inner network headers -
> >> allowing these checks to be done only once, in tcp_gro_receive. As a
> > 
> > Comments should be updated to reflect both TCP and L4 UDP. Can
> > generalize to transport callbacks.
> > 
> >> result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id
> >> checks are more declarative and contained in inet_gro_flush, thus removing
> >> the need for flush_id in napi_gro_cb.
> >>
> >> This results in less parsing code for UDP flows and non-loop flush tests
> >> for TCP flows.
> > 
> > This moves network layer tests out of the network layer callbacks into
> > helpers called from the transport layer callback. And then the helper
> > has to look up the network layer header and demultiplex the protocol
> > again:
> > 
> >     +		if (((struct iphdr *)nh)->version == 6)
> >     +			flush |= ipv6_gro_flush(nh, nh2);
> >     +		else
> >     +			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
> > 
> > That just seems a bit roundabout.
> 
> IMO this commit could be a part of a larger change, where all
> loops in gro_list_prepare, inet_gro_receive and ipv6_gro_receive can be
> removed, and the logic for finding a matching p will be moved to L4.  This
> means that when p is found, the rest of the gro_list would not need to be
> traversed and thus would not even dirty cache lines at all. I can provide a
> code snippet which would explain it better.

These loops are exactly the mechanism to find a matching p. Though
with all the callbacks perhaps not the most efficient model. The
hashtable should have solved much of that.

Yes, please share a snippet to understand how you would replace this.

In the meantime, I do suggest sending the first two patches to net,
as they have Fixes tags. And then follow up with this for net-next
separately.
Richard Gobert April 12, 2024, 3:37 p.m. UTC | #4
Willem de Bruijn wrote:
> Richard Gobert wrote:
>> Willem de Bruijn wrote:
>>> Richard Gobert wrote:
>>>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
>>>> iph->id, ...) against all packets in a loop. These flush checks are used
>>>> currently only in tcp flows in GRO.
>>>>
>>>> These checks need to be done only once in tcp_gro_receive and only against
>>>> the found p skb, since they only affect flush and not same_flow.
>>>
>>> I don't quite understand where the performance improvements arise.
>>> As inet_gro_receive will skip any p that does not match:
>>>
>>>       if (!NAPI_GRO_CB(p)->same_flow)
>>>               continue;
>>>
>>>       iph2 = (struct iphdr *)(p->data + off);
>>>       /* The above works because, with the exception of the top
>>>        * (inner most) layer, we only aggregate pkts with the same
>>>        * hdr length so all the hdrs we'll need to verify will start
>>>        * at the same offset.
>>>        */
>>>       if ((iph->protocol ^ iph2->protocol) |
>>>           ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
>>>           ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
>>>               NAPI_GRO_CB(p)->same_flow = 0;
>>>               continue;
>>>       }
>>>
>>> So these checks are already only performed against a p that matches.
>>>  
>>
>>
>> Thanks for the review!
>>
>> flush/flush_id is calculated for all other p with same_flow = 1 (which is
>> not always determined to be 0 before inet_gro_receive) and same src/dst
>> addr in the bucket. Moving it to udp_gro_receive_segment/tcp_gro_receive
>> will make it run only once when a matching p is found.
> 
> So this optimization is for flows that are the same up to having the
> same saddr/daddr. Aside from stress tests, it seems rare to have many
> concurrent flows between the same pair of machines?
> 

Yes exactly, sorry if I wasn't clear enough earlier. Multiple connections
with the same srcaddr+dstaddr are not necessarily rare (e.g. devices behind
a large NAT network all connecting to the same servers, thus sharing the
same IP srcaddr).

>>
>> In addition, UDP flows where skb_gro_receive_list is called -
>> flush/flush_id is not relevant and does not need to be calculated. 
> 
> That makes sense
> 

I ran a UDP forwarding benchmark similar to how I did in the commit
message. GRO's flush/flush_id values were not relevant as all packets
reached skb_gro_receive_list instead of skb_gro_receive.

These numbers show CPU utilization under the same load (64 concurrent
IP/UDP connections):

net-next:
        3.03%  [kernel]  [k] inet_gro_receive

patch applied:
        2.78%  [kernel]  [k] inet_gro_receive

And under encapsulated load, 64 concurrent IP/IP/UDP connections:
net-next:
        10.50%  [kernel]  [k] inet_gro_receive

patch applied:
        8.19%  [kernel]  [k] inet_gro_receive

Total time spent in GRO reduced significantly due to fewer opcodes and
branches in inet_gro_receive :)

>> In these
>> cases total CPU time in GRO should drop. I could post perf numbers for
>> this flow as well.
>>
>>
>>>> Leveraging the previous commit in the series, in which correct network
>>>> header offsets are saved for both outer and inner network headers -
>>>> allowing these checks to be done only once, in tcp_gro_receive. As a
>>>
>>> Comments should be updated to reflect both TCP and L4 UDP. Can
>>> generalize to transport callbacks.
>>>
>>>> result, NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id
>>>> checks are more declarative and contained in inet_gro_flush, thus removing
>>>> the need for flush_id in napi_gro_cb.
>>>>
>>>> This results in less parsing code for UDP flows and non-loop flush tests
>>>> for TCP flows.
>>>
>>> This moves network layer tests out of the network layer callbacks into
>>> helpers called from the transport layer callback. And then the helper
>>> has to look up the network layer header and demultiplex the protocol
>>> again:
>>>
>>>     +		if (((struct iphdr *)nh)->version == 6)
>>>     +			flush |= ipv6_gro_flush(nh, nh2);
>>>     +		else
>>>     +			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
>>>
>>> That just seems a bit roundabout.
>>
>> IMO this commit could be a part of a larger change, where all
>> loops in gro_list_prepare, inet_gro_receive and ipv6_gro_receive can be
>> removed, and the logic for finding a matching p will be moved to L4.  This
>> means that when p is found, the rest of the gro_list would not need to be
>> traversed and thus would not even dirty cache lines at all. I can provide a
>> code snippet which would explain it better.
> 
> These loops are exactly the mechanism to find a matching p. Though
> with all the callbacks perhaps not the most efficient model. The
> hashtable should have solved much of that.
> 
> Yes, please share a snippet to understand how you would replace this.
> 

This is still a rough idea, and I still think the current patch is
significant by itself due to its performance gains and more readable code.

The idea is that an skb from gro_list will be loaded to cache only when GRO
is trying to find out if it matches the current skb, avoiding the current
multiple list traversals design in GRO for IP-TCP.

For a simple IP-TCP packet, gro_list_prapare will not be called from
dev_gro_receive anymore:

@@ -450,8 +489,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
        if (netif_elide_gro(skb->dev))
                goto normal;
 
-       gro_list_prepare(&gro_list->list, skb);

The matching loop from inet_gro_receive is removed as well:

@@ -1507,27 +1506,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 
        NAPI_GRO_CB(skb)->proto = proto;
        flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF));
-
-       list_for_each_entry(p, head, list) {
-               struct iphdr *iph2;
-
-               if (!NAPI_GRO_CB(p)->same_flow)
-                       continue;
-
-               iph2 = (struct iphdr *)(p->data + off);
-               /* The above works because, with the exception of the top
-                * (inner most) layer, we only aggregate pkts with the same
-                * hdr length so all the hdrs we'll need to verify will start
-                * at the same offset.
-                */
-               if ((iph->protocol ^ iph2->protocol) |
-                   ((__force u32)iph->saddr ^ (__force u32)iph2->saddr) |
-                   ((__force u32)iph->daddr ^ (__force u32)iph2->daddr)) {
-                       NAPI_GRO_CB(p)->same_flow = 0;
-                       continue;
-               }
-       }
-

and change tcp_gro_receive such that p is retrieved from gro_list using a new function:

--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -215,24 +215,12 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
        len = skb_gro_len(skb);
        flags = tcp_flag_word(th);
 
-       list_for_each_entry(p, head, list) {
-               if (!NAPI_GRO_CB(p)->same_flow)
-                       continue;
-
-               th2 = tcp_hdr(p);
-
-               if (*(u32 *)&th->source ^ *(u32 *)&th2->source) {
-                       NAPI_GRO_CB(p)->same_flow = 0;
-                       continue;
-               }
+       if (!(p = gro_list_find(head, skb, &flush)))
+               goto out_check_final;
 
-               goto found;
-       }
-       p = NULL;
-       goto out_check_final;
+       th2 = tcp_hdr(p);
 
-found:
-       flush = (__force int)(flags & TCP_FLAG_CWR);
+       flush |= (__force int)(flags & TCP_FLAG_CWR);
        flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
                  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
        flush |= (__force int)(th->ack_seq ^ th2->ack_seq);


The first time gro_list is traversed is in tcp_gro_receive.  gro_list_find
is very similar to gro_list_prepare - but it also matches L3 & L4 (UDP and
TCP port offsets are identical), and it returns L3 flush value as well.

The main difference between the current design and what I propose is that
once a matched p is found - there's no need to keep traversing the rest of
gro_list (as opposed to gro_list_prepare and inet_gro_receive for example),
and that's why certain checks are saved especially under load.  This way
less data is loaded to cache overall for every skb entering dev_gro_receive.

> In the meantime, I do suggest sending the first two patches to net,
> as they have Fixes tags. And then follow up with this for net-next
> separately.

Just submitted the first two commits to net.
Thanks.
diff mbox series

Patch

diff --git a/include/net/gro.h b/include/net/gro.h
index a1cc8e8c2ebd..c1f80f1156d6 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -36,15 +36,15 @@  struct napi_gro_cb {
 	/* This is non-zero if the packet cannot be merged with the new skb. */
 	u16	flush;
 
-	/* Save the IP ID here and check when we get to the transport layer */
-	u16	flush_id;
-
 	/* Number of segments aggregated. */
 	u16	count;
 
 	/* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */
 	u16	proto;
 
+	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
+	__wsum	csum;
+
 /* Used in napi_gro_cb::free */
 #define NAPI_GRO_FREE             1
 #define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -85,9 +85,6 @@  struct napi_gro_cb {
 		u8	is_flist:1;
 	);
 
-	/* used to support CHECKSUM_COMPLETE for tunneling protocols */
-	__wsum	csum;
-
 	/* L3 offsets */
 	union {
 		struct {
@@ -443,6 +440,63 @@  static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb,
 					    skb_gro_len(skb), proto, 0));
 }
 
+static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
+				 struct sk_buff *p, bool outer)
+{
+	const u32 id = ntohl(*(__be32 *)&iph->id);
+	const u32 id2 = ntohl(*(__be32 *)&iph2->id);
+	const u16 flush_id = (id >> 16) - (id2 >> 16);
+	const u16 count = NAPI_GRO_CB(p)->count;
+	const u32 df = id & IP_DF;
+	u32 is_atomic;
+	int flush;
+
+	/* All fields must match except length and checksum. */
+	flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
+
+	if (outer && df)
+		return flush;
+
+	/* When we receive our second frame we can make a decision on if we
+	 * continue this flow as an atomic flow with a fixed ID or if we use
+	 * an incrementing ID.
+	 */
+	NAPI_GRO_CB(p)->is_atomic |= (count == 1 && df && flush_id == 0);
+	is_atomic = (df && NAPI_GRO_CB(p)->is_atomic) - 1;
+
+	return flush | (flush_id ^ (count & is_atomic));
+}
+
+static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
+{
+	/* <Version:4><Traffic_Class:8><Flow_Label:20> */
+	__be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
+
+	/* Flush if Traffic Class fields are different. */
+	return !!((first_word & htonl(0x0FF00000)) |
+		(__force __be32)(iph->hop_limit ^ iph2->hop_limit));
+}
+
+static inline int gro_network_flush(const void *th, const void *th2, struct sk_buff *p, int off)
+{
+	const bool encap_mark = NAPI_GRO_CB(p)->encap_mark;
+	int flush = 0;
+	int i;
+
+	for (i = 0; i <= encap_mark; i++) {
+		const u16 diff = off - NAPI_GRO_CB(p)->network_offsets[i];
+		const void *nh = th - diff;
+		const void *nh2 = th2 - diff;
+
+		if (((struct iphdr *)nh)->version == 6)
+			flush |= ipv6_gro_flush(nh, nh2);
+		else
+			flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
+	}
+
+	return flush;
+}
+
 int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
 
 /* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
diff --git a/net/core/gro.c b/net/core/gro.c
index b2156e6cc4ad..3bfdfefe4a24 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -89,7 +89,6 @@  void dev_remove_offload(struct packet_offload *po)
 }
 EXPORT_SYMBOL(dev_remove_offload);
 
-
 int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
 {
 	struct skb_shared_info *pinfo, *skbinfo = skb_shinfo(skb);
@@ -330,8 +329,6 @@  static void gro_list_prepare(const struct list_head *head,
 	list_for_each_entry(p, head, list) {
 		unsigned long diffs;
 
-		NAPI_GRO_CB(p)->flush = 0;
-
 		if (hash != skb_get_hash_raw(p)) {
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
@@ -471,7 +468,6 @@  static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 					sizeof(u32))); /* Avoid slow unaligned acc */
 	*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;
 	NAPI_GRO_CB(skb)->flush = skb_has_frag_list(skb);
-	NAPI_GRO_CB(skb)->is_atomic = 1;
 	NAPI_GRO_CB(skb)->count = 1;
 	if (unlikely(skb_is_gso(skb))) {
 		NAPI_GRO_CB(skb)->count = skb_shinfo(skb)->gso_segs;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6546bf376b24..af094aecf38c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1481,7 +1481,6 @@  struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 	struct sk_buff *p;
 	unsigned int hlen;
 	unsigned int off;
-	unsigned int id;
 	int flush = 1;
 	int proto;
 
@@ -1507,13 +1506,10 @@  struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 		goto out;
 
 	NAPI_GRO_CB(skb)->proto = proto;
-	id = ntohl(*(__be32 *)&iph->id);
-	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
-	id >>= 16;
+	flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF));
 
 	list_for_each_entry(p, head, list) {
 		struct iphdr *iph2;
-		u16 flush_id;
 
 		if (!NAPI_GRO_CB(p)->same_flow)
 			continue;
@@ -1530,43 +1526,8 @@  struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
 			NAPI_GRO_CB(p)->same_flow = 0;
 			continue;
 		}
-
-		/* All fields must match except length and checksum. */
-		NAPI_GRO_CB(p)->flush |=
-			(iph->ttl ^ iph2->ttl) |
-			(iph->tos ^ iph2->tos) |
-			((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
-
-		NAPI_GRO_CB(p)->flush |= flush;
-
-		/* We need to store of the IP ID check to be included later
-		 * when we can verify that this packet does in fact belong
-		 * to a given flow.
-		 */
-		flush_id = (u16)(id - ntohs(iph2->id));
-
-		/* This bit of code makes it much easier for us to identify
-		 * the cases where we are doing atomic vs non-atomic IP ID
-		 * checks.  Specifically an atomic check can return IP ID
-		 * values 0 - 0xFFFF, while a non-atomic check can only
-		 * return 0 or 0xFFFF.
-		 */
-		if (!NAPI_GRO_CB(p)->is_atomic ||
-		    !(iph->frag_off & htons(IP_DF))) {
-			flush_id ^= NAPI_GRO_CB(p)->count;
-			flush_id = flush_id ? 0xFFFF : 0;
-		}
-
-		/* If the previous IP ID value was based on an atomic
-		 * datagram we can overwrite the value and ignore it.
-		 */
-		if (NAPI_GRO_CB(skb)->is_atomic)
-			NAPI_GRO_CB(p)->flush_id = flush_id;
-		else
-			NAPI_GRO_CB(p)->flush_id |= flush_id;
 	}
 
-	NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
 	NAPI_GRO_CB(skb)->flush |= flush;
 
 	/* Note : No need to call skb_gro_postpull_rcsum() here,
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 7f045b881dd4..1b10ab3b0f6a 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -232,9 +232,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 	goto out_check_final;
 
 found:
-	/* Include the IP ID check below from the inner most IP hdr */
-	flush = NAPI_GRO_CB(p)->flush;
-	flush |= (__force int)(flags & TCP_FLAG_CWR);
+	flush = (__force int)(flags & TCP_FLAG_CWR);
 	flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
 		  ~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
 	flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
@@ -242,16 +240,7 @@  struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
 		flush |= *(u32 *)((u8 *)th + i) ^
 			 *(u32 *)((u8 *)th2 + i);
 
-	/* When we receive our second frame we can made a decision on if we
-	 * continue this flow as an atomic flow with a fixed ID or if we use
-	 * an incrementing ID.
-	 */
-	if (NAPI_GRO_CB(p)->flush_id != 1 ||
-	    NAPI_GRO_CB(p)->count != 1 ||
-	    !NAPI_GRO_CB(p)->is_atomic)
-		flush |= NAPI_GRO_CB(p)->flush_id;
-	else
-		NAPI_GRO_CB(p)->is_atomic = false;
+	flush |= gro_network_flush(th, th2, p, off);
 
 	mss = skb_shinfo(p)->gso_size;
 
diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index ad4c88fe7d15..c5a5155904cf 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -466,12 +466,12 @@  static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
 					       struct sk_buff *skb)
 {
 	struct udphdr *uh = udp_gro_udphdr(skb);
+	int off = skb_gro_offset(skb);
 	struct sk_buff *pp = NULL;
 	struct udphdr *uh2;
 	struct sk_buff *p;
 	unsigned int ulen;
 	int ret = 0;
-	int flush;
 
 	/* requires non zero csum, for symmetry with GSO */
 	if (!uh->check) {
@@ -529,17 +529,9 @@  static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
 				skb_gro_postpull_rcsum(skb, uh,
 						       sizeof(struct udphdr));
 
-				flush = NAPI_GRO_CB(p)->flush;
-
-				if (NAPI_GRO_CB(p)->flush_id != 1 ||
-				    NAPI_GRO_CB(p)->count != 1 ||
-				    !NAPI_GRO_CB(p)->is_atomic)
-					flush |= NAPI_GRO_CB(p)->flush_id;
-				else
-					NAPI_GRO_CB(p)->is_atomic = false;
-
-				if (flush || skb_gro_receive(p, skb))
-					ret = 1;
+				ret = gro_network_flush(uh, uh2, p, off);
+				if (!ret)
+					ret = skb_gro_receive(p, skb);
 			}
 		}
 
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index ba41939537f2..c9a6bc1afc9a 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -288,19 +288,8 @@  INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
 				   nlen - sizeof(struct ipv6hdr)))
 				goto not_same_flow;
 		}
-		/* flush if Traffic Class fields are different */
-		NAPI_GRO_CB(p)->flush |= !!((first_word & htonl(0x0FF00000)) |
-			(__force __be32)(iph->hop_limit ^ iph2->hop_limit));
-		NAPI_GRO_CB(p)->flush |= flush;
-
-		/* If the previous IP ID value was based on an atomic
-		 * datagram we can overwrite the value and ignore it.
-		 */
-		if (NAPI_GRO_CB(skb)->is_atomic)
-			NAPI_GRO_CB(p)->flush_id = 0;
 	}
 
-	NAPI_GRO_CB(skb)->is_atomic = true;
 	NAPI_GRO_CB(skb)->flush |= flush;
 
 	skb_gro_postpull_rcsum(skb, iph, nlen);