diff mbox series

[net,1/1] xfrm: fix source address in icmp error generation from IPsec gateway

Message ID 20ea2ab0472ecf2d1625dadb7ca0df39cf4fe0f5.1712226175.git.antony.antony@secunet.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series fix icmp error source with ICMP reverse lookup | expand

Checks

Context Check Description
netdev/series_format success Posting correctly formatted
netdev/tree_selection success Clearly marked for net
netdev/ynl success Generated files up to date; no warnings/errors; no diff in generated;
netdev/fixes_present fail Series targets non-next tree, but doesn't contain any Fixes tags
netdev/header_inline success No static functions without inline keyword in header files
netdev/build_32bit success Errors and warnings before: 947 this patch: 947
netdev/build_tools success No tools touched, skip
netdev/cc_maintainers success CCed 5 of 5 maintainers
netdev/build_clang success Errors and warnings before: 955 this patch: 955
netdev/verify_signedoff success Signed-off-by tag matches author and committer
netdev/deprecated_api success None detected
netdev/check_selftest success No net selftest shell script
netdev/verify_fixes success No Fixes tag
netdev/build_allmodconfig_warn success Errors and warnings before: 958 this patch: 958
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 7 lines checked
netdev/build_clang_rust success No Rust files in patch. Skipping build
netdev/kdoc success Errors and warnings before: 1 this patch: 1
netdev/source_inline success Was 0 now: 0
netdev/contest success net-next-2024-04-07--00-00 (tests: 956)

Commit Message

Antony Antony April 4, 2024, 10:31 a.m. UTC
When enabling support for XFRM lookup using reverse ICMP payload, we
have identified an issue where the source address of the IPv4, e.g.,
"Destination Host Unreachable" message, is incorrect. The error message
received by the sender appears to originate from a 
non-existing/unreachable host.  IPv6 seems to behave correctly; respond 
with an existing source address from the gateway.

Here is example of incorrect source address for ICMP error response.
When sending a ping to an unreachable host, the sender would receive an
ICMP unreachable response with a fake source address. Rather the address
of the host that generated ICMP Unreachable message. This is confusing
and incorrect.

Example:
ping -W 9 -w 5 -c 1 10.1.4.3
PING 10.1.4.3 (10.1.4.3) 56(84) bytes of data.
From 10.1.4.3 icmp_seq=1 Destination Host Unreachable

Notice : packet has the source address of the ICMP "Unreachable host!"

This issue can be traced back to commit
415b3334a21a ("icmp: Fix regression in nexthop resolution during replies.")
which introduced a change that copied the source address from the ICMP
payload.

This commit would force to use source address from the gatway/host.
The ICMP error message source address correctly set from the host.

After fixing:
ping -W 5 -c 1 10.1.4.3
PING 10.1.4.3 (10.1.4.3) 56(84) bytes of data.
From 10.1.3.2 icmp_seq=1 Destination Host Unreachable

Here is an snippt to reporduce the issue.

export AB="10.1"
for i in 1 2 3 4 5; do
        h="host${i}"
        ip netns add ${h}
        ip -netns ${h} link set lo up
        ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
        if [ $i -lt 5 ]; then
                ip -netns ${h} link add eth0 type veth peer name eth10${i}
                ip -netns ${h} addr add "${AB}.${i}.1/24" dev eth0
                ip -netns ${h} link set up dev eth0
        fi
done

for i in 1 2 3 4 5; do
        h="host${i}"
        p=$((i - 1))
        ph="host${p}"
        # connect to previous host
        if [ $i -gt 1 ]; then
                ip -netns ${ph} link set eth10${p} netns ${h}
                ip -netns ${h} link set eth10${p} name eth1
                ip -netns ${h} link set up dev eth1
                ip -netns ${h} addr add "${AB}.${p}.2/24" dev eth1
        fi
        # add forward routes
        for k in $(seq ${i} $((5 - 1))); do
                ip -netns ${h} route 2>/dev/null | (grep "${AB}.${k}.0" 2>/dev/null) || \
                ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${i}.2" 2>/dev/nul
        done

        # add reverse routes
        for k in $(seq 1 $((i - 2))); do
                ip -netns ${h} route 2>/dev/null | grep "${AB}.${k}.0" 2>/dev/null || \
                ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${p}.1" 2>/dev/nul
        done
done

ip netns exec host1 ping -q -W 2 -w 1 -c 1 10.1.4.2 2>&1>/dev/null && echo "success 10.1.4.2 reachable" || echo "ERROR"
ip netns exec host1 ping -W 9 -w 5 -c 1 10.1.4.3 || echo  "note the source address of unreachble of gateway"
ip -netns host1 route flush cache

ip netns exec host3 nft add table inet filter
ip netns exec host3 nft add chain inet filter FORWARD { type filter hook forward priority filter\; policy drop \; }
ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol icmp drop
ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol esp accept
ip netns exec host3 nft add rule inet filter FORWARD counter drop

ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
        flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel

ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir in \
        tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel

ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir fwd \
        flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel

ip -netns host2 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
        reqid 1 replay-window 1  mode tunnel aead 'rfc4106(gcm(aes))' \
        0x1111111111111111111111111111111111111111 96 \
        sel src 10.1.1.0/24 dst 10.1.4.0/24

ip -netns host2 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
        flag icmp reqid 2 replay-window 10 mode tunnel aead 'rfc4106(gcm(aes))' \
        0x2222222222222222222222222222222222222222 96

ip -netns host4 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir out \
        flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 1 mode tunnel

ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir in \
        tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2  mode tunnel

ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir fwd \
                flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2 mode tunnel

ip -netns host4 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
        reqid 1 replay-window 1 mode tunnel aead 'rfc4106(gcm(aes))' \
        0x2222222222222222222222222222222222222222 96

ip -netns host4 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
        reqid 2 replay-window 20 flag icmp  mode tunnel aead 'rfc4106(gcm(aes))' \
        0x1111111111111111111111111111111111111111 96 \
        sel src 10.1.1.0/24 dst 10.1.4.0/24

ip netns exec host1 ping -W 5 -c 1 10.1.4.2 2>&1 > /dev/null && echo ""
ip netns exec host1 ping -W 5 -c 1 10.1.4.3 || echo "note source address of gateway 10.1.3.2"

Again before the fix
ping -W 5 -c 1 10.1.4.3
From 10.1.4.3 icmp_seq=1 Destination Host Unreachable

After the fix
From 10.1.3.2 icmp_seq=1 Destination Host Unreachable

Signed-off-by: Antony Antony <antony.antony@secunet.com>
---
 net/ipv4/icmp.c | 1 -
 1 file changed, 1 deletion(-)

--
2.30.2

Comments

Michael Richardson April 4, 2024, 11:38 a.m. UTC | #1
Antony Antony via Devel <devel@linux-ipsec.org> wrote:
    > This commit would force to use source address from the gatway/host.
    > The ICMP error message source address correctly set from the host.

While that seems more correct, since that host is generating, it might not
fit into the IPsec tunnel, and therefore might go the right place or
anywhere.   Perhaps you could pick the internal IP of the gateway, but in
more complex policies, the gateway itself might not be part of the VPN.

    > Again before the fix ping -W 5 -c 1 10.1.4.3 From 10.1.4.3 icmp_seq=1
    > Destination Host Unreachable

    > After the fix From 10.1.3.2 icmp_seq=1 Destination Host Unreachable

ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
        flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel

As far as I can see, 10.1.3.2 does not fit into this policy.
You appear to be selecting the outside ("WAN") interface of the gateway.
It would be less confusing if you had used 172.16.0.0/24 for the outside of
the gateways in your example.
How will the WAN interface manage to talk to the internal sender of the
packet except via the tunnel?

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        |    IoT architect   [
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [
Antony Antony April 4, 2024, 12:16 p.m. UTC | #2
Hi Michael,

On Thu, Apr 04, 2024 at 07:38:04AM -0400, Michael Richardson via Devel 
wrote:
> 
> Antony Antony via Devel <devel@linux-ipsec.org> wrote:
>     > This commit would force to use source address from the gatway/host.
>     > The ICMP error message source address correctly set from the host.
> 
> While that seems more correct, since that host is generating, it might not
> fit into the IPsec tunnel, and therefore might go the right place or
> anywhere.   Perhaps you could pick the internal IP of the gateway, but in
> more complex policies, the gateway itself might not be part of the VPN.
> 
>     > Again before the fix ping -W 5 -c 1 10.1.4.3 From 10.1.4.3 icmp_seq=1
>     > Destination Host Unreachable
> 
>     > After the fix From 10.1.3.2 icmp_seq=1 Destination Host Unreachable
> 
> ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
>         flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel

> 
> As far as I can see, 10.1.3.2 does not fit into this policy.


Indeed, 10.1.3.2 does not match the policy. However, notice the 
"flag icmp" in the above line. That means the policy lookup will use the inner
payload for policy lookup as specified in RFC 4301, Section 6, which
will match. The inner payload 10.1.4.1 <=> 10.1.4.3  will match the policy.

> You appear to be selecting the outside ("WAN") interface of the gateway.
> It would be less confusing if you had used 172.16.0.0/24 for the outside of
> the gateways in your example.

With this fix, I am leaving the source address selection of the error
response to the stack instead of unconditionally copying from the packet
that was dropped. So, in a simple case, the outgoing interface will be
the end of the tunnel, i.e., 10.1.3.2. Also that is the end point of the 
tunnel. 

> How will the WAN interface manage to talk to the internal sender of the
> packet except via the tunnel?

WAN is geneerating an ESP packet? In this case ESP tunnel mode with  
10.1.3.2. Note: the reciver has also the icmp enabled.

-antony
Tobias Brunner April 4, 2024, 12:35 p.m. UTC | #3
On 04.04.24 12:31, Antony Antony wrote:
> When enabling support for XFRM lookup using reverse ICMP payload, we
> have identified an issue where the source address of the IPv4, e.g.,
> "Destination Host Unreachable" message, is incorrect. The error message
> received by the sender appears to originate from a 
> non-existing/unreachable host.  IPv6 seems to behave correctly; respond 
> with an existing source address from the gateway.
> 
> Here is example of incorrect source address for ICMP error response.
> When sending a ping to an unreachable host, the sender would receive an
> ICMP unreachable response with a fake source address. Rather the address
> of the host that generated ICMP Unreachable message. This is confusing
> and incorrect.
> 
> Example:
> ping -W 9 -w 5 -c 1 10.1.4.3
> PING 10.1.4.3 (10.1.4.3) 56(84) bytes of data.
> From 10.1.4.3 icmp_seq=1 Destination Host Unreachable
> 
> Notice : packet has the source address of the ICMP "Unreachable host!"
> 
> This issue can be traced back to commit
> 415b3334a21a ("icmp: Fix regression in nexthop resolution during replies.")
> which introduced a change that copied the source address from the ICMP
> payload.
> 
> This commit would force to use source address from the gatway/host.
> The ICMP error message source address correctly set from the host.
> 
> After fixing:
> ping -W 5 -c 1 10.1.4.3
> PING 10.1.4.3 (10.1.4.3) 56(84) bytes of data.
> From 10.1.3.2 icmp_seq=1 Destination Host Unreachable
> 
> Here is an snippt to reporduce the issue.
> 
> export AB="10.1"
> for i in 1 2 3 4 5; do
>         h="host${i}"
>         ip netns add ${h}
>         ip -netns ${h} link set lo up
>         ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
>         if [ $i -lt 5 ]; then
>                 ip -netns ${h} link add eth0 type veth peer name eth10${i}
>                 ip -netns ${h} addr add "${AB}.${i}.1/24" dev eth0
>                 ip -netns ${h} link set up dev eth0
>         fi
> done
> 
> for i in 1 2 3 4 5; do
>         h="host${i}"
>         p=$((i - 1))
>         ph="host${p}"
>         # connect to previous host
>         if [ $i -gt 1 ]; then
>                 ip -netns ${ph} link set eth10${p} netns ${h}
>                 ip -netns ${h} link set eth10${p} name eth1
>                 ip -netns ${h} link set up dev eth1
>                 ip -netns ${h} addr add "${AB}.${p}.2/24" dev eth1
>         fi
>         # add forward routes
>         for k in $(seq ${i} $((5 - 1))); do
>                 ip -netns ${h} route 2>/dev/null | (grep "${AB}.${k}.0" 2>/dev/null) || \
>                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${i}.2" 2>/dev/nul
>         done
> 
>         # add reverse routes
>         for k in $(seq 1 $((i - 2))); do
>                 ip -netns ${h} route 2>/dev/null | grep "${AB}.${k}.0" 2>/dev/null || \
>                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${p}.1" 2>/dev/nul
>         done
> done
> 
> ip netns exec host1 ping -q -W 2 -w 1 -c 1 10.1.4.2 2>&1>/dev/null && echo "success 10.1.4.2 reachable" || echo "ERROR"
> ip netns exec host1 ping -W 9 -w 5 -c 1 10.1.4.3 || echo  "note the source address of unreachble of gateway"
> ip -netns host1 route flush cache
> 
> ip netns exec host3 nft add table inet filter
> ip netns exec host3 nft add chain inet filter FORWARD { type filter hook forward priority filter\; policy drop \; }
> ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol icmp drop
> ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol esp accept
> ip netns exec host3 nft add rule inet filter FORWARD counter drop
> 
> ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
>         flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel
> 
> ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir in \
>         tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> 
> ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir fwd \
>         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> 
> ip -netns host2 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
>         reqid 1 replay-window 1  mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x1111111111111111111111111111111111111111 96 \
>         sel src 10.1.1.0/24 dst 10.1.4.0/24
> 
> ip -netns host2 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
>         flag icmp reqid 2 replay-window 10 mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x2222222222222222222222222222222222222222 96
> 
> ip -netns host4 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir out \
>         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 1 mode tunnel
> 
> ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir in \
>         tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2  mode tunnel
> 
> ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir fwd \
>                 flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2 mode tunnel
> 
> ip -netns host4 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
>         reqid 1 replay-window 1 mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x2222222222222222222222222222222222222222 96
> 
> ip -netns host4 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
>         reqid 2 replay-window 20 flag icmp  mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x1111111111111111111111111111111111111111 96 \
>         sel src 10.1.1.0/24 dst 10.1.4.0/24
> 
> ip netns exec host1 ping -W 5 -c 1 10.1.4.2 2>&1 > /dev/null && echo ""
> ip netns exec host1 ping -W 5 -c 1 10.1.4.3 || echo "note source address of gateway 10.1.3.2"
> 
> Again before the fix
> ping -W 5 -c 1 10.1.4.3
> From 10.1.4.3 icmp_seq=1 Destination Host Unreachable
> 
> After the fix
> From 10.1.3.2 icmp_seq=1 Destination Host Unreachable
> 
> Signed-off-by: Antony Antony <antony.antony@secunet.com>
> ---
>  net/ipv4/icmp.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
> index e63a3bf99617..bec234637122 100644
> --- a/net/ipv4/icmp.c
> +++ b/net/ipv4/icmp.c
> @@ -555,7 +555,6 @@ static struct rtable *icmp_route_lookup(struct net *net,
>  					    XFRM_LOOKUP_ICMP);
>  	if (!IS_ERR(rt2)) {
>  		dst_release(&rt->dst);
> -		memcpy(fl4, &fl4_dec, sizeof(*fl4));
>  		rt = rt2;
>  	} else if (PTR_ERR(rt2) == -EPERM) {
>  		if (rt)

Acked-by: Tobias Brunner <tobias@strongswan.org>
Michael Richardson April 4, 2024, 2:39 p.m. UTC | #4
Antony Antony <antony@phenome.org> wrote:
    > Indeed, 10.1.3.2 does not match the policy. However, notice the "flag
    > icmp" in the above line. That means the policy lookup will use the
    > inner payload for policy lookup as specified in RFC 4301, Section 6,
    > which will match. The inner payload 10.1.4.1 <=> 10.1.4.3 will match
    > the policy.

How is "flag icmp" communicated via IKEv2?
Won't the other gateway just drop this packet?
Antony Antony April 4, 2024, 3:23 p.m. UTC | #5
Hi Michael,

On Thu, Apr 04, 2024 at 10:39:17AM -0400, Michael Richardson wrote:
> 
> Antony Antony <antony@phenome.org> wrote:
>     > Indeed, 10.1.3.2 does not match the policy. However, notice the "flag
>     > icmp" in the above line. That means the policy lookup will use the
>     > inner payload for policy lookup as specified in RFC 4301, Section 6,
>     > which will match. The inner payload 10.1.4.1 <=> 10.1.4.3 will match
>     > the policy.
> 
> How is "flag icmp" communicated via IKEv2?

As far as I'm aware, it isn't communicated via IKEv2. I believe it's 
considered a local policy, and possibly specified in BCP.

However, how is communicating it over IKEv2 relevant to this kernel patch? I 
don't see any connection! If there is one, please elaborate. Without a clear 
link, the netdev maintainers might reject this patch.

> Won't the other gateway just drop this packet?

That's would be a local choice, fate of an ICMP message:), akin to ICMP 
errors elsewhere. Let's not dive into filtering choices and PMTU for now:)

Just thinking out loud, I haven't seen forwarding ICMP error messages 
negotiated in other tunneling protocols like MPLS or pptp...., if I recall 
correctly, QUIC does indeed have it specified.

-antony
Michael Richardson April 4, 2024, 3:35 p.m. UTC | #6
Antony Antony <antony@phenome.org> wrote:
    > On Thu, Apr 04, 2024 at 10:39:17AM -0400, Michael Richardson wrote:
    >>
    >> Antony Antony <antony@phenome.org> wrote: > Indeed, 10.1.3.2 does not
    >> match the policy. However, notice the "flag > icmp" in the above
    >> line. That means the policy lookup will use the > inner payload for
    >> policy lookup as specified in RFC 4301, Section 6, > which will
    >> match. The inner payload 10.1.4.1 <=> 10.1.4.3 will match > the
    >> policy.
    >>
    >> How is "flag icmp" communicated via IKEv2?

    > As far as I'm aware, it isn't communicated via IKEv2. I believe it's
    > considered a local policy, and possibly specified in BCP.

    > However, how is communicating it over IKEv2 relevant to this kernel
    > patch? I don't see any connection! If there is one, please
    > elaborate. Without a clear link, the netdev maintainers might reject
    > this patch.

because, we are using a custom Linux kernel flag to get the ICMP back into
the tunnel, then the other end might not accept the packet if it doesn't have
a similiar configuration.

    >> Won't the other gateway just drop this packet?

    > That's would be a local choice, fate of an ICMP message:), akin to ICMP
    > errors elsewhere. Let's not dive into filtering choices and PMTU for
    > now:)

No, it's not. It's up to IKEv2 to configure those flags.
Your choice requires extra flags.  The previous behaviour was rather
ingenious because it guaranteed that the packet always fit into the tunnel.
(the bug was that it didn't do it for IPv6 as well)

    > Just thinking out loud, I haven't seen forwarding ICMP error messages
    > negotiated in other tunneling protocols like MPLS or pptp...., if I
    > recall correctly, QUIC does indeed have it specified.

So, what?  They aren't L3 tunnel protocols are they?
MPLS is L2.5, pptp is L2 and QUIC is L4.

--
]               Never tell me the odds!                 | ipv6 mesh networks [
]   Michael Richardson, Sandelman Software Works        |    IoT architect   [
]     mcr@sandelman.ca  http://www.sandelman.ca/        |   ruby on rails    [
Tero Kivinen April 5, 2024, 12:21 p.m. UTC | #7
Michael Richardson via Devel writes:
> 
> Antony Antony <antony@phenome.org> wrote:
>     > Indeed, 10.1.3.2 does not match the policy. However, notice the "flag
>     > icmp" in the above line. That means the policy lookup will use the
>     > inner payload for policy lookup as specified in RFC 4301, Section 6,
>     > which will match. The inner payload 10.1.4.1 <=> 10.1.4.3 will match
>     > the policy.
> 
> How is "flag icmp" communicated via IKEv2?
> Won't the other gateway just drop this packet?

It is not specified in IKE, it is mandated by the RFC4301 section 6.2: 
----------------------------------------------------------------------
6.2.  Processing Protected, Transit ICMP Error Messages
...
   If no SA exists that would carry the outbound ICMP message in
   question, and if no SPD entry would allow carriage of this outbound
   ICMP error message, then an IPsec implementation MUST map the message
   to the SA that would carry the return traffic associated with the
   packet that triggered the ICMP error message.  This requires an IPsec
   implementation to detect outbound ICMP error messages that map to no
   extant SA or SPD entry, and treat them specially with regard to SA
   creation and lookup.  The implementation extracts the header for the
   packet that triggered the error (from the ICMP message payload),
   reverses the source and destination IP address fields, extracts the
   protocol field, and reverses the port fields (if accessible).  It
   then uses this extracted information to locate an appropriate, active
   outbound SA, and transmits the error message via this SA.  If no such
   SA exists, no SA will be created, and this is an auditable event.

   If an IPsec implementation receives an inbound ICMP error message on
   an SA, and the IP and ICMP headers of the message do not match the
   traffic selectors for the SA, the receiver MUST process the received
   message in a special fashion.  Specifically, the receiver must
   extract the header of the triggering packet from the ICMP payload,
   and reverse fields as described above to determine if the packet is
   consistent with the selectors for the SA via which the ICMP error
   message was received.  If the packet fails this check, the IPsec
   implementation MUST NOT forwarded the ICMP message to the
   destination.  This is an auditable event.
Antony Antony April 5, 2024, 12:27 p.m. UTC | #8
Hi Michael,

thanks for your clarifications.

On Thu, Apr 04, 2024 at 11:35:57AM -0400, Michael Richardson wrote:
> 
> Antony Antony <antony@phenome.org> wrote:
>     > On Thu, Apr 04, 2024 at 10:39:17AM -0400, Michael Richardson wrote:
>     >>
>     >> Antony Antony <antony@phenome.org> wrote: > Indeed, 10.1.3.2 does not
>     >> match the policy. However, notice the "flag > icmp" in the above
>     >> line. That means the policy lookup will use the > inner payload for
>     >> policy lookup as specified in RFC 4301, Section 6, > which will
>     >> match. The inner payload 10.1.4.1 <=> 10.1.4.3 will match > the
>     >> policy.
>     >>
>     >> How is "flag icmp" communicated via IKEv2?
> 
>     > As far as I'm aware, it isn't communicated via IKEv2. I believe it's
>     > considered a local policy, and possibly specified in BCP.
> 
>     > However, how is communicating it over IKEv2 relevant to this kernel
>     > patch? I don't see any connection! If there is one, please
>     > elaborate. Without a clear link, the netdev maintainers might reject
>     > this patch.
> 
> because, we are using a custom Linux kernel flag to get the ICMP back into
> the tunnel, then the other end might not accept the packet if it doesn't have
> a similiar configuration.

The use of "flags icmp" option here for ICMP handling within tunnels is 
directly aligned with the standards outlined in IPsec RFC 4301, section 6.2.  
This section mandates that both sender and receiver must be configurable to 
verify inner payload header information against the SA it traverses. It 
explicitly states, "an IPsec implementation MUST be configurable to check 
that this payload header information is consistent with the SA via which it 
arrives." Furthermore, it details the required handling of inbound ICMP 
error messages that don't match the SA's traffic selectors, emphasizing the 
necessity for special processing in such cases.
Therefore, the need for IKEv2 in this context does not arise, as Linux 
kernel approach is fully compliant with the specified standards.

> 
>     >> Won't the other gateway just drop this packet?
> 
>     > That's would be a local choice, fate of an ICMP message:), akin to ICMP
>     > errors elsewhere. Let's not dive into filtering choices and PMTU for
>     > now:)
> 
> No, it's not. It's up to IKEv2 to configure those flags.
> Your choice requires extra flags.  The previous behaviour was rather
> ingenious because it guaranteed that the packet always fit into the tunnel.
> (the bug was that it didn't do it for IPv6 as well)

While the current behavior might seem beneficial, it has a major flaw. 
The generated response matches the IPsec policy only because it uses 
incorrect source addresses. This is address spoofing by the IPsec peer.
This should be considered a bug, not a feature. In contrast, the correct 
approach, as IPv6 demonstrates, avoids such tactics. Ensuring IPv4 also 
adheres to standards is crucial. Thus, applying this fix is an improvment.

I think the patch should be applied, and leave it to the kernel community's 
insights and judgment.

> 
>     > Just thinking out loud, I haven't seen forwarding ICMP error messages
>     > negotiated in other tunneling protocols like MPLS or pptp...., if I
>     > recall correctly, QUIC does indeed have it specified.
> 
> So, what?  They aren't L3 tunnel protocols are they?

Based on RFC 4301, section 6—as referenced in the quote above—it explicitly 
mandates that implementing a locally configurable option/
Therefore, negotiation over IKEv2 seems unnecessary.

> MPLS is L2.5, pptp is L2 and QUIC is L4.

-antony
Jakub Kicinski April 9, 2024, 2:15 a.m. UTC | #9
On Thu, 4 Apr 2024 12:31:56 +0200 Antony Antony wrote:
> export AB="10.1"
> for i in 1 2 3 4 5; do
>         h="host${i}"
>         ip netns add ${h}
>         ip -netns ${h} link set lo up
>         ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
>         if [ $i -lt 5 ]; then
>                 ip -netns ${h} link add eth0 type veth peer name eth10${i}
>                 ip -netns ${h} addr add "${AB}.${i}.1/24" dev eth0
>                 ip -netns ${h} link set up dev eth0
>         fi
> done
> 
> for i in 1 2 3 4 5; do
>         h="host${i}"
>         p=$((i - 1))
>         ph="host${p}"
>         # connect to previous host
>         if [ $i -gt 1 ]; then
>                 ip -netns ${ph} link set eth10${p} netns ${h}
>                 ip -netns ${h} link set eth10${p} name eth1
>                 ip -netns ${h} link set up dev eth1
>                 ip -netns ${h} addr add "${AB}.${p}.2/24" dev eth1
>         fi
>         # add forward routes
>         for k in $(seq ${i} $((5 - 1))); do
>                 ip -netns ${h} route 2>/dev/null | (grep "${AB}.${k}.0" 2>/dev/null) || \
>                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${i}.2" 2>/dev/nul
>         done
> 
>         # add reverse routes
>         for k in $(seq 1 $((i - 2))); do
>                 ip -netns ${h} route 2>/dev/null | grep "${AB}.${k}.0" 2>/dev/null || \
>                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${p}.1" 2>/dev/nul
>         done
> done
> 
> ip netns exec host1 ping -q -W 2 -w 1 -c 1 10.1.4.2 2>&1>/dev/null && echo "success 10.1.4.2 reachable" || echo "ERROR"
> ip netns exec host1 ping -W 9 -w 5 -c 1 10.1.4.3 || echo  "note the source address of unreachble of gateway"
> ip -netns host1 route flush cache
> 
> ip netns exec host3 nft add table inet filter
> ip netns exec host3 nft add chain inet filter FORWARD { type filter hook forward priority filter\; policy drop \; }
> ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol icmp drop
> ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol esp accept
> ip netns exec host3 nft add rule inet filter FORWARD counter drop
> 
> ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
>         flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel
> 
> ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir in \
>         tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> 
> ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir fwd \
>         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> 
> ip -netns host2 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
>         reqid 1 replay-window 1  mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x1111111111111111111111111111111111111111 96 \
>         sel src 10.1.1.0/24 dst 10.1.4.0/24
> 
> ip -netns host2 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
>         flag icmp reqid 2 replay-window 10 mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x2222222222222222222222222222222222222222 96
> 
> ip -netns host4 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir out \
>         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 1 mode tunnel
> 
> ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir in \
>         tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2  mode tunnel
> 
> ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir fwd \
>                 flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2 mode tunnel
> 
> ip -netns host4 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
>         reqid 1 replay-window 1 mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x2222222222222222222222222222222222222222 96
> 
> ip -netns host4 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
>         reqid 2 replay-window 20 flag icmp  mode tunnel aead 'rfc4106(gcm(aes))' \
>         0x1111111111111111111111111111111111111111 96 \
>         sel src 10.1.1.0/24 dst 10.1.4.0/24
> 
> ip netns exec host1 ping -W 5 -c 1 10.1.4.2 2>&1 > /dev/null && echo ""
> ip netns exec host1 ping -W 5 -c 1 10.1.4.3 || echo "note source address of gateway 10.1.3.2"

Could you turn this into a selftest?
Antony Antony April 10, 2024, 5:48 p.m. UTC | #10
On Mon, Apr 08, 2024 at 19:15:34 -0700, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 12:31:56 +0200 Antony Antony wrote:
> > export AB="10.1"
> > for i in 1 2 3 4 5; do
> >         h="host${i}"
> >         ip netns add ${h}
> >         ip -netns ${h} link set lo up
> >         ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
> >         if [ $i -lt 5 ]; then
> >                 ip -netns ${h} link add eth0 type veth peer name eth10${i}
> >                 ip -netns ${h} addr add "${AB}.${i}.1/24" dev eth0
> >                 ip -netns ${h} link set up dev eth0
> >         fi
> > done
> > 
> > for i in 1 2 3 4 5; do
> >         h="host${i}"
> >         p=$((i - 1))
> >         ph="host${p}"
> >         # connect to previous host
> >         if [ $i -gt 1 ]; then
> >                 ip -netns ${ph} link set eth10${p} netns ${h}
> >                 ip -netns ${h} link set eth10${p} name eth1
> >                 ip -netns ${h} link set up dev eth1
> >                 ip -netns ${h} addr add "${AB}.${p}.2/24" dev eth1
> >         fi
> >         # add forward routes
> >         for k in $(seq ${i} $((5 - 1))); do
> >                 ip -netns ${h} route 2>/dev/null | (grep "${AB}.${k}.0" 2>/dev/null) || \
> >                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${i}.2" 2>/dev/nul
> >         done
> > 
> >         # add reverse routes
> >         for k in $(seq 1 $((i - 2))); do
> >                 ip -netns ${h} route 2>/dev/null | grep "${AB}.${k}.0" 2>/dev/null || \
> >                 ip -netns ${h} route add "${AB}.${k}.0/24" via "${AB}.${p}.1" 2>/dev/nul
> >         done
> > done
> > 
> > ip netns exec host1 ping -q -W 2 -w 1 -c 1 10.1.4.2 2>&1>/dev/null && echo "success 10.1.4.2 reachable" || echo "ERROR"
> > ip netns exec host1 ping -W 9 -w 5 -c 1 10.1.4.3 || echo  "note the source address of unreachble of gateway"
> > ip -netns host1 route flush cache
> > 
> > ip netns exec host3 nft add table inet filter
> > ip netns exec host3 nft add chain inet filter FORWARD { type filter hook forward priority filter\; policy drop \; }
> > ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol icmp drop
> > ip netns exec host3 nft add rule inet filter FORWARD counter ip protocol esp accept
> > ip netns exec host3 nft add rule inet filter FORWARD counter drop
> > 
> > ip -netns host2 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir out \
> >         flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 1 mode tunnel
> > 
> > ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir in \
> >         tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> > 
> > ip -netns host2 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir fwd \
> >         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 2 mode tunnel
> > 
> > ip -netns host2 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
> >         reqid 1 replay-window 1  mode tunnel aead 'rfc4106(gcm(aes))' \
> >         0x1111111111111111111111111111111111111111 96 \
> >         sel src 10.1.1.0/24 dst 10.1.4.0/24
> > 
> > ip -netns host2 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
> >         flag icmp reqid 2 replay-window 10 mode tunnel aead 'rfc4106(gcm(aes))' \
> >         0x2222222222222222222222222222222222222222 96
> > 
> > ip -netns host4 xfrm policy add src 10.1.4.0/24 dst 10.1.1.0/24 dir out \
> >         flag icmp tmpl src 10.1.3.2 dst 10.1.2.1 proto esp reqid 1 mode tunnel
> > 
> > ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir in \
> >         tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2  mode tunnel
> > 
> > ip -netns host4 xfrm policy add src 10.1.1.0/24 dst 10.1.4.0/24 dir fwd \
> >                 flag icmp tmpl src 10.1.2.1 dst 10.1.3.2 proto esp reqid 2 mode tunnel
> > 
> > ip -netns host4 xfrm state add src 10.1.3.2 dst 10.1.2.1 proto esp spi 2 \
> >         reqid 1 replay-window 1 mode tunnel aead 'rfc4106(gcm(aes))' \
> >         0x2222222222222222222222222222222222222222 96
> > 
> > ip -netns host4 xfrm state add src 10.1.2.1 dst 10.1.3.2 proto esp spi 1 \
> >         reqid 2 replay-window 20 flag icmp  mode tunnel aead 'rfc4106(gcm(aes))' \
> >         0x1111111111111111111111111111111111111111 96 \
> >         sel src 10.1.1.0/24 dst 10.1.4.0/24
> > 
> > ip netns exec host1 ping -W 5 -c 1 10.1.4.2 2>&1 > /dev/null && echo ""
> > ip netns exec host1 ping -W 5 -c 1 10.1.4.3 || echo "note source address of gateway 10.1.3.2"
> 
> Could you turn this into a selftest?

I thought about it, and I didn't find any selftest file that match this
test. This test need a topology with 4, ideally 5, namespaces connected in a line, and ip xfrm.

git/linux/tools/testing/selftests/net/pmtu.sh  is probably the easiest I
can think off.

git/linux/tools/testing/selftests/net/xfrm_policy.sh seems to be a bit
more complex to a extra tests to.

Do you have any preference? which file to add?

-antonz
Jakub Kicinski April 11, 2024, 12:49 a.m. UTC | #11
On Wed, 10 Apr 2024 19:48:46 +0200 Antony Antony wrote:
> > Could you turn this into a selftest?  
> 
> I thought about it, and I didn't find any selftest file that match this
> test. This test need a topology with 4, ideally 5, namespaces connected in a line, and ip xfrm.
> 
> git/linux/tools/testing/selftests/net/pmtu.sh  is probably the easiest I
> can think off.
> 
> git/linux/tools/testing/selftests/net/xfrm_policy.sh seems to be a bit
> more complex to a extra tests to.
> 
> Do you have any preference? which file to add?

Whatever's easiest, I don't have much experience with either of those.
You can also create a new file, it's perfectly fine, just make sure
you add it to the makefile so kselftest knows about it
diff mbox series

Patch

diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index e63a3bf99617..bec234637122 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -555,7 +555,6 @@  static struct rtable *icmp_route_lookup(struct net *net,
 					    XFRM_LOOKUP_ICMP);
 	if (!IS_ERR(rt2)) {
 		dst_release(&rt->dst);
-		memcpy(fl4, &fl4_dec, sizeof(*fl4));
 		rt = rt2;
 	} else if (PTR_ERR(rt2) == -EPERM) {
 		if (rt)