diff mbox series

[v3] icmp: support rfc5837

Message ID 20210317221959.4410-1-ishaangandhi@gmail.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v3] icmp: support rfc5837 | expand

Checks

Context Check Description
netdev/tree_selection success Guessing tree name failed - patch did not apply

Commit Message

Ishaan Gandhi March 17, 2021, 10:19 p.m. UTC
From: ishaan <ishaangandhi@gmail.com>

This patch identifies the interface a packet arrived on when sending
ICMP time exceeded, destination unreachable, and parameter problem
messages, in accordance with RFC 5837.

It was tested by pinging a machine with a ttl of 1, and observing the
response in Wireshark.

Changes since v1:
- Add sysctls, feature is disabled by default
- Device name is always less than 63, so don't check this
- MTU is always included in net_device, so don't check its presence
- Support IPv6 as first class citizen
- Increment lengths via sizeof operator as opposed to int literals
- Initialize more local variables with defaults

Changes since v2:
- Remove check for device name
- Get first entry instead of last in IPv6 addr list
- Use ALIGN macro for alignment
- Remove verification function only gets called with
  ip_version 4 or 6.
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
- Use /* */ style comments instead of // style
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
- Use proc_dointvec_minmax to constrain sysctl values
- Release dev with dev_put once finished
- Simplify logic for padding the end of the original datagram
- Fix off by one error in where the length of the original datagram
  is written into the IP header (6th and 5th bytes for ICMPv4 and v6
  respectively are accessed at icmph[5] and icmph[4] respectively,
  not icmph[6] and icmph[5].)

Signed-off-by: Ishaan Gandhi <ishaangandhi@gmail.com>

---
 Documentation/networking/ip-sysctl.rst |   9 ++
 include/linux/icmp.h                   |   3 +
 include/net/netns/ipv4.h               |   1 +
 include/net/netns/ipv6.h               |   1 +
 include/uapi/linux/icmp.h              |  26 ++++
 net/ipv4/icmp.c                        | 157 +++++++++++++++++++++++++
 net/ipv4/sysctl_net_ipv4.c             |   9 ++
 net/ipv6/af_inet6.c                    |   1 +
 net/ipv6/icmp.c                        |  17 +++
 9 files changed, 224 insertions(+)

Comments

David Ahern March 19, 2021, 2:55 p.m. UTC | #1
On 3/17/21 4:19 PM, ishaangandhi wrote:
> +void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, int room,
> +				     char *icmph, int ip_version)
> +{
> +	unsigned int ext_len, orig_len, word_aligned_orig_len, offset, extra_space_needed,
> +		     if_index, mtu = 0, name_len = 0, name_subobj_len = 0;
> +	struct interface_ipv4_addr_sub_obj ip4_addr_subobj = {.addr = 0};
> +	struct interface_ipv6_addr_sub_obj ip6_addr_subobj;
> +	struct icmp_extobj_hdr *iio_hdr;
> +	struct inet6_ifaddr ip6_ifaddr;
> +	struct inet6_dev *dev6 = NULL;
> +	struct icmp_ext_hdr *ext_hdr;
> +	char *name = NULL, ctype;
> +	struct net_device *dev;
> +	void *subobj_offset;
> +
> +	skb_linearize(skb);
> +	if_index = inet_iif(skb);

inet_iif is an IPv4 helper; it should not be used for v6 skb's.
Ishaan Gandhi March 19, 2021, 4:11 p.m. UTC | #2
Thank you. Would it be better to do instead:

+	if_index = skb->skb_iif;

or

+	if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;

> On Mar 19, 2021, at 7:55 AM, David Ahern <dsahern@gmail.com> wrote:
> 
> On 3/17/21 4:19 PM, ishaangandhi wrote:
>> +void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, int room,
>> +				     char *icmph, int ip_version)
>> +{
>> +	unsigned int ext_len, orig_len, word_aligned_orig_len, offset, extra_space_needed,
>> +		     if_index, mtu = 0, name_len = 0, name_subobj_len = 0;
>> +	struct interface_ipv4_addr_sub_obj ip4_addr_subobj = {.addr = 0};
>> +	struct interface_ipv6_addr_sub_obj ip6_addr_subobj;
>> +	struct icmp_extobj_hdr *iio_hdr;
>> +	struct inet6_ifaddr ip6_ifaddr;
>> +	struct inet6_dev *dev6 = NULL;
>> +	struct icmp_ext_hdr *ext_hdr;
>> +	char *name = NULL, ctype;
>> +	struct net_device *dev;
>> +	void *subobj_offset;
>> +
>> +	skb_linearize(skb);
>> +	if_index = inet_iif(skb);
> 
> inet_iif is an IPv4 helper; it should not be used for v6 skb's.
>
David Ahern March 19, 2021, 11:54 p.m. UTC | #3
On 3/19/21 10:11 AM, Ishaan Gandhi wrote:
> Thank you. Would it be better to do instead:
> 
> +	if_index = skb->skb_iif;
> 
> or
> 
> +	if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;
> 

If the packet comes in via an interface assigned to a VRF, skb_iif is
most likely the VRF index which is not what you want.

The general problem of relying on skb_iif was discussed on v1 and v2 of
your patch. Returning an iif that is a VRF, as an example, leaks
information about the networking configuration of the device which from
a quick reading of the RFC is not the intention.

Further, the Security Considerations section recommends controls on what
information can be returned where you have added a single sysctl that
determines if all information or none is returned. Further, it is not a
a per-device control but a global one that applies to all net devices -
though multiple entries per netdevice has a noticeable cost in memory at
scale.

In the end it seems to me the cost benefit is not there for a feature
like this.
Willem de Bruijn March 20, 2021, 12:53 a.m. UTC | #4
On Fri, Mar 19, 2021 at 7:54 PM David Ahern <dsahern@gmail.com> wrote:
>
> On 3/19/21 10:11 AM, Ishaan Gandhi wrote:
> > Thank you. Would it be better to do instead:
> >
> > +     if_index = skb->skb_iif;
> >
> > or
> >
> > +     if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;
> >
>
> If the packet comes in via an interface assigned to a VRF, skb_iif is
> most likely the VRF index which is not what you want.
>
> The general problem of relying on skb_iif was discussed on v1 and v2 of
> your patch. Returning an iif that is a VRF, as an example, leaks
> information about the networking configuration of the device which from
> a quick reading of the RFC is not the intention.
>
> Further, the Security Considerations section recommends controls on what
> information can be returned where you have added a single sysctl that
> determines if all information or none is returned. Further, it is not a
> a per-device control but a global one that applies to all net devices -
> though multiple entries per netdevice has a noticeable cost in memory at
> scale.
>
> In the end it seems to me the cost benefit is not there for a feature
> like this.

The sysctl was my suggestion. The detailed filtering suggested in the
RFC would add more complexity, not helping that cost benefit analysis.
I cared mostly about being able to disable this feature outright as it has
obvious risks.

But perhaps that is overly simplistic. The RFC suggests deciding trusted
recipients based on destination address. With a sysctl this feature can be
only deployed when all possible recipients are trusted, i.e., on an isolated
network. That is highly limiting.

Perhaps a per-netns trusted subnet prefix?

The root admin should always be able to override and disable this outright.
David Ahern March 20, 2021, 4:24 a.m. UTC | #5
On 3/19/21 6:53 PM, Willem de Bruijn wrote:
> On Fri, Mar 19, 2021 at 7:54 PM David Ahern <dsahern@gmail.com> wrote:
>>
>> On 3/19/21 10:11 AM, Ishaan Gandhi wrote:
>>> Thank you. Would it be better to do instead:
>>>
>>> +     if_index = skb->skb_iif;
>>>
>>> or
>>>
>>> +     if_index = ip_version == 4 ? inet_iif(skb) : skb->skb_iif;
>>>
>>
>> If the packet comes in via an interface assigned to a VRF, skb_iif is
>> most likely the VRF index which is not what you want.
>>
>> The general problem of relying on skb_iif was discussed on v1 and v2 of
>> your patch. Returning an iif that is a VRF, as an example, leaks
>> information about the networking configuration of the device which from
>> a quick reading of the RFC is not the intention.
>>
>> Further, the Security Considerations section recommends controls on what
>> information can be returned where you have added a single sysctl that
>> determines if all information or none is returned. Further, it is not a
>> a per-device control but a global one that applies to all net devices -
>> though multiple entries per netdevice has a noticeable cost in memory at
>> scale.
>>
>> In the end it seems to me the cost benefit is not there for a feature
>> like this.
> 
> The sysctl was my suggestion. The detailed filtering suggested in the
> RFC would add more complexity, not helping that cost benefit analysis.
> I cared mostly about being able to disable this feature outright as it has
> obvious risks.
> 
> But perhaps that is overly simplistic. The RFC suggests deciding trusted
> recipients based on destination address. With a sysctl this feature can be
> only deployed when all possible recipients are trusted, i.e., on an isolated
> network. That is highly limiting.
> 
> Perhaps a per-netns trusted subnet prefix?
> 
> The root admin should always be able to override and disable this outright.
> 

sure a sysctl is definitely required for a feature like this.

From my perspective to be useful the control needs to be per interface
(e.g., management interface vs dataplane devices) and that has a higher
cost. Add in the amount of information returned and we know from other
examples that some users will want to limit which data is returned and
that increases the number of sysctls per device.

On top of that there is the logic of resolving what is the right device
and its information to return. There is are multiple layers - nic port,
bond, vlan, bridge, vrf, macvlan - each of which might be relevant. The
RFC referenced unnumbered devices as the ingress device. It seems like a
means for leaking information which comes back to the sysctl for proper
controls.

At the end of the day, what is the value of this feature vs the other
ICMP probing set?
David Ahern March 20, 2021, 8:35 p.m. UTC | #6
On 3/19/21 10:24 PM, David Ahern wrote:
> At the end of the day, what is the value of this feature vs the other
> ICMP probing set?

Merging the conversations about both of these RFCs since my comments and
questions are the same for both.

What is the motivation for adding support for these RFCs? Is the push
from a company or academia (e.g., a CS project)?

Realistically, who is expected to use this feature and why given the
information it leaks about the networking configuration of the node. Why
is this tool expected to be more useful than a network operator using
existing protocols like lldp, collecting that data across nodes and
analyzing, or using tools like suzieq[1]?

RFC 5837 has been out for 11 years. Do any operating systems support it
— e.g., networking vendors like Cisco, Juniper, etc.? If not, why not?
This one seems to me the most dubious at this point in time.

Similarly for RFC 8335, what is the current support for it?

Linux does not need to support an RFC just because it exists. I am
really questioning the value of both of them

[1] https://github.com/netenglabs/suzieq
Ishaan Gandhi March 22, 2021, 1:50 a.m. UTC | #7
> What is the motivation for adding support for these RFCs? Is the push
> from a company or academia (e.g., a CS project)?

Yes, these patches (RFC 8335 and 5837) were produced as a result of a
collaboration between Juniper Networks and Harvey Mudd College.

Let me loop in our advisor, Zach Dodds, and Juniper Networks engineer
Ron Bonica. I believe Ron has more context on the potential usage and
existing support for these two features.

> On Mar 20, 2021, at 1:35 PM, David Ahern <dsahern@gmail.com> wrote:
> 
> On 3/19/21 10:24 PM, David Ahern wrote:
>> At the end of the day, what is the value of this feature vs the other
>> ICMP probing set?
> 
> Merging the conversations about both of these RFCs since my comments and
> questions are the same for both.
> 
> What is the motivation for adding support for these RFCs? Is the push
> from a company or academia (e.g., a CS project)?
> 
> Realistically, who is expected to use this feature and why given the
> information it leaks about the networking configuration of the node. Why
> is this tool expected to be more useful than a network operator using
> existing protocols like lldp, collecting that data across nodes and
> analyzing, or using tools like suzieq[1]?
> 
> RFC 5837 has been out for 11 years. Do any operating systems support it
> — e.g., networking vendors like Cisco, Juniper, etc.? If not, why not?
> This one seems to me the most dubious at this point in time.
> 
> Similarly for RFC 8335, what is the current support for it?
> 
> Linux does not need to support an RFC just because it exists. I am
> really questioning the value of both of them
> 
> [1] https://github.com/netenglabs/suzieq
David Ahern March 25, 2021, 3:19 a.m. UTC | #8
On 3/23/21 10:39 AM, Ron Bonica wrote:
> Hi Folks,
> 
>  
> 
> The rationale for RFC 8335 can be found in Section 5.0 of that document.
> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to determine
> the liveness of some interfaces. However, they cannot determine the
> liveness of:
> 
>  
> 
>   * An unnumbered IPv4 interface
>   * An IPv6 interface that has only a link-local address
> 
>  
> 
> A router can have hundreds, or even thousands of interfaces that fall
> into these categories.
> 
>  
> 
> The rational for RFC 5837 can be found in the Introduction to that
> document. When a node sends an ICMP TTL Expired message, the node
> reports that a packet has expired on it. However, the source address of
> the ICMP TTL Expired message doesn’t necessarily identify the interface
> upon which the packet arrived. So, TRACEROUTE can be relied upon to
> identify the nodes that a packet traverses along its delivery path. But
> it cannot be relied upon to identify the interfaces that a packet
> traversed along its deliver path.
> 
>  

It's not a question of the rationale; the question is why add this
support to Linux now? RFC 5837 is 11 years old. Why has no one cared to
add support before now? What tooling supports it? What other NOS'es
support it to really make the feature meaningful? e.g., Do you know what
Juniper products support RFC 5837 today?

More than likely Linux is the end node of the traceroute chain, not the
transit or path nodes. With Linux, the ingress interface can lost in the
layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly
support either you need to return information about the right one.
Unnumbered interfaces can make that more of a challenge.
Ron Bonica March 29, 2021, 2:49 p.m. UTC | #9
David,

Juniper networks is motivated to promote RFC 5837 now, as opposed to eleven years ago, because the deployment of parallel links between routers is becoming more common

Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.

I think that you are correct in saying that:

- LINUX is more likely to be implemented on a host than a router
- Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often

However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.

Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.

                                                                                 Ron




Juniper Business Use Only

-----Original Message-----
From: David Ahern <dsahern@gmail.com> 
Sent: Wednesday, March 24, 2021 11:19 PM
To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
Subject: Re: rfc5837 and rfc8335

[External Email. Be cautious of content]


On 3/23/21 10:39 AM, Ron Bonica wrote:
> Hi Folks,
>
>
>
> The rationale for RFC 8335 can be found in Section 5.0 of that document.
> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to 
> determine the liveness of some interfaces. However, they cannot 
> determine the liveness of:
>
>
>
>   * An unnumbered IPv4 interface
>   * An IPv6 interface that has only a link-local address
>
>
>
> A router can have hundreds, or even thousands of interfaces that fall 
> into these categories.
>
>
>
> The rational for RFC 5837 can be found in the Introduction to that 
> document. When a node sends an ICMP TTL Expired message, the node 
> reports that a packet has expired on it. However, the source address 
> of the ICMP TTL Expired message doesn't necessarily identify the 
> interface upon which the packet arrived. So, TRACEROUTE can be relied 
> upon to identify the nodes that a packet traverses along its delivery 
> path. But it cannot be relied upon to identify the interfaces that a 
> packet traversed along its deliver path.
>
>

It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?

More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
Unnumbered interfaces can make that more of a challenge.
Ron Bonica March 29, 2021, 7:39 p.m. UTC | #10
Folks,

Andreas reminds me that you may have the same questions regarding RFC 8335.....

The practice of assigning globally reachable IP addresses to infrastructure interfaces cost network operators money. Normally, they number an interface from a IPv4  /30. Currently, a /30 costs 80 USD and the price is only expected to rise. Furthermore, most IP Address Management (IPAM) systems license by the address block. The more globally reachable addresses you use, the more you pay.

They would prefer to use:

- IPv4 unnumbered interfaces
- IPv6 interfaces that have only link-local addresses

                                                                    Ron



Juniper Business Use Only

-----Original Message-----
From: Ron Bonica 
Sent: Monday, March 29, 2021 10:50 AM
To: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
Subject: RE: rfc5837 and rfc8335

David,

Juniper networks is motivated to promote RFC 5837 now, as opposed to eleven years ago, because the deployment of parallel links between routers is becoming more common

Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.

I think that you are correct in saying that:

- LINUX is more likely to be implemented on a host than a router
- Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often

However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.

Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.

                                                                                 Ron




Juniper Business Use Only

-----Original Message-----
From: David Ahern <dsahern@gmail.com>
Sent: Wednesday, March 24, 2021 11:19 PM
To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
Subject: Re: rfc5837 and rfc8335

[External Email. Be cautious of content]


On 3/23/21 10:39 AM, Ron Bonica wrote:
> Hi Folks,
>
>
>
> The rationale for RFC 8335 can be found in Section 5.0 of that document.
> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to 
> determine the liveness of some interfaces. However, they cannot 
> determine the liveness of:
>
>
>
>   * An unnumbered IPv4 interface
>   * An IPv6 interface that has only a link-local address
>
>
>
> A router can have hundreds, or even thousands of interfaces that fall 
> into these categories.
>
>
>
> The rational for RFC 5837 can be found in the Introduction to that 
> document. When a node sends an ICMP TTL Expired message, the node 
> reports that a packet has expired on it. However, the source address 
> of the ICMP TTL Expired message doesn't necessarily identify the 
> interface upon which the packet arrived. So, TRACEROUTE can be relied 
> upon to identify the nodes that a packet traverses along its delivery 
> path. But it cannot be relied upon to identify the interfaces that a 
> packet traversed along its deliver path.
>
>

It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?

More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
Unnumbered interfaces can make that more of a challenge.
Willem de Bruijn March 31, 2021, 2:04 p.m. UTC | #11
On Mon, Mar 29, 2021 at 3:40 PM Ron Bonica <rbonica@juniper.net> wrote:
>
> Folks,
>
> Andreas reminds me that you may have the same questions regarding RFC 8335.....
>
> The practice of assigning globally reachable IP addresses to infrastructure interfaces cost network operators money. Normally, they number an interface from a IPv4  /30. Currently, a /30 costs 80 USD and the price is only expected to rise. Furthermore, most IP Address Management (IPAM) systems license by the address block. The more globally reachable addresses you use, the more you pay.
>
> They would prefer to use:
>
> - IPv4 unnumbered interfaces
> - IPv6 interfaces that have only link-local addresses
>
>                                                                     Ron

Thanks for the context, Ron.

That sounds reasonable to me. Andreas's patch series has also been
merged by now.


>
>
>
> Juniper Business Use Only
>
> -----Original Message-----
> From: Ron Bonica
> Sent: Monday, March 29, 2021 10:50 AM
> To: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
> Subject: RE: rfc5837 and rfc8335
>
> David,
>
> Juniper networks is motivated to promote RFC 5837 now, as opposed to eleven years ago, because the deployment of parallel links between routers is becoming more common
>
> Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.
>
> I think that you are correct in saying that:
>
> - LINUX is more likely to be implemented on a host than a router
> - Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often
>
> However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.

But the patch series under review adds support to generate such packets.


> Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.
>
>                                                                                  Ron
>
>
>
>
> Juniper Business Use Only
>
> -----Original Message-----
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, March 24, 2021 11:19 PM
> To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
> Subject: Re: rfc5837 and rfc8335
>
> [External Email. Be cautious of content]
>
>
> On 3/23/21 10:39 AM, Ron Bonica wrote:
> > Hi Folks,
> >
> >
> >
> > The rationale for RFC 8335 can be found in Section 5.0 of that document.
> > Currently, ICMP ECHO and ECHO RESPONSE messages can be used to
> > determine the liveness of some interfaces. However, they cannot
> > determine the liveness of:
> >
> >
> >
> >   * An unnumbered IPv4 interface
> >   * An IPv6 interface that has only a link-local address
> >
> >
> >
> > A router can have hundreds, or even thousands of interfaces that fall
> > into these categories.
> >
> >
> >
> > The rational for RFC 5837 can be found in the Introduction to that
> > document. When a node sends an ICMP TTL Expired message, the node
> > reports that a packet has expired on it. However, the source address
> > of the ICMP TTL Expired message doesn't necessarily identify the
> > interface upon which the packet arrived. So, TRACEROUTE can be relied
> > upon to identify the nodes that a packet traverses along its delivery
> > path. But it cannot be relied upon to identify the interfaces that a
> > packet traversed along its deliver path.
> >
> >
>
> It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?
>
> More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
> Unnumbered interfaces can make that more of a challenge.
Ron Bonica March 31, 2021, 5:56 p.m. UTC | #12
Thanks!


Juniper Business Use Only

-----Original Message-----
From: Willem de Bruijn <willemdebruijn.kernel@gmail.com> 
Sent: Wednesday, March 31, 2021 10:05 AM
To: Ron Bonica <rbonica@juniper.net>
Cc: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>; Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; junipeross20@cs.hmc.edu
Subject: Re: rfc5837 and rfc8335

[External Email. Be cautious of content]


On Mon, Mar 29, 2021 at 3:40 PM Ron Bonica <rbonica@juniper.net> wrote:
>
> Folks,
>
> Andreas reminds me that you may have the same questions regarding RFC 8335.....
>
> The practice of assigning globally reachable IP addresses to infrastructure interfaces cost network operators money. Normally, they number an interface from a IPv4  /30. Currently, a /30 costs 80 USD and the price is only expected to rise. Furthermore, most IP Address Management (IPAM) systems license by the address block. The more globally reachable addresses you use, the more you pay.
>
> They would prefer to use:
>
> - IPv4 unnumbered interfaces
> - IPv6 interfaces that have only link-local addresses
>
>                                                                     
> Ron

Thanks for the context, Ron.

That sounds reasonable to me. Andreas's patch series has also been merged by now.


>
>
>
> Juniper Business Use Only
>
> -----Original Message-----
> From: Ron Bonica
> Sent: Monday, March 29, 2021 10:50 AM
> To: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; 
> Ishaan Gandhi <ishaangandhi@gmail.com>
> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller 
> <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; 
> Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn 
> <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
> Subject: RE: rfc5837 and rfc8335
>
> David,
>
> Juniper networks is motivated to promote RFC 5837 now, as opposed to 
> eleven years ago, because the deployment of parallel links between 
> routers is becoming more common
>
> Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.
>
> I think that you are correct in saying that:
>
> - LINUX is more likely to be implemented on a host than a router
> - Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often
>
> However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.

But the patch series under review adds support to generate such packets.


> Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.
>
>                                                                                  
> Ron
>
>
>
>
> Juniper Business Use Only
>
> -----Original Message-----
> From: David Ahern <dsahern@gmail.com>
> Sent: Wednesday, March 24, 2021 11:19 PM
> To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds 
> <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller 
> <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; 
> Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn 
> <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
> Subject: Re: rfc5837 and rfc8335
>
> [External Email. Be cautious of content]
>
>
> On 3/23/21 10:39 AM, Ron Bonica wrote:
> > Hi Folks,
> >
> >
> >
> > The rationale for RFC 8335 can be found in Section 5.0 of that document.
> > Currently, ICMP ECHO and ECHO RESPONSE messages can be used to 
> > determine the liveness of some interfaces. However, they cannot 
> > determine the liveness of:
> >
> >
> >
> >   * An unnumbered IPv4 interface
> >   * An IPv6 interface that has only a link-local address
> >
> >
> >
> > A router can have hundreds, or even thousands of interfaces that 
> > fall into these categories.
> >
> >
> >
> > The rational for RFC 5837 can be found in the Introduction to that 
> > document. When a node sends an ICMP TTL Expired message, the node 
> > reports that a packet has expired on it. However, the source address 
> > of the ICMP TTL Expired message doesn't necessarily identify the 
> > interface upon which the packet arrived. So, TRACEROUTE can be 
> > relied upon to identify the nodes that a packet traverses along its 
> > delivery path. But it cannot be relied upon to identify the 
> > interfaces that a packet traversed along its deliver path.
> >
> >
>
> It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?
>
> More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
> Unnumbered interfaces can make that more of a challenge.
Ishaan Gandhi April 8, 2021, 10:03 p.m. UTC | #13
> But the patch series under review adds support to generate such packets.

Willem, yes, good point. Let me add some more context.

This quote might be misleading:

>> Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often

Linux _hosts_ won’t send 5837 extensions often, but Linux is often
used in routers, which may indeed send those extensions often.

(Ron previously added some great context why routers may want to send those
extensions _now_ as opposed to 11 years ago.)

Juniper is adding RFC 5837 support to some of their router line right now, and 
some of their OS’s  are based on Linux.

Contributing these changes to the upstream kernel would be valuable
to Juniper for maintainability. (No re-applying the patch every time Juniper
upgrades to a new kernel version.)

Furthermore, Juniper is not the only router vendor based on Linux.

OpenWrt and DD-WRT are two popular router distributions, both based on Linux.
There are 16 total Linux based router-distributions mentioned on
https://en.wikipedia.org/wiki/List_of_router_and_firewall_distributions alone.

Re, some previous questions

>> What tooling supports it? 

Currently, Wireshark. There are also open merge requests for traceroute in IPUtils,
and TCPDump.

>> With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.

either return the information, or?

Many thanks!
- Ishaan

> On Mar 31, 2021, at 7:04 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
> 
> On Mon, Mar 29, 2021 at 3:40 PM Ron Bonica <rbonica@juniper.net> wrote:
>> 
>> Folks,
>> 
>> Andreas reminds me that you may have the same questions regarding RFC 8335.....
>> 
>> The practice of assigning globally reachable IP addresses to infrastructure interfaces cost network operators money. Normally, they number an interface from a IPv4  /30. Currently, a /30 costs 80 USD and the price is only expected to rise. Furthermore, most IP Address Management (IPAM) systems license by the address block. The more globally reachable addresses you use, the more you pay.
>> 
>> They would prefer to use:
>> 
>> - IPv4 unnumbered interfaces
>> - IPv6 interfaces that have only link-local addresses
>> 
>>                                                                    Ron
> 
> Thanks for the context, Ron.
> 
> That sounds reasonable to me. Andreas's patch series has also been
> merged by now.
> 
> 
>> 
>> 
>> 
>> Juniper Business Use Only
>> 
>> -----Original Message-----
>> From: Ron Bonica
>> Sent: Monday, March 29, 2021 10:50 AM
>> To: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
>> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
>> Subject: RE: rfc5837 and rfc8335
>> 
>> David,
>> 
>> Juniper networks is motivated to promote RFC 5837 now, as opposed to eleven years ago, because the deployment of parallel links between routers is becoming more common
>> 
>> Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.
>> 
>> I think that you are correct in saying that:
>> 
>> - LINUX is more likely to be implemented on a host than a router
>> - Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often
>> 
>> However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.
> 
> But the patch series under review adds support to generate such packets.
> 
> 
>> Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.
>> 
>>                                                                                 Ron
>> 
>> 
>> 
>> 
>> Juniper Business Use Only
>> 
>> -----Original Message-----
>> From: David Ahern <dsahern@gmail.com>
>> Sent: Wednesday, March 24, 2021 11:19 PM
>> To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
>> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
>> Subject: Re: rfc5837 and rfc8335
>> 
>> [External Email. Be cautious of content]
>> 
>> 
>> On 3/23/21 10:39 AM, Ron Bonica wrote:
>>> Hi Folks,
>>> 
>>> 
>>> 
>>> The rationale for RFC 8335 can be found in Section 5.0 of that document.
>>> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to
>>> determine the liveness of some interfaces. However, they cannot
>>> determine the liveness of:
>>> 
>>> 
>>> 
>>>  * An unnumbered IPv4 interface
>>>  * An IPv6 interface that has only a link-local address
>>> 
>>> 
>>> 
>>> A router can have hundreds, or even thousands of interfaces that fall
>>> into these categories.
>>> 
>>> 
>>> 
>>> The rational for RFC 5837 can be found in the Introduction to that
>>> document. When a node sends an ICMP TTL Expired message, the node
>>> reports that a packet has expired on it. However, the source address
>>> of the ICMP TTL Expired message doesn't necessarily identify the
>>> interface upon which the packet arrived. So, TRACEROUTE can be relied
>>> upon to identify the nodes that a packet traverses along its delivery
>>> path. But it cannot be relied upon to identify the interfaces that a
>>> packet traversed along its deliver path.
>>> 
>>> 
>> 
>> It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?
>> 
>> More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
>> Unnumbered interfaces can make that more of a challenge.
Ishaan Gandhi May 3, 2021, 1:41 a.m. UTC | #14
Gently bumping this. Any thoughts?

> On Apr 8, 2021, at 6:03 PM, Ishaan Gandhi <ishaangandhi@gmail.com> wrote:
> 
>> But the patch series under review adds support to generate such packets.
> 
> Willem, yes, good point. Let me add some more context.
> 
> This quote might be misleading:
> 
>>> Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often
> 
> Linux _hosts_ won’t send 5837 extensions often, but Linux is often
> used in routers, which may indeed send those extensions often.
> 
> (Ron previously added some great context why routers may want to send those
> extensions _now_ as opposed to 11 years ago.)
> 
> Juniper is adding RFC 5837 support to some of their router line right now, and 
> some of their OS’s  are based on Linux.
> 
> Contributing these changes to the upstream kernel would be valuable
> to Juniper for maintainability. (No re-applying the patch every time Juniper
> upgrades to a new kernel version.)
> 
> Furthermore, Juniper is not the only router vendor based on Linux.
> 
> OpenWrt and DD-WRT are two popular router distributions, both based on Linux.
> There are 16 total Linux based router-distributions mentioned on
> https://en.wikipedia.org/wiki/List_of_router_and_firewall_distributions alone.
> 
> Re, some previous questions
> 
>>> What tooling supports it? 
> 
> Currently, Wireshark. There are also open merge requests for traceroute in IPUtils,
> and TCPDump.
> 
>>> With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
> 
> either return the information, or?
> 
> Many thanks!
> - Ishaan
> 
>> On Mar 31, 2021, at 7:04 AM, Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote:
>> 
>> On Mon, Mar 29, 2021 at 3:40 PM Ron Bonica <rbonica@juniper.net> wrote:
>>> 
>>> Folks,
>>> 
>>> Andreas reminds me that you may have the same questions regarding RFC 8335.....
>>> 
>>> The practice of assigning globally reachable IP addresses to infrastructure interfaces cost network operators money. Normally, they number an interface from a IPv4  /30. Currently, a /30 costs 80 USD and the price is only expected to rise. Furthermore, most IP Address Management (IPAM) systems license by the address block. The more globally reachable addresses you use, the more you pay.
>>> 
>>> They would prefer to use:
>>> 
>>> - IPv4 unnumbered interfaces
>>> - IPv6 interfaces that have only link-local addresses
>>> 
>>>                                                                   Ron
>> 
>> Thanks for the context, Ron.
>> 
>> That sounds reasonable to me. Andreas's patch series has also been
>> merged by now.
>> 
>> 
>>> 
>>> 
>>> 
>>> Juniper Business Use Only
>>> 
>>> -----Original Message-----
>>> From: Ron Bonica
>>> Sent: Monday, March 29, 2021 10:50 AM
>>> To: David Ahern <dsahern@gmail.com>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
>>> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
>>> Subject: RE: rfc5837 and rfc8335
>>> 
>>> David,
>>> 
>>> Juniper networks is motivated to promote RFC 5837 now, as opposed to eleven years ago, because the deployment of parallel links between routers is becoming more common
>>> 
>>> Large network operators frequently require more than 400 Gbps connectivity between their backbone routers. However, the largest interfaces available can only handle 400 Gbps. So, parallel links are required. Moreover, it is frequently cheaper to deploy 4 100 Gbps interfaces than a single 400 Gbps interface. So, it is not uncommon to see two routers connected by many, parallel 100 Gbps links. RFC 5837 allows a network operator to trace a packet interface to interface, as opposed to node to node.
>>> 
>>> I think that you are correct in saying that:
>>> 
>>> - LINUX is more likely to be implemented on a host than a router
>>> - Therefore, LINUX hosts will  not send RFC 5837 ICMP extensions often
>>> 
>>> However, LINUX hosts are frequently used in network management stations. Therefore, it would be very useful if LINUX hosts could parse and display incoming RFC 5837 extensions, just as they display RFC 4950 ICMP extensions.
>> 
>> But the patch series under review adds support to generate such packets.
>> 
>> 
>>> Juniper networks plans to support RFC 5837 on one platform in an upcoming release and on other platforms soon after that.
>>> 
>>>                                                                                Ron
>>> 
>>> 
>>> 
>>> 
>>> Juniper Business Use Only
>>> 
>>> -----Original Message-----
>>> From: David Ahern <dsahern@gmail.com>
>>> Sent: Wednesday, March 24, 2021 11:19 PM
>>> To: Ron Bonica <rbonica@juniper.net>; Zachary Dodds <zdodds@gmail.com>; Ishaan Gandhi <ishaangandhi@gmail.com>
>>> Cc: Andreas Roeseler <andreas.a.roeseler@gmail.com>; David Miller <davem@davemloft.net>; Network Development <netdev@vger.kernel.org>; Stephen Hemminger <stephen@networkplumber.org>; Willem de Bruijn <willemdebruijn.kernel@gmail.com>; junipeross20@cs.hmc.edu
>>> Subject: Re: rfc5837 and rfc8335
>>> 
>>> [External Email. Be cautious of content]
>>> 
>>> 
>>> On 3/23/21 10:39 AM, Ron Bonica wrote:
>>>> Hi Folks,
>>>> 
>>>> 
>>>> 
>>>> The rationale for RFC 8335 can be found in Section 5.0 of that document.
>>>> Currently, ICMP ECHO and ECHO RESPONSE messages can be used to
>>>> determine the liveness of some interfaces. However, they cannot
>>>> determine the liveness of:
>>>> 
>>>> 
>>>> 
>>>> * An unnumbered IPv4 interface
>>>> * An IPv6 interface that has only a link-local address
>>>> 
>>>> 
>>>> 
>>>> A router can have hundreds, or even thousands of interfaces that fall
>>>> into these categories.
>>>> 
>>>> 
>>>> 
>>>> The rational for RFC 5837 can be found in the Introduction to that
>>>> document. When a node sends an ICMP TTL Expired message, the node
>>>> reports that a packet has expired on it. However, the source address
>>>> of the ICMP TTL Expired message doesn't necessarily identify the
>>>> interface upon which the packet arrived. So, TRACEROUTE can be relied
>>>> upon to identify the nodes that a packet traverses along its delivery
>>>> path. But it cannot be relied upon to identify the interfaces that a
>>>> packet traversed along its deliver path.
>>>> 
>>>> 
>>> 
>>> It's not a question of the rationale; the question is why add this support to Linux now? RFC 5837 is 11 years old. Why has no one cared to add support before now? What tooling supports it? What other NOS'es support it to really make the feature meaningful? e.g., Do you know what Juniper products support RFC 5837 today?
>>> 
>>> More than likely Linux is the end node of the traceroute chain, not the transit or path nodes. With Linux, the ingress interface can lost in the layers (NIC port, vlan, bond, bridge, vrf, macvlan), and to properly support either you need to return information about the right one.
>>> Unnumbered interfaces can make that more of a challenge.
>
diff mbox series

Patch

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 837d51f9e1fa..55d38539a72a 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1204,6 +1204,15 @@  icmp_errors_use_inbound_ifaddr - BOOLEAN
 
 	Default: 0
 
+icmp_errors_identify_if - BOOLEAN
+
+	If 1, then the kernel will append an extension object identifying
+	the interface on which the packet which caused the error. The
+	object will contain the ifIndex, interface name, interface IP
+	address, and/or MTU, in accordance with RFC5837.
+
+	Default: 0
+
 igmp_max_memberships - INTEGER
 	Change the maximum number of multicast groups we can subscribe to.
 	Default: 20
diff --git a/include/linux/icmp.h b/include/linux/icmp.h
index 8fc38a34cb20..db1a17dbc338 100644
--- a/include/linux/icmp.h
+++ b/include/linux/icmp.h
@@ -39,4 +39,7 @@  static inline bool icmp_is_err(int type)
 void ip_icmp_error_rfc4884(const struct sk_buff *skb,
 			   struct sock_ee_data_rfc4884 *out);
 
+void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, int room,
+				     char *icmph, int ip_version);
+
 #endif	/* _LINUX_ICMP_H */
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 9e36738c1fe1..fd68a47f1130 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -90,6 +90,7 @@  struct netns_ipv4 {
 	int sysctl_icmp_ratelimit;
 	int sysctl_icmp_ratemask;
 	int sysctl_icmp_errors_use_inbound_ifaddr;
+	int sysctl_icmp_errors_identify_if;
 
 	struct local_ports ip_local_ports;
 
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 5ec054473d81..9608e0a82401 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -36,6 +36,7 @@  struct netns_sysctl_ipv6 {
 	int icmpv6_echo_ignore_all;
 	int icmpv6_echo_ignore_multicast;
 	int icmpv6_echo_ignore_anycast;
+	int icmpv6_errors_identify_if;
 	DECLARE_BITMAP(icmpv6_ratemask, ICMPV6_MSG_MAX + 1);
 	unsigned long *icmpv6_ratemask_ptr;
 	int anycast_src_echo_reply;
diff --git a/include/uapi/linux/icmp.h b/include/uapi/linux/icmp.h
index fb169a50895e..f43ecd3c3677 100644
--- a/include/uapi/linux/icmp.h
+++ b/include/uapi/linux/icmp.h
@@ -118,4 +118,30 @@  struct icmp_extobj_hdr {
 	__u8		class_type;
 };
 
+/* RFC 5837 Bitmasks */
+#define ICMP_5837_MTU_CTYPE		(1 << 0)
+#define ICMP_5837_NAME_CTYPE		(1 << 1)
+#define ICMP_5837_IP_ADDR_CTYPE		(1 << 2)
+#define ICMP_5837_IF_INDEX_CTYPE	(1 << 3)
+
+#define ICMP_5837_ARRIVAL_ROLE_CTYPE	(0 << 6)
+#define ICMP_5837_SUB_IP_ROLE_CTYPE	(1 << 6)
+#define ICMP_5837_FORWARD_ROLE_CTYPE	(2 << 6)
+#define ICMP_5837_NEXT_HOP_ROLE_CTYPE	(3 << 6)
+
+#define ICMP_5837_MIN_ORIG_LEN 128
+
+/* RFC 5837 Interface IP Address sub-object */
+struct interface_ipv4_addr_sub_obj {
+	__be16 afi;
+	__be16 reserved;
+	__be32 addr;
+};
+
+struct interface_ipv6_addr_sub_obj {
+	__be16 afi;
+	__be16 reserved;
+	__be32 addr[4];
+};
+
 #endif /* _UAPI_LINUX_ICMP_H */
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 793aebf07c2a..c203758471c7 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -87,6 +87,7 @@ 
 #include <linux/timer.h>
 #include <linux/init.h>
 #include <linux/uaccess.h>
+#include <net/addrconf.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
 #include <net/inet_common.h>
@@ -555,6 +556,156 @@  static struct rtable *icmp_route_lookup(struct net *net,
 	return ERR_PTR(err);
 }
 
+/*  Appends interface identification object to ICMP packet to identify
+ *  the interface on which the original datagram arrived, per RFC 5837.
+ *
+ *  Should only be called on the following messages
+ *  - ICMPv4 Time Exceeded
+ *  - ICMPv4 Destination Unreachable
+ *  - ICMPv4 Parameter Problem
+ *  - ICMPv6 Time Exceeded
+ *  - ICMPv6 Destination Unreachable
+ */
+
+void icmp_identify_arrival_interface(struct sk_buff *skb, struct net *net, int room,
+				     char *icmph, int ip_version)
+{
+	unsigned int ext_len, orig_len, word_aligned_orig_len, offset, extra_space_needed,
+		     if_index, mtu = 0, name_len = 0, name_subobj_len = 0;
+	struct interface_ipv4_addr_sub_obj ip4_addr_subobj = {.addr = 0};
+	struct interface_ipv6_addr_sub_obj ip6_addr_subobj;
+	struct icmp_extobj_hdr *iio_hdr;
+	struct inet6_ifaddr ip6_ifaddr;
+	struct inet6_dev *dev6 = NULL;
+	struct icmp_ext_hdr *ext_hdr;
+	char *name = NULL, ctype;
+	struct net_device *dev;
+	void *subobj_offset;
+
+	skb_linearize(skb);
+	if_index = inet_iif(skb);
+	orig_len = skb->len - skb_network_offset(skb);
+
+	/* IPv4 messages MUST be 32-bit aligned, IPv6 64-bit aligned.
+	 * Both must be padded to 128 bytes. */
+	if (ip_version == 4) {
+		word_aligned_orig_len = max_t(unsigned int, ALIGN(orig_len, 4), 128);
+		icmph[5] = word_aligned_orig_len / 4;
+	} else {
+		word_aligned_orig_len = max_t(unsigned int, ALIGN(orig_len, 8), 128);
+		icmph[4] = word_aligned_orig_len / 8;
+	}
+
+	ctype = ICMP_5837_ARRIVAL_ROLE_CTYPE;
+	ext_len = sizeof(struct icmp_ext_hdr) + sizeof(struct icmp_extobj_hdr);
+
+	/* Always add if_index to the IIO */
+	ext_len += sizeof(__be32);
+	ctype |= ICMP_5837_IF_INDEX_CTYPE;
+
+	dev = dev_get_by_index(net, if_index);
+	/* Try to append IP address, name, and MTU */
+	if (dev) {
+		if (ip_version == 4) {
+			ip4_addr_subobj.addr = inet_select_addr(dev, 0, RT_SCOPE_UNIVERSE);
+			if (ip4_addr_subobj.addr) {
+				ip4_addr_subobj.afi = htons(1);
+				ip4_addr_subobj.reserved = 0;
+				ctype |= ICMP_5837_IP_ADDR_CTYPE;
+				ext_len += sizeof(ip4_addr_subobj);
+			}
+		}
+		if (ip_version == 6) {
+			dev6 = in6_dev_get(dev);
+			if (dev6) {
+				ip6_ifaddr = *list_first_entry(&dev6->addr_list,
+							      struct inet6_ifaddr, if_list);
+				memcpy(ip6_addr_subobj.addr, ip6_ifaddr.addr.s6_addr32,
+				       sizeof(ip6_addr_subobj.addr));
+				ip6_addr_subobj.afi = htons(2);
+				ip6_addr_subobj.reserved = 0;
+				ctype |= ICMP_5837_IP_ADDR_CTYPE;
+				ext_len += sizeof(ip6_addr_subobj);
+			}
+		}
+
+		name = dev->name;
+		name_len = strlen(name);
+		/* Subobj has 1 extra byte for the length field, and is 32 bit-aligned */
+		name_subobj_len = ALIGN(name_len + 1, 4);
+		ctype |= ICMP_5837_NAME_CTYPE;
+		ext_len += name_subobj_len;
+
+		mtu = dev->mtu;
+		ctype |= ICMP_5837_MTU_CTYPE;
+		ext_len += sizeof(__be32);
+
+		dev_put(dev);
+	}
+
+	if (word_aligned_orig_len + ext_len > room) {
+		offset = room - ext_len;
+		extra_space_needed = room - orig_len;
+	} else {
+		/* There is enough room to just add to the end of the packet.
+		 * We need (word_aligned_orig_len - orig_len bytes) for padding
+		 * and ext_len bytes for the extension. */
+		offset = word_aligned_orig_len;
+		extra_space_needed = ext_len + (word_aligned_orig_len - orig_len);
+	}
+
+	if (skb_tailroom(skb) < extra_space_needed) {
+		if (pskb_expand_head(skb, 0, extra_space_needed - skb_tailroom(skb), GFP_ATOMIC))
+			return;
+	}
+	skb_put(skb, extra_space_needed);
+
+	/* Zero-pad from the end of the original message to the beginning of the header */
+	memset(skb_network_header(skb) + orig_len, 0, word_aligned_orig_len - orig_len);
+
+	ext_hdr = (struct icmp_ext_hdr *)(skb_network_header(skb) + offset);
+	iio_hdr = (struct icmp_extobj_hdr *)(ext_hdr + 1);
+	subobj_offset = (void *)(iio_hdr + 1);
+
+	ext_hdr->reserved1 = 0;
+	ext_hdr->reserved2 = 0;
+	ext_hdr->version = 2;
+	ext_hdr->checksum = 0;
+
+	iio_hdr->length = htons(ext_len - sizeof(struct icmp_ext_hdr));
+	iio_hdr->class_num = 2;
+	iio_hdr->class_type = ctype;
+
+	*(__be32 *)subobj_offset = htonl(if_index);
+	subobj_offset += sizeof(__be32);
+
+	if (ip4_addr_subobj.addr) {
+		*(struct interface_ipv4_addr_sub_obj *)subobj_offset = ip4_addr_subobj;
+		subobj_offset += sizeof(ip4_addr_subobj);
+	}
+
+	if (dev6) {
+		*(struct interface_ipv6_addr_sub_obj *)subobj_offset = ip6_addr_subobj;
+		subobj_offset += sizeof(ip6_addr_subobj);
+	}
+
+	if (name) {
+		*(__u8 *)subobj_offset = name_subobj_len;
+		subobj_offset += sizeof(__u8);
+		memcpy(subobj_offset, name, name_len);
+		memset(subobj_offset + name_len, 0, name_subobj_len - (name_len + sizeof(__u8)));
+		subobj_offset += name_subobj_len - sizeof(__u8);
+	}
+
+	if (mtu) {
+		*(__be32 *)subobj_offset = htonl(mtu);
+		subobj_offset += sizeof(__be32);
+	}
+
+	ext_hdr->checksum =
+		csum_fold(skb_checksum(skb, skb_network_offset(skb) + offset, ext_len, 0));
+}
+
 /*
  *	Send an ICMP message in response to a situation
  *
@@ -731,6 +882,11 @@  void __icmp_send(struct sk_buff *skb_in, int type, int code, __be32 info,
 		room = 576;
 	room -= sizeof(struct iphdr) + icmp_param.replyopts.opt.opt.optlen;
 	room -= sizeof(struct icmphdr);
+	if (net->ipv4.sysctl_icmp_errors_identify_if && (type == ICMP_DEST_UNREACH ||
+	    type == ICMP_TIME_EXCEEDED || type == ICMP_PARAMETERPROB)) {
+		icmp_identify_arrival_interface(skb_in, net, room,
+						(char *)&icmp_param.data.icmph, 4);
+	}
 
 	icmp_param.data_len = skb_in->len - icmp_param.offset;
 	if (icmp_param.data_len > room)
@@ -1349,6 +1505,7 @@  static int __net_init icmp_sk_init(struct net *net)
 	net->ipv4.sysctl_icmp_ratelimit = 1 * HZ;
 	net->ipv4.sysctl_icmp_ratemask = 0x1818;
 	net->ipv4.sysctl_icmp_errors_use_inbound_ifaddr = 0;
+	net->ipv4.sysctl_icmp_errors_identify_if = 0;
 
 	return 0;
 
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 5653e3b011bf..c54881136497 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -628,6 +628,15 @@  static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "icmp_errors_identify_if",
+		.data		= &init_net.ipv4.sysctl_icmp_errors_identify_if,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "icmp_ratelimit",
 		.data		= &init_net.ipv4.sysctl_icmp_ratelimit,
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index b304b882e031..3901c2a99be4 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -939,6 +939,7 @@  static int __net_init inet6_net_init(struct net *net)
 	net->ipv6.sysctl.icmpv6_echo_ignore_all = 0;
 	net->ipv6.sysctl.icmpv6_echo_ignore_multicast = 0;
 	net->ipv6.sysctl.icmpv6_echo_ignore_anycast = 0;
+	net->ipv6.sysctl.icmpv6_errors_identify_if = 0;
 
 	/* By default, rate limit error messages.
 	 * Except for pmtu discovery, it would break it.
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 9df8737ae0d3..ad4c986b806b 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -595,6 +595,13 @@  static void icmp6_send(struct sk_buff *skb, u8 type, u8 code, __u32 info,
 	msg.offset = skb_network_offset(skb);
 	msg.type = type;
 
+	if (net->ipv6.sysctl.icmpv6_errors_identify_if &&
+	    (type == ICMPV6_DEST_UNREACH || type == ICMPV6_TIME_EXCEED)) {
+		icmp_identify_arrival_interface(skb, net,
+			IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr),
+			(char *)&tmp_hdr, 6);
+	}
+
 	len = skb->len - msg.offset;
 	len = min_t(unsigned int, len, IPV6_MIN_MTU - sizeof(struct ipv6hdr) - sizeof(struct icmp6hdr));
 	if (len < 0) {
@@ -1184,6 +1191,15 @@  static struct ctl_table ipv6_icmp_table_template[] = {
 		.mode		= 0644,
 		.proc_handler = proc_do_large_bitmap,
 	},
+	{
+		.procname	= "errors_identify_if",
+		.data		= &init_net.ipv6.sysctl.icmpv6_errors_identify_if,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler = proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ },
 };
 
@@ -1201,6 +1217,7 @@  struct ctl_table * __net_init ipv6_icmp_sysctl_init(struct net *net)
 		table[2].data = &net->ipv6.sysctl.icmpv6_echo_ignore_multicast;
 		table[3].data = &net->ipv6.sysctl.icmpv6_echo_ignore_anycast;
 		table[4].data = &net->ipv6.sysctl.icmpv6_ratemask_ptr;
+		table[5].data = &net->ipv6.sysctl.icmpv6_errors_identify_if;
 	}
 	return table;
 }