mbox series

[v1,net-next,00/13] ipv6: No RTNL for IPv6 routing table.

Message ID 20250321040131.21057-1-kuniyu@amazon.com (mailing list archive)
Headers show
Series ipv6: No RTNL for IPv6 routing table. | expand

Message

Kuniyuki Iwashima March 21, 2025, 4 a.m. UTC
IPv6 routing tables are protected by each table's lock and work in
the interrupt context, which means we basically don't need RTNL to
modify an IPv6 routing table itself.

Currently, the control paths require RTNL because we may need to
perform device and nexthop lookups; we must prevent dev/nexthop from
going away from the netns.

This, however, can be achieved by RCU as well.

If we are in the RCU critical section while adding an IPv6 route,
synchronize_net() in netif_change_net_namespace() and
unregister_netdevice_many_notify() guarantee that the dev will not be
moved to another netns or removed.

Also, nexthop is guaranteed not to be freed during the RCU grace period.

If we care about a race between nexthop removal and IPv6 route addition,
we can get rid of RTNL from the control paths.

Patch 1 moves a validation for RTA_MULTIPATH earlier.
Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE.
Patch 3 ~ 10 move validation and memory allocation earlier.
Patch 11 prevents a race between two requests for the same table.
Patch 12 prevents the race mentioned above.
Patch 13 removes RTNL for SIOCADDRT and RTM_NEWROUTE.


Test:

The script [0] lets each CPU-X create 100000 routes on table-X in a
batch.

On c7a.metal-48xl EC2 instance with 192 CPUs,

With this series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  Time elapsed: 189154 milliseconds.

Without series:

  $ sudo ./route_test.sh
  start adding routes
  added 19200000 routes (100000 routes * 192 tables).
  Time elapsed: 62531 milliseconds.

I changed the number of routes (1000 ~ 100000 per CPU/table) and
constantly saw it complete 3x faster with this series.


[0]
#!/bin/bash

mkdir tmp

NS="test"
ip netns add $NS
ip -n $NS link add veth0 type veth peer veth1
ip -n $NS link set veth0 up
ip -n $NS link set veth1 up

TABLES=()
for i in $(seq $(nproc)); do
    TABLES+=("$i")
done

ROUTES=()
for i in {1..100}; do
    for j in {1..1000}; do
	ROUTES+=("2001:$i:$j::/64")
    done
done

for TABLE in "${TABLES[@]}"; do
    FILE="./tmp/batch-table-$TABLE.txt"
    > $FILE
    for ROUTE in "${ROUTES[@]}"; do
        echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE
    done
done

echo "start adding routes"

START_TIME=$(date +%s%3N)
for TABLE in "${TABLES[@]}"; do
    ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" &
done

wait
END_TIME=$(date +%s%3N)
ELAPSED_TIME=$((END_TIME - START_TIME))

echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)."
echo "Time elapsed: ${ELAPSED_TIME} milliseconds."
echo $(ip -n $NS -6 route show table all | wc -l)  # Just for debug

ip netns del $NS
rm -fr ./tmp/


Kuniyuki Iwashima (13):
  ipv6: Validate RTA_GATEWAY of RTA_MULTIPATH in rtm_to_fib6_config().
  ipv6: Get rid of RTNL for SIOCDELRT and RTM_DELROUTE.
  ipv6: Move some validation from ip6_route_info_create() to
    rtm_to_fib6_config().
  ipv6: Check GATEWAY in rtm_to_fib6_multipath_config().
  ipv6: Move nexthop_find_by_id() after fib6_info_alloc().
  ipv6: Split ip6_route_info_create().
  ipv6: Preallocate rt->fib6_nh->rt6i_pcpu in ip6_route_info_create().
  ipv6: Preallocate nhc_pcpu_rth_output in ip6_route_info_create().
  ipv6: Don't pass net to ip6_route_info_append().
  ipv6: Factorise ip6_route_multipath_add().
  ipv6: Protect fib6_link_table() with spinlock.
  ipv6: Protect nh->f6i_list with spinlock and flag.
  ipv6: Get rid of RTNL for SIOCADDRT and RTM_NEWROUTE.

 include/net/netns/ipv6.h |   1 +
 include/net/nexthop.h    |   2 +
 net/ipv4/fib_semantics.c |  10 +-
 net/ipv4/nexthop.c       |  24 +-
 net/ipv6/ip6_fib.c       |  51 +++-
 net/ipv6/route.c         | 555 ++++++++++++++++++++++++---------------
 6 files changed, 415 insertions(+), 228 deletions(-)

Comments

Stanislav Fomichev March 21, 2025, 2:07 p.m. UTC | #1
On 03/20, Kuniyuki Iwashima wrote:
> IPv6 routing tables are protected by each table's lock and work in
> the interrupt context, which means we basically don't need RTNL to
> modify an IPv6 routing table itself.
> 
> Currently, the control paths require RTNL because we may need to
> perform device and nexthop lookups; we must prevent dev/nexthop from
> going away from the netns.
> 
> This, however, can be achieved by RCU as well.
> 
> If we are in the RCU critical section while adding an IPv6 route,
> synchronize_net() in netif_change_net_namespace() and
> unregister_netdevice_many_notify() guarantee that the dev will not be
> moved to another netns or removed.
> 
> Also, nexthop is guaranteed not to be freed during the RCU grace period.
> 
> If we care about a race between nexthop removal and IPv6 route addition,
> we can get rid of RTNL from the control paths.
> 
> Patch 1 moves a validation for RTA_MULTIPATH earlier.
> Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE.
> Patch 3 ~ 10 move validation and memory allocation earlier.
> Patch 11 prevents a race between two requests for the same table.
> Patch 12 prevents the race mentioned above.
> Patch 13 removes RTNL for SIOCADDRT and RTM_NEWROUTE.
> 
> 
> Test:
> 
> The script [0] lets each CPU-X create 100000 routes on table-X in a
> batch.
> 
> On c7a.metal-48xl EC2 instance with 192 CPUs,
> 
> With this series:
> 
>   $ sudo ./route_test.sh
>   start adding routes
>   added 19200000 routes (100000 routes * 192 tables).
>   Time elapsed: 189154 milliseconds.
> 
> Without series:
> 
>   $ sudo ./route_test.sh
>   start adding routes
>   added 19200000 routes (100000 routes * 192 tables).
>   Time elapsed: 62531 milliseconds.
> 
> I changed the number of routes (1000 ~ 100000 per CPU/table) and
> constantly saw it complete 3x faster with this series.
> 
> 
> [0]
> #!/bin/bash
> 
> mkdir tmp
> 
> NS="test"
> ip netns add $NS
> ip -n $NS link add veth0 type veth peer veth1
> ip -n $NS link set veth0 up
> ip -n $NS link set veth1 up
> 
> TABLES=()
> for i in $(seq $(nproc)); do
>     TABLES+=("$i")
> done
> 
> ROUTES=()
> for i in {1..100}; do
>     for j in {1..1000}; do
> 	ROUTES+=("2001:$i:$j::/64")
>     done
> done
> 
> for TABLE in "${TABLES[@]}"; do
>     FILE="./tmp/batch-table-$TABLE.txt"
>     > $FILE
>     for ROUTE in "${ROUTES[@]}"; do
>         echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE
>     done
> done
> 
> echo "start adding routes"
> 
> START_TIME=$(date +%s%3N)
> for TABLE in "${TABLES[@]}"; do
>     ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" &
> done
> 
> wait
> END_TIME=$(date +%s%3N)
> ELAPSED_TIME=$((END_TIME - START_TIME))
> 
> echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)."
> echo "Time elapsed: ${ELAPSED_TIME} milliseconds."
> echo $(ip -n $NS -6 route show table all | wc -l)  # Just for debug
> 
> ip netns del $NS
> rm -fr ./tmp/

Lockdep is not supper happy about some patch:
https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/42463/38-gre-multipath-nh-res-sh/stderr

---
pw-bot: cr
Kuniyuki Iwashima March 21, 2025, 4:50 p.m. UTC | #2
From: Stanislav Fomichev <stfomichev@gmail.com>
Date: Fri, 21 Mar 2025 07:07:15 -0700
> On 03/20, Kuniyuki Iwashima wrote:
> > IPv6 routing tables are protected by each table's lock and work in
> > the interrupt context, which means we basically don't need RTNL to
> > modify an IPv6 routing table itself.
> > 
> > Currently, the control paths require RTNL because we may need to
> > perform device and nexthop lookups; we must prevent dev/nexthop from
> > going away from the netns.
> > 
> > This, however, can be achieved by RCU as well.
> > 
> > If we are in the RCU critical section while adding an IPv6 route,
> > synchronize_net() in netif_change_net_namespace() and
> > unregister_netdevice_many_notify() guarantee that the dev will not be
> > moved to another netns or removed.
> > 
> > Also, nexthop is guaranteed not to be freed during the RCU grace period.
> > 
> > If we care about a race between nexthop removal and IPv6 route addition,
> > we can get rid of RTNL from the control paths.
> > 
> > Patch 1 moves a validation for RTA_MULTIPATH earlier.
> > Patch 2 removes RTNL for SIOCDELRT and RTM_DELROUTE.
> > Patch 3 ~ 10 move validation and memory allocation earlier.
> > Patch 11 prevents a race between two requests for the same table.
> > Patch 12 prevents the race mentioned above.
> > Patch 13 removes RTNL for SIOCADDRT and RTM_NEWROUTE.
> > 
> > 
> > Test:
> > 
> > The script [0] lets each CPU-X create 100000 routes on table-X in a
> > batch.
> > 
> > On c7a.metal-48xl EC2 instance with 192 CPUs,
> > 
> > With this series:
> > 
> >   $ sudo ./route_test.sh
> >   start adding routes
> >   added 19200000 routes (100000 routes * 192 tables).
> >   Time elapsed: 189154 milliseconds.
> > 
> > Without series:
> > 
> >   $ sudo ./route_test.sh
> >   start adding routes
> >   added 19200000 routes (100000 routes * 192 tables).
> >   Time elapsed: 62531 milliseconds.
> > 
> > I changed the number of routes (1000 ~ 100000 per CPU/table) and
> > constantly saw it complete 3x faster with this series.
> > 
> > 
> > [0]
> > #!/bin/bash
> > 
> > mkdir tmp
> > 
> > NS="test"
> > ip netns add $NS
> > ip -n $NS link add veth0 type veth peer veth1
> > ip -n $NS link set veth0 up
> > ip -n $NS link set veth1 up
> > 
> > TABLES=()
> > for i in $(seq $(nproc)); do
> >     TABLES+=("$i")
> > done
> > 
> > ROUTES=()
> > for i in {1..100}; do
> >     for j in {1..1000}; do
> > 	ROUTES+=("2001:$i:$j::/64")
> >     done
> > done
> > 
> > for TABLE in "${TABLES[@]}"; do
> >     FILE="./tmp/batch-table-$TABLE.txt"
> >     > $FILE
> >     for ROUTE in "${ROUTES[@]}"; do
> >         echo "route add $ROUTE dev veth0 table $TABLE" >> $FILE
> >     done
> > done
> > 
> > echo "start adding routes"
> > 
> > START_TIME=$(date +%s%3N)
> > for TABLE in "${TABLES[@]}"; do
> >     ip -n $NS -6 -batch "./tmp/batch-table-$TABLE.txt" &
> > done
> > 
> > wait
> > END_TIME=$(date +%s%3N)
> > ELAPSED_TIME=$((END_TIME - START_TIME))
> > 
> > echo "added $((${#ROUTES[@]} * ${#TABLES[@]})) routes (${#ROUTES[@]} routes * ${#TABLES[@]} tables)."
> > echo "Time elapsed: ${ELAPSED_TIME} milliseconds."
> > echo $(ip -n $NS -6 route show table all | wc -l)  # Just for debug
> > 
> > ip netns del $NS
> > rm -fr ./tmp/
> 
> Lockdep is not supper happy about some patch:
> https://netdev-3.bots.linux.dev/vmksft-forwarding-dbg/results/42463/38-gre-multipath-nh-res-sh/stderr

Looks like I need to extend the RCU critical section in
ip6_route_del() in patch 2:

---8<---
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index d1d60415d1aa..b6434532858f 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4104,9 +4104,9 @@ static int ip6_route_del(struct fib6_config *cfg,
 			if (rt->nh) {
 				if (!fib6_info_hold_safe(rt))
 					continue;
-				rcu_read_unlock();
 
-				return __ip6_del_rt(rt, &cfg->fc_nlinfo);
+				err =  __ip6_del_rt(rt, &cfg->fc_nlinfo);
+				break;
 			}
 			if (cfg->fc_nh_id)
 				continue;
@@ -4121,13 +4121,13 @@ static int ip6_route_del(struct fib6_config *cfg,
 				continue;
 			if (!fib6_info_hold_safe(rt))
 				continue;
-			rcu_read_unlock();
 
 			/* if gateway was specified only delete the one hop */
 			if (cfg->fc_flags & RTF_GATEWAY)
-				return __ip6_del_rt(rt, &cfg->fc_nlinfo);
-
-			return __ip6_del_rt_siblings(rt, cfg);
+				err = __ip6_del_rt(rt, &cfg->fc_nlinfo);
+			else
+				err = __ip6_del_rt_siblings(rt, cfg);
+			break;
 		}
 	}
 	rcu_read_unlock();
---8<---

Thanks!