Message ID | 20241220140516563WDQ_X40bt0ZOch3Qte1YO@zte.com.cn (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [linux,next] net:dsa:fix the dsa_ptr null pointer dereference | expand |
… > +++ b/net/dsa/dsa.c > @@ -1561,6 +1561,17 @@ void dsa_unregister_switch(struct dsa_switch *ds) > } > EXPORT_SYMBOL_GPL(dsa_unregister_switch); > > +static void dsa_conduit_ethtool_shutdown(struct net_device *dev) > +{ > + struct dsa_port *cpu_dp = dev->dsa_ptr; I suggest to assign this local variable only after the subsequent check. > + > + if (netif_is_lag_master(dev)) > + return; > + > + dev->ethtool_ops = cpu_dp->orig_ethtool_ops; > + cpu_dp->orig_ethtool_ops = NULL; > +} … Regards, Markus
Hello Kun Jiang, On Fri, Dec 20, 2024 at 02:05:16PM +0800, jiang.kun2@zte.com.cn wrote: > From: Peilin He<he.peilin@zte.com.cn> > > Issue > ===== > Repeatedly accessing the DSA Ethernet controller via the ethtool command, > followed by a system reboot, may trigger a DSA null pointer dereference, > causing a kernel panic and preventing the system from rebooting properly. > This can lead to data loss or denial-of-service, resulting in serious > consequences. > > The original problem occurred in the Linux kernel version 5.4.19. > The following is the panic log: > > [ 172.523467] Unable to handle kernel NULL pointer dereference at virtual > address 0000000000000020 > [ 172.532455] Mem abort info: > [ 172.535313] printk: console [ttyS0]: printing thread stopped > [ 172.536352] ESR = 0x0000000096000006 > [ 172.544926] EC = 0x25: DABT (current EL), IL = 32 bits > [ 172.550321] SET = 0, FnV = 0 > [ 172.553427] EA = 0, S1PTW = 0 > [ 172.556646] FSC = 0x06: level 2 translation fault > [ 172.561604] Data abort info: > [ 172.564563] ISV = 0, ISS = 0x00000006 > [ 172.568466] CM = 0, WnR = 0 > [ 172.571502] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a4b34000 > [ 172.578058] [0000000000000020] pgd=08000020a4ce6003, p4d=08000020a4ce6003, > pud=08000020a4b4d003, pmd=0000000000000000 > [ 172.588785] Internal error: Oops: 96000006 [#1] PREEMPT_RT SMP > [ 172.594641] Modules linked in: r8168(O) bcmdhd(O) ossmod(O) tipc(O) > [ 172.600933] CPU: 1 PID: 548 Comm: lldpd Tainted: G O > [ 172.610795] Hardware name: LS1028A RDB Board (DT) > [ 172.615508] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) > [ 172.622492] pc : dsa_master_get_sset_count+0x24/0xa4 > [ 172.627475] lr : ethtool_get_drvinfo+0x8c/0x210 > [ 172.632020] sp : ffff80000c233a90 > [ 172.635338] x29: ffff80000c233a90 x28: ffff67ad21e45a00 x27: 0000000000000000 > [ 172.642498] x26: 0000000000000000 x25: 0000ffffd1102110 x24: 0000000000000000 > [ 172.649657] x23: 00020100001149a9 x22: 0000ffffd1102110 x21: 0000000000000000 > [ 172.656816] x20: 0000000000000000 x19: ffff67ad00bbe000 x18: 0000000000000000 > [ 172.663974] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffd1102110 > [ 172.671132] x14: ffffffffffffffff x13: 30322e344230342e x12: 33302e37564c4547 > [ 172.678290] x11: 0000000000000020 x10: 0101010101010101 x9 : ffffd837fcebe6fc > [ 172.685448] x8 : 0101010101010101 x7 : 6374656e655f6c73 x6 : 74656e655f6c7366 > [ 172.692606] x5 : ffff80000c233b01 x4 : ffffd837fdae0251 x3 : 0000000000000063 > [ 172.699764] x2 : ffffd837fd076da0 x1 : 0000000000000000 x0 : ffff67ad00bbe000 > [ 172.706923] Call trace: > [ 172.709371] dsa_master_get_sset_count+0x24/0xa4 > [ 172.714000] ethtool_get_drvinfo+0x8c/0x210 > [ 172.718193] dev_ethtool+0x780/0x2120 > [ 172.721863] dev_ioctl+0x1b0/0x580 > [ 172.725273] sock_do_ioctl+0xc0/0x100 > [ 172.728944] sock_ioctl+0x130/0x3c0 > [ 172.732440] __arm64_sys_ioctl+0xb4/0x100 > [ 172.736460] invoke_syscall+0x50/0x120 > [ 172.740219] el0_svc_common.constprop.0+0x4c/0xf4 > [ 172.744936] do_el0_svc+0x2c/0xa0 > [ 172.748257] el0_svc+0x20/0x60 > [ 172.751318] el0t_64_sync_handler+0xe8/0x114 > [ 172.755599] el0t_64_sync+0x180/0x184 > [ 172.759271] Code: a90153f3 2a0103f4 a9025bf5 f9418015 (f94012b6) > [ 172.765383] ---[ end trace 0000000000000002 ]--- > > Root Cause > ========== > Analysis of linux-next-6.13.0-rc3 reveals that the > dsa_conduit_get_sset_count() function accesses members of > a structure pointed to by cpu_dp without checking > if cpu_dp is a null pointer. This can lead to a kernel panic > if cpu_dp is NULL. > > static int dsa_conduit_get_sset_count(struct net_device *dev, > int sset) > { > struct dsa_port *cpu_dp = dev->dsa_ptr; > const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops; > struct dsa_switch *ds = cpu_dp->ds; > ... > } > > dev->dsa_ptr is set to NULL in both the dsa_switch_shutdown and > dsa_conduit_teardown functions. When the DSA module unloads, > dsa_conduit_ethtool_teardown(dev) restores the original copy of the DSA > device's ethtool_ops using "dev->ethtool_ops = cpu_dp->orig_ethtool_ops;" > before setting dev->dsa_ptr to NULL. This ensures that ethtool_ops > remains accessible after DSA unloading. However, dsa_switch_shutdown does > not restore the original copy of the DSA device's ethtool_ops, potentially > leading to a null pointer dereference of dsa_ptr and subsequently a system > panic. > > Solution > ======== > In the kernel's dsa_switch_shutdown function, before dp->conduit->dsa_ptr > is set to NULL, the dsa_conduit_ethtool_shutdown function is called to > restore the DSA master's ethtool_ops pointer to its original value. > This prevents the kernel from entering the DSA ethtool_ops flow even if > the user executes ethtool, thus avoiding the null pointer dereference issue > with dsa_ptr. > > Signed-off-by: Peilin He<he.peilin@zte.com.cn> > Co-developed-by: xu xin <xu.xin16@zte.com.cn> > Signed-off-by: xu xin <xu.xin16@zte.com.cn> > Signed-off-by: Kun Jiang <jiang.kun2@zte.com.cn> > Cc: Fan Yu <fan.yu9@zte.com.cn> > Cc: Yutan Qiu <qiu.yutan@zte.com.cn> > Cc: Yaxin Wang <wang.yaxin@zte.com.cn> > Cc: tuqiang <tu.qiang35@zte.com.cn> > Cc: Yang Yang <yang.yang29@zte.com.cn> > Cc: ye xingchen <ye.xingchen@zte.com.cn> > Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> > > --- Thank you for the patch. There are many process problems with it however. The most glaring one is that you are examining a crash from kernel 5.4 but patching linux-next, without having apparently also tested linux-next. It appears that you just made a static analysis which may result in incorrect conclusions. When submitting patches upstream you always have to test on the latest version and understand afterwards what is missing and needs to be backported in the particular stable version you are using. In particular here, dsa_switch_shutdown() now has this: dsa_switch_for_each_user_port(dp, ds) { conduit = dsa_port_to_conduit(dp); user_dev = dp->user; netif_device_detach(user_dev); ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ netdev_upper_dev_unlink(conduit, user_dev); } After netif_device_detach() is called, my expectation is that ethnl_ops_begin() sees that netif_device_present() is false, so it returns -ENODEV and does not proceed further to call into the device's ethtool ops. So that eliminates the premise for the crash. Secondly, linux-next is not a kernel tree that accepts patches, it is just for integration. For netdev, we have net.git for bug fixes and net-next.git for new features. You have to target your patch to net.git by using "[PATCH net v1]". If the problem does not exist in net.git but exists in stable kernels, you have to identify which patches are missing, adapt them if necessary, and then send them to stable@vger.kernel.org, with netdev and the other maintainers also CCed, and with a subject prefix along the lines of "[PATCH stable 5.4]". Generally, backporting patches manually to stable is rarely needed, so if that needs to happen, please use the space under the "---" marker (this is discarded when applying the patch in git) to explain to maintainers why (what conflicted, if it simply appears to have been missed, etc). There are other things to be aware of in Documentation/process/, I just summarized to you what I considered most relevant here.
diff --git a/net/dsa/dsa.c b/net/dsa/dsa.c index 5a7c0e565a89..5eee0c436848 100644 --- a/net/dsa/dsa.c +++ b/net/dsa/dsa.c @@ -1561,6 +1561,17 @@ void dsa_unregister_switch(struct dsa_switch *ds) } EXPORT_SYMBOL_GPL(dsa_unregister_switch); +static void dsa_conduit_ethtool_shutdown(struct net_device *dev) +{ + struct dsa_port *cpu_dp = dev->dsa_ptr; + + if (netif_is_lag_master(dev)) + return; + + dev->ethtool_ops = cpu_dp->orig_ethtool_ops; + cpu_dp->orig_ethtool_ops = NULL; +} + /* If the DSA conduit chooses to unregister its net_device on .shutdown, DSA is * blocking that operation from completion, due to the dev_hold taken inside * netdev_upper_dev_link. Unlink the DSA user interfaces from being uppers of @@ -1595,8 +1606,10 @@ void dsa_switch_shutdown(struct dsa_switch *ds) /* Disconnect from further netdevice notifiers on the conduit, * since netdev_uses_dsa() will now return false. */ - dsa_switch_for_each_cpu_port(dp, ds) + dsa_switch_for_each_cpu_port(dp, ds) { + dsa_conduit_ethtool_shutdown(dp->conduit); dp->conduit->dsa_ptr = NULL; + } rtnl_unlock(); out: