Message ID | 20250227083717.4307-2-liuhangbin@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | bond: fix xfrm offload issues | expand |
On 2/27/25 10:37, Hangbin Liu wrote: > The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers > a warning: > > BUG: sleeping function called from invalid context at... > > Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, > which is not held by spin_lock_bh(). > > Additionally, delete the IPsec list in bond_ipsec_del_sa_all() when the > XFRM state is DEAD to prevent xdo_dev_state_free() from being triggered > again in bond_ipsec_free_sa(). > > For bond_ipsec_free_sa(), there are now three conditions: > > 1. if (!slave): When no active device exists. > 2. if (!xs->xso.real_dev): When xdo_dev_state_add() fails. > 3. if (xs->xso.real_dev != real_dev): When an xs has already been freed > by bond_ipsec_del_sa_all() due to migration, and the active slave has > changed to a new device. At the same time, the xs is marked as DEAD > due to the XFRM entry is removed, triggering xfrm_state_gc_task() and > bond_ipsec_free_sa(). > > In all three cases, xdo_dev_state_free() should not be called, only xs > should be removed from bond->ipsec list. > > Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") > Reported-by: Jakub Kicinski <kuba@kernel.org> > Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org > Suggested-by: Cosmin Ratiu <cratiu@nvidia.com> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> > --- > drivers/net/bonding/bond_main.c | 34 ++++++++++++++++++++++----------- > 1 file changed, 23 insertions(+), 11 deletions(-) > > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index e45bba240cbc..683bf1221caf 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -537,6 +537,10 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) > } > > list_for_each_entry(ipsec, &bond->ipsec_list, list) { > + /* Skip dead xfrm states, they'll be freed later. */ > + if (ipsec->xs->km.state == XFRM_STATE_DEAD) > + continue; > + > /* If new state is added before ipsec_lock acquired */ > if (ipsec->xs->xso.real_dev == real_dev) > continue; > @@ -560,7 +564,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) > struct net_device *bond_dev = xs->xso.dev; > struct net_device *real_dev; > netdevice_tracker tracker; > - struct bond_ipsec *ipsec; > struct bonding *bond; > struct slave *slave; > > @@ -592,15 +595,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) > real_dev->xfrmdev_ops->xdo_dev_state_delete(xs); > out: > netdev_put(real_dev, &tracker); > - mutex_lock(&bond->ipsec_lock); > - list_for_each_entry(ipsec, &bond->ipsec_list, list) { > - if (ipsec->xs == xs) { > - list_del(&ipsec->list); > - kfree(ipsec); > - break; > - } > - } > - mutex_unlock(&bond->ipsec_lock); > } > > static void bond_ipsec_del_sa_all(struct bonding *bond) > @@ -617,6 +611,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) > > mutex_lock(&bond->ipsec_lock); > list_for_each_entry(ipsec, &bond->ipsec_list, list) { > + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { > + list_del(&ipsec->list); To be able to do this here, you'll have to use list_for_each_entry_safe(). > + kfree(ipsec); > + continue; > + } > + > if (!ipsec->xs->xso.real_dev) > continue; > > @@ -640,6 +640,7 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) > struct net_device *bond_dev = xs->xso.dev; > struct net_device *real_dev; > netdevice_tracker tracker; > + struct bond_ipsec *ipsec; > struct bonding *bond; > struct slave *slave; > > @@ -659,13 +660,24 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) > if (!xs->xso.real_dev) > goto out; > > - WARN_ON(xs->xso.real_dev != real_dev); > + if (xs->xso.real_dev != real_dev) > + goto out; > > if (real_dev && real_dev->xfrmdev_ops && > real_dev->xfrmdev_ops->xdo_dev_state_free) > real_dev->xfrmdev_ops->xdo_dev_state_free(xs); > out: > netdev_put(real_dev, &tracker); > + > + mutex_lock(&bond->ipsec_lock); > + list_for_each_entry(ipsec, &bond->ipsec_list, list) { > + if (ipsec->xs == xs) { > + list_del(&ipsec->list); > + kfree(ipsec); > + break; > + } > + } > + mutex_unlock(&bond->ipsec_lock); > } > > /**
On 2/27/25 10:50, Nikolay Aleksandrov wrote: > On 2/27/25 10:37, Hangbin Liu wrote: >> The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers >> a warning: >> >> BUG: sleeping function called from invalid context at... >> >> Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, >> which is not held by spin_lock_bh(). >> >> Additionally, delete the IPsec list in bond_ipsec_del_sa_all() when the >> XFRM state is DEAD to prevent xdo_dev_state_free() from being triggered >> again in bond_ipsec_free_sa(). >> >> For bond_ipsec_free_sa(), there are now three conditions: >> >> 1. if (!slave): When no active device exists. >> 2. if (!xs->xso.real_dev): When xdo_dev_state_add() fails. >> 3. if (xs->xso.real_dev != real_dev): When an xs has already been freed >> by bond_ipsec_del_sa_all() due to migration, and the active slave has >> changed to a new device. At the same time, the xs is marked as DEAD >> due to the XFRM entry is removed, triggering xfrm_state_gc_task() and >> bond_ipsec_free_sa(). >> >> In all three cases, xdo_dev_state_free() should not be called, only xs >> should be removed from bond->ipsec list. >> >> Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") >> Reported-by: Jakub Kicinski <kuba@kernel.org> >> Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org >> Suggested-by: Cosmin Ratiu <cratiu@nvidia.com> >> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> >> --- >> drivers/net/bonding/bond_main.c | 34 ++++++++++++++++++++++----------- >> 1 file changed, 23 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >> index e45bba240cbc..683bf1221caf 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -537,6 +537,10 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) >> } >> >> list_for_each_entry(ipsec, &bond->ipsec_list, list) { >> + /* Skip dead xfrm states, they'll be freed later. */ >> + if (ipsec->xs->km.state == XFRM_STATE_DEAD) >> + continue; >> + >> /* If new state is added before ipsec_lock acquired */ >> if (ipsec->xs->xso.real_dev == real_dev) >> continue; >> @@ -560,7 +564,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) >> struct net_device *bond_dev = xs->xso.dev; >> struct net_device *real_dev; >> netdevice_tracker tracker; >> - struct bond_ipsec *ipsec; >> struct bonding *bond; >> struct slave *slave; >> >> @@ -592,15 +595,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) >> real_dev->xfrmdev_ops->xdo_dev_state_delete(xs); >> out: >> netdev_put(real_dev, &tracker); >> - mutex_lock(&bond->ipsec_lock); >> - list_for_each_entry(ipsec, &bond->ipsec_list, list) { >> - if (ipsec->xs == xs) { >> - list_del(&ipsec->list); >> - kfree(ipsec); >> - break; >> - } >> - } >> - mutex_unlock(&bond->ipsec_lock); >> } >> >> static void bond_ipsec_del_sa_all(struct bonding *bond) >> @@ -617,6 +611,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) >> >> mutex_lock(&bond->ipsec_lock); >> list_for_each_entry(ipsec, &bond->ipsec_list, list) { >> + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { >> + list_del(&ipsec->list); > > To be able to do this here, you'll have to use list_for_each_entry_safe(). > One more thing - note I'm not an xfrm expert by far but it seems to me here you have to also call xdo_dev_state_free() with the old active slave dev otherwise that will never get called with the original real_dev after the switch to a new active slave (or more accurately it might if the GC runs between the switching but it is a race), care must be taken wrt sequence of events because the XFRM GC may be running in parallel which probably means that in bond_ipsec_free_sa() you'll have to take the mutex before calling xdo_dev_state_free() and check if the entry is still linked in the bond's ipsec list before calling the free_sa callback, if it isn't then del_sa_all got to it before the GC and there's nothing to do if it also called the dev's free_sa callback. The check for real_dev doesn't seem enough to protect against this race. Cheers, Nik >> + kfree(ipsec); >> + continue; >> + } >> + >> if (!ipsec->xs->xso.real_dev) >> continue; >> >> @@ -640,6 +640,7 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) >> struct net_device *bond_dev = xs->xso.dev; >> struct net_device *real_dev; >> netdevice_tracker tracker; >> + struct bond_ipsec *ipsec; >> struct bonding *bond; >> struct slave *slave; >> >> @@ -659,13 +660,24 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) >> if (!xs->xso.real_dev) >> goto out; >> >> - WARN_ON(xs->xso.real_dev != real_dev); >> + if (xs->xso.real_dev != real_dev) >> + goto out; >> >> if (real_dev && real_dev->xfrmdev_ops && >> real_dev->xfrmdev_ops->xdo_dev_state_free) >> real_dev->xfrmdev_ops->xdo_dev_state_free(xs); >> out: >> netdev_put(real_dev, &tracker); >> + >> + mutex_lock(&bond->ipsec_lock); >> + list_for_each_entry(ipsec, &bond->ipsec_list, list) { >> + if (ipsec->xs == xs) { >> + list_del(&ipsec->list); >> + kfree(ipsec); >> + break; >> + } >> + } >> + mutex_unlock(&bond->ipsec_lock); >> } >> >> /** >
On Thu, Feb 27, 2025 at 11:21:51AM +0200, Nikolay Aleksandrov wrote: > >> @@ -617,6 +611,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) > >> > >> mutex_lock(&bond->ipsec_lock); > >> list_for_each_entry(ipsec, &bond->ipsec_list, list) { > >> + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { > >> + list_del(&ipsec->list); > > > > To be able to do this here, you'll have to use list_for_each_entry_safe(). > > > > One more thing - note I'm not an xfrm expert by far but it seems to me here you have > to also call xdo_dev_state_free() with the old active slave dev otherwise that will > never get called with the original real_dev after the switch to a new > active slave (or more accurately it might if the GC runs between the switching > but it is a race), care must be taken wrt sequence of events because the XFRM Can we just call xs->xso.real_dev->xfrmdev_ops->xdo_dev_state_free(xs) no matter xs->xso.real_dev == real_dev or not? I'm afraid calling xdo_dev_state_free() every where may make us lot more easily. > GC may be running in parallel which probably means that in bond_ipsec_free_sa() > you'll have to take the mutex before calling xdo_dev_state_free() and check > if the entry is still linked in the bond's ipsec list before calling the free_sa > callback, if it isn't then del_sa_all got to it before the GC and there's nothing > to do if it also called the dev's free_sa callback. The check for real_dev doesn't > seem enough to protect against this race. I agree that we need to take the mutex before calling xdo_dev_state_free() in bond_ipsec_free_sa(). Do you think if this is enough? I'm a bit lot here. Thanks Hangbin
On 2/27/25 15:21, Hangbin Liu wrote: > On Thu, Feb 27, 2025 at 11:21:51AM +0200, Nikolay Aleksandrov wrote: >>>> @@ -617,6 +611,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) >>>> >>>> mutex_lock(&bond->ipsec_lock); >>>> list_for_each_entry(ipsec, &bond->ipsec_list, list) { >>>> + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { >>>> + list_del(&ipsec->list); >>> >>> To be able to do this here, you'll have to use list_for_each_entry_safe(). >>> >> >> One more thing - note I'm not an xfrm expert by far but it seems to me here you have >> to also call xdo_dev_state_free() with the old active slave dev otherwise that will >> never get called with the original real_dev after the switch to a new >> active slave (or more accurately it might if the GC runs between the switching >> but it is a race), care must be taken wrt sequence of events because the XFRM > > Can we just call xs->xso.real_dev->xfrmdev_ops->xdo_dev_state_free(xs) > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > xdo_dev_state_free() every where may make us lot more easily. > You'd have to check all drivers that implement the callback to answer that and even then I'd stick to the canonical way of how it's done in xfrm and make the bond just passthrough. Any other games become dangerous and new code will have to be carefully reviewed every time, calling another device's free_sa when it wasn't added before doesn't sound good. >> GC may be running in parallel which probably means that in bond_ipsec_free_sa() >> you'll have to take the mutex before calling xdo_dev_state_free() and check >> if the entry is still linked in the bond's ipsec list before calling the free_sa >> callback, if it isn't then del_sa_all got to it before the GC and there's nothing >> to do if it also called the dev's free_sa callback. The check for real_dev doesn't >> seem enough to protect against this race. > > I agree that we need to take the mutex before calling xdo_dev_state_free() > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a bit lot here. > > Thanks > Hangbin Well, the race is between the xfrm GC and del_sa_all, in bond's free_sa if you walk the list under the mutex before calling real_dev's free callback and don't find the current element that's being freed in free_sa then it was cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk that list and clean the entries. I think it should be fine as long as free_sa was called once with the proper device.
On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > >> One more thing - note I'm not an xfrm expert by far but it seems to me here you have > >> to also call xdo_dev_state_free() with the old active slave dev otherwise that will > >> never get called with the original real_dev after the switch to a new > >> active slave (or more accurately it might if the GC runs between the switching > >> but it is a race), care must be taken wrt sequence of events because the XFRM > > > > Can we just call xs->xso.real_dev->xfrmdev_ops->xdo_dev_state_free(xs) > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > xdo_dev_state_free() every where may make us lot more easily. > > > > You'd have to check all drivers that implement the callback to answer that and even then > I'd stick to the canonical way of how it's done in xfrm and make the bond just passthrough. > Any other games become dangerous and new code will have to be carefully reviewed every > time, calling another device's free_sa when it wasn't added before doesn't sound good. > > >> GC may be running in parallel which probably means that in bond_ipsec_free_sa() > >> you'll have to take the mutex before calling xdo_dev_state_free() and check > >> if the entry is still linked in the bond's ipsec list before calling the free_sa > >> callback, if it isn't then del_sa_all got to it before the GC and there's nothing > >> to do if it also called the dev's free_sa callback. The check for real_dev doesn't > >> seem enough to protect against this race. > > > > I agree that we need to take the mutex before calling xdo_dev_state_free() > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a bit lot here. > > > > Thanks > > Hangbin > > Well, the race is between the xfrm GC and del_sa_all, in bond's free_sa if you > walk the list under the mutex before calling real_dev's free callback and > don't find the current element that's being freed in free_sa then it was > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk that > list and clean the entries. I think it should be fine as long as free_sa > was called once with the proper device. OK, so the free will be called either in del_sa_all() or free_sa(). Something like this? static void bond_ipsec_del_sa_all(struct bonding *bond) @@ -620,6 +614,16 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (!ipsec->xs->xso.real_dev) continue; + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + /* already dead no need to delete again */ + if (real_dev->xfrmdev_ops->xdo_dev_state_free) + real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); + list_del(&ipsec->list); + kfree(ipsec); + continue; + } + if (!real_dev->xfrmdev_ops || !real_dev->xfrmdev_ops->xdo_dev_state_delete || netif_is_bond_master(real_dev)) { @@ -659,11 +664,22 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) if (!xs->xso.real_dev) goto out; - WARN_ON(xs->xso.real_dev != real_dev); + mutex_lock(&bond->ipsec_lock); + list_for_each_entry(ipsec, &bond->ipsec_list, list) { + if (ipsec->xs == xs) { + if (real_dev && xs->xso.real_dev == real_dev && ^^ looks we don't need this xs->xso.real_dev == real_dev checking if there is no race, do we? Or just keep the WARN_ON() in case of any race. + real_dev->xfrmdev_ops && + real_dev->xfrmdev_ops->xdo_dev_state_free) + real_dev->xfrmdev_ops->xdo_dev_state_free(xs); + list_del(&ipsec->list); + kfree(ipsec); + break; + } + } + mutex_unlock(&bond->ipsec_lock); - if (real_dev && real_dev->xfrmdev_ops && - real_dev->xfrmdev_ops->xdo_dev_state_free) - real_dev->xfrmdev_ops->xdo_dev_state_free(xs); out: netdev_put(real_dev, &tracker); }
On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: > On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > > > > One more thing - note I'm not an xfrm expert by far but it > > > > seems to me here you have > > > > to also call xdo_dev_state_free() with the old active slave > > > > dev otherwise that will > > > > never get called with the original real_dev after the switch to > > > > a new > > > > active slave (or more accurately it might if the GC runs > > > > between the switching > > > > but it is a race), care must be taken wrt sequence of events > > > > because the XFRM > > > > > > Can we just call xs->xso.real_dev->xfrmdev_ops- > > > >xdo_dev_state_free(xs) > > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > > xdo_dev_state_free() every where may make us lot more easily. > > > > > > > You'd have to check all drivers that implement the callback to > > answer that and even then > > I'd stick to the canonical way of how it's done in xfrm and make > > the bond just passthrough. > > Any other games become dangerous and new code will have to be > > carefully reviewed every > > time, calling another device's free_sa when it wasn't added before > > doesn't sound good. > > > > > > GC may be running in parallel which probably means that in > > > > bond_ipsec_free_sa() > > > > you'll have to take the mutex before calling > > > > xdo_dev_state_free() and check > > > > if the entry is still linked in the bond's ipsec list before > > > > calling the free_sa > > > > callback, if it isn't then del_sa_all got to it before the GC > > > > and there's nothing > > > > to do if it also called the dev's free_sa callback. The check > > > > for real_dev doesn't > > > > seem enough to protect against this race. > > > > > > I agree that we need to take the mutex before calling > > > xdo_dev_state_free() > > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a > > > bit lot here. > > > > > > Thanks > > > Hangbin > > > > Well, the race is between the xfrm GC and del_sa_all, in bond's > > free_sa if you > > walk the list under the mutex before calling real_dev's free > > callback and > > don't find the current element that's being freed in free_sa then > > it was > > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk > > that > > list and clean the entries. I think it should be fine as long as > > free_sa > > was called once with the proper device. > > OK, so the free will be called either in del_sa_all() or free_sa(). > Something like this? > [...] Unfortunately, after applying these changes and reasoning about them for a bit, I don't think this will work. There are still races left. For example: 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all is called in parallel, doesn't call delete on xs (because it's dead), then calls free (incorrect without delete first), then removes the list entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, and calls delete (incorrect, out of order with free). Finally, bond_ipsec_free_sa is called, which fortunately doesn't do anything silly in the new proposed form because xs is no longer in the list. 2. A more sinister form of the above race can happen when bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and immediately after __xfrm_state_delete marks xs as DEAD and calls bond_ipsec_del_sa() which happily calls delete on real_dev again. In order to fix these races (and others like it), I think bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >lock for each xs being processed. This would prevent xfrm from concurrently initiating add/delete operations on the managed states. Cosmin.
On 2/28/25 12:31, Cosmin Ratiu wrote: > On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: >> On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: >>>>> One more thing - note I'm not an xfrm expert by far but it >>>>> seems to me here you have >>>>> to also call xdo_dev_state_free() with the old active slave >>>>> dev otherwise that will >>>>> never get called with the original real_dev after the switch to >>>>> a new >>>>> active slave (or more accurately it might if the GC runs >>>>> between the switching >>>>> but it is a race), care must be taken wrt sequence of events >>>>> because the XFRM >>>> >>>> Can we just call xs->xso.real_dev->xfrmdev_ops- >>>>> xdo_dev_state_free(xs) >>>> no matter xs->xso.real_dev == real_dev or not? I'm afraid calling >>>> xdo_dev_state_free() every where may make us lot more easily. >>>> >>> >>> You'd have to check all drivers that implement the callback to >>> answer that and even then >>> I'd stick to the canonical way of how it's done in xfrm and make >>> the bond just passthrough. >>> Any other games become dangerous and new code will have to be >>> carefully reviewed every >>> time, calling another device's free_sa when it wasn't added before >>> doesn't sound good. >>> >>>>> GC may be running in parallel which probably means that in >>>>> bond_ipsec_free_sa() >>>>> you'll have to take the mutex before calling >>>>> xdo_dev_state_free() and check >>>>> if the entry is still linked in the bond's ipsec list before >>>>> calling the free_sa >>>>> callback, if it isn't then del_sa_all got to it before the GC >>>>> and there's nothing >>>>> to do if it also called the dev's free_sa callback. The check >>>>> for real_dev doesn't >>>>> seem enough to protect against this race. >>>> >>>> I agree that we need to take the mutex before calling >>>> xdo_dev_state_free() >>>> in bond_ipsec_free_sa(). Do you think if this is enough? I'm a >>>> bit lot here. >>>> >>>> Thanks >>>> Hangbin >>> >>> Well, the race is between the xfrm GC and del_sa_all, in bond's >>> free_sa if you >>> walk the list under the mutex before calling real_dev's free >>> callback and >>> don't find the current element that's being freed in free_sa then >>> it was >>> cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk >>> that >>> list and clean the entries. I think it should be fine as long as >>> free_sa >>> was called once with the proper device. >> >> OK, so the free will be called either in del_sa_all() or free_sa(). >> Something like this? >> > [...] > > Unfortunately, after applying these changes and reasoning about them > for a bit, I don't think this will work. There are still races left. > For example: > 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and > before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all > is called in parallel, doesn't call delete on xs (because it's dead), > then calls free (incorrect without delete first), then removes the list > entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, > and calls delete (incorrect, out of order with free). Finally, > bond_ipsec_free_sa is called, which fortunately doesn't do anything > silly in the new proposed form because xs is no longer in the list. > > 2. A more sinister form of the above race can happen when > bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and > immediately after __xfrm_state_delete marks xs as DEAD and calls > bond_ipsec_del_sa() which happily calls delete on real_dev again. > > In order to fix these races (and others like it), I think > bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >> lock for each xs being processed. This would prevent xfrm from > concurrently initiating add/delete operations on the managed states. > > Cosmin. Duh, right you are. The state is protected by x->lock and cannot be trusted outside of it. If you take x->lock inside the list walk with the mutex held you can deadlock. Cheers, Nik
On 2/28/25 13:07, Nikolay Aleksandrov wrote: > On 2/28/25 12:31, Cosmin Ratiu wrote: >> On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: >>> On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: >>>>>> One more thing - note I'm not an xfrm expert by far but it >>>>>> seems to me here you have >>>>>> to also call xdo_dev_state_free() with the old active slave >>>>>> dev otherwise that will >>>>>> never get called with the original real_dev after the switch to >>>>>> a new >>>>>> active slave (or more accurately it might if the GC runs >>>>>> between the switching >>>>>> but it is a race), care must be taken wrt sequence of events >>>>>> because the XFRM >>>>> >>>>> Can we just call xs->xso.real_dev->xfrmdev_ops- >>>>>> xdo_dev_state_free(xs) >>>>> no matter xs->xso.real_dev == real_dev or not? I'm afraid calling >>>>> xdo_dev_state_free() every where may make us lot more easily. >>>>> >>>> >>>> You'd have to check all drivers that implement the callback to >>>> answer that and even then >>>> I'd stick to the canonical way of how it's done in xfrm and make >>>> the bond just passthrough. >>>> Any other games become dangerous and new code will have to be >>>> carefully reviewed every >>>> time, calling another device's free_sa when it wasn't added before >>>> doesn't sound good. >>>> >>>>>> GC may be running in parallel which probably means that in >>>>>> bond_ipsec_free_sa() >>>>>> you'll have to take the mutex before calling >>>>>> xdo_dev_state_free() and check >>>>>> if the entry is still linked in the bond's ipsec list before >>>>>> calling the free_sa >>>>>> callback, if it isn't then del_sa_all got to it before the GC >>>>>> and there's nothing >>>>>> to do if it also called the dev's free_sa callback. The check >>>>>> for real_dev doesn't >>>>>> seem enough to protect against this race. >>>>> >>>>> I agree that we need to take the mutex before calling >>>>> xdo_dev_state_free() >>>>> in bond_ipsec_free_sa(). Do you think if this is enough? I'm a >>>>> bit lot here. >>>>> >>>>> Thanks >>>>> Hangbin >>>> >>>> Well, the race is between the xfrm GC and del_sa_all, in bond's >>>> free_sa if you >>>> walk the list under the mutex before calling real_dev's free >>>> callback and >>>> don't find the current element that's being freed in free_sa then >>>> it was >>>> cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk >>>> that >>>> list and clean the entries. I think it should be fine as long as >>>> free_sa >>>> was called once with the proper device. >>> >>> OK, so the free will be called either in del_sa_all() or free_sa(). >>> Something like this? >>> >> [...] >> >> Unfortunately, after applying these changes and reasoning about them >> for a bit, I don't think this will work. There are still races left. >> For example: >> 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and >> before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all >> is called in parallel, doesn't call delete on xs (because it's dead), >> then calls free (incorrect without delete first), then removes the list >> entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, >> and calls delete (incorrect, out of order with free). Finally, >> bond_ipsec_free_sa is called, which fortunately doesn't do anything >> silly in the new proposed form because xs is no longer in the list. >> >> 2. A more sinister form of the above race can happen when >> bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and >> immediately after __xfrm_state_delete marks xs as DEAD and calls >> bond_ipsec_del_sa() which happily calls delete on real_dev again. >> >> In order to fix these races (and others like it), I think >> bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- >>> lock for each xs being processed. This would prevent xfrm from >> concurrently initiating add/delete operations on the managed states. >> >> Cosmin. > > Duh, right you are. The state is protected by x->lock and cannot be trusted > outside of it. If you take x->lock inside the list walk with the mutex held > you can deadlock. > > Cheers, > Nik > Correction - actually took a closer look at the xfrm code and it should be fine. The x->lock is taken only in the delete path and if the mutex is not acquired by bond's del_sa callback it should be ok. Though this must be very well documented.
Hi Cosmin, On Fri, Feb 28, 2025 at 10:31:58AM +0000, Cosmin Ratiu wrote: > On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: > > On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > > > > > One more thing - note I'm not an xfrm expert by far but it > > > > > seems to me here you have > > > > > to also call xdo_dev_state_free() with the old active slave > > > > > dev otherwise that will > > > > > never get called with the original real_dev after the switch to > > > > > a new > > > > > active slave (or more accurately it might if the GC runs > > > > > between the switching > > > > > but it is a race), care must be taken wrt sequence of events > > > > > because the XFRM > > > > > > > > Can we just call xs->xso.real_dev->xfrmdev_ops- > > > > >xdo_dev_state_free(xs) > > > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > > > xdo_dev_state_free() every where may make us lot more easily. > > > > > > > > > > You'd have to check all drivers that implement the callback to > > > answer that and even then > > > I'd stick to the canonical way of how it's done in xfrm and make > > > the bond just passthrough. > > > Any other games become dangerous and new code will have to be > > > carefully reviewed every > > > time, calling another device's free_sa when it wasn't added before > > > doesn't sound good. > > > > > > > > GC may be running in parallel which probably means that in > > > > > bond_ipsec_free_sa() > > > > > you'll have to take the mutex before calling > > > > > xdo_dev_state_free() and check > > > > > if the entry is still linked in the bond's ipsec list before > > > > > calling the free_sa > > > > > callback, if it isn't then del_sa_all got to it before the GC > > > > > and there's nothing > > > > > to do if it also called the dev's free_sa callback. The check > > > > > for real_dev doesn't > > > > > seem enough to protect against this race. > > > > > > > > I agree that we need to take the mutex before calling > > > > xdo_dev_state_free() > > > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a > > > > bit lot here. > > > > > > > > Thanks > > > > Hangbin > > > > > > Well, the race is between the xfrm GC and del_sa_all, in bond's > > > free_sa if you > > > walk the list under the mutex before calling real_dev's free > > > callback and > > > don't find the current element that's being freed in free_sa then > > > it was > > > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk > > > that > > > list and clean the entries. I think it should be fine as long as > > > free_sa > > > was called once with the proper device. > > > > OK, so the free will be called either in del_sa_all() or free_sa(). > > Something like this? > > > [...] > > Unfortunately, after applying these changes and reasoning about them > for a bit, I don't think this will work. There are still races left. > For example: > 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and > before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all > is called in parallel, doesn't call delete on xs (because it's dead), > then calls free (incorrect without delete first), then removes the list > entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, > and calls delete (incorrect, out of order with free). Finally, > bond_ipsec_free_sa is called, which fortunately doesn't do anything > silly in the new proposed form because xs is no longer in the list. > > 2. A more sinister form of the above race can happen when > bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and > immediately after __xfrm_state_delete marks xs as DEAD and calls > bond_ipsec_del_sa() which happily calls delete on real_dev again. > > In order to fix these races (and others like it), I think > bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- > >lock for each xs being processed. This would prevent xfrm from > concurrently initiating add/delete operations on the managed states. > Thanks a lot for the careful checking. I will add the x->lock in del/add_sa_all. Regards Hangbin
Hi Cosmin, On Fri, Feb 28, 2025 at 10:31:58AM +0000, Cosmin Ratiu wrote: > On Fri, 2025-02-28 at 02:20 +0000, Hangbin Liu wrote: > > On Thu, Feb 27, 2025 at 03:31:01PM +0200, Nikolay Aleksandrov wrote: > > > > > One more thing - note I'm not an xfrm expert by far but it > > > > > seems to me here you have > > > > > to also call xdo_dev_state_free() with the old active slave > > > > > dev otherwise that will > > > > > never get called with the original real_dev after the switch to > > > > > a new > > > > > active slave (or more accurately it might if the GC runs > > > > > between the switching > > > > > but it is a race), care must be taken wrt sequence of events > > > > > because the XFRM > > > > > > > > Can we just call xs->xso.real_dev->xfrmdev_ops- > > > > >xdo_dev_state_free(xs) > > > > no matter xs->xso.real_dev == real_dev or not? I'm afraid calling > > > > xdo_dev_state_free() every where may make us lot more easily. > > > > > > > > > > You'd have to check all drivers that implement the callback to > > > answer that and even then > > > I'd stick to the canonical way of how it's done in xfrm and make > > > the bond just passthrough. > > > Any other games become dangerous and new code will have to be > > > carefully reviewed every > > > time, calling another device's free_sa when it wasn't added before > > > doesn't sound good. > > > > > > > > GC may be running in parallel which probably means that in > > > > > bond_ipsec_free_sa() > > > > > you'll have to take the mutex before calling > > > > > xdo_dev_state_free() and check > > > > > if the entry is still linked in the bond's ipsec list before > > > > > calling the free_sa > > > > > callback, if it isn't then del_sa_all got to it before the GC > > > > > and there's nothing > > > > > to do if it also called the dev's free_sa callback. The check > > > > > for real_dev doesn't > > > > > seem enough to protect against this race. > > > > > > > > I agree that we need to take the mutex before calling > > > > xdo_dev_state_free() > > > > in bond_ipsec_free_sa(). Do you think if this is enough? I'm a > > > > bit lot here. > > > > > > > > Thanks > > > > Hangbin > > > > > > Well, the race is between the xfrm GC and del_sa_all, in bond's > > > free_sa if you > > > walk the list under the mutex before calling real_dev's free > > > callback and > > > don't find the current element that's being freed in free_sa then > > > it was > > > cleaned up by del_sa_all, otherwise del_sa_all is waiting to walk > > > that > > > list and clean the entries. I think it should be fine as long as > > > free_sa > > > was called once with the proper device. > > > > OK, so the free will be called either in del_sa_all() or free_sa(). > > Something like this? > > > [...] > > Unfortunately, after applying these changes and reasoning about them > for a bit, I don't think this will work. There are still races left. > For example: > 1. An xs is marked DEAD (in __xfrm_state_delete, with x->lock held) and > before .xdo_dev_state_delete() is called on it, bond_ipsec_del_sa_all > is called in parallel, doesn't call delete on xs (because it's dead), > then calls free (incorrect without delete first), then removes the list > entry. Later, xdo_dev_state_delete( == bond_ipsec_del_sa) is called, > and calls delete (incorrect, out of order with free). Finally, > bond_ipsec_free_sa is called, which fortunately doesn't do anything > silly in the new proposed form because xs is no longer in the list. > > 2. A more sinister form of the above race can happen when > bond_ipsec_del_sa_all() calls delete on real_dev, then in parallel and > immediately after __xfrm_state_delete marks xs as DEAD and calls > bond_ipsec_del_sa() which happily calls delete on real_dev again. > > In order to fix these races (and others like it), I think > bond_ipsec_del_sa_all and bond_ipsec_add_sa_all *need* to acquire x- > >lock for each xs being processed. This would prevent xfrm from > concurrently initiating add/delete operations on the managed states. > Just to make sure I added the lock in correct place, would you please help confirm. diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e85878b12376..c59ad3a5cf43 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -537,19 +537,25 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) } list_for_each_entry(ipsec, &bond->ipsec_list, list) { + spin_lock_bh(&ipsec->xs->lock); /* Skip dead xfrm states, they'll be freed later. */ - if (ipsec->xs->km.state == XFRM_STATE_DEAD) + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + spin_unlock_bh(&ipsec->xs->lock); continue; + } /* If new state is added before ipsec_lock acquired */ - if (ipsec->xs->xso.real_dev == real_dev) + if (ipsec->xs->xso.real_dev == real_dev) { + spin_unlock_bh(&ipsec->xs->lock); continue; + } ipsec->xs->xso.real_dev = real_dev; if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec->xs, NULL)) { slave_warn(bond_dev, real_dev, "%s: failed to add SA\n", __func__); ipsec->xs->xso.real_dev = NULL; } + spin_unlock_bh(&ipsec->xs->lock); } out: mutex_unlock(&bond->ipsec_lock); @@ -614,6 +620,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (!ipsec->xs->xso.real_dev) continue; + spin_lock_bh(&ipsec->xs->lock); if (ipsec->xs->km.state == XFRM_STATE_DEAD) { /* already dead no need to delete again */ if (ipsec->xs->xso.real_dev == real_dev && @@ -621,6 +628,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); list_del(&ipsec->list); kfree(ipsec); + spin_unlock_bh(&ipsec->xs->lock); continue; } @@ -635,6 +643,7 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) if (real_dev->xfrmdev_ops->xdo_dev_state_free) real_dev->xfrmdev_ops->xdo_dev_state_free(ipsec->xs); } + spin_unlock_bh(&ipsec->xs->lock); } mutex_unlock(&bond->ipsec_lock); } Thanks Hangbin
On Tue, 2025-03-04 at 09:18 +0000, Hangbin Liu wrote: > > Just to make sure I added the lock in correct place, would you please > help > confirm. > > diff --git a/drivers/net/bonding/bond_main.c > b/drivers/net/bonding/bond_main.c > index e85878b12376..c59ad3a5cf43 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -537,19 +537,25 @@ static void bond_ipsec_add_sa_all(struct > bonding *bond) > } > > list_for_each_entry(ipsec, &bond->ipsec_list, list) { > + spin_lock_bh(&ipsec->xs->lock); > /* Skip dead xfrm states, they'll be freed later. */ > - if (ipsec->xs->km.state == XFRM_STATE_DEAD) > + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { > + spin_unlock_bh(&ipsec->xs->lock); Instead of unlocking on every branch, I recommend adding a "next:" tag before the unlock at the end of the loop and switching the "continue" statements with "goto next". > continue; > + } > > /* If new state is added before ipsec_lock acquired > */ > - if (ipsec->xs->xso.real_dev == real_dev) > + if (ipsec->xs->xso.real_dev == real_dev) { > + spin_unlock_bh(&ipsec->xs->lock); > continue; > + } > > ipsec->xs->xso.real_dev = real_dev; > if (real_dev->xfrmdev_ops->xdo_dev_state_add(ipsec- > >xs, NULL)) { > slave_warn(bond_dev, real_dev, "%s: failed > to add SA\n", __func__); > ipsec->xs->xso.real_dev = NULL; > } Add the "next:" tag here. > + spin_unlock_bh(&ipsec->xs->lock); > } > out: > mutex_unlock(&bond->ipsec_lock); > @@ -614,6 +620,7 @@ static void bond_ipsec_del_sa_all(struct bonding > *bond) > if (!ipsec->xs->xso.real_dev) > continue; The above if should be in the critical section as well. > > + spin_lock_bh(&ipsec->xs->lock); > if (ipsec->xs->km.state == XFRM_STATE_DEAD) { > /* already dead no need to delete again */ > if (ipsec->xs->xso.real_dev == real_dev && > @@ -621,6 +628,7 @@ static void bond_ipsec_del_sa_all(struct bonding > *bond) > real_dev->xfrmdev_ops- > >xdo_dev_state_free(ipsec->xs); > list_del(&ipsec->list); > kfree(ipsec); > + spin_unlock_bh(&ipsec->xs->lock); And I recommend the same thing with "goto next" here, jumping at the end of the loop, before the unlock. > continue; > } > > @@ -635,6 +643,7 @@ static void bond_ipsec_del_sa_all(struct bonding > *bond) > if (real_dev->xfrmdev_ops- > >xdo_dev_state_free) > real_dev->xfrmdev_ops- > >xdo_dev_state_free(ipsec->xs); > } > + spin_unlock_bh(&ipsec->xs->lock); > } > mutex_unlock(&bond->ipsec_lock); > } > > Thanks > Hangbin
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index e45bba240cbc..683bf1221caf 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -537,6 +537,10 @@ static void bond_ipsec_add_sa_all(struct bonding *bond) } list_for_each_entry(ipsec, &bond->ipsec_list, list) { + /* Skip dead xfrm states, they'll be freed later. */ + if (ipsec->xs->km.state == XFRM_STATE_DEAD) + continue; + /* If new state is added before ipsec_lock acquired */ if (ipsec->xs->xso.real_dev == real_dev) continue; @@ -560,7 +564,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) struct net_device *bond_dev = xs->xso.dev; struct net_device *real_dev; netdevice_tracker tracker; - struct bond_ipsec *ipsec; struct bonding *bond; struct slave *slave; @@ -592,15 +595,6 @@ static void bond_ipsec_del_sa(struct xfrm_state *xs) real_dev->xfrmdev_ops->xdo_dev_state_delete(xs); out: netdev_put(real_dev, &tracker); - mutex_lock(&bond->ipsec_lock); - list_for_each_entry(ipsec, &bond->ipsec_list, list) { - if (ipsec->xs == xs) { - list_del(&ipsec->list); - kfree(ipsec); - break; - } - } - mutex_unlock(&bond->ipsec_lock); } static void bond_ipsec_del_sa_all(struct bonding *bond) @@ -617,6 +611,12 @@ static void bond_ipsec_del_sa_all(struct bonding *bond) mutex_lock(&bond->ipsec_lock); list_for_each_entry(ipsec, &bond->ipsec_list, list) { + if (ipsec->xs->km.state == XFRM_STATE_DEAD) { + list_del(&ipsec->list); + kfree(ipsec); + continue; + } + if (!ipsec->xs->xso.real_dev) continue; @@ -640,6 +640,7 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) struct net_device *bond_dev = xs->xso.dev; struct net_device *real_dev; netdevice_tracker tracker; + struct bond_ipsec *ipsec; struct bonding *bond; struct slave *slave; @@ -659,13 +660,24 @@ static void bond_ipsec_free_sa(struct xfrm_state *xs) if (!xs->xso.real_dev) goto out; - WARN_ON(xs->xso.real_dev != real_dev); + if (xs->xso.real_dev != real_dev) + goto out; if (real_dev && real_dev->xfrmdev_ops && real_dev->xfrmdev_ops->xdo_dev_state_free) real_dev->xfrmdev_ops->xdo_dev_state_free(xs); out: netdev_put(real_dev, &tracker); + + mutex_lock(&bond->ipsec_lock); + list_for_each_entry(ipsec, &bond->ipsec_list, list) { + if (ipsec->xs == xs) { + list_del(&ipsec->list); + kfree(ipsec); + break; + } + } + mutex_unlock(&bond->ipsec_lock); } /**
The fixed commit placed mutex_lock() inside spin_lock_bh(), which triggers a warning: BUG: sleeping function called from invalid context at... Fix this by moving the IPsec deletion operation to bond_ipsec_free_sa, which is not held by spin_lock_bh(). Additionally, delete the IPsec list in bond_ipsec_del_sa_all() when the XFRM state is DEAD to prevent xdo_dev_state_free() from being triggered again in bond_ipsec_free_sa(). For bond_ipsec_free_sa(), there are now three conditions: 1. if (!slave): When no active device exists. 2. if (!xs->xso.real_dev): When xdo_dev_state_add() fails. 3. if (xs->xso.real_dev != real_dev): When an xs has already been freed by bond_ipsec_del_sa_all() due to migration, and the active slave has changed to a new device. At the same time, the xs is marked as DEAD due to the XFRM entry is removed, triggering xfrm_state_gc_task() and bond_ipsec_free_sa(). In all three cases, xdo_dev_state_free() should not be called, only xs should be removed from bond->ipsec list. Fixes: 2aeeef906d5a ("bonding: change ipsec_lock from spin lock to mutex") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20241212062734.182a0164@kernel.org Suggested-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> --- drivers/net/bonding/bond_main.c | 34 ++++++++++++++++++++++----------- 1 file changed, 23 insertions(+), 11 deletions(-)