Message ID | 20210604021420.49972-1-zhudi21@huawei.com (mailing list archive) |
---|---|
State | Changes Requested |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | [v2] bonding: 3ad: fix a crash in agg_device_up() | expand |
Context | Check | Description |
---|---|---|
netdev/cover_letter | success | Link |
netdev/fixes_present | success | Link |
netdev/patch_count | success | Link |
netdev/tree_selection | success | Guessed tree name to be net-next |
netdev/subject_prefix | warning | Target tree name not specified in the subject |
netdev/cc_maintainers | warning | 1 maintainers not CCed: andy@greyhouse.net |
netdev/source_inline | success | Was 0 now: 0 |
netdev/verify_signedoff | success | Link |
netdev/module_param | success | Was 0 now: 0 |
netdev/build_32bit | success | Errors and warnings before: 0 this patch: 0 |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/verify_fixes | success | Link |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 19 lines checked |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 0 this patch: 0 |
netdev/header_inline | success | Link |
zhudi <zhudi21@huawei.com> wrote: >From: Di Zhu <zhudi21@huawei.com> > >When doing the test of restarting the network card, the system is >broken because the port->slave is null pointer in agg_device_up(). >After in-depth investigation, we found the real cause: in >bond_3ad_unbind_slave() if there are no active ports in the >aggregator to be deleted, the ad_clear_agg() will be called to >set "aggregator->lag_ports = NULL", but the ports originally >belonging to the aggregator are still linked together. Presumably by "no active ports" you refer to the following block near the end of bond_3ad_unbind_slave() (where a port being deleted is removed from an aggregator): if (prev_port) prev_port->next_port_in_aggregator = temp_port->next_port_in_aggregator; else temp_aggregator->lag_ports = temp_port->next_port_in_aggregator; temp_aggregator->num_of_ports--; if (__agg_active_ports(temp_aggregator) == 0) { select_new_active_agg = temp_aggregator->is_active; ad_clear_agg(temp_aggregator); if (select_new_active_agg) { slave_info(bond->dev, slave->dev, "Removing an active aggregator\n"); /* select new active aggregator */ ad_agg_selection_logic(__get_first_agg(port), &dummy_slave_update); by what mechanism are you causing the ports within a particular aggregator to all be inactive (port->is_enabled == false)? Am I correct in speculating that you're removing carrier from "port1" and "port3" while simultaneously removing "port2" from the bond? I believe this would all need to occur within a single gap between passes of the state machine logic. The logic in bond_3ad_handle_link_change() that sets port->is_enabled = false happens on link state change, and is immediately followed (under the same acquisition of bond->mode_lock) by a call to ad_agg_selection_logic(). So, in principle, when the port is set to inactive (is_enabled == false), it should be reselected to a different aggregator (likely as an individual port) on the next run of the state machine. >Before bond_3ad_unbind_slave(): > aggregator4->lag_ports = port1->port2->port3 >After bond_3ad_unbind_slave(): > aggregator4->lag_ports = NULL > port1->port2->port3 > >After the port2 is deleted, the port is still remain in the linked >list: because the port does not belong to any agg, so unbind do >nothing for this port. > >After a while, bond_3ad_state_machine_handler() will run and >traverse each existing port, trying to bind each port to the >newly selected agg, such as: > if (!found) { > if (free_aggregator) { > ... > port->aggregator->lag_ports = port; > ... > } > } >After this operation, the link list looks like this: > aggregator1->lag_ports = port1->port2(has been deleted)->port3 > >After that, just traverse the linked list of agg1 and access the >port2->slave, the crash will happen. > >The easiest way to fix it is: if a port does not belong to any agg, delete >it from the list and wait for the state machine to select the agg again. As I understand the above, the bug causes a deleted (freed) object to be left on the linked list (port1 -> port2 -> port3 in the above). I'm not quite sure how the code block above allows this to happen, as the port being deleted is unlinked (at the top of the code block). In any event, assuming that I'm missing something and it does leave the freed port in the list, this fix may be the easiest, but I don't believe it's correct, as it leaves opportunity for something else to traverse the linked list and dereference the freed element. The code sample I included above occurs immediately after removing a port from the aggregator's list of ports, and, in light of your analysis (and assuming I've got the correct code block above), I believe that the "__agg_active_ports(temp_aggregator) == 0" test should really test to determine if there are no ports at all in the aggregator (i.e., temp_aggregator->num_of_ports == 0), not only test for active ports. This way, if any ports remain in the aggregator, then ad_clear_agg() would not be called, and the linked list would not end up detached. Comments? -J >Fixes: 0622cab0341c ("bonding: fix 802.3ad aggregator reselection") >Signed-off-by: Di Zhu <zhudi21@huawei.com> >--- >/* v2 */ >-David Miller <davem@davemloft.net> > -add Fixes: tag >--- > drivers/net/bonding/bond_3ad.c | 7 +++++++ > 1 file changed, 7 insertions(+) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index 6908822d9773..1d6ff4e1ed28 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -1793,6 +1793,8 @@ static void ad_agg_selection_logic(struct aggregator *agg, > static void ad_clear_agg(struct aggregator *aggregator) > { > if (aggregator) { >+ struct port *port, *next; >+ > aggregator->is_individual = false; > aggregator->actor_admin_aggregator_key = 0; > aggregator->actor_oper_aggregator_key = 0; >@@ -1801,6 +1803,11 @@ static void ad_clear_agg(struct aggregator *aggregator) > aggregator->partner_oper_aggregator_key = 0; > aggregator->receive_state = 0; > aggregator->transmit_state = 0; >+ for (port = aggregator->lag_ports; port; port = next) { >+ next = port->next_port_in_aggregator; >+ if (port->aggregator == aggregator) >+ port->next_port_in_aggregator = NULL; >+ } > aggregator->lag_ports = NULL; > aggregator->is_active = 0; > aggregator->num_of_ports = 0; >-- >2.23.0 > --- -Jay Vosburgh, jay.vosburgh@canonical.com
diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index 6908822d9773..1d6ff4e1ed28 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -1793,6 +1793,8 @@ static void ad_agg_selection_logic(struct aggregator *agg, static void ad_clear_agg(struct aggregator *aggregator) { if (aggregator) { + struct port *port, *next; + aggregator->is_individual = false; aggregator->actor_admin_aggregator_key = 0; aggregator->actor_oper_aggregator_key = 0; @@ -1801,6 +1803,11 @@ static void ad_clear_agg(struct aggregator *aggregator) aggregator->partner_oper_aggregator_key = 0; aggregator->receive_state = 0; aggregator->transmit_state = 0; + for (port = aggregator->lag_ports; port; port = next) { + next = port->next_port_in_aggregator; + if (port->aggregator == aggregator) + port->next_port_in_aggregator = NULL; + } aggregator->lag_ports = NULL; aggregator->is_active = 0; aggregator->num_of_ports = 0;