diff mbox series

[v2] bonding: 3ad: fix a crash in agg_device_up()

Message ID 20210604021420.49972-1-zhudi21@huawei.com (mailing list archive)
State Changes Requested
Delegated to: Netdev Maintainers
Headers show
Series [v2] bonding: 3ad: fix a crash in agg_device_up() | expand

Checks

Context Check Description
netdev/cover_letter success Link
netdev/fixes_present success Link
netdev/patch_count success Link
netdev/tree_selection success Guessed tree name to be net-next
netdev/subject_prefix warning Target tree name not specified in the subject
netdev/cc_maintainers warning 1 maintainers not CCed: andy@greyhouse.net
netdev/source_inline success Was 0 now: 0
netdev/verify_signedoff success Link
netdev/module_param success Was 0 now: 0
netdev/build_32bit success Errors and warnings before: 0 this patch: 0
netdev/kdoc success Errors and warnings before: 0 this patch: 0
netdev/verify_fixes success Link
netdev/checkpatch success total: 0 errors, 0 warnings, 0 checks, 19 lines checked
netdev/build_allmodconfig_warn success Errors and warnings before: 0 this patch: 0
netdev/header_inline success Link

Commit Message

zhudi (J) June 4, 2021, 2:14 a.m. UTC
From: Di Zhu <zhudi21@huawei.com>

When doing the test of restarting the network card, the system is
broken because the port->slave is null pointer in agg_device_up().
After in-depth investigation, we found the real cause: in
bond_3ad_unbind_slave()  if there are no active ports in the
aggregator to be deleted, the ad_clear_agg() will be called to
set "aggregator->lag_ports = NULL", but the ports originally
belonging to the aggregator are still linked together.

Before bond_3ad_unbind_slave():
	aggregator4->lag_ports = port1->port2->port3
After bond_3ad_unbind_slave():
	aggregator4->lag_ports = NULL
	port1->port2->port3

After the port2 is deleted, the port is still  remain in the linked
list: because the port does not belong to any agg, so unbind do
nothing for this port.

After a while, bond_3ad_state_machine_handler() will run and
traverse each existing port, trying to bind each port to the
newly selected agg, such as:
	if (!found) {
		if (free_aggregator) {
			...
			port->aggregator->lag_ports = port;
			...
		}
	}
After this operation, the link list looks like this:
	 aggregator1->lag_ports = port1->port2(has been deleted)->port3

After that, just traverse the linked list of agg1 and access the
port2->slave, the crash will happen.

The easiest way to fix it is: if a port does not belong to any agg, delete
it from the list and wait for the state machine to select the agg again.

Fixes: 0622cab0341c ("bonding: fix 802.3ad aggregator reselection")
Signed-off-by: Di Zhu <zhudi21@huawei.com>
---
/* v2 */
-David Miller <davem@davemloft.net>
 -add Fixes: tag
---
 drivers/net/bonding/bond_3ad.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Jay Vosburgh June 4, 2021, 4:47 a.m. UTC | #1
zhudi <zhudi21@huawei.com> wrote:

>From: Di Zhu <zhudi21@huawei.com>
>
>When doing the test of restarting the network card, the system is
>broken because the port->slave is null pointer in agg_device_up().
>After in-depth investigation, we found the real cause: in
>bond_3ad_unbind_slave()  if there are no active ports in the
>aggregator to be deleted, the ad_clear_agg() will be called to
>set "aggregator->lag_ports = NULL", but the ports originally
>belonging to the aggregator are still linked together.

	Presumably by "no active ports" you refer to the following block
near the end of bond_3ad_unbind_slave() (where a port being deleted is
removed from an aggregator):

				if (prev_port)
					prev_port->next_port_in_aggregator = temp_port->next_port_in_aggregator;
				else
					temp_aggregator->lag_ports = temp_port->next_port_in_aggregator;
				temp_aggregator->num_of_ports--;
				if (__agg_active_ports(temp_aggregator) == 0) {
					select_new_active_agg = temp_aggregator->is_active;
					ad_clear_agg(temp_aggregator);
					if (select_new_active_agg) {
						slave_info(bond->dev, slave->dev, "Removing an active aggregator\n");
						/* select new active aggregator */
						ad_agg_selection_logic(__get_first_agg(port),
							               &dummy_slave_update);


	by what mechanism are you causing the ports within a particular
aggregator to all be inactive (port->is_enabled == false)?  Am I correct
in speculating that you're removing carrier from "port1" and "port3"
while simultaneously removing "port2" from the bond?  I believe this
would all need to occur within a single gap between passes of the state
machine logic.

	The logic in bond_3ad_handle_link_change() that sets
port->is_enabled = false happens on link state change, and is
immediately followed (under the same acquisition of bond->mode_lock) by
a call to ad_agg_selection_logic().  So, in principle, when the port is
set to inactive (is_enabled == false), it should be reselected to a
different aggregator (likely as an individual port) on the next run of
the state machine.

>Before bond_3ad_unbind_slave():
>	aggregator4->lag_ports = port1->port2->port3
>After bond_3ad_unbind_slave():
>	aggregator4->lag_ports = NULL
>	port1->port2->port3
>
>After the port2 is deleted, the port is still  remain in the linked
>list: because the port does not belong to any agg, so unbind do
>nothing for this port.
>
>After a while, bond_3ad_state_machine_handler() will run and
>traverse each existing port, trying to bind each port to the
>newly selected agg, such as:
>	if (!found) {
>		if (free_aggregator) {
>			...
>			port->aggregator->lag_ports = port;
>			...
>		}
>	}
>After this operation, the link list looks like this:
>	 aggregator1->lag_ports = port1->port2(has been deleted)->port3
>
>After that, just traverse the linked list of agg1 and access the
>port2->slave, the crash will happen.
>
>The easiest way to fix it is: if a port does not belong to any agg, delete
>it from the list and wait for the state machine to select the agg again.

	As I understand the above, the bug causes a deleted (freed)
object to be left on the linked list (port1 -> port2 -> port3 in the
above).  I'm not quite sure how the code block above allows this to
happen, as the port being deleted is unlinked (at the top of the code
block).

	In any event, assuming that I'm missing something and it does
leave the freed port in the list, this fix may be the easiest, but I
don't believe it's correct, as it leaves opportunity for something else
to traverse the linked list and dereference the freed element.

	The code sample I included above occurs immediately after
removing a port from the aggregator's list of ports, and, in light of
your analysis (and assuming I've got the correct code block above), I
believe that the "__agg_active_ports(temp_aggregator) == 0" test should
really test to determine if there are no ports at all in the aggregator
(i.e., temp_aggregator->num_of_ports == 0), not only test for active
ports.  This way, if any ports remain in the aggregator, then
ad_clear_agg() would not be called, and the linked list would not end up
detached.

	Comments?

	-J

>Fixes: 0622cab0341c ("bonding: fix 802.3ad aggregator reselection")
>Signed-off-by: Di Zhu <zhudi21@huawei.com>
>---
>/* v2 */
>-David Miller <davem@davemloft.net>
> -add Fixes: tag
>---
> drivers/net/bonding/bond_3ad.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 6908822d9773..1d6ff4e1ed28 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -1793,6 +1793,8 @@ static void ad_agg_selection_logic(struct aggregator *agg,
> static void ad_clear_agg(struct aggregator *aggregator)
> {
> 	if (aggregator) {
>+		struct port *port, *next;
>+
> 		aggregator->is_individual = false;
> 		aggregator->actor_admin_aggregator_key = 0;
> 		aggregator->actor_oper_aggregator_key = 0;
>@@ -1801,6 +1803,11 @@ static void ad_clear_agg(struct aggregator *aggregator)
> 		aggregator->partner_oper_aggregator_key = 0;
> 		aggregator->receive_state = 0;
> 		aggregator->transmit_state = 0;
>+		for (port = aggregator->lag_ports; port; port = next) {
>+			next = port->next_port_in_aggregator;
>+			if (port->aggregator == aggregator)
>+				port->next_port_in_aggregator = NULL;
>+		}
> 		aggregator->lag_ports = NULL;
> 		aggregator->is_active = 0;
> 		aggregator->num_of_ports = 0;
>-- 
>2.23.0
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
diff mbox series

Patch

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index 6908822d9773..1d6ff4e1ed28 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -1793,6 +1793,8 @@  static void ad_agg_selection_logic(struct aggregator *agg,
 static void ad_clear_agg(struct aggregator *aggregator)
 {
 	if (aggregator) {
+		struct port *port, *next;
+
 		aggregator->is_individual = false;
 		aggregator->actor_admin_aggregator_key = 0;
 		aggregator->actor_oper_aggregator_key = 0;
@@ -1801,6 +1803,11 @@  static void ad_clear_agg(struct aggregator *aggregator)
 		aggregator->partner_oper_aggregator_key = 0;
 		aggregator->receive_state = 0;
 		aggregator->transmit_state = 0;
+		for (port = aggregator->lag_ports; port; port = next) {
+			next = port->next_port_in_aggregator;
+			if (port->aggregator == aggregator)
+				port->next_port_in_aggregator = NULL;
+		}
 		aggregator->lag_ports = NULL;
 		aggregator->is_active = 0;
 		aggregator->num_of_ports = 0;