[v2] ethernet:arc: Fix racing of TX ring buffer

Message ID	57419853.9050701@gmx.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-rockchip-bounces+patchwork-linux-rockchip=patchwork.kernel.org@lists.infradead.org> Subject: Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer To: Shuyu Wei <wsy2220@gmail.com> References: <20160517152520.GA2750@debian-dorm> <20160517.142456.2247845107325931733.davem@davemloft.net> <20160518000153.GA21757@electric-eye.fr.zoreil.com> <573CD09D.1060307@gmx.de> <20160518225529.GA18671@electric-eye.fr.zoreil.com> <573E2D0C.604@gmx.de> <20160520003145.GA22420@electric-eye.fr.zoreil.com> <20160521160910.GA14945@debian-dorm> <5740E82F.8040903@gmx.de> <20160522091742.GA8681@debian-dorm> From: Lino Sanfilippo <LinoSanfilippo@gmx.de> Message-ID: <57419853.9050701@gmx.de> Date: Sun, 22 May 2016 13:30:27 +0200 User-Agent: Mozilla/5.0 (X11; Linux i686; rv:38.0) Gecko/20100101 Thunderbird/38.8.0 MIME-Version: 1.0 In-Reply-To: <20160522091742.GA8681@debian-dorm> Precedence: list Cc: heiko@sntech.de, al.kochet@gmail.com, netdev@vger.kernel.org, linux-rockchip@lists.infradead.org, Francois Romieu <romieu@fr.zoreil.com>, David Miller <davem@davemloft.net>, wxt@rock-chips.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "Linux-rockchip" <linux-rockchip-bounces@lists.infradead.org> Errors-To: linux-rockchip-bounces+patchwork-linux-rockchip=patchwork.kernel.org@lists.infradead.org

Message ID

57419853.9050701@gmx.de (mailing list archive)

State

New, archived

Headers

Subject: Re: [PATCH v2] ethernet:arc: Fix racing of TX ring buffer
To: Shuyu Wei <wsy2220@gmail.com>
References: <20160517152520.GA2750@debian-dorm>
	<20160517.142456.2247845107325931733.davem@davemloft.net>
	<20160518000153.GA21757@electric-eye.fr.zoreil.com>
	<573CD09D.1060307@gmx.de>
	<20160518225529.GA18671@electric-eye.fr.zoreil.com>
	<573E2D0C.604@gmx.de>
	<20160520003145.GA22420@electric-eye.fr.zoreil.com>
	<20160521160910.GA14945@debian-dorm> <5740E82F.8040903@gmx.de>
	<20160522091742.GA8681@debian-dorm>
From: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Message-ID: <57419853.9050701@gmx.de>
Date: Sun, 22 May 2016 13:30:27 +0200
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:38.0) Gecko/20100101
	Thunderbird/38.8.0
MIME-Version: 1.0
In-Reply-To: <20160522091742.GA8681@debian-dorm>
Precedence: list
Cc: heiko@sntech.de, al.kochet@gmail.com, netdev@vger.kernel.org,
	linux-rockchip@lists.infradead.org,
	Francois Romieu <romieu@fr.zoreil.com>, 
	David Miller <davem@davemloft.net>, wxt@rock-chips.com
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Sender: "Linux-rockchip" <linux-rockchip-bounces@lists.infradead.org>
Errors-To: linux-rockchip-bounces+patchwork-linux-rockchip=patchwork.kernel.org@lists.infradead.org

Commit Message

Lino Sanfilippo May 22, 2016, 11:30 a.m. UTC

On 22.05.2016 11:17, Shuyu Wei wrote:

> Hi Lino,
> 
> I tested this patch, it still got panic under stress.
> Just wget 2 large files simultaneously and it failed.
> 
> Looks like the problem comes from the if statement in tx_clean().
> I changed your patch to use 
> 
> -               if (info & FOR_EMAC)
> +               if ((info & FOR_EMAC) || !txbd->data || !skb)
> 
> and it worked. 

Thanks for testing. However that extra check for skb not being NULL should not be
necessary if the code were correct. The changes I suggested were all about having
skb and info consistent with txbd_curr.
But I just realized that there is still a big flaw in the last changes. While
tx() looks correct now (we first set up the descriptor and assign the skb and _then_
advance txbd_curr) tx_clean still is not:

We _first_ have to read tx_curr and _then_ read the corresponding descriptor and its skb.
(The last patch implemented just the reverse - and thus wrong - order, first get skb and 
descriptor and then read tx_curr).

So the patch below hopefully handles also tx_clean correctly. Could you please do once more a test
with this one?

> 
> After further test, my patch to barrier timestamp() didn't work.
> Just like the original code in the tree, the emac still got stuck under
> high load, even if I changed the smp_wmb() to dma_wmb(). So the original
> code do have race somewhere.

So to make this clear: with the current code in net-next you still see a problem (lockup), right?

> 
> I'm new to kernel development, and still trying to understand how memory
> barrier works

Its an interresting topic and thats what I am trying to understand, too :)

> ... and why Francois' fix worked. Please be patient with me :-).

So which fix(es) exactly work for you and solve your lockup issue?

Comments

Francois Romieu May 22, 2016, 10:36 p.m. UTC | #1

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
[...]
> --- a/drivers/net/ethernet/arc/emac_main.c
> +++ b/drivers/net/ethernet/arc/emac_main.c
> @@ -159,12 +159,22 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>  		unsigned int *txbd_dirty = &priv->txbd_dirty;
>  		struct arc_emac_bd *txbd = &priv->txbd[*txbd_dirty];
>  		struct buffer_state *tx_buff = &priv->tx_buff[*txbd_dirty];
> -		struct sk_buff *skb = tx_buff->skb;
>  		unsigned int info = le32_to_cpu(txbd->info);
> +		struct sk_buff *skb;
>  

Insert a smp_rmb() here to close one window for an outdated txbd_dirty value
(the "arc_emac_tx_clean wrote txbd_curr and issued smp_wmb" one).

Actually, insert smp_rmb() at the start of arc_emac_tx_clean() as it
does not need to be performed withing the loop and would penalize it.

Given the implicit smp barriers in the non-driver code, I consider
"arc_emac_tx_clean on one CPU does not read latest txbd_dirty value written
by previous arc_emac_tx_clean on different CPU" as utter onanism but
issueing smp_rmb() at the start of arc_emac_tx_clean() nails it as well.

> -		if ((info & FOR_EMAC) || !txbd->data || !skb)
> +		if (*txbd_dirty == priv->txbd_curr)
>  			break;

Ok, it's just the "while (priv->txbd_dirty != priv->txbd_curr) {" loop
in disguise.

>  
> +		/* Make sure curr pointer is consistent with info */
> +		rmb();
> +
> +		info = le32_to_cpu(txbd->info);
> +
> +		if (info & FOR_EMAC)
> +			break;

With proper ordering + barrier in arc_emac_tx() you can relax it to smp_rmb().

> +
> +		skb = tx_buff->skb;
> +
>  		if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>  			stats->tx_errors++;
>  			stats->tx_dropped++;
> @@ -195,8 +205,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>  		*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>  	}
>  
> -	/* Ensure that txbd_dirty is visible to tx() before checking
> -	 * for queue stopped.
> +	/* Ensure that txbd_dirty is visible to tx() and we see the most recent
> +	 * value for txbd_curr.
>  	 */
>  	smp_mb();
>  
> @@ -680,35 +690,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct net_device *ndev)
>  	dma_unmap_len_set(&priv->tx_buff[*txbd_curr], len, len);
>  
>  	priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
> -
> -	/* Make sure pointer to data buffer is set */
> -	wmb();
> +	priv->tx_buff[*txbd_curr].skb = skb;
>  
>  	skb_tx_timestamp(skb);
>  
>  	*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);

No.

You need dma_wmb() after skb_tx_timestamp() to commit skb_tx_timestamp() [*]
and data = cpu_to_le32(addr).

[*] I doubt anyone want a dma_sync_single_...() here.

>  
> -	/* Make sure info word is set */
> +	/* 1. Make sure that with respect to tx_clean everything is set up
> +	 * properly before we advance txbd_curr.
> +	 * 2. Make sure writes to DMA descriptors are completed before we inform
> +	 * the hardware.
> +	 */
>  	wmb();

Yes, either wmb() or smp_wmb() + dma_wmb().

>  
> -	priv->tx_buff[*txbd_curr].skb = skb;
> -
>  	/* Increment index to point to the next BD */
>  	*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>  
> -	/* Ensure that tx_clean() sees the new txbd_curr before
> -	 * checking the queue status. This prevents an unneeded wake
> -	 * of the queue in tx_clean().
> +	/* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
> +	 * the updated value of txbd_curr.
>  	 */
>  	smp_mb();

Nit: s/the most/a/

"a" as in "arc_emac_tx_clean() _is_ racing with arc_emac_tx"

>  
> -	if (!arc_emac_tx_avail(priv)) {
> +	if (!arc_emac_tx_avail(priv))
>  		netif_stop_queue(ndev);
> -		/* Refresh tx_dirty */
> -		smp_mb();
> -		if (arc_emac_tx_avail(priv))
> -			netif_start_queue(ndev);
> -	}

No.

I may sound like an old record but the revalidation part must be kept.

txbd_dirty may change in the arc_emac_tx_avail.. netif_stop_queue window
(the race requires a different CPU as arc_emac_tx() runs in locally
disabled BH context).

Lino Sanfilippo May 24, 2016, 1:09 a.m. UTC | #2

On 23.05.2016 00:36, Francois Romieu wrote:
> Lino Sanfilippo <LinoSanfilippo@gmx.de> :
> [...]
>> --- a/drivers/net/ethernet/arc/emac_main.c
>> +++ b/drivers/net/ethernet/arc/emac_main.c
>> @@ -159,12 +159,22 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>>  		unsigned int *txbd_dirty = &priv->txbd_dirty;
>>  		struct arc_emac_bd *txbd = &priv->txbd[*txbd_dirty];
>>  		struct buffer_state *tx_buff = &priv->tx_buff[*txbd_dirty];
>> -		struct sk_buff *skb = tx_buff->skb;
>>  		unsigned int info = le32_to_cpu(txbd->info);
>> +		struct sk_buff *skb;
>>  
> 
> Insert a smp_rmb() here to close one window for an outdated txbd_dirty value
> (the "arc_emac_tx_clean wrote txbd_curr and issued smp_wmb" one).
> 
> Actually, insert smp_rmb() at the start of arc_emac_tx_clean() as it
> does not need to be performed withing the loop and would penalize it.

I agree, we should place the barrier before the loop. I also think it is indeed more
appropriate to use the SMP versions for both barriers.

> 
> Given the implicit smp barriers in the non-driver code, I consider
> "arc_emac_tx_clean on one CPU does not read latest txbd_dirty value written
> by previous arc_emac_tx_clean on different CPU" as utter onanism but
> issueing smp_rmb() at the start of arc_emac_tx_clean() nails it as well.
> 
>> -		if ((info & FOR_EMAC) || !txbd->data || !skb)
>> +		if (*txbd_dirty == priv->txbd_curr)
>>  			break;
> 
> Ok, it's just the "while (priv->txbd_dirty != priv->txbd_curr) {" loop
> in disguise.

I cant deny that :)

> 
>>  
>> +		/* Make sure curr pointer is consistent with info */
>> +		rmb();
>> +
>> +		info = le32_to_cpu(txbd->info);
>> +
>> +		if (info & FOR_EMAC)
>> +			break;
> 
> With proper ordering + barrier in arc_emac_tx() you can relax it to smp_rmb().

Ok.

>> +
>> +		skb = tx_buff->skb;
>> +
>>  		if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
>>  			stats->tx_errors++;
>>  			stats->tx_dropped++;
>> @@ -195,8 +205,8 @@ static void arc_emac_tx_clean(struct net_device *ndev)
>>  		*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
>>  	}
>>  
>> -	/* Ensure that txbd_dirty is visible to tx() before checking
>> -	 * for queue stopped.
>> +	/* Ensure that txbd_dirty is visible to tx() and we see the most recent
>> +	 * value for txbd_curr.
>>  	 */
>>  	smp_mb();
>>  
>> @@ -680,35 +690,29 @@ static int arc_emac_tx(struct sk_buff *skb, struct net_device *ndev)
>>  	dma_unmap_len_set(&priv->tx_buff[*txbd_curr], len, len);
>>  
>>  	priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
>> -
>> -	/* Make sure pointer to data buffer is set */
>> -	wmb();
>> +	priv->tx_buff[*txbd_curr].skb = skb;
>>  
>>  	skb_tx_timestamp(skb);
>>  
>>  	*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
> 
> No.
> 
> You need dma_wmb() after skb_tx_timestamp() to commit skb_tx_timestamp() [*]
> and data = cpu_to_le32(addr).

I dont agree here. A dma_wmb would have an effect to "data" and "info", yes, but it 
would have absolutely no effect to skb_tx_timestamp(), since there is no dma access
involved here. In fact skb_tx_timestamp() could probably be even reordered to appear
 after the dma_wmb.
Anyway, there is the wmb() directly after the assignment to "info". _This_ barrier
 should ensure that skb_tx_timestamp() (along with a flush of data and info to DMA)
is executed before "txbd_curr" is advanced.
This means that the corresponding skb cant be freed prematurely by tx_clean().

 
> 
> [*] I doubt anyone want a dma_sync_single_...() here.
> 
>>  
>> -	/* Make sure info word is set */
>> +	/* 1. Make sure that with respect to tx_clean everything is set up
>> +	 * properly before we advance txbd_curr.
>> +	 * 2. Make sure writes to DMA descriptors are completed before we inform
>> +	 * the hardware.
>> +	 */
>>  	wmb();
> 
> Yes, either wmb() or smp_wmb() + dma_wmb().
> 

I really prefer one generic barrier over combos of 2 or more special barriers.

>>  
>> -	priv->tx_buff[*txbd_curr].skb = skb;
>> -
>>  	/* Increment index to point to the next BD */
>>  	*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
>>  
>> -	/* Ensure that tx_clean() sees the new txbd_curr before
>> -	 * checking the queue status. This prevents an unneeded wake
>> -	 * of the queue in tx_clean().
>> +	/* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
>> +	 * the updated value of txbd_curr.
>>  	 */
>>  	smp_mb();
> 
> Nit: s/the most/a/
> 
> "a" as in "arc_emac_tx_clean() _is_ racing with arc_emac_tx"
> 
>>  
>> -	if (!arc_emac_tx_avail(priv)) {
>> +	if (!arc_emac_tx_avail(priv))
>>  		netif_stop_queue(ndev);
>> -		/* Refresh tx_dirty */
>> -		smp_mb();
>> -		if (arc_emac_tx_avail(priv))
>> -			netif_start_queue(ndev);
>> -	}
> 
> No.
> 
> I may sound like an old record but the revalidation part must be kept.
> 
> txbd_dirty may change in the arc_emac_tx_avail.. netif_stop_queue window
> (the race requires a different CPU as arc_emac_tx() runs in locally
> disabled BH context).
> 

Ok, I can see the race now. So yes, this should be kept.

I will prepare a patch with the discussed changes tomorrow.

Regards,
Lino

Francois Romieu May 24, 2016, 7:02 p.m. UTC | #3

Lino Sanfilippo <LinoSanfilippo@gmx.de> :
[...]
> I dont agree here. A dma_wmb would have an effect to "data" and "info", yes,
> but it would have absolutely no effect to skb_tx_timestamp(), since there
> is no dma access involved here. In fact skb_tx_timestamp() could probably
> be even reordered to appear after the dma_wmb.
>
> Anyway, there is the wmb() directly after the assignment to "info". _This_
> barrier  should ensure that skb_tx_timestamp() (along with a flush of data
> and info to DMA) is executed before "txbd_curr" is advanced.
> This means that the corresponding skb cant be freed prematurely by tx_clean().

The concern here is about sending adequate PTP payload on the network.

skb_tx_timestamp() is used for network clock synchronization. Some extra
information must be transmitted. Be it through direct payload change
or through indirect control, it _is_ related to dma.

Several (most ?) skb_tx_timestamp() misuses blur the picture: CPU vs device
sync is of course way below the radar when the driver crashes because of
plain use-after-free skb_tx_timestamp() :o/

--- a/drivers/net/ethernet/arc/emac_main.c
+++ b/drivers/net/ethernet/arc/emac_main.c
@@ -159,12 +159,22 @@  static void arc_emac_tx_clean(struct net_device *ndev)
 		unsigned int *txbd_dirty = &priv->txbd_dirty;
 		struct arc_emac_bd *txbd = &priv->txbd[*txbd_dirty];
 		struct buffer_state *tx_buff = &priv->tx_buff[*txbd_dirty];
-		struct sk_buff *skb = tx_buff->skb;
 		unsigned int info = le32_to_cpu(txbd->info);
+		struct sk_buff *skb;
 
-		if ((info & FOR_EMAC) || !txbd->data || !skb)
+		if (*txbd_dirty == priv->txbd_curr)
 			break;
 
+		/* Make sure curr pointer is consistent with info */
+		rmb();
+
+		info = le32_to_cpu(txbd->info);
+
+		if (info & FOR_EMAC)
+			break;
+
+		skb = tx_buff->skb;
+
 		if (unlikely(info & (DROP | DEFR | LTCL | UFLO))) {
 			stats->tx_errors++;
 			stats->tx_dropped++;
@@ -195,8 +205,8 @@  static void arc_emac_tx_clean(struct net_device *ndev)
 		*txbd_dirty = (*txbd_dirty + 1) % TX_BD_NUM;
 	}
 
-	/* Ensure that txbd_dirty is visible to tx() before checking
-	 * for queue stopped.
+	/* Ensure that txbd_dirty is visible to tx() and we see the most recent
+	 * value for txbd_curr.
 	 */
 	smp_mb();
 
@@ -680,35 +690,29 @@  static int arc_emac_tx(struct sk_buff *skb, struct net_device *ndev)
 	dma_unmap_len_set(&priv->tx_buff[*txbd_curr], len, len);
 
 	priv->txbd[*txbd_curr].data = cpu_to_le32(addr);
-
-	/* Make sure pointer to data buffer is set */
-	wmb();
+	priv->tx_buff[*txbd_curr].skb = skb;
 
 	skb_tx_timestamp(skb);
 
 	*info = cpu_to_le32(FOR_EMAC | FIRST_OR_LAST_MASK | len);
 
-	/* Make sure info word is set */
+	/* 1. Make sure that with respect to tx_clean everything is set up
+	 * properly before we advance txbd_curr.
+	 * 2. Make sure writes to DMA descriptors are completed before we inform
+	 * the hardware.
+	 */
 	wmb();
 
-	priv->tx_buff[*txbd_curr].skb = skb;
-
 	/* Increment index to point to the next BD */
 	*txbd_curr = (*txbd_curr + 1) % TX_BD_NUM;
 
-	/* Ensure that tx_clean() sees the new txbd_curr before
-	 * checking the queue status. This prevents an unneeded wake
-	 * of the queue in tx_clean().
+	/* Ensure we see the most recent value of txbd_dirty and tx_clean() sees
+	 * the updated value of txbd_curr.
 	 */
 	smp_mb();
 
-	if (!arc_emac_tx_avail(priv)) {
+	if (!arc_emac_tx_avail(priv))
 		netif_stop_queue(ndev);
-		/* Refresh tx_dirty */
-		smp_mb();
-		if (arc_emac_tx_avail(priv))
-			netif_start_queue(ndev);
-	}
 
 	arc_reg_set(priv, R_STATUS, TXPL_MASK);

[v2] ethernet:arc: Fix racing of TX ring buffer

Commit Message

Comments

Patch