diff mbox series

[v2,1/1] net/mlx5: Added cond_resched() to crdump collection

Message ID 20240829213856.77619-2-mkhalfella@purestorage.com (mailing list archive)
State Not Applicable
Headers show
Series net/mlx5: Added cond_resched() to crdump collection | expand

Commit Message

Mohamed Khalfella Aug. 29, 2024, 9:38 p.m. UTC
Collecting crdump involves reading vsc registers from pci config space
of mlx device, which can take long time to complete. This might result
in starving other threads waiting to run on the cpu.

Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:

- mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
- mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
  read the entire 1310716 bytes. It was called 53813 times because
  there are jumps in read_addr.
- On average mlx5_vsc_gw_read_fast() took 35284.4ns.
- In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
  The average time for each call was 17548.3ns. In some instances
  vsc_read() was called more than one time when the flag was not set.
  As expected the thread released the cpu after 16 iterations in
  mlx5_vsc_wait_on_flag().
- Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.

It was seen in the field that crdump can take more than 5 seconds to
complete. During that time mlx5_vsc_wait_on_flag() did not release the
cpu because it did not complete 16 iterations. It is believed that pci
config reads were slow. This change adds conditional reschedule call
every 128 register read to release the cpu if needed.

Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Alexander Lobakin Aug. 30, 2024, 1:07 p.m. UTC | #1
From: Mohamed Khalfella <mkhalfella@purestorage.com>
Date: Thu, 29 Aug 2024 15:38:56 -0600

> Collecting crdump involves reading vsc registers from pci config space
> of mlx device, which can take long time to complete. This might result
> in starving other threads waiting to run on the cpu.
> 
> Numbers I got from testing ConnectX-5 Ex MCX516A-CDAT in the lab:
> 
> - mlx5_vsc_gw_read_block_fast() was called with length = 1310716.
> - mlx5_vsc_gw_read_fast() reads 4 bytes at a time. It was not used to
>   read the entire 1310716 bytes. It was called 53813 times because
>   there are jumps in read_addr.
> - On average mlx5_vsc_gw_read_fast() took 35284.4ns.
> - In total mlx5_vsc_wait_on_flag() called vsc_read() 54707 times.
>   The average time for each call was 17548.3ns. In some instances
>   vsc_read() was called more than one time when the flag was not set.
>   As expected the thread released the cpu after 16 iterations in
>   mlx5_vsc_wait_on_flag().
> - Total time to read crdump was 35284.4ns * 53813 ~= 1.898s.
> 
> It was seen in the field that crdump can take more than 5 seconds to
> complete. During that time mlx5_vsc_wait_on_flag() did not release the
> cpu because it did not complete 16 iterations. It is believed that pci
> config reads were slow. This change adds conditional reschedule call
> every 128 register read to release the cpu if needed.
> 
> Reviewed-by: Yuanyuan Zhong <yzhong@purestorage.com>
> Signed-off-by: Mohamed Khalfella <mkhalfella@purestorage.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> index 6b774e0c2766..bc6c38a68702 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> @@ -269,6 +269,7 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
>  {
>  	unsigned int next_read_addr = 0;
>  	unsigned int read_addr = 0;
> +	unsigned int count = 0;
>  
>  	while (read_addr < length) {
>  		if (mlx5_vsc_gw_read_fast(dev, read_addr, &next_read_addr,
> @@ -276,6 +277,9 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
>  			return read_addr;
>  
>  		read_addr = next_read_addr;
> +		/* Yield the cpu every 128 register read */
> +		if ((++count & 0x7f) == 0)
> +			cond_resched();

Why & 0x7f, could it be written more clearly?

		if (++count == 128) {
			cond_resched();
			count = 0;
		}

Also, I'd make this open-coded value a #define somewhere at the
beginning of the file with a comment with a short explanation.

BTW, why 128? Not 64, not 256 etc? You just picked it, I don't see any
explanation in the commitmsg or here in the code why exactly 128. Have
you tried different values?

>  	}
>  	return length;
>  }

Thanks,
Olek
Mohamed Khalfella Aug. 30, 2024, 6:01 p.m. UTC | #2
On 2024-08-30 15:07:45 +0200, Alexander Lobakin wrote:
> From: Mohamed Khalfella <mkhalfella@purestorage.com>
> Date: Thu, 29 Aug 2024 15:38:56 -0600
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > index 6b774e0c2766..bc6c38a68702 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> > @@ -269,6 +269,7 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
> >  {
> >  	unsigned int next_read_addr = 0;
> >  	unsigned int read_addr = 0;
> > +	unsigned int count = 0;
> >  
> >  	while (read_addr < length) {
> >  		if (mlx5_vsc_gw_read_fast(dev, read_addr, &next_read_addr,
> > @@ -276,6 +277,9 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
> >  			return read_addr;
> >  
> >  		read_addr = next_read_addr;
> > +		/* Yield the cpu every 128 register read */
> > +		if ((++count & 0x7f) == 0)
> > +			cond_resched();
> 
> Why & 0x7f, could it be written more clearly?
> 
> 		if (++count == 128) {
> 			cond_resched();
> 			count = 0;
> 		}
> 
> Also, I'd make this open-coded value a #define somewhere at the
> beginning of the file with a comment with a short explanation.

What you are suggesting should work also. I copied the style from
mlx5_vsc_wait_on_flag() to keep the code consistent. The comment above
the line should make it clear.

> 
> BTW, why 128? Not 64, not 256 etc? You just picked it, I don't see any
> explanation in the commitmsg or here in the code why exactly 128. Have
> you tried different values?

This mostly subjective. For the numbers I saw in the lab, this will
release the cpu after ~4.51ms. If crdump takes ~5s, the code should
release the cpu after ~18.0ms. These numbers look reasonable to me.
Alexander Lobakin Sept. 3, 2024, 12:14 p.m. UTC | #3
From: Mohamed Khalfella <mkhalfella@purestorage.com>
Date: Fri, 30 Aug 2024 11:01:19 -0700

> On 2024-08-30 15:07:45 +0200, Alexander Lobakin wrote:
>> From: Mohamed Khalfella <mkhalfella@purestorage.com>
>> Date: Thu, 29 Aug 2024 15:38:56 -0600
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> index 6b774e0c2766..bc6c38a68702 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
>>> @@ -269,6 +269,7 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
>>>  {
>>>  	unsigned int next_read_addr = 0;
>>>  	unsigned int read_addr = 0;
>>> +	unsigned int count = 0;
>>>  
>>>  	while (read_addr < length) {
>>>  		if (mlx5_vsc_gw_read_fast(dev, read_addr, &next_read_addr,
>>> @@ -276,6 +277,9 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
>>>  			return read_addr;
>>>  
>>>  		read_addr = next_read_addr;
>>> +		/* Yield the cpu every 128 register read */
>>> +		if ((++count & 0x7f) == 0)
>>> +			cond_resched();
>>
>> Why & 0x7f, could it be written more clearly?
>>
>> 		if (++count == 128) {
>> 			cond_resched();
>> 			count = 0;
>> 		}
>>
>> Also, I'd make this open-coded value a #define somewhere at the
>> beginning of the file with a comment with a short explanation.

This is still valid.

> 
> What you are suggesting should work also. I copied the style from
> mlx5_vsc_wait_on_flag() to keep the code consistent. The comment above
> the line should make it clear.

I just don't see a reason to make the code less readable.

> 
>>
>> BTW, why 128? Not 64, not 256 etc? You just picked it, I don't see any
>> explanation in the commitmsg or here in the code why exactly 128. Have
>> you tried different values?
> 
> This mostly subjective. For the numbers I saw in the lab, this will
> release the cpu after ~4.51ms. If crdump takes ~5s, the code should
> release the cpu after ~18.0ms. These numbers look reasonable to me.

So just mention in the commit message that you tried different values
and 128 gave you the best results.

Thanks,
Olek
Mohamed Khalfella Sept. 5, 2024, 3:36 a.m. UTC | #4
On 2024-09-03 14:14:58 +0200, Alexander Lobakin wrote:
> From: Mohamed Khalfella <mkhalfella@purestorage.com>
> Date: Fri, 30 Aug 2024 11:01:19 -0700
> 
> > On 2024-08-30 15:07:45 +0200, Alexander Lobakin wrote:
> >> From: Mohamed Khalfella <mkhalfella@purestorage.com>
> >> Date: Thu, 29 Aug 2024 15:38:56 -0600
> >>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> >>> index 6b774e0c2766..bc6c38a68702 100644
> >>> --- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> >>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
> >>> @@ -269,6 +269,7 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
> >>>  {
> >>>  	unsigned int next_read_addr = 0;
> >>>  	unsigned int read_addr = 0;
> >>> +	unsigned int count = 0;
> >>>  
> >>>  	while (read_addr < length) {
> >>>  		if (mlx5_vsc_gw_read_fast(dev, read_addr, &next_read_addr,
> >>> @@ -276,6 +277,9 @@ int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
> >>>  			return read_addr;
> >>>  
> >>>  		read_addr = next_read_addr;
> >>> +		/* Yield the cpu every 128 register read */
> >>> +		if ((++count & 0x7f) == 0)
> >>> +			cond_resched();
> >>
> >> Why & 0x7f, could it be written more clearly?
> >>
> >> 		if (++count == 128) {
> >> 			cond_resched();
> >> 			count = 0;
> >> 		}
> >>
> >> Also, I'd make this open-coded value a #define somewhere at the
> >> beginning of the file with a comment with a short explanation.
> 
> This is still valid.

Done. See <1>.

> 
> > 
> > What you are suggesting should work also. I copied the style from
> > mlx5_vsc_wait_on_flag() to keep the code consistent. The comment above
> > the line should make it clear.
> 
> I just don't see a reason to make the code less readable.

<1> Now I am looking at mlx5_vsc_wait_on_flag() again, I realized the 
code does not want to reset retries to 0 because it needs to check when
it reaches VSC_MAX_RETRIES. This is not the case here. I will update the
code as suggested.

> 
> > 
> >>
> >> BTW, why 128? Not 64, not 256 etc? You just picked it, I don't see any
> >> explanation in the commitmsg or here in the code why exactly 128. Have
> >> you tried different values?
> > 
> > This mostly subjective. For the numbers I saw in the lab, this will
> > release the cpu after ~4.51ms. If crdump takes ~5s, the code should
> > release the cpu after ~18.0ms. These numbers look reasonable to me.
> 
> So just mention in the commit message that you tried different values
> and 128 gave you the best results.

I will update the commit message in v3.
diff mbox series

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
index 6b774e0c2766..bc6c38a68702 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
@@ -269,6 +269,7 @@  int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
 {
 	unsigned int next_read_addr = 0;
 	unsigned int read_addr = 0;
+	unsigned int count = 0;
 
 	while (read_addr < length) {
 		if (mlx5_vsc_gw_read_fast(dev, read_addr, &next_read_addr,
@@ -276,6 +277,9 @@  int mlx5_vsc_gw_read_block_fast(struct mlx5_core_dev *dev, u32 *data,
 			return read_addr;
 
 		read_addr = next_read_addr;
+		/* Yield the cpu every 128 register read */
+		if ((++count & 0x7f) == 0)
+			cond_resched();
 	}
 	return length;
 }