diff mbox

[v1,2/8] platform/mellanox: mlxreg-hotplug: Improve mechanism of ASIC health discovery

Message ID 1531060324-116592-3-git-send-email-vadimp@mellanox.com (mailing list archive)
State Changes Requested, archived
Delegated to: Darren Hart
Headers show

Commit Message

Vadim Pasternak July 8, 2018, 2:31 p.m. UTC
Extend the logic of ASIC health discovery.
ASIC device can indicate its health state as a good, booting or dormant.
During ASIC reset the device is dropped to dormant state and should get to
the stable good health state through the intermediate booting state.
The sequence for getting to the steady state health after reset is:
dormant -> booting -> [booting -> good -> dormant -> booting] -> good.
The sequence in brackets can be repeated several times.
Initial implementation assumes that the this sequence is always repeated
by the constant number. This patch allows to perform health steady state
discovery independently of the repetitions number.

Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
---
 drivers/platform/mellanox/mlxreg-hotplug.c | 38 +++++++++++++++++++++++++++---
 1 file changed, 35 insertions(+), 3 deletions(-)

Comments

Darren Hart July 25, 2018, 5:28 p.m. UTC | #1
On Sun, Jul 08, 2018 at 02:31:58PM +0000, Vadim Pasternak wrote:
> Extend the logic of ASIC health discovery.
> ASIC device can indicate its health state as a good, booting or dormant.
> During ASIC reset the device is dropped to dormant state and should get to
> the stable good health state through the intermediate booting state.
> The sequence for getting to the steady state health after reset is:
> dormant -> booting -> [booting -> good -> dormant -> booting] -> good.
> The sequence in brackets can be repeated several times.
> Initial implementation assumes that the this sequence is always repeated
> by the constant number. This patch allows to perform health steady state
> discovery independently of the repetitions number.
> 
> Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
> ---
>  drivers/platform/mellanox/mlxreg-hotplug.c | 38 +++++++++++++++++++++++++++---
>  1 file changed, 35 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/platform/mellanox/mlxreg-hotplug.c b/drivers/platform/mellanox/mlxreg-hotplug.c
> index ac97aa0..f32637e 100644
> --- a/drivers/platform/mellanox/mlxreg-hotplug.c
> +++ b/drivers/platform/mellanox/mlxreg-hotplug.c
> @@ -51,8 +51,9 @@
>  #define MLXREG_HOTPLUG_AGGR_MASK_OFF	1
>  
>  /* ASIC health parameters. */
> +#define MLXREG_HOTPLUG_DOWN_MASK	0x00
>  #define MLXREG_HOTPLUG_HEALTH_MASK	0x02
> -#define MLXREG_HOTPLUG_RST_CNTR		3
> +#define MLXREG_HOTPLUG_RST_CNTR		2
>  
>  #define MLXREG_HOTPLUG_ATTRS_MAX	24
>  #define MLXREG_HOTPLUG_NOT_ASSERT	3
> @@ -325,20 +326,51 @@ mlxreg_hotplug_health_work_helper(struct mlxreg_hotplug_priv_data *priv,
>  			goto out;
>  
>  		regval &= data->mask;
> -		item->cache = regval;
> +		/*
> +		 * ASIC health indication is provided through two bits. Bits
> +		 * value 0x2 indicates that ASIC reached the good health, value
> +		 * 0x0 indicates ASIC the bad health or dormant state and value
> +		 * 0x2 indicates the booting state. During ASIC reset it should

A bit odd to have specific values in paragraph form in comments,
generally I'd expect to see these as defines, but as this is a counter
and we aren't using these specific values in the logic... OK - but
perhaps something more obvious like an itemized list:

		  0: Dormant
		  1: Booting
		  2: Good

Note that above the text states that 0x2 is both "good" and "booting",
we'll need to at least correct that. Since it is a counter, and not a
bitmask, the 0x prefix isn't needed in my opinion.

> +		 * pass the following states: dormant -> booting -> good.
> +		 * The transition from dormant to booting state and from
> +		 * booting to good state are indicated by ASIC twice, so actual
> +		 * sequence for getting to the steady state after reset is:
> +		 * dormant -> booting -> booting -> good -> good. It is
> +		 * possible that due to some hardware noise, the transition
> +		 * sequence will look like: dormant -> booting -> [ booting ->
> +		 * good -> dormant -> booting ] -> good -> good.
> +		 */

What does it mean to go from good -> good ?? Is that an intention transition to
the same state? It seems the above state transition diagram could be simplified
considerably like:

(dormant -> booting -> good)+

Where the + indicates this occurs 1 or more times. If this isn't
accurate - what am I missing?

>  		if (regval == MLXREG_HOTPLUG_HEALTH_MASK) {
> -			if ((data->health_cntr++ == MLXREG_HOTPLUG_RST_CNTR) ||
> +			if ((++data->health_cntr == MLXREG_HOTPLUG_RST_CNTR) ||

This is fairly confusing to read, 2 indicates good (I think, see above),
but we're comparing it to "RST_CNTR" - what does that mean? Using
another define which is more explicit would help with legibility.

Hrm... both HEALTH_MASK and RST_CNTR are defined as 2.... which of these
is testing for the good state?

>  			    !priv->after_probe) {
> +				/*
> +				 * ASIC is in steady state. Connect associated
> +				 * device, if configured.
> +				 */
>  				mlxreg_hotplug_device_create(priv, data);
>  				data->attached = true;
>  			}
>  		} else {
>  			if (data->attached) {
> +				/*
> +				 * ASIC health is dropped after ASIC has been

What does "health is dropped" mean? Is this a state transition? good to
dormant? If so, using the same terminology would help legibility.

> +				 * in steady state. Disconnect associated
> +				 * device, if it has been connected.
> +				 */
>  				mlxreg_hotplug_device_destroy(data);
>  				data->attached = false;
>  				data->health_cntr = 0;
> +			} else if (regval == MLXREG_HOTPLUG_DOWN_MASK &&
> +				   item->cache == MLXREG_HOTPLUG_HEALTH_MASK) {
> +				/*
> +				 * Decrease counter, if health has been dropped
> +				 * before ASIC reaches the steady state, like:
> +				 * good -> dormant -> booting.
> +				 */
> +				data->health_cntr--;
>  			}
>  		}
> +		item->cache = regval;

I found this to be pretty confusing. The description of the states above
don't seem to map readily to the logic above.

>  
>  		/* Acknowledge event. */
>  		ret = regmap_write(priv->regmap, data->reg +
> -- 
> 2.1.4
> 
>
Vadim Pasternak July 26, 2018, 6:31 p.m. UTC | #2
> -----Original Message-----
> From: Darren Hart [mailto:dvhart@infradead.org]
> Sent: Wednesday, July 25, 2018 8:28 PM
> To: Vadim Pasternak <vadimp@mellanox.com>
> Cc: andy.shevchenko@gmail.com; gregkh@linuxfoundation.org; platform-
> driver-x86@vger.kernel.org; jiri@resnulli.us; Michael Shych
> <michaelsh@mellanox.com>; ivecera@redhat.com
> Subject: Re: [PATCH v1 2/8] platform/mellanox: mlxreg-hotplug: Improve
> mechanism of ASIC health discovery
> 
> On Sun, Jul 08, 2018 at 02:31:58PM +0000, Vadim Pasternak wrote:
> > Extend the logic of ASIC health discovery.
> > ASIC device can indicate its health state as a good, booting or dormant.
> > During ASIC reset the device is dropped to dormant state and should
> > get to the stable good health state through the intermediate booting state.
> > The sequence for getting to the steady state health after reset is:
> > dormant -> booting -> [booting -> good -> dormant -> booting] -> good.
> > The sequence in brackets can be repeated several times.
> > Initial implementation assumes that the this sequence is always
> > repeated by the constant number. This patch allows to perform health
> > steady state discovery independently of the repetitions number.
> >
> > Signed-off-by: Vadim Pasternak <vadimp@mellanox.com>
> > ---
> >  drivers/platform/mellanox/mlxreg-hotplug.c | 38
> > +++++++++++++++++++++++++++---
> >  1 file changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/platform/mellanox/mlxreg-hotplug.c
> > b/drivers/platform/mellanox/mlxreg-hotplug.c
> > index ac97aa0..f32637e 100644
> > --- a/drivers/platform/mellanox/mlxreg-hotplug.c
> > +++ b/drivers/platform/mellanox/mlxreg-hotplug.c
> > @@ -51,8 +51,9 @@
> >  #define MLXREG_HOTPLUG_AGGR_MASK_OFF	1
> >
> >  /* ASIC health parameters. */
> > +#define MLXREG_HOTPLUG_DOWN_MASK	0x00
> >  #define MLXREG_HOTPLUG_HEALTH_MASK	0x02
> > -#define MLXREG_HOTPLUG_RST_CNTR		3
> > +#define MLXREG_HOTPLUG_RST_CNTR		2
> >
> >  #define MLXREG_HOTPLUG_ATTRS_MAX	24
> >  #define MLXREG_HOTPLUG_NOT_ASSERT	3
> > @@ -325,20 +326,51 @@ mlxreg_hotplug_health_work_helper(struct
> mlxreg_hotplug_priv_data *priv,
> >  			goto out;
> >
> >  		regval &= data->mask;
> > -		item->cache = regval;
> > +		/*
> > +		 * ASIC health indication is provided through two bits. Bits
> > +		 * value 0x2 indicates that ASIC reached the good health, value
> > +		 * 0x0 indicates ASIC the bad health or dormant state and value
> > +		 * 0x2 indicates the booting state. During ASIC reset it should
> 
> A bit odd to have specific values in paragraph form in comments, generally I'd
> expect to see these as defines, but as this is a counter and we aren't using these
> specific values in the logic... OK - but perhaps something more obvious like an
> itemized list:

Hi Darren,

Thank you for review.

My mistake in comment. States from HW are:
	b00: Dormant
	b01: N/A
	b10: Good
	b11: Booting

These are bitmask actually. I'll fix the comment.

> 
> 		  0: Dormant
> 		  1: Booting
> 		  2: Good
> 
> Note that above the text states that 0x2 is both "good" and "booting", we'll
> need to at least correct that. Since it is a counter, and not a bitmask, the 0x
> prefix isn't needed in my opinion.
> 
> > +		 * pass the following states: dormant -> booting -> good.
> > +		 * The transition from dormant to booting state and from
> > +		 * booting to good state are indicated by ASIC twice, so actual
> > +		 * sequence for getting to the steady state after reset is:
> > +		 * dormant -> booting -> booting -> good -> good. It is
> > +		 * possible that due to some hardware noise, the transition
> > +		 * sequence will look like: dormant -> booting -> [ booting ->
> > +		 * good -> dormant -> booting ] -> good -> good.
> > +		 */
> 
> What does it mean to go from good -> good ?? Is that an intention transition to
> the same state? It seems the above state transition diagram could be simplified
> considerably like:
> 
> (dormant -> booting -> good)+

OK. I'll fix it.

Regarding booting - booting - good - good transition.
I retested it again with scope and this is just some noise on line and I can ignore
same -> same transition.
I'll modify this patch and re-send.
It also means that I can drop MLXREG_HOTPLUG_RST_CNTR at all.

> 
> Where the + indicates this occurs 1 or more times. If this isn't accurate - what
> am I missing?
> 
> >  		if (regval == MLXREG_HOTPLUG_HEALTH_MASK) {
> > -			if ((data->health_cntr++ ==
> MLXREG_HOTPLUG_RST_CNTR) ||
> > +			if ((++data->health_cntr ==
> MLXREG_HOTPLUG_RST_CNTR) ||
> 
> This is fairly confusing to read, 2 indicates good (I think, see above), but we're
> comparing it to "RST_CNTR" - what does that mean? Using another define which
> is more explicit would help with legibility.
> 
> Hrm... both HEALTH_MASK and RST_CNTR are defined as 2.... which of these is
> testing for the good state?
> 
> >  			    !priv->after_probe) {
> > +				/*
> > +				 * ASIC is in steady state. Connect associated
> > +				 * device, if configured.
> > +				 */
> >  				mlxreg_hotplug_device_create(priv, data);
> >  				data->attached = true;
> >  			}
> >  		} else {
> >  			if (data->attached) {
> > +				/*
> > +				 * ASIC health is dropped after ASIC has been
> 
> What does "health is dropped" mean? Is this a state transition? good to
> dormant? If so, using the same terminology would help legibility.
> 
> > +				 * in steady state. Disconnect associated
> > +				 * device, if it has been connected.
> > +				 */
> >  				mlxreg_hotplug_device_destroy(data);
> >  				data->attached = false;
> >  				data->health_cntr = 0;
> > +			} else if (regval == MLXREG_HOTPLUG_DOWN_MASK
> &&
> > +				   item->cache ==
> MLXREG_HOTPLUG_HEALTH_MASK) {
> > +				/*
> > +				 * Decrease counter, if health has been
> dropped
> > +				 * before ASIC reaches the steady state, like:
> > +				 * good -> dormant -> booting.
> > +				 */
> > +				data->health_cntr--;
> >  			}
> >  		}
> > +		item->cache = regval;
> 
> I found this to be pretty confusing. The description of the states above don't
> seem to map readily to the logic above.
> 
> >
> >  		/* Acknowledge event. */
> >  		ret = regmap_write(priv->regmap, data->reg +
> > --
> > 2.1.4
> >
> >
> 
> --
> Darren Hart
> VMware Open Source Technology Center
diff mbox

Patch

diff --git a/drivers/platform/mellanox/mlxreg-hotplug.c b/drivers/platform/mellanox/mlxreg-hotplug.c
index ac97aa0..f32637e 100644
--- a/drivers/platform/mellanox/mlxreg-hotplug.c
+++ b/drivers/platform/mellanox/mlxreg-hotplug.c
@@ -51,8 +51,9 @@ 
 #define MLXREG_HOTPLUG_AGGR_MASK_OFF	1
 
 /* ASIC health parameters. */
+#define MLXREG_HOTPLUG_DOWN_MASK	0x00
 #define MLXREG_HOTPLUG_HEALTH_MASK	0x02
-#define MLXREG_HOTPLUG_RST_CNTR		3
+#define MLXREG_HOTPLUG_RST_CNTR		2
 
 #define MLXREG_HOTPLUG_ATTRS_MAX	24
 #define MLXREG_HOTPLUG_NOT_ASSERT	3
@@ -325,20 +326,51 @@  mlxreg_hotplug_health_work_helper(struct mlxreg_hotplug_priv_data *priv,
 			goto out;
 
 		regval &= data->mask;
-		item->cache = regval;
+		/*
+		 * ASIC health indication is provided through two bits. Bits
+		 * value 0x2 indicates that ASIC reached the good health, value
+		 * 0x0 indicates ASIC the bad health or dormant state and value
+		 * 0x2 indicates the booting state. During ASIC reset it should
+		 * pass the following states: dormant -> booting -> good.
+		 * The transition from dormant to booting state and from
+		 * booting to good state are indicated by ASIC twice, so actual
+		 * sequence for getting to the steady state after reset is:
+		 * dormant -> booting -> booting -> good -> good. It is
+		 * possible that due to some hardware noise, the transition
+		 * sequence will look like: dormant -> booting -> [ booting ->
+		 * good -> dormant -> booting ] -> good -> good.
+		 */
 		if (regval == MLXREG_HOTPLUG_HEALTH_MASK) {
-			if ((data->health_cntr++ == MLXREG_HOTPLUG_RST_CNTR) ||
+			if ((++data->health_cntr == MLXREG_HOTPLUG_RST_CNTR) ||
 			    !priv->after_probe) {
+				/*
+				 * ASIC is in steady state. Connect associated
+				 * device, if configured.
+				 */
 				mlxreg_hotplug_device_create(priv, data);
 				data->attached = true;
 			}
 		} else {
 			if (data->attached) {
+				/*
+				 * ASIC health is dropped after ASIC has been
+				 * in steady state. Disconnect associated
+				 * device, if it has been connected.
+				 */
 				mlxreg_hotplug_device_destroy(data);
 				data->attached = false;
 				data->health_cntr = 0;
+			} else if (regval == MLXREG_HOTPLUG_DOWN_MASK &&
+				   item->cache == MLXREG_HOTPLUG_HEALTH_MASK) {
+				/*
+				 * Decrease counter, if health has been dropped
+				 * before ASIC reaches the steady state, like:
+				 * good -> dormant -> booting.
+				 */
+				data->health_cntr--;
 			}
 		}
+		item->cache = regval;
 
 		/* Acknowledge event. */
 		ret = regmap_write(priv->regmap, data->reg +