diff mbox series

platform/chrome: cros_ec_spi: Transfer messages at high priority

Message ID 20190402224445.64823-1-dianders@chromium.org (mailing list archive)
State New, archived
Headers show
Series platform/chrome: cros_ec_spi: Transfer messages at high priority | expand

Commit Message

Doug Anderson April 2, 2019, 10:44 p.m. UTC
The software running on the Chrome OS Embedded Controller (cros_ec)
handles SPI transfers in a bit of a wonky way.  Specifically if the EC
sees too long of a delay in a SPI transfer it will give up and the
transfer will be counted as failed.  Unfortunately the timeout is
fairly short, though the actual number may be different for different
EC codebases.

We can end up tripping the timeout pretty easily if we happen to
preempt the task running the SPI transfer and don't get back to it for
a little while.

Historically this hasn't been a _huge_ deal because:
1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
   we were pretty unlikely to take a big break from the transfer.
2. On recent devices we had faster / more processors.
3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
   delay makes us more likely to trip this use case.
4. For whatever reasons (I didn't dig) old kernels seem to be less
   likely to trip this.
5. For the most part it's kinda OK if a few transfers to the EC fail.
   Mostly we're just polling the battery or doing some other task
   where we'll try again.

Even with the above things, this issue has reared its ugly head
periodically.  We could solve this in a nice way by adding reliable
retries to the EC protocol [1] or by re-designing the code in the EC
codebase to allow it to wait longer, but that code doesn't ever seem
to get changed.  ...and even if it did, it wouldn't help old devices.

It's now time to finally take a crack at making this a little better.
This patch isn't guaranteed to make every cros_ec SPI transfer
perfect, but it should improve things by a few orders of magnitude.
Specifically you can try this on a rk3288-veyron Chromebook (which is
slower and also _does_ need "cros-ec-spi-pre-delay"):
  md5sum /dev/zero &
  md5sum /dev/zero &
  md5sum /dev/zero &
  md5sum /dev/zero &
  while true; do
    cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
  done
...before this patch you'll see boatloads of errors.  After this patch I
don't see any in the testing I did.

The way this patch works is by effectively boosting the priority of
the cros_ec transfers.  As far as I know there is no simple way to
just boost the priority of the current process temporarily so the way
we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
the transfers there.

NOTE: this patch relies on the fact that the SPI framework attempts to
push the messages out on the calling context (which is the one that is
boosted to high priority).  As I understand from earlier (long ago)
discussions with Mark Brown this should be a fine assumption.  Even if
it isn't true sometimes this patch will still not make things worse.

[1] https://crbug.com/678675

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---

 drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
 1 file changed, 101 insertions(+), 6 deletions(-)

Comments

Matthias Kaehlcke April 2, 2019, 11:19 p.m. UTC | #1
Hi Doug,

On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> The software running on the Chrome OS Embedded Controller (cros_ec)
> handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> sees too long of a delay in a SPI transfer it will give up and the
> transfer will be counted as failed.  Unfortunately the timeout is
> fairly short, though the actual number may be different for different
> EC codebases.
> 
> We can end up tripping the timeout pretty easily if we happen to
> preempt the task running the SPI transfer and don't get back to it for
> a little while.
> 
> Historically this hasn't been a _huge_ deal because:
> 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
>    we were pretty unlikely to take a big break from the transfer.
> 2. On recent devices we had faster / more processors.
> 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
>    delay makes us more likely to trip this use case.
> 4. For whatever reasons (I didn't dig) old kernels seem to be less
>    likely to trip this.
> 5. For the most part it's kinda OK if a few transfers to the EC fail.
>    Mostly we're just polling the battery or doing some other task
>    where we'll try again.
> 
> Even with the above things, this issue has reared its ugly head
> periodically.  We could solve this in a nice way by adding reliable
> retries to the EC protocol [1] or by re-designing the code in the EC
> codebase to allow it to wait longer, but that code doesn't ever seem
> to get changed.  ...and even if it did, it wouldn't help old devices.
> 
> It's now time to finally take a crack at making this a little better.
> This patch isn't guaranteed to make every cros_ec SPI transfer
> perfect, but it should improve things by a few orders of magnitude.
> Specifically you can try this on a rk3288-veyron Chromebook (which is
> slower and also _does_ need "cros-ec-spi-pre-delay"):
>   md5sum /dev/zero &
>   md5sum /dev/zero &
>   md5sum /dev/zero &
>   md5sum /dev/zero &
>   while true; do
>     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
>   done
> ...before this patch you'll see boatloads of errors.  After this patch I
> don't see any in the testing I did.
> 
> The way this patch works is by effectively boosting the priority of
> the cros_ec transfers.  As far as I know there is no simple way to
> just boost the priority of the current process temporarily so the way
> we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> the transfers there.
> 
> NOTE: this patch relies on the fact that the SPI framework attempts to
> push the messages out on the calling context (which is the one that is
> boosted to high priority).  As I understand from earlier (long ago)
> discussions with Mark Brown this should be a fine assumption.  Even if
> it isn't true sometimes this patch will still not make things worse.
> 
> [1] https://crbug.com/678675
> 
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> 
>  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
>  1 file changed, 101 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> index ffc38f9d4829..101f2deb7d3c 100644
> --- a/drivers/platform/chrome/cros_ec_spi.c
> +++ b/drivers/platform/chrome/cros_ec_spi.c
>
> ...
>
> +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> +				struct cros_ec_command *ec_msg)
> +{
> +	struct cros_ec_spi *ec_spi = ec_dev->priv;
> +	struct cros_ec_xfer_work_params params;
> +
> +	INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> +	params.ec_dev = ec_dev;
> +	params.ec_msg = ec_msg;
> +
> +	queue_work(ec_spi->high_pri_wq, &params.work);
> +	flush_workqueue(ec_spi->high_pri_wq);

IIRC dedicated workqueues should be avoided unless they are needed. In
this case it seems you could use system_highpri_wq + a
completion. This would add a few extra lines to deal with the
completion, in exchange the code to create the workqueue could be
removed.

> +	return params.ret;
> +}
> +
> +static void cros_ec_cmd_xfer_spi_work(struct work_struct *work)
> +{
> +	struct cros_ec_xfer_work_params *params;
> +
> +	params = container_of(work, struct cros_ec_xfer_work_params, work);
> +	params->ret = do_cros_ec_cmd_xfer_spi(params->ec_dev, params->ec_msg);
> +}
> +
> +static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
> +				struct cros_ec_command *ec_msg)
> +{
> +	struct cros_ec_spi *ec_spi = ec_dev->priv;
> +	struct cros_ec_xfer_work_params params;
> +
> +	INIT_WORK(&params.work, cros_ec_cmd_xfer_spi_work);
> +	params.ec_dev = ec_dev;
> +	params.ec_msg = ec_msg;
> +
> +	queue_work(ec_spi->high_pri_wq, &params.work);
> +	flush_workqueue(ec_spi->high_pri_wq);
> +
> +	return params.ret;
> +}

This is essentially a copy of cros_ec_pkt_xfer_spi() above. You
could add a wrapper that receives the work function to avoid the
duplicate code.

Cheers

Matthias
Doug Anderson April 2, 2019, 11:38 p.m. UTC | #2
Hi,

On Tue, Apr 2, 2019 at 4:19 PM Matthias Kaehlcke <mka@chromium.org> wrote:
>
> Hi Doug,
>
> On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> > The software running on the Chrome OS Embedded Controller (cros_ec)
> > handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> > sees too long of a delay in a SPI transfer it will give up and the
> > transfer will be counted as failed.  Unfortunately the timeout is
> > fairly short, though the actual number may be different for different
> > EC codebases.
> >
> > We can end up tripping the timeout pretty easily if we happen to
> > preempt the task running the SPI transfer and don't get back to it for
> > a little while.
> >
> > Historically this hasn't been a _huge_ deal because:
> > 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
> >    we were pretty unlikely to take a big break from the transfer.
> > 2. On recent devices we had faster / more processors.
> > 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
> >    delay makes us more likely to trip this use case.
> > 4. For whatever reasons (I didn't dig) old kernels seem to be less
> >    likely to trip this.
> > 5. For the most part it's kinda OK if a few transfers to the EC fail.
> >    Mostly we're just polling the battery or doing some other task
> >    where we'll try again.
> >
> > Even with the above things, this issue has reared its ugly head
> > periodically.  We could solve this in a nice way by adding reliable
> > retries to the EC protocol [1] or by re-designing the code in the EC
> > codebase to allow it to wait longer, but that code doesn't ever seem
> > to get changed.  ...and even if it did, it wouldn't help old devices.
> >
> > It's now time to finally take a crack at making this a little better.
> > This patch isn't guaranteed to make every cros_ec SPI transfer
> > perfect, but it should improve things by a few orders of magnitude.
> > Specifically you can try this on a rk3288-veyron Chromebook (which is
> > slower and also _does_ need "cros-ec-spi-pre-delay"):
> >   md5sum /dev/zero &
> >   md5sum /dev/zero &
> >   md5sum /dev/zero &
> >   md5sum /dev/zero &
> >   while true; do
> >     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
> >   done
> > ...before this patch you'll see boatloads of errors.  After this patch I
> > don't see any in the testing I did.
> >
> > The way this patch works is by effectively boosting the priority of
> > the cros_ec transfers.  As far as I know there is no simple way to
> > just boost the priority of the current process temporarily so the way
> > we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> > the transfers there.
> >
> > NOTE: this patch relies on the fact that the SPI framework attempts to
> > push the messages out on the calling context (which is the one that is
> > boosted to high priority).  As I understand from earlier (long ago)
> > discussions with Mark Brown this should be a fine assumption.  Even if
> > it isn't true sometimes this patch will still not make things worse.
> >
> > [1] https://crbug.com/678675
> >
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
> >
> >  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
> >  1 file changed, 101 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> > index ffc38f9d4829..101f2deb7d3c 100644
> > --- a/drivers/platform/chrome/cros_ec_spi.c
> > +++ b/drivers/platform/chrome/cros_ec_spi.c
> >
> > ...
> >
> > +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> > +                             struct cros_ec_command *ec_msg)
> > +{
> > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > +     struct cros_ec_xfer_work_params params;
> > +
> > +     INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> > +     params.ec_dev = ec_dev;
> > +     params.ec_msg = ec_msg;
> > +
> > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > +     flush_workqueue(ec_spi->high_pri_wq);
>
> IIRC dedicated workqueues should be avoided unless they are needed. In
> this case it seems you could use system_highpri_wq + a
> completion. This would add a few extra lines to deal with the
> completion, in exchange the code to create the workqueue could be
> removed.

I'm not convinced using the "system_highpri_wq" is a great idea here.
Using flush_workqueue() on the "system_highpri_wq" seems like a recipe
for deadlock but I need to flush to get the result back.  See the
comments in flush_scheduled_work() for some discussion here.

I guess you're suggesting using a completion instead of the flush but
I think the deadlock potentials are the same.  If we're currently
running on the "system_highpri_wq" (because one of our callers
happened to be on it) or there are some shared resources between
another user of the "system_highpri_wq" and us then we'll just sitting
waiting for the completion, won't we?

I would bet that currently nobody actually ends up in this situation
because there aren't lots of users of the "system_highpri_wq", but it
still doesn't seem like a good design.  Is it really that expensive to
have our own workqueue?


> > +     return params.ret;
> > +}
> > +
> > +static void cros_ec_cmd_xfer_spi_work(struct work_struct *work)
> > +{
> > +     struct cros_ec_xfer_work_params *params;
> > +
> > +     params = container_of(work, struct cros_ec_xfer_work_params, work);
> > +     params->ret = do_cros_ec_cmd_xfer_spi(params->ec_dev, params->ec_msg);
> > +}
> > +
> > +static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
> > +                             struct cros_ec_command *ec_msg)
> > +{
> > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > +     struct cros_ec_xfer_work_params params;
> > +
> > +     INIT_WORK(&params.work, cros_ec_cmd_xfer_spi_work);
> > +     params.ec_dev = ec_dev;
> > +     params.ec_msg = ec_msg;
> > +
> > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > +     flush_workqueue(ec_spi->high_pri_wq);
> > +
> > +     return params.ret;
> > +}
>
> This is essentially a copy of cros_ec_pkt_xfer_spi() above. You
> could add a wrapper that receives the work function to avoid the
> duplicate code.

Good point.  I'll wait a day or two for more feedback and then post a
new version with that change.

-Doug
Matthias Kaehlcke April 3, 2019, 12:21 a.m. UTC | #3
On Tue, Apr 02, 2019 at 04:38:29PM -0700, Doug Anderson wrote:
> Hi,
> 
> On Tue, Apr 2, 2019 at 4:19 PM Matthias Kaehlcke <mka@chromium.org> wrote:
> >
> > Hi Doug,
> >
> > On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> > > The software running on the Chrome OS Embedded Controller (cros_ec)
> > > handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> > > sees too long of a delay in a SPI transfer it will give up and the
> > > transfer will be counted as failed.  Unfortunately the timeout is
> > > fairly short, though the actual number may be different for different
> > > EC codebases.
> > >
> > > We can end up tripping the timeout pretty easily if we happen to
> > > preempt the task running the SPI transfer and don't get back to it for
> > > a little while.
> > >
> > > Historically this hasn't been a _huge_ deal because:
> > > 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
> > >    we were pretty unlikely to take a big break from the transfer.
> > > 2. On recent devices we had faster / more processors.
> > > 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
> > >    delay makes us more likely to trip this use case.
> > > 4. For whatever reasons (I didn't dig) old kernels seem to be less
> > >    likely to trip this.
> > > 5. For the most part it's kinda OK if a few transfers to the EC fail.
> > >    Mostly we're just polling the battery or doing some other task
> > >    where we'll try again.
> > >
> > > Even with the above things, this issue has reared its ugly head
> > > periodically.  We could solve this in a nice way by adding reliable
> > > retries to the EC protocol [1] or by re-designing the code in the EC
> > > codebase to allow it to wait longer, but that code doesn't ever seem
> > > to get changed.  ...and even if it did, it wouldn't help old devices.
> > >
> > > It's now time to finally take a crack at making this a little better.
> > > This patch isn't guaranteed to make every cros_ec SPI transfer
> > > perfect, but it should improve things by a few orders of magnitude.
> > > Specifically you can try this on a rk3288-veyron Chromebook (which is
> > > slower and also _does_ need "cros-ec-spi-pre-delay"):
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   while true; do
> > >     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
> > >   done
> > > ...before this patch you'll see boatloads of errors.  After this patch I
> > > don't see any in the testing I did.
> > >
> > > The way this patch works is by effectively boosting the priority of
> > > the cros_ec transfers.  As far as I know there is no simple way to
> > > just boost the priority of the current process temporarily so the way
> > > we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> > > the transfers there.
> > >
> > > NOTE: this patch relies on the fact that the SPI framework attempts to
> > > push the messages out on the calling context (which is the one that is
> > > boosted to high priority).  As I understand from earlier (long ago)
> > > discussions with Mark Brown this should be a fine assumption.  Even if
> > > it isn't true sometimes this patch will still not make things worse.
> > >
> > > [1] https://crbug.com/678675
> > >
> > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > ---
> > >
> > >  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
> > >  1 file changed, 101 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> > > index ffc38f9d4829..101f2deb7d3c 100644
> > > --- a/drivers/platform/chrome/cros_ec_spi.c
> > > +++ b/drivers/platform/chrome/cros_ec_spi.c
> > >
> > > ...
> > >
> > > +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> > > +                             struct cros_ec_command *ec_msg)
> > > +{
> > > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > > +     struct cros_ec_xfer_work_params params;
> > > +
> > > +     INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> > > +     params.ec_dev = ec_dev;
> > > +     params.ec_msg = ec_msg;
> > > +
> > > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > > +     flush_workqueue(ec_spi->high_pri_wq);
> >
> > IIRC dedicated workqueues should be avoided unless they are needed. In
> > this case it seems you could use system_highpri_wq + a
> > completion. This would add a few extra lines to deal with the
> > completion, in exchange the code to create the workqueue could be
> > removed.
> 
> I'm not convinced using the "system_highpri_wq" is a great idea here.
> Using flush_workqueue() on the "system_highpri_wq" seems like a recipe
> for deadlock but I need to flush to get the result back.  See the
> comments in flush_scheduled_work() for some discussion here.
> 
> I guess you're suggesting using a completion instead of the flush but
> I think the deadlock potentials are the same.  If we're currently
> running on the "system_highpri_wq" (because one of our callers
> happened to be on it) or there are some shared resources between
> another user of the "system_highpri_wq" and us then we'll just sitting
> waiting for the completion, won't we?

I'm no workqueue expert, but I think the deadlock potential isn't the same:

With flush_workqueue() the deadlock would occur when running as work
item of the the same workqueue, i.e. the work is waiting for itself.

If we are running on "system_highpri_wq", schedule a new work on this
workqueue and wait for it, the Concurrency Managed Workqueue (cmwq)
will launch a worker for our work, which can run while we are waiting
for the work and be woken up when it is done.

(https://www.kernel.org/doc/html/v5.0/core-api/workqueue.html)

Other users of "system_highpri_wq" shouldn't cause long delays, unless
they are CPU hogs, which could/should be considered a bug.

> I would bet that currently nobody actually ends up in this situation
> because there aren't lots of users of the "system_highpri_wq", but it
> still doesn't seem like a good design.  Is it really that expensive to
> have our own workqueue?

I don't think it's excessively expensive, but why use the extra
resources and lifetime management code if it doesn't provide any
significant advantages? In terms of deadlocks I even have the
impression the wq + completion is a more robust solution.
Guenter Roeck April 3, 2019, 3:17 a.m. UTC | #4
On Tue, Apr 2, 2019 at 4:38 PM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Tue, Apr 2, 2019 at 4:19 PM Matthias Kaehlcke <mka@chromium.org> wrote:
> >
> > Hi Doug,
> >
> > On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> > > The software running on the Chrome OS Embedded Controller (cros_ec)
> > > handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> > > sees too long of a delay in a SPI transfer it will give up and the
> > > transfer will be counted as failed.  Unfortunately the timeout is
> > > fairly short, though the actual number may be different for different
> > > EC codebases.
> > >
> > > We can end up tripping the timeout pretty easily if we happen to
> > > preempt the task running the SPI transfer and don't get back to it for
> > > a little while.
> > >
> > > Historically this hasn't been a _huge_ deal because:
> > > 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
> > >    we were pretty unlikely to take a big break from the transfer.
> > > 2. On recent devices we had faster / more processors.
> > > 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
> > >    delay makes us more likely to trip this use case.
> > > 4. For whatever reasons (I didn't dig) old kernels seem to be less
> > >    likely to trip this.
> > > 5. For the most part it's kinda OK if a few transfers to the EC fail.
> > >    Mostly we're just polling the battery or doing some other task
> > >    where we'll try again.
> > >
> > > Even with the above things, this issue has reared its ugly head
> > > periodically.  We could solve this in a nice way by adding reliable
> > > retries to the EC protocol [1] or by re-designing the code in the EC
> > > codebase to allow it to wait longer, but that code doesn't ever seem
> > > to get changed.  ...and even if it did, it wouldn't help old devices.
> > >
> > > It's now time to finally take a crack at making this a little better.
> > > This patch isn't guaranteed to make every cros_ec SPI transfer
> > > perfect, but it should improve things by a few orders of magnitude.
> > > Specifically you can try this on a rk3288-veyron Chromebook (which is
> > > slower and also _does_ need "cros-ec-spi-pre-delay"):
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   md5sum /dev/zero &
> > >   while true; do
> > >     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
> > >   done
> > > ...before this patch you'll see boatloads of errors.  After this patch I
> > > don't see any in the testing I did.
> > >
> > > The way this patch works is by effectively boosting the priority of
> > > the cros_ec transfers.  As far as I know there is no simple way to
> > > just boost the priority of the current process temporarily so the way
> > > we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> > > the transfers there.
> > >
> > > NOTE: this patch relies on the fact that the SPI framework attempts to
> > > push the messages out on the calling context (which is the one that is
> > > boosted to high priority).  As I understand from earlier (long ago)
> > > discussions with Mark Brown this should be a fine assumption.  Even if
> > > it isn't true sometimes this patch will still not make things worse.
> > >
> > > [1] https://crbug.com/678675
> > >
> > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > ---
> > >
> > >  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
> > >  1 file changed, 101 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> > > index ffc38f9d4829..101f2deb7d3c 100644
> > > --- a/drivers/platform/chrome/cros_ec_spi.c
> > > +++ b/drivers/platform/chrome/cros_ec_spi.c
> > >
> > > ...
> > >
> > > +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> > > +                             struct cros_ec_command *ec_msg)
> > > +{
> > > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > > +     struct cros_ec_xfer_work_params params;
> > > +
> > > +     INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> > > +     params.ec_dev = ec_dev;
> > > +     params.ec_msg = ec_msg;
> > > +
> > > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > > +     flush_workqueue(ec_spi->high_pri_wq);
> >
> > IIRC dedicated workqueues should be avoided unless they are needed. In
> > this case it seems you could use system_highpri_wq + a
> > completion. This would add a few extra lines to deal with the
> > completion, in exchange the code to create the workqueue could be
> > removed.
>
> I'm not convinced using the "system_highpri_wq" is a great idea here.
> Using flush_workqueue() on the "system_highpri_wq" seems like a recipe
> for deadlock but I need to flush to get the result back.  See the
> comments in flush_scheduled_work() for some discussion here.
>

Given that high priority workqueues are used all over the place, and
system_highpri_wq is only rarely used, hijacking the latter doesn't
seem to be such a good idea to me either. I also recall that we had to
drop using system qorkqueues at a previous company and replace them
with local workqueues because we got into timing trouble when using
system workqueues.

Having said that, the combination of queue_work() immediately followed
by flush_workqueue() seems odd and appears to violate the wole idea of
work _queues_. I wonder if there is a better solution available to
solve problems like this. I also wonder if we solve a problem here, or
if we work around its symptoms. AFAICS, the delay translates into a
udelay(), meaning an active wait loop. Has anyone an understanding why
this translates into a spi transfer error, and how using a workqueue
(which I guess may offload the active wait to another CPU) solves that
problem ? Also, are we sure that this isn't a problem with the SPI
driver ?

Guenter

> I guess you're suggesting using a completion instead of the flush but
> I think the deadlock potentials are the same.  If we're currently
> running on the "system_highpri_wq" (because one of our callers
> happened to be on it) or there are some shared resources between
> another user of the "system_highpri_wq" and us then we'll just sitting
> waiting for the completion, won't we?
>
> I would bet that currently nobody actually ends up in this situation
> because there aren't lots of users of the "system_highpri_wq", but it
> still doesn't seem like a good design.  Is it really that expensive to
> have our own workqueue?
>
>
> > > +     return params.ret;
> > > +}
> > > +
> > > +static void cros_ec_cmd_xfer_spi_work(struct work_struct *work)
> > > +{
> > > +     struct cros_ec_xfer_work_params *params;
> > > +
> > > +     params = container_of(work, struct cros_ec_xfer_work_params, work);
> > > +     params->ret = do_cros_ec_cmd_xfer_spi(params->ec_dev, params->ec_msg);
> > > +}
> > > +
> > > +static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
> > > +                             struct cros_ec_command *ec_msg)
> > > +{
> > > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > > +     struct cros_ec_xfer_work_params params;
> > > +
> > > +     INIT_WORK(&params.work, cros_ec_cmd_xfer_spi_work);
> > > +     params.ec_dev = ec_dev;
> > > +     params.ec_msg = ec_msg;
> > > +
> > > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > > +     flush_workqueue(ec_spi->high_pri_wq);
> > > +
> > > +     return params.ret;
> > > +}
> >
> > This is essentially a copy of cros_ec_pkt_xfer_spi() above. You
> > could add a wrapper that receives the work function to avoid the
> > duplicate code.
>
> Good point.  I'll wait a day or two for more feedback and then post a
> new version with that change.
>
> -Doug
Doug Anderson April 3, 2019, 4:15 a.m. UTC | #5
Hi,

On Tue, Apr 2, 2019 at 8:17 PM Guenter Roeck <groeck@google.com> wrote:
>
> On Tue, Apr 2, 2019 at 4:38 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 2, 2019 at 4:19 PM Matthias Kaehlcke <mka@chromium.org> wrote:
> > >
> > > Hi Doug,
> > >
> > > On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> > > > The software running on the Chrome OS Embedded Controller (cros_ec)
> > > > handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> > > > sees too long of a delay in a SPI transfer it will give up and the
> > > > transfer will be counted as failed.  Unfortunately the timeout is
> > > > fairly short, though the actual number may be different for different
> > > > EC codebases.
> > > >
> > > > We can end up tripping the timeout pretty easily if we happen to
> > > > preempt the task running the SPI transfer and don't get back to it for
> > > > a little while.
> > > >
> > > > Historically this hasn't been a _huge_ deal because:
> > > > 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
> > > >    we were pretty unlikely to take a big break from the transfer.
> > > > 2. On recent devices we had faster / more processors.
> > > > 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
> > > >    delay makes us more likely to trip this use case.
> > > > 4. For whatever reasons (I didn't dig) old kernels seem to be less
> > > >    likely to trip this.
> > > > 5. For the most part it's kinda OK if a few transfers to the EC fail.
> > > >    Mostly we're just polling the battery or doing some other task
> > > >    where we'll try again.
> > > >
> > > > Even with the above things, this issue has reared its ugly head
> > > > periodically.  We could solve this in a nice way by adding reliable
> > > > retries to the EC protocol [1] or by re-designing the code in the EC
> > > > codebase to allow it to wait longer, but that code doesn't ever seem
> > > > to get changed.  ...and even if it did, it wouldn't help old devices.
> > > >
> > > > It's now time to finally take a crack at making this a little better.
> > > > This patch isn't guaranteed to make every cros_ec SPI transfer
> > > > perfect, but it should improve things by a few orders of magnitude.
> > > > Specifically you can try this on a rk3288-veyron Chromebook (which is
> > > > slower and also _does_ need "cros-ec-spi-pre-delay"):
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   while true; do
> > > >     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
> > > >   done
> > > > ...before this patch you'll see boatloads of errors.  After this patch I
> > > > don't see any in the testing I did.
> > > >
> > > > The way this patch works is by effectively boosting the priority of
> > > > the cros_ec transfers.  As far as I know there is no simple way to
> > > > just boost the priority of the current process temporarily so the way
> > > > we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> > > > the transfers there.
> > > >
> > > > NOTE: this patch relies on the fact that the SPI framework attempts to
> > > > push the messages out on the calling context (which is the one that is
> > > > boosted to high priority).  As I understand from earlier (long ago)
> > > > discussions with Mark Brown this should be a fine assumption.  Even if
> > > > it isn't true sometimes this patch will still not make things worse.
> > > >
> > > > [1] https://crbug.com/678675
> > > >
> > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > ---
> > > >
> > > >  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
> > > >  1 file changed, 101 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> > > > index ffc38f9d4829..101f2deb7d3c 100644
> > > > --- a/drivers/platform/chrome/cros_ec_spi.c
> > > > +++ b/drivers/platform/chrome/cros_ec_spi.c
> > > >
> > > > ...
> > > >
> > > > +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> > > > +                             struct cros_ec_command *ec_msg)
> > > > +{
> > > > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > > > +     struct cros_ec_xfer_work_params params;
> > > > +
> > > > +     INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> > > > +     params.ec_dev = ec_dev;
> > > > +     params.ec_msg = ec_msg;
> > > > +
> > > > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > > > +     flush_workqueue(ec_spi->high_pri_wq);
> > >
> > > IIRC dedicated workqueues should be avoided unless they are needed. In
> > > this case it seems you could use system_highpri_wq + a
> > > completion. This would add a few extra lines to deal with the
> > > completion, in exchange the code to create the workqueue could be
> > > removed.
> >
> > I'm not convinced using the "system_highpri_wq" is a great idea here.
> > Using flush_workqueue() on the "system_highpri_wq" seems like a recipe
> > for deadlock but I need to flush to get the result back.  See the
> > comments in flush_scheduled_work() for some discussion here.
> >
>
> Given that high priority workqueues are used all over the place, and
> system_highpri_wq is only rarely used, hijacking the latter doesn't
> seem to be such a good idea to me either. I also recall that we had to
> drop using system qorkqueues at a previous company and replace them
> with local workqueues because we got into timing trouble when using
> system workqueues.

I think I need to dig into workqueues more after reading Matthias's
comments.  Presumably I'm misunderstanding something about them.  I
think I expecting that there was just one worker or one worker per CPU
and thus we'd be blocking others, but I guess I was just confused.


> Having said that, the combination of queue_work() immediately followed
> by flush_workqueue() seems odd and appears to violate the wole idea of
> work _queues_.

Yeah, this is talked about a little in the commit message where I say:
"The way this patch works is by effectively boosting the priority of
the cros_ec transfers.  As far as I know there is no simple way to
just boost the priority of the current process temporarily so the way
we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
the transfers there."

...if there is a better way to boost the priority of the transfer then
I'm all ears, but last I checked it wasn't considered OK to just mess
with/restore the current task's priority.


> I wonder if there is a better solution available to
> solve problems like this. I also wonder if we solve a problem here, or
> if we work around its symptoms. AFAICS, the delay translates into a
> udelay(), meaning an active wait loop. Has anyone an understanding why
> this translates into a spi transfer error, and how using a workqueue
> (which I guess may offload the active wait to another CPU) solves that
> problem ?

People have done digging on this in the past and I can try to find all
the research, but basically what was found out was that the task
running the SPI transfer can get preempted out.  If the CPU is busy it
might be a while before it gets the CPU again.  ...and the longer we
spend in the transfer (AKA if we have cros-ec-spi-pre-delay) the
bigger chance of this happening.  As per stuff in the commit message,
it's known that the EC is very sensitive to delays like this and will
give up on the transfer.  Yes, the EC ought to be more lenient but it
just isn't and we definitely can't change the old ECs.

I know for sure people did logic analyzer traces to show this.  They
saw these big delays and the only explanation was this preemption.


> Also, are we sure that this isn't a problem with the SPI
> driver ?

I'm 99.9% sure that this is not a problem with the SPI driver.  The
cros_ec protocol is just weird and to do a transfer the cros_ec SPI
code has to string together several separate SPI transfers.  Each of
these separate transfers is something where the CPU needs to get
involved and take the next step.


-Doug
Matthias Kaehlcke April 3, 2019, 5:03 p.m. UTC | #6
On Tue, Apr 02, 2019 at 08:17:03PM -0700, Guenter Roeck wrote:
> On Tue, Apr 2, 2019 at 4:38 PM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Tue, Apr 2, 2019 at 4:19 PM Matthias Kaehlcke <mka@chromium.org> wrote:
> > >
> > > Hi Doug,
> > >
> > > On Tue, Apr 02, 2019 at 03:44:44PM -0700, Douglas Anderson wrote:
> > > > The software running on the Chrome OS Embedded Controller (cros_ec)
> > > > handles SPI transfers in a bit of a wonky way.  Specifically if the EC
> > > > sees too long of a delay in a SPI transfer it will give up and the
> > > > transfer will be counted as failed.  Unfortunately the timeout is
> > > > fairly short, though the actual number may be different for different
> > > > EC codebases.
> > > >
> > > > We can end up tripping the timeout pretty easily if we happen to
> > > > preempt the task running the SPI transfer and don't get back to it for
> > > > a little while.
> > > >
> > > > Historically this hasn't been a _huge_ deal because:
> > > > 1. On old devices Chrome OS used to run PREEMPT_VOLUNTARY.  That meant
> > > >    we were pretty unlikely to take a big break from the transfer.
> > > > 2. On recent devices we had faster / more processors.
> > > > 3. Recent devices didn't use "cros-ec-spi-pre-delay".  Using that
> > > >    delay makes us more likely to trip this use case.
> > > > 4. For whatever reasons (I didn't dig) old kernels seem to be less
> > > >    likely to trip this.
> > > > 5. For the most part it's kinda OK if a few transfers to the EC fail.
> > > >    Mostly we're just polling the battery or doing some other task
> > > >    where we'll try again.
> > > >
> > > > Even with the above things, this issue has reared its ugly head
> > > > periodically.  We could solve this in a nice way by adding reliable
> > > > retries to the EC protocol [1] or by re-designing the code in the EC
> > > > codebase to allow it to wait longer, but that code doesn't ever seem
> > > > to get changed.  ...and even if it did, it wouldn't help old devices.
> > > >
> > > > It's now time to finally take a crack at making this a little better.
> > > > This patch isn't guaranteed to make every cros_ec SPI transfer
> > > > perfect, but it should improve things by a few orders of magnitude.
> > > > Specifically you can try this on a rk3288-veyron Chromebook (which is
> > > > slower and also _does_ need "cros-ec-spi-pre-delay"):
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   md5sum /dev/zero &
> > > >   while true; do
> > > >     cat /sys/class/power_supply/sbs-20-000b/charge_now > /dev/null;
> > > >   done
> > > > ...before this patch you'll see boatloads of errors.  After this patch I
> > > > don't see any in the testing I did.
> > > >
> > > > The way this patch works is by effectively boosting the priority of
> > > > the cros_ec transfers.  As far as I know there is no simple way to
> > > > just boost the priority of the current process temporarily so the way
> > > > we accomplish this is by creating a "WQ_HIGHPRI" workqueue and doing
> > > > the transfers there.
> > > >
> > > > NOTE: this patch relies on the fact that the SPI framework attempts to
> > > > push the messages out on the calling context (which is the one that is
> > > > boosted to high priority).  As I understand from earlier (long ago)
> > > > discussions with Mark Brown this should be a fine assumption.  Even if
> > > > it isn't true sometimes this patch will still not make things worse.
> > > >
> > > > [1] https://crbug.com/678675
> > > >
> > > > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > > > ---
> > > >
> > > >  drivers/platform/chrome/cros_ec_spi.c | 107 ++++++++++++++++++++++++--
> > > >  1 file changed, 101 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
> > > > index ffc38f9d4829..101f2deb7d3c 100644
> > > > --- a/drivers/platform/chrome/cros_ec_spi.c
> > > > +++ b/drivers/platform/chrome/cros_ec_spi.c
> > > >
> > > > ...
> > > >
> > > > +static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
> > > > +                             struct cros_ec_command *ec_msg)
> > > > +{
> > > > +     struct cros_ec_spi *ec_spi = ec_dev->priv;
> > > > +     struct cros_ec_xfer_work_params params;
> > > > +
> > > > +     INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
> > > > +     params.ec_dev = ec_dev;
> > > > +     params.ec_msg = ec_msg;
> > > > +
> > > > +     queue_work(ec_spi->high_pri_wq, &params.work);
> > > > +     flush_workqueue(ec_spi->high_pri_wq);
> > >
> > > IIRC dedicated workqueues should be avoided unless they are needed. In
> > > this case it seems you could use system_highpri_wq + a
> > > completion. This would add a few extra lines to deal with the
> > > completion, in exchange the code to create the workqueue could be
> > > removed.
> >
> > I'm not convinced using the "system_highpri_wq" is a great idea here.
> > Using flush_workqueue() on the "system_highpri_wq" seems like a recipe
> > for deadlock but I need to flush to get the result back.  See the
> > comments in flush_scheduled_work() for some discussion here.
> >
> 
> Given that high priority workqueues are used all over the place, and
> system_highpri_wq is only rarely used, hijacking the latter doesn't
> seem to be such a good idea to me either.

It *might* be that system_highpri_wq is used less often because it was
introduced later than high priority workqueues in general, also many
folks might still think of workqueues as 'one thread per CPU' (like
myself until I was recently pointed to cmwq) and prefer to have their
own wq for this reason.

> I also recall that we had to drop using system qorkqueues at a
> previous company and replace them with local workqueues because we
> got into timing trouble when using system workqueues.

IIUC this shouldn't be an issue with cwmq (which has worker pools),
unless a work is a CPU hog, in which case it probably shouldn't use
a system workqueue in the first place.

Not saying that the problem you mentioned didn't exist, just sharing
what is my current understanding, maybe somebody will point out that
I'm plain wrong and I/we learn something new about wqs ;-)

Matthias
diff mbox series

Patch

diff --git a/drivers/platform/chrome/cros_ec_spi.c b/drivers/platform/chrome/cros_ec_spi.c
index ffc38f9d4829..101f2deb7d3c 100644
--- a/drivers/platform/chrome/cros_ec_spi.c
+++ b/drivers/platform/chrome/cros_ec_spi.c
@@ -67,12 +67,30 @@ 
  *      is sent when we want to turn on CS at the start of a transaction.
  * @end_of_msg_delay: used to set the delay_usecs on the spi_transfer that
  *      is sent when we want to turn off CS at the end of a transaction.
+ * @high_pri_wq: We'll do our transfers here to reduce preemption problems.
  */
 struct cros_ec_spi {
 	struct spi_device *spi;
 	s64 last_transfer_ns;
 	unsigned int start_of_msg_delay;
 	unsigned int end_of_msg_delay;
+	struct workqueue_struct *high_pri_wq;
+};
+
+/**
+ * struct cros_ec_xfer_work_params - params for our high priority workers
+ *
+ * @work: The work_struct needed to queue work
+ * @ec_dev: ChromeOS EC device
+ * @ec_msg: Message to transfer
+ * @ret: The return value of the function
+ */
+
+struct cros_ec_xfer_work_params {
+	struct work_struct work;
+	struct cros_ec_device *ec_dev;
+	struct cros_ec_command *ec_msg;
+	int ret;
 };
 
 static void debug_packet(struct device *dev, const char *name, u8 *ptr,
@@ -350,13 +368,13 @@  static int cros_ec_spi_receive_response(struct cros_ec_device *ec_dev,
 }
 
 /**
- * cros_ec_pkt_xfer_spi - Transfer a packet over SPI and receive the reply
+ * do_cros_ec_pkt_xfer_spi - Transfer a packet over SPI and receive the reply
  *
  * @ec_dev: ChromeOS EC device
  * @ec_msg: Message to transfer
  */
-static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
-				struct cros_ec_command *ec_msg)
+static int do_cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
+				   struct cros_ec_command *ec_msg)
 {
 	struct ec_host_response *response;
 	struct cros_ec_spi *ec_spi = ec_dev->priv;
@@ -493,13 +511,13 @@  static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
 }
 
 /**
- * cros_ec_cmd_xfer_spi - Transfer a message over SPI and receive the reply
+ * do_cros_ec_cmd_xfer_spi - Transfer a message over SPI and receive the reply
  *
  * @ec_dev: ChromeOS EC device
  * @ec_msg: Message to transfer
  */
-static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
-				struct cros_ec_command *ec_msg)
+static int do_cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
+				   struct cros_ec_command *ec_msg)
 {
 	struct cros_ec_spi *ec_spi = ec_dev->priv;
 	struct spi_transfer trans;
@@ -611,6 +629,54 @@  static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
 	return ret;
 }
 
+static void cros_ec_pkt_xfer_spi_work(struct work_struct *work)
+{
+	struct cros_ec_xfer_work_params *params;
+
+	params = container_of(work, struct cros_ec_xfer_work_params, work);
+	params->ret = do_cros_ec_pkt_xfer_spi(params->ec_dev, params->ec_msg);
+}
+
+static int cros_ec_pkt_xfer_spi(struct cros_ec_device *ec_dev,
+				struct cros_ec_command *ec_msg)
+{
+	struct cros_ec_spi *ec_spi = ec_dev->priv;
+	struct cros_ec_xfer_work_params params;
+
+	INIT_WORK(&params.work, cros_ec_pkt_xfer_spi_work);
+	params.ec_dev = ec_dev;
+	params.ec_msg = ec_msg;
+
+	queue_work(ec_spi->high_pri_wq, &params.work);
+	flush_workqueue(ec_spi->high_pri_wq);
+
+	return params.ret;
+}
+
+static void cros_ec_cmd_xfer_spi_work(struct work_struct *work)
+{
+	struct cros_ec_xfer_work_params *params;
+
+	params = container_of(work, struct cros_ec_xfer_work_params, work);
+	params->ret = do_cros_ec_cmd_xfer_spi(params->ec_dev, params->ec_msg);
+}
+
+static int cros_ec_cmd_xfer_spi(struct cros_ec_device *ec_dev,
+				struct cros_ec_command *ec_msg)
+{
+	struct cros_ec_spi *ec_spi = ec_dev->priv;
+	struct cros_ec_xfer_work_params params;
+
+	INIT_WORK(&params.work, cros_ec_cmd_xfer_spi_work);
+	params.ec_dev = ec_dev;
+	params.ec_msg = ec_msg;
+
+	queue_work(ec_spi->high_pri_wq, &params.work);
+	flush_workqueue(ec_spi->high_pri_wq);
+
+	return params.ret;
+}
+
 static void cros_ec_spi_dt_probe(struct cros_ec_spi *ec_spi, struct device *dev)
 {
 	struct device_node *np = dev->of_node;
@@ -626,6 +692,31 @@  static void cros_ec_spi_dt_probe(struct cros_ec_spi *ec_spi, struct device *dev)
 		ec_spi->end_of_msg_delay = val;
 }
 
+static void cros_ec_spi_workqueue_release(struct device *dev, void *res)
+{
+	destroy_workqueue(*(struct workqueue_struct **)res);
+}
+
+static struct workqueue_struct *cros_ec_spi_workqueue_alloc(struct device *dev)
+{
+	struct workqueue_struct **ptr;
+
+	ptr = devres_alloc(cros_ec_spi_workqueue_release, sizeof(*ptr),
+			   GFP_KERNEL);
+	if (!ptr)
+		return NULL;
+
+	*ptr = alloc_workqueue("cros_ec_spi", WQ_HIGHPRI, 1);
+	if (*ptr == NULL) {
+		devres_free(ptr);
+		return NULL;
+	}
+
+	devres_add(dev, ptr);
+
+	return *ptr;
+}
+
 static int cros_ec_spi_probe(struct spi_device *spi)
 {
 	struct device *dev = &spi->dev;
@@ -664,6 +755,10 @@  static int cros_ec_spi_probe(struct spi_device *spi)
 
 	ec_spi->last_transfer_ns = ktime_get_ns();
 
+	ec_spi->high_pri_wq = cros_ec_spi_workqueue_alloc(dev);
+	if (!ec_spi->high_pri_wq)
+		return -ENOMEM;
+
 	err = cros_ec_register(ec_dev);
 	if (err) {
 		dev_err(dev, "cannot register EC\n");