diff mbox series

[v2] tpm: Allow system suspend to continue when TPM suspend fails

Message ID 20230106030156.3258307-1-Jason@zx2c4.com (mailing list archive)
State New, archived
Headers show
Series [v2] tpm: Allow system suspend to continue when TPM suspend fails | expand

Commit Message

Jason A. Donenfeld Jan. 6, 2023, 3:01 a.m. UTC
TPM 1 is sometimes broken across system suspends, due to races or
locking issues or something else that haven't been diagnosed or fixed
yet, most likely having to do with concurrent reads from the TPM's
hardware random number generator driver. These issues prevent the system
from actually suspending, with errors like:

  tpm tpm0: A TPM error (28) occurred continue selftest
  ...
  tpm tpm0: A TPM error (28) occurred attempting get random
  ...
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
  tpm_tis 00:08: PM: failed to suspend: error 28
  PM: Some devices failed to suspend, or early wake event detected

This issue was partially fixed by 23393c646142 ("char: tpm: Protect
tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
directly because the TPM maintainers weren't available. However, it
seems like this just addresses the most common cases of the bug, rather
than addressing it entirely. So there are more things to fix still,
apparently.

In lieu of actually fixing the underlying bug, just allow system suspend
to continue, so that laptops still go to sleep fine. Later, this can be
reverted when the real bug is fixed.

Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
Cc: stable@vger.kernel.org # 6.1+
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
---
This is basically untested and I haven't worked out if there are any
awful implications of letting the system sleep when TPM suspend fails.
Maybe some PCRs get cleared and that will make everything explode on
resume? Maybe it doesn't matter? Somebody well versed in TPMology should
probably [n]ack this approach.

 drivers/char/tpm/tpm-interface.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Jason A. Donenfeld Jan. 6, 2023, 4:01 p.m. UTC | #1
Hi Todd & ChromeOS folks,

On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.

When idling scrolling on my telephone to try to see what the
implications of skipping TPM_ORD_SAVESTATE could be, I stumbled across
some ChromeOS commits related to it, and realized that, ah-hah, finally
there's an obvious group of stakeholders who make heavy use of the TPM
and have likely amassed some expertise on it.

So I was wondering if you'd take a look at this patch briefly to make
sure it won't break ChromeOS laptops.

Jason
Jason A. Donenfeld Jan. 6, 2023, 5:16 p.m. UTC | #2
On Fri, Jan 6, 2023 at 6:04 PM Luigi Semenzato <semenzato@chromium.org> wrote:
>
> I worked a fair amount on TPM 1.0 about 10 years ago and I even vaguely remember suspend-related problems.  I'd be happy to take a look.  The linked thread shows that Peter Huewe was copied.  I know Peter well, his opinion can be trusted.  Unfortunately I don't immediately see a link to a patch, can you help?

Sorry, I should have included that:
https://lore.kernel.org/lkml/20230106030156.3258307-1-Jason@zx2c4.com/
Instead of blocking system suspend when TPM_ORD_SAVESTATE fails, it
just lets the system sleep anyway. This means that presumably the
system might sleep without having called TPM_ORD_SAVESTATE. Trying to
figure out how bad that is.

And yes, Peter Huewe certainly knows about TPMs, especially as he
maintains the code in Linux, but the maintainers haven't been so much
available, unfortunately. This bug happens to intersect with something
mostly related that I work on (the rng), so I'm motivated to at least
prevent the worst of the breakage, but I otherwise don't know anything
about the Linux TPM driver.

Jason
Linus Torvalds Jan. 6, 2023, 6:59 p.m. UTC | #3
On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
>
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.

So the patch looks fine to me, but since there's still the ChromeOS
discussion pending I'll wait for that to finish.

Perhaps re-send or at least remind me if/when it does?

Also, a query about the printout:

> +       if (rc)
> +               pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +                      chip->dev_num, rc);

so I suspect that 99% of the time the dev_num isn't actually all that
useful, but what *might* be useful is which tpm driver it is.

Just comparing the error dmesg output you had:

  ..
  tpm tpm0: Error (28) sending savestate before suspend
  tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
  ..

that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.

So I think "dev_err(dev, ...)" would be more useful here.

Finally - and maybe I'm just being difficult here, I will note here
again that TPM2 devices don't have this issue, because the TPM2 path
for suspend doesn't do any of this at all.

It just does

        tpm_transmit_cmd(..);

with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
the return value. In fact, the tpm2 code *used* to have this comment:

        /* In places where shutdown command is sent there's no much we can do
         * except print the error code on a system failure.
         */
        if (rc < 0 && rc != -EPIPE)
                dev_warn(&chip->dev, "transmit returned %d while
stopping the TPM",
                         rc);

but it was summarily removed when doing some re-organization around
buffer handling.

So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
should do this dance at all.

But having a dev_err() is probably a good idea at least as a transitional thing.

                  Linus
Luigi Semenzato Jan. 6, 2023, 8:04 p.m. UTC | #4
I think it's fine to go ahead with your change, for multiple reasons.

1. I doubt that any of the ChromeOS devices using TPM 1.2 are still
being updated.
2. If the SAVESTATE command fails, it is probably better to continue
the transition to S3, and fail at resume, than to block the suspend.
The suspend is often triggered by closing the lid, so users would not
see what's going on and might put their running laptop in a backpack,
where it could overheat.
3. I don't recall bugs due to failures of TPM suspend, and I didn't
find any such bug in our database.  Many (most?) ChromeOS devices left
the TPM powered on in S3, so didn't use the suspend/resume path.

Thank you for asking!


On Fri, Jan 6, 2023 at 11:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Thu, Jan 5, 2023 at 7:02 PM Jason A. Donenfeld <Jason@zx2c4.com> wrote:
> >
> > In lieu of actually fixing the underlying bug, just allow system suspend
> > to continue, so that laptops still go to sleep fine. Later, this can be
> > reverted when the real bug is fixed.
>
> So the patch looks fine to me, but since there's still the ChromeOS
> discussion pending I'll wait for that to finish.
>
> Perhaps re-send or at least remind me if/when it does?
>
> Also, a query about the printout:
>
> > +       if (rc)
> > +               pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > +                      chip->dev_num, rc);
>
> so I suspect that 99% of the time the dev_num isn't actually all that
> useful, but what *might* be useful is which tpm driver it is.
>
> Just comparing the error dmesg output you had:
>
>   ..
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   ..
>
> that "tpm tpm0" output is kind of useless compared to the "tpm_tis 00:08" one.
>
> So I think "dev_err(dev, ...)" would be more useful here.
>
> Finally - and maybe I'm just being difficult here, I will note here
> again that TPM2 devices don't have this issue, because the TPM2 path
> for suspend doesn't do any of this at all.
>
> It just does
>
>         tpm_transmit_cmd(..);
>
> with a TPM2_CC_SHUTDOWN TPM_SU_STATE command, and doesn't even check
> the return value. In fact, the tpm2 code *used* to have this comment:
>
>         /* In places where shutdown command is sent there's no much we can do
>          * except print the error code on a system failure.
>          */
>         if (rc < 0 && rc != -EPIPE)
>                 dev_warn(&chip->dev, "transmit returned %d while
> stopping the TPM",
>                          rc);
>
> but it was summarily removed when doing some re-organization around
> buffer handling.
>
> So just by looking at what tpm2 does, I'm not 100% convinced that tpm1
> should do this dance at all.
>
> But having a dev_err() is probably a good idea at least as a transitional thing.
>
>                   Linus
Linus Torvalds Jan. 6, 2023, 10:28 p.m. UTC | #5
On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <semenzato@chromium.org> wrote:
>
> I think it's fine to go ahead with your change, for multiple reasons.

Ok, I've applied the patch (although I did end up editing it to use
dev_err() before doing that just to make myself happier about the
printout).

            Linus
Jason A. Donenfeld Jan. 9, 2023, 4:05 p.m. UTC | #6
On Fri, Jan 06, 2023 at 02:28:00PM -0800, Linus Torvalds wrote:
> On Fri, Jan 6, 2023 at 12:04 PM Luigi Semenzato <semenzato@chromium.org> wrote:
> >
> > I think it's fine to go ahead with your change, for multiple reasons.
> 
> Ok, I've applied the patch (although I did end up editing it to use
> dev_err() before doing that just to make myself happier about the
> printout).

Thanks for fixing it up to use dev_err(). Final commit looks good to me.

Jason
Jarkko Sakkinen Jan. 16, 2023, 8:12 a.m. UTC | #7
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...
>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected
> 
> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
> 
>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>  	}
>  
>  suspended:
> -	return rc;
> +	if (rc)
> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +		       chip->dev_num, rc);
> +	return 0;
>  }
>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>  
> -- 
> 2.39.0
> 

Let me read all the threads through starting from the original report. I've
had emails piling up because of getting sick before holiday, and holiday
season after that.

This looks sane

Apologies for the lack of response.

BR, Jarkko
Jarkko Sakkinen Jan. 16, 2023, 11:44 a.m. UTC | #8
On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> TPM 1 is sometimes broken across system suspends, due to races or
> locking issues or something else that haven't been diagnosed or fixed
> yet, most likely having to do with concurrent reads from the TPM's
> hardware random number generator driver. These issues prevent the system
> from actually suspending, with errors like:
> 
>   tpm tpm0: A TPM error (28) occurred continue selftest
>   ...

<REMOVE>

>   tpm tpm0: A TPM error (28) occurred attempting get random
>   ...
>   tpm tpm0: Error (28) sending savestate before suspend
>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>   tpm_tis 00:08: PM: failed to suspend: error 28
>   PM: Some devices failed to suspend, or early wake event detected

</REMOVE>

Unrelated to thix particular fix.

> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> directly because the TPM maintainers weren't available. However, it
> seems like this just addresses the most common cases of the bug, rather
> than addressing it entirely. So there are more things to fix still,
> apparently.
> 
> In lieu of actually fixing the underlying bug, just allow system suspend
> to continue, so that laptops still go to sleep fine. Later, this can be
> reverted when the real bug is fixed.
> 
> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> Cc: stable@vger.kernel.org # 6.1+
> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> ---
> This is basically untested and I haven't worked out if there are any
> awful implications of letting the system sleep when TPM suspend fails.
> Maybe some PCRs get cleared and that will make everything explode on
> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> probably [n]ack this approach.
> 
>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> index d69905233aff..6df9067ef7f9 100644
> --- a/drivers/char/tpm/tpm-interface.c
> +++ b/drivers/char/tpm/tpm-interface.c
> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>  	}
>  
>  suspended:
> -	return rc;
> +	if (rc)
> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> +		       chip->dev_num, rc);
> +	return 0;
>  }
>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>  
> -- 
> 2.39.0
> 

This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
the call is located in tpm_tis_resume() [*].

[*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/

BR, Jarkko
Vlastimil Babka Jan. 16, 2023, 2 p.m. UTC | #9
On 1/16/23 12:44, Jarkko Sakkinen wrote:
> On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
>> TPM 1 is sometimes broken across system suspends, due to races or
>> locking issues or something else that haven't been diagnosed or fixed
>> yet, most likely having to do with concurrent reads from the TPM's
>> hardware random number generator driver. These issues prevent the system
>> from actually suspending, with errors like:
>> 
>>   tpm tpm0: A TPM error (28) occurred continue selftest
>>   ...
> 
> <REMOVE>
> 
>>   tpm tpm0: A TPM error (28) occurred attempting get random
>>   ...
>>   tpm tpm0: Error (28) sending savestate before suspend
>>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
>>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
>>   tpm_tis 00:08: PM: failed to suspend: error 28
>>   PM: Some devices failed to suspend, or early wake event detected
> 
> </REMOVE>
> 
> Unrelated to thix particular fix.

Not sure I understand.
AFAIK this is not a proper fix, but a workaround for when laptop suspend no
longer works because TPM fails to suspend. The error messages quoted above
are very much related to the problem of suspend not working, and this patch
did work as advertised at least for me. I see errors but they don't prevent
suspend anymore:

https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/

>> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
>> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
>> directly because the TPM maintainers weren't available. However, it
>> seems like this just addresses the most common cases of the bug, rather
>> than addressing it entirely. So there are more things to fix still,
>> apparently.
>> 
>> In lieu of actually fixing the underlying bug, just allow system suspend
>> to continue, so that laptops still go to sleep fine. Later, this can be
>> reverted when the real bug is fixed.
>> 
>> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
>> Cc: stable@vger.kernel.org # 6.1+
>> Reported-by: Vlastimil Babka <vbabka@suse.cz>
>> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
>> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
>> ---
>> This is basically untested and I haven't worked out if there are any
>> awful implications of letting the system sleep when TPM suspend fails.
>> Maybe some PCRs get cleared and that will make everything explode on
>> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
>> probably [n]ack this approach.
>> 
>>  drivers/char/tpm/tpm-interface.c | 5 ++++-
>>  1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
>> index d69905233aff..6df9067ef7f9 100644
>> --- a/drivers/char/tpm/tpm-interface.c
>> +++ b/drivers/char/tpm/tpm-interface.c
>> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
>>  	}
>>  
>>  suspended:
>> -	return rc;
>> +	if (rc)
>> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
>> +		       chip->dev_num, rc);
>> +	return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
>>  
>> -- 
>> 2.39.0
>> 
> 
> This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> the call is located in tpm_tis_resume() [*].
> 
> [*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/

Yes the changelog at the top does say "due to races or locking issues or
something else that haven't been diagnosed or fixed yet"

I don't know what causes the TPM to start returning error 28 on resume and
never recover from it. But it didn't happen before hwrng started using the
TPM. Before that, it was probably just the selftest ever doing anything with
the TPM, and on its own I don't recall it ever (before 6.1) failing and
preventing further suspend/resume.

> BR, Jarkko
Jason A. Donenfeld Jan. 16, 2023, 2:03 p.m. UTC | #10
Hi Jarkko,

On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > index d69905233aff..6df9067ef7f9 100644
> > --- a/drivers/char/tpm/tpm-interface.c
> > +++ b/drivers/char/tpm/tpm-interface.c
> > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> >       }
> >
> >  suspended:
> > -     return rc;
> > +     if (rc)
> > +             pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > +                    chip->dev_num, rc);
> > +     return 0;
> >  }
> >  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >
> > --
> > 2.39.0
> >
>
> Let me read all the threads through starting from the original report. I've
> had emails piling up because of getting sick before holiday, and holiday
> season after that.
>
> This looks sane

No, not really. I mean, it was sane under the circumstances of, "I'm
not going to spend time fixing this for real if the maintainers aren't
around," and it fixed the suspend issue. But it doesn't actually fix
any real tpm issue. The real issue, AFAICT, is there's some sort of
race between the tpm rng read command and either suspend or wakeup or
selftest. One of these is missing some locking. And then commands step
on each other and the tpm gets upset. This is probably something that
should be fixed. I assume the "Fixes: ..." tag will actually go quite
far back, with recent things only unearthing a somewhat old bug. But
just a hunch.

Jason
Jarkko Sakkinen Jan. 21, 2023, 12:03 a.m. UTC | #11
On Mon, Jan 16, 2023 at 03:00:03PM +0100, Vlastimil Babka wrote:
> On 1/16/23 12:44, Jarkko Sakkinen wrote:
> > On Fri, Jan 06, 2023 at 04:01:56AM +0100, Jason A. Donenfeld wrote:
> >> TPM 1 is sometimes broken across system suspends, due to races or
> >> locking issues or something else that haven't been diagnosed or fixed
> >> yet, most likely having to do with concurrent reads from the TPM's
> >> hardware random number generator driver. These issues prevent the system
> >> from actually suspending, with errors like:
> >> 
> >>   tpm tpm0: A TPM error (28) occurred continue selftest
> >>   ...
> > 
> > <REMOVE>
> > 
> >>   tpm tpm0: A TPM error (28) occurred attempting get random
> >>   ...
> >>   tpm tpm0: Error (28) sending savestate before suspend
> >>   tpm_tis 00:08: PM: __pnp_bus_suspend(): tpm_pm_suspend+0x0/0x80 returns 28
> >>   tpm_tis 00:08: PM: dpm_run_callback(): pnp_bus_suspend+0x0/0x10 returns 28
> >>   tpm_tis 00:08: PM: failed to suspend: error 28
> >>   PM: Some devices failed to suspend, or early wake event detected
> > 
> > </REMOVE>
> > 
> > Unrelated to thix particular fix.
> 
> Not sure I understand.
> AFAIK this is not a proper fix, but a workaround for when laptop suspend no
> longer works because TPM fails to suspend. The error messages quoted above
> are very much related to the problem of suspend not working, and this patch
> did work as advertised at least for me. I see errors but they don't prevent
> suspend anymore:
> 
> https://lore.kernel.org/all/58d7a42c-9e6b-ab2a-617f-d5e373bf63cb@suse.cz/
> 
> >> This issue was partially fixed by 23393c646142 ("char: tpm: Protect
> >> tpm_pm_suspend with locks"), in a last minute 6.1 commit that Linus took
> >> directly because the TPM maintainers weren't available. However, it
> >> seems like this just addresses the most common cases of the bug, rather
> >> than addressing it entirely. So there are more things to fix still,
> >> apparently.
> >> 
> >> In lieu of actually fixing the underlying bug, just allow system suspend
> >> to continue, so that laptops still go to sleep fine. Later, this can be
> >> reverted when the real bug is fixed.
> >> 
> >> Link: https://lore.kernel.org/lkml/7cbe96cf-e0b5-ba63-d1b4-f63d2e826efa@suse.cz/
> >> Cc: stable@vger.kernel.org # 6.1+
> >> Reported-by: Vlastimil Babka <vbabka@suse.cz>
> >> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> >> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> >> ---
> >> This is basically untested and I haven't worked out if there are any
> >> awful implications of letting the system sleep when TPM suspend fails.
> >> Maybe some PCRs get cleared and that will make everything explode on
> >> resume? Maybe it doesn't matter? Somebody well versed in TPMology should
> >> probably [n]ack this approach.
> >> 
> >>  drivers/char/tpm/tpm-interface.c | 5 ++++-
> >>  1 file changed, 4 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> >> index d69905233aff..6df9067ef7f9 100644
> >> --- a/drivers/char/tpm/tpm-interface.c
> >> +++ b/drivers/char/tpm/tpm-interface.c
> >> @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> >>  	}
> >>  
> >>  suspended:
> >> -	return rc;
> >> +	if (rc)
> >> +		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> >> +		       chip->dev_num, rc);
> >> +	return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> >>  
> >> -- 
> >> 2.39.0
> >> 
> > 
> > This tpm_tis local issue, nothing to do with tpm_pm_suspend(). Executing
> > the selftest as part of wake up, is TPM 1.2 dTPM specific requirement, and
> > the call is located in tpm_tis_resume() [*].
> > 
> > [*] https://lore.kernel.org/lkml/Y8U1QxA4GYvPWDky@kernel.org/
> 
> Yes the changelog at the top does say "due to races or locking issues or
> something else that haven't been diagnosed or fixed yet"
> 
> I don't know what causes the TPM to start returning error 28 on resume and
> never recover from it. But it didn't happen before hwrng started using the
> TPM. Before that, it was probably just the selftest ever doing anything with
> the TPM, and on its own I don't recall it ever (before 6.1) failing and
> preventing further suspend/resume.

Would it be possible to test this theory by commenting out tpm_add_hwrng()
call from tpm_chip_register()?

Since they are called sequentially any sort of concurrency issue can be
probably ruled out.

One thing that I noticed is that probably it would be more safe-play to
move tpm_add_hwrng() call after creating the character device, as there's
no need to do it before anything else.

BR, Jarkko
Jarkko Sakkinen Jan. 21, 2023, 12:07 a.m. UTC | #12
On Mon, Jan 16, 2023 at 03:03:17PM +0100, Jason A. Donenfeld wrote:
> Hi Jarkko,
> 
> On Mon, Jan 16, 2023 at 9:12 AM Jarkko Sakkinen <jarkko@kernel.org> wrote:
> > > diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
> > > index d69905233aff..6df9067ef7f9 100644
> > > --- a/drivers/char/tpm/tpm-interface.c
> > > +++ b/drivers/char/tpm/tpm-interface.c
> > > @@ -412,7 +412,10 @@ int tpm_pm_suspend(struct device *dev)
> > >       }
> > >
> > >  suspended:
> > > -     return rc;
> > > +     if (rc)
> > > +             pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
> > > +                    chip->dev_num, rc);
> > > +     return 0;
> > >  }
> > >  EXPORT_SYMBOL_GPL(tpm_pm_suspend);
> > >
> > > --
> > > 2.39.0
> > >
> >
> > Let me read all the threads through starting from the original report. I've
> > had emails piling up because of getting sick before holiday, and holiday
> > season after that.
> >
> > This looks sane
> 
> No, not really. I mean, it was sane under the circumstances of, "I'm
> not going to spend time fixing this for real if the maintainers aren't
> around," and it fixed the suspend issue. But it doesn't actually fix
> any real tpm issue. The real issue, AFAICT, is there's some sort of
> race between the tpm rng read command and either suspend or wakeup or
> selftest. One of these is missing some locking. And then commands step
> on each other and the tpm gets upset. This is probably something that
> should be fixed. I assume the "Fixes: ..." tag will actually go quite
> far back, with recent things only unearthing a somewhat old bug. But
> just a hunch.
> 
> Jason

See my response to Vlastimil:

https://lore.kernel.org/linux-integrity/Y8sr7YJ8e8eSpPFv@kernel.org/

Can you try what happens if you do not call tpm_add_hwrng()?

BR, Jarkko
diff mbox series

Patch

diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index d69905233aff..6df9067ef7f9 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -412,7 +412,10 @@  int tpm_pm_suspend(struct device *dev)
 	}
 
 suspended:
-	return rc;
+	if (rc)
+		pr_err("Unable to suspend tpm-%d (error %d), but continuing system suspend\n",
+		       chip->dev_num, rc);
+	return 0;
 }
 EXPORT_SYMBOL_GPL(tpm_pm_suspend);