diff mbox series

[v2,1/6] hwrng: core: Freeze khwrng thread during suspend

Message ID 20190716224518.62556-2-swboyd@chromium.org (mailing list archive)
State New, archived
Headers show
Series tpm: Add driver for cr50 | expand

Commit Message

Stephen Boyd July 16, 2019, 10:45 p.m. UTC
The hwrng_fill() function can run while devices are suspending and
resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
is suspended, the hwrng may hang the bus while attempting to add some
randomness. It's been observed on ChromeOS devices with suspend-to-idle
(s2idle) and an i2c based hwrng that this kthread may run and ask the
hwrng device for randomness before the i2c bus has been resumed.

Let's make this kthread freezable so that we don't try to touch the
hwrng during suspend/resume. This ensures that we can't cause the hwrng
backing driver to get into a bad state because the device is guaranteed
to be resumed before the hwrng kthread is thawed.

Cc: Andrey Pronin <apronin@chromium.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Duncan Laurie <dlaurie@chromium.org>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
---
 drivers/char/hw_random/core.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Comments

Herbert Xu July 17, 2019, 1:43 a.m. UTC | #1
On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> The hwrng_fill() function can run while devices are suspending and
> resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> is suspended, the hwrng may hang the bus while attempting to add some
> randomness. It's been observed on ChromeOS devices with suspend-to-idle
> (s2idle) and an i2c based hwrng that this kthread may run and ask the
> hwrng device for randomness before the i2c bus has been resumed.
> 
> Let's make this kthread freezable so that we don't try to touch the
> hwrng during suspend/resume. This ensures that we can't cause the hwrng
> backing driver to get into a bad state because the device is guaranteed
> to be resumed before the hwrng kthread is thawed.
> 
> Cc: Andrey Pronin <apronin@chromium.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Duncan Laurie <dlaurie@chromium.org>
> Signed-off-by: Stephen Boyd <swboyd@chromium.org>
> ---
>  drivers/char/hw_random/core.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)

Do you want this to go through the crypto tree? If so you need
to cc the linux-crypto mailing list.

Thanks,
Jason Gunthorpe July 17, 2019, 11:39 a.m. UTC | #2
On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> The hwrng_fill() function can run while devices are suspending and
> resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> is suspended, the hwrng may hang the bus while attempting to add some
> randomness. It's been observed on ChromeOS devices with suspend-to-idle
> (s2idle) and an i2c based hwrng that this kthread may run and ask the
> hwrng device for randomness before the i2c bus has been resumed.

You mean the TPM here right?

Should we be more careful in the TPM code to check if the TPM is
suspended before trying to use it, rather than muck up callers?

Jason
Stephen Boyd July 17, 2019, 4:36 p.m. UTC | #3
Quoting Herbert Xu (2019-07-16 18:43:59)
> On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> > The hwrng_fill() function can run while devices are suspending and
> > resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> > is suspended, the hwrng may hang the bus while attempting to add some
> > randomness. It's been observed on ChromeOS devices with suspend-to-idle
> > (s2idle) and an i2c based hwrng that this kthread may run and ask the
> > hwrng device for randomness before the i2c bus has been resumed.
> > 
> > Let's make this kthread freezable so that we don't try to touch the
> > hwrng during suspend/resume. This ensures that we can't cause the hwrng
> > backing driver to get into a bad state because the device is guaranteed
> > to be resumed before the hwrng kthread is thawed.
> > 
> > Cc: Andrey Pronin <apronin@chromium.org>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Duncan Laurie <dlaurie@chromium.org>
> > Signed-off-by: Stephen Boyd <swboyd@chromium.org>
> > ---
> >  drivers/char/hw_random/core.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> Do you want this to go through the crypto tree? If so you need
> to cc the linux-crypto mailing list.
> 

Sure. I'll resend just this one Cced to the linux-crypto list and to
you.
Stephen Boyd July 17, 2019, 4:42 p.m. UTC | #4
Quoting Jason Gunthorpe (2019-07-17 04:39:56)
> On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> > The hwrng_fill() function can run while devices are suspending and
> > resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> > is suspended, the hwrng may hang the bus while attempting to add some
> > randomness. It's been observed on ChromeOS devices with suspend-to-idle
> > (s2idle) and an i2c based hwrng that this kthread may run and ask the
> > hwrng device for randomness before the i2c bus has been resumed.
> 
> You mean the TPM here right?

In my case yes, but in general it isn't the TPM.

> 
> Should we be more careful in the TPM code to check if the TPM is
> suspended before trying to use it, rather than muck up callers?
> 

Given that it's not just a TPM issue I don't see how checking in the TPM
is going to fix this problem. It's better to not try to get random bytes
from the hwrng when the device could be suspended.
Jason Gunthorpe July 17, 2019, 4:50 p.m. UTC | #5
On Wed, Jul 17, 2019 at 09:42:32AM -0700, Stephen Boyd wrote:
> Quoting Jason Gunthorpe (2019-07-17 04:39:56)
> > On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> > > The hwrng_fill() function can run while devices are suspending and
> > > resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> > > is suspended, the hwrng may hang the bus while attempting to add some
> > > randomness. It's been observed on ChromeOS devices with suspend-to-idle
> > > (s2idle) and an i2c based hwrng that this kthread may run and ask the
> > > hwrng device for randomness before the i2c bus has been resumed.
> > 
> > You mean the TPM here right?
> 
> In my case yes, but in general it isn't the TPM.
> 
> > 
> > Should we be more careful in the TPM code to check if the TPM is
> > suspended before trying to use it, rather than muck up callers?
> > 
> 
> Given that it's not just a TPM issue I don't see how checking in the TPM
> is going to fix this problem. It's better to not try to get random bytes
> from the hwrng when the device could be suspended.

I think the same comment would apply to all the other suspend capable
hwrngs...

It just seems weird to do this, what about all the other tpm API
users? Do they have a racy problem with suspend too?

Jason
Stephen Boyd July 17, 2019, 5:03 p.m. UTC | #6
Quoting Jason Gunthorpe (2019-07-17 09:50:11)
> On Wed, Jul 17, 2019 at 09:42:32AM -0700, Stephen Boyd wrote:
> > Quoting Jason Gunthorpe (2019-07-17 04:39:56)
> > > On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> > > > The hwrng_fill() function can run while devices are suspending and
> > > > resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> > > > is suspended, the hwrng may hang the bus while attempting to add some
> > > > randomness. It's been observed on ChromeOS devices with suspend-to-idle
> > > > (s2idle) and an i2c based hwrng that this kthread may run and ask the
> > > > hwrng device for randomness before the i2c bus has been resumed.
> > > 
> > > You mean the TPM here right?
> > 
> > In my case yes, but in general it isn't the TPM.
> > 
> > > 
> > > Should we be more careful in the TPM code to check if the TPM is
> > > suspended before trying to use it, rather than muck up callers?
> > > 
> > 
> > Given that it's not just a TPM issue I don't see how checking in the TPM
> > is going to fix this problem. It's better to not try to get random bytes
> > from the hwrng when the device could be suspended.
> 
> I think the same comment would apply to all the other suspend capable
> hwrngs...

Yes. That's exactly my point. A hwrng that's suspended will fail here
and it's better to just not try until it's guaranteed to have resumed.

> 
> It just seems weird to do this, what about all the other tpm API
> users? Do they have a racy problem with suspend too?

I haven't looked at them. Are they being called from suspend/resume
paths? I don't think anything for the userspace API can be a problem
because those tasks are all frozen. The only problem would be some
kernel internal API that TPM API exposes. I did a quick grep and I see
things like IMA or the trusted keys APIs that might need a closer look.

Either way, trying to hold off a TPM operation from the TPM API when
we're suspended isn't really possible. If something like IMA needs to
get TPM data from deep suspend path and it fails because the device is
suspended, all we can do is return an error from TPM APIs and hope the
caller can recover. The fix is probably going to be to change the code
to not call into the TPM API until the hardware has resumed by avoiding
doing anything with the TPM until resume is over. So we're at best able
to make same sort of band-aid here in the TPM API where all we can do is
say -EAGAIN but we can't tell the caller when to try again.
Jarkko Sakkinen Aug. 2, 2019, 8:22 p.m. UTC | #7
On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> The hwrng_fill() function can run while devices are suspending and
> resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> is suspended, the hwrng may hang the bus while attempting to add some
> randomness. It's been observed on ChromeOS devices with suspend-to-idle
> (s2idle) and an i2c based hwrng that this kthread may run and ask the
> hwrng device for randomness before the i2c bus has been resumed.
> 
> Let's make this kthread freezable so that we don't try to touch the
> hwrng during suspend/resume. This ensures that we can't cause the hwrng
> backing driver to get into a bad state because the device is guaranteed
> to be resumed before the hwrng kthread is thawed.
> 
> Cc: Andrey Pronin <apronin@chromium.org>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Duncan Laurie <dlaurie@chromium.org>
> Signed-off-by: Stephen Boyd <swboyd@chromium.org>

This does not need a fixes tag?

/Jarkko
Stephen Boyd Aug. 2, 2019, 10:50 p.m. UTC | #8
Quoting Stephen Boyd (2019-07-17 10:03:22)
> Quoting Jason Gunthorpe (2019-07-17 09:50:11)
> > On Wed, Jul 17, 2019 at 09:42:32AM -0700, Stephen Boyd wrote:
> 
> Yes. That's exactly my point. A hwrng that's suspended will fail here
> and it's better to just not try until it's guaranteed to have resumed.
> 
> > 
> > It just seems weird to do this, what about all the other tpm API
> > users? Do they have a racy problem with suspend too?
> 
> I haven't looked at them. Are they being called from suspend/resume
> paths? I don't think anything for the userspace API can be a problem
> because those tasks are all frozen. The only problem would be some
> kernel internal API that TPM API exposes. I did a quick grep and I see
> things like IMA or the trusted keys APIs that might need a closer look.
> 
> Either way, trying to hold off a TPM operation from the TPM API when
> we're suspended isn't really possible. If something like IMA needs to
> get TPM data from deep suspend path and it fails because the device is
> suspended, all we can do is return an error from TPM APIs and hope the
> caller can recover. The fix is probably going to be to change the code
> to not call into the TPM API until the hardware has resumed by avoiding
> doing anything with the TPM until resume is over. So we're at best able
> to make same sort of band-aid here in the TPM API where all we can do is
> say -EAGAIN but we can't tell the caller when to try again.
> 

Andrey talked to me a little about this today. Andrey would prefer we
don't just let the TPM go into a wonky state if it's used during
suspend/resume so that it can stay resilient to errors. Sounds OK to me,
but my point still stands that we need to fix the callers.

I'll resurrect the IS_SUSPENDED flag and make it set generically by the
tpm_pm_suspend() and tpm_pm_resume() functions and then spit out a big
WARN_ON() and return an error value like -EAGAIN if the TPM functions
are called when the TPM is suspended. I hope we don't hit the warning
message, but if we do then at least we can track it down rather quickly
and figure out how to fix the caller instead of just silently returning
-EAGAIN and hoping for that to be visible to the user.

This patch will still be required to avoid the WARN message, so I'll
resend with the Cc to crypto list so it can be picked up.
Stephen Boyd Aug. 2, 2019, 11:18 p.m. UTC | #9
Quoting Jarkko Sakkinen (2019-08-02 13:22:09)
> On Tue, Jul 16, 2019 at 03:45:13PM -0700, Stephen Boyd wrote:
> > The hwrng_fill() function can run while devices are suspending and
> > resuming. If the hwrng is behind a bus such as i2c or SPI and that bus
> > is suspended, the hwrng may hang the bus while attempting to add some
> > randomness. It's been observed on ChromeOS devices with suspend-to-idle
> > (s2idle) and an i2c based hwrng that this kthread may run and ask the
> > hwrng device for randomness before the i2c bus has been resumed.
> > 
> > Let's make this kthread freezable so that we don't try to touch the
> > hwrng during suspend/resume. This ensures that we can't cause the hwrng
> > backing driver to get into a bad state because the device is guaranteed
> > to be resumed before the hwrng kthread is thawed.
> > 
> > Cc: Andrey Pronin <apronin@chromium.org>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Duncan Laurie <dlaurie@chromium.org>
> > Signed-off-by: Stephen Boyd <swboyd@chromium.org>
> 
> This does not need a fixes tag?
> 

I'll add Fixes: be4000bc4644 ("hwrng: create filler thread")
Alexander Steffen Aug. 16, 2019, 3:56 p.m. UTC | #10
On 03.08.2019 00:50, Stephen Boyd wrote:
> Quoting Stephen Boyd (2019-07-17 10:03:22)
>> Quoting Jason Gunthorpe (2019-07-17 09:50:11)
>>> On Wed, Jul 17, 2019 at 09:42:32AM -0700, Stephen Boyd wrote:
>>
>> Yes. That's exactly my point. A hwrng that's suspended will fail here
>> and it's better to just not try until it's guaranteed to have resumed.
>>
>>>
>>> It just seems weird to do this, what about all the other tpm API
>>> users? Do they have a racy problem with suspend too?
>>
>> I haven't looked at them. Are they being called from suspend/resume
>> paths? I don't think anything for the userspace API can be a problem
>> because those tasks are all frozen. The only problem would be some
>> kernel internal API that TPM API exposes. I did a quick grep and I see
>> things like IMA or the trusted keys APIs that might need a closer look.
>>
>> Either way, trying to hold off a TPM operation from the TPM API when
>> we're suspended isn't really possible. If something like IMA needs to
>> get TPM data from deep suspend path and it fails because the device is
>> suspended, all we can do is return an error from TPM APIs and hope the
>> caller can recover. The fix is probably going to be to change the code
>> to not call into the TPM API until the hardware has resumed by avoiding
>> doing anything with the TPM until resume is over. So we're at best able
>> to make same sort of band-aid here in the TPM API where all we can do is
>> say -EAGAIN but we can't tell the caller when to try again.
>>
> 
> Andrey talked to me a little about this today. Andrey would prefer we
> don't just let the TPM go into a wonky state if it's used during
> suspend/resume so that it can stay resilient to errors. Sounds OK to me,
> but my point still stands that we need to fix the callers.
> 
> I'll resurrect the IS_SUSPENDED flag and make it set generically by the
> tpm_pm_suspend() and tpm_pm_resume() functions and then spit out a big
> WARN_ON() and return an error value like -EAGAIN if the TPM functions
> are called when the TPM is suspended. I hope we don't hit the warning
> message, but if we do then at least we can track it down rather quickly
> and figure out how to fix the caller instead of just silently returning
> -EAGAIN and hoping for that to be visible to the user.

There is another use case I see for this functionality: There are ways 
for user space to upgrade the TPM's firmware via /dev/tpm0 (using e.g. 
TPM2_FieldUpgradeStart/TPM2_FieldUpgradeData). While upgrading, the 
normal TPM functionality might not be available (commands return 
TPM_RC_UPGRADE or other error codes). Even after the upgrade is 
finished, the TPM might continue to refuse command execution (e.g. with 
TPM_RC_REBOOT).

I'm not sure whether all in-kernel users are prepared to deal correctly 
with those error codes. But even if they are, it might be better to 
block them from sending commands in the first place, to not interfere 
with the upgrade process.

What would you think about a way for a user space upgrade tool to also 
set this flag, to make the TPM unavailable for everything but the 
upgrade process?

Alexander
Jarkko Sakkinen Aug. 16, 2019, 11:55 p.m. UTC | #11
On Fri, Aug 16, 2019 at 05:56:12PM +0200, Alexander Steffen wrote:
> > Andrey talked to me a little about this today. Andrey would prefer we
> > don't just let the TPM go into a wonky state if it's used during
> > suspend/resume so that it can stay resilient to errors. Sounds OK to me,
> > but my point still stands that we need to fix the callers.
> > 
> > I'll resurrect the IS_SUSPENDED flag and make it set generically by the
> > tpm_pm_suspend() and tpm_pm_resume() functions and then spit out a big
> > WARN_ON() and return an error value like -EAGAIN if the TPM functions
> > are called when the TPM is suspended. I hope we don't hit the warning
> > message, but if we do then at least we can track it down rather quickly
> > and figure out how to fix the caller instead of just silently returning
> > -EAGAIN and hoping for that to be visible to the user.
> 
> There is another use case I see for this functionality: There are ways for
> user space to upgrade the TPM's firmware via /dev/tpm0 (using e.g.
> TPM2_FieldUpgradeStart/TPM2_FieldUpgradeData). While upgrading, the normal
> TPM functionality might not be available (commands return TPM_RC_UPGRADE or
> other error codes). Even after the upgrade is finished, the TPM might
> continue to refuse command execution (e.g. with TPM_RC_REBOOT).
> 
> I'm not sure whether all in-kernel users are prepared to deal correctly with
> those error codes. But even if they are, it might be better to block them
> from sending commands in the first place, to not interfere with the upgrade
> process.
> 
> What would you think about a way for a user space upgrade tool to also set
> this flag, to make the TPM unavailable for everything but the upgrade
> process?
> 
> Alexander

NOTE: Just commenting the FW use case.

I don't like it because it contains variable amount of reserved time
for a hardware resource.

Right now a user thread gets a lease of one TPM command for /dev/tpm0
and that is how I would like to keep it.

/Jarkko
diff mbox series

Patch

diff --git a/drivers/char/hw_random/core.c b/drivers/char/hw_random/core.c
index 95be7228f327..3b88af3149a7 100644
--- a/drivers/char/hw_random/core.c
+++ b/drivers/char/hw_random/core.c
@@ -13,6 +13,7 @@ 
 #include <linux/delay.h>
 #include <linux/device.h>
 #include <linux/err.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/hw_random.h>
 #include <linux/kernel.h>
@@ -421,7 +422,9 @@  static int hwrng_fillfn(void *unused)
 {
 	long rc;
 
-	while (!kthread_should_stop()) {
+	set_freezable();
+
+	while (!kthread_freezable_should_stop(NULL)) {
 		struct hwrng *rng;
 
 		rng = get_current_rng();