diff mbox

tpm.h: increase poll timings to fix tpm_tis regression

Message ID 1531329074.3260.9.camel@HansenPartnership.com (mailing list archive)
State New, archived
Headers show

Commit Message

James Bottomley July 11, 2018, 5:11 p.m. UTC
tpm_tis regressed recently to the point where the TPM being driven by
it falls off the bus and cannot be contacted after some hours of use.
This is the failure trace:

jejb@jarvis:~> dmesg|grep tpm
[    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
[14566.626614] tpm tpm0: Operation Timed out
[14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
[14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
[14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
[14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
...

The problem is caused by a change that caused us to poke the TPM far
more often to see if it's ready.  Apparently something about the bus
its on and the TPM means that it crashes or falls off the bus if you
poke it too often and once this happens, only a reboot will recover
it.

The fix I've come up with is to adjust the timings so the TPM no
longer falls of the bus.  Obviously, this fix works for my Nuvoton
NPCT6xxx but that's the only TPM I've tested it with.

Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity
Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com>
---
 drivers/char/tpm/tpm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Jason Gunthorpe July 11, 2018, 6:21 p.m. UTC | #1
On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> tpm_tis regressed recently to the point where the TPM being driven by
> it falls off the bus and cannot be contacted after some hours of use.
> This is the failure trace:
> 
> jejb@jarvis:~> dmesg|grep tpm
> [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2)
> [14566.626614] tpm tpm0: Operation Timed out
> [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62
> [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> ...
> 
> The problem is caused by a change that caused us to poke the TPM far
> more often to see if it's ready.  Apparently something about the bus
> its on and the TPM means that it crashes or falls off the bus if you
> poke it too often and once this happens, only a reboot will recover
> it.

I wonder if something about triggering ETIME even once breaks the
driver so it can't talk to the chip at all thereafter..

Ie it doesn't abort the command properly and becomes desynced with the
TIS execution protocol.

I would be very surprised if polling the TIS status register effects
the firmware running inside the chip..

BTW, no interrupt in your latop setup? I'm surprised by that..

Jason
James Bottomley July 11, 2018, 7:08 p.m. UTC | #2
On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > tpm_tis regressed recently to the point where the TPM being driven
> > by
> > it falls off the bus and cannot be contacted after some hours of
> > use.
> > This is the failure trace:
> > 
> > jejb@jarvis:~> dmesg|grep tpm
> > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 
> > 2)
> > [14566.626614] tpm tpm0: Operation Timed out
> > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system
> > error -62
> > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system
> > error -62
> > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > ...
> > 
> > The problem is caused by a change that caused us to poke the TPM
> > far
> > more often to see if it's ready.  Apparently something about the
> > bus
> > its on and the TPM means that it crashes or falls off the bus if
> > you
> > poke it too often and once this happens, only a reboot will recover
> > it.
> 
> I wonder if something about triggering ETIME even once breaks the
> driver so it can't talk to the chip at all thereafter..
> 
> Ie it doesn't abort the command properly and becomes desynced with
> the TIS execution protocol.

Yes, I wondered about this, but I don't understand the bus protocol
well enough.  The tpm-interface:tpm_try_transmit() which throws the
first ETIME says after we get that we send chip->ops->cancel() which
tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
there a bigger hammer I can hit it with?

> I would be very surprised if polling the TIS status register effects.

Well, I was surprised too, but given the patch that's causing the
problem, it was the only explanation I could come up with.

> the firmware running inside the chip..
> 
> 
> BTW, no interrupt in your latop setup? I'm surprised by that..

Heh, I hear rumours that most laptops are delivered like this and a
large software company might be annoyed enough about the problem to
mandate interrupts the next time it produces a hardware requirements
spec.

James
Jason Gunthorpe July 11, 2018, 8:01 p.m. UTC | #3
On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > > tpm_tis regressed recently to the point where the TPM being driven
> > > by
> > > it falls off the bus and cannot be contacted after some hours of
> > > use.
> > > This is the failure trace:
> > > 
> > > jejb@jarvis:~> dmesg|grep tpm
> > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 
> > > 2)
> > > [14566.626614] tpm tpm0: Operation Timed out
> > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system
> > > error -62
> > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system
> > > error -62
> > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > ...
> > > 
> > > The problem is caused by a change that caused us to poke the TPM
> > > far
> > > more often to see if it's ready.  Apparently something about the
> > > bus
> > > its on and the TPM means that it crashes or falls off the bus if
> > > you
> > > poke it too often and once this happens, only a reboot will recover
> > > it.
> > 
> > I wonder if something about triggering ETIME even once breaks the
> > driver so it can't talk to the chip at all thereafter..
> > 
> > Ie it doesn't abort the command properly and becomes desynced with
> > the TIS execution protocol.
> 
> Yes, I wondered about this, but I don't understand the bus protocol
> well enough.  The tpm-interface:tpm_try_transmit() which throws the
> first ETIME says after we get that we send chip->ops->cancel() which
> tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
> there a bigger hammer I can hit it with?

I don't remember off hand.. But this is, IMHO, a better guess than the
firmware crashes from reading the status register..

Perhaps reducing the timeout to force a ETIME would prove the theory?

Jason
James Bottomley July 11, 2018, 8:39 p.m. UTC | #4
On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> > > > tpm_tis regressed recently to the point where the TPM being
> > > > driven
> > > > by
> > > > it falls off the bus and cannot be contacted after some hours
> > > > of
> > > > use.
> > > > This is the failure trace:
> > > > 
> > > > jejb@jarvis:~> dmesg|grep tpm
> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
> > > > rev-id 
> > > > 2)
> > > > [14566.626614] tpm tpm0: Operation Timed out
> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
> > > > system
> > > > error -62
> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
> > > > system
> > > > error -62
> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> > > > ...
> > > > 
> > > > The problem is caused by a change that caused us to poke the
> > > > TPM
> > > > far
> > > > more often to see if it's ready.  Apparently something about
> > > > the
> > > > bus
> > > > its on and the TPM means that it crashes or falls off the bus
> > > > if
> > > > you
> > > > poke it too often and once this happens, only a reboot will
> > > > recover
> > > > it.
> > > 
> > > I wonder if something about triggering ETIME even once breaks the
> > > driver so it can't talk to the chip at all thereafter..
> > > 
> > > Ie it doesn't abort the command properly and becomes desynced
> > > with the TIS execution protocol.
> > 
> > Yes, I wondered about this, but I don't understand the bus protocol
> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
> > first ETIME says after we get that we send chip->ops->cancel()
> > which tpm_tis simply translates to tpm_tis_ready() which also times
> > out.  Is there a bigger hammer I can hit it with?
> 
> I don't remember off hand.. But this is, IMHO, a better guess than
> the firmware crashes from reading the status register..

Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
don't have any tools to probe the LPC.

James


> Perhaps reducing the timeout to force a ETIME would prove the theory?
Peter Huewe July 11, 2018, 8:51 p.m. UTC | #5
Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>:
>On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
>> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
>> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
>> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
>> > > > tpm_tis regressed recently to the point where the TPM being
>> > > > driven
>> > > > by
>> > > > it falls off the bus and cannot be contacted after some hours
>> > > > of
>> > > > use.
>> > > > This is the failure trace:
>> > > > 
>> > > > jejb@jarvis:~> dmesg|grep tpm
>> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
>> > > > rev-id 
>> > > > 2)
>> > > > [14566.626614] tpm tpm0: Operation Timed out
>> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
>> > > > system
>> > > > error -62
>> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
>> > > > system
>> > > > error -62
>> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
>> > > > ...
>> > > > 
>> > > > The problem is caused by a change that caused us to poke the
>> > > > TPM
>> > > > far
>> > > > more often to see if it's ready.  Apparently something about
>> > > > the
>> > > > bus
>> > > > its on and the TPM means that it crashes or falls off the bus
>> > > > if
>> > > > you
>> > > > poke it too often and once this happens, only a reboot will
>> > > > recover
>> > > > it.
>> > > 
>> > > I wonder if something about triggering ETIME even once breaks the
>> > > driver so it can't talk to the chip at all thereafter..
>> > > 
>> > > Ie it doesn't abort the command properly and becomes desynced
>> > > with the TIS execution protocol.
>> > 
>> > Yes, I wondered about this, but I don't understand the bus protocol
>> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
>> > first ETIME says after we get that we send chip->ops->cancel()
>> > which tpm_tis simply translates to tpm_tis_ready() which also times
>> > out.  Is there a bigger hammer I can hit it with?
>> 
>> I don't remember off hand.. But this is, IMHO, a better guess than
>> the firmware crashes from reading the status register..
>
>Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
>don't have any tools to probe the LPC.
>

I doubt that your fTPM is actually attached to LPC.
And usually if lpc  wedges it takes down your pc with it (from my experience)

I wonder what the contents of the acc and sts registers actually are.


If you want i can send you a small tis 'debug' tool tomorrow.

>James
>
>
>> Perhaps reducing the timeout to force a ETIME would prove the theory?
James Bottomley July 11, 2018, 8:54 p.m. UTC | #6
On Wed, 2018-07-11 at 22:51 +0200, Peter Huewe wrote:
> 
> Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottoml
> ey@HansenPartnership.com>:
> > On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> > > On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> > > > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
[...]
> > > > > I wonder if something about triggering ETIME even once breaks
> > > > > the driver so it can't talk to the chip at all thereafter..
> > > > > 
> > > > > Ie it doesn't abort the command properly and becomes desynced
> > > > > with the TIS execution protocol.
> > > > 
> > > > Yes, I wondered about this, but I don't understand the bus
> > > > protocol well enough.  The tpm-interface:tpm_try_transmit()
> > > > which throws the first ETIME says after we get that we send
> > > > chip->ops->cancel() which tpm_tis simply translates to
> > > > tpm_tis_ready() which also times out.  Is there a bigger hammer
> > > > I can hit it with?
> > > 
> > > I don't remember off hand.. But this is, IMHO, a better guess
> > > than the firmware crashes from reading the status register..
> > 
> > Oh, actually, I think the bus crashes or wedges, not the TPM.  I
> > just don't have any tools to probe the LPC.
> > 
> 
> I doubt that your fTPM is actually attached to LPC.
> And usually if lpc  wedges it takes down your pc with it (from my
> experience)

How do I tell what it is attached to?  Sysfs doesn't seem to know but I
assume the ACPI information should contain something relevant.

> I wonder what the contents of the acc and sts registers actually are.
> 
> 
> If you want i can send you a small tis 'debug' tool tomorrow.

Sure, that would help ... I'm just leafing through the TIS manual now
to see if there's anything I can build into the driver.

James
Jason Gunthorpe July 11, 2018, 9:06 p.m. UTC | #7
On Wed, Jul 11, 2018 at 10:51:40PM +0200, Peter Huewe wrote:
> 
> 
> Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>:
> >On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote:
> >> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> >> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote:
> >> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote:
> >> > > > tpm_tis regressed recently to the point where the TPM being
> >> > > > driven
> >> > > > by
> >> > > > it falls off the bus and cannot be contacted after some hours
> >> > > > of
> >> > > > use.
> >> > > > This is the failure trace:
> >> > > > 
> >> > > > jejb@jarvis:~> dmesg|grep tpm
> >> > > > [    3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE,
> >> > > > rev-id 
> >> > > > 2)
> >> > > > [14566.626614] tpm tpm0: Operation Timed out
> >> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a
> >> > > > system
> >> > > > error -62
> >> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a
> >> > > > system
> >> > > > error -62
> >> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62
> >> > > > ...
> >> > > > 
> >> > > > The problem is caused by a change that caused us to poke the
> >> > > > TPM
> >> > > > far
> >> > > > more often to see if it's ready.  Apparently something about
> >> > > > the
> >> > > > bus
> >> > > > its on and the TPM means that it crashes or falls off the bus
> >> > > > if
> >> > > > you
> >> > > > poke it too often and once this happens, only a reboot will
> >> > > > recover
> >> > > > it.
> >> > > 
> >> > > I wonder if something about triggering ETIME even once breaks the
> >> > > driver so it can't talk to the chip at all thereafter..
> >> > > 
> >> > > Ie it doesn't abort the command properly and becomes desynced
> >> > > with the TIS execution protocol.
> >> > 
> >> > Yes, I wondered about this, but I don't understand the bus protocol
> >> > well enough.  The tpm-interface:tpm_try_transmit() which throws the
> >> > first ETIME says after we get that we send chip->ops->cancel()
> >> > which tpm_tis simply translates to tpm_tis_ready() which also times
> >> > out.  Is there a bigger hammer I can hit it with?
> >> 
> >> I don't remember off hand.. But this is, IMHO, a better guess than
> >> the firmware crashes from reading the status register..
> >
> >Oh, actually, I think the bus crashes or wedges, not the TPM.  I just
> >don't have any tools to probe the LPC.
> >
> 
> I doubt that your fTPM is actually attached to LPC.
> And usually if lpc  wedges it takes down your pc with it (from my experience)

Yes, agree. Very, very unlikely LPC crashed or wedged.

You could try reading the ID numbers register (eg the source of this
data 'device-id 0xFE rev-id 2').

That should be on the HW side of the FIFO bridge and if it returs the
same values as boot the bus is not broken, no matter what the bus is.

Jason
Jarkko Sakkinen July 16, 2018, 8:16 p.m. UTC | #8
On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote:
> Yes, I wondered about this, but I don't understand the bus protocol
> well enough.  The tpm-interface:tpm_try_transmit() which throws the
> first ETIME says after we get that we send chip->ops->cancel() which
> tpm_tis simply translates to tpm_tis_ready() which also times out.  Is
> there a bigger hammer I can hit it with?

tpm_tis_ready() gets called twice and it is unsynced (no wait). Maybe
this confuses the chip somehow?

/Jarkko
diff mbox

Patch

diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h
index 8114ea88449f..5d147a8dcca8 100644
--- a/drivers/char/tpm/tpm.h
+++ b/drivers/char/tpm/tpm.h
@@ -58,8 +58,8 @@  enum tpm_timeout {
 	TPM_TIMEOUT_RETRY = 100, /* msecs */
 	TPM_TIMEOUT_RANGE_US = 300,	/* usecs */
 	TPM_TIMEOUT_POLL = 1,	/* msecs */
-	TPM_TIMEOUT_USECS_MIN = 100,      /* usecs */
-	TPM_TIMEOUT_USECS_MAX = 500      /* usecs */
+	TPM_TIMEOUT_USECS_MIN = 750,      /* usecs */
+	TPM_TIMEOUT_USECS_MAX = 1000,      /* usecs */
 };
 
 /* TPM addresses */