Message ID | 1531329074.3260.9.camel@HansenPartnership.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: > tpm_tis regressed recently to the point where the TPM being driven by > it falls off the bus and cannot be contacted after some hours of use. > This is the failure trace: > > jejb@jarvis:~> dmesg|grep tpm > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2) > [14566.626614] tpm tpm0: Operation Timed out > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62 > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62 > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > ... > > The problem is caused by a change that caused us to poke the TPM far > more often to see if it's ready. Apparently something about the bus > its on and the TPM means that it crashes or falls off the bus if you > poke it too often and once this happens, only a reboot will recover > it. I wonder if something about triggering ETIME even once breaks the driver so it can't talk to the chip at all thereafter.. Ie it doesn't abort the command properly and becomes desynced with the TIS execution protocol. I would be very surprised if polling the TIS status register effects the firmware running inside the chip.. BTW, no interrupt in your latop setup? I'm surprised by that.. Jason
On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: > > tpm_tis regressed recently to the point where the TPM being driven > > by > > it falls off the bus and cannot be contacted after some hours of > > use. > > This is the failure trace: > > > > jejb@jarvis:~> dmesg|grep tpm > > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id > > 2) > > [14566.626614] tpm tpm0: Operation Timed out > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system > > error -62 > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system > > error -62 > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > ... > > > > The problem is caused by a change that caused us to poke the TPM > > far > > more often to see if it's ready. Apparently something about the > > bus > > its on and the TPM means that it crashes or falls off the bus if > > you > > poke it too often and once this happens, only a reboot will recover > > it. > > I wonder if something about triggering ETIME even once breaks the > driver so it can't talk to the chip at all thereafter.. > > Ie it doesn't abort the command properly and becomes desynced with > the TIS execution protocol. Yes, I wondered about this, but I don't understand the bus protocol well enough. The tpm-interface:tpm_try_transmit() which throws the first ETIME says after we get that we send chip->ops->cancel() which tpm_tis simply translates to tpm_tis_ready() which also times out. Is there a bigger hammer I can hit it with? > I would be very surprised if polling the TIS status register effects. Well, I was surprised too, but given the patch that's causing the problem, it was the only explanation I could come up with. > the firmware running inside the chip.. > > > BTW, no interrupt in your latop setup? I'm surprised by that.. Heh, I hear rumours that most laptops are delivered like this and a large software company might be annoyed enough about the problem to mandate interrupts the next time it produces a hardware requirements spec. James
On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: > > > tpm_tis regressed recently to the point where the TPM being driven > > > by > > > it falls off the bus and cannot be contacted after some hours of > > > use. > > > This is the failure trace: > > > > > > jejb@jarvis:~> dmesg|grep tpm > > > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id > > > 2) > > > [14566.626614] tpm tpm0: Operation Timed out > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a system > > > error -62 > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a system > > > error -62 > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > ... > > > > > > The problem is caused by a change that caused us to poke the TPM > > > far > > > more often to see if it's ready. Apparently something about the > > > bus > > > its on and the TPM means that it crashes or falls off the bus if > > > you > > > poke it too often and once this happens, only a reboot will recover > > > it. > > > > I wonder if something about triggering ETIME even once breaks the > > driver so it can't talk to the chip at all thereafter.. > > > > Ie it doesn't abort the command properly and becomes desynced with > > the TIS execution protocol. > > Yes, I wondered about this, but I don't understand the bus protocol > well enough. The tpm-interface:tpm_try_transmit() which throws the > first ETIME says after we get that we send chip->ops->cancel() which > tpm_tis simply translates to tpm_tis_ready() which also times out. Is > there a bigger hammer I can hit it with? I don't remember off hand.. But this is, IMHO, a better guess than the firmware crashes from reading the status register.. Perhaps reducing the timeout to force a ETIME would prove the theory? Jason
On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote: > On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: > > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: > > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: > > > > tpm_tis regressed recently to the point where the TPM being > > > > driven > > > > by > > > > it falls off the bus and cannot be contacted after some hours > > > > of > > > > use. > > > > This is the failure trace: > > > > > > > > jejb@jarvis:~> dmesg|grep tpm > > > > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, > > > > rev-id > > > > 2) > > > > [14566.626614] tpm tpm0: Operation Timed out > > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a > > > > system > > > > error -62 > > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a > > > > system > > > > error -62 > > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > > > > ... > > > > > > > > The problem is caused by a change that caused us to poke the > > > > TPM > > > > far > > > > more often to see if it's ready. Apparently something about > > > > the > > > > bus > > > > its on and the TPM means that it crashes or falls off the bus > > > > if > > > > you > > > > poke it too often and once this happens, only a reboot will > > > > recover > > > > it. > > > > > > I wonder if something about triggering ETIME even once breaks the > > > driver so it can't talk to the chip at all thereafter.. > > > > > > Ie it doesn't abort the command properly and becomes desynced > > > with the TIS execution protocol. > > > > Yes, I wondered about this, but I don't understand the bus protocol > > well enough. The tpm-interface:tpm_try_transmit() which throws the > > first ETIME says after we get that we send chip->ops->cancel() > > which tpm_tis simply translates to tpm_tis_ready() which also times > > out. Is there a bigger hammer I can hit it with? > > I don't remember off hand.. But this is, IMHO, a better guess than > the firmware crashes from reading the status register.. Oh, actually, I think the bus crashes or wedges, not the TPM. I just don't have any tools to probe the LPC. James > Perhaps reducing the timeout to force a ETIME would prove the theory?
Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>: >On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote: >> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: >> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: >> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: >> > > > tpm_tis regressed recently to the point where the TPM being >> > > > driven >> > > > by >> > > > it falls off the bus and cannot be contacted after some hours >> > > > of >> > > > use. >> > > > This is the failure trace: >> > > > >> > > > jejb@jarvis:~> dmesg|grep tpm >> > > > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, >> > > > rev-id >> > > > 2) >> > > > [14566.626614] tpm tpm0: Operation Timed out >> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a >> > > > system >> > > > error -62 >> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 >> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 >> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a >> > > > system >> > > > error -62 >> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 >> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 >> > > > ... >> > > > >> > > > The problem is caused by a change that caused us to poke the >> > > > TPM >> > > > far >> > > > more often to see if it's ready. Apparently something about >> > > > the >> > > > bus >> > > > its on and the TPM means that it crashes or falls off the bus >> > > > if >> > > > you >> > > > poke it too often and once this happens, only a reboot will >> > > > recover >> > > > it. >> > > >> > > I wonder if something about triggering ETIME even once breaks the >> > > driver so it can't talk to the chip at all thereafter.. >> > > >> > > Ie it doesn't abort the command properly and becomes desynced >> > > with the TIS execution protocol. >> > >> > Yes, I wondered about this, but I don't understand the bus protocol >> > well enough. The tpm-interface:tpm_try_transmit() which throws the >> > first ETIME says after we get that we send chip->ops->cancel() >> > which tpm_tis simply translates to tpm_tis_ready() which also times >> > out. Is there a bigger hammer I can hit it with? >> >> I don't remember off hand.. But this is, IMHO, a better guess than >> the firmware crashes from reading the status register.. > >Oh, actually, I think the bus crashes or wedges, not the TPM. I just >don't have any tools to probe the LPC. > I doubt that your fTPM is actually attached to LPC. And usually if lpc wedges it takes down your pc with it (from my experience) I wonder what the contents of the acc and sts registers actually are. If you want i can send you a small tis 'debug' tool tomorrow. >James > > >> Perhaps reducing the timeout to force a ETIME would prove the theory?
On Wed, 2018-07-11 at 22:51 +0200, Peter Huewe wrote: > > Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottoml > ey@HansenPartnership.com>: > > On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote: > > > On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: > > > > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: [...] > > > > > I wonder if something about triggering ETIME even once breaks > > > > > the driver so it can't talk to the chip at all thereafter.. > > > > > > > > > > Ie it doesn't abort the command properly and becomes desynced > > > > > with the TIS execution protocol. > > > > > > > > Yes, I wondered about this, but I don't understand the bus > > > > protocol well enough. The tpm-interface:tpm_try_transmit() > > > > which throws the first ETIME says after we get that we send > > > > chip->ops->cancel() which tpm_tis simply translates to > > > > tpm_tis_ready() which also times out. Is there a bigger hammer > > > > I can hit it with? > > > > > > I don't remember off hand.. But this is, IMHO, a better guess > > > than the firmware crashes from reading the status register.. > > > > Oh, actually, I think the bus crashes or wedges, not the TPM. I > > just don't have any tools to probe the LPC. > > > > I doubt that your fTPM is actually attached to LPC. > And usually if lpc wedges it takes down your pc with it (from my > experience) How do I tell what it is attached to? Sysfs doesn't seem to know but I assume the ACPI information should contain something relevant. > I wonder what the contents of the acc and sts registers actually are. > > > If you want i can send you a small tis 'debug' tool tomorrow. Sure, that would help ... I'm just leafing through the TIS manual now to see if there's anything I can build into the driver. James
On Wed, Jul 11, 2018 at 10:51:40PM +0200, Peter Huewe wrote: > > > Am 11. Juli 2018 22:39:05 MESZ schrieb James Bottomley <James.Bottomley@HansenPartnership.com>: > >On Wed, 2018-07-11 at 14:01 -0600, Jason Gunthorpe wrote: > >> On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: > >> > On Wed, 2018-07-11 at 12:21 -0600, Jason Gunthorpe wrote: > >> > > On Wed, Jul 11, 2018 at 10:11:14AM -0700, James Bottomley wrote: > >> > > > tpm_tis regressed recently to the point where the TPM being > >> > > > driven > >> > > > by > >> > > > it falls off the bus and cannot be contacted after some hours > >> > > > of > >> > > > use. > >> > > > This is the failure trace: > >> > > > > >> > > > jejb@jarvis:~> dmesg|grep tpm > >> > > > [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, > >> > > > rev-id > >> > > > 2) > >> > > > [14566.626614] tpm tpm0: Operation Timed out > >> > > > [14566.626621] tpm tpm0: tpm2_load_context: failed with a > >> > > > system > >> > > > error -62 > >> > > > [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > >> > > > [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > >> > > > [14570.626605] tpm tpm0: tpm2_load_context: failed with a > >> > > > system > >> > > > error -62 > >> > > > [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > >> > > > [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 > >> > > > ... > >> > > > > >> > > > The problem is caused by a change that caused us to poke the > >> > > > TPM > >> > > > far > >> > > > more often to see if it's ready. Apparently something about > >> > > > the > >> > > > bus > >> > > > its on and the TPM means that it crashes or falls off the bus > >> > > > if > >> > > > you > >> > > > poke it too often and once this happens, only a reboot will > >> > > > recover > >> > > > it. > >> > > > >> > > I wonder if something about triggering ETIME even once breaks the > >> > > driver so it can't talk to the chip at all thereafter.. > >> > > > >> > > Ie it doesn't abort the command properly and becomes desynced > >> > > with the TIS execution protocol. > >> > > >> > Yes, I wondered about this, but I don't understand the bus protocol > >> > well enough. The tpm-interface:tpm_try_transmit() which throws the > >> > first ETIME says after we get that we send chip->ops->cancel() > >> > which tpm_tis simply translates to tpm_tis_ready() which also times > >> > out. Is there a bigger hammer I can hit it with? > >> > >> I don't remember off hand.. But this is, IMHO, a better guess than > >> the firmware crashes from reading the status register.. > > > >Oh, actually, I think the bus crashes or wedges, not the TPM. I just > >don't have any tools to probe the LPC. > > > > I doubt that your fTPM is actually attached to LPC. > And usually if lpc wedges it takes down your pc with it (from my experience) Yes, agree. Very, very unlikely LPC crashed or wedged. You could try reading the ID numbers register (eg the source of this data 'device-id 0xFE rev-id 2'). That should be on the HW side of the FIFO bridge and if it returs the same values as boot the bus is not broken, no matter what the bus is. Jason
On Wed, Jul 11, 2018 at 12:08:53PM -0700, James Bottomley wrote: > Yes, I wondered about this, but I don't understand the bus protocol > well enough. The tpm-interface:tpm_try_transmit() which throws the > first ETIME says after we get that we send chip->ops->cancel() which > tpm_tis simply translates to tpm_tis_ready() which also times out. Is > there a bigger hammer I can hit it with? tpm_tis_ready() gets called twice and it is unsynced (no wait). Maybe this confuses the chip somehow? /Jarkko
diff --git a/drivers/char/tpm/tpm.h b/drivers/char/tpm/tpm.h index 8114ea88449f..5d147a8dcca8 100644 --- a/drivers/char/tpm/tpm.h +++ b/drivers/char/tpm/tpm.h @@ -58,8 +58,8 @@ enum tpm_timeout { TPM_TIMEOUT_RETRY = 100, /* msecs */ TPM_TIMEOUT_RANGE_US = 300, /* usecs */ TPM_TIMEOUT_POLL = 1, /* msecs */ - TPM_TIMEOUT_USECS_MIN = 100, /* usecs */ - TPM_TIMEOUT_USECS_MAX = 500 /* usecs */ + TPM_TIMEOUT_USECS_MIN = 750, /* usecs */ + TPM_TIMEOUT_USECS_MAX = 1000, /* usecs */ }; /* TPM addresses */
tpm_tis regressed recently to the point where the TPM being driven by it falls off the bus and cannot be contacted after some hours of use. This is the failure trace: jejb@jarvis:~> dmesg|grep tpm [ 3.282605] tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2) [14566.626614] tpm tpm0: Operation Timed out [14566.626621] tpm tpm0: tpm2_load_context: failed with a system error -62 [14568.626607] tpm tpm0: tpm_try_transmit: tpm_send: error -62 [14570.626594] tpm tpm0: tpm_try_transmit: tpm_send: error -62 [14570.626605] tpm tpm0: tpm2_load_context: failed with a system error -62 [14572.626526] tpm tpm0: tpm_try_transmit: tpm_send: error -62 [14577.710441] tpm tpm0: tpm_try_transmit: tpm_send: error -62 ... The problem is caused by a change that caused us to poke the TPM far more often to see if it's ready. Apparently something about the bus its on and the TPM means that it crashes or falls off the bus if you poke it too often and once this happens, only a reboot will recover it. The fix I've come up with is to adjust the timings so the TPM no longer falls of the bus. Obviously, this fix works for my Nuvoton NPCT6xxx but that's the only TPM I've tested it with. Fixes: 424eaf910c32 tpm: reduce polling time to usecs for even finer granularity Signed-off-by: James Bottomley <James.Bottomley@HansenPartnership.com> --- drivers/char/tpm/tpm.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-)