Message ID | 20201001180925.13808-1-James.Bottomley@HansenPartnership.com (mailing list archive) |
---|---|
Headers | show |
Series | tpm_tis: fix interrupts (again) | expand |
James Bottomley @ 2020-10-01 11:09 MST: > The current state of the TIS TPM is that interrupts have been globally > disabled by various changes. The problems we got reported the last > time they were enabled was interrupt storms. With my own TIS TPM, > I've found that this is caused because my TPM doesn't do legacy > cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > requires any TIS TPM without legacy cycles not to act on any write to > an interrupt register unless the locality is enabled. This means if > an interrupt fires after we relinquish the locality, the TPM_EOI in > the interrupt routine is ineffective meaning the same interrupt > triggers over and over again. This problem also means we can have > trouble setting up interrupts on TIS TPMs because the current init > code does the setup before the locality is claimed for the first time. > > James > I tested initially with the following commits reverted: aa4a63dd9816 tpm: Revert "tpm_tis_core: Turn on the TPM before probing IRQ's" | 2020-01-06 | (Stefan Berger) dda8b2af395b tpm: Revert "tpm_tis_core: Set TPM_CHIP_FLAG_IRQ before probing for interrupts" | 2020-01-06 | (Stefan Berger) The laptop doesn't become unusable, but I lose the trackpad and get the following: [ 3.070501] irq 31: nobody cared (try booting with the "irqpoll" option) [ 3.070504] CPU: 2 PID: 251 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 [ 3.070504] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 [ 3.070505] Call Trace: [ 3.070511] dump_stack+0x6b/0x88 [ 3.070514] __report_bad_irq+0x35/0xa7 [ 3.070516] note_interrupt.cold+0xb/0x6a [ 3.070518] handle_irq_event+0x88/0x8a [ 3.070519] handle_fasteoi_irq+0x78/0x1c0 [ 3.070522] common_interrupt+0x68/0x140 [ 3.070524] ? asm_common_interrupt+0x8/0x40 [ 3.070525] asm_common_interrupt+0x1e/0x40 [ 3.070527] RIP: 0033:0x7f2eea3249b5 [ 3.070529] Code: e0 48 c1 e8 3c 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 37 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 <48> 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 [ 3.070529] RSP: 002b:00007f2ee3ffebc0 EFLAGS: 00000202 [ 3.070530] RAX: 0000000000000001 RBX: 000000000000000a RCX: 000000000000002e [ 3.070531] RDX: 0463400000000000 RSI: 0000000000000000 RDI: 0000000000000001 [ 3.070532] RBP: 00007f2ee3ffec10 R08: 0000000000000000 R09: 0000000000000035 [ 3.070532] R10: 00007ffde716f080 R11: 00007ffde716f080 R12: 00007f2edc004c00 [ 3.070533] R13: 00007ffde700a54f R14: 00007f2ee3ffed10 R15: 00005598a41e3680 [ 3.070534] handlers: [ 3.070537] [<000000007b6f3232>] tis_int_handler [ 3.070538] Disabling IRQ #31 ... [ 14.956342] irq 16: nobody cared (try booting with the "irqpoll" option) [ 14.956344] CPU: 0 PID: 1013 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 [ 14.956344] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 [ 14.956345] Call Trace: [ 14.956350] dump_stack+0x6b/0x88 [ 14.956353] __report_bad_irq+0x35/0xa7 [ 14.956354] note_interrupt.cold+0xb/0x6a [ 14.956355] handle_irq_event+0x88/0x8a [ 14.956356] handle_fasteoi_irq+0x78/0x1c0 [ 14.956358] common_interrupt+0x68/0x140 [ 14.956360] ? asm_common_interrupt+0x8/0x40 [ 14.956361] asm_common_interrupt+0x1e/0x40 [ 14.956362] RIP: 0033:0x7fcaff4399d3 [ 14.956363] Code: e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 16 83 e0 01 48 31 45 f0 <48> d1 65 e0 48 8b 45 f0 48 31 45 e0 83 45 d4 01 83 7d d4 40 0f 86 [ 14.956364] RSP: 002b:00007fcafe544bc0 EFLAGS: 00000246 [ 14.956364] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000026 [ 14.956365] RDX: 000df20000000000 RSI: 0000000000000000 RDI: 0000000000000001 [ 14.956365] RBP: 00007fcafe544c10 R08: 0000000000000000 R09: 0000000000000035 [ 14.956366] R10: 00007ffc30bd2080 R11: 00007ffc30bd2080 R12: 00007fcaf0004c00 [ 14.956366] R13: 00007fcaf0004c00 R14: 00007fcaf0000b60 R15: 0000560216fcf0f2 [ 14.956367] handlers: [ 14.956370] [<0000000024c0571e>] i801_isr [i2c_i801] [ 14.956371] Disabling IRQ #16 /proc/interrupts shows: 16: 100000 0 0 0 0 0 0 0 IR-IO-APIC 16-fasteoi i801_smbus 31: 0 0 100000 0 0 0 0 0 IR-IO-APIC 31-fasteoi tpm0 I also get this behavior with your patchset applied. If I boot with tpm_tis.interrupts=0 it behaves. The thought from Hans is to look at the dmi info and key off the vendor being Lenovo and bios date to decide if interrupts should be disabled. Since Dan ran also into this with something internal at Intel I don't know if that will be sufficient. I hope to look over this patchset tomorrow. Regards, Jerry
On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: > The current state of the TIS TPM is that interrupts have been globally > disabled by various changes. The problems we got reported the last > time they were enabled was interrupt storms. With my own TIS TPM, > I've found that this is caused because my TPM doesn't do legacy > cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > requires any TIS TPM without legacy cycles not to act on any write to > an interrupt register unless the locality is enabled. This means if > an interrupt fires after we relinquish the locality, the TPM_EOI in > the interrupt routine is ineffective meaning the same interrupt > triggers over and over again. This problem also means we can have > trouble setting up interrupts on TIS TPMs because the current init > code does the setup before the locality is claimed for the first time. > > James You should consider expanding the audience. Jerry, once you have some bandwidth (no rush, does not land before rc2), it would be great that if you could try this. I'm emphasizing this just because of the intersection. I think it would also make senset to get tested-by from Nayna. Speaking of the changelog, I almost never have encounter a patch set that does not have a changelog in the cover letter. And I'm not able to interpret the process text in a way that it would ask to scatter changelogs to the patches, and not put them to the cover letter [*]. Thus, I trust what I see as commodity. Not only that but it also helps a lot with to see quickly what has changed, especially if the patches are such that you have to take them in eventually. In my own patch sets, I've recently started to xref associated lore korg discussions in the change log entries, when it applies. I started to do this with the SGX series when I had missed to address a number of Boris' comments. It has helped a lot with my own tracking and I assume it is also helpful for reviewers and maintainers to get the context, when they need it. The last paragraph is not something I demand. I'm merely just mentioning it because I think it is a good practice for any patch set. [*] https://lore.kernel.org/linux-integrity/14edea1f5092c2b8442165756b2ee32e56bed1eb.camel@HansenPartnership.com/ /Jarkko
On Sun, Oct 11, 2020 at 10:39:07PM -0700, Jerry Snitselaar wrote: > > James Bottomley @ 2020-10-01 11:09 MST: > > > The current state of the TIS TPM is that interrupts have been globally > > disabled by various changes. The problems we got reported the last > > time they were enabled was interrupt storms. With my own TIS TPM, > > I've found that this is caused because my TPM doesn't do legacy > > cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > > requires any TIS TPM without legacy cycles not to act on any write to > > an interrupt register unless the locality is enabled. This means if > > an interrupt fires after we relinquish the locality, the TPM_EOI in > > the interrupt routine is ineffective meaning the same interrupt > > triggers over and over again. This problem also means we can have > > trouble setting up interrupts on TIS TPMs because the current init > > code does the setup before the locality is claimed for the first time. > > > > James > > > > > I tested initially with the following commits reverted: > > aa4a63dd9816 tpm: Revert "tpm_tis_core: Turn on the TPM before probing IRQ's" | 2020-01-06 | (Stefan Berger) > dda8b2af395b tpm: Revert "tpm_tis_core: Set TPM_CHIP_FLAG_IRQ before probing for interrupts" | 2020-01-06 | (Stefan Berger) > > The laptop doesn't become unusable, but I lose the trackpad and get the following: > > [ 3.070501] irq 31: nobody cared (try booting with the "irqpoll" option) > [ 3.070504] CPU: 2 PID: 251 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 > [ 3.070504] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 > [ 3.070505] Call Trace: > [ 3.070511] dump_stack+0x6b/0x88 > [ 3.070514] __report_bad_irq+0x35/0xa7 > [ 3.070516] note_interrupt.cold+0xb/0x6a > [ 3.070518] handle_irq_event+0x88/0x8a > [ 3.070519] handle_fasteoi_irq+0x78/0x1c0 > [ 3.070522] common_interrupt+0x68/0x140 > [ 3.070524] ? asm_common_interrupt+0x8/0x40 > [ 3.070525] asm_common_interrupt+0x1e/0x40 > [ 3.070527] RIP: 0033:0x7f2eea3249b5 > [ 3.070529] Code: e0 48 c1 e8 3c 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 37 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 <48> 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 > [ 3.070529] RSP: 002b:00007f2ee3ffebc0 EFLAGS: 00000202 > [ 3.070530] RAX: 0000000000000001 RBX: 000000000000000a RCX: 000000000000002e > [ 3.070531] RDX: 0463400000000000 RSI: 0000000000000000 RDI: 0000000000000001 > [ 3.070532] RBP: 00007f2ee3ffec10 R08: 0000000000000000 R09: 0000000000000035 > [ 3.070532] R10: 00007ffde716f080 R11: 00007ffde716f080 R12: 00007f2edc004c00 > [ 3.070533] R13: 00007ffde700a54f R14: 00007f2ee3ffed10 R15: 00005598a41e3680 > [ 3.070534] handlers: > [ 3.070537] [<000000007b6f3232>] tis_int_handler > [ 3.070538] Disabling IRQ #31 > ... > [ 14.956342] irq 16: nobody cared (try booting with the "irqpoll" option) > [ 14.956344] CPU: 0 PID: 1013 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 > [ 14.956344] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 > [ 14.956345] Call Trace: > [ 14.956350] dump_stack+0x6b/0x88 > [ 14.956353] __report_bad_irq+0x35/0xa7 > [ 14.956354] note_interrupt.cold+0xb/0x6a > [ 14.956355] handle_irq_event+0x88/0x8a > [ 14.956356] handle_fasteoi_irq+0x78/0x1c0 > [ 14.956358] common_interrupt+0x68/0x140 > [ 14.956360] ? asm_common_interrupt+0x8/0x40 > [ 14.956361] asm_common_interrupt+0x1e/0x40 > [ 14.956362] RIP: 0033:0x7fcaff4399d3 > [ 14.956363] Code: e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 16 83 e0 01 48 31 45 f0 <48> d1 65 e0 48 8b 45 f0 48 31 45 e0 83 45 d4 01 83 7d d4 40 0f 86 > [ 14.956364] RSP: 002b:00007fcafe544bc0 EFLAGS: 00000246 > [ 14.956364] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000026 > [ 14.956365] RDX: 000df20000000000 RSI: 0000000000000000 RDI: 0000000000000001 > [ 14.956365] RBP: 00007fcafe544c10 R08: 0000000000000000 R09: 0000000000000035 > [ 14.956366] R10: 00007ffc30bd2080 R11: 00007ffc30bd2080 R12: 00007fcaf0004c00 > [ 14.956366] R13: 00007fcaf0004c00 R14: 00007fcaf0000b60 R15: 0000560216fcf0f2 > [ 14.956367] handlers: > [ 14.956370] [<0000000024c0571e>] i801_isr [i2c_i801] > [ 14.956371] Disabling IRQ #16 > > /proc/interrupts shows: > > 16: 100000 0 0 0 0 0 0 0 IR-IO-APIC 16-fasteoi i801_smbus > 31: 0 0 100000 0 0 0 0 0 IR-IO-APIC 31-fasteoi tpm0 > > > I also get this behavior with your patchset applied. If I boot with > tpm_tis.interrupts=0 it behaves. The thought from Hans is to look at > the dmi info and key off the vendor being Lenovo and bios date to > decide if interrupts should be disabled. Since Dan ran also into this > with something internal at Intel I don't know if that will be > sufficient. > > I hope to look over this patchset tomorrow. > > Regards, > Jerry I'm sorry, I received this email only after pulling the latest with fdm. /Jarkko
Jarkko Sakkinen @ 2020-10-12 18:17 MST: > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: >> The current state of the TIS TPM is that interrupts have been globally >> disabled by various changes. The problems we got reported the last >> time they were enabled was interrupt storms. With my own TIS TPM, >> I've found that this is caused because my TPM doesn't do legacy >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") >> requires any TIS TPM without legacy cycles not to act on any write to >> an interrupt register unless the locality is enabled. This means if >> an interrupt fires after we relinquish the locality, the TPM_EOI in >> the interrupt routine is ineffective meaning the same interrupt >> triggers over and over again. This problem also means we can have >> trouble setting up interrupts on TIS TPMs because the current init >> code does the setup before the locality is claimed for the first time. >> >> James > > You should consider expanding the audience. Jerry, once you have some > bandwidth (no rush, does not land before rc2), it would be great that if > you could try this. I'm emphasizing this just because of the > intersection. I think it would also make senset to get tested-by from > Nayna. I will run some tests on some other systems I have access to. As noted in the other email I did a quick test with a t490s with an older bios that exhibits the problem originally reported when Stefan's patch enabled interrupts. > > Speaking of the changelog, I almost never have encounter a patch set > that does not have a changelog in the cover letter. And I'm not able to > interpret the process text in a way that it would ask to scatter > changelogs to the patches, and not put them to the cover letter [*]. > > Thus, I trust what I see as commodity. Not only that but it also helps a > lot with to see quickly what has changed, especially if the patches are > such that you have to take them in eventually. > > In my own patch sets, I've recently started to xref associated lore korg > discussions in the change log entries, when it applies. I started to do > this with the SGX series when I had missed to address a number of Boris' > comments. It has helped a lot with my own tracking and I assume it is > also helpful for reviewers and maintainers to get the context, when they > need it. > > The last paragraph is not something I demand. I'm merely just mentioning > it because I think it is a good practice for any patch set. > > [*] https://lore.kernel.org/linux-integrity/14edea1f5092c2b8442165756b2ee32e56bed1eb.camel@HansenPartnership.com/ > > /Jarkko
On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: > > > The current state of the TIS TPM is that interrupts have been > > > globally disabled by various changes. The problems we got > > > reported the last time they were enabled was interrupt > > > storms. With my own TIS TPM, I've found that this is caused > > > because my TPM doesn't do legacy cycles, The TIS spec (chapter > > > 6.1 "Locality Usage Per Register") requires any TIS TPM without > > > legacy cycles not to act on any write to an interrupt register > > > unless the locality is enabled. This means if an interrupt fires > > > after we relinquish the locality, the TPM_EOI in the interrupt > > > routine is ineffective meaning the same interrupt triggers over > > > and over again. This problem also means we can have trouble > > > setting up interrupts on TIS TPMs because the current init > > > code does the setup before the locality is claimed for the first > > > time. > > > > > > James > > > > You should consider expanding the audience. Well, most people interested in testing this sort of thing are already on the integrity list. > > Jerry, once you have some bandwidth (no rush, does not land before > > rc2), it would be great that if you could try this. I'm emphasizing > > this just because of the intersection. I think it would also make > > senset to get tested-by from Nayna. > > I will run some tests on some other systems I have access to. As > noted in the other email I did a quick test with a t490s with an > older bios that exhibits the problem originally reported when > Stefan's patch enabled interrupts. Well, it means there's still some other problem. I was hoping that because the rainbow pass system originally exhibited the same symptoms (interrupt storm) fixing it would also fix the t490 and the ineffective EOI bug looked like a great candidate for being the root cause. How amenable are you to debugging this? I originally figured out the problem with the rainbow pass by putting ratelimited printks in the interrupt routine and most of the TIS transmission ones, but it's somewhat labour intensive doing it this way. James
James Bottomley @ 2020-10-13 08:24 MST: > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> >> > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: >> > > The current state of the TIS TPM is that interrupts have been >> > > globally disabled by various changes. The problems we got >> > > reported the last time they were enabled was interrupt >> > > storms. With my own TIS TPM, I've found that this is caused >> > > because my TPM doesn't do legacy cycles, The TIS spec (chapter >> > > 6.1 "Locality Usage Per Register") requires any TIS TPM without >> > > legacy cycles not to act on any write to an interrupt register >> > > unless the locality is enabled. This means if an interrupt fires >> > > after we relinquish the locality, the TPM_EOI in the interrupt >> > > routine is ineffective meaning the same interrupt triggers over >> > > and over again. This problem also means we can have trouble >> > > setting up interrupts on TIS TPMs because the current init >> > > code does the setup before the locality is claimed for the first >> > > time. >> > > >> > > James >> > >> > You should consider expanding the audience. > > Well, most people interested in testing this sort of thing are already > on the integrity list. > >> > Jerry, once you have some bandwidth (no rush, does not land before >> > rc2), it would be great that if you could try this. I'm emphasizing >> > this just because of the intersection. I think it would also make >> > senset to get tested-by from Nayna. >> >> I will run some tests on some other systems I have access to. As >> noted in the other email I did a quick test with a t490s with an >> older bios that exhibits the problem originally reported when >> Stefan's patch enabled interrupts. > > Well, it means there's still some other problem. I was hoping that > because the rainbow pass system originally exhibited the same symptoms > (interrupt storm) fixing it would also fix the t490 and the ineffective > EOI bug looked like a great candidate for being the root cause. > Adding Hans to the list. IIUC in the t490s case the problem lies with the hardware itself. Hans, is that correct? > How amenable are you to debugging this? I originally figured out the > problem with the rainbow pass by putting ratelimited printks in the > interrupt routine and most of the TIS transmission ones, but it's > somewhat labour intensive doing it this way. > > James
Hi, On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > James Bottomley @ 2020-10-13 08:24 MST: > >> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>> >>>> On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: >>>>> The current state of the TIS TPM is that interrupts have been >>>>> globally disabled by various changes. The problems we got >>>>> reported the last time they were enabled was interrupt >>>>> storms. With my own TIS TPM, I've found that this is caused >>>>> because my TPM doesn't do legacy cycles, The TIS spec (chapter >>>>> 6.1 "Locality Usage Per Register") requires any TIS TPM without >>>>> legacy cycles not to act on any write to an interrupt register >>>>> unless the locality is enabled. This means if an interrupt fires >>>>> after we relinquish the locality, the TPM_EOI in the interrupt >>>>> routine is ineffective meaning the same interrupt triggers over >>>>> and over again. This problem also means we can have trouble >>>>> setting up interrupts on TIS TPMs because the current init >>>>> code does the setup before the locality is claimed for the first >>>>> time. >>>>> >>>>> James >>>> >>>> You should consider expanding the audience. >> >> Well, most people interested in testing this sort of thing are already >> on the integrity list. >> >>>> Jerry, once you have some bandwidth (no rush, does not land before >>>> rc2), it would be great that if you could try this. I'm emphasizing >>>> this just because of the intersection. I think it would also make >>>> senset to get tested-by from Nayna. >>> >>> I will run some tests on some other systems I have access to. As >>> noted in the other email I did a quick test with a t490s with an >>> older bios that exhibits the problem originally reported when >>> Stefan's patch enabled interrupts. >> >> Well, it means there's still some other problem. I was hoping that >> because the rainbow pass system originally exhibited the same symptoms >> (interrupt storm) fixing it would also fix the t490 and the ineffective >> EOI bug looked like a great candidate for being the root cause. >> > > Adding Hans to the list. > > IIUC in the t490s case the problem lies with the hardware itself. Hans, > is that correct? More or less. AFAIK / have been told by Lenovo it is an issue with the configuration of the inerrupt-type of the GPIO pin used for the IRQ, which is a firmware issue which could be fixed by a BIOS update (the pin is setup as a direct-irq pin for the APIC, so the OS has no control of the IRQ type since with APIC irqs this is all supposed to be setup properly before hand). But it is a model specific issue, if we denylist IRQ usage on this Lenovo model (and probably a few others) then we should be able to restore the IRQ code to normal functionality for all other device models which declare an IRQ in their resource tables. Regards, Hans
On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > James Bottomley @ 2020-10-13 08:24 MST: > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: [...] > > > > > Jerry, once you have some bandwidth (no rush, does not land > > > > > before rc2), it would be great that if you could try this. > > > > > I'm emphasizing this just because of the intersection. I > > > > > think it would also make senset to get tested-by from Nayna. > > > > > > > > I will run some tests on some other systems I have access to. > > > > As noted in the other email I did a quick test with a t490s > > > > with an older bios that exhibits the problem originally > > > > reported when Stefan's patch enabled interrupts. > > > > > > Well, it means there's still some other problem. I was hoping > > > that because the rainbow pass system originally exhibited the > > > same symptoms (interrupt storm) fixing it would also fix the t490 > > > and the ineffective EOI bug looked like a great candidate for > > > being the root cause. > > > > > > > Adding Hans to the list. > > > > IIUC in the t490s case the problem lies with the hardware itself. > > Hans, is that correct? > > More or less. AFAIK / have been told by Lenovo it is an issue with > the configuration of the inerrupt-type of the GPIO pin used for the > IRQ, which is a firmware issue which could be fixed by a BIOS update > (the pin is setup as a direct-irq pin for the APIC, so the OS has no > control of the IRQ type since with APIC irqs this is all supposed to > be setup properly before hand). > > But it is a model specific issue, if we denylist IRQ usage on this > Lenovo model (and probably a few others) then we should be able to > restore the IRQ code to normal functionality for all other device > models which declare an IRQ in their resource tables. I can do that with a quirk, but how do I identify the device? TPM manufacturer and version? or do I have to use something like the ACPI bios version? James
Hi, On 10/14/20 5:23 PM, James Bottomley wrote: > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>> James Bottomley @ 2020-10-13 08:24 MST: >>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: > [...] >>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>> before rc2), it would be great that if you could try this. >>>>>> I'm emphasizing this just because of the intersection. I >>>>>> think it would also make senset to get tested-by from Nayna. >>>>> >>>>> I will run some tests on some other systems I have access to. >>>>> As noted in the other email I did a quick test with a t490s >>>>> with an older bios that exhibits the problem originally >>>>> reported when Stefan's patch enabled interrupts. >>>> >>>> Well, it means there's still some other problem. I was hoping >>>> that because the rainbow pass system originally exhibited the >>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>> and the ineffective EOI bug looked like a great candidate for >>>> being the root cause. >>>> >>> >>> Adding Hans to the list. >>> >>> IIUC in the t490s case the problem lies with the hardware itself. >>> Hans, is that correct? >> >> More or less. AFAIK / have been told by Lenovo it is an issue with >> the configuration of the inerrupt-type of the GPIO pin used for the >> IRQ, which is a firmware issue which could be fixed by a BIOS update >> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >> control of the IRQ type since with APIC irqs this is all supposed to >> be setup properly before hand). >> >> But it is a model specific issue, if we denylist IRQ usage on this >> Lenovo model (and probably a few others) then we should be able to >> restore the IRQ code to normal functionality for all other device >> models which declare an IRQ in their resource tables. > > I can do that with a quirk, but how do I identify the device? TPM > manufacturer and version? or do I have to use something like the ACPI > bios version? I'm not sure if the TPM ids are unique to one model/series of laptops. So my idea for this was to match on DMI strings, specifically use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION strings (normally one would use DMI_PRODUCT_NAME but for Lenovo devices the string which you expect to be in DMI_PRODUCT_NAME is actually in DMI_PRODUCT_VERSION). You can easily get the strings for your device by doing: cat /sys/class/dmi/id/sys_vendor cat /sys/class/dmi/id/product_version Regards, Hans
Hans de Goede @ 2020-10-14 09:04 MST: > Hi, > > On 10/14/20 5:23 PM, James Bottomley wrote: >> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> [...] >>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>> before rc2), it would be great that if you could try this. >>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>> >>>>>> I will run some tests on some other systems I have access to. >>>>>> As noted in the other email I did a quick test with a t490s >>>>>> with an older bios that exhibits the problem originally >>>>>> reported when Stefan's patch enabled interrupts. >>>>> >>>>> Well, it means there's still some other problem. I was hoping >>>>> that because the rainbow pass system originally exhibited the >>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>> and the ineffective EOI bug looked like a great candidate for >>>>> being the root cause. >>>>> >>>> >>>> Adding Hans to the list. >>>> >>>> IIUC in the t490s case the problem lies with the hardware itself. >>>> Hans, is that correct? >>> >>> More or less. AFAIK / have been told by Lenovo it is an issue with >>> the configuration of the inerrupt-type of the GPIO pin used for the >>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>> control of the IRQ type since with APIC irqs this is all supposed to >>> be setup properly before hand). >>> >>> But it is a model specific issue, if we denylist IRQ usage on this >>> Lenovo model (and probably a few others) then we should be able to >>> restore the IRQ code to normal functionality for all other device >>> models which declare an IRQ in their resource tables. >> I can do that with a quirk, but how do I identify the device? TPM >> manufacturer and version? or do I have to use something like the ACPI >> bios version? > > I'm not sure if the TPM ids are unique to one model/series of laptops. > > So my idea for this was to match on DMI strings, specifically > use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION > strings (normally one would use DMI_PRODUCT_NAME but for Lenovo > devices the string which you expect to be in DMI_PRODUCT_NAME > is actually in DMI_PRODUCT_VERSION). > > You can easily get the strings for your device by doing: > > cat /sys/class/dmi/id/sys_vendor > cat /sys/class/dmi/id/product_version > > Regards, > > Hans Plus use dmi_get_date(DMI_BIOS_DATE,...) to check if the bios is older than the fixed bios? Has Lenovo released the fixed bios?
Hi, On 10/14/20 6:34 PM, Jerry Snitselaar wrote: > > Hans de Goede @ 2020-10-14 09:04 MST: > >> Hi, >> >> On 10/14/20 5:23 PM, James Bottomley wrote: >>> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>> [...] >>>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>>> before rc2), it would be great that if you could try this. >>>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>>> >>>>>>> I will run some tests on some other systems I have access to. >>>>>>> As noted in the other email I did a quick test with a t490s >>>>>>> with an older bios that exhibits the problem originally >>>>>>> reported when Stefan's patch enabled interrupts. >>>>>> >>>>>> Well, it means there's still some other problem. I was hoping >>>>>> that because the rainbow pass system originally exhibited the >>>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>>> and the ineffective EOI bug looked like a great candidate for >>>>>> being the root cause. >>>>>> >>>>> >>>>> Adding Hans to the list. >>>>> >>>>> IIUC in the t490s case the problem lies with the hardware itself. >>>>> Hans, is that correct? >>>> >>>> More or less. AFAIK / have been told by Lenovo it is an issue with >>>> the configuration of the inerrupt-type of the GPIO pin used for the >>>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>>> control of the IRQ type since with APIC irqs this is all supposed to >>>> be setup properly before hand). >>>> >>>> But it is a model specific issue, if we denylist IRQ usage on this >>>> Lenovo model (and probably a few others) then we should be able to >>>> restore the IRQ code to normal functionality for all other device >>>> models which declare an IRQ in their resource tables. >>> I can do that with a quirk, but how do I identify the device? TPM >>> manufacturer and version? or do I have to use something like the ACPI >>> bios version? >> >> I'm not sure if the TPM ids are unique to one model/series of laptops. >> >> So my idea for this was to match on DMI strings, specifically >> use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >> strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >> devices the string which you expect to be in DMI_PRODUCT_NAME >> is actually in DMI_PRODUCT_VERSION). >> >> You can easily get the strings for your device by doing: >> >> cat /sys/class/dmi/id/sys_vendor >> cat /sys/class/dmi/id/product_version >> >> Regards, >> >> Hans > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check > if the bios is older than the fixed bios? Has Lenovo > released the fixed bios? Maybe, the fixed BIOS-es which I have seen (for the X1C8, broken BIOS was a pre-production BIOS) "fixed" this by no longer listing an IRQ in the ACPI resources for the TPM. Which means that the new BIOS still being on the deny list does not matter since the IRQ support won't work anyways as we no longer get an IRQ assigned. So I don't think this is necessary and it will just complicate things unnecessarily. This whole saga has already taken way too long to fix. So IMHO the simplest fix where we just deny list the broken models independent of BIOS versions and move on seems best. Regards, Hans
On Wed, 2020-10-14 at 09:34 -0700, Jerry Snitselaar wrote: > Hans de Goede @ 2020-10-14 09:04 MST: > > > Hi, > > > > On 10/14/20 5:23 PM, James Bottomley wrote: > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > > > > James Bottomley @ 2020-10-13 08:24 MST: > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > [...] > > > > > > > > Jerry, once you have some bandwidth (no rush, does > > > > > > > > not land > > > > > > > > before rc2), it would be great that if you could try > > > > > > > > this. > > > > > > > > I'm emphasizing this just because of the intersection. > > > > > > > > I > > > > > > > > think it would also make senset to get tested-by from > > > > > > > > Nayna. > > > > > > > > > > > > > > I will run some tests on some other systems I have access > > > > > > > to. > > > > > > > As noted in the other email I did a quick test with a > > > > > > > t490s > > > > > > > with an older bios that exhibits the problem originally > > > > > > > reported when Stefan's patch enabled interrupts. > > > > > > > > > > > > Well, it means there's still some other problem. I was > > > > > > hoping > > > > > > that because the rainbow pass system originally exhibited > > > > > > the > > > > > > same symptoms (interrupt storm) fixing it would also fix > > > > > > the t490 > > > > > > and the ineffective EOI bug looked like a great candidate > > > > > > for > > > > > > being the root cause. > > > > > > > > > > > > > > > > Adding Hans to the list. > > > > > > > > > > IIUC in the t490s case the problem lies with the hardware > > > > > itself. > > > > > Hans, is that correct? > > > > > > > > More or less. AFAIK / have been told by Lenovo it is an issue > > > > with > > > > the configuration of the inerrupt-type of the GPIO pin used for > > > > the > > > > IRQ, which is a firmware issue which could be fixed by a BIOS > > > > update > > > > (the pin is setup as a direct-irq pin for the APIC, so the OS > > > > has no > > > > control of the IRQ type since with APIC irqs this is all > > > > supposed to > > > > be setup properly before hand). > > > > > > > > But it is a model specific issue, if we denylist IRQ usage on > > > > this > > > > Lenovo model (and probably a few others) then we should be able > > > > to > > > > restore the IRQ code to normal functionality for all other > > > > device > > > > models which declare an IRQ in their resource tables. > > > I can do that with a quirk, but how do I identify the > > > device? TPM > > > manufacturer and version? or do I have to use something like the > > > ACPI > > > bios version? > > > > I'm not sure if the TPM ids are unique to one model/series of > > laptops. > > > > So my idea for this was to match on DMI strings, specifically > > use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION > > strings (normally one would use DMI_PRODUCT_NAME but for Lenovo > > devices the string which you expect to be in DMI_PRODUCT_NAME > > is actually in DMI_PRODUCT_VERSION). > > > > You can easily get the strings for your device by doing: > > > > cat /sys/class/dmi/id/sys_vendor > > cat /sys/class/dmi/id/product_version > > > > Regards, > > > > Hans > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check > if the bios is older than the fixed bios? Has Lenovo > released the fixed bios? Yes, get all the DMI before and after information, but ideally also the TIS before and after information. That's what you get from the tpm_tis line in dmesg: tpm_tis MSFT0101:00: 2.0 TPM (device-id 0xFE, rev-id 2) What I was hoping is rev-id might change. However, if the bug is specific to the Lenovo firmware and not the TPM firmware, that is unlikely. James
Hans de Goede @ 2020-10-14 09:46 MST: > Hi, > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >> Hans de Goede @ 2020-10-14 09:04 MST: >> >>> Hi, >>> >>> On 10/14/20 5:23 PM, James Bottomley wrote: >>>> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>>>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>>> [...] >>>>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>>>> before rc2), it would be great that if you could try this. >>>>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>>>> >>>>>>>> I will run some tests on some other systems I have access to. >>>>>>>> As noted in the other email I did a quick test with a t490s >>>>>>>> with an older bios that exhibits the problem originally >>>>>>>> reported when Stefan's patch enabled interrupts. >>>>>>> >>>>>>> Well, it means there's still some other problem. I was hoping >>>>>>> that because the rainbow pass system originally exhibited the >>>>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>>>> and the ineffective EOI bug looked like a great candidate for >>>>>>> being the root cause. >>>>>>> >>>>>> >>>>>> Adding Hans to the list. >>>>>> >>>>>> IIUC in the t490s case the problem lies with the hardware itself. >>>>>> Hans, is that correct? >>>>> >>>>> More or less. AFAIK / have been told by Lenovo it is an issue with >>>>> the configuration of the inerrupt-type of the GPIO pin used for the >>>>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>>>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>>>> control of the IRQ type since with APIC irqs this is all supposed to >>>>> be setup properly before hand). >>>>> >>>>> But it is a model specific issue, if we denylist IRQ usage on this >>>>> Lenovo model (and probably a few others) then we should be able to >>>>> restore the IRQ code to normal functionality for all other device >>>>> models which declare an IRQ in their resource tables. >>>> I can do that with a quirk, but how do I identify the device? TPM >>>> manufacturer and version? or do I have to use something like the ACPI >>>> bios version? >>> >>> I'm not sure if the TPM ids are unique to one model/series of laptops. >>> >>> So my idea for this was to match on DMI strings, specifically >>> use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >>> strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >>> devices the string which you expect to be in DMI_PRODUCT_NAME >>> is actually in DMI_PRODUCT_VERSION). >>> >>> You can easily get the strings for your device by doing: >>> >>> cat /sys/class/dmi/id/sys_vendor >>> cat /sys/class/dmi/id/product_version >>> >>> Regards, >>> >>> Hans >> Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >> if the bios is older than the fixed bios? Has Lenovo >> released the fixed bios? > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, > broken BIOS was a pre-production BIOS) "fixed" this by > no longer listing an IRQ in the ACPI resources for the TPM. > > Which means that the new BIOS still being on the deny list > does not matter since the IRQ support won't work anyways as > we no longer get an IRQ assigned. > > So I don't think this is necessary and it will just complicate > things unnecessarily. This whole saga has already taken way > too long to fix. So IMHO the simplest fix where we just deny > list the broken models independent of BIOS versions and move > on seems best. > > Regards, > > Hans Yea, probably just best to disable for the model and be done with it. Regards, Jerry
Jerry Snitselaar @ 2020-10-14 10:01 MST: > Hans de Goede @ 2020-10-14 09:46 MST: > >> Hi, >> >> On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >>> Hans de Goede @ 2020-10-14 09:04 MST: >>> >>>> Hi, >>>> >>>> On 10/14/20 5:23 PM, James Bottomley wrote: >>>>> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>>>>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>>>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>>>> [...] >>>>>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>>>>> before rc2), it would be great that if you could try this. >>>>>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>>>>> >>>>>>>>> I will run some tests on some other systems I have access to. >>>>>>>>> As noted in the other email I did a quick test with a t490s >>>>>>>>> with an older bios that exhibits the problem originally >>>>>>>>> reported when Stefan's patch enabled interrupts. >>>>>>>> >>>>>>>> Well, it means there's still some other problem. I was hoping >>>>>>>> that because the rainbow pass system originally exhibited the >>>>>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>>>>> and the ineffective EOI bug looked like a great candidate for >>>>>>>> being the root cause. >>>>>>>> >>>>>>> >>>>>>> Adding Hans to the list. >>>>>>> >>>>>>> IIUC in the t490s case the problem lies with the hardware itself. >>>>>>> Hans, is that correct? >>>>>> >>>>>> More or less. AFAIK / have been told by Lenovo it is an issue with >>>>>> the configuration of the inerrupt-type of the GPIO pin used for the >>>>>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>>>>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>>>>> control of the IRQ type since with APIC irqs this is all supposed to >>>>>> be setup properly before hand). >>>>>> >>>>>> But it is a model specific issue, if we denylist IRQ usage on this >>>>>> Lenovo model (and probably a few others) then we should be able to >>>>>> restore the IRQ code to normal functionality for all other device >>>>>> models which declare an IRQ in their resource tables. >>>>> I can do that with a quirk, but how do I identify the device? TPM >>>>> manufacturer and version? or do I have to use something like the ACPI >>>>> bios version? >>>> >>>> I'm not sure if the TPM ids are unique to one model/series of laptops. >>>> >>>> So my idea for this was to match on DMI strings, specifically >>>> use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >>>> strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >>>> devices the string which you expect to be in DMI_PRODUCT_NAME >>>> is actually in DMI_PRODUCT_VERSION). >>>> >>>> You can easily get the strings for your device by doing: >>>> >>>> cat /sys/class/dmi/id/sys_vendor >>>> cat /sys/class/dmi/id/product_version >>>> >>>> Regards, >>>> >>>> Hans >>> Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >>> if the bios is older than the fixed bios? Has Lenovo >>> released the fixed bios? >> >> Maybe, the fixed BIOS-es which I have seen (for the X1C8, >> broken BIOS was a pre-production BIOS) "fixed" this by >> no longer listing an IRQ in the ACPI resources for the TPM. >> >> Which means that the new BIOS still being on the deny list >> does not matter since the IRQ support won't work anyways as >> we no longer get an IRQ assigned. >> >> So I don't think this is necessary and it will just complicate >> things unnecessarily. This whole saga has already taken way >> too long to fix. So IMHO the simplest fix where we just deny >> list the broken models independent of BIOS versions and move >> on seems best. >> >> Regards, >> >> Hans > > Yea, probably just best to disable for the model and be done > with it. > > Regards, > Jerry # cat /sys/class/dmi/id/sys_vendor LENOVO # cat /sys/class/dmi/id/product_version ThinkPad T490s
Hans de Goede @ 2020-10-14 09:46 MST: > Hi, > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >> Hans de Goede @ 2020-10-14 09:04 MST: >> >>> Hi, >>> >>> On 10/14/20 5:23 PM, James Bottomley wrote: >>>> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>>>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>>> [...] >>>>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>>>> before rc2), it would be great that if you could try this. >>>>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>>>> >>>>>>>> I will run some tests on some other systems I have access to. >>>>>>>> As noted in the other email I did a quick test with a t490s >>>>>>>> with an older bios that exhibits the problem originally >>>>>>>> reported when Stefan's patch enabled interrupts. >>>>>>> >>>>>>> Well, it means there's still some other problem. I was hoping >>>>>>> that because the rainbow pass system originally exhibited the >>>>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>>>> and the ineffective EOI bug looked like a great candidate for >>>>>>> being the root cause. >>>>>>> >>>>>> >>>>>> Adding Hans to the list. >>>>>> >>>>>> IIUC in the t490s case the problem lies with the hardware itself. >>>>>> Hans, is that correct? >>>>> >>>>> More or less. AFAIK / have been told by Lenovo it is an issue with >>>>> the configuration of the inerrupt-type of the GPIO pin used for the >>>>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>>>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>>>> control of the IRQ type since with APIC irqs this is all supposed to >>>>> be setup properly before hand). >>>>> >>>>> But it is a model specific issue, if we denylist IRQ usage on this >>>>> Lenovo model (and probably a few others) then we should be able to >>>>> restore the IRQ code to normal functionality for all other device >>>>> models which declare an IRQ in their resource tables. >>>> I can do that with a quirk, but how do I identify the device? TPM >>>> manufacturer and version? or do I have to use something like the ACPI >>>> bios version? >>> >>> I'm not sure if the TPM ids are unique to one model/series of laptops. >>> >>> So my idea for this was to match on DMI strings, specifically >>> use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >>> strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >>> devices the string which you expect to be in DMI_PRODUCT_NAME >>> is actually in DMI_PRODUCT_VERSION). >>> >>> You can easily get the strings for your device by doing: >>> >>> cat /sys/class/dmi/id/sys_vendor >>> cat /sys/class/dmi/id/product_version >>> >>> Regards, >>> >>> Hans >> Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >> if the bios is older than the fixed bios? Has Lenovo >> released the fixed bios? > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, > broken BIOS was a pre-production BIOS) "fixed" this by > no longer listing an IRQ in the ACPI resources for the TPM. > > Which means that the new BIOS still being on the deny list > does not matter since the IRQ support won't work anyways as > we no longer get an IRQ assigned. > > So I don't think this is necessary and it will just complicate > things unnecessarily. This whole saga has already taken way > too long to fix. So IMHO the simplest fix where we just deny > list the broken models independent of BIOS versions and move > on seems best. > > Regards, > > Hans This worked for me: diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c index 0b214963539d..abe674d1de6d 100644 --- a/drivers/char/tpm/tpm_tis.c +++ b/drivers/char/tpm/tpm_tis.c @@ -27,6 +27,7 @@ #include <linux/of.h> #include <linux/of_device.h> #include <linux/kernel.h> +#include <linux/dmi.h> #include "tpm.h" #include "tpm_tis_core.h" @@ -63,6 +64,26 @@ module_param(force, bool, 0444); MODULE_PARM_DESC(force, "Force device probe rather than using ACPI entry"); #endif +static int tpm_tis_disable_irq(const struct dmi_system_id *d) +{ + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d->ident); + interrupts = false; + + return 0; +} + +static const struct dmi_system_id tpm_tis_dmi_table[] = { + { + .callback = tpm_tis_disable_irq, + .ident = "ThinkPad T490s", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T490s"), + }, + }, + {} +}; + #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) static int has_hid(struct acpi_device *dev, const char *hid) { @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info) int irq = -1; int rc; + dmi_check_system(tpm_tis_dmi_table); + rc = check_acpi_tpm2(dev); if (rc) return rc;
Hi, On 10/14/20 10:58 PM, Jerry Snitselaar wrote: > > Hans de Goede @ 2020-10-14 09:46 MST: > >> Hi, >> >> On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >>> Hans de Goede @ 2020-10-14 09:04 MST: >>> >>>> Hi, >>>> >>>> On 10/14/20 5:23 PM, James Bottomley wrote: >>>>> On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >>>>>> On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >>>>>>> James Bottomley @ 2020-10-13 08:24 MST: >>>>>>>> On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: >>>>>>>>> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >>>>> [...] >>>>>>>>>> Jerry, once you have some bandwidth (no rush, does not land >>>>>>>>>> before rc2), it would be great that if you could try this. >>>>>>>>>> I'm emphasizing this just because of the intersection. I >>>>>>>>>> think it would also make senset to get tested-by from Nayna. >>>>>>>>> >>>>>>>>> I will run some tests on some other systems I have access to. >>>>>>>>> As noted in the other email I did a quick test with a t490s >>>>>>>>> with an older bios that exhibits the problem originally >>>>>>>>> reported when Stefan's patch enabled interrupts. >>>>>>>> >>>>>>>> Well, it means there's still some other problem. I was hoping >>>>>>>> that because the rainbow pass system originally exhibited the >>>>>>>> same symptoms (interrupt storm) fixing it would also fix the t490 >>>>>>>> and the ineffective EOI bug looked like a great candidate for >>>>>>>> being the root cause. >>>>>>>> >>>>>>> >>>>>>> Adding Hans to the list. >>>>>>> >>>>>>> IIUC in the t490s case the problem lies with the hardware itself. >>>>>>> Hans, is that correct? >>>>>> >>>>>> More or less. AFAIK / have been told by Lenovo it is an issue with >>>>>> the configuration of the inerrupt-type of the GPIO pin used for the >>>>>> IRQ, which is a firmware issue which could be fixed by a BIOS update >>>>>> (the pin is setup as a direct-irq pin for the APIC, so the OS has no >>>>>> control of the IRQ type since with APIC irqs this is all supposed to >>>>>> be setup properly before hand). >>>>>> >>>>>> But it is a model specific issue, if we denylist IRQ usage on this >>>>>> Lenovo model (and probably a few others) then we should be able to >>>>>> restore the IRQ code to normal functionality for all other device >>>>>> models which declare an IRQ in their resource tables. >>>>> I can do that with a quirk, but how do I identify the device? TPM >>>>> manufacturer and version? or do I have to use something like the ACPI >>>>> bios version? >>>> >>>> I'm not sure if the TPM ids are unique to one model/series of laptops. >>>> >>>> So my idea for this was to match on DMI strings, specifically >>>> use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >>>> strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >>>> devices the string which you expect to be in DMI_PRODUCT_NAME >>>> is actually in DMI_PRODUCT_VERSION). >>>> >>>> You can easily get the strings for your device by doing: >>>> >>>> cat /sys/class/dmi/id/sys_vendor >>>> cat /sys/class/dmi/id/product_version >>>> >>>> Regards, >>>> >>>> Hans >>> Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >>> if the bios is older than the fixed bios? Has Lenovo >>> released the fixed bios? >> >> Maybe, the fixed BIOS-es which I have seen (for the X1C8, >> broken BIOS was a pre-production BIOS) "fixed" this by >> no longer listing an IRQ in the ACPI resources for the TPM. >> >> Which means that the new BIOS still being on the deny list >> does not matter since the IRQ support won't work anyways as >> we no longer get an IRQ assigned. >> >> So I don't think this is necessary and it will just complicate >> things unnecessarily. This whole saga has already taken way >> too long to fix. So IMHO the simplest fix where we just deny >> list the broken models independent of BIOS versions and move >> on seems best. >> >> Regards, >> >> Hans > > This worked for me: That looks good to me, can you submit this upstream please ? If you Cc me I'll give it my Reviewed-by. Regards, Hans > > diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c > index 0b214963539d..abe674d1de6d 100644 > --- a/drivers/char/tpm/tpm_tis.c > +++ b/drivers/char/tpm/tpm_tis.c > @@ -27,6 +27,7 @@ > #include <linux/of.h> > #include <linux/of_device.h> > #include <linux/kernel.h> > +#include <linux/dmi.h> > #include "tpm.h" > #include "tpm_tis_core.h" > > @@ -63,6 +64,26 @@ module_param(force, bool, 0444); > MODULE_PARM_DESC(force, "Force device probe rather than using ACPI entry"); > #endif > > +static int tpm_tis_disable_irq(const struct dmi_system_id *d) > +{ > + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d->ident); > + interrupts = false; > + > + return 0; > +} > + > +static const struct dmi_system_id tpm_tis_dmi_table[] = { > + { > + .callback = tpm_tis_disable_irq, > + .ident = "ThinkPad T490s", > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), > + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T490s"), > + }, > + }, > + {} > +}; > + > #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) > static int has_hid(struct acpi_device *dev, const char *hid) > { > @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info) > int irq = -1; > int rc; > > + dmi_check_system(tpm_tis_dmi_table); > + > rc = check_acpi_tpm2(dev); > if (rc) > return rc; >
On Wed, 2020-10-14 at 13:58 -0700, Jerry Snitselaar wrote: > Hans de Goede @ 2020-10-14 09:46 MST: > > > Hi, > > > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: > > > Hans de Goede @ 2020-10-14 09:04 MST: > > > > > > > Hi, > > > > > > > > On 10/14/20 5:23 PM, James Bottomley wrote: > > > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: > > > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > > > > > > James Bottomley @ 2020-10-13 08:24 MST: > > > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar > > > > > > > > wrote: > > > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > > > [...] > > > > > > > > > > Jerry, once you have some bandwidth (no rush, > > > > > > > > > > does not land > > > > > > > > > > before rc2), it would be great that if you could > > > > > > > > > > try this. > > > > > > > > > > I'm emphasizing this just because of the > > > > > > > > > > intersection. I > > > > > > > > > > think it would also make senset to get tested-by > > > > > > > > > > from Nayna. > > > > > > > > > > > > > > > > > > I will run some tests on some other systems I have > > > > > > > > > access to. > > > > > > > > > As noted in the other email I did a quick test with a > > > > > > > > > t490s > > > > > > > > > with an older bios that exhibits the problem > > > > > > > > > originally > > > > > > > > > reported when Stefan's patch enabled interrupts. > > > > > > > > > > > > > > > > Well, it means there's still some other problem. I was > > > > > > > > hoping > > > > > > > > that because the rainbow pass system originally > > > > > > > > exhibited the > > > > > > > > same symptoms (interrupt storm) fixing it would also > > > > > > > > fix the t490 > > > > > > > > and the ineffective EOI bug looked like a great > > > > > > > > candidate for > > > > > > > > being the root cause. > > > > > > > > > > > > > > > > > > > > > > Adding Hans to the list. > > > > > > > > > > > > > > IIUC in the t490s case the problem lies with the hardware > > > > > > > itself. > > > > > > > Hans, is that correct? > > > > > > > > > > > > More or less. AFAIK / have been told by Lenovo it is an > > > > > > issue with > > > > > > the configuration of the inerrupt-type of the GPIO pin used > > > > > > for the > > > > > > IRQ, which is a firmware issue which could be fixed by a > > > > > > BIOS update > > > > > > (the pin is setup as a direct-irq pin for the APIC, so the > > > > > > OS has no > > > > > > control of the IRQ type since with APIC irqs this is all > > > > > > supposed to > > > > > > be setup properly before hand). > > > > > > > > > > > > But it is a model specific issue, if we denylist IRQ usage > > > > > > on this > > > > > > Lenovo model (and probably a few others) then we should be > > > > > > able to > > > > > > restore the IRQ code to normal functionality for all other > > > > > > device > > > > > > models which declare an IRQ in their resource tables. > > > > > I can do that with a quirk, but how do I identify the > > > > > device? TPM > > > > > manufacturer and version? or do I have to use something like > > > > > the ACPI > > > > > bios version? > > > > > > > > I'm not sure if the TPM ids are unique to one model/series of > > > > laptops. > > > > > > > > So my idea for this was to match on DMI strings, specifically > > > > use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION > > > > strings (normally one would use DMI_PRODUCT_NAME but for Lenovo > > > > devices the string which you expect to be in DMI_PRODUCT_NAME > > > > is actually in DMI_PRODUCT_VERSION). > > > > > > > > You can easily get the strings for your device by doing: > > > > > > > > cat /sys/class/dmi/id/sys_vendor > > > > cat /sys/class/dmi/id/product_version > > > > > > > > Regards, > > > > > > > > Hans > > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check > > > if the bios is older than the fixed bios? Has Lenovo > > > released the fixed bios? > > > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, > > broken BIOS was a pre-production BIOS) "fixed" this by > > no longer listing an IRQ in the ACPI resources for the TPM. > > > > Which means that the new BIOS still being on the deny list > > does not matter since the IRQ support won't work anyways as > > we no longer get an IRQ assigned. > > > > So I don't think this is necessary and it will just complicate > > things unnecessarily. This whole saga has already taken way > > too long to fix. So IMHO the simplest fix where we just deny > > list the broken models independent of BIOS versions and move > > on seems best. > > > > Regards, > > > > Hans > > This worked for me: > > diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c > index 0b214963539d..abe674d1de6d 100644 > --- a/drivers/char/tpm/tpm_tis.c > +++ b/drivers/char/tpm/tpm_tis.c > @@ -27,6 +27,7 @@ > #include <linux/of.h> > #include <linux/of_device.h> > #include <linux/kernel.h> > +#include <linux/dmi.h> > #include "tpm.h" > #include "tpm_tis_core.h" > > @@ -63,6 +64,26 @@ module_param(force, bool, 0444); > MODULE_PARM_DESC(force, "Force device probe rather than using ACPI > entry"); > #endif > > +static int tpm_tis_disable_irq(const struct dmi_system_id *d) > +{ > + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d- > >ident); > + interrupts = false; > + > + return 0; > +} > + > +static const struct dmi_system_id tpm_tis_dmi_table[] = { > + { > + .callback = tpm_tis_disable_irq, > + .ident = "ThinkPad T490s", > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), > + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad > T490s"), > + }, > + }, > + {} > +}; > + > #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) > static int has_hid(struct acpi_device *dev, const char *hid) > { > @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, > struct tpm_info *tpm_info) > int irq = -1; > int rc; > > + dmi_check_system(tpm_tis_dmi_table); > + > rc = check_acpi_tpm2(dev); > if (rc) > return rc; This looks OK to me with the caveat that anyone on one of these systems has no way to enable interrupts again if they think they have a fixed bios. What about making interrupts a tristate with the default value -1? Then in the dmi check, if we see -1 we set it to 0 but if we see 1 (the user has specified interrupts=1 on the module insert line) we leave it? James
James Bottomley @ 2020-10-15 08:36 MST: > On Wed, 2020-10-14 at 13:58 -0700, Jerry Snitselaar wrote: >> Hans de Goede @ 2020-10-14 09:46 MST: >> >> > Hi, >> > >> > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >> > > Hans de Goede @ 2020-10-14 09:04 MST: >> > > >> > > > Hi, >> > > > >> > > > On 10/14/20 5:23 PM, James Bottomley wrote: >> > > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >> > > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >> > > > > > > James Bottomley @ 2020-10-13 08:24 MST: >> > > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar >> > > > > > > > wrote: >> > > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> > > > > [...] >> > > > > > > > > > Jerry, once you have some bandwidth (no rush, >> > > > > > > > > > does not land >> > > > > > > > > > before rc2), it would be great that if you could >> > > > > > > > > > try this. >> > > > > > > > > > I'm emphasizing this just because of the >> > > > > > > > > > intersection. I >> > > > > > > > > > think it would also make senset to get tested-by >> > > > > > > > > > from Nayna. >> > > > > > > > > >> > > > > > > > > I will run some tests on some other systems I have >> > > > > > > > > access to. >> > > > > > > > > As noted in the other email I did a quick test with a >> > > > > > > > > t490s >> > > > > > > > > with an older bios that exhibits the problem >> > > > > > > > > originally >> > > > > > > > > reported when Stefan's patch enabled interrupts. >> > > > > > > > >> > > > > > > > Well, it means there's still some other problem. I was >> > > > > > > > hoping >> > > > > > > > that because the rainbow pass system originally >> > > > > > > > exhibited the >> > > > > > > > same symptoms (interrupt storm) fixing it would also >> > > > > > > > fix the t490 >> > > > > > > > and the ineffective EOI bug looked like a great >> > > > > > > > candidate for >> > > > > > > > being the root cause. >> > > > > > > > >> > > > > > > >> > > > > > > Adding Hans to the list. >> > > > > > > >> > > > > > > IIUC in the t490s case the problem lies with the hardware >> > > > > > > itself. >> > > > > > > Hans, is that correct? >> > > > > > >> > > > > > More or less. AFAIK / have been told by Lenovo it is an >> > > > > > issue with >> > > > > > the configuration of the inerrupt-type of the GPIO pin used >> > > > > > for the >> > > > > > IRQ, which is a firmware issue which could be fixed by a >> > > > > > BIOS update >> > > > > > (the pin is setup as a direct-irq pin for the APIC, so the >> > > > > > OS has no >> > > > > > control of the IRQ type since with APIC irqs this is all >> > > > > > supposed to >> > > > > > be setup properly before hand). >> > > > > > >> > > > > > But it is a model specific issue, if we denylist IRQ usage >> > > > > > on this >> > > > > > Lenovo model (and probably a few others) then we should be >> > > > > > able to >> > > > > > restore the IRQ code to normal functionality for all other >> > > > > > device >> > > > > > models which declare an IRQ in their resource tables. >> > > > > I can do that with a quirk, but how do I identify the >> > > > > device? TPM >> > > > > manufacturer and version? or do I have to use something like >> > > > > the ACPI >> > > > > bios version? >> > > > >> > > > I'm not sure if the TPM ids are unique to one model/series of >> > > > laptops. >> > > > >> > > > So my idea for this was to match on DMI strings, specifically >> > > > use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION >> > > > strings (normally one would use DMI_PRODUCT_NAME but for Lenovo >> > > > devices the string which you expect to be in DMI_PRODUCT_NAME >> > > > is actually in DMI_PRODUCT_VERSION). >> > > > >> > > > You can easily get the strings for your device by doing: >> > > > >> > > > cat /sys/class/dmi/id/sys_vendor >> > > > cat /sys/class/dmi/id/product_version >> > > > >> > > > Regards, >> > > > >> > > > Hans >> > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >> > > if the bios is older than the fixed bios? Has Lenovo >> > > released the fixed bios? >> > >> > Maybe, the fixed BIOS-es which I have seen (for the X1C8, >> > broken BIOS was a pre-production BIOS) "fixed" this by >> > no longer listing an IRQ in the ACPI resources for the TPM. >> > >> > Which means that the new BIOS still being on the deny list >> > does not matter since the IRQ support won't work anyways as >> > we no longer get an IRQ assigned. >> > >> > So I don't think this is necessary and it will just complicate >> > things unnecessarily. This whole saga has already taken way >> > too long to fix. So IMHO the simplest fix where we just deny >> > list the broken models independent of BIOS versions and move >> > on seems best. >> > >> > Regards, >> > >> > Hans >> >> This worked for me: >> >> diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c >> index 0b214963539d..abe674d1de6d 100644 >> --- a/drivers/char/tpm/tpm_tis.c >> +++ b/drivers/char/tpm/tpm_tis.c >> @@ -27,6 +27,7 @@ >> #include <linux/of.h> >> #include <linux/of_device.h> >> #include <linux/kernel.h> >> +#include <linux/dmi.h> >> #include "tpm.h" >> #include "tpm_tis_core.h" >> >> @@ -63,6 +64,26 @@ module_param(force, bool, 0444); >> MODULE_PARM_DESC(force, "Force device probe rather than using ACPI >> entry"); >> #endif >> >> +static int tpm_tis_disable_irq(const struct dmi_system_id *d) >> +{ >> + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d- >> >ident); >> + interrupts = false; >> + >> + return 0; >> +} >> + >> +static const struct dmi_system_id tpm_tis_dmi_table[] = { >> + { >> + .callback = tpm_tis_disable_irq, >> + .ident = "ThinkPad T490s", >> + .matches = { >> + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), >> + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad >> T490s"), >> + }, >> + }, >> + {} >> +}; >> + >> #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) >> static int has_hid(struct acpi_device *dev, const char *hid) >> { >> @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, >> struct tpm_info *tpm_info) >> int irq = -1; >> int rc; >> >> + dmi_check_system(tpm_tis_dmi_table); >> + >> rc = check_acpi_tpm2(dev); >> if (rc) >> return rc; > > This looks OK to me with the caveat that anyone on one of these systems > has no way to enable interrupts again if they think they have a fixed > bios. What about making interrupts a tristate with the default value > -1? Then in the dmi check, if we see -1 we set it to 0 but if we see 1 > (the user has specified interrupts=1 on the module insert line) we > leave it? > > James like this? diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c index 0b214963539d..10c46cb26c5a 100644 --- a/drivers/char/tpm/tpm_tis.c +++ b/drivers/char/tpm/tpm_tis.c @@ -27,6 +27,7 @@ #include <linux/of.h> #include <linux/of_device.h> #include <linux/kernel.h> +#include <linux/dmi.h> #include "tpm.h" #include "tpm_tis_core.h" @@ -49,8 +50,8 @@ static inline struct tpm_tis_tcg_phy *to_tpm_tis_tcg_phy(struct tpm_tis_data *da return container_of(data, struct tpm_tis_tcg_phy, priv); } -static bool interrupts = true; -module_param(interrupts, bool, 0444); +static int interrupts = -1; +module_param(interrupts, int, 0444); MODULE_PARM_DESC(interrupts, "Enable interrupts"); static bool itpm; @@ -63,6 +64,27 @@ module_param(force, bool, 0444); MODULE_PARM_DESC(force, "Force device probe rather than using ACPI entry"); #endif +static int tpm_tis_disable_irq(const struct dmi_system_id *d) +{ + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d->ident); + if (interrupts == -1) + interrupts = 0; + + return 0; +} + +static const struct dmi_system_id tpm_tis_dmi_table[] = { + { + .callback = tpm_tis_disable_irq, + .ident = "ThinkPad T490s", + .matches = { + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad T490s"), + }, + }, + {} +}; + #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) static int has_hid(struct acpi_device *dev, const char *hid) { @@ -192,6 +214,8 @@ static int tpm_tis_init(struct device *dev, struct tpm_info *tpm_info) int irq = -1; int rc; + dmi_check_system(tpm_tis_dmi_table); + rc = check_acpi_tpm2(dev); if (rc) return rc;
On Thu, 2020-10-15 at 11:48 -0700, Jerry Snitselaar wrote: > James Bottomley @ 2020-10-15 08:36 MST: > > > On Wed, 2020-10-14 at 13:58 -0700, Jerry Snitselaar wrote: > > > Hans de Goede @ 2020-10-14 09:46 MST: > > > > > > > Hi, > > > > > > > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: > > > > > Hans de Goede @ 2020-10-14 09:04 MST: > > > > > > > > > > > Hi, > > > > > > > > > > > > On 10/14/20 5:23 PM, James Bottomley wrote: > > > > > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: > > > > > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > > > > > > > > James Bottomley @ 2020-10-13 08:24 MST: > > > > > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar > > > > > > > > > > wrote: > > > > > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > > > > > [...] > > > > > > > > > > > > Jerry, once you have some bandwidth (no > > > > > > > > > > > > rush, > > > > > > > > > > > > does not land > > > > > > > > > > > > before rc2), it would be great that if you > > > > > > > > > > > > could > > > > > > > > > > > > try this. > > > > > > > > > > > > I'm emphasizing this just because of the > > > > > > > > > > > > intersection. I > > > > > > > > > > > > think it would also make senset to get tested- > > > > > > > > > > > > by > > > > > > > > > > > > from Nayna. > > > > > > > > > > > > > > > > > > > > > > I will run some tests on some other systems I > > > > > > > > > > > have > > > > > > > > > > > access to. > > > > > > > > > > > As noted in the other email I did a quick test > > > > > > > > > > > with a > > > > > > > > > > > t490s > > > > > > > > > > > with an older bios that exhibits the problem > > > > > > > > > > > originally > > > > > > > > > > > reported when Stefan's patch enabled interrupts. > > > > > > > > > > > > > > > > > > > > Well, it means there's still some other problem. I > > > > > > > > > > was > > > > > > > > > > hoping > > > > > > > > > > that because the rainbow pass system originally > > > > > > > > > > exhibited the > > > > > > > > > > same symptoms (interrupt storm) fixing it would > > > > > > > > > > also > > > > > > > > > > fix the t490 > > > > > > > > > > and the ineffective EOI bug looked like a great > > > > > > > > > > candidate for > > > > > > > > > > being the root cause. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Adding Hans to the list. > > > > > > > > > > > > > > > > > > IIUC in the t490s case the problem lies with the > > > > > > > > > hardware > > > > > > > > > itself. > > > > > > > > > Hans, is that correct? > > > > > > > > > > > > > > > > More or less. AFAIK / have been told by Lenovo it is an > > > > > > > > issue with > > > > > > > > the configuration of the inerrupt-type of the GPIO pin > > > > > > > > used > > > > > > > > for the > > > > > > > > IRQ, which is a firmware issue which could be fixed by > > > > > > > > a > > > > > > > > BIOS update > > > > > > > > (the pin is setup as a direct-irq pin for the APIC, so > > > > > > > > the > > > > > > > > OS has no > > > > > > > > control of the IRQ type since with APIC irqs this is > > > > > > > > all > > > > > > > > supposed to > > > > > > > > be setup properly before hand). > > > > > > > > > > > > > > > > But it is a model specific issue, if we denylist IRQ > > > > > > > > usage > > > > > > > > on this > > > > > > > > Lenovo model (and probably a few others) then we should > > > > > > > > be > > > > > > > > able to > > > > > > > > restore the IRQ code to normal functionality for all > > > > > > > > other > > > > > > > > device > > > > > > > > models which declare an IRQ in their resource tables. > > > > > > > I can do that with a quirk, but how do I identify the > > > > > > > device? TPM > > > > > > > manufacturer and version? or do I have to use something > > > > > > > like > > > > > > > the ACPI > > > > > > > bios version? > > > > > > > > > > > > I'm not sure if the TPM ids are unique to one model/series > > > > > > of > > > > > > laptops. > > > > > > > > > > > > So my idea for this was to match on DMI strings, > > > > > > specifically > > > > > > use a DMI match on the DMI_SYS_VENDOR and > > > > > > DMI_PRODUCT_VERSION > > > > > > strings (normally one would use DMI_PRODUCT_NAME but for > > > > > > Lenovo > > > > > > devices the string which you expect to be in > > > > > > DMI_PRODUCT_NAME > > > > > > is actually in DMI_PRODUCT_VERSION). > > > > > > > > > > > > You can easily get the strings for your device by doing: > > > > > > > > > > > > cat /sys/class/dmi/id/sys_vendor > > > > > > cat /sys/class/dmi/id/product_version > > > > > > > > > > > > Regards, > > > > > > > > > > > > Hans > > > > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check > > > > > if the bios is older than the fixed bios? Has Lenovo > > > > > released the fixed bios? > > > > > > > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, > > > > broken BIOS was a pre-production BIOS) "fixed" this by > > > > no longer listing an IRQ in the ACPI resources for the TPM. > > > > > > > > Which means that the new BIOS still being on the deny list > > > > does not matter since the IRQ support won't work anyways as > > > > we no longer get an IRQ assigned. > > > > > > > > So I don't think this is necessary and it will just complicate > > > > things unnecessarily. This whole saga has already taken way > > > > too long to fix. So IMHO the simplest fix where we just deny > > > > list the broken models independent of BIOS versions and move > > > > on seems best. > > > > > > > > Regards, > > > > > > > > Hans > > > > > > This worked for me: > > > > > > diff --git a/drivers/char/tpm/tpm_tis.c > > > b/drivers/char/tpm/tpm_tis.c > > > index 0b214963539d..abe674d1de6d 100644 > > > --- a/drivers/char/tpm/tpm_tis.c > > > +++ b/drivers/char/tpm/tpm_tis.c > > > @@ -27,6 +27,7 @@ > > > #include <linux/of.h> > > > #include <linux/of_device.h> > > > #include <linux/kernel.h> > > > +#include <linux/dmi.h> > > > #include "tpm.h" > > > #include "tpm_tis_core.h" > > > > > > @@ -63,6 +64,26 @@ module_param(force, bool, 0444); > > > MODULE_PARM_DESC(force, "Force device probe rather than using > > > ACPI > > > entry"); > > > #endif > > > > > > +static int tpm_tis_disable_irq(const struct dmi_system_id *d) > > > +{ > > > + pr_notice("tpm_tis: %s detected: disabling > > > interrupts.\n", d- > > > > ident); > > > + interrupts = false; > > > + > > > + return 0; > > > +} > > > + > > > +static const struct dmi_system_id tpm_tis_dmi_table[] = { > > > + { > > > + .callback = tpm_tis_disable_irq, > > > + .ident = "ThinkPad T490s", > > > + .matches = { > > > + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), > > > + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad > > > T490s"), > > > + }, > > > + }, > > > + {} > > > +}; > > > + > > > #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) > > > static int has_hid(struct acpi_device *dev, const char *hid) > > > { > > > @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, > > > struct tpm_info *tpm_info) > > > int irq = -1; > > > int rc; > > > > > > + dmi_check_system(tpm_tis_dmi_table); > > > + > > > rc = check_acpi_tpm2(dev); > > > if (rc) > > > return rc; > > > > This looks OK to me with the caveat that anyone on one of these > > systems > > has no way to enable interrupts again if they think they have a > > fixed > > bios. What about making interrupts a tristate with the default > > value > > -1? Then in the dmi check, if we see -1 we set it to 0 but if we > > see 1 > > (the user has specified interrupts=1 on the module insert line) we > > leave it? > > > > James > > like this? Yes, that works, I think. Reviewed-by: James Bottomley <James.Bottomley@HansenPartnership.com> James > diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c > index 0b214963539d..10c46cb26c5a 100644 > --- a/drivers/char/tpm/tpm_tis.c > +++ b/drivers/char/tpm/tpm_tis.c > @@ -27,6 +27,7 @@ > #include <linux/of.h> > #include <linux/of_device.h> > #include <linux/kernel.h> > +#include <linux/dmi.h> > #include "tpm.h" > #include "tpm_tis_core.h" > > @@ -49,8 +50,8 @@ static inline struct tpm_tis_tcg_phy > *to_tpm_tis_tcg_phy(struct tpm_tis_data *da > return container_of(data, struct tpm_tis_tcg_phy, priv); > } > > -static bool interrupts = true; > -module_param(interrupts, bool, 0444); > +static int interrupts = -1; > +module_param(interrupts, int, 0444); > MODULE_PARM_DESC(interrupts, "Enable interrupts"); > > static bool itpm; > @@ -63,6 +64,27 @@ module_param(force, bool, 0444); > MODULE_PARM_DESC(force, "Force device probe rather than using ACPI > entry"); > #endif > > +static int tpm_tis_disable_irq(const struct dmi_system_id *d) > +{ > + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d- > >ident); > + if (interrupts == -1) > + interrupts = 0; > + > + return 0; > +} > + > +static const struct dmi_system_id tpm_tis_dmi_table[] = { > + { > + .callback = tpm_tis_disable_irq, > + .ident = "ThinkPad T490s", > + .matches = { > + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), > + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad > T490s"), > + }, > + }, > + {} > +}; > + > #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) > static int has_hid(struct acpi_device *dev, const char *hid) > { > @@ -192,6 +214,8 @@ static int tpm_tis_init(struct device *dev, > struct tpm_info *tpm_info) > int irq = -1; > int rc; > > + dmi_check_system(tpm_tis_dmi_table); > + > rc = check_acpi_tpm2(dev); > if (rc) > return rc; >
James Bottomley @ 2020-10-15 11:57 MST: > On Thu, 2020-10-15 at 11:48 -0700, Jerry Snitselaar wrote: >> James Bottomley @ 2020-10-15 08:36 MST: >> >> > On Wed, 2020-10-14 at 13:58 -0700, Jerry Snitselaar wrote: >> > > Hans de Goede @ 2020-10-14 09:46 MST: >> > > >> > > > Hi, >> > > > >> > > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: >> > > > > Hans de Goede @ 2020-10-14 09:04 MST: >> > > > > >> > > > > > Hi, >> > > > > > >> > > > > > On 10/14/20 5:23 PM, James Bottomley wrote: >> > > > > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: >> > > > > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: >> > > > > > > > > James Bottomley @ 2020-10-13 08:24 MST: >> > > > > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar >> > > > > > > > > > wrote: >> > > > > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> > > > > > > [...] >> > > > > > > > > > > > Jerry, once you have some bandwidth (no >> > > > > > > > > > > > rush, >> > > > > > > > > > > > does not land >> > > > > > > > > > > > before rc2), it would be great that if you >> > > > > > > > > > > > could >> > > > > > > > > > > > try this. >> > > > > > > > > > > > I'm emphasizing this just because of the >> > > > > > > > > > > > intersection. I >> > > > > > > > > > > > think it would also make senset to get tested- >> > > > > > > > > > > > by >> > > > > > > > > > > > from Nayna. >> > > > > > > > > > > >> > > > > > > > > > > I will run some tests on some other systems I >> > > > > > > > > > > have >> > > > > > > > > > > access to. >> > > > > > > > > > > As noted in the other email I did a quick test >> > > > > > > > > > > with a >> > > > > > > > > > > t490s >> > > > > > > > > > > with an older bios that exhibits the problem >> > > > > > > > > > > originally >> > > > > > > > > > > reported when Stefan's patch enabled interrupts. >> > > > > > > > > > >> > > > > > > > > > Well, it means there's still some other problem. I >> > > > > > > > > > was >> > > > > > > > > > hoping >> > > > > > > > > > that because the rainbow pass system originally >> > > > > > > > > > exhibited the >> > > > > > > > > > same symptoms (interrupt storm) fixing it would >> > > > > > > > > > also >> > > > > > > > > > fix the t490 >> > > > > > > > > > and the ineffective EOI bug looked like a great >> > > > > > > > > > candidate for >> > > > > > > > > > being the root cause. >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > > Adding Hans to the list. >> > > > > > > > > >> > > > > > > > > IIUC in the t490s case the problem lies with the >> > > > > > > > > hardware >> > > > > > > > > itself. >> > > > > > > > > Hans, is that correct? >> > > > > > > > >> > > > > > > > More or less. AFAIK / have been told by Lenovo it is an >> > > > > > > > issue with >> > > > > > > > the configuration of the inerrupt-type of the GPIO pin >> > > > > > > > used >> > > > > > > > for the >> > > > > > > > IRQ, which is a firmware issue which could be fixed by >> > > > > > > > a >> > > > > > > > BIOS update >> > > > > > > > (the pin is setup as a direct-irq pin for the APIC, so >> > > > > > > > the >> > > > > > > > OS has no >> > > > > > > > control of the IRQ type since with APIC irqs this is >> > > > > > > > all >> > > > > > > > supposed to >> > > > > > > > be setup properly before hand). >> > > > > > > > >> > > > > > > > But it is a model specific issue, if we denylist IRQ >> > > > > > > > usage >> > > > > > > > on this >> > > > > > > > Lenovo model (and probably a few others) then we should >> > > > > > > > be >> > > > > > > > able to >> > > > > > > > restore the IRQ code to normal functionality for all >> > > > > > > > other >> > > > > > > > device >> > > > > > > > models which declare an IRQ in their resource tables. >> > > > > > > I can do that with a quirk, but how do I identify the >> > > > > > > device? TPM >> > > > > > > manufacturer and version? or do I have to use something >> > > > > > > like >> > > > > > > the ACPI >> > > > > > > bios version? >> > > > > > >> > > > > > I'm not sure if the TPM ids are unique to one model/series >> > > > > > of >> > > > > > laptops. >> > > > > > >> > > > > > So my idea for this was to match on DMI strings, >> > > > > > specifically >> > > > > > use a DMI match on the DMI_SYS_VENDOR and >> > > > > > DMI_PRODUCT_VERSION >> > > > > > strings (normally one would use DMI_PRODUCT_NAME but for >> > > > > > Lenovo >> > > > > > devices the string which you expect to be in >> > > > > > DMI_PRODUCT_NAME >> > > > > > is actually in DMI_PRODUCT_VERSION). >> > > > > > >> > > > > > You can easily get the strings for your device by doing: >> > > > > > >> > > > > > cat /sys/class/dmi/id/sys_vendor >> > > > > > cat /sys/class/dmi/id/product_version >> > > > > > >> > > > > > Regards, >> > > > > > >> > > > > > Hans >> > > > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check >> > > > > if the bios is older than the fixed bios? Has Lenovo >> > > > > released the fixed bios? >> > > > >> > > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, >> > > > broken BIOS was a pre-production BIOS) "fixed" this by >> > > > no longer listing an IRQ in the ACPI resources for the TPM. >> > > > >> > > > Which means that the new BIOS still being on the deny list >> > > > does not matter since the IRQ support won't work anyways as >> > > > we no longer get an IRQ assigned. >> > > > >> > > > So I don't think this is necessary and it will just complicate >> > > > things unnecessarily. This whole saga has already taken way >> > > > too long to fix. So IMHO the simplest fix where we just deny >> > > > list the broken models independent of BIOS versions and move >> > > > on seems best. >> > > > >> > > > Regards, >> > > > >> > > > Hans >> > > >> > > This worked for me: >> > > >> > > diff --git a/drivers/char/tpm/tpm_tis.c >> > > b/drivers/char/tpm/tpm_tis.c >> > > index 0b214963539d..abe674d1de6d 100644 >> > > --- a/drivers/char/tpm/tpm_tis.c >> > > +++ b/drivers/char/tpm/tpm_tis.c >> > > @@ -27,6 +27,7 @@ >> > > #include <linux/of.h> >> > > #include <linux/of_device.h> >> > > #include <linux/kernel.h> >> > > +#include <linux/dmi.h> >> > > #include "tpm.h" >> > > #include "tpm_tis_core.h" >> > > >> > > @@ -63,6 +64,26 @@ module_param(force, bool, 0444); >> > > MODULE_PARM_DESC(force, "Force device probe rather than using >> > > ACPI >> > > entry"); >> > > #endif >> > > >> > > +static int tpm_tis_disable_irq(const struct dmi_system_id *d) >> > > +{ >> > > + pr_notice("tpm_tis: %s detected: disabling >> > > interrupts.\n", d- >> > > > ident); >> > > + interrupts = false; >> > > + >> > > + return 0; >> > > +} >> > > + >> > > +static const struct dmi_system_id tpm_tis_dmi_table[] = { >> > > + { >> > > + .callback = tpm_tis_disable_irq, >> > > + .ident = "ThinkPad T490s", >> > > + .matches = { >> > > + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), >> > > + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad >> > > T490s"), >> > > + }, >> > > + }, >> > > + {} >> > > +}; >> > > + >> > > #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) >> > > static int has_hid(struct acpi_device *dev, const char *hid) >> > > { >> > > @@ -192,6 +213,8 @@ static int tpm_tis_init(struct device *dev, >> > > struct tpm_info *tpm_info) >> > > int irq = -1; >> > > int rc; >> > > >> > > + dmi_check_system(tpm_tis_dmi_table); >> > > + >> > > rc = check_acpi_tpm2(dev); >> > > if (rc) >> > > return rc; >> > >> > This looks OK to me with the caveat that anyone on one of these >> > systems >> > has no way to enable interrupts again if they think they have a >> > fixed >> > bios. What about making interrupts a tristate with the default >> > value >> > -1? Then in the dmi check, if we see -1 we set it to 0 but if we >> > see 1 >> > (the user has specified interrupts=1 on the module insert line) we >> > leave it? >> > >> > James >> >> like this? > > Yes, that works, I think. > > Reviewed-by: James Bottomley <James.Bottomley@HansenPartnership.com> > > James > I'll send a real patch in a just a bit. The pr_notice needs to move into the if block, otherwise it will log that it is disabling interrupts when the user has forced enabling. Regards, Jerry > >> diff --git a/drivers/char/tpm/tpm_tis.c b/drivers/char/tpm/tpm_tis.c >> index 0b214963539d..10c46cb26c5a 100644 >> --- a/drivers/char/tpm/tpm_tis.c >> +++ b/drivers/char/tpm/tpm_tis.c >> @@ -27,6 +27,7 @@ >> #include <linux/of.h> >> #include <linux/of_device.h> >> #include <linux/kernel.h> >> +#include <linux/dmi.h> >> #include "tpm.h" >> #include "tpm_tis_core.h" >> >> @@ -49,8 +50,8 @@ static inline struct tpm_tis_tcg_phy >> *to_tpm_tis_tcg_phy(struct tpm_tis_data *da >> return container_of(data, struct tpm_tis_tcg_phy, priv); >> } >> >> -static bool interrupts = true; >> -module_param(interrupts, bool, 0444); >> +static int interrupts = -1; >> +module_param(interrupts, int, 0444); >> MODULE_PARM_DESC(interrupts, "Enable interrupts"); >> >> static bool itpm; >> @@ -63,6 +64,27 @@ module_param(force, bool, 0444); >> MODULE_PARM_DESC(force, "Force device probe rather than using ACPI >> entry"); >> #endif >> >> +static int tpm_tis_disable_irq(const struct dmi_system_id *d) >> +{ >> + pr_notice("tpm_tis: %s detected: disabling interrupts.\n", d- >> >ident); >> + if (interrupts == -1) >> + interrupts = 0; >> + >> + return 0; >> +} >> + >> +static const struct dmi_system_id tpm_tis_dmi_table[] = { >> + { >> + .callback = tpm_tis_disable_irq, >> + .ident = "ThinkPad T490s", >> + .matches = { >> + DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"), >> + DMI_MATCH(DMI_PRODUCT_VERSION, "ThinkPad >> T490s"), >> + }, >> + }, >> + {} >> +}; >> + >> #if defined(CONFIG_PNP) && defined(CONFIG_ACPI) >> static int has_hid(struct acpi_device *dev, const char *hid) >> { >> @@ -192,6 +214,8 @@ static int tpm_tis_init(struct device *dev, >> struct tpm_info *tpm_info) >> int irq = -1; >> int rc; >> >> + dmi_check_system(tpm_tis_dmi_table); >> + >> rc = check_acpi_tpm2(dev); >> if (rc) >> return rc; >>
On Tue, Oct 13, 2020 at 04:23:54AM +0300, Jarkko Sakkinen wrote: > On Sun, Oct 11, 2020 at 10:39:07PM -0700, Jerry Snitselaar wrote: > > > > James Bottomley @ 2020-10-01 11:09 MST: > > > > > The current state of the TIS TPM is that interrupts have been globally > > > disabled by various changes. The problems we got reported the last > > > time they were enabled was interrupt storms. With my own TIS TPM, > > > I've found that this is caused because my TPM doesn't do legacy > > > cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > > > requires any TIS TPM without legacy cycles not to act on any write to > > > an interrupt register unless the locality is enabled. This means if > > > an interrupt fires after we relinquish the locality, the TPM_EOI in > > > the interrupt routine is ineffective meaning the same interrupt > > > triggers over and over again. This problem also means we can have > > > trouble setting up interrupts on TIS TPMs because the current init > > > code does the setup before the locality is claimed for the first time. > > > > > > James > > > > > > > > > I tested initially with the following commits reverted: > > > > aa4a63dd9816 tpm: Revert "tpm_tis_core: Turn on the TPM before probing IRQ's" | 2020-01-06 | (Stefan Berger) > > dda8b2af395b tpm: Revert "tpm_tis_core: Set TPM_CHIP_FLAG_IRQ before probing for interrupts" | 2020-01-06 | (Stefan Berger) > > > > The laptop doesn't become unusable, but I lose the trackpad and get the following: > > > > [ 3.070501] irq 31: nobody cared (try booting with the "irqpoll" option) > > [ 3.070504] CPU: 2 PID: 251 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 > > [ 3.070504] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 > > [ 3.070505] Call Trace: > > [ 3.070511] dump_stack+0x6b/0x88 > > [ 3.070514] __report_bad_irq+0x35/0xa7 > > [ 3.070516] note_interrupt.cold+0xb/0x6a > > [ 3.070518] handle_irq_event+0x88/0x8a > > [ 3.070519] handle_fasteoi_irq+0x78/0x1c0 > > [ 3.070522] common_interrupt+0x68/0x140 > > [ 3.070524] ? asm_common_interrupt+0x8/0x40 > > [ 3.070525] asm_common_interrupt+0x1e/0x40 > > [ 3.070527] RIP: 0033:0x7f2eea3249b5 > > [ 3.070529] Code: e0 48 c1 e8 3c 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 37 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 <48> 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 > > [ 3.070529] RSP: 002b:00007f2ee3ffebc0 EFLAGS: 00000202 > > [ 3.070530] RAX: 0000000000000001 RBX: 000000000000000a RCX: 000000000000002e > > [ 3.070531] RDX: 0463400000000000 RSI: 0000000000000000 RDI: 0000000000000001 > > [ 3.070532] RBP: 00007f2ee3ffec10 R08: 0000000000000000 R09: 0000000000000035 > > [ 3.070532] R10: 00007ffde716f080 R11: 00007ffde716f080 R12: 00007f2edc004c00 > > [ 3.070533] R13: 00007ffde700a54f R14: 00007f2ee3ffed10 R15: 00005598a41e3680 > > [ 3.070534] handlers: > > [ 3.070537] [<000000007b6f3232>] tis_int_handler > > [ 3.070538] Disabling IRQ #31 > > ... > > [ 14.956342] irq 16: nobody cared (try booting with the "irqpoll" option) > > [ 14.956344] CPU: 0 PID: 1013 Comm: rngd Not tainted 5.8.13-201.local.fc32.x86_64 #1 > > [ 14.956344] Hardware name: LENOVO 20NYS7K90F/20NYS7K90F, BIOS N2JET83W (1.61 ) 11/22/2019 > > [ 14.956345] Call Trace: > > [ 14.956350] dump_stack+0x6b/0x88 > > [ 14.956353] __report_bad_irq+0x35/0xa7 > > [ 14.956354] note_interrupt.cold+0xb/0x6a > > [ 14.956355] handle_irq_event+0x88/0x8a > > [ 14.956356] handle_fasteoi_irq+0x78/0x1c0 > > [ 14.956358] common_interrupt+0x68/0x140 > > [ 14.956360] ? asm_common_interrupt+0x8/0x40 > > [ 14.956361] asm_common_interrupt+0x1e/0x40 > > [ 14.956362] RIP: 0033:0x7fcaff4399d3 > > [ 14.956363] Code: e0 48 c1 e8 1e 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 1b 83 e0 01 48 31 45 f0 48 8b 45 e0 48 c1 e8 16 83 e0 01 48 31 45 f0 <48> d1 65 e0 48 8b 45 f0 48 31 45 e0 83 45 d4 01 83 7d d4 40 0f 86 > > [ 14.956364] RSP: 002b:00007fcafe544bc0 EFLAGS: 00000246 > > [ 14.956364] RAX: 0000000000000000 RBX: 000000000000000a RCX: 0000000000000026 > > [ 14.956365] RDX: 000df20000000000 RSI: 0000000000000000 RDI: 0000000000000001 > > [ 14.956365] RBP: 00007fcafe544c10 R08: 0000000000000000 R09: 0000000000000035 > > [ 14.956366] R10: 00007ffc30bd2080 R11: 00007ffc30bd2080 R12: 00007fcaf0004c00 > > [ 14.956366] R13: 00007fcaf0004c00 R14: 00007fcaf0000b60 R15: 0000560216fcf0f2 > > [ 14.956367] handlers: > > [ 14.956370] [<0000000024c0571e>] i801_isr [i2c_i801] > > [ 14.956371] Disabling IRQ #16 > > > > /proc/interrupts shows: > > > > 16: 100000 0 0 0 0 0 0 0 IR-IO-APIC 16-fasteoi i801_smbus > > 31: 0 0 100000 0 0 0 0 0 IR-IO-APIC 31-fasteoi tpm0 > > > > > > I also get this behavior with your patchset applied. If I boot with > > tpm_tis.interrupts=0 it behaves. The thought from Hans is to look at > > the dmi info and key off the vendor being Lenovo and bios date to > > decide if interrupts should be disabled. Since Dan ran also into this > > with something internal at Intel I don't know if that will be > > sufficient. > > > > I hope to look over this patchset tomorrow. > > > > Regards, > > Jerry > > I'm sorry, I received this email only after pulling the latest with fdm. So, this is the main blocker to take these patches. /Jarkko
On Tue, Oct 13, 2020 at 08:15:36AM -0700, Jerry Snitselaar wrote: > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: > >> The current state of the TIS TPM is that interrupts have been globally > >> disabled by various changes. The problems we got reported the last > >> time they were enabled was interrupt storms. With my own TIS TPM, > >> I've found that this is caused because my TPM doesn't do legacy > >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > >> requires any TIS TPM without legacy cycles not to act on any write to > >> an interrupt register unless the locality is enabled. This means if > >> an interrupt fires after we relinquish the locality, the TPM_EOI in > >> the interrupt routine is ineffective meaning the same interrupt > >> triggers over and over again. This problem also means we can have > >> trouble setting up interrupts on TIS TPMs because the current init > >> code does the setup before the locality is claimed for the first time. > >> > >> James > > > > You should consider expanding the audience. Jerry, once you have some > > bandwidth (no rush, does not land before rc2), it would be great that if > > you could try this. I'm emphasizing this just because of the > > intersection. I think it would also make senset to get tested-by from > > Nayna. > > I will run some tests on some other systems I have access to. As noted > in the other email I did a quick test with a t490s with an older bios > that exhibits the problem originally reported when Stefan's patch > enabled interrupts. Thank you. As said, I can make a pull request to rc2 or even rc3, if needed. /Jarkko
On Thu, Oct 15, 2020 at 09:38:24AM +0200, Hans de Goede wrote: > Hi, > > On 10/14/20 10:58 PM, Jerry Snitselaar wrote: > > > > Hans de Goede @ 2020-10-14 09:46 MST: > > > > > Hi, > > > > > > On 10/14/20 6:34 PM, Jerry Snitselaar wrote: > > > > Hans de Goede @ 2020-10-14 09:04 MST: > > > > > > > > > Hi, > > > > > > > > > > On 10/14/20 5:23 PM, James Bottomley wrote: > > > > > > On Wed, 2020-10-14 at 17:03 +0200, Hans de Goede wrote: > > > > > > > On 10/13/20 6:05 PM, Jerry Snitselaar wrote: > > > > > > > > James Bottomley @ 2020-10-13 08:24 MST: > > > > > > > > > On Tue, 2020-10-13 at 08:15 -0700, Jerry Snitselaar wrote: > > > > > > > > > > Jarkko Sakkinen @ 2020-10-12 18:17 MST: > > > > > > [...] > > > > > > > > > > > Jerry, once you have some bandwidth (no rush, does not land > > > > > > > > > > > before rc2), it would be great that if you could try this. > > > > > > > > > > > I'm emphasizing this just because of the intersection. I > > > > > > > > > > > think it would also make senset to get tested-by from Nayna. > > > > > > > > > > > > > > > > > > > > I will run some tests on some other systems I have access to. > > > > > > > > > > As noted in the other email I did a quick test with a t490s > > > > > > > > > > with an older bios that exhibits the problem originally > > > > > > > > > > reported when Stefan's patch enabled interrupts. > > > > > > > > > > > > > > > > > > Well, it means there's still some other problem. I was hoping > > > > > > > > > that because the rainbow pass system originally exhibited the > > > > > > > > > same symptoms (interrupt storm) fixing it would also fix the t490 > > > > > > > > > and the ineffective EOI bug looked like a great candidate for > > > > > > > > > being the root cause. > > > > > > > > > > > > > > > > > > > > > > > > > Adding Hans to the list. > > > > > > > > > > > > > > > > IIUC in the t490s case the problem lies with the hardware itself. > > > > > > > > Hans, is that correct? > > > > > > > > > > > > > > More or less. AFAIK / have been told by Lenovo it is an issue with > > > > > > > the configuration of the inerrupt-type of the GPIO pin used for the > > > > > > > IRQ, which is a firmware issue which could be fixed by a BIOS update > > > > > > > (the pin is setup as a direct-irq pin for the APIC, so the OS has no > > > > > > > control of the IRQ type since with APIC irqs this is all supposed to > > > > > > > be setup properly before hand). > > > > > > > > > > > > > > But it is a model specific issue, if we denylist IRQ usage on this > > > > > > > Lenovo model (and probably a few others) then we should be able to > > > > > > > restore the IRQ code to normal functionality for all other device > > > > > > > models which declare an IRQ in their resource tables. > > > > > > I can do that with a quirk, but how do I identify the device? TPM > > > > > > manufacturer and version? or do I have to use something like the ACPI > > > > > > bios version? > > > > > > > > > > I'm not sure if the TPM ids are unique to one model/series of laptops. > > > > > > > > > > So my idea for this was to match on DMI strings, specifically > > > > > use a DMI match on the DMI_SYS_VENDOR and DMI_PRODUCT_VERSION > > > > > strings (normally one would use DMI_PRODUCT_NAME but for Lenovo > > > > > devices the string which you expect to be in DMI_PRODUCT_NAME > > > > > is actually in DMI_PRODUCT_VERSION). > > > > > > > > > > You can easily get the strings for your device by doing: > > > > > > > > > > cat /sys/class/dmi/id/sys_vendor > > > > > cat /sys/class/dmi/id/product_version > > > > > > > > > > Regards, > > > > > > > > > > Hans > > > > Plus use dmi_get_date(DMI_BIOS_DATE,...) to check > > > > if the bios is older than the fixed bios? Has Lenovo > > > > released the fixed bios? > > > > > > Maybe, the fixed BIOS-es which I have seen (for the X1C8, > > > broken BIOS was a pre-production BIOS) "fixed" this by > > > no longer listing an IRQ in the ACPI resources for the TPM. > > > > > > Which means that the new BIOS still being on the deny list > > > does not matter since the IRQ support won't work anyways as > > > we no longer get an IRQ assigned. > > > > > > So I don't think this is necessary and it will just complicate > > > things unnecessarily. This whole saga has already taken way > > > too long to fix. So IMHO the simplest fix where we just deny > > > list the broken models independent of BIOS versions and move > > > on seems best. > > > > > > Regards, > > > > > > Hans > > > > This worked for me: > > That looks good to me, can you submit this upstream please ? > > If you Cc me I'll give it my Reviewed-by. > > Regards, > > Hans Yes, this does look like sustainable patch. PS. Email has been migration ongoing. That's the reason for inconsistent responses. /Jarkko
Jarkko Sakkinen @ 2020-10-18 14:05 MST: > On Tue, Oct 13, 2020 at 08:15:36AM -0700, Jerry Snitselaar wrote: >> >> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> >> > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: >> >> The current state of the TIS TPM is that interrupts have been globally >> >> disabled by various changes. The problems we got reported the last >> >> time they were enabled was interrupt storms. With my own TIS TPM, >> >> I've found that this is caused because my TPM doesn't do legacy >> >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") >> >> requires any TIS TPM without legacy cycles not to act on any write to >> >> an interrupt register unless the locality is enabled. This means if >> >> an interrupt fires after we relinquish the locality, the TPM_EOI in >> >> the interrupt routine is ineffective meaning the same interrupt >> >> triggers over and over again. This problem also means we can have >> >> trouble setting up interrupts on TIS TPMs because the current init >> >> code does the setup before the locality is claimed for the first time. >> >> >> >> James >> > >> > You should consider expanding the audience. Jerry, once you have some >> > bandwidth (no rush, does not land before rc2), it would be great that if >> > you could try this. I'm emphasizing this just because of the >> > intersection. I think it would also make senset to get tested-by from >> > Nayna. >> >> I will run some tests on some other systems I have access to. As noted >> in the other email I did a quick test with a t490s with an older bios >> that exhibits the problem originally reported when Stefan's patch >> enabled interrupts. > > Thank you. As said, I can make a pull request to rc2 or even rc3, if > needed. > > /Jarkko So outside of the t490s I have access to, it looks like the nuc5 with tpm2.0 device, and and older lenovo D30 with a tpm1.2 device both are not using interrupts. I'm digging around to see if I can find some other systems that I can test interrupts on. Regards, Jerry
On Tue, Oct 20, 2020 at 04:10:42PM -0700, Jerry Snitselaar wrote: > > Jarkko Sakkinen @ 2020-10-18 14:05 MST: > > > On Tue, Oct 13, 2020 at 08:15:36AM -0700, Jerry Snitselaar wrote: > >> > >> Jarkko Sakkinen @ 2020-10-12 18:17 MST: > >> > >> > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: > >> >> The current state of the TIS TPM is that interrupts have been globally > >> >> disabled by various changes. The problems we got reported the last > >> >> time they were enabled was interrupt storms. With my own TIS TPM, > >> >> I've found that this is caused because my TPM doesn't do legacy > >> >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > >> >> requires any TIS TPM without legacy cycles not to act on any write to > >> >> an interrupt register unless the locality is enabled. This means if > >> >> an interrupt fires after we relinquish the locality, the TPM_EOI in > >> >> the interrupt routine is ineffective meaning the same interrupt > >> >> triggers over and over again. This problem also means we can have > >> >> trouble setting up interrupts on TIS TPMs because the current init > >> >> code does the setup before the locality is claimed for the first time. > >> >> > >> >> James > >> > > >> > You should consider expanding the audience. Jerry, once you have some > >> > bandwidth (no rush, does not land before rc2), it would be great that if > >> > you could try this. I'm emphasizing this just because of the > >> > intersection. I think it would also make senset to get tested-by from > >> > Nayna. > >> > >> I will run some tests on some other systems I have access to. As noted > >> in the other email I did a quick test with a t490s with an older bios > >> that exhibits the problem originally reported when Stefan's patch > >> enabled interrupts. > > > > Thank you. As said, I can make a pull request to rc2 or even rc3, if > > needed. > > > > /Jarkko > > So outside of the t490s I have access to, it looks like the nuc5 with > tpm2.0 device, and and older lenovo D30 with a tpm1.2 device both are > not using interrupts. I'm digging around to see if I can find some > other systems that I can test interrupts on. > > Regards, > Jerry I'm going to test with this C720P chromebook, which has a long standing bug in the kernel bugzilla. I got it a while ago but it's stuck in the boot. If that doesn't boot, I'll pick up old Ivylake NUC from the office, which has TPM 1.2 and bit newer TPM 2.0 NUC. Should anyway pick them, have not used them for testing for a while because of pandemia. /Jarkko
Jarkko Sakkinen @ 2020-10-24 05:20 MST: > On Tue, Oct 20, 2020 at 04:10:42PM -0700, Jerry Snitselaar wrote: >> >> Jarkko Sakkinen @ 2020-10-18 14:05 MST: >> >> > On Tue, Oct 13, 2020 at 08:15:36AM -0700, Jerry Snitselaar wrote: >> >> >> >> Jarkko Sakkinen @ 2020-10-12 18:17 MST: >> >> >> >> > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: >> >> >> The current state of the TIS TPM is that interrupts have been globally >> >> >> disabled by various changes. The problems we got reported the last >> >> >> time they were enabled was interrupt storms. With my own TIS TPM, >> >> >> I've found that this is caused because my TPM doesn't do legacy >> >> >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") >> >> >> requires any TIS TPM without legacy cycles not to act on any write to >> >> >> an interrupt register unless the locality is enabled. This means if >> >> >> an interrupt fires after we relinquish the locality, the TPM_EOI in >> >> >> the interrupt routine is ineffective meaning the same interrupt >> >> >> triggers over and over again. This problem also means we can have >> >> >> trouble setting up interrupts on TIS TPMs because the current init >> >> >> code does the setup before the locality is claimed for the first time. >> >> >> >> >> >> James >> >> > >> >> > You should consider expanding the audience. Jerry, once you have some >> >> > bandwidth (no rush, does not land before rc2), it would be great that if >> >> > you could try this. I'm emphasizing this just because of the >> >> > intersection. I think it would also make senset to get tested-by from >> >> > Nayna. >> >> >> >> I will run some tests on some other systems I have access to. As noted >> >> in the other email I did a quick test with a t490s with an older bios >> >> that exhibits the problem originally reported when Stefan's patch >> >> enabled interrupts. >> > >> > Thank you. As said, I can make a pull request to rc2 or even rc3, if >> > needed. >> > >> > /Jarkko >> >> So outside of the t490s I have access to, it looks like the nuc5 with >> tpm2.0 device, and and older lenovo D30 with a tpm1.2 device both are >> not using interrupts. I'm digging around to see if I can find some >> other systems that I can test interrupts on. >> >> Regards, >> Jerry > > I'm going to test with this C720P chromebook, which has a long standing > bug in the kernel bugzilla. I got it a while ago but it's stuck in the > boot. > > If that doesn't boot, I'll pick up old Ivylake NUC from the office, > which has TPM 1.2 and bit newer TPM 2.0 NUC. Should anyway pick them, > have not used them for testing for a while because of pandemia. > > /Jarkko My search continues through the systems in our labs. Running into the same issue with the Dells. Looking at the TCPA table it looks like they aren't set up for interrupts as well. I have some tpm2.0 systems to still try. Jerry
On Mon, Oct 26, 2020 at 11:29:46AM -0700, Jerry Snitselaar wrote: > > Jarkko Sakkinen @ 2020-10-24 05:20 MST: > > > On Tue, Oct 20, 2020 at 04:10:42PM -0700, Jerry Snitselaar wrote: > >> > >> Jarkko Sakkinen @ 2020-10-18 14:05 MST: > >> > >> > On Tue, Oct 13, 2020 at 08:15:36AM -0700, Jerry Snitselaar wrote: > >> >> > >> >> Jarkko Sakkinen @ 2020-10-12 18:17 MST: > >> >> > >> >> > On Thu, Oct 01, 2020 at 11:09:20AM -0700, James Bottomley wrote: > >> >> >> The current state of the TIS TPM is that interrupts have been globally > >> >> >> disabled by various changes. The problems we got reported the last > >> >> >> time they were enabled was interrupt storms. With my own TIS TPM, > >> >> >> I've found that this is caused because my TPM doesn't do legacy > >> >> >> cycles, The TIS spec (chapter 6.1 "Locality Usage Per Register") > >> >> >> requires any TIS TPM without legacy cycles not to act on any write to > >> >> >> an interrupt register unless the locality is enabled. This means if > >> >> >> an interrupt fires after we relinquish the locality, the TPM_EOI in > >> >> >> the interrupt routine is ineffective meaning the same interrupt > >> >> >> triggers over and over again. This problem also means we can have > >> >> >> trouble setting up interrupts on TIS TPMs because the current init > >> >> >> code does the setup before the locality is claimed for the first time. > >> >> >> > >> >> >> James > >> >> > > >> >> > You should consider expanding the audience. Jerry, once you have some > >> >> > bandwidth (no rush, does not land before rc2), it would be great that if > >> >> > you could try this. I'm emphasizing this just because of the > >> >> > intersection. I think it would also make senset to get tested-by from > >> >> > Nayna. > >> >> > >> >> I will run some tests on some other systems I have access to. As noted > >> >> in the other email I did a quick test with a t490s with an older bios > >> >> that exhibits the problem originally reported when Stefan's patch > >> >> enabled interrupts. > >> > > >> > Thank you. As said, I can make a pull request to rc2 or even rc3, if > >> > needed. > >> > > >> > /Jarkko > >> > >> So outside of the t490s I have access to, it looks like the nuc5 with > >> tpm2.0 device, and and older lenovo D30 with a tpm1.2 device both are > >> not using interrupts. I'm digging around to see if I can find some > >> other systems that I can test interrupts on. > >> > >> Regards, > >> Jerry > > > > I'm going to test with this C720P chromebook, which has a long standing > > bug in the kernel bugzilla. I got it a while ago but it's stuck in the > > boot. > > > > If that doesn't boot, I'll pick up old Ivylake NUC from the office, > > which has TPM 1.2 and bit newer TPM 2.0 NUC. Should anyway pick them, > > have not used them for testing for a while because of pandemia. > > > > /Jarkko > > My search continues through the systems in our labs. Running into the > same issue with the Dells. Looking at the TCPA table it looks like > they aren't set up for interrupts as well. I have some tpm2.0 systems > to still try. The chromebook that I have feels like bricked, at least for the window for testing this bug. I'll pick up the nucs tomorrow. > Jerry /Jarkko