Message ID | 20200526183213.20720-1-mario.limonciello@dell.com (mailing list archive) |
---|---|
State | Rejected, archived |
Headers | show |
Series | tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode" | expand |
On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > This commit has caused regressions for the XPS 9560 containing > a Nuvoton TPM. Presumably this is using the tis driver? > As mentioned by the reporter all TPM2 commands are failing with: > ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive() > Failed to read response from fd 3, got errno 1: Operation not > permitted > > The reporter bisected this issue back to this commit which was > backported to stable as commit 4d6ebc4. I think the problem is request_locality ... for some inexplicable reason a failure there returns -1, which is EPERM to user space. That seems to be a bug in the async code since everything else gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one assumes it gives back a sensible return code. What I think is happening is that with the patch the TPM goes through a quick sequence of request, relinquish, request, relinquish and it's the third request which is failing (likely timing out). Without the patch, the patch there's only one request,relinquish cycle because the ops are held while the async work is executed. I have a vague recollection that there is a problem with too many locality request in quick succession, but I'll defer to Jason, who I think understands the intricacies of localities better than I do. If that's the problem, the solution looks simple enough: just move the ops get down because the priv state is already protected by the buffer mutex James --- diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c index 87f449340202..1784530b8387 100644 --- a/drivers/char/tpm/tpm-dev-common.c +++ b/drivers/char/tpm/tpm-dev-common.c @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf, goto out; } - /* atomic tpm command send and result receive. We only hold the ops - * lock during this period so that the tpm can be unregistered even if - * the char dev is held open. - */ - if (tpm_try_get_ops(priv->chip)) { - ret = -EPIPE; - goto out; - } - priv->response_length = 0; priv->response_read = false; *off = 0; @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf, if (file->f_flags & O_NONBLOCK) { priv->command_enqueued = true; queue_work(tpm_dev_wq, &priv->async_work); - tpm_put_ops(priv->chip); mutex_unlock(&priv->buffer_mutex); return size; } + /* atomic tpm command send and result receive. We only hold the ops + * lock during this period so that the tpm can be unregistered even if + * the char dev is held open. + */ + if (tpm_try_get_ops(priv->chip)) { + ret = -EPIPE; + goto out; + } + ret = tpm_dev_transmit(priv->chip, priv->space, priv->data_buffer, sizeof(priv->data_buffer)); tpm_put_ops(priv->chip);
> On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > This commit has caused regressions for the XPS 9560 containing > > a Nuvoton TPM. > > Presumably this is using the tis driver? Correct. > > > As mentioned by the reporter all TPM2 commands are failing with: > > ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive() > > Failed to read response from fd 3, got errno 1: Operation not > > permitted > > > > The reporter bisected this issue back to this commit which was > > backported to stable as commit 4d6ebc4. > > I think the problem is request_locality ... for some inexplicable > reason a failure there returns -1, which is EPERM to user space. > > That seems to be a bug in the async code since everything else gives a > ESPIPE error if tpm_try_get_ops fails ... at least no-one assumes it > gives back a sensible return code. > > What I think is happening is that with the patch the TPM goes through a > quick sequence of request, relinquish, request, relinquish and it's the > third request which is failing (likely timing out). Without the patch, > the patch there's only one request,relinquish cycle because the ops are > held while the async work is executed. I have a vague recollection > that there is a problem with too many locality request in quick > succession, but I'll defer to Jason, who I think understands the > intricacies of localities better than I do. Thanks, I don't pretend to understand the nuances of this particular code, but I was hoping that the request to revert got some attention since Alex's kernel Bugzilla and message a few months ago to linux integrity weren't. > > If that's the problem, the solution looks simple enough: just move the > ops get down because the priv state is already protected by the buffer > mutex Yeah, if that works for Alex's situation it certainly sounds like a better solution than reverting this patch as this patch actually does fix a problem reported by Jeffrin originally. Could you propose a specific patch that Alex and Jeffrin can perhaps both try? > > James > > --- > > diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev- > common.c > index 87f449340202..1784530b8387 100644 > --- a/drivers/char/tpm/tpm-dev-common.c > +++ b/drivers/char/tpm/tpm-dev-common.c > @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, const char > __user *buf, > goto out; > } > > - /* atomic tpm command send and result receive. We only hold the ops > - * lock during this period so that the tpm can be unregistered even if > - * the char dev is held open. > - */ > - if (tpm_try_get_ops(priv->chip)) { > - ret = -EPIPE; > - goto out; > - } > - > priv->response_length = 0; > priv->response_read = false; > *off = 0; > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, const char > __user *buf, > if (file->f_flags & O_NONBLOCK) { > priv->command_enqueued = true; > queue_work(tpm_dev_wq, &priv->async_work); > - tpm_put_ops(priv->chip); > mutex_unlock(&priv->buffer_mutex); > return size; > } > > + /* atomic tpm command send and result receive. We only hold the ops > + * lock during this period so that the tpm can be unregistered even if > + * the char dev is held open. > + */ > + if (tpm_try_get_ops(priv->chip)) { > + ret = -EPIPE; > + goto out; > + } > + > ret = tpm_dev_transmit(priv->chip, priv->space, priv->data_buffer, > sizeof(priv->data_buffer)); > tpm_put_ops(priv->chip);
On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > This commit has caused regressions for the XPS 9560 containing > > > a Nuvoton TPM. > > > > Presumably this is using the tis driver? > > Correct. > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > ERROR:tcti:src/tss2-tcti/tcti- > > > device.c:290:tcti_device_receive() > > > Failed to read response from fd 3, got errno 1: Operation not > > > permitted > > > > > > The reporter bisected this issue back to this commit which was > > > backported to stable as commit 4d6ebc4. > > > > I think the problem is request_locality ... for some inexplicable > > reason a failure there returns -1, which is EPERM to user space. > > > > That seems to be a bug in the async code since everything else > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > assumes it gives back a sensible return code. > > > > What I think is happening is that with the patch the TPM goes > > through a quick sequence of request, relinquish, request, > > relinquish and it's the third request which is failing (likely > > timing out). Without the patch, the patch there's only one > > request,relinquish cycle because the ops are held while the async > > work is executed. I have a vague recollection that there is a > > problem with too many locality request in quick succession, but > > I'll defer to Jason, who I think understands the intricacies of > > localities better than I do. > > Thanks, I don't pretend to understand the nuances of this particular > code, but I was hoping that the request to revert got some attention > since Alex's kernel Bugzilla and message a few months ago to linux > integrity weren't. > > > > > If that's the problem, the solution looks simple enough: just move > > the ops get down because the priv state is already protected by the > > buffer mutex > > Yeah, if that works for Alex's situation it certainly sounds like a > better solution than reverting this patch as this patch actually does > fix a problem reported by Jeffrin originally. > > Could you propose a specific patch that Alex and Jeffrin can perhaps > both try? Um, what's wrong with the one I originally attached and which you quote below? It's only compile tested, but I think it will work, if the theory is correct. James > > > > James > > > > --- > > > > diff --git a/drivers/char/tpm/tpm-dev-common.c > > b/drivers/char/tpm/tpm-dev- > > common.c > > index 87f449340202..1784530b8387 100644 > > --- a/drivers/char/tpm/tpm-dev-common.c > > +++ b/drivers/char/tpm/tpm-dev-common.c > > @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, > > const char > > __user *buf, > > goto out; > > } > > > > - /* atomic tpm command send and result receive. We only > > hold the ops > > - * lock during this period so that the tpm can be > > unregistered even if > > - * the char dev is held open. > > - */ > > - if (tpm_try_get_ops(priv->chip)) { > > - ret = -EPIPE; > > - goto out; > > - } > > - > > priv->response_length = 0; > > priv->response_read = false; > > *off = 0; > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, > > const char > > __user *buf, > > if (file->f_flags & O_NONBLOCK) { > > priv->command_enqueued = true; > > queue_work(tpm_dev_wq, &priv->async_work); > > - tpm_put_ops(priv->chip); > > mutex_unlock(&priv->buffer_mutex); > > return size; > > } > > > > + /* atomic tpm command send and result receive. We only > > hold the ops > > + * lock during this period so that the tpm can be > > unregistered even if > > + * the char dev is held open. > > + */ > > + if (tpm_try_get_ops(priv->chip)) { > > + ret = -EPIPE; > > + goto out; > > + } > > + > > ret = tpm_dev_transmit(priv->chip, priv->space, priv- > > > data_buffer, > > > > sizeof(priv->data_buffer)); > > tpm_put_ops(priv->chip);
On 5/26/20 12:14 PM, James Bottomley wrote: > + /* atomic tpm command send and result receive. We only hold the ops > + * lock during this period so that the tpm can be unregistered even if > + * the char dev is held open. > + */ > + if (tpm_try_get_ops(priv->chip)) { > + ret = -EPIPE; > + goto out; > + } > + Hi James, This won't help if the message is read by an async tcti. If the problem lies in the chip get locality code, perhaps this could help to debug the root-cause instead of masking it out in the upper layer code: diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c index 2435216bd10a..da5ecd0376bf 100644 --- a/drivers/char/tpm/tpm_tis_core.c +++ b/drivers/char/tpm/tpm_tis_core.c @@ -202,20 +202,22 @@ static int request_locality(struct tpm_chip *chip, int l) return rc; stop = jiffies + chip->timeout_a; + timeout = stop - jiffies; if (chip->flags & TPM_CHIP_FLAG_IRQ) { again: timeout = stop - jiffies; if ((long)timeout <= 0) - return -1; + goto out; + rc = wait_event_interruptible_timeout(priv->int_queue, - (check_locality - (chip, l)), + check_locality(chip, l), timeout); if (rc > 0) return l; if (rc == -ERESTARTSYS && freezing(current)) { clear_thread_flag(TIF_SIGPENDING); + timeout = stop - jiffies; goto again; } } else { @@ -226,6 +228,10 @@ static int request_locality(struct tpm_chip *chip, int l) tpm_msleep(TPM_TIMEOUT); } while (time_before(jiffies, stop)); } +out: + dev_warn(&chip->dev, "%s: failed to request locality %d after %lu ms\n", + __func__, l, timeout * HZ / 1000); + return -1; }
On Tue, 2020-05-26 at 12:39 -0700, Tadeusz Struk wrote: > On 5/26/20 12:14 PM, James Bottomley wrote: > > + /* atomic tpm command send and result receive. We only > > hold the ops > > + * lock during this period so that the tpm can be > > unregistered even if > > + * the char dev is held open. > > + */ > > + if (tpm_try_get_ops(priv->chip)) { > > + ret = -EPIPE; > > + goto out; > > + } > > + > > Hi James, > This won't help if the message is read by an async tcti. Why not? It moves the ops get underneath the async path, so it's now all done in the direct read or the async read. That seems to be more efficient. > If the problem lies in the chip get locality code, perhaps this > could help to debug the root-cause instead of masking it out in the > upper layer code: I don't think there is a root cause other than a TIS TPM is getting annoyed by us cycling localities too rapidly because we don't do an actual TPM operation between request and relinquish. Since the first request/relinquish seems unnecessary for the async case, moving the ops get eliminates the problem. James
On 5/26/20 1:00 PM, James Bottomley wrote: > I don't think there is a root cause other than a TIS TPM is getting > annoyed by us cycling localities too rapidly because we don't do an > actual TPM operation between request and relinquish. Since the first > request/relinquish seems unnecessary for the async case, moving the ops > get eliminates the problem. Could be, so maybe we could try both patches. More debug info on the error path won't hurt. Thanks, Tadeusz
On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > a Nuvoton TPM. > > > > > > Presumably this is using the tis driver? > > > > Correct. > > > > > > As mentioned by the reporter all TPM2 commands are failing > > > > with: > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > device.c:290:tcti_device_receive() > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > permitted > > > > > > > > The reporter bisected this issue back to this commit which was > > > > backported to stable as commit 4d6ebc4. > > > > > > I think the problem is request_locality ... for some inexplicable > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > That seems to be a bug in the async code since everything else > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > assumes it gives back a sensible return code. > > > > > > What I think is happening is that with the patch the TPM goes > > > through a quick sequence of request, relinquish, request, > > > relinquish and it's the third request which is failing (likely > > > timing out). Without the patch, the patch there's only one > > > request,relinquish cycle because the ops are held while the async > > > work is executed. I have a vague recollection that there is a > > > problem with too many locality request in quick succession, but > > > I'll defer to Jason, who I think understands the intricacies of > > > localities better than I do. > > > > Thanks, I don't pretend to understand the nuances of this > > particular > > code, but I was hoping that the request to revert got some > > attention > > since Alex's kernel Bugzilla and message a few months ago to linux > > integrity weren't. > > > > > If that's the problem, the solution looks simple enough: just > > > move > > > the ops get down because the priv state is already protected by > > > the > > > buffer mutex > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > better solution than reverting this patch as this patch actually > > does > > fix a problem reported by Jeffrin originally. > > > > Could you propose a specific patch that Alex and Jeffrin can > > perhaps > > both try? > > Um, what's wrong with the one I originally attached and which you > quote > below? It's only compile tested, but I think it will work, if the > theory is correct. > > James > > > > James > > > > > > --- > > > > > > diff --git a/drivers/char/tpm/tpm-dev-common.c > > > b/drivers/char/tpm/tpm-dev- > > > common.c > > > index 87f449340202..1784530b8387 100644 > > > --- a/drivers/char/tpm/tpm-dev-common.c > > > +++ b/drivers/char/tpm/tpm-dev-common.c > > > @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, > > > const char > > > __user *buf, > > > goto out; > > > } > > > > > > - /* atomic tpm command send and result receive. We only > > > hold the ops > > > - * lock during this period so that the tpm can be > > > unregistered even if > > > - * the char dev is held open. > > > - */ > > > - if (tpm_try_get_ops(priv->chip)) { > > > - ret = -EPIPE; > > > - goto out; > > > - } > > > - > > > priv->response_length = 0; > > > priv->response_read = false; > > > *off = 0; > > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, > > > const char > > > __user *buf, > > > if (file->f_flags & O_NONBLOCK) { > > > priv->command_enqueued = true; > > > queue_work(tpm_dev_wq, &priv->async_work); > > > - tpm_put_ops(priv->chip); > > > mutex_unlock(&priv->buffer_mutex); > > > return size; > > > } > > > > > > + /* atomic tpm command send and result receive. We only > > > hold the ops > > > + * lock during this period so that the tpm can be > > > unregistered even if > > > + * the char dev is held open. > > > + */ > > > + if (tpm_try_get_ops(priv->chip)) { > > > + ret = -EPIPE; > > > + goto out; > > > + } > > > + > > > ret = tpm_dev_transmit(priv->chip, priv->space, priv- > > > > data_buffer, > > > > > > sizeof(priv->data_buffer)); > > > tpm_put_ops(priv->chip); When using your patch, I get a hang when trying to use tpm2_getcap, and dmesg shows some info. [ 570.913779] BUG: unable to handle page fault for address: ffffb20001247000 [ 570.913782] #PF: supervisor write access in kernel mode [ 570.913783] #PF: error_code(0x0002) - not-present page [ 570.913784] PGD 0 P4D 0 [ 570.913785] Oops: 0002 [#3] SMP PTI [ 570.913787] CPU: 6 PID: 24744 Comm: tpm2_getcap Tainted: G UD 5.7.0-rc7+ #31 [ 570.913788] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.18.0 11/17/2019 [ 570.913791] RIP: 0010:iowrite8+0x9/0x50 [ 570.913792] Code: 48 c7 c2 40 43 9f 99 48 89 04 24 e8 14 a7 90 ff 0f 0b 48 8b 04 24 48 83 c4 08 c3 66 0f 1f 44 00 00 48 81 fe ff ff 03 00 76 04 <40> 88 3e c3 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ee c3 8b 05 [ 570.913793] RSP: 0018:ffffb1ff049d7db0 EFLAGS: 00010292 [ 570.913794] RAX: ffffffff981bf520 RBX: ffffb1ff049d7df9 RCX: ffffb1ff049d7df8 [ 570.913795] RDX: 0000000000000001 RSI: ffffb20001247000 RDI: 0000000000000020 [ 570.913796] RBP: ffffb1ff049d7df9 R08: 0000000000000000 R09: ffff8b80de5370f0 [ 570.913797] R10: 0000000000b71b00 R11: 000000000000028f R12: ffff8b80b148cda8 [ 570.913797] R13: 00000000fffff000 R14: ffff8b80b148cda8 R15: ffff8b80cb44a0ba [ 570.913799] FS: 00007f78f7cd0d80(0000) GS:ffff8b80de500000(0000) knlGS:0000000000000000 [ 570.913799] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 570.913800] CR2: ffffb20001247000 CR3: 0000000795618001 CR4: 00000000003606e0 [ 570.913801] Call Trace: [ 570.913803] tpm_tcg_write_bytes+0x2f/0x40 [ 570.913805] release_locality+0x49/0x220 [ 570.913807] tpm_relinquish_locality+0x1f/0x40 [ 570.913808] tpm_chip_stop+0x21/0x40 [ 570.913810] tpm_put_ops+0x9/0x30 [ 570.913811] tpm_common_write+0x179/0x190 [ 570.913813] vfs_write+0xb1/0x1a0 [ 570.913815] ksys_write+0x5a/0xd0 [ 570.913816] do_syscall_64+0x43/0x130 [ 570.913819] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 570.913820] RIP: 0033:0x7f78f7e00123 [ 570.913821] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18 [ 570.913822] RSP: 002b:00007fff724e8c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 570.913823] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f78f7e00123 [ 570.913824] RDX: 0000000000000016 RSI: 0000564cf24a7220 RDI: 0000000000000003 [ 570.913825] RBP: 0000000000000016 R08: 00007f78f7ccc785 R09: 00007f78f7ccca40 [ 570.913826] R10: 00007fff724e8b10 R11: 0000000000000246 R12: 0000564cf24a7220 [ 570.913826] R13: 0000000000000000 R14: 0000000000000016 R15: 00007f78f7ccc890 [ 570.913827] Modules linked in: squashfs rtsx_pci_sdmmc x86_pkg_temp_thermal coretemp rtsx_pci mfd_core [ 570.913831] CR2: ffffb20001247000 [ 570.913832] ---[ end trace c84437b00f0d01a0 ]--- [ 570.913833] RIP: 0010:iowrite8+0x9/0x50 [ 570.913834] Code: 48 c7 c2 40 43 9f 99 48 89 04 24 e8 14 a7 90 ff 0f 0b 48 8b 04 24 48 83 c4 08 c3 66 0f 1f 44 00 00 48 81 fe ff ff 03 00 76 04 <40> 88 3e c3 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ee c3 8b 05 [ 570.913835] RSP: 0018:ffffb1ff030b7db0 EFLAGS: 00010292 [ 570.913836] RAX: ffffffff981bf520 RBX: ffffb1ff030b7df9 RCX: ffffb1ff030b7df8 [ 570.913837] RDX: 0000000000000001 RSI: ffffb20001247000 RDI: 0000000000000020 [ 570.913837] RBP: ffffb1ff030b7df9 R08: 0000000000000000 R09: ffff8b80de2370f0 [ 570.913838] R10: 0000000000b71b00 R11: 000000000000019c R12: ffff8b80b148cda8 [ 570.913839] R13: 00000000fffff000 R14: ffff8b80b148cda8 R15: ffff8b80c4cfc0ba [ 570.913840] FS: 00007f78f7cd0d80(0000) GS:ffff8b80de500000(0000) knlGS:0000000000000000 [ 570.913840] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 570.913841] CR2: ffffb20001247000 CR3: 0000000795618001 CR4: 00000000003606e0
On Tue, 2020-05-26 at 14:33 -0700, Tadeusz Struk wrote: > On 5/26/20 1:00 PM, James Bottomley wrote: > > I don't think there is a root cause other than a TIS TPM is getting > > annoyed by us cycling localities too rapidly because we don't do an > > actual TPM operation between request and relinquish. Since the > > first > > request/relinquish seems unnecessary for the async case, moving the > > ops > > get eliminates the problem. > > Could be, so maybe we could try both patches. > More debug info on the error path won't hurt. > Thanks, > Tadeusz With your logging patch, I consistently see this message in dmesg when tpm2_getcap fails: tpm tpm0: request_locality: failed to request locality 0 after 750 ms
On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote: [...] > When using your patch, I get a hang when trying to use tpm2_getcap, > and dmesg shows some info. Are you sure it's all applied? This > [ 570.913803] tpm_tcg_write_bytes+0x2f/0x40 > [ 570.913805] release_locality+0x49/0x220 > [ 570.913807] tpm_relinquish_locality+0x1f/0x40 > [ 570.913808] tpm_chip_stop+0x21/0x40 > [ 570.913810] tpm_put_ops+0x9/0x30 > [ 570.913811] tpm_common_write+0x179/0x190 > [ 570.913813] vfs_write+0xb1/0x1a0 Implies an unmatched tpm_put_ops() in the async write path, as though this hunk: > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, > const char __user *buf, > if (file->f_flags & O_NONBLOCK) { > priv->command_enqueued = true; > queue_work(tpm_dev_wq, &priv->async_work); > - tpm_put_ops(priv->chip); > mutex_unlock(&priv->buffer_mutex); > return size; > } Is missing. I actually booted the patch in my TPM based VM and it all seems to work OK when I execute tpm2_getcap (I verified it's using O_NONBLOCK) and tssgetcapability in sync mode. James
On Tue, May 26, 2020 at 4:06 PM James Bottomley <James.Bottomley@hansenpartnership.com> wrote: > > On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote: > [...] > > When using your patch, I get a hang when trying to use tpm2_getcap, > > and dmesg shows some info. > > Are you sure it's all applied? This > > > [ 570.913803] tpm_tcg_write_bytes+0x2f/0x40 > > [ 570.913805] release_locality+0x49/0x220 > > [ 570.913807] tpm_relinquish_locality+0x1f/0x40 > > [ 570.913808] tpm_chip_stop+0x21/0x40 > > [ 570.913810] tpm_put_ops+0x9/0x30 > > [ 570.913811] tpm_common_write+0x179/0x190 > > [ 570.913813] vfs_write+0xb1/0x1a0 > > Implies an unmatched tpm_put_ops() in the async write path, as though > this hunk: > > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, > > const char __user *buf, > > if (file->f_flags & O_NONBLOCK) { > > priv->command_enqueued = true; > > queue_work(tpm_dev_wq, &priv->async_work); > > - tpm_put_ops(priv->chip); > > mutex_unlock(&priv->buffer_mutex); > > return size; > > } > > Is missing. I actually booted the patch in my TPM based VM and it all > seems to work OK when I execute tpm2_getcap (I verified it's using > O_NONBLOCK) and tssgetcapability in sync mode. > > James > Oh, I did miss that bit. The patch had issues applying for some reason and I missed the single-line removal when I was looking at the diff. I gave it a spin on my machine again. getcap seems to work correctly with and without having the async config flag set for tpm2-tss. The pkcs11 plugin seems to work correctly again too. :)
On Tue, 2020-05-26 at 16:31 -0700, Alex Guzman wrote: > On Tue, May 26, 2020 at 4:06 PM James Bottomley > <James.Bottomley@hansenpartnership.com> wrote: > > > > On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote: > > [...] > > > When using your patch, I get a hang when trying to use > > > tpm2_getcap, and dmesg shows some info. > > > > Are you sure it's all applied? This > > > > > [ 570.913803] tpm_tcg_write_bytes+0x2f/0x40 > > > [ 570.913805] release_locality+0x49/0x220 > > > [ 570.913807] tpm_relinquish_locality+0x1f/0x40 > > > [ 570.913808] tpm_chip_stop+0x21/0x40 > > > [ 570.913810] tpm_put_ops+0x9/0x30 > > > [ 570.913811] tpm_common_write+0x179/0x190 > > > [ 570.913813] vfs_write+0xb1/0x1a0 > > > > Implies an unmatched tpm_put_ops() in the async write path, as > > though this hunk: > > > > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, > > > const char __user *buf, > > > if (file->f_flags & O_NONBLOCK) { > > > priv->command_enqueued = true; > > > queue_work(tpm_dev_wq, &priv->async_work); > > > - tpm_put_ops(priv->chip); > > > mutex_unlock(&priv->buffer_mutex); > > > return size; > > > } > > > > Is missing. I actually booted the patch in my TPM based VM and it > > all seems to work OK when I execute tpm2_getcap (I verified it's > > using O_NONBLOCK) and tssgetcapability in sync mode. > > > > James > > > > Oh, I did miss that bit. The patch had issues applying for some > reason and I missed the single-line removal when I was looking at the > diff. Sorry, that's likely my fault: I did it on top of my current TPM tree. I'll prepare a version against the vanilla kernel with a real changelog. > I gave it a spin on my machine again. getcap seems to work correctly > with and without having the async config flag set for tpm2-tss. The > pkcs11 plugin seems to work correctly again too. :) Great, thanks! I'll add your tested-by to the above. James
On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > a Nuvoton TPM. > > > > > > Presumably this is using the tis driver? > > > > Correct. > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > device.c:290:tcti_device_receive() > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > permitted > > > > > > > > The reporter bisected this issue back to this commit which was > > > > backported to stable as commit 4d6ebc4. > > > > > > I think the problem is request_locality ... for some inexplicable > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > That seems to be a bug in the async code since everything else > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > assumes it gives back a sensible return code. > > > > > > What I think is happening is that with the patch the TPM goes > > > through a quick sequence of request, relinquish, request, > > > relinquish and it's the third request which is failing (likely > > > timing out). Without the patch, the patch there's only one > > > request,relinquish cycle because the ops are held while the async > > > work is executed. I have a vague recollection that there is a > > > problem with too many locality request in quick succession, but > > > I'll defer to Jason, who I think understands the intricacies of > > > localities better than I do. > > > > Thanks, I don't pretend to understand the nuances of this particular > > code, but I was hoping that the request to revert got some attention > > since Alex's kernel Bugzilla and message a few months ago to linux > > integrity weren't. > > > > > If that's the problem, the solution looks simple enough: just move > > > the ops get down because the priv state is already protected by the > > > buffer mutex > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > better solution than reverting this patch as this patch actually does > > fix a problem reported by Jeffrin originally. > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps > > both try? > > Um, what's wrong with the one I originally attached and which you quote > below? It's only compile tested, but I think it will work, if the > theory is correct. Please send a legit patch, thanks. /Jarkko
On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > Thanks, I don't pretend to understand the nuances of this particular code, > but I was hoping that the request to revert got some attention since Alex's > kernel Bugzilla and message a few months ago to linux integrity weren't. Removing linux-kernel from CC since this subsystem internal discussion. Seeing the whole thing first time today. Bugzilla is the first thing to ignore when busy. It is good as place holder for bugs, but all discussions should happen only in LKML. There's no official requirement to proactively use Bugzilla for anything. That said I'm happy that people put stuff there so that it gets logged. For follow-up's use only LKML if it is important to you. Those will get processed. As far as this goes, if nothing is heard from me, check that you put me as CC to the original email. Otherwise, I might have missed it (by mistake, not by purpose). Honestly, I'm not sure what point was this patch when there was time to wait for months without response. Why the passivity for all this time? /Jarkko
> -----Original Message----- > From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > Sent: Wednesday, May 27, 2020 3:09 PM > To: James Bottomley; Limonciello, Mario; peterhuewe@gmx.de; jgg@ziepe.ca > Cc: arnd@arndb.de; gregkh@linuxfoundation.org; linux-integrity@vger.kernel.org; > linux-kernel@vger.kernel.org; jeffrin@rajagiritech.edu.in; alex@guzman.io > Subject: Re: [PATCH] tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode" > > > [EXTERNAL EMAIL] > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > > a Nuvoton TPM. > > > > > > > > Presumably this is using the tis driver? > > > > > > Correct. > > > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > > device.c:290:tcti_device_receive() > > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > > permitted > > > > > > > > > > The reporter bisected this issue back to this commit which was > > > > > backported to stable as commit 4d6ebc4. > > > > > > > > I think the problem is request_locality ... for some inexplicable > > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > > > That seems to be a bug in the async code since everything else > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > > assumes it gives back a sensible return code. > > > > > > > > What I think is happening is that with the patch the TPM goes > > > > through a quick sequence of request, relinquish, request, > > > > relinquish and it's the third request which is failing (likely > > > > timing out). Without the patch, the patch there's only one > > > > request,relinquish cycle because the ops are held while the async > > > > work is executed. I have a vague recollection that there is a > > > > problem with too many locality request in quick succession, but > > > > I'll defer to Jason, who I think understands the intricacies of > > > > localities better than I do. > > > > > > Thanks, I don't pretend to understand the nuances of this particular > > > code, but I was hoping that the request to revert got some attention > > > since Alex's kernel Bugzilla and message a few months ago to linux > > > integrity weren't. > > > > > > > If that's the problem, the solution looks simple enough: just move > > > > the ops get down because the priv state is already protected by the > > > > buffer mutex > > > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > > better solution than reverting this patch as this patch actually does > > > fix a problem reported by Jeffrin originally. > > > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps > > > both try? > > > > Um, what's wrong with the one I originally attached and which you quote > > below? It's only compile tested, but I think it will work, if the > > theory is correct. > > Please send a legit patch, thanks. > > /Jarkko Jarkko, After the confirmation from Alex that this patch attached to the end of the thread worked, James did send a proper patch that can be accessed here: https://lore.kernel.org/linux-integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t Thanks,
On Tue, May 26, 2020 at 12:39:37PM -0700, Tadeusz Struk wrote: > On 5/26/20 12:14 PM, James Bottomley wrote: > > + /* atomic tpm command send and result receive. We only hold the ops > > + * lock during this period so that the tpm can be unregistered even if > > + * the char dev is held open. > > + */ > > + if (tpm_try_get_ops(priv->chip)) { > > + ret = -EPIPE; > > + goto out; > > + } > > + > Hi James, > This won't help if the message is read by an async tcti. If the problem lies > in the chip get locality code, perhaps this could help to debug the root-cause > instead of masking it out in the upper layer code: What is TCTI and async TCTI? Not following. > diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c > index 2435216bd10a..da5ecd0376bf 100644 > --- a/drivers/char/tpm/tpm_tis_core.c > +++ b/drivers/char/tpm/tpm_tis_core.c > @@ -202,20 +202,22 @@ static int request_locality(struct tpm_chip *chip, int l) > return rc; > > stop = jiffies + chip->timeout_a; > + timeout = stop - jiffies; > > if (chip->flags & TPM_CHIP_FLAG_IRQ) { > again: > timeout = stop - jiffies; > if ((long)timeout <= 0) > - return -1; > + goto out; > + > rc = wait_event_interruptible_timeout(priv->int_queue, > - (check_locality > - (chip, l)), > + check_locality(chip, l), > timeout); > if (rc > 0) > return l; > if (rc == -ERESTARTSYS && freezing(current)) { > clear_thread_flag(TIF_SIGPENDING); > + timeout = stop - jiffies; > goto again; > } > } else { > @@ -226,6 +228,10 @@ static int request_locality(struct tpm_chip *chip, int l) > tpm_msleep(TPM_TIMEOUT); > } while (time_before(jiffies, stop)); > } > +out: > + dev_warn(&chip->dev, "%s: failed to request locality %d after %lu ms\n", > + __func__, l, timeout * HZ / 1000); > + > return -1; > } > /Jarkko
On Wed, May 27, 2020 at 08:18:56PM +0000, Mario.Limonciello@dell.com wrote: > > -----Original Message----- > > From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com> > > Sent: Wednesday, May 27, 2020 3:09 PM > > To: James Bottomley; Limonciello, Mario; peterhuewe@gmx.de; jgg@ziepe.ca > > Cc: arnd@arndb.de; gregkh@linuxfoundation.org; linux-integrity@vger.kernel.org; > > linux-kernel@vger.kernel.org; jeffrin@rajagiritech.edu.in; alex@guzman.io > > Subject: Re: [PATCH] tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode" > > > > > > [EXTERNAL EMAIL] What is this? > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > > > a Nuvoton TPM. > > > > > > > > > > Presumably this is using the tis driver? > > > > > > > > Correct. > > > > > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > > > device.c:290:tcti_device_receive() > > > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > > > permitted > > > > > > > > > > > > The reporter bisected this issue back to this commit which was > > > > > > backported to stable as commit 4d6ebc4. > > > > > > > > > > I think the problem is request_locality ... for some inexplicable > > > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > > > > > That seems to be a bug in the async code since everything else > > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > > > assumes it gives back a sensible return code. > > > > > > > > > > What I think is happening is that with the patch the TPM goes > > > > > through a quick sequence of request, relinquish, request, > > > > > relinquish and it's the third request which is failing (likely > > > > > timing out). Without the patch, the patch there's only one > > > > > request,relinquish cycle because the ops are held while the async > > > > > work is executed. I have a vague recollection that there is a > > > > > problem with too many locality request in quick succession, but > > > > > I'll defer to Jason, who I think understands the intricacies of > > > > > localities better than I do. > > > > > > > > Thanks, I don't pretend to understand the nuances of this particular > > > > code, but I was hoping that the request to revert got some attention > > > > since Alex's kernel Bugzilla and message a few months ago to linux > > > > integrity weren't. > > > > > > > > > If that's the problem, the solution looks simple enough: just move > > > > > the ops get down because the priv state is already protected by the > > > > > buffer mutex > > > > > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > > > better solution than reverting this patch as this patch actually does > > > > fix a problem reported by Jeffrin originally. > > > > > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps > > > > both try? > > > > > > Um, what's wrong with the one I originally attached and which you quote > > > below? It's only compile tested, but I think it will work, if the > > > theory is correct. > > > > Please send a legit patch, thanks. > > > > /Jarkko > > Jarkko, > > After the confirmation from Alex that this patch attached to the end of the thread > worked, James did send a proper patch that can be accessed here: > https://lore.kernel.org/linux-integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t > > Thanks, Hi thanks a lot! I did read the full discussions and agree with the conclusions as I get a patch in proper form. Please ping next time a bit earlier. It's not that I don't want to deal with the issues quickly as possible. It's probably just that I've forgot something or missed. /Jarkko
> > > [EXTERNAL EMAIL] > > What is this? Something my employer's mail system automatically tags in external email. My mistakes in forgetting to remove it on the response. > > > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > > > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > > > > a Nuvoton TPM. > > > > > > > > > > > > Presumably this is using the tis driver? > > > > > > > > > > Correct. > > > > > > > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > > > > device.c:290:tcti_device_receive() > > > > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > > > > permitted > > > > > > > > > > > > > > The reporter bisected this issue back to this commit which was > > > > > > > backported to stable as commit 4d6ebc4. > > > > > > > > > > > > I think the problem is request_locality ... for some inexplicable > > > > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > > > > > > > That seems to be a bug in the async code since everything else > > > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > > > > assumes it gives back a sensible return code. > > > > > > > > > > > > What I think is happening is that with the patch the TPM goes > > > > > > through a quick sequence of request, relinquish, request, > > > > > > relinquish and it's the third request which is failing (likely > > > > > > timing out). Without the patch, the patch there's only one > > > > > > request,relinquish cycle because the ops are held while the async > > > > > > work is executed. I have a vague recollection that there is a > > > > > > problem with too many locality request in quick succession, but > > > > > > I'll defer to Jason, who I think understands the intricacies of > > > > > > localities better than I do. > > > > > > > > > > Thanks, I don't pretend to understand the nuances of this particular > > > > > code, but I was hoping that the request to revert got some attention > > > > > since Alex's kernel Bugzilla and message a few months ago to linux > > > > > integrity weren't. > > > > > > > > > > > If that's the problem, the solution looks simple enough: just move > > > > > > the ops get down because the priv state is already protected by the > > > > > > buffer mutex > > > > > > > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > > > > better solution than reverting this patch as this patch actually does > > > > > fix a problem reported by Jeffrin originally. > > > > > > > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps > > > > > both try? > > > > > > > > Um, what's wrong with the one I originally attached and which you quote > > > > below? It's only compile tested, but I think it will work, if the > > > > theory is correct. > > > > > > Please send a legit patch, thanks. > > > > > > /Jarkko > > > > Jarkko, > > > > After the confirmation from Alex that this patch attached to the end of the > thread > > worked, James did send a proper patch that can be accessed here: > > https://lore.kernel.org/linux- > integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t > > > > Thanks, > > Hi thanks a lot! I did read the full discussions and agree with the > conclusions as I get a patch in proper form. > > Please ping next time a bit earlier. It's not that I don't want to deal > with the issues quickly as possible. It's probably just that I've forgot > something or missed. > > /Jarkko Thanks! I completely forgot about it too, it was mentioned to me right after holidays and I forgot to follow up and see that it got sorted.
On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote: > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > Thanks, I don't pretend to understand the nuances of this > > particular code, > > but I was hoping that the request to revert got some attention > > since Alex's > > kernel Bugzilla and message a few months ago to linux integrity > > weren't. > > Removing linux-kernel from CC since this subsystem internal > discussion. > > Seeing the whole thing first time today. > > Bugzilla is the first thing to ignore when busy. It is good as place > holder for bugs, but all discussions should happen only in LKML. > There's > no official requirement to proactively use Bugzilla for anything. > > That said I'm happy that people put stuff there so that it gets > logged. > > For follow-up's use only LKML if it is important to you. Those will > get > processed. > > As far as this goes, if nothing is heard from me, check that you put > me > as CC to the original email. Otherwise, I might have missed it (by > mistake, > not by purpose). > > Honestly, I'm not sure what point was this patch when there was time > to > wait for months without response. Why the passivity for all this > time? > > /Jarkko > It largely went quiet because I didn't raise the issue in the mailing list again. I pinged back in February ( https://lore.kernel.org/linux-integrity/CAJ7-PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/ ) but the conversation died out and I was content to simply use the last working kernel version and see if the bug was resolved on its own. I raised the issue again on the bugtracker a few days ago, leading to this follow up here. :) - Alex
On 5/27/20 5:30 PM, Jarkko Sakkinen wrote: >> This won't help if the message is read by an async tcti. If the problem lies >> in the chip get locality code, perhaps this could help to debug the root-cause >> instead of masking it out in the upper layer code: > What is TCTI and async TCTI? Not following. TPM Command Transmission Interface (TCTI) as defined by TCG in https://trustedcomputinggroup.org/resource/tss-tcti-specification/ the reason we added the O_NONBLOCK mode was to satisfy the TCG spec for async TCTI. Thanks, Tadeusz
On Thu, May 28, 2020 at 12:59:59AM +0000, Mario.Limonciello@dell.com wrote: > > > > [EXTERNAL EMAIL] > > > > What is this? > > Something my employer's mail system automatically tags in external email. > > My mistakes in forgetting to remove it on the response. NP, just asking :-) > > > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote: > > > > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote: > > > > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897. > > > > > > > > > > > > > > > > This commit has caused regressions for the XPS 9560 containing > > > > > > > > a Nuvoton TPM. > > > > > > > > > > > > > > Presumably this is using the tis driver? > > > > > > > > > > > > Correct. > > > > > > > > > > > > > > As mentioned by the reporter all TPM2 commands are failing with: > > > > > > > > ERROR:tcti:src/tss2-tcti/tcti- > > > > > > > > device.c:290:tcti_device_receive() > > > > > > > > Failed to read response from fd 3, got errno 1: Operation not > > > > > > > > permitted > > > > > > > > > > > > > > > > The reporter bisected this issue back to this commit which was > > > > > > > > backported to stable as commit 4d6ebc4. > > > > > > > > > > > > > > I think the problem is request_locality ... for some inexplicable > > > > > > > reason a failure there returns -1, which is EPERM to user space. > > > > > > > > > > > > > > That seems to be a bug in the async code since everything else > > > > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one > > > > > > > assumes it gives back a sensible return code. > > > > > > > > > > > > > > What I think is happening is that with the patch the TPM goes > > > > > > > through a quick sequence of request, relinquish, request, > > > > > > > relinquish and it's the third request which is failing (likely > > > > > > > timing out). Without the patch, the patch there's only one > > > > > > > request,relinquish cycle because the ops are held while the async > > > > > > > work is executed. I have a vague recollection that there is a > > > > > > > problem with too many locality request in quick succession, but > > > > > > > I'll defer to Jason, who I think understands the intricacies of > > > > > > > localities better than I do. > > > > > > > > > > > > Thanks, I don't pretend to understand the nuances of this particular > > > > > > code, but I was hoping that the request to revert got some attention > > > > > > since Alex's kernel Bugzilla and message a few months ago to linux > > > > > > integrity weren't. > > > > > > > > > > > > > If that's the problem, the solution looks simple enough: just move > > > > > > > the ops get down because the priv state is already protected by the > > > > > > > buffer mutex > > > > > > > > > > > > Yeah, if that works for Alex's situation it certainly sounds like a > > > > > > better solution than reverting this patch as this patch actually does > > > > > > fix a problem reported by Jeffrin originally. > > > > > > > > > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps > > > > > > both try? > > > > > > > > > > Um, what's wrong with the one I originally attached and which you quote > > > > > below? It's only compile tested, but I think it will work, if the > > > > > theory is correct. > > > > > > > > Please send a legit patch, thanks. > > > > > > > > /Jarkko > > > > > > Jarkko, > > > > > > After the confirmation from Alex that this patch attached to the end of the > > thread > > > worked, James did send a proper patch that can be accessed here: > > > https://lore.kernel.org/linux- > > integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t > > > > > > Thanks, > > > > Hi thanks a lot! I did read the full discussions and agree with the > > conclusions as I get a patch in proper form. > > > > Please ping next time a bit earlier. It's not that I don't want to deal > > with the issues quickly as possible. It's probably just that I've forgot > > something or missed. > > > > /Jarkko > > Thanks! > > I completely forgot about it too, it was mentioned to me right after holidays > and I forgot to follow up and see that it got sorted. Yeah, sure, lets try to get a fix landed asap :-) /Jarkko
On Wed, May 27, 2020 at 06:10:02PM -0700, Alex Guzman wrote: > On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote: > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote: > > > Thanks, I don't pretend to understand the nuances of this > > > particular code, > > > but I was hoping that the request to revert got some attention > > > since Alex's > > > kernel Bugzilla and message a few months ago to linux integrity > > > weren't. > > > > Removing linux-kernel from CC since this subsystem internal > > discussion. > > > > Seeing the whole thing first time today. > > > > Bugzilla is the first thing to ignore when busy. It is good as place > > holder for bugs, but all discussions should happen only in LKML. > > There's > > no official requirement to proactively use Bugzilla for anything. > > > > That said I'm happy that people put stuff there so that it gets > > logged. > > > > For follow-up's use only LKML if it is important to you. Those will > > get > > processed. > > > > As far as this goes, if nothing is heard from me, check that you put > > me > > as CC to the original email. Otherwise, I might have missed it (by > > mistake, > > not by purpose). > > > > Honestly, I'm not sure what point was this patch when there was time > > to > > wait for months without response. Why the passivity for all this > > time? > > > > /Jarkko > > > > It largely went quiet because I didn't raise the issue in the mailing > list again. I pinged back in February ( > https://lore.kernel.org/linux-integrity/CAJ7-PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/ > ) but the conversation died out and I was content to simply use the > last working kernel version and see if the bug was resolved on its own. > I raised the issue again on the bugtracker a few days ago, leading to > this follow up here. :) OK cool and thanks a lot for reporting this! /Jarkko
On Wed, May 27, 2020 at 09:40:25PM -0700, Tadeusz Struk wrote: > On 5/27/20 5:30 PM, Jarkko Sakkinen wrote: > >> This won't help if the message is read by an async tcti. If the problem lies > >> in the chip get locality code, perhaps this could help to debug the root-cause > >> instead of masking it out in the upper layer code: > > What is TCTI and async TCTI? Not following. > > TPM Command Transmission Interface (TCTI) as defined by TCG in > https://trustedcomputinggroup.org/resource/tss-tcti-specification/ > > the reason we added the O_NONBLOCK mode was to satisfy the TCG spec for async TCTI. > > Thanks, > Tadeusz OK, thanks recalling. /Jarkko
On Wed, 2020-05-27 at 18:10 -0700, Alex Guzman wrote: > On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote: > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com > > wrote: > > > Thanks, I don't pretend to understand the nuances of this > > > particular code, but I was hoping that the request to revert got > > > some attention since Alex's kernel Bugzilla and message a few > > > months ago to linux integrity weren't. > > > > Removing linux-kernel from CC since this subsystem internal > > discussion. > > > > Seeing the whole thing first time today. > > > > Bugzilla is the first thing to ignore when busy. It is good as > > place holder for bugs, but all discussions should happen only in > > LKML. There's no official requirement to proactively use Bugzilla > > for anything. > > > > That said I'm happy that people put stuff there so that it gets > > logged. > > > > For follow-up's use only LKML if it is important to you. Those will > > get processed. > > > > As far as this goes, if nothing is heard from me, check that you > > put me as CC to the original email. Otherwise, I might have missed > > it (by mistake, not by purpose). > > > > Honestly, I'm not sure what point was this patch when there was > > time to wait for months without response. Why the passivity for all > > this time? > > > > /Jarkko > > > > It largely went quiet because I didn't raise the issue in the mailing > list again. I pinged back in February ( > https://lore.kernel.org/linux-integrity/CAJ7- > PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/) but > the conversation died out and I was content to simply use the > last working kernel version and see if the bug was resolved on its > own. I think its just a state of knowledge problem: back in February I didn't know how unusual EPERM errors are in the TPM so the issue just flew by as a "this is a curious issue with an O_NONBLOCK path" thing, but thanks to some key stuff I've been doing I now do. So this time your EPERM struck me as "that's impossible surely" which is why I dug into the code to find out where it was coming from ... and sure enough, it was impossible: it was an untranslated failure return, but at least it accidentally told me exactly what the real error was. So the upshot is you got lucky this time around ... James > I raised the issue again on the bugtracker a few days ago, leading to > this follow up here. :)
diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c index 87f449340202..bc9d7c7ddc01 100644 --- a/drivers/char/tpm/tpm-dev-common.c +++ b/drivers/char/tpm/tpm-dev-common.c @@ -61,12 +61,6 @@ static void tpm_dev_async_work(struct work_struct *work) mutex_lock(&priv->buffer_mutex); priv->command_enqueued = false; - ret = tpm_try_get_ops(priv->chip); - if (ret) { - priv->response_length = ret; - goto out; - } - ret = tpm_dev_transmit(priv->chip, priv->space, priv->data_buffer, sizeof(priv->data_buffer)); tpm_put_ops(priv->chip); @@ -74,7 +68,6 @@ static void tpm_dev_async_work(struct work_struct *work) priv->response_length = ret; mod_timer(&priv->user_read_timer, jiffies + (120 * HZ)); } -out: mutex_unlock(&priv->buffer_mutex); wake_up_interruptible(&priv->async_wait); } @@ -211,7 +204,6 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf, if (file->f_flags & O_NONBLOCK) { priv->command_enqueued = true; queue_work(tpm_dev_wq, &priv->async_work); - tpm_put_ops(priv->chip); mutex_unlock(&priv->buffer_mutex); return size; }
This reverts commit d23d12484307b40eea549b8a858f5fffad913897. This commit has caused regressions for the XPS 9560 containing a Nuvoton TPM. As mentioned by the reporter all TPM2 commands are failing with: ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive() Failed to read response from fd 3, got errno 1: Operation not permitted The reporter bisected this issue back to this commit which was backported to stable as commit 4d6ebc4. Cc: Jeffrin Jose T <jeffrin@rajagiritech.edu.in> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206275 Fixes: d23d124 ("tpm: fix invalid locking in NONBLOCKING mode") Reported-by: Alex Guzman <alex@guzman.io> Signed-off-by: Mario Limonciello <mario.limonciello@dell.com> --- drivers/char/tpm/tpm-dev-common.c | 8 -------- 1 file changed, 8 deletions(-)