tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode"

Message ID	20200526183213.20720-1-mario.limonciello@dell.com (mailing list archive)
State	Rejected, archived
Headers	show Return-Path: <SRS0=pAxC=7I=vger.kernel.org=linux-integrity-owner@kernel.org> From: Mario Limonciello <mario.limonciello@dell.com> To: Peter Huewe <peterhuewe@gmx.de>, Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>, Jason Gunthorpe <jgg@ziepe.ca> Cc: Arnd Bergmann <arnd@arndb.de>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, linux-integrity@vger.kernel.org, linux-kernel@vger.kernel.org, Mario Limonciello <mario.limonciello@dell.com>, Jeffrin Jose T <jeffrin@rajagiritech.edu.in>, Alex Guzman <alex@guzman.io> Subject: [PATCH] tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode" Date: Tue, 26 May 2020 13:32:13 -0500 Message-Id: <20200526183213.20720-1-mario.limonciello@dell.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: linux-integrity-owner@vger.kernel.org Precedence: bulk
Series	tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode" \| expand tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode"

Limonciello, Mario May 26, 2020, 6:32 p.m. UTC

This reverts commit d23d12484307b40eea549b8a858f5fffad913897.

This commit has caused regressions for the XPS 9560 containing
a Nuvoton TPM.

As mentioned by the reporter all TPM2 commands are failing with:
  ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive()
  Failed to read response from fd 3, got errno 1: Operation not permitted

The reporter bisected this issue back to this commit which was
backported to stable as commit 4d6ebc4.

Cc: Jeffrin Jose T <jeffrin@rajagiritech.edu.in>
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=206275
Fixes: d23d124 ("tpm: fix invalid locking in NONBLOCKING mode")
Reported-by: Alex Guzman <alex@guzman.io>
Signed-off-by: Mario Limonciello <mario.limonciello@dell.com>
---
 drivers/char/tpm/tpm-dev-common.c | 8 --------
 1 file changed, 8 deletions(-)

James Bottomley May 26, 2020, 7:14 p.m. UTC | #1

On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> 
> This commit has caused regressions for the XPS 9560 containing
> a Nuvoton TPM.

Presumably this is using the tis driver?

> As mentioned by the reporter all TPM2 commands are failing with:
>   ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive()
>   Failed to read response from fd 3, got errno 1: Operation not
> permitted
> 
> The reporter bisected this issue back to this commit which was
> backported to stable as commit 4d6ebc4.

I think the problem is request_locality ... for some inexplicable
reason a failure there returns -1, which is EPERM to user space.

That seems to be a bug in the async code since everything else gives a
ESPIPE error if tpm_try_get_ops fails ... at least no-one assumes it
gives back a sensible return code.

What I think is happening is that with the patch the TPM goes through a
quick sequence of request, relinquish, request, relinquish and it's the
third request which is failing (likely timing out).  Without the patch,
the patch there's only one request,relinquish cycle because the ops are
held while the async work is executed.  I have a vague recollection
that there is a problem with too many locality request in quick
succession, but I'll defer to Jason, who I think understands the
intricacies of localities better than I do.

If that's the problem, the solution looks simple enough: just move the
ops get down because the priv state is already protected by the buffer
mutex

James

---

diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-common.c
index 87f449340202..1784530b8387 100644
--- a/drivers/char/tpm/tpm-dev-common.c
+++ b/drivers/char/tpm/tpm-dev-common.c
@@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf,
 		goto out;
 	}

-	/* atomic tpm command send and result receive. We only hold the ops
-	 * lock during this period so that the tpm can be unregistered even if
-	 * the char dev is held open.
-	 */
-	if (tpm_try_get_ops(priv->chip)) {
-		ret = -EPIPE;
-		goto out;
-	}
-
 	priv->response_length = 0;
 	priv->response_read = false;
 	*off = 0;
@@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, const char __user *buf,
 	if (file->f_flags & O_NONBLOCK) {
 		priv->command_enqueued = true;
 		queue_work(tpm_dev_wq, &priv->async_work);
-		tpm_put_ops(priv->chip);
 		mutex_unlock(&priv->buffer_mutex);
 		return size;
 	}

+	/* atomic tpm command send and result receive. We only hold the ops
+	 * lock during this period so that the tpm can be unregistered even if
+	 * the char dev is held open.
+	 */
+	if (tpm_try_get_ops(priv->chip)) {
+		ret = -EPIPE;
+		goto out;
+	}
+
 	ret = tpm_dev_transmit(priv->chip, priv->space, priv->data_buffer,
 			       sizeof(priv->data_buffer));
 	tpm_put_ops(priv->chip);

Limonciello, Mario May 26, 2020, 7:23 p.m. UTC | #2

> On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> >
> > This commit has caused regressions for the XPS 9560 containing
> > a Nuvoton TPM.
> 
> Presumably this is using the tis driver?

Correct.

> 
> > As mentioned by the reporter all TPM2 commands are failing with:
> >   ERROR:tcti:src/tss2-tcti/tcti-device.c:290:tcti_device_receive()
> >   Failed to read response from fd 3, got errno 1: Operation not
> > permitted
> >
> > The reporter bisected this issue back to this commit which was
> > backported to stable as commit 4d6ebc4.
> 
> I think the problem is request_locality ... for some inexplicable
> reason a failure there returns -1, which is EPERM to user space.
> 
> That seems to be a bug in the async code since everything else gives a
> ESPIPE error if tpm_try_get_ops fails ... at least no-one assumes it
> gives back a sensible return code.
> 
> What I think is happening is that with the patch the TPM goes through a
> quick sequence of request, relinquish, request, relinquish and it's the
> third request which is failing (likely timing out).  Without the patch,
> the patch there's only one request,relinquish cycle because the ops are
> held while the async work is executed.  I have a vague recollection
> that there is a problem with too many locality request in quick
> succession, but I'll defer to Jason, who I think understands the
> intricacies of localities better than I do.

Thanks, I don't pretend to understand the nuances of this particular code,
but I was hoping that the request to revert got some attention since Alex's
kernel Bugzilla and message a few months ago to linux integrity weren't.

> 
> If that's the problem, the solution looks simple enough: just move the
> ops get down because the priv state is already protected by the buffer
> mutex

Yeah, if that works for Alex's situation it certainly sounds like a better
solution than reverting this patch as this patch actually does fix a problem
reported by Jeffrin originally.

Could you propose a specific patch that Alex and Jeffrin can perhaps both try?

> 
> James
> 
> ---
> 
> diff --git a/drivers/char/tpm/tpm-dev-common.c b/drivers/char/tpm/tpm-dev-
> common.c
> index 87f449340202..1784530b8387 100644
> --- a/drivers/char/tpm/tpm-dev-common.c
> +++ b/drivers/char/tpm/tpm-dev-common.c
> @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file, const char
> __user *buf,
>  		goto out;
>  	}
> 
> -	/* atomic tpm command send and result receive. We only hold the ops
> -	 * lock during this period so that the tpm can be unregistered even if
> -	 * the char dev is held open.
> -	 */
> -	if (tpm_try_get_ops(priv->chip)) {
> -		ret = -EPIPE;
> -		goto out;
> -	}
> -
>  	priv->response_length = 0;
>  	priv->response_read = false;
>  	*off = 0;
> @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file, const char
> __user *buf,
>  	if (file->f_flags & O_NONBLOCK) {
>  		priv->command_enqueued = true;
>  		queue_work(tpm_dev_wq, &priv->async_work);
> -		tpm_put_ops(priv->chip);
>  		mutex_unlock(&priv->buffer_mutex);
>  		return size;
>  	}
> 
> +	/* atomic tpm command send and result receive. We only hold the ops
> +	 * lock during this period so that the tpm can be unregistered even if
> +	 * the char dev is held open.
> +	 */
> +	if (tpm_try_get_ops(priv->chip)) {
> +		ret = -EPIPE;
> +		goto out;
> +	}
> +
>  	ret = tpm_dev_transmit(priv->chip, priv->space, priv->data_buffer,
>  			       sizeof(priv->data_buffer));
>  	tpm_put_ops(priv->chip);

James Bottomley May 26, 2020, 7:38 p.m. UTC | #3

On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > 
> > > This commit has caused regressions for the XPS 9560 containing
> > > a Nuvoton TPM.
> > 
> > Presumably this is using the tis driver?
> 
> Correct.
> 
> > 
> > > As mentioned by the reporter all TPM2 commands are failing with:
> > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > device.c:290:tcti_device_receive()
> > >   Failed to read response from fd 3, got errno 1: Operation not
> > > permitted
> > > 
> > > The reporter bisected this issue back to this commit which was
> > > backported to stable as commit 4d6ebc4.
> > 
> > I think the problem is request_locality ... for some inexplicable
> > reason a failure there returns -1, which is EPERM to user space.
> > 
> > That seems to be a bug in the async code since everything else
> > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > assumes it gives back a sensible return code.
> > 
> > What I think is happening is that with the patch the TPM goes
> > through a quick sequence of request, relinquish, request,
> > relinquish and it's the third request which is failing (likely
> > timing out).  Without the patch, the patch there's only one
> > request,relinquish cycle because the ops are held while the async
> > work is executed.  I have a vague recollection that there is a
> > problem with too many locality request in quick succession, but
> > I'll defer to Jason, who I think understands the intricacies of
> > localities better than I do.
> 
> Thanks, I don't pretend to understand the nuances of this particular
> code, but I was hoping that the request to revert got some attention
> since Alex's kernel Bugzilla and message a few months ago to linux
> integrity weren't.
> 
> > 
> > If that's the problem, the solution looks simple enough: just move
> > the ops get down because the priv state is already protected by the
> > buffer mutex
> 
> Yeah, if that works for Alex's situation it certainly sounds like a
> better solution than reverting this patch as this patch actually does
> fix a problem reported by Jeffrin originally.
> 
> Could you propose a specific patch that Alex and Jeffrin can perhaps
> both try?

Um, what's wrong with the one I originally attached and which you quote
below?  It's only compile tested, but I think it will work, if the
theory is correct.

James

> > 
> > James
> > 
> > ---
> > 
> > diff --git a/drivers/char/tpm/tpm-dev-common.c
> > b/drivers/char/tpm/tpm-dev-
> > common.c
> > index 87f449340202..1784530b8387 100644
> > --- a/drivers/char/tpm/tpm-dev-common.c
> > +++ b/drivers/char/tpm/tpm-dev-common.c
> > @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file,
> > const char
> > __user *buf,
> >  		goto out;
> >  	}
> > 
> > -	/* atomic tpm command send and result receive. We only
> > hold the ops
> > -	 * lock during this period so that the tpm can be
> > unregistered even if
> > -	 * the char dev is held open.
> > -	 */
> > -	if (tpm_try_get_ops(priv->chip)) {
> > -		ret = -EPIPE;
> > -		goto out;
> > -	}
> > -
> >  	priv->response_length = 0;
> >  	priv->response_read = false;
> >  	*off = 0;
> > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file,
> > const char
> > __user *buf,
> >  	if (file->f_flags & O_NONBLOCK) {
> >  		priv->command_enqueued = true;
> >  		queue_work(tpm_dev_wq, &priv->async_work);
> > -		tpm_put_ops(priv->chip);
> >  		mutex_unlock(&priv->buffer_mutex);
> >  		return size;
> >  	}
> > 
> > +	/* atomic tpm command send and result receive. We only
> > hold the ops
> > +	 * lock during this period so that the tpm can be
> > unregistered even if
> > +	 * the char dev is held open.
> > +	 */
> > +	if (tpm_try_get_ops(priv->chip)) {
> > +		ret = -EPIPE;
> > +		goto out;
> > +	}
> > +
> >  	ret = tpm_dev_transmit(priv->chip, priv->space, priv-
> > > data_buffer,
> > 
> >  			       sizeof(priv->data_buffer));
> >  	tpm_put_ops(priv->chip);

Tadeusz Struk May 26, 2020, 7:39 p.m. UTC | #4

On 5/26/20 12:14 PM, James Bottomley wrote:
> +	/* atomic tpm command send and result receive. We only hold the ops
> +	 * lock during this period so that the tpm can be unregistered even if
> +	 * the char dev is held open.
> +	 */
> +	if (tpm_try_get_ops(priv->chip)) {
> +		ret = -EPIPE;
> +		goto out;
> +	}
> +
Hi James,
This won't help if the message is read by an async tcti. If the problem lies
in the chip get locality code, perhaps this could help to debug the root-cause
instead of masking it out in the upper layer code:

diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
index 2435216bd10a..da5ecd0376bf 100644
--- a/drivers/char/tpm/tpm_tis_core.c
+++ b/drivers/char/tpm/tpm_tis_core.c
@@ -202,20 +202,22 @@ static int request_locality(struct tpm_chip *chip, int l)
 		return rc;
 
 	stop = jiffies + chip->timeout_a;
+	timeout = stop - jiffies;
 
 	if (chip->flags & TPM_CHIP_FLAG_IRQ) {
 again:
 		timeout = stop - jiffies;
 		if ((long)timeout <= 0)
-			return -1;
+			goto out;
+
 		rc = wait_event_interruptible_timeout(priv->int_queue,
-						      (check_locality
-						       (chip, l)),
+						      check_locality(chip, l),
 						      timeout);
 		if (rc > 0)
 			return l;
 		if (rc == -ERESTARTSYS && freezing(current)) {
 			clear_thread_flag(TIF_SIGPENDING);
+			timeout = stop - jiffies;
 			goto again;
 		}
 	} else {
@@ -226,6 +228,10 @@ static int request_locality(struct tpm_chip *chip, int l)
 			tpm_msleep(TPM_TIMEOUT);
 		} while (time_before(jiffies, stop));
 	}
+out:
+	dev_warn(&chip->dev, "%s: failed to request locality %d after %lu ms\n",
+			 __func__, l, timeout * HZ / 1000);
+
 	return -1;
 }

James Bottomley May 26, 2020, 8 p.m. UTC | #5

On Tue, 2020-05-26 at 12:39 -0700, Tadeusz Struk wrote:
> On 5/26/20 12:14 PM, James Bottomley wrote:
> > +	/* atomic tpm command send and result receive. We only
> > hold the ops
> > +	 * lock during this period so that the tpm can be
> > unregistered even if
> > +	 * the char dev is held open.
> > +	 */
> > +	if (tpm_try_get_ops(priv->chip)) {
> > +		ret = -EPIPE;
> > +		goto out;
> > +	}
> > +
> 
> Hi James,
> This won't help if the message is read by an async tcti.

Why not?  It moves the ops get underneath the async path, so it's now
all done in the direct read or the async read.  That seems to be more
efficient.

>  If the problem lies in the chip get locality code, perhaps this
> could help to debug the root-cause instead of masking it out in the
> upper layer code:

I don't think there is a root cause other than a TIS TPM is getting
annoyed by us cycling localities too rapidly because we don't do an
actual TPM operation between request and relinquish.  Since the first
request/relinquish seems unnecessary for the async case, moving the ops
get eliminates the problem.

James

Tadeusz Struk May 26, 2020, 9:33 p.m. UTC | #6

On 5/26/20 1:00 PM, James Bottomley wrote:
> I don't think there is a root cause other than a TIS TPM is getting
> annoyed by us cycling localities too rapidly because we don't do an
> actual TPM operation between request and relinquish.  Since the first
> request/relinquish seems unnecessary for the async case, moving the ops
> get eliminates the problem.

Could be, so maybe we could try both patches.
More debug info on the error path won't hurt.
Thanks,
Tadeusz

Alex Guzman May 26, 2020, 10:19 p.m. UTC | #7

On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > 
> > > > This commit has caused regressions for the XPS 9560 containing
> > > > a Nuvoton TPM.
> > > 
> > > Presumably this is using the tis driver?
> > 
> > Correct.
> > 
> > > > As mentioned by the reporter all TPM2 commands are failing
> > > > with:
> > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > device.c:290:tcti_device_receive()
> > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > permitted
> > > > 
> > > > The reporter bisected this issue back to this commit which was
> > > > backported to stable as commit 4d6ebc4.
> > > 
> > > I think the problem is request_locality ... for some inexplicable
> > > reason a failure there returns -1, which is EPERM to user space.
> > > 
> > > That seems to be a bug in the async code since everything else
> > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > assumes it gives back a sensible return code.
> > > 
> > > What I think is happening is that with the patch the TPM goes
> > > through a quick sequence of request, relinquish, request,
> > > relinquish and it's the third request which is failing (likely
> > > timing out).  Without the patch, the patch there's only one
> > > request,relinquish cycle because the ops are held while the async
> > > work is executed.  I have a vague recollection that there is a
> > > problem with too many locality request in quick succession, but
> > > I'll defer to Jason, who I think understands the intricacies of
> > > localities better than I do.
> > 
> > Thanks, I don't pretend to understand the nuances of this
> > particular
> > code, but I was hoping that the request to revert got some
> > attention
> > since Alex's kernel Bugzilla and message a few months ago to linux
> > integrity weren't.
> > 
> > > If that's the problem, the solution looks simple enough: just
> > > move
> > > the ops get down because the priv state is already protected by
> > > the
> > > buffer mutex
> > 
> > Yeah, if that works for Alex's situation it certainly sounds like a
> > better solution than reverting this patch as this patch actually
> > does
> > fix a problem reported by Jeffrin originally.
> > 
> > Could you propose a specific patch that Alex and Jeffrin can
> > perhaps
> > both try?
> 
> Um, what's wrong with the one I originally attached and which you
> quote
> below?  It's only compile tested, but I think it will work, if the
> theory is correct.
> 
> James
> 
> > > James
> > > 
> > > ---
> > > 
> > > diff --git a/drivers/char/tpm/tpm-dev-common.c
> > > b/drivers/char/tpm/tpm-dev-
> > > common.c
> > > index 87f449340202..1784530b8387 100644
> > > --- a/drivers/char/tpm/tpm-dev-common.c
> > > +++ b/drivers/char/tpm/tpm-dev-common.c
> > > @@ -189,15 +189,6 @@ ssize_t tpm_common_write(struct file *file,
> > > const char
> > > __user *buf,
> > >  		goto out;
> > >  	}
> > > 
> > > -	/* atomic tpm command send and result receive. We only
> > > hold the ops
> > > -	 * lock during this period so that the tpm can be
> > > unregistered even if
> > > -	 * the char dev is held open.
> > > -	 */
> > > -	if (tpm_try_get_ops(priv->chip)) {
> > > -		ret = -EPIPE;
> > > -		goto out;
> > > -	}
> > > -
> > >  	priv->response_length = 0;
> > >  	priv->response_read = false;
> > >  	*off = 0;
> > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file,
> > > const char
> > > __user *buf,
> > >  	if (file->f_flags & O_NONBLOCK) {
> > >  		priv->command_enqueued = true;
> > >  		queue_work(tpm_dev_wq, &priv->async_work);
> > > -		tpm_put_ops(priv->chip);
> > >  		mutex_unlock(&priv->buffer_mutex);
> > >  		return size;
> > >  	}
> > > 
> > > +	/* atomic tpm command send and result receive. We only
> > > hold the ops
> > > +	 * lock during this period so that the tpm can be
> > > unregistered even if
> > > +	 * the char dev is held open.
> > > +	 */
> > > +	if (tpm_try_get_ops(priv->chip)) {
> > > +		ret = -EPIPE;
> > > +		goto out;
> > > +	}
> > > +
> > >  	ret = tpm_dev_transmit(priv->chip, priv->space, priv-
> > > > data_buffer,
> > > 
> > >  			       sizeof(priv->data_buffer));
> > >  	tpm_put_ops(priv->chip);

When using your patch, I get a hang when trying to use tpm2_getcap, and
dmesg shows some info.
[  570.913779] BUG: unable to handle page fault for address: ffffb20001247000
[  570.913782] #PF: supervisor write access in kernel mode
[  570.913783] #PF: error_code(0x0002) - not-present page
[  570.913784] PGD 0 P4D 0 
[  570.913785] Oops: 0002 [#3] SMP PTI
[  570.913787] CPU: 6 PID: 24744 Comm: tpm2_getcap Tainted: G     UD           5.7.0-rc7+ #31
[  570.913788] Hardware name: Dell Inc. XPS 15 9560/05FFDN, BIOS 1.18.0 11/17/2019
[  570.913791] RIP: 0010:iowrite8+0x9/0x50
[  570.913792] Code: 48 c7 c2 40 43 9f 99 48 89 04 24 e8 14 a7 90 ff 0f 0b 48 8b 04 24 48 83 c4 08 c3 66 0f 1f 44 00 00 48 81 fe ff ff 03 00 76 04 <40> 88 3e c3 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ee c3 8b 05
[  570.913793] RSP: 0018:ffffb1ff049d7db0 EFLAGS: 00010292
[  570.913794] RAX: ffffffff981bf520 RBX: ffffb1ff049d7df9 RCX: ffffb1ff049d7df8
[  570.913795] RDX: 0000000000000001 RSI: ffffb20001247000 RDI: 0000000000000020
[  570.913796] RBP: ffffb1ff049d7df9 R08: 0000000000000000 R09: ffff8b80de5370f0
[  570.913797] R10: 0000000000b71b00 R11: 000000000000028f R12: ffff8b80b148cda8
[  570.913797] R13: 00000000fffff000 R14: ffff8b80b148cda8 R15: ffff8b80cb44a0ba
[  570.913799] FS:  00007f78f7cd0d80(0000) GS:ffff8b80de500000(0000) knlGS:0000000000000000
[  570.913799] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  570.913800] CR2: ffffb20001247000 CR3: 0000000795618001 CR4: 00000000003606e0
[  570.913801] Call Trace:
[  570.913803]  tpm_tcg_write_bytes+0x2f/0x40
[  570.913805]  release_locality+0x49/0x220
[  570.913807]  tpm_relinquish_locality+0x1f/0x40
[  570.913808]  tpm_chip_stop+0x21/0x40
[  570.913810]  tpm_put_ops+0x9/0x30
[  570.913811]  tpm_common_write+0x179/0x190
[  570.913813]  vfs_write+0xb1/0x1a0
[  570.913815]  ksys_write+0x5a/0xd0
[  570.913816]  do_syscall_64+0x43/0x130
[  570.913819]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  570.913820] RIP: 0033:0x7f78f7e00123
[  570.913821] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 64 8b 04 25 18 00 00 00 85 c0 75 14 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 55 c3 0f 1f 40 00 48 83 ec 28 48 89 54 24 18
[  570.913822] RSP: 002b:00007fff724e8c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  570.913823] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f78f7e00123
[  570.913824] RDX: 0000000000000016 RSI: 0000564cf24a7220 RDI: 0000000000000003
[  570.913825] RBP: 0000000000000016 R08: 00007f78f7ccc785 R09: 00007f78f7ccca40
[  570.913826] R10: 00007fff724e8b10 R11: 0000000000000246 R12: 0000564cf24a7220
[  570.913826] R13: 0000000000000000 R14: 0000000000000016 R15: 00007f78f7ccc890
[  570.913827] Modules linked in: squashfs rtsx_pci_sdmmc x86_pkg_temp_thermal coretemp rtsx_pci mfd_core
[  570.913831] CR2: ffffb20001247000
[  570.913832] ---[ end trace c84437b00f0d01a0 ]---
[  570.913833] RIP: 0010:iowrite8+0x9/0x50
[  570.913834] Code: 48 c7 c2 40 43 9f 99 48 89 04 24 e8 14 a7 90 ff 0f 0b 48 8b 04 24 48 83 c4 08 c3 66 0f 1f 44 00 00 48 81 fe ff ff 03 00 76 04 <40> 88 3e c3 48 81 fe 00 00 01 00 76 07 0f b7 d6 89 f8 ee c3 8b 05
[  570.913835] RSP: 0018:ffffb1ff030b7db0 EFLAGS: 00010292
[  570.913836] RAX: ffffffff981bf520 RBX: ffffb1ff030b7df9 RCX: ffffb1ff030b7df8
[  570.913837] RDX: 0000000000000001 RSI: ffffb20001247000 RDI: 0000000000000020
[  570.913837] RBP: ffffb1ff030b7df9 R08: 0000000000000000 R09: ffff8b80de2370f0
[  570.913838] R10: 0000000000b71b00 R11: 000000000000019c R12: ffff8b80b148cda8
[  570.913839] R13: 00000000fffff000 R14: ffff8b80b148cda8 R15: ffff8b80c4cfc0ba
[  570.913840] FS:  00007f78f7cd0d80(0000) GS:ffff8b80de500000(0000) knlGS:0000000000000000
[  570.913840] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  570.913841] CR2: ffffb20001247000 CR3: 0000000795618001 CR4: 00000000003606e0

Alex Guzman May 26, 2020, 10:34 p.m. UTC | #8

On Tue, 2020-05-26 at 14:33 -0700, Tadeusz Struk wrote:
> On 5/26/20 1:00 PM, James Bottomley wrote:
> > I don't think there is a root cause other than a TIS TPM is getting
> > annoyed by us cycling localities too rapidly because we don't do an
> > actual TPM operation between request and relinquish.  Since the
> > first
> > request/relinquish seems unnecessary for the async case, moving the
> > ops
> > get eliminates the problem.
> 
> Could be, so maybe we could try both patches.
> More debug info on the error path won't hurt.
> Thanks,
> Tadeusz

With your logging patch, I consistently see this message in dmesg when
tpm2_getcap fails:

tpm tpm0: request_locality: failed to request locality 0 after 750 ms

James Bottomley May 26, 2020, 11:06 p.m. UTC | #9

On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote:
[...]
> When using your patch, I get a hang when trying to use tpm2_getcap,
> and dmesg shows some info.

Are you sure it's all applied?  This

> [  570.913803]  tpm_tcg_write_bytes+0x2f/0x40
> [  570.913805]  release_locality+0x49/0x220
> [  570.913807]  tpm_relinquish_locality+0x1f/0x40
> [  570.913808]  tpm_chip_stop+0x21/0x40
> [  570.913810]  tpm_put_ops+0x9/0x30
> [  570.913811]  tpm_common_write+0x179/0x190
> [  570.913813]  vfs_write+0xb1/0x1a0

Implies an unmatched tpm_put_ops() in the async write path, as though
this hunk:

> @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file,
> const char __user *buf,
>         if (file->f_flags & O_NONBLOCK) {
>                 priv->command_enqueued = true;
>                 queue_work(tpm_dev_wq, &priv->async_work);
> -               tpm_put_ops(priv->chip);
>                 mutex_unlock(&priv->buffer_mutex);
>                 return size;
>         }

Is missing.  I actually booted the patch in my TPM based VM and it all
seems to work OK when I execute tpm2_getcap (I verified it's using
O_NONBLOCK) and tssgetcapability in sync mode.

James

Alex Guzman May 26, 2020, 11:31 p.m. UTC | #10

On Tue, May 26, 2020 at 4:06 PM James Bottomley
<James.Bottomley@hansenpartnership.com> wrote:
>
> On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote:
> [...]
> > When using your patch, I get a hang when trying to use tpm2_getcap,
> > and dmesg shows some info.
>
> Are you sure it's all applied?  This
>
> > [  570.913803]  tpm_tcg_write_bytes+0x2f/0x40
> > [  570.913805]  release_locality+0x49/0x220
> > [  570.913807]  tpm_relinquish_locality+0x1f/0x40
> > [  570.913808]  tpm_chip_stop+0x21/0x40
> > [  570.913810]  tpm_put_ops+0x9/0x30
> > [  570.913811]  tpm_common_write+0x179/0x190
> > [  570.913813]  vfs_write+0xb1/0x1a0
>
> Implies an unmatched tpm_put_ops() in the async write path, as though
> this hunk:
>
> > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file,
> > const char __user *buf,
> >         if (file->f_flags & O_NONBLOCK) {
> >                 priv->command_enqueued = true;
> >                 queue_work(tpm_dev_wq, &priv->async_work);
> > -               tpm_put_ops(priv->chip);
> >                 mutex_unlock(&priv->buffer_mutex);
> >                 return size;
> >         }
>
> Is missing.  I actually booted the patch in my TPM based VM and it all
> seems to work OK when I execute tpm2_getcap (I verified it's using
> O_NONBLOCK) and tssgetcapability in sync mode.
>
> James
>

Oh, I did miss that bit. The patch had issues applying for some reason
and I missed the single-line removal when I was looking at the diff.

I gave it a spin on my machine again. getcap seems to work correctly
with and without having the async config flag set for tpm2-tss. The
pkcs11 plugin seems to work correctly again too. :)

James Bottomley May 27, 2020, 12:18 a.m. UTC | #11

On Tue, 2020-05-26 at 16:31 -0700, Alex Guzman wrote:
> On Tue, May 26, 2020 at 4:06 PM James Bottomley
> <James.Bottomley@hansenpartnership.com> wrote:
> > 
> > On Tue, 2020-05-26 at 15:19 -0700, Alex Guzman wrote:
> > [...]
> > > When using your patch, I get a hang when trying to use
> > > tpm2_getcap, and dmesg shows some info.
> > 
> > Are you sure it's all applied?  This
> > 
> > > [  570.913803]  tpm_tcg_write_bytes+0x2f/0x40
> > > [  570.913805]  release_locality+0x49/0x220
> > > [  570.913807]  tpm_relinquish_locality+0x1f/0x40
> > > [  570.913808]  tpm_chip_stop+0x21/0x40
> > > [  570.913810]  tpm_put_ops+0x9/0x30
> > > [  570.913811]  tpm_common_write+0x179/0x190
> > > [  570.913813]  vfs_write+0xb1/0x1a0
> > 
> > Implies an unmatched tpm_put_ops() in the async write path, as
> > though this hunk:
> > 
> > > @@ -211,11 +202,19 @@ ssize_t tpm_common_write(struct file *file,
> > > const char __user *buf,
> > >         if (file->f_flags & O_NONBLOCK) {
> > >                 priv->command_enqueued = true;
> > >                 queue_work(tpm_dev_wq, &priv->async_work);
> > > -               tpm_put_ops(priv->chip);
> > >                 mutex_unlock(&priv->buffer_mutex);
> > >                 return size;
> > >         }
> > 
> > Is missing.  I actually booted the patch in my TPM based VM and it
> > all seems to work OK when I execute tpm2_getcap (I verified it's
> > using O_NONBLOCK) and tssgetcapability in sync mode.
> > 
> > James
> > 
> 
> Oh, I did miss that bit. The patch had issues applying for some
> reason and I missed the single-line removal when I was looking at the
> diff.

Sorry, that's likely my fault: I did it on top of my current TPM tree.
I'll prepare a version against the vanilla kernel with a real
changelog.

> I gave it a spin on my machine again. getcap seems to work correctly
> with and without having the async config flag set for tpm2-tss. The
> pkcs11 plugin seems to work correctly again too. :)

Great, thanks!  I'll add your tested-by to the above.

James

Jarkko Sakkinen May 27, 2020, 8:09 p.m. UTC | #12

On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > 
> > > > This commit has caused regressions for the XPS 9560 containing
> > > > a Nuvoton TPM.
> > > 
> > > Presumably this is using the tis driver?
> > 
> > Correct.
> > 
> > > > As mentioned by the reporter all TPM2 commands are failing with:
> > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > device.c:290:tcti_device_receive()
> > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > permitted
> > > > 
> > > > The reporter bisected this issue back to this commit which was
> > > > backported to stable as commit 4d6ebc4.
> > > 
> > > I think the problem is request_locality ... for some inexplicable
> > > reason a failure there returns -1, which is EPERM to user space.
> > > 
> > > That seems to be a bug in the async code since everything else
> > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > assumes it gives back a sensible return code.
> > > 
> > > What I think is happening is that with the patch the TPM goes
> > > through a quick sequence of request, relinquish, request,
> > > relinquish and it's the third request which is failing (likely
> > > timing out).  Without the patch, the patch there's only one
> > > request,relinquish cycle because the ops are held while the async
> > > work is executed.  I have a vague recollection that there is a
> > > problem with too many locality request in quick succession, but
> > > I'll defer to Jason, who I think understands the intricacies of
> > > localities better than I do.
> > 
> > Thanks, I don't pretend to understand the nuances of this particular
> > code, but I was hoping that the request to revert got some attention
> > since Alex's kernel Bugzilla and message a few months ago to linux
> > integrity weren't.
> > 
> > > If that's the problem, the solution looks simple enough: just move
> > > the ops get down because the priv state is already protected by the
> > > buffer mutex
> > 
> > Yeah, if that works for Alex's situation it certainly sounds like a
> > better solution than reverting this patch as this patch actually does
> > fix a problem reported by Jeffrin originally.
> > 
> > Could you propose a specific patch that Alex and Jeffrin can perhaps
> > both try?
> 
> Um, what's wrong with the one I originally attached and which you quote
> below?  It's only compile tested, but I think it will work, if the
> theory is correct.

Please send a legit patch, thanks.

/Jarkko

Jarkko Sakkinen May 27, 2020, 8:15 p.m. UTC | #13

On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> Thanks, I don't pretend to understand the nuances of this particular code,
> but I was hoping that the request to revert got some attention since Alex's
> kernel Bugzilla and message a few months ago to linux integrity weren't.

Removing linux-kernel from CC since this subsystem internal discussion.

Seeing the whole thing first time today.

Bugzilla is the first thing to ignore when busy. It is good as place
holder for bugs, but all discussions should happen only in LKML. There's
no official requirement to proactively use Bugzilla for anything.

That said I'm happy that people put stuff there so that it gets logged.

For follow-up's use only LKML if it is important to you. Those will get
processed.

As far as this goes, if nothing is heard from me, check that you put me
as CC to the original email. Otherwise, I might have missed it (by mistake,
not by purpose).

Honestly, I'm not sure what point was this patch when there was time to
wait for months without response. Why the passivity for all this time?

/Jarkko

Limonciello, Mario May 27, 2020, 8:18 p.m. UTC | #14

> -----Original Message-----
> From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> Sent: Wednesday, May 27, 2020 3:09 PM
> To: James Bottomley; Limonciello, Mario; peterhuewe@gmx.de; jgg@ziepe.ca
> Cc: arnd@arndb.de; gregkh@linuxfoundation.org; linux-integrity@vger.kernel.org;
> linux-kernel@vger.kernel.org; jeffrin@rajagiritech.edu.in; alex@guzman.io
> Subject: Re: [PATCH] tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode"
> 
> 
> [EXTERNAL EMAIL]
> 
> On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > >
> > > > > This commit has caused regressions for the XPS 9560 containing
> > > > > a Nuvoton TPM.
> > > >
> > > > Presumably this is using the tis driver?
> > >
> > > Correct.
> > >
> > > > > As mentioned by the reporter all TPM2 commands are failing with:
> > > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > > device.c:290:tcti_device_receive()
> > > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > > permitted
> > > > >
> > > > > The reporter bisected this issue back to this commit which was
> > > > > backported to stable as commit 4d6ebc4.
> > > >
> > > > I think the problem is request_locality ... for some inexplicable
> > > > reason a failure there returns -1, which is EPERM to user space.
> > > >
> > > > That seems to be a bug in the async code since everything else
> > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > > assumes it gives back a sensible return code.
> > > >
> > > > What I think is happening is that with the patch the TPM goes
> > > > through a quick sequence of request, relinquish, request,
> > > > relinquish and it's the third request which is failing (likely
> > > > timing out).  Without the patch, the patch there's only one
> > > > request,relinquish cycle because the ops are held while the async
> > > > work is executed.  I have a vague recollection that there is a
> > > > problem with too many locality request in quick succession, but
> > > > I'll defer to Jason, who I think understands the intricacies of
> > > > localities better than I do.
> > >
> > > Thanks, I don't pretend to understand the nuances of this particular
> > > code, but I was hoping that the request to revert got some attention
> > > since Alex's kernel Bugzilla and message a few months ago to linux
> > > integrity weren't.
> > >
> > > > If that's the problem, the solution looks simple enough: just move
> > > > the ops get down because the priv state is already protected by the
> > > > buffer mutex
> > >
> > > Yeah, if that works for Alex's situation it certainly sounds like a
> > > better solution than reverting this patch as this patch actually does
> > > fix a problem reported by Jeffrin originally.
> > >
> > > Could you propose a specific patch that Alex and Jeffrin can perhaps
> > > both try?
> >
> > Um, what's wrong with the one I originally attached and which you quote
> > below?  It's only compile tested, but I think it will work, if the
> > theory is correct.
> 
> Please send a legit patch, thanks.
> 
> /Jarkko

Jarkko,

After the confirmation from Alex that this patch attached to the end of the thread
worked, James did send a proper patch that can be accessed here:
https://lore.kernel.org/linux-integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t

Thanks,

Jarkko Sakkinen May 28, 2020, 12:30 a.m. UTC | #15

On Tue, May 26, 2020 at 12:39:37PM -0700, Tadeusz Struk wrote:
> On 5/26/20 12:14 PM, James Bottomley wrote:
> > +	/* atomic tpm command send and result receive. We only hold the ops
> > +	 * lock during this period so that the tpm can be unregistered even if
> > +	 * the char dev is held open.
> > +	 */
> > +	if (tpm_try_get_ops(priv->chip)) {
> > +		ret = -EPIPE;
> > +		goto out;
> > +	}
> > +
> Hi James,
> This won't help if the message is read by an async tcti. If the problem lies
> in the chip get locality code, perhaps this could help to debug the root-cause
> instead of masking it out in the upper layer code:

What is TCTI and async TCTI? Not following.

> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
> index 2435216bd10a..da5ecd0376bf 100644
> --- a/drivers/char/tpm/tpm_tis_core.c
> +++ b/drivers/char/tpm/tpm_tis_core.c
> @@ -202,20 +202,22 @@ static int request_locality(struct tpm_chip *chip, int l)
>  		return rc;
>  
>  	stop = jiffies + chip->timeout_a;
> +	timeout = stop - jiffies;
>  
>  	if (chip->flags & TPM_CHIP_FLAG_IRQ) {
>  again:
>  		timeout = stop - jiffies;
>  		if ((long)timeout <= 0)
> -			return -1;
> +			goto out;
> +
>  		rc = wait_event_interruptible_timeout(priv->int_queue,
> -						      (check_locality
> -						       (chip, l)),
> +						      check_locality(chip, l),
>  						      timeout);
>  		if (rc > 0)
>  			return l;
>  		if (rc == -ERESTARTSYS && freezing(current)) {
>  			clear_thread_flag(TIF_SIGPENDING);
> +			timeout = stop - jiffies;
>  			goto again;
>  		}
>  	} else {
> @@ -226,6 +228,10 @@ static int request_locality(struct tpm_chip *chip, int l)
>  			tpm_msleep(TPM_TIMEOUT);
>  		} while (time_before(jiffies, stop));
>  	}
> +out:
> +	dev_warn(&chip->dev, "%s: failed to request locality %d after %lu ms\n",
> +			 __func__, l, timeout * HZ / 1000);
> +
>  	return -1;
>  }
>  

/Jarkko

Jarkko Sakkinen May 28, 2020, 12:43 a.m. UTC | #16

On Wed, May 27, 2020 at 08:18:56PM +0000, Mario.Limonciello@dell.com wrote:
> > -----Original Message-----
> > From: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
> > Sent: Wednesday, May 27, 2020 3:09 PM
> > To: James Bottomley; Limonciello, Mario; peterhuewe@gmx.de; jgg@ziepe.ca
> > Cc: arnd@arndb.de; gregkh@linuxfoundation.org; linux-integrity@vger.kernel.org;
> > linux-kernel@vger.kernel.org; jeffrin@rajagiritech.edu.in; alex@guzman.io
> > Subject: Re: [PATCH] tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode"
> > 
> > 
> > [EXTERNAL EMAIL]

What is this?

> > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > > >
> > > > > > This commit has caused regressions for the XPS 9560 containing
> > > > > > a Nuvoton TPM.
> > > > >
> > > > > Presumably this is using the tis driver?
> > > >
> > > > Correct.
> > > >
> > > > > > As mentioned by the reporter all TPM2 commands are failing with:
> > > > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > > > device.c:290:tcti_device_receive()
> > > > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > > > permitted
> > > > > >
> > > > > > The reporter bisected this issue back to this commit which was
> > > > > > backported to stable as commit 4d6ebc4.
> > > > >
> > > > > I think the problem is request_locality ... for some inexplicable
> > > > > reason a failure there returns -1, which is EPERM to user space.
> > > > >
> > > > > That seems to be a bug in the async code since everything else
> > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > > > assumes it gives back a sensible return code.
> > > > >
> > > > > What I think is happening is that with the patch the TPM goes
> > > > > through a quick sequence of request, relinquish, request,
> > > > > relinquish and it's the third request which is failing (likely
> > > > > timing out).  Without the patch, the patch there's only one
> > > > > request,relinquish cycle because the ops are held while the async
> > > > > work is executed.  I have a vague recollection that there is a
> > > > > problem with too many locality request in quick succession, but
> > > > > I'll defer to Jason, who I think understands the intricacies of
> > > > > localities better than I do.
> > > >
> > > > Thanks, I don't pretend to understand the nuances of this particular
> > > > code, but I was hoping that the request to revert got some attention
> > > > since Alex's kernel Bugzilla and message a few months ago to linux
> > > > integrity weren't.
> > > >
> > > > > If that's the problem, the solution looks simple enough: just move
> > > > > the ops get down because the priv state is already protected by the
> > > > > buffer mutex
> > > >
> > > > Yeah, if that works for Alex's situation it certainly sounds like a
> > > > better solution than reverting this patch as this patch actually does
> > > > fix a problem reported by Jeffrin originally.
> > > >
> > > > Could you propose a specific patch that Alex and Jeffrin can perhaps
> > > > both try?
> > >
> > > Um, what's wrong with the one I originally attached and which you quote
> > > below?  It's only compile tested, but I think it will work, if the
> > > theory is correct.
> > 
> > Please send a legit patch, thanks.
> > 
> > /Jarkko
> 
> Jarkko,
> 
> After the confirmation from Alex that this patch attached to the end of the thread
> worked, James did send a proper patch that can be accessed here:
> https://lore.kernel.org/linux-integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t
> 
> Thanks,

Hi thanks a lot! I did read the full discussions and agree with the
conclusions as I get a patch in proper form.

Please ping next time a bit earlier. It's not that I don't want to deal
with the issues quickly as possible. It's probably just that I've forgot
something or missed.

/Jarkko

Limonciello, Mario May 28, 2020, 12:59 a.m. UTC | #17

> > > [EXTERNAL EMAIL]
> 
> What is this?

Something my employer's mail system automatically tags in external email.

My mistakes in forgetting to remove it on the response.

> 
> > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> > > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > > > >
> > > > > > > This commit has caused regressions for the XPS 9560 containing
> > > > > > > a Nuvoton TPM.
> > > > > >
> > > > > > Presumably this is using the tis driver?
> > > > >
> > > > > Correct.
> > > > >
> > > > > > > As mentioned by the reporter all TPM2 commands are failing with:
> > > > > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > > > > device.c:290:tcti_device_receive()
> > > > > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > > > > permitted
> > > > > > >
> > > > > > > The reporter bisected this issue back to this commit which was
> > > > > > > backported to stable as commit 4d6ebc4.
> > > > > >
> > > > > > I think the problem is request_locality ... for some inexplicable
> > > > > > reason a failure there returns -1, which is EPERM to user space.
> > > > > >
> > > > > > That seems to be a bug in the async code since everything else
> > > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > > > > assumes it gives back a sensible return code.
> > > > > >
> > > > > > What I think is happening is that with the patch the TPM goes
> > > > > > through a quick sequence of request, relinquish, request,
> > > > > > relinquish and it's the third request which is failing (likely
> > > > > > timing out).  Without the patch, the patch there's only one
> > > > > > request,relinquish cycle because the ops are held while the async
> > > > > > work is executed.  I have a vague recollection that there is a
> > > > > > problem with too many locality request in quick succession, but
> > > > > > I'll defer to Jason, who I think understands the intricacies of
> > > > > > localities better than I do.
> > > > >
> > > > > Thanks, I don't pretend to understand the nuances of this particular
> > > > > code, but I was hoping that the request to revert got some attention
> > > > > since Alex's kernel Bugzilla and message a few months ago to linux
> > > > > integrity weren't.
> > > > >
> > > > > > If that's the problem, the solution looks simple enough: just move
> > > > > > the ops get down because the priv state is already protected by the
> > > > > > buffer mutex
> > > > >
> > > > > Yeah, if that works for Alex's situation it certainly sounds like a
> > > > > better solution than reverting this patch as this patch actually does
> > > > > fix a problem reported by Jeffrin originally.
> > > > >
> > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps
> > > > > both try?
> > > >
> > > > Um, what's wrong with the one I originally attached and which you quote
> > > > below?  It's only compile tested, but I think it will work, if the
> > > > theory is correct.
> > >
> > > Please send a legit patch, thanks.
> > >
> > > /Jarkko
> >
> > Jarkko,
> >
> > After the confirmation from Alex that this patch attached to the end of the
> thread
> > worked, James did send a proper patch that can be accessed here:
> > https://lore.kernel.org/linux-
> integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t
> >
> > Thanks,
> 
> Hi thanks a lot! I did read the full discussions and agree with the
> conclusions as I get a patch in proper form.
> 
> Please ping next time a bit earlier. It's not that I don't want to deal
> with the issues quickly as possible. It's probably just that I've forgot
> something or missed.
> 
> /Jarkko

Thanks!

I completely forgot about it too, it was mentioned to me right after holidays
and I forgot to follow up and see that it got sorted.

Alex Guzman May 28, 2020, 1:10 a.m. UTC | #18

On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote:
> On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > Thanks, I don't pretend to understand the nuances of this
> > particular code,
> > but I was hoping that the request to revert got some attention
> > since Alex's
> > kernel Bugzilla and message a few months ago to linux integrity
> > weren't.
> 
> Removing linux-kernel from CC since this subsystem internal
> discussion.
> 
> Seeing the whole thing first time today.
> 
> Bugzilla is the first thing to ignore when busy. It is good as place
> holder for bugs, but all discussions should happen only in LKML.
> There's
> no official requirement to proactively use Bugzilla for anything.
> 
> That said I'm happy that people put stuff there so that it gets
> logged.
> 
> For follow-up's use only LKML if it is important to you. Those will
> get
> processed.
> 
> As far as this goes, if nothing is heard from me, check that you put
> me
> as CC to the original email. Otherwise, I might have missed it (by
> mistake,
> not by purpose).
> 
> Honestly, I'm not sure what point was this patch when there was time
> to
> wait for months without response. Why the passivity for all this
> time?
> 
> /Jarkko
> 

It largely went quiet because I didn't raise the issue in the mailing
list again. I pinged back in February (
https://lore.kernel.org/linux-integrity/CAJ7-PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/
) but the conversation died out and I was content to simply use the
last working kernel version and see if the bug was resolved on its own.
I raised the issue again on the bugtracker a few days ago, leading to
this follow up here. :)

- Alex

Tadeusz Struk May 28, 2020, 4:40 a.m. UTC | #19

On 5/27/20 5:30 PM, Jarkko Sakkinen wrote:
>> This won't help if the message is read by an async tcti. If the problem lies
>> in the chip get locality code, perhaps this could help to debug the root-cause
>> instead of masking it out in the upper layer code:
> What is TCTI and async TCTI? Not following.

TPM Command Transmission Interface (TCTI) as defined by TCG in
https://trustedcomputinggroup.org/resource/tss-tcti-specification/

the reason we added the O_NONBLOCK mode was to satisfy the TCG spec for async TCTI.

Thanks,
Tadeusz

Jarkko Sakkinen May 28, 2020, 6:53 a.m. UTC | #20

On Thu, May 28, 2020 at 12:59:59AM +0000, Mario.Limonciello@dell.com wrote:
> > > > [EXTERNAL EMAIL]
> > 
> > What is this?
> 
> Something my employer's mail system automatically tags in external email.
> 
> My mistakes in forgetting to remove it on the response.

NP, just asking :-)

> > > > On Tue, 2020-05-26 at 12:38 -0700, James Bottomley wrote:
> > > > > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > > > > > On Tue, 2020-05-26 at 13:32 -0500, Mario Limonciello wrote:
> > > > > > > > This reverts commit d23d12484307b40eea549b8a858f5fffad913897.
> > > > > > > >
> > > > > > > > This commit has caused regressions for the XPS 9560 containing
> > > > > > > > a Nuvoton TPM.
> > > > > > >
> > > > > > > Presumably this is using the tis driver?
> > > > > >
> > > > > > Correct.
> > > > > >
> > > > > > > > As mentioned by the reporter all TPM2 commands are failing with:
> > > > > > > >   ERROR:tcti:src/tss2-tcti/tcti-
> > > > > > > > device.c:290:tcti_device_receive()
> > > > > > > >   Failed to read response from fd 3, got errno 1: Operation not
> > > > > > > > permitted
> > > > > > > >
> > > > > > > > The reporter bisected this issue back to this commit which was
> > > > > > > > backported to stable as commit 4d6ebc4.
> > > > > > >
> > > > > > > I think the problem is request_locality ... for some inexplicable
> > > > > > > reason a failure there returns -1, which is EPERM to user space.
> > > > > > >
> > > > > > > That seems to be a bug in the async code since everything else
> > > > > > > gives a ESPIPE error if tpm_try_get_ops fails ... at least no-one
> > > > > > > assumes it gives back a sensible return code.
> > > > > > >
> > > > > > > What I think is happening is that with the patch the TPM goes
> > > > > > > through a quick sequence of request, relinquish, request,
> > > > > > > relinquish and it's the third request which is failing (likely
> > > > > > > timing out).  Without the patch, the patch there's only one
> > > > > > > request,relinquish cycle because the ops are held while the async
> > > > > > > work is executed.  I have a vague recollection that there is a
> > > > > > > problem with too many locality request in quick succession, but
> > > > > > > I'll defer to Jason, who I think understands the intricacies of
> > > > > > > localities better than I do.
> > > > > >
> > > > > > Thanks, I don't pretend to understand the nuances of this particular
> > > > > > code, but I was hoping that the request to revert got some attention
> > > > > > since Alex's kernel Bugzilla and message a few months ago to linux
> > > > > > integrity weren't.
> > > > > >
> > > > > > > If that's the problem, the solution looks simple enough: just move
> > > > > > > the ops get down because the priv state is already protected by the
> > > > > > > buffer mutex
> > > > > >
> > > > > > Yeah, if that works for Alex's situation it certainly sounds like a
> > > > > > better solution than reverting this patch as this patch actually does
> > > > > > fix a problem reported by Jeffrin originally.
> > > > > >
> > > > > > Could you propose a specific patch that Alex and Jeffrin can perhaps
> > > > > > both try?
> > > > >
> > > > > Um, what's wrong with the one I originally attached and which you quote
> > > > > below?  It's only compile tested, but I think it will work, if the
> > > > > theory is correct.
> > > >
> > > > Please send a legit patch, thanks.
> > > >
> > > > /Jarkko
> > >
> > > Jarkko,
> > >
> > > After the confirmation from Alex that this patch attached to the end of the
> > thread
> > > worked, James did send a proper patch that can be accessed here:
> > > https://lore.kernel.org/linux-
> > integrity/20200527155800.ya43xm2ltuwduwjg@cantor/T/#t
> > >
> > > Thanks,
> > 
> > Hi thanks a lot! I did read the full discussions and agree with the
> > conclusions as I get a patch in proper form.
> > 
> > Please ping next time a bit earlier. It's not that I don't want to deal
> > with the issues quickly as possible. It's probably just that I've forgot
> > something or missed.
> > 
> > /Jarkko
> 
> Thanks!
> 
> I completely forgot about it too, it was mentioned to me right after holidays
> and I forgot to follow up and see that it got sorted.

Yeah, sure, lets try to get a fix landed asap :-)

/Jarkko

Jarkko Sakkinen May 28, 2020, 6:54 a.m. UTC | #21

On Wed, May 27, 2020 at 06:10:02PM -0700, Alex Guzman wrote:
> On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote:
> > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com wrote:
> > > Thanks, I don't pretend to understand the nuances of this
> > > particular code,
> > > but I was hoping that the request to revert got some attention
> > > since Alex's
> > > kernel Bugzilla and message a few months ago to linux integrity
> > > weren't.
> > 
> > Removing linux-kernel from CC since this subsystem internal
> > discussion.
> > 
> > Seeing the whole thing first time today.
> > 
> > Bugzilla is the first thing to ignore when busy. It is good as place
> > holder for bugs, but all discussions should happen only in LKML.
> > There's
> > no official requirement to proactively use Bugzilla for anything.
> > 
> > That said I'm happy that people put stuff there so that it gets
> > logged.
> > 
> > For follow-up's use only LKML if it is important to you. Those will
> > get
> > processed.
> > 
> > As far as this goes, if nothing is heard from me, check that you put
> > me
> > as CC to the original email. Otherwise, I might have missed it (by
> > mistake,
> > not by purpose).
> > 
> > Honestly, I'm not sure what point was this patch when there was time
> > to
> > wait for months without response. Why the passivity for all this
> > time?
> > 
> > /Jarkko
> > 
> 
> It largely went quiet because I didn't raise the issue in the mailing
> list again. I pinged back in February (
> https://lore.kernel.org/linux-integrity/CAJ7-PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/
> ) but the conversation died out and I was content to simply use the
> last working kernel version and see if the bug was resolved on its own.
> I raised the issue again on the bugtracker a few days ago, leading to
> this follow up here. :)

OK  cool and thanks a lot for reporting this!

/Jarkko

Jarkko Sakkinen May 28, 2020, 6:55 a.m. UTC | #22

On Wed, May 27, 2020 at 09:40:25PM -0700, Tadeusz Struk wrote:
> On 5/27/20 5:30 PM, Jarkko Sakkinen wrote:
> >> This won't help if the message is read by an async tcti. If the problem lies
> >> in the chip get locality code, perhaps this could help to debug the root-cause
> >> instead of masking it out in the upper layer code:
> > What is TCTI and async TCTI? Not following.
> 
> TPM Command Transmission Interface (TCTI) as defined by TCG in
> https://trustedcomputinggroup.org/resource/tss-tcti-specification/
> 
> the reason we added the O_NONBLOCK mode was to satisfy the TCG spec for async TCTI.
> 
> Thanks,
> Tadeusz

OK, thanks recalling.

/Jarkko

James Bottomley May 28, 2020, 10:33 p.m. UTC | #23

On Wed, 2020-05-27 at 18:10 -0700, Alex Guzman wrote:
> On Wed, 2020-05-27 at 23:15 +0300, Jarkko Sakkinen wrote:
> > On Tue, 2020-05-26 at 19:23 +0000, Mario.Limonciello@dell.com
> > wrote:
> > > Thanks, I don't pretend to understand the nuances of this
> > > particular code, but I was hoping that the request to revert got
> > > some attention since Alex's kernel Bugzilla and message a few
> > > months ago to linux integrity weren't.
> > 
> > Removing linux-kernel from CC since this subsystem internal
> > discussion.
> > 
> > Seeing the whole thing first time today.
> > 
> > Bugzilla is the first thing to ignore when busy. It is good as
> > place holder for bugs, but all discussions should happen only in
> > LKML. There's no official requirement to proactively use Bugzilla
> > for anything.
> > 
> > That said I'm happy that people put stuff there so that it gets
> > logged.
> > 
> > For follow-up's use only LKML if it is important to you. Those will
> > get processed.
> > 
> > As far as this goes, if nothing is heard from me, check that you
> > put me as CC to the original email. Otherwise, I might have missed
> > it (by mistake, not by purpose).
> > 
> > Honestly, I'm not sure what point was this patch when there was
> > time to wait for months without response. Why the passivity for all
> > this time?
> > 
> > /Jarkko
> > 
> 
> It largely went quiet because I didn't raise the issue in the mailing
> list again. I pinged back in February (
> https://lore.kernel.org/linux-integrity/CAJ7-
> PMbujee92N1f9xVF8vtXgS49qpe7qHkeWh1Z0R-Rk-Jkaw@mail.gmail.com/) but
> the conversation died out and I was content to simply use the
> last working kernel version and see if the bug was resolved on its
> own.

I think its just a state of knowledge problem: back in February I
didn't know how unusual EPERM errors are in the TPM so the issue just
flew by as a "this is a curious issue with an O_NONBLOCK path" thing,
but thanks to some key stuff I've been doing I now do.  So this time
your EPERM struck me as "that's impossible surely" which is why I dug
into the code to find out where it was coming from ... and sure enough,
it was impossible: it was an untranslated failure return, but at least
it accidentally told me exactly what the real error was.

So the upshot is you got lucky this time around ...

James

> I raised the issue again on the bugtracker a few days ago, leading to
> this follow up here. :)

tpm: Revert "tpm: fix invalid locking in NONBLOCKING mode"

Commit Message

Comments

Patch