crypto: AF_ALG AIO - lock context IV

Message ID	11410454.xUfCo2NHDh@positron.chronox.de (mailing list archive)
State	Superseded
Delegated to:	Herbert Xu
Headers	show Return-Path: <linux-crypto-owner@kernel.org> From: Stephan =?ISO-8859-1?Q?M=FCller?= <smueller@chronox.de> To: Harsh Jain <harsh@chelsio.com> Cc: herbert@gondor.apana.org.au, linux-crypto@vger.kernel.org, jonathan.cameron@huawei.com Subject: [PATCH] crypto: AF_ALG AIO - lock context IV Date: Tue, 30 Jan 2018 09:27:04 +0100 Message-ID: <11410454.xUfCo2NHDh@positron.chronox.de> In-Reply-To: <51c57076-4a01-308c-583b-8c34afc3b2e4@chelsio.com> References: <2118226.LQArbCsRu5@tauon.chronox.de> <2165621.DHQGk8sIp3@positron.chronox.de> <51c57076-4a01-308c-583b-8c34afc3b2e4@chelsio.com> MIME-Version: 1.0 Content-Transfer-Encoding: 7Bit Content-Type: text/plain; charset="us-ascii" Sender: linux-crypto-owner@vger.kernel.org Precedence: bulk

Stephan Mueller Jan. 30, 2018, 8:27 a.m. UTC

Hi Harsh,

may I as you to try the attached patch on your Chelsio driver? Please invoke 
the following command from libkcapi:

kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k 
8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i 
7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910

The result should now be a fully block-chained result.

Note, the patch goes on top of the inline IV patch.

Thanks

---8<---

In case of AIO where multiple IOCBs are sent to the kernel and inline IV
is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for
the crypto operation and written back after the crypto operation. Thus,
processing of multiple IOCBs must be serialized.

As the AF_ALG interface is used by user space, a mutex provides the
serialization which puts the caller to sleep in case a previous IOCB
processing is not yet finished.

If multiple IOCBs arrive that all are blocked, the mutex' FIFO operation
of processing arriving requests ensures that the blocked IOCBs are
unblocked in the right order.

Signed-off-by: Stephan Mueller <smueller@chronox.de>
---
 crypto/af_alg.c         | 22 ++++++++++++++++++++++
 crypto/algif_aead.c     |  2 ++
 crypto/algif_skcipher.c |  2 ++
 include/crypto/if_alg.h |  3 +++
 4 files changed, 29 insertions(+)

 #endif	/* _CRYPTO_IF_ALG_H */
--

Stephan Mueller Jan. 30, 2018, 2:04 p.m. UTC | #1

Am Dienstag, 30. Januar 2018, 09:27:04 CET schrieb Stephan Müller:

Hi,

> +/**
> + * af_alg_put_iv - release lock on IV in case CTX IV is used
> + *
> + * @sk [in] AF_ALG socket
> + */
> +void af_alg_put_iv(struct sock *sk)
> +{
> +	struct alg_sock *ask = alg_sk(sk);
> +	struct af_alg_ctx *ctx = ask->private;
> +
> +	if (!ctx->iiv || !ctx->ivlen)
> +		mutex_unlock(&ctx->ivlock);
> +}
> +EXPORT_SYMBOL_GPL(af_alg_put_iv);

Having this function implies that ctx->iiv must be set once at the beginning 
and cannot be toggled for a CTX thereafter as otherwise the release of the 
mutex is racy. This implies that the inline IV patch needs a slight revision 
to prevent toggling the ctx->iiv value.

I will send a new revision of the inline IV and the lock context IV patch 
covering this issue.

Ciao
Stephan

Jonathan Cameron Jan. 30, 2018, 3:51 p.m. UTC | #2

On Tue, 30 Jan 2018 09:27:04 +0100
Stephan Müller <smueller@chronox.de> wrote:

> Hi Harsh,
> 
> may I as you to try the attached patch on your Chelsio driver? Please invoke 
> the following command from libkcapi:
> 
> kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k 
> 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i 
> 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910
> 
> The result should now be a fully block-chained result.
> 
> Note, the patch goes on top of the inline IV patch.
> 
> Thanks
> 
> ---8<---
> 
> In case of AIO where multiple IOCBs are sent to the kernel and inline IV
> is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for
> the crypto operation and written back after the crypto operation. Thus,
> processing of multiple IOCBs must be serialized.
> 
> As the AF_ALG interface is used by user space, a mutex provides the
> serialization which puts the caller to sleep in case a previous IOCB
> processing is not yet finished.
> 
> If multiple IOCBs arrive that all are blocked, the mutex' FIFO operation
> of processing arriving requests ensures that the blocked IOCBs are
> unblocked in the right order.
> 
> Signed-off-by: Stephan Mueller <smueller@chronox.de>

As a reference point, holding outside the kernel (in at least
the case of our hardware with 8K CBC) drops us down to around 30%
of performance with separate IVs.  Putting a queue in kernel
(and hence allowing most setup of descriptors / DMA etc) gets us
up to 50% of raw non chained performance.

So whilst this will work in principle I suspect anyone who
cares about the cost will be looking at adding their own
software queuing and dependency handling in driver.  How
much that matters will depend on just how quick the hardware
is vs overheads.

Anyhow, just posted the driver as an RFC. There are a few corners
that need resolving and the outcome of this thread effects whether
we need to play games with IVs or not.

https://marc.info/?l=linux-crypto-vger&m=151732626428206&w=2
grep for softqueue.

We have a number of processing units that will grab requests
from the HW queues in parallel.  So 'lots' of requests
from a given HW queue may be in flight at the same time (the problem
case).

Logic is fairly simple.
1. Only create a softqueue if the encryption mode requires chaining.
   Currently in this driver that is CBC and CTR modes only.
   Otherwise put everything straight in the hardware monitored queues.
2. If we have an IV embedded within the request it will have a different
   address to the one used previously (guaranteed if that one is still
   in flight and it doesn't matter if it isn't).  If that occurs we
   can safely add it to the hardware queue. To avoid DoS issues against
   and earlier set of messages using the a chained IV we actually make
   sure nothing is in the software queue (not needed for correctness)
3. If IV matches previous IV and there are existing elements in SW or HW
   queues we need to hold it in the SW queue.
4. On receiving a response from the hardware and if the hardware queue
   is empty, we can release an element from the software queue into the
   hardware queue.

This maintains chaining when needed and gets a good chunk of the lost
performance back.  I believe (though yet to verify) the remaining lost
performance is mostly down to the fact we can't coalesce interrupts if
chaining.

Note the driver is deliberately a touch 'simplistic' so there may be other
optimization opportunities to get some of the lost performance back or
it may be fundamental due to the fact the raw processing can't be in
parallel.

> ---
>  crypto/af_alg.c         | 22 ++++++++++++++++++++++
>  crypto/algif_aead.c     |  2 ++
>  crypto/algif_skcipher.c |  2 ++
>  include/crypto/if_alg.h |  3 +++
>  4 files changed, 29 insertions(+)
> 
> diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> index 87394dc90ac0..3007c9d899da 100644
> --- a/crypto/af_alg.c
> +++ b/crypto/af_alg.c
> @@ -1059,6 +1059,8 @@ void af_alg_async_cb(struct crypto_async_request *_req, 
> int err)
>  	struct kiocb *iocb = areq->iocb;
>  	unsigned int resultlen;
>  
> +	af_alg_put_iv(sk);
> +
>  	/* Buffer size written by crypto operation. */
>  	resultlen = areq->outlen;
>  
> @@ -1227,6 +1229,11 @@ int af_alg_get_iv(struct sock *sk, struct 
> af_alg_async_req *areq)
>  
>  	/* No inline IV or cipher has no IV, use ctx IV buffer. */
>  	if (!ctx->iiv || !ctx->ivlen) {
> +		int ret = mutex_lock_interruptible(&ctx->ivlock);
> +
> +		if (ret)
> +			return ret;
> +
>  		areq->iv = ctx->iv;
>  		areq->ivlen = 0;	// areq->iv will not be freed
>  		return 0;
> @@ -1252,6 +1259,21 @@ int af_alg_get_iv(struct sock *sk, struct 
> af_alg_async_req *areq)
>  }
>  EXPORT_SYMBOL_GPL(af_alg_get_iv);
>  
> +/**
> + * af_alg_put_iv - release lock on IV in case CTX IV is used
> + *
> + * @sk [in] AF_ALG socket
> + */
> +void af_alg_put_iv(struct sock *sk)
> +{
> +	struct alg_sock *ask = alg_sk(sk);
> +	struct af_alg_ctx *ctx = ask->private;
> +
> +	if (!ctx->iiv || !ctx->ivlen)
> +		mutex_unlock(&ctx->ivlock);
> +}
> +EXPORT_SYMBOL_GPL(af_alg_put_iv);
> +
>  static int __init af_alg_init(void)
>  {
>  	int err = proto_register(&alg_proto, 0);
> diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
> index 7eb7cb132c09..165b2ca82e51 100644
> --- a/crypto/algif_aead.c
> +++ b/crypto/algif_aead.c
> @@ -309,6 +309,7 @@ static int _aead_recvmsg(struct socket *sock, struct 
> msghdr *msg,
>  				&ctx->wait);
>  	}
>  
> +	af_alg_put_iv(sk);
>  
>  free:
>  	af_alg_free_resources(areq);
> @@ -555,6 +556,7 @@ static int aead_accept_parent_nokey(void *private, struct 
> sock *sk)
>  		return -ENOMEM;
>  	}
>  	memset(ctx->iv, 0, ctx->ivlen);
> +	mutex_init(&ctx->ivlock);
>  
>  	INIT_LIST_HEAD(&ctx->tsgl_list);
>  	ctx->len = len;
> diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
> index d40e1d6797d8..a759cec446b4 100644
> --- a/crypto/algif_skcipher.c
> +++ b/crypto/algif_skcipher.c
> @@ -151,6 +151,7 @@ static int _skcipher_recvmsg(struct socket *sock, struct 
> msghdr *msg,
>  						 &ctx->wait);
>  	}
> +	af_alg_put_iv(sk);
>  
>  free:
>  	af_alg_free_resources(areq);
> @@ -358,6 +359,7 @@ static int skcipher_accept_parent_nokey(void *private, 
> struct sock *sk)
>  	}
>  	memset(ctx->iv, 0, ctx->ivlen);
> +	mutex_init(&ctx->ivlock);
>  
>  	INIT_LIST_HEAD(&ctx->tsgl_list);
>  	ctx->len = len;
> diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
> index ebc651ceb54a..666be8ac683c 100644
> --- a/include/crypto/if_alg.h
> +++ b/include/crypto/if_alg.h
> @@ -19,6 +19,7 @@
>  #include <linux/scatterlist.h>
>  #include <linux/types.h>
>  #include <linux/atomic.h>
> +#include <linux/mutex.h>
>  #include <net/sock.h>
>  
>  #include <crypto/aead.h>
> @@ -160,6 +161,7 @@ struct af_alg_ctx {
>  
>  	void *iv;
>  	unsigned int ivlen;
> +	struct mutex ivlock;
>  	size_t aead_assoclen;
>  
>  	struct crypto_wait wait;
> @@ -268,6 +270,7 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, 
> int flags,
>  		    struct af_alg_async_req *areq, size_t maxsize,
>  		    size_t *outlen);
>  int af_alg_get_iv(struct sock *sk, struct af_alg_async_req *areq);
> +void af_alg_put_iv(struct sock *sk);
>  struct scatterlist *af_alg_get_tsgl_sg(struct af_alg_ctx *ctx);
>  
>  #endif	/* _CRYPTO_IF_ALG_H */

Jonathan Cameron Jan. 31, 2018, 12:29 p.m. UTC | #3

On Tue, 30 Jan 2018 15:51:40 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Tue, 30 Jan 2018 09:27:04 +0100
> Stephan Müller <smueller@chronox.de> wrote:

A few clarifications from me after discussions with Stephan this morning.

> 
> > Hi Harsh,
> > 
> > may I as you to try the attached patch on your Chelsio driver? Please invoke 
> > the following command from libkcapi:
> > 
> > kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k 
> > 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i 
> > 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910
> > 
> > The result should now be a fully block-chained result.
> > 
> > Note, the patch goes on top of the inline IV patch.
> > 
> > Thanks
> > 
> > ---8<---
> > 
> > In case of AIO where multiple IOCBs are sent to the kernel and inline IV
> > is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for
> > the crypto operation and written back after the crypto operation. Thus,
> > processing of multiple IOCBs must be serialized.
> > 
> > As the AF_ALG interface is used by user space, a mutex provides the
> > serialization which puts the caller to sleep in case a previous IOCB
> > processing is not yet finished.
> > 
> > If multiple IOCBs arrive that all are blocked, the mutex' FIFO operation
> > of processing arriving requests ensures that the blocked IOCBs are
> > unblocked in the right order.
> > 
> > Signed-off-by: Stephan Mueller <smueller@chronox.de> 

Firstly, as far as I can tell (not tested it yet) the patch does the
job and is about the best we can easily do in the AF_ALG code.

I'd suggest that this (or v2 anyway) is stable material as well
(which, as Stephan observed, will require reordering of the two patches).

> 
> As a reference point, holding outside the kernel (in at least
> the case of our hardware with 8K CBC) drops us down to around 30%
> of performance with separate IVs.  Putting a queue in kernel
> (and hence allowing most setup of descriptors / DMA etc) gets us
> up to 50% of raw non chained performance.
> 
> So whilst this will work in principle I suspect anyone who
> cares about the cost will be looking at adding their own
> software queuing and dependency handling in driver.  How
> much that matters will depend on just how quick the hardware
> is vs overheads.
> 
> Anyhow, just posted the driver as an RFC. There are a few corners
> that need resolving and the outcome of this thread effects whether
> we need to play games with IVs or not.
> 
> https://marc.info/?l=linux-crypto-vger&m=151732626428206&w=2
> grep for softqueue.
> 
> We have a number of processing units that will grab requests
> from the HW queues in parallel.  So 'lots' of requests
> from a given HW queue may be in flight at the same time (the problem
> case).
> 
> Logic is fairly simple.
> 1. Only create a softqueue if the encryption mode requires chaining.
>    Currently in this driver that is CBC and CTR modes only.
>    Otherwise put everything straight in the hardware monitored queues.
> 2. If we have an IV embedded within the request it will have a different
>    address to the one used previously (guaranteed if that one is still
>    in flight and it doesn't matter if it isn't).  If that occurs we
>    can safely add it to the hardware queue. To avoid DoS issues against
>    and earlier set of messages using the a chained IV we actually make
>    sure nothing is in the software queue (not needed for correctness)
> 3. If IV matches previous IV and there are existing elements in SW or HW
>    queues we need to hold it in the SW queue.
> 4. On receiving a response from the hardware and if the hardware queue
>    is empty, we can release an element from the software queue into the
>    hardware queue.

The differences are better shown with pictures...
To compare the two approaches. If we break up the data flow the
alternatives are

1) Mutex causes queuing in AF ALG code

The key thing is the [Build / Map HW Descs] for small packets,
up to perhaps 32K, is a significant task we can't avoid.
At 8k it looks like it takes perhaps 20-30% of the time
(though I only have end to end performance numbers so far)


[REQUEST 1 from userspace]   
           |                  [REQUEST 2 from userspace] 
    [AF_ALG/SOCKET]                       |
           |                       [AF_ALG/SOCKET]
    NOTHING BLOCKING (lock mut)           |
           |                        Queued on Mutex
 [BUILD / MAP HW DESCS]                   | 
           |                              |
   [Place in HW Queue]                    |
           |                              |
       [Process]                          |
           |                              |
      [Interrupt]                         | 
           |                              |
 [Respond and update IV]     Nothing Blocking (lock mut)        
           |                    [BUILD / MAP HW DESCS] * AFTER SERIALIZATION *     
 Don't care beyond here.                  |
                                  [Place in HW Queue]
                                          |
                                      [Process] 
                                          |
                                     [Interrupt]
                                          |
                                [Respond and update IV]
                                          |
                                Don't care beyond here.   


2) The queuing approach I did in the driver moves that serialization
to after the BUILD / MAP HW DESCs stage.

[REQUEST 1 from userspace]   
           |                  [REQUEST 2 from userspace] 
    [AF_ALG/SOCKET]                       |
           |                       [AF_ALG/SOCKET]
 [BUILD / MAP HW DESCS]                   |
           |                     [BUILD / MAP HW DESCS] * BEFORE SERIALIZATION *
 NOTHING BLOCKING                         | 
           |                   IV in flight (enqueue sw queue)
   [Place in HW Queue]                    |
           |                              |
       [Process]                          |
           |                              |
      [Interrupt]                         | 
           |                              |
 [Respond and update IV]     Nothing Blocking (dequeue sw queue)        
           |                      [Place in HW Queue]      
 Don't care beyond here.                  |
                                      [Process]
                                          |
                                     [Interrupt]
                                          |
                                [Respond and update IV]
                                          |
                                Don't care beyond here. 

> 
> This maintains chaining when needed and gets a good chunk of the lost
> performance back.  I believe (though yet to verify) the remaining lost
> performance is mostly down to the fact we can't coalesce interrupts if
> chaining.
> 
> Note the driver is deliberately a touch 'simplistic' so there may be other
> optimization opportunities to get some of the lost performance back or
> it may be fundamental due to the fact the raw processing can't be in
> parallel.
> 

Quoting a private email from Stephan (at Stephan's suggestion)

> What I however could fathom is that we introduce a flag where a driver 
> implements its own queuing mechanism. In this case, the mutex handling would 
> be disregarded.  

Which works well for the sort of optimization I did and for hardware that
can do iv dependency tracking itself. If hardware dependency tracking was
avilable, you would be able to queue up requests with a chained IV
without taking any special precautions at all. The hardware would
handle the IV update dependencies.

So in conclusion, Stephan's approach should work and give us a nice
small patch suitable for stable inclusion.

However, if people know that their setup overhead can be put in parallel
with previous requests (even when the IV is not yet updated) then they will
probably want to do something inside their own driver and set the flag
that Stephan is proposing adding to bypass the mutex approach.

Jonathan
> > ---
> >  crypto/af_alg.c         | 22 ++++++++++++++++++++++
> >  crypto/algif_aead.c     |  2 ++
> >  crypto/algif_skcipher.c |  2 ++
> >  include/crypto/if_alg.h |  3 +++
> >  4 files changed, 29 insertions(+)
> > 
> > diff --git a/crypto/af_alg.c b/crypto/af_alg.c
> > index 87394dc90ac0..3007c9d899da 100644
> > --- a/crypto/af_alg.c
> > +++ b/crypto/af_alg.c
> > @@ -1059,6 +1059,8 @@ void af_alg_async_cb(struct crypto_async_request *_req, 
> > int err)
> >  	struct kiocb *iocb = areq->iocb;
> >  	unsigned int resultlen;
> >  
> > +	af_alg_put_iv(sk);
> > +
> >  	/* Buffer size written by crypto operation. */
> >  	resultlen = areq->outlen;
> >  
> > @@ -1227,6 +1229,11 @@ int af_alg_get_iv(struct sock *sk, struct 
> > af_alg_async_req *areq)
> >  
> >  	/* No inline IV or cipher has no IV, use ctx IV buffer. */
> >  	if (!ctx->iiv || !ctx->ivlen) {
> > +		int ret = mutex_lock_interruptible(&ctx->ivlock);
> > +
> > +		if (ret)
> > +			return ret;
> > +
> >  		areq->iv = ctx->iv;
> >  		areq->ivlen = 0;	// areq->iv will not be freed
> >  		return 0;
> > @@ -1252,6 +1259,21 @@ int af_alg_get_iv(struct sock *sk, struct 
> > af_alg_async_req *areq)
> >  }
> >  EXPORT_SYMBOL_GPL(af_alg_get_iv);
> >  
> > +/**
> > + * af_alg_put_iv - release lock on IV in case CTX IV is used
> > + *
> > + * @sk [in] AF_ALG socket
> > + */
> > +void af_alg_put_iv(struct sock *sk)
> > +{
> > +	struct alg_sock *ask = alg_sk(sk);
> > +	struct af_alg_ctx *ctx = ask->private;
> > +
> > +	if (!ctx->iiv || !ctx->ivlen)
> > +		mutex_unlock(&ctx->ivlock);
> > +}
> > +EXPORT_SYMBOL_GPL(af_alg_put_iv);
> > +
> >  static int __init af_alg_init(void)
> >  {
> >  	int err = proto_register(&alg_proto, 0);
> > diff --git a/crypto/algif_aead.c b/crypto/algif_aead.c
> > index 7eb7cb132c09..165b2ca82e51 100644
> > --- a/crypto/algif_aead.c
> > +++ b/crypto/algif_aead.c
> > @@ -309,6 +309,7 @@ static int _aead_recvmsg(struct socket *sock, struct 
> > msghdr *msg,
> >  				&ctx->wait);
> >  	}
> >  
> > +	af_alg_put_iv(sk);
> >  
> >  free:
> >  	af_alg_free_resources(areq);
> > @@ -555,6 +556,7 @@ static int aead_accept_parent_nokey(void *private, struct 
> > sock *sk)
> >  		return -ENOMEM;
> >  	}
> >  	memset(ctx->iv, 0, ctx->ivlen);
> > +	mutex_init(&ctx->ivlock);
> >  
> >  	INIT_LIST_HEAD(&ctx->tsgl_list);
> >  	ctx->len = len;
> > diff --git a/crypto/algif_skcipher.c b/crypto/algif_skcipher.c
> > index d40e1d6797d8..a759cec446b4 100644
> > --- a/crypto/algif_skcipher.c
> > +++ b/crypto/algif_skcipher.c
> > @@ -151,6 +151,7 @@ static int _skcipher_recvmsg(struct socket *sock, struct 
> > msghdr *msg,
> >  						 &ctx->wait);
> >  	}
> > +	af_alg_put_iv(sk);
> >  
> >  free:
> >  	af_alg_free_resources(areq);
> > @@ -358,6 +359,7 @@ static int skcipher_accept_parent_nokey(void *private, 
> > struct sock *sk)
> >  	}
> >  	memset(ctx->iv, 0, ctx->ivlen);
> > +	mutex_init(&ctx->ivlock);
> >  
> >  	INIT_LIST_HEAD(&ctx->tsgl_list);
> >  	ctx->len = len;
> > diff --git a/include/crypto/if_alg.h b/include/crypto/if_alg.h
> > index ebc651ceb54a..666be8ac683c 100644
> > --- a/include/crypto/if_alg.h
> > +++ b/include/crypto/if_alg.h
> > @@ -19,6 +19,7 @@
> >  #include <linux/scatterlist.h>
> >  #include <linux/types.h>
> >  #include <linux/atomic.h>
> > +#include <linux/mutex.h>
> >  #include <net/sock.h>
> >  
> >  #include <crypto/aead.h>
> > @@ -160,6 +161,7 @@ struct af_alg_ctx {
> >  
> >  	void *iv;
> >  	unsigned int ivlen;
> > +	struct mutex ivlock;
> >  	size_t aead_assoclen;
> >  
> >  	struct crypto_wait wait;
> > @@ -268,6 +270,7 @@ int af_alg_get_rsgl(struct sock *sk, struct msghdr *msg, 
> > int flags,
> >  		    struct af_alg_async_req *areq, size_t maxsize,
> >  		    size_t *outlen);
> >  int af_alg_get_iv(struct sock *sk, struct af_alg_async_req *areq);
> > +void af_alg_put_iv(struct sock *sk);
> >  struct scatterlist *af_alg_get_tsgl_sg(struct af_alg_ctx *ctx);
> >  
> >  #endif	/* _CRYPTO_IF_ALG_H */  
>

Gilad Ben-Yossef Feb. 1, 2018, 9:35 a.m. UTC | #4

Hi,

On Wed, Jan 31, 2018 at 2:29 PM, Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> On Tue, 30 Jan 2018 15:51:40 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
>> On Tue, 30 Jan 2018 09:27:04 +0100
>> Stephan Müller <smueller@chronox.de> wrote:
>
> A few clarifications from me after discussions with Stephan this morning.
>
>>
>> > Hi Harsh,
>> >
>> > may I as you to try the attached patch on your Chelsio driver? Please invoke
>> > the following command from libkcapi:
>> >
>> > kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k
>> > 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i
>> > 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910
>> >
>> > The result should now be a fully block-chained result.
>> >
>> > Note, the patch goes on top of the inline IV patch.
>> >
>> > Thanks
>> >
>> > ---8<---
>> >
>> > In case of AIO where multiple IOCBs are sent to the kernel and inline IV
>> > is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for
>> > the crypto operation and written back after the crypto operation. Thus,
>> > processing of multiple IOCBs must be serialized.
>> >
>> > As the AF_ALG interface is used by user space, a mutex provides the
>> > serialization which puts the caller to sleep in case a previous IOCB
>> > processing is not yet finished.
>> >
>> > If multiple IOCBs arrive that all are blocked, the mutex' FIFO operation
>> > of processing arriving requests ensures that the blocked IOCBs are
>> > unblocked in the right order.
>> >
>> > Signed-off-by: Stephan Mueller <smueller@chronox.de>
>
> Firstly, as far as I can tell (not tested it yet) the patch does the
> job and is about the best we can easily do in the AF_ALG code.
>
> I'd suggest that this (or v2 anyway) is stable material as well
> (which, as Stephan observed, will require reordering of the two patches).
>
>>
>> As a reference point, holding outside the kernel (in at least
>> the case of our hardware with 8K CBC) drops us down to around 30%
>> of performance with separate IVs.  Putting a queue in kernel
>> (and hence allowing most setup of descriptors / DMA etc) gets us
>> up to 50% of raw non chained performance.
>>
>> So whilst this will work in principle I suspect anyone who
>> cares about the cost will be looking at adding their own
>> software queuing and dependency handling in driver.  How
>> much that matters will depend on just how quick the hardware
>> is vs overheads.
>>

<snip>

>
> The differences are better shown with pictures...
> To compare the two approaches. If we break up the data flow the
> alternatives are
>
> 1) Mutex causes queuing in AF ALG code
>
> The key thing is the [Build / Map HW Descs] for small packets,
> up to perhaps 32K, is a significant task we can't avoid.
> At 8k it looks like it takes perhaps 20-30% of the time
> (though I only have end to end performance numbers so far)
>
>
> [REQUEST 1 from userspace]
>            |                  [REQUEST 2 from userspace]
>     [AF_ALG/SOCKET]                       |
>            |                       [AF_ALG/SOCKET]
>     NOTHING BLOCKING (lock mut)           |
>            |                        Queued on Mutex
>  [BUILD / MAP HW DESCS]                   |
>            |                              |
>    [Place in HW Queue]                    |
>            |                              |
>        [Process]                          |
>            |                              |
>       [Interrupt]                         |
>            |                              |
>  [Respond and update IV]     Nothing Blocking (lock mut)
>            |                    [BUILD / MAP HW DESCS] * AFTER SERIALIZATION *
>  Don't care beyond here.                  |
>                                   [Place in HW Queue]
>                                           |
>                                       [Process]
>                                           |
>                                      [Interrupt]
>                                           |
>                                 [Respond and update IV]
>                                           |
>                                 Don't care beyond here.
>
>
> 2) The queuing approach I did in the driver moves that serialization
> to after the BUILD / MAP HW DESCs stage.
>
> [REQUEST 1 from userspace]
>            |                  [REQUEST 2 from userspace]
>     [AF_ALG/SOCKET]                       |
>            |                       [AF_ALG/SOCKET]
>  [BUILD / MAP HW DESCS]                   |
>            |                     [BUILD / MAP HW DESCS] * BEFORE SERIALIZATION *
>  NOTHING BLOCKING                         |
>            |                   IV in flight (enqueue sw queue)
>    [Place in HW Queue]                    |
>            |                              |
>        [Process]                          |
>            |                              |
>       [Interrupt]                         |
>            |                              |
>  [Respond and update IV]     Nothing Blocking (dequeue sw queue)
>            |                      [Place in HW Queue]
>  Don't care beyond here.                  |
>                                       [Process]
>                                           |
>                                      [Interrupt]
>                                           |
>                                 [Respond and update IV]
>                                           |
>                                 Don't care beyond here.
>
>>
>> This maintains chaining when needed and gets a good chunk of the lost
>> performance back.  I believe (though yet to verify) the remaining lost
>> performance is mostly down to the fact we can't coalesce interrupts if
>> chaining.
>>
>> Note the driver is deliberately a touch 'simplistic' so there may be other
>> optimization opportunities to get some of the lost performance back or
>> it may be fundamental due to the fact the raw processing can't be in
>> parallel.
>>
>
> Quoting a private email from Stephan (at Stephan's suggestion)
>
>> What I however could fathom is that we introduce a flag where a driver
>> implements its own queuing mechanism. In this case, the mutex handling would
>> be disregarded.
>
> Which works well for the sort of optimization I did and for hardware that
> can do iv dependency tracking itself. If hardware dependency tracking was
> avilable, you would be able to queue up requests with a chained IV
> without taking any special precautions at all. The hardware would
> handle the IV update dependencies.
>
> So in conclusion, Stephan's approach should work and give us a nice
> small patch suitable for stable inclusion.
>
> However, if people know that their setup overhead can be put in parallel
> with previous requests (even when the IV is not yet updated) then they will
> probably want to do something inside their own driver and set the flag
> that Stephan is proposing adding to bypass the mutex approach.
>

The patches from Stephan looks good to me, but I think we can do better
for the long term approach you are discussing.

Consider that the issue we are dealing with is in no way unique to the AF_ALG
use case, it just happens to expose it.

The grander issue is, I believe, that we are missing an API for chained crypto
operations. One that allows submitting multiple concurrent but
dependent requests
to tfm providers.

Trying to second guess whether or not there is a dependency between calls from
the address of the IV is not a clean solution. Why don't we make it explicit?

For example, a flag that can be set on a tfm that states that the subsequent
series of requests are interdependent. If you need a different stream,
simply allocated anotehr
tfm object.

This will let each driver do it's best, be it a simple mutex, software
queuing or hardware
dependency tracking as the case m may be.

Than of course, AF_ALG code (or any other user for that matter) will
not need to handle
interdependence, just set the right flag.

Do you think this makes sense?

Thanks,
Gilad

Stephan Mueller Feb. 1, 2018, 9:46 a.m. UTC | #5

Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:

Hi Gilad,

> Hi,
> 
> On Wed, Jan 31, 2018 at 2:29 PM, Jonathan Cameron
> 
> <Jonathan.Cameron@huawei.com> wrote:
> > On Tue, 30 Jan 2018 15:51:40 +0000
> > 
> > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >> On Tue, 30 Jan 2018 09:27:04 +0100
> > 
> >> Stephan Müller <smueller@chronox.de> wrote:
> > A few clarifications from me after discussions with Stephan this morning.
> > 
> >> > Hi Harsh,
> >> > 
> >> > may I as you to try the attached patch on your Chelsio driver? Please
> >> > invoke the following command from libkcapi:
> >> > 
> >> > kcapi  -d 2 -x 9  -e -c "cbc(aes)" -k
> >> > 8d7dd9b0170ce0b5f2f8e1aa768e01e91da8bfc67fd486d081b28254c99eb423 -i
> >> > 7fbc02ebf5b93322329df9bfccb635af -p 48981da18e4bb9ef7e2e3162d16b1910
> >> > 
> >> > The result should now be a fully block-chained result.
> >> > 
> >> > Note, the patch goes on top of the inline IV patch.
> >> > 
> >> > Thanks
> >> > 
> >> > ---8<---
> >> > 
> >> > In case of AIO where multiple IOCBs are sent to the kernel and inline
> >> > IV
> >> > is not used, the ctx->iv is used for each IOCB. The ctx->iv is read for
> >> > the crypto operation and written back after the crypto operation. Thus,
> >> > processing of multiple IOCBs must be serialized.
> >> > 
> >> > As the AF_ALG interface is used by user space, a mutex provides the
> >> > serialization which puts the caller to sleep in case a previous IOCB
> >> > processing is not yet finished.
> >> > 
> >> > If multiple IOCBs arrive that all are blocked, the mutex' FIFO
> >> > operation
> >> > of processing arriving requests ensures that the blocked IOCBs are
> >> > unblocked in the right order.
> >> > 
> >> > Signed-off-by: Stephan Mueller <smueller@chronox.de>
> > 
> > Firstly, as far as I can tell (not tested it yet) the patch does the
> > job and is about the best we can easily do in the AF_ALG code.
> > 
> > I'd suggest that this (or v2 anyway) is stable material as well
> > (which, as Stephan observed, will require reordering of the two patches).
> > 
> >> As a reference point, holding outside the kernel (in at least
> >> the case of our hardware with 8K CBC) drops us down to around 30%
> >> of performance with separate IVs.  Putting a queue in kernel
> >> (and hence allowing most setup of descriptors / DMA etc) gets us
> >> up to 50% of raw non chained performance.
> >> 
> >> So whilst this will work in principle I suspect anyone who
> >> cares about the cost will be looking at adding their own
> >> software queuing and dependency handling in driver.  How
> >> much that matters will depend on just how quick the hardware
> >> is vs overheads.
> 
> <snip>
> 
> > The differences are better shown with pictures...
> > To compare the two approaches. If we break up the data flow the
> > alternatives are
> > 
> > 1) Mutex causes queuing in AF ALG code
> > 
> > The key thing is the [Build / Map HW Descs] for small packets,
> > up to perhaps 32K, is a significant task we can't avoid.
> > At 8k it looks like it takes perhaps 20-30% of the time
> > (though I only have end to end performance numbers so far)
> > 
> > 
> > [REQUEST 1 from userspace]
> > 
> >            |                  [REQUEST 2 from userspace]
> >     
> >     [AF_ALG/SOCKET]                       |
> >     
> >            |                       [AF_ALG/SOCKET]
> >     
> >     NOTHING BLOCKING (lock mut)           |
> >     
> >            |                        Queued on Mutex
> >  
> >  [BUILD / MAP HW DESCS]                   |
> >  
> >    [Place in HW Queue]                    |
> >    
> >        [Process]                          |
> >       
> >       [Interrupt]                         |
> >  
> >  [Respond and update IV]     Nothing Blocking (lock mut)
> >  
> >            |                    [BUILD / MAP HW DESCS] * AFTER
> >            |                    SERIALIZATION *
> >  
> >  Don't care beyond here.                  |
> >  
> >                                   [Place in HW Queue]
> >                                   
> >                                       [Process]
> >                                      
> >                                      [Interrupt]
> >                                 
> >                                 [Respond and update IV]
> >                                 
> >                                 Don't care beyond here.
> > 
> > 2) The queuing approach I did in the driver moves that serialization
> > to after the BUILD / MAP HW DESCs stage.
> > 
> > [REQUEST 1 from userspace]
> > 
> >            |                  [REQUEST 2 from userspace]
> >     
> >     [AF_ALG/SOCKET]                       |
> >     
> >            |                       [AF_ALG/SOCKET]
> >  
> >  [BUILD / MAP HW DESCS]                   |
> >  
> >            |                     [BUILD / MAP HW DESCS] * BEFORE
> >            |                     SERIALIZATION *
> >  
> >  NOTHING BLOCKING                         |
> >  
> >            |                   IV in flight (enqueue sw queue)
> >    
> >    [Place in HW Queue]                    |
> >    
> >        [Process]                          |
> >       
> >       [Interrupt]                         |
> >  
> >  [Respond and update IV]     Nothing Blocking (dequeue sw queue)
> >  
> >            |                      [Place in HW Queue]
> >  
> >  Don't care beyond here.                  |
> >  
> >                                       [Process]
> >                                      
> >                                      [Interrupt]
> >                                 
> >                                 [Respond and update IV]
> >                                 
> >                                 Don't care beyond here.
> >> 
> >> This maintains chaining when needed and gets a good chunk of the lost
> >> performance back.  I believe (though yet to verify) the remaining lost
> >> performance is mostly down to the fact we can't coalesce interrupts if
> >> chaining.
> >> 
> >> Note the driver is deliberately a touch 'simplistic' so there may be
> >> other
> >> optimization opportunities to get some of the lost performance back or
> >> it may be fundamental due to the fact the raw processing can't be in
> >> parallel.
> > 
> > Quoting a private email from Stephan (at Stephan's suggestion)
> > 
> >> What I however could fathom is that we introduce a flag where a driver
> >> implements its own queuing mechanism. In this case, the mutex handling
> >> would be disregarded.
> > 
> > Which works well for the sort of optimization I did and for hardware that
> > can do iv dependency tracking itself. If hardware dependency tracking was
> > avilable, you would be able to queue up requests with a chained IV
> > without taking any special precautions at all. The hardware would
> > handle the IV update dependencies.
> > 
> > So in conclusion, Stephan's approach should work and give us a nice
> > small patch suitable for stable inclusion.
> > 
> > However, if people know that their setup overhead can be put in parallel
> > with previous requests (even when the IV is not yet updated) then they
> > will
> > probably want to do something inside their own driver and set the flag
> > that Stephan is proposing adding to bypass the mutex approach.
> 
> The patches from Stephan looks good to me, but I think we can do better
> for the long term approach you are discussing.
> 
> Consider that the issue we are dealing with is in no way unique to the
> AF_ALG use case, it just happens to expose it.
> 
> The grander issue is, I believe, that we are missing an API for chained
> crypto operations. One that allows submitting multiple concurrent but
> dependent requests
> to tfm providers.
> 
> Trying to second guess whether or not there is a dependency between calls
> from the address of the IV is not a clean solution. Why don't we make it
> explicit?
> 
> For example, a flag that can be set on a tfm that states that the subsequent
> series of requests are interdependent. If you need a different stream,
> simply allocated anotehr
> tfm object.

But this is already implemented, is it not considering the discussed patches? 
Again, here are the use cases I see for AIO:

- multiple IOCBs which are totally separate and have no interdependencies: the 
inline IV approach

- multiple IOCBs which are interdependent: the current API with the patches 
that we are discussing here (note, a new patch set is coming after the merge 
window)

- if you have several independent invocations of multiple interdependent 
IOCBs: call accept() again on the TFM-FD to get another operations FD. This is 
followed either option one or two above. Note, this is the idea behind the 
kcapi_handle_reinit() call just added to libkcapi. So, you have multiple 
operation-FDs on which you do either interdependent operation or inline IV 
handling.
> 
> This will let each driver do it's best, be it a simple mutex, software
> queuing or hardware
> dependency tracking as the case m may be.

Again, I am wondering that with the discussed patch set we have this 
functionality already.
> 
> Than of course, AF_ALG code (or any other user for that matter) will
> not need to handle
> interdependence, just set the right flag.

The issue is that some drivers simply do not handle this interdependency at 
all -- see the bug report from Harsh.

Thus, the current patch set (again which will be updated after the merge 
window) contains:

1. adding a mutex to AF_ALG to ensure serialization of interdependent calls 
(option 2 from above) irrespective whether the driver implements support for 
it or not.

2. add the inline IV handling (to serve option 1)

3. add a flag that can be set by the crypto implementation. If this flag is 
set, then the mutex of AF_ALG is *not* set assuming that the crypto driver 
handles the serialization.

Note, option 3 from above is implemented in the proper usage of the AF_ALG 
interface.
> 
> Do you think this makes sense?
> 
> Thanks,
> Gilad



Ciao
Stephan

Stephan Mueller Feb. 1, 2018, 10:04 a.m. UTC | #6

Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:

Hi Gilad,

> > 
> > Which works well for the sort of optimization I did and for hardware that
> > can do iv dependency tracking itself. If hardware dependency tracking was
> > avilable, you would be able to queue up requests with a chained IV
> > without taking any special precautions at all. The hardware would
> > handle the IV update dependencies.
> > 
> > So in conclusion, Stephan's approach should work and give us a nice
> > small patch suitable for stable inclusion.
> > 
> > However, if people know that their setup overhead can be put in parallel
> > with previous requests (even when the IV is not yet updated) then they
> > will
> > probably want to do something inside their own driver and set the flag
> > that Stephan is proposing adding to bypass the mutex approach.
> 
> The patches from Stephan looks good to me, but I think we can do better
> for the long term approach you are discussing.

What you made me think of is the following: shouldn't we relay the inline IV 
flag on to the crypto drivers?

The reason is to allow a clean implementation of the enabling or disabling of 
the dependency handling in the driver. Jonathan's driver, for example, decides 
based on the pointer value of the IV buffer whether it is the same buffer and 
thus dependency handling is to be applied. This is fragile.

As AF_ALG knows whether the inline IV with separate IVs per request or the 
serialization with one IV buffer for all requests is requested, it should 
relay this state on to the drivers. Thus, for example, Jonathan's driver can 
be changed to rely on this flag instead on the buffer pointer value to decide 
whether to enable its dependency handling.

Ciao
Stephan

Gilad Ben-Yossef Feb. 1, 2018, 10:06 a.m. UTC | #7

On Thu, Feb 1, 2018 at 11:46 AM, Stephan Mueller <smueller@chronox.de> wrote:
> Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:
>
> Hi Gilad,
>
>> Hi,
>>
<snip>
>> >
>> > Quoting a private email from Stephan (at Stephan's suggestion)
>> >
>> >> What I however could fathom is that we introduce a flag where a driver
>> >> implements its own queuing mechanism. In this case, the mutex handling
>> >> would be disregarded.
>> >
>> > Which works well for the sort of optimization I did and for hardware that
>> > can do iv dependency tracking itself. If hardware dependency tracking was
>> > avilable, you would be able to queue up requests with a chained IV
>> > without taking any special precautions at all. The hardware would
>> > handle the IV update dependencies.
>> >
>> > So in conclusion, Stephan's approach should work and give us a nice
>> > small patch suitable for stable inclusion.
>> >
>> > However, if people know that their setup overhead can be put in parallel
>> > with previous requests (even when the IV is not yet updated) then they
>> > will
>> > probably want to do something inside their own driver and set the flag
>> > that Stephan is proposing adding to bypass the mutex approach.
>>
>> The patches from Stephan looks good to me, but I think we can do better
>> for the long term approach you are discussing.
>>
>> Consider that the issue we are dealing with is in no way unique to the
>> AF_ALG use case, it just happens to expose it.
>>
>> The grander issue is, I believe, that we are missing an API for chained
>> crypto operations. One that allows submitting multiple concurrent but
>> dependent requests
>> to tfm providers.
>>
>> Trying to second guess whether or not there is a dependency between calls
>> from the address of the IV is not a clean solution. Why don't we make it
>> explicit?
>>
>> For example, a flag that can be set on a tfm that states that the subsequent
>> series of requests are interdependent. If you need a different stream,
>> simply allocated anotehr
>> tfm object.
>
> But this is already implemented, is it not considering the discussed patches?
>
> Again, here are the use cases I see for AIO:
>
> - multiple IOCBs which are totally separate and have no interdependencies: the
> inline IV approach
>
> - multiple IOCBs which are interdependent: the current API with the patches
> that we are discussing here (note, a new patch set is coming after the merge
> window)
>
> - if you have several independent invocations of multiple interdependent
> IOCBs: call accept() again on the TFM-FD to get another operations FD. This is
> followed either option one or two above. Note, this is the idea behind the
> kcapi_handle_reinit() call just added to libkcapi. So, you have multiple
> operation-FDs on which you do either interdependent operation or inline IV
> handling.
>>
>> This will let each driver do it's best, be it a simple mutex, software
>> queuing or hardware
>> dependency tracking as the case m may be.
>
> Again, I am wondering that with the discussed patch set we have this
> functionality already.
>>
>> Than of course, AF_ALG code (or any other user for that matter) will
>> not need to handle
>> interdependence, just set the right flag.
>
> The issue is that some drivers simply do not handle this interdependency at
> all -- see the bug report from Harsh.
>
> Thus, the current patch set (again which will be updated after the merge
> window) contains:
>
> 1. adding a mutex to AF_ALG to ensure serialization of interdependent calls
> (option 2 from above) irrespective whether the driver implements support for
> it or not.
>
> 2. add the inline IV handling (to serve option 1)
>
> 3. add a flag that can be set by the crypto implementation. If this flag is
> set, then the mutex of AF_ALG is *not* set assuming that the crypto driver
> handles the serialization.
>
> Note, option 3 from above is implemented in the proper usage of the AF_ALG
> interface.

Sorry for not communicating clearly enough. I was not objecting at
all. Let me try again.

What I was trying to say is, more succinctly hopefully this time:

1. Advocating for that 3rd option. I did not understand the plan
include adding it,

2. Pointing out that the problem solved (and rightfully so) by mutex in AF_ALG
AIO implementation must exists elsewhere as well - for example IPsec, and is
probably solved there too, so if we add the flag as suggested, it can be used
there as well to gain similar benefits to what Jonathan is suggesting.

Thanks,
Gilad

Gilad Ben-Yossef Feb. 1, 2018, 10:07 a.m. UTC | #8

On Thu, Feb 1, 2018 at 12:04 PM, Stephan Mueller <smueller@chronox.de> wrote:
> Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:
>
> Hi Gilad,
>
>> >
>> > Which works well for the sort of optimization I did and for hardware that
>> > can do iv dependency tracking itself. If hardware dependency tracking was
>> > avilable, you would be able to queue up requests with a chained IV
>> > without taking any special precautions at all. The hardware would
>> > handle the IV update dependencies.
>> >
>> > So in conclusion, Stephan's approach should work and give us a nice
>> > small patch suitable for stable inclusion.
>> >
>> > However, if people know that their setup overhead can be put in parallel
>> > with previous requests (even when the IV is not yet updated) then they
>> > will
>> > probably want to do something inside their own driver and set the flag
>> > that Stephan is proposing adding to bypass the mutex approach.
>>
>> The patches from Stephan looks good to me, but I think we can do better
>> for the long term approach you are discussing.
>
> What you made me think of is the following: shouldn't we relay the inline IV
> flag on to the crypto drivers?
>
> The reason is to allow a clean implementation of the enabling or disabling of
> the dependency handling in the driver. Jonathan's driver, for example, decides
> based on the pointer value of the IV buffer whether it is the same buffer and
> thus dependency handling is to be applied. This is fragile.
>
> As AF_ALG knows whether the inline IV with separate IVs per request or the
> serialization with one IV buffer for all requests is requested, it should
> relay this state on to the drivers. Thus, for example, Jonathan's driver can
> be changed to rely on this flag instead on the buffer pointer value to decide
> whether to enable its dependency handling.

Yes, that is exactly what I was trying to point out :-)

Thanks,
Gilad

Stephan Mueller Feb. 1, 2018, 10:15 a.m. UTC | #9

Am Donnerstag, 1. Februar 2018, 11:06:30 CET schrieb Gilad Ben-Yossef:

Hi Gilad,

> 2. Pointing out that the problem solved (and rightfully so) by mutex in
> AF_ALG AIO implementation must exists elsewhere as well - for example
> IPsec, and is probably solved there too, so if we add the flag as
> suggested, it can be used there as well to gain similar benefits to what
> Jonathan is suggesting.

You are quite right. this patch could speed up things for IPSec and the disk 
encryption logic too. So I think we should CC also the IPSec folks and the 
disk crypto folks to review the patch.

Ciao
Stephan

Jonathan Cameron Feb. 1, 2018, 10:25 a.m. UTC | #10

On Thu, 1 Feb 2018 12:07:21 +0200
Gilad Ben-Yossef <gilad@benyossef.com> wrote:

> On Thu, Feb 1, 2018 at 12:04 PM, Stephan Mueller <smueller@chronox.de> wrote:
> > Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:
> >
> > Hi Gilad,
> >  
> >> >
> >> > Which works well for the sort of optimization I did and for hardware that
> >> > can do iv dependency tracking itself. If hardware dependency tracking was
> >> > avilable, you would be able to queue up requests with a chained IV
> >> > without taking any special precautions at all. The hardware would
> >> > handle the IV update dependencies.
> >> >
> >> > So in conclusion, Stephan's approach should work and give us a nice
> >> > small patch suitable for stable inclusion.
> >> >
> >> > However, if people know that their setup overhead can be put in parallel
> >> > with previous requests (even when the IV is not yet updated) then they
> >> > will
> >> > probably want to do something inside their own driver and set the flag
> >> > that Stephan is proposing adding to bypass the mutex approach.  
> >>
> >> The patches from Stephan looks good to me, but I think we can do better
> >> for the long term approach you are discussing.  
> >
> > What you made me think of is the following: shouldn't we relay the inline IV
> > flag on to the crypto drivers?
> >
> > The reason is to allow a clean implementation of the enabling or disabling of
> > the dependency handling in the driver. Jonathan's driver, for example, decides
> > based on the pointer value of the IV buffer whether it is the same buffer and
> > thus dependency handling is to be applied. This is fragile.

I agree it's inelegant and a flag would be better than pointer tricks (though
they are safe currently - we never know what might change in future)
It was really a minimal example rather than a suggestion of the ultimate
solution ;)  I was planning on suggesting a flag myself once the basic
discussion concluded the approach was worthwhile.

> >
> > As AF_ALG knows whether the inline IV with separate IVs per request or the
> > serialization with one IV buffer for all requests is requested, it should
> > relay this state on to the drivers. Thus, for example, Jonathan's driver can
> > be changed to rely on this flag instead on the buffer pointer value to decide
> > whether to enable its dependency handling.  
> 
> Yes, that is exactly what I was trying to point out :-)

Agreed - make things explicit rather than basically relying on knowing how
the above layers are working.

Thanks,

Jonathan

Harsh Jain Feb. 1, 2018, 10:55 a.m. UTC | #11

On 01-02-2018 15:55, Jonathan Cameron wrote:
> On Thu, 1 Feb 2018 12:07:21 +0200
> Gilad Ben-Yossef <gilad@benyossef.com> wrote:
>
>> On Thu, Feb 1, 2018 at 12:04 PM, Stephan Mueller <smueller@chronox.de> wrote:
>>> Am Donnerstag, 1. Februar 2018, 10:35:07 CET schrieb Gilad Ben-Yossef:
>>>
>>> Hi Gilad,
>>>  
>>>>> Which works well for the sort of optimization I did and for hardware that
>>>>> can do iv dependency tracking itself. If hardware dependency tracking was
>>>>> avilable, you would be able to queue up requests with a chained IV
>>>>> without taking any special precautions at all. The hardware would
>>>>> handle the IV update dependencies.
>>>>>
>>>>> So in conclusion, Stephan's approach should work and give us a nice
>>>>> small patch suitable for stable inclusion.
>>>>>
>>>>> However, if people know that their setup overhead can be put in parallel
>>>>> with previous requests (even when the IV is not yet updated) then they
>>>>> will
>>>>> probably want to do something inside their own driver and set the flag
>>>>> that Stephan is proposing adding to bypass the mutex approach.  
>>>> The patches from Stephan looks good to me, but I think we can do better
>>>> for the long term approach you are discussing.  
>>> What you made me think of is the following: shouldn't we relay the inline IV
>>> flag on to the crypto drivers?
>>>
>>> The reason is to allow a clean implementation of the enabling or disabling of
>>> the dependency handling in the driver. Jonathan's driver, for example, decides
>>> based on the pointer value of the IV buffer whether it is the same buffer and
>>> thus dependency handling is to be applied. This is fragile.
> I agree it's inelegant and a flag would be better than pointer tricks (though
> they are safe currently - we never know what might change in future)
> It was really a minimal example rather than a suggestion of the ultimate
> solution ;)  I was planning on suggesting a flag myself once the basic
> discussion concluded the approach was worthwhile.
>
>>> As AF_ALG knows whether the inline IV with separate IVs per request or the
>>> serialization with one IV buffer for all requests is requested, it should
>>> relay this state on to the drivers. Thus, for example, Jonathan's driver can
>>> be changed to rely on this flag instead on the buffer pointer value to decide
>>> whether to enable its dependency handling.  
>> Yes, that is exactly what I was trying to point out :-)
> Agreed - make things explicit rather than basically relying on knowing how
> the above layers are working.
IPSec layer may not get benefit from this because they send complete sg list in single request only. They don't need partial mode support.
>
> Thanks,
>
> Jonathan
>

Stephan Mueller Feb. 7, 2018, 7:42 a.m. UTC | #12

Hi Herbert,

Herbert, the patch 1 is meant for stable. However, this patch as is
only applies to the new AF_ALG interface implementation. Though,
the issue goes back to the first implementation of AIO support.
Shall I try prepare a patch for the old AF_ALG implementation
as well?

Changes from v1:

* integrate the inline IV and locking patch into one patch set

* reverse the order of lock context IV patch and inline IV patch --
  the reason is to allow the first patch to be back-ported to stable

* mark the first patch (locking of the context IV) as applicable to
  stable as there is an existing inconsistency which was demonstrated
  by Harsh with the Chelsio driver vs the AES-NI driver

* modify the inline IV patch to have proper unlocking of the mutex
  in case of errors

* prevent locking if no IV is defined by cipher

* add a patch to allow crypto drivers to report whether they support
  serialization -- in this case the locking in AF_ALG shall be
  disabled

* add a patch to inform the crypto drivers that their serialization
  support should actually be enabled and used because AF_ALG does not
  serialize the interdependent parallel AIO requests

* streamline the code in patch 1 and 2 slightly

I would like to ask the folks with real AIO hardware (Harsh, Jonathan)
to test the patches. Especially, is the locking patch should be tested
by Harsh as you have seen the issue with your hardware.

Thanks.

Stephan Mueller (4):
  crypto: AF_ALG AIO - lock context IV
  crypto: AF_ALG - inline IV support
  crypto: AF_ALG - allow driver to serialize IV access
  crypto: add CRYPTO_TFM_REQ_PARALLEL flag

 crypto/af_alg.c             | 119 +++++++++++++++++++++++++++++++++++++++++++-
 crypto/algif_aead.c         |  86 +++++++++++++++++---------------
 crypto/algif_skcipher.c     |  38 ++++++++++----
 include/crypto/if_alg.h     |  37 ++++++++++++++
 include/linux/crypto.h      |  16 ++++++
 include/uapi/linux/if_alg.h |   6 ++-
 6 files changed, 249 insertions(+), 53 deletions(-)

Harsh Jain Feb. 7, 2018, 8:52 a.m. UTC | #13

On 07-02-2018 13:12, Stephan Müller wrote:
> Hi Herbert,
>
> Herbert, the patch 1 is meant for stable. However, this patch as is
> only applies to the new AF_ALG interface implementation. Though,
> the issue goes back to the first implementation of AIO support.
> Shall I try prepare a patch for the old AF_ALG implementation
> as well?
>
> Changes from v1:
>
> * integrate the inline IV and locking patch into one patch set
>
> * reverse the order of lock context IV patch and inline IV patch --
>   the reason is to allow the first patch to be back-ported to stable
>
> * mark the first patch (locking of the context IV) as applicable to
>   stable as there is an existing inconsistency which was demonstrated
>   by Harsh with the Chelsio driver vs the AES-NI driver
>
> * modify the inline IV patch to have proper unlocking of the mutex
>   in case of errors
>
> * prevent locking if no IV is defined by cipher
>
> * add a patch to allow crypto drivers to report whether they support
>   serialization -- in this case the locking in AF_ALG shall be
>   disabled
>
> * add a patch to inform the crypto drivers that their serialization
>   support should actually be enabled and used because AF_ALG does not
>   serialize the interdependent parallel AIO requests
>
> * streamline the code in patch 1 and 2 slightly
>
> I would like to ask the folks with real AIO hardware (Harsh, Jonathan)
> to test the patches. Especially, is the locking patch should be tested
> by Harsh as you have seen the issue with your hardware.
Sure I will test the patch and let you know.
>
> Thanks.
>
> Stephan Mueller (4):
>   crypto: AF_ALG AIO - lock context IV
>   crypto: AF_ALG - inline IV support
>   crypto: AF_ALG - allow driver to serialize IV access
>   crypto: add CRYPTO_TFM_REQ_PARALLEL flag
>
>  crypto/af_alg.c             | 119 +++++++++++++++++++++++++++++++++++++++++++-
>  crypto/algif_aead.c         |  86 +++++++++++++++++---------------
>  crypto/algif_skcipher.c     |  38 ++++++++++----
>  include/crypto/if_alg.h     |  37 ++++++++++++++
>  include/linux/crypto.h      |  16 ++++++
>  include/uapi/linux/if_alg.h |   6 ++-
>  6 files changed, 249 insertions(+), 53 deletions(-)
>

Jonathan Cameron Feb. 7, 2018, 3:37 p.m. UTC | #14

On Wed, 7 Feb 2018 08:42:59 +0100
Stephan Müller <smueller@chronox.de> wrote:

> Hi Herbert,
> 
> Herbert, the patch 1 is meant for stable. However, this patch as is
> only applies to the new AF_ALG interface implementation. Though,
> the issue goes back to the first implementation of AIO support.
> Shall I try prepare a patch for the old AF_ALG implementation
> as well?
> 
> Changes from v1:
> 
> * integrate the inline IV and locking patch into one patch set
> 
> * reverse the order of lock context IV patch and inline IV patch --
>   the reason is to allow the first patch to be back-ported to stable
> 
> * mark the first patch (locking of the context IV) as applicable to
>   stable as there is an existing inconsistency which was demonstrated
>   by Harsh with the Chelsio driver vs the AES-NI driver
> 
> * modify the inline IV patch to have proper unlocking of the mutex
>   in case of errors
> 
> * prevent locking if no IV is defined by cipher
> 
> * add a patch to allow crypto drivers to report whether they support
>   serialization -- in this case the locking in AF_ALG shall be
>   disabled
> 
> * add a patch to inform the crypto drivers that their serialization
>   support should actually be enabled and used because AF_ALG does not
>   serialize the interdependent parallel AIO requests
> 
> * streamline the code in patch 1 and 2 slightly
> 
> I would like to ask the folks with real AIO hardware (Harsh, Jonathan)
> to test the patches. Especially, is the locking patch should be tested
> by Harsh as you have seen the issue with your hardware.
> 

I've updated my driver to take advantage of this and it is working
pretty much the same as it was with the dirty hacks I was running before.

So Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> with
that minor change to sensitivity to having it set to IIV multiple times.

Hmm. Naming of the parallel request probably needs some thought though.
Will reply to that patch.


> Thanks.
> 
> Stephan Mueller (4):
>   crypto: AF_ALG AIO - lock context IV
>   crypto: AF_ALG - inline IV support
>   crypto: AF_ALG - allow driver to serialize IV access
>   crypto: add CRYPTO_TFM_REQ_PARALLEL flag
> 
>  crypto/af_alg.c             | 119 +++++++++++++++++++++++++++++++++++++++++++-
>  crypto/algif_aead.c         |  86 +++++++++++++++++---------------
>  crypto/algif_skcipher.c     |  38 ++++++++++----
>  include/crypto/if_alg.h     |  37 ++++++++++++++
>  include/linux/crypto.h      |  16 ++++++
>  include/uapi/linux/if_alg.h |   6 ++-
>  6 files changed, 249 insertions(+), 53 deletions(-)
>

Stephan Mueller Feb. 9, 2018, 10:02 p.m. UTC | #15

Hi,

Herbert, the patch 1 is meant for stable. However, this patch as is
only applies to the new AF_ALG interface implementation. Though,
the issue goes back to the first implementation of AIO support.
Shall I try prepare a patch for the old AF_ALG implementation
as well?

Changes from v2:

* rename ALG_IIV flags into ALG_IV_...

* rename CRYPTO_TFM_REQ_PARALLEL into CRYPTO_TFM_REQ_IV_SERIALIZE

* fix branch in patch 4 to add CRYPTO_TFM_REQ_IV_SERIALIZE flag when
  ctx->iiv == ALG_IV_SERIAL_PROCESSING

* fix patch description of patch 4

Stephan Mueller (4):
  crypto: AF_ALG AIO - lock context IV
  crypto: AF_ALG - inline IV support
  crypto: AF_ALG - allow driver to serialize IV access
  crypto: add CRYPTO_TFM_REQ_IV_SERIALIZE flag

 crypto/af_alg.c             | 119 +++++++++++++++++++++++++++++++++++++++++++-
 crypto/algif_aead.c         |  86 +++++++++++++++++---------------
 crypto/algif_skcipher.c     |  38 ++++++++++----
 include/crypto/if_alg.h     |  37 ++++++++++++++
 include/linux/crypto.h      |  16 ++++++
 include/uapi/linux/if_alg.h |   6 ++-
 6 files changed, 249 insertions(+), 53 deletions(-)

Herbert Xu Feb. 22, 2018, 1:06 p.m. UTC | #16

On Fri, Feb 09, 2018 at 11:02:27PM +0100, Stephan Müller wrote:
> Hi,
> 
> Herbert, the patch 1 is meant for stable. However, this patch as is
> only applies to the new AF_ALG interface implementation. Though,
> the issue goes back to the first implementation of AIO support.
> Shall I try prepare a patch for the old AF_ALG implementation
> as well?

I think this is overcomplicated.  We simply need to duplicate
the IV for async operations.  That is, if you're doing an async
recvmsg then the IV must be duplicated and stored in the request
instead of the global context.

Remember, you must not start the sendmsg for the next request
until the recvmsg system call for the previous request has completed.

Cheers,

Stephan Mueller Feb. 23, 2018, 8:33 a.m. UTC | #17

Am Donnerstag, 22. Februar 2018, 14:06:00 CET schrieb Herbert Xu:

Hi Herbert,

> On Fri, Feb 09, 2018 at 11:02:27PM +0100, Stephan Müller wrote:
> > Hi,
> > 
> > Herbert, the patch 1 is meant for stable. However, this patch as is
> > only applies to the new AF_ALG interface implementation. Though,
> > the issue goes back to the first implementation of AIO support.
> > Shall I try prepare a patch for the old AF_ALG implementation
> > as well?
> 
> I think this is overcomplicated.  We simply need to duplicate
> the IV for async operations.  That is, if you're doing an async
> recvmsg then the IV must be duplicated and stored in the request
> instead of the global context.

A simple copy operation, however, will imply that in one AIO recvmsg request, 
only *one* IOCB can be set and processed.

If multiple IOCBs are processed, then each IOCB would get the same IV with the 
same key, just different plain/ciphertext. With this approach, I think we 
neither support a fully parallel execution of independent cipher operations 
(as suppgested with the inline IV patch that requires the caller to provide a 
separate IV for each recvmsg call) nor a serialized operation of multiple 
IOCBs where the IV-based block chaining links the different dependent IOBCs.

Therefore, I would think that between each recvmsg call with one IOCB another 
sendmsg must be made to set the IV in order to support either of the 
aforementioned scenarios (parallel and serialized). IMHO this seems to be a 
waste of resources.
> 
> Remember, you must not start the sendmsg for the next request
> until the recvmsg system call for the previous request has completed.

I understand that, but one AIO recvmsg can process multiple cipher operations 
all at once (one IOCB defines one operation). Otherwise we do not utilize the 
true potential of the AIO support. We would use the AIO support to mimic 
synchronous behavior.
> 
> Cheers,

Ciao
Stephan

Herbert Xu Feb. 23, 2018, noon UTC | #18

On Fri, Feb 23, 2018 at 09:33:33AM +0100, Stephan Müller wrote:
>
> A simple copy operation, however, will imply that in one AIO recvmsg request, 
> only *one* IOCB can be set and processed.

Sure, but the recvmsg will return as soon as the crypto API encrypt
or decrypt function returns.  It's still fully async.  It's just
that the setup part needs to be done with sendmsg/recvmsg.

Even if we wanted to do what you stated, just inlining the IV isn't
enough.  You'd also need to inline the assoclen, and probably the
optype in case you want to mix encrypt/decrypt too.

However, I must say that I don't see the point of going all the way
to support such a bulk submission interface (e.g., lio_listio).

Remember, the algif interface due to its inherent overhead is meant
for bulk data.  That is, the processing time for each request is
dominated by the actual processing, not the submission process.

If you're instead processing lots of tiny requests, do NOT use
algif because it's not designed for that.

Therefore spending too much time to optimise the submission overhead
seems pointless to me.

Cheers,

Stephan Mueller Feb. 27, 2018, 2:08 p.m. UTC | #19

Am Freitag, 23. Februar 2018, 13:00:26 CET schrieb Herbert Xu:

Hi Herbert,

> On Fri, Feb 23, 2018 at 09:33:33AM +0100, Stephan Müller wrote:
> > A simple copy operation, however, will imply that in one AIO recvmsg
> > request, only *one* IOCB can be set and processed.
> 
> Sure, but the recvmsg will return as soon as the crypto API encrypt
> or decrypt function returns.  It's still fully async.  It's just
> that the setup part needs to be done with sendmsg/recvmsg.

Wouldn't a copy of the ctx->iv into a per-request buffer change the behavoir 
of the AF_ALG interface significantly?

Today, if multiple IOCBs are submitted, most cipher implementations would 
serialize the requests (e.g. all implementations that behave synchronous in 
nature like all software implementations).

Thus, when copying the ctx->iv into a separate per-request buffer, suddenly 
all block-chained cipher operations are not block chained any more.
> 
> Even if we wanted to do what you stated, just inlining the IV isn't
> enough.  You'd also need to inline the assoclen, and probably the
> optype in case you want to mix encrypt/decrypt too.

Maybe that is what we have to do.
> 
> However, I must say that I don't see the point of going all the way
> to support such a bulk submission interface (e.g., lio_listio).

IMHO, the point is that AF_ALG is the only interface to allow userspace to 
utilize hardware crypto implementations. For example, on a small chip with 
hardware crypto support, your user space code can offload crypto to that 
hardware to free CPU time.

How else would somebody access its crypto accelerators?
> 
> Remember, the algif interface due to its inherent overhead is meant
> for bulk data.  That is, the processing time for each request is
> dominated by the actual processing, not the submission process.

I see that. And for smaller chips with crypto support, this would be the case 
IMHO. Especially if we streamline the AF_ALG overhead such that we reduce the 
number of syscalls and user/kernel space roundtrips.
> 
> If you're instead processing lots of tiny requests, do NOT use
> algif because it's not designed for that.

The only issue in this case is that it makes the operation slower. 
> 
> Therefore spending too much time to optimise the submission overhead
> seems pointless to me.
> 
> Cheers,


Ciao
Stephan

Jonathan Cameron March 2, 2018, 11:33 a.m. UTC | #20

On Tue, 27 Feb 2018 15:08:58 +0100
Stephan Müller <smueller@chronox.de> wrote:

> Am Freitag, 23. Februar 2018, 13:00:26 CET schrieb Herbert Xu:
> 
> Hi Herbert,

Hi Stephan / Herbert,

> 
> > On Fri, Feb 23, 2018 at 09:33:33AM +0100, Stephan Müller wrote:  
> > > A simple copy operation, however, will imply that in one AIO recvmsg
> > > request, only *one* IOCB can be set and processed.  
> > 
> > Sure, but the recvmsg will return as soon as the crypto API encrypt
> > or decrypt function returns.  It's still fully async.  It's just
> > that the setup part needs to be done with sendmsg/recvmsg.  
> 
> Wouldn't a copy of the ctx->iv into a per-request buffer change the behavoir 
> of the AF_ALG interface significantly?
> 
> Today, if multiple IOCBs are submitted, most cipher implementations would 
> serialize the requests (e.g. all implementations that behave synchronous in 
> nature like all software implementations).
> 
> Thus, when copying the ctx->iv into a separate per-request buffer, suddenly 
> all block-chained cipher operations are not block chained any more.

Agreed - specific handling would be needed to ensure the IV is written to
each copy to maintain the chain.  Not nice at all.

> > 
> > Even if we wanted to do what you stated, just inlining the IV isn't
> > enough.  You'd also need to inline the assoclen, and probably the
> > optype in case you want to mix encrypt/decrypt too.  
> 
> Maybe that is what we have to do.

The one element I could do with more clarity on here is use cases
as it feels like the discussion is a little unfocused (helps with
performance runs, but is it really useful?)

When do we want to have separate IVs per request but a shared key?

I think this is relevant for ctr modes in particular where userspace
can provide the relevant ctrs but the key is shared.
Storage encryption modes such as XTS can also benefit.

My own knowledge is too abstract to give good answers to these.

> > 
> > However, I must say that I don't see the point of going all the way
> > to support such a bulk submission interface (e.g., lio_listio).  
> 
> IMHO, the point is that AF_ALG is the only interface to allow userspace to 
> utilize hardware crypto implementations. For example, on a small chip with 
> hardware crypto support, your user space code can offload crypto to that 
> hardware to free CPU time.
> 
> How else would somebody access its crypto accelerators?

This is also useful at the high end where we may well be throwing this
bulk submission at a set of crypto units (hidden behind a queue) to
parallelize when possible.

Just because we have lots of cpu power doesn't mean it makes sense
to use it for crypto :)

We 'could' just do it all in userspace via vfio, but there are the usual
disadvantages in that approach in terms of generality etc.

> > 
> > Remember, the algif interface due to its inherent overhead is meant
> > for bulk data.  That is, the processing time for each request is
> > dominated by the actual processing, not the submission process.  
> 
> I see that. And for smaller chips with crypto support, this would be the case 
> IMHO. Especially if we streamline the AF_ALG overhead such that we reduce the 
> number of syscalls and user/kernel space roundtrips.

For larger devices the ability to run large numbers of requests and 'know'
that they don't need to be chained is useful (because they have separate IVs).
This allows you to let the hardware handle them in parallel (either because
the hardware handles dependency tracking, or because we have done it in the
driver.)  Applies just as well for large blocks with lower overhead.

You could do this by opening lots of separate sockets and simply providing
them all with the same key.  However, this assumes the hardware / driver can
handle very large numbers of contexts (ours can though we only implement a
subset of this functionality in the current driver to keep things simple).

If we 'fake' such support in the driver then there is inherent nastiness
around having to let the hardware queues drain before you can change the IV)
and that the overhead of operating such a pool of sockets in your program
isn't significant.  

Managing such a pool of sockets would also be a significant complexity
overhead in complexity of the user space code.

> > 
> > If you're instead processing lots of tiny requests, do NOT use
> > algif because it's not designed for that.  
> 
> The only issue in this case is that it makes the operation slower. 
> > 
> > Therefore spending too much time to optimise the submission overhead
> > seems pointless to me.
> > 
> > Cheers,  
> 
> 
> Ciao
> Stephan
> 
> 
Thanks,

Jonathan

Herbert Xu March 7, 2018, 1:41 p.m. UTC | #21

On Tue, Feb 27, 2018 at 03:08:58PM +0100, Stephan Müller wrote:
> Am Freitag, 23. Februar 2018, 13:00:26 CET schrieb Herbert Xu:
> 
> Hi Herbert,
> 
> > On Fri, Feb 23, 2018 at 09:33:33AM +0100, Stephan Müller wrote:
> > > A simple copy operation, however, will imply that in one AIO recvmsg
> > > request, only *one* IOCB can be set and processed.
> > 
> > Sure, but the recvmsg will return as soon as the crypto API encrypt
> > or decrypt function returns.  It's still fully async.  It's just
> > that the setup part needs to be done with sendmsg/recvmsg.
> 
> Wouldn't a copy of the ctx->iv into a per-request buffer change the behavoir 
> of the AF_ALG interface significantly?
> 
> Today, if multiple IOCBs are submitted, most cipher implementations would 
> serialize the requests (e.g. all implementations that behave synchronous in 
> nature like all software implementations).

No there is no such guarantee.  In fact I'm pretty sure such
users would be totally broken if cryptd was used.

Cheers,

crypto: AF_ALG AIO - lock context IV

Commit Message

Comments

Patch