diff mbox

OMAP: AES: Don't idle/start AES device between Encrypt operations

Message ID 1368293024-6654-1-git-send-email-joelagnel@ti.com (mailing list archive)
State New, archived
Headers show

Commit Message

Fernandes, Joel A May 11, 2013, 5:23 p.m. UTC
Calling runtime PM API for every block causes serious perf hit to
crypto operations that are done on a long buffer.
As crypto is performed on a page boundary, encrypting large buffers can
cause a series of crypto operations divided by page. The runtime PM API
is also called those many times.

We call runtime_pm_get_sync only at beginning of the session (cra_init)
and runtime_pm_put at the end. This result in upto a 50% speedup as below:

Before:
root@beagleboard:~# time -v openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 13310 aes-128-cbc's in 0.01s
Doing aes-128-cbc for 3s on 64 size blocks: 13040 aes-128-cbc's in 0.04s
Doing aes-128-cbc for 3s on 256 size blocks: 9134 aes-128-cbc's in 0.03s
Doing aes-128-cbc for 3s on 1024 size blocks: 8939 aes-128-cbc's in 0.01s
Doing aes-128-cbc for 3s on 8192 size blocks: 4299 aes-128-cbc's in 0.00s

After:
root@beagleboard:~# time -v openssl speed -evp aes-128-cbc
Doing aes-128-cbc for 3s on 16 size blocks: 18911 aes-128-cbc's in 0.02s
Doing aes-128-cbc for 3s on 64 size blocks: 18878 aes-128-cbc's in 0.02s
Doing aes-128-cbc for 3s on 256 size blocks: 11878 aes-128-cbc's in 0.10s
Doing aes-128-cbc for 3s on 1024 size blocks: 11538 aes-128-cbc's in 0.05s
Doing aes-128-cbc for 3s on 8192 size blocks: 4857 aes-128-cbc's in 0.03s

While at it, also drop enter and exit pr_debugs, in related code. tracers
are exactly used for that.

Tested on a Beaglebone (AM335x SoC) board.

Signed-off-by: Joel A Fernandes <joelagnel@ti.com>
---
 drivers/crypto/omap-aes.c |   23 +++++++++++++++++++----
 1 files changed, 19 insertions(+), 4 deletions(-)

Comments

Kevin Hilman May 13, 2013, 4:35 p.m. UTC | #1
Joel A Fernandes <joelagnel@ti.com> writes:

> Calling runtime PM API for every block causes serious perf hit to
> crypto operations that are done on a long buffer.
> As crypto is performed on a page boundary, encrypting large buffers can
> cause a series of crypto operations divided by page. The runtime PM API
> is also called those many times.
>
> We call runtime_pm_get_sync only at beginning of the session (cra_init)
> and runtime_pm_put at the end. This result in upto a 50% speedup as below:
>
> Before:
> root@beagleboard:~# time -v openssl speed -evp aes-128-cbc
> Doing aes-128-cbc for 3s on 16 size blocks: 13310 aes-128-cbc's in 0.01s
> Doing aes-128-cbc for 3s on 64 size blocks: 13040 aes-128-cbc's in 0.04s
> Doing aes-128-cbc for 3s on 256 size blocks: 9134 aes-128-cbc's in 0.03s
> Doing aes-128-cbc for 3s on 1024 size blocks: 8939 aes-128-cbc's in 0.01s
> Doing aes-128-cbc for 3s on 8192 size blocks: 4299 aes-128-cbc's in 0.00s
>
> After:
> root@beagleboard:~# time -v openssl speed -evp aes-128-cbc
> Doing aes-128-cbc for 3s on 16 size blocks: 18911 aes-128-cbc's in 0.02s
> Doing aes-128-cbc for 3s on 64 size blocks: 18878 aes-128-cbc's in 0.02s
> Doing aes-128-cbc for 3s on 256 size blocks: 11878 aes-128-cbc's in 0.10s
> Doing aes-128-cbc for 3s on 1024 size blocks: 11538 aes-128-cbc's in 0.05s
> Doing aes-128-cbc for 3s on 8192 size blocks: 4857 aes-128-cbc's in 0.03s
>
> While at it, also drop enter and exit pr_debugs, in related code. tracers
> are exactly used for that.
>
> Tested on a Beaglebone (AM335x SoC) board.
>
> Signed-off-by: Joel A Fernandes <joelagnel@ti.com>

Did you explore using runtime PM autosuspend timeouts for this instead?
They are intended for exactly this kind of thing, and the timeouts can
have sane defaults, but can be configured from userspace to allow a
power/performance trade-off.

[...]

>  static void omap_aes_cra_exit(struct crypto_tfm *tfm)
>  {
> -	pr_debug("enter\n");
> +	struct omap_aes_dev *dd = NULL;
> +
> +	/* Find AES device, currently picks the first device */
> +	spin_lock_bh(&list_lock);
> +	list_for_each_entry(dd, &dev_list, list) {
> +		break;
> +	}
> +	spin_unlock_bh(&list_lock);
> +
> +	pm_runtime_put_sync(dd->dev);

nit: Why use the synchronous call here?  The original was async.

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fernandes, Joel A May 13, 2013, 7:39 p.m. UTC | #2
Hi Kevin,
Thanks for your review.

> -----Original Message-----
> From: Kevin Hilman [mailto:khilman@linaro.org]
> Sent: Monday, May 13, 2013 11:36 AM
> To: Fernandes, Joel A
> Cc: linux-crypto@vger.kernel.org; linux-omap@vger.kernel.org; Mark A. Greer
> Subject: Re: [PATCH] OMAP: AES: Don't idle/start AES device between Encrypt
> operations
> 
> Joel A Fernandes <joelagnel@ti.com> writes:
> 
> > Calling runtime PM API for every block causes serious perf hit to
> > crypto operations that are done on a long buffer.
> > As crypto is performed on a page boundary, encrypting large buffers
> > can cause a series of crypto operations divided by page. The runtime
> > PM API is also called those many times.
> >
> > We call runtime_pm_get_sync only at beginning of the session
> > (cra_init) and runtime_pm_put at the end. This result in upto a 50% speedup
> as below:
> >
> > Before:
> > root@beagleboard:~# time -v openssl speed -evp aes-128-cbc Doing
> > aes-128-cbc for 3s on 16 size blocks: 13310 aes-128-cbc's in 0.01s
> > Doing aes-128-cbc for 3s on 64 size blocks: 13040 aes-128-cbc's in
> > 0.04s Doing aes-128-cbc for 3s on 256 size blocks: 9134 aes-128-cbc's
> > in 0.03s Doing aes-128-cbc for 3s on 1024 size blocks: 8939
> > aes-128-cbc's in 0.01s Doing aes-128-cbc for 3s on 8192 size blocks:
> > 4299 aes-128-cbc's in 0.00s
> >
> > After:
> > root@beagleboard:~# time -v openssl speed -evp aes-128-cbc Doing
> > aes-128-cbc for 3s on 16 size blocks: 18911 aes-128-cbc's in 0.02s
> > Doing aes-128-cbc for 3s on 64 size blocks: 18878 aes-128-cbc's in
> > 0.02s Doing aes-128-cbc for 3s on 256 size blocks: 11878 aes-128-cbc's
> > in 0.10s Doing aes-128-cbc for 3s on 1024 size blocks: 11538
> > aes-128-cbc's in 0.05s Doing aes-128-cbc for 3s on 8192 size blocks:
> > 4857 aes-128-cbc's in 0.03s
> >
> > While at it, also drop enter and exit pr_debugs, in related code.
> > tracers are exactly used for that.
> >
> > Tested on a Beaglebone (AM335x SoC) board.
> >
> > Signed-off-by: Joel A Fernandes <joelagnel@ti.com>
> 
> Did you explore using runtime PM autosuspend timeouts for this instead?
> They are intended for exactly this kind of thing, and the timeouts can have sane
> defaults, but can be configured from userspace to allow a power/performance
> trade-off.
[Joel] Actually, I feel there is no real benefit in calling runtime PM api so many
times in between crypto operations. The patch just moves the runtime pm usage
to the beginning and end of a crypto session which will have to be created anyway.
Imagine encrypting a 20M block- this means runtime PM API is called
20 * 1024 / 4 =~ 5000 times. The slow down in my opinion doesn't make it worth it.
What is your opinion about this?
I can explore runtime-pm timeouts and propose the numbers to describe what would
the speedup w/ my patch and w/ timeouts.

> [...]
> 
> >  static void omap_aes_cra_exit(struct crypto_tfm *tfm)  {
> > -	pr_debug("enter\n");
> > +	struct omap_aes_dev *dd = NULL;
> > +
> > +	/* Find AES device, currently picks the first device */
> > +	spin_lock_bh(&list_lock);
> > +	list_for_each_entry(dd, &dev_list, list) {
> > +		break;
> > +	}
> > +	spin_unlock_bh(&list_lock);
> > +
> > +	pm_runtime_put_sync(dd->dev);
> 
> nit: Why use the synchronous call here?  The original was async.
[Joel] Async was required there because that was in interrupt context. It was originally sync but changed to
async in another separate patch.

Thanks,
Joel
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Kevin Hilman May 13, 2013, 8:45 p.m. UTC | #3
Hi Joel,

"Fernandes, Joel A" <joelagnel@ti.com> writes:

> Hi Kevin,
> Thanks for your review.
>
>> -----Original Message-----
>> From: Kevin Hilman [mailto:khilman@linaro.org]
>> Sent: Monday, May 13, 2013 11:36 AM
>> To: Fernandes, Joel A
>> Cc: linux-crypto@vger.kernel.org; linux-omap@vger.kernel.org; Mark A. Greer
>> Subject: Re: [PATCH] OMAP: AES: Don't idle/start AES device between Encrypt
>> operations
>> 
>> Joel A Fernandes <joelagnel@ti.com> writes:
>> 
>> > Calling runtime PM API for every block causes serious perf hit to
>> > crypto operations that are done on a long buffer.
>> > As crypto is performed on a page boundary, encrypting large buffers
>> > can cause a series of crypto operations divided by page. The runtime
>> > PM API is also called those many times.
>> >
>> > We call runtime_pm_get_sync only at beginning of the session
>> > (cra_init) and runtime_pm_put at the end. This result in upto a 50% speedup
>> as below:
>> >
>> > Before:
>> > root@beagleboard:~# time -v openssl speed -evp aes-128-cbc Doing
>> > aes-128-cbc for 3s on 16 size blocks: 13310 aes-128-cbc's in 0.01s
>> > Doing aes-128-cbc for 3s on 64 size blocks: 13040 aes-128-cbc's in
>> > 0.04s Doing aes-128-cbc for 3s on 256 size blocks: 9134 aes-128-cbc's
>> > in 0.03s Doing aes-128-cbc for 3s on 1024 size blocks: 8939
>> > aes-128-cbc's in 0.01s Doing aes-128-cbc for 3s on 8192 size blocks:
>> > 4299 aes-128-cbc's in 0.00s
>> >
>> > After:
>> > root@beagleboard:~# time -v openssl speed -evp aes-128-cbc Doing
>> > aes-128-cbc for 3s on 16 size blocks: 18911 aes-128-cbc's in 0.02s
>> > Doing aes-128-cbc for 3s on 64 size blocks: 18878 aes-128-cbc's in
>> > 0.02s Doing aes-128-cbc for 3s on 256 size blocks: 11878 aes-128-cbc's
>> > in 0.10s Doing aes-128-cbc for 3s on 1024 size blocks: 11538
>> > aes-128-cbc's in 0.05s Doing aes-128-cbc for 3s on 8192 size blocks:
>> > 4857 aes-128-cbc's in 0.03s
>> >
>> > While at it, also drop enter and exit pr_debugs, in related code.
>> > tracers are exactly used for that.
>> >
>> > Tested on a Beaglebone (AM335x SoC) board.
>> >
>> > Signed-off-by: Joel A Fernandes <joelagnel@ti.com>
>> 
>> Did you explore using runtime PM autosuspend timeouts for this instead?
>> They are intended for exactly this kind of thing, and the timeouts can have sane
>> defaults, but can be configured from userspace to allow a power/performance
>> trade-off.
> [Joel] Actually, I feel there is no real benefit in calling runtime PM api so many
> times in between crypto operations. The patch just moves the runtime pm usage
> to the beginning and end of a crypto session which will have to be created anyway.
> Imagine encrypting a 20M block- this means runtime PM API is called
> 20 * 1024 / 4 =~ 5000 times. The slow down in my opinion doesn't make it worth it.
> What is your opinion about this?

OK, I'm not terribly familiar with the crypto API, so I was assuming
that the init/exit calls you're instrumenting were happening at driver
probe/remove time.  Based on your clarifications, that doesn't seem to
be the case.

My main concern is that drivers don't simply use 'get' on driver probe
and 'put' on driver remove and force the system awake as long as the
driver is present.  I've seen that plenty of times, and I was assuming
that's what was going on here.  Sorry for the confusion.  

> I can explore runtime-pm timeouts and propose the numbers to describe what would
> the speedup w/ my patch and w/ timeouts.

Probably not needed.  

How about just add a few more details to the changelog summarizing
how/when the init/exit calls happen to make it a bit more clear.

Thanks,

Kevin
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fernandes, Joel A May 14, 2013, 2:49 a.m. UTC | #4
Hi Kevin,

> have to be created anyway.
> > Imagine encrypting a 20M block- this means runtime PM API is called
> > 20 * 1024 / 4 =~ 5000 times. The slow down in my opinion doesn't make it
> worth it.
> > What is your opinion about this?
> 
> OK, I'm not terribly familiar with the crypto API, so I was assuming that the
> init/exit calls you're instrumenting were happening at driver probe/remove
> time.  Based on your clarifications, that doesn't seem to be the case.
> 
> My main concern is that drivers don't simply use 'get' on driver probe and 'put'
> on driver remove and force the system awake as long as the driver is present.
> I've seen that plenty of times, and I was assuming that's what was going on
> here.  Sorry for the confusion.

[Joel] No problem, thanks. Yes, the driver doesn't put/get in the probe functions.
Just when it has to do its work.

> How about just add a few more details to the changelog summarizing
> how/when the init/exit calls happen to make it a bit more clear.

[Joel] Sure, I will make this more clear. Sorry for not doing so earlier.

Thanks,
Joel
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/crypto/omap-aes.c b/drivers/crypto/omap-aes.c
index 6aa425f..e6474eb 100644
--- a/drivers/crypto/omap-aes.c
+++ b/drivers/crypto/omap-aes.c
@@ -208,7 +208,6 @@  static int omap_aes_hw_init(struct omap_aes_dev *dd)
 	 * It may be long delays between requests.
 	 * Device might go to off mode to save power.
 	 */
-	pm_runtime_get_sync(dd->dev);
 
 	if (!(dd->flags & FLAGS_INIT)) {
 		dd->flags |= FLAGS_INIT;
@@ -636,7 +635,6 @@  static void omap_aes_finish_req(struct omap_aes_dev *dd, int err)
 
 	pr_debug("err: %d\n", err);
 
-	pm_runtime_put(dd->dev);
 	dd->flags &= ~FLAGS_BUSY;
 
 	req->base.complete(&req->base, err);
@@ -837,8 +835,16 @@  static int omap_aes_ctr_decrypt(struct ablkcipher_request *req)
 
 static int omap_aes_cra_init(struct crypto_tfm *tfm)
 {
-	pr_debug("enter\n");
+	struct omap_aes_dev *dd = NULL;
+
+	/* Find AES device, currently picks the first device */
+	spin_lock_bh(&list_lock);
+	list_for_each_entry(dd, &dev_list, list) {
+		break;
+	}
+	spin_unlock_bh(&list_lock);
 
+	pm_runtime_get_sync(dd->dev);
 	tfm->crt_ablkcipher.reqsize = sizeof(struct omap_aes_reqctx);
 
 	return 0;
@@ -846,7 +852,16 @@  static int omap_aes_cra_init(struct crypto_tfm *tfm)
 
 static void omap_aes_cra_exit(struct crypto_tfm *tfm)
 {
-	pr_debug("enter\n");
+	struct omap_aes_dev *dd = NULL;
+
+	/* Find AES device, currently picks the first device */
+	spin_lock_bh(&list_lock);
+	list_for_each_entry(dd, &dev_list, list) {
+		break;
+	}
+	spin_unlock_bh(&list_lock);
+
+	pm_runtime_put_sync(dd->dev);
 }
 
 /* ********************** ALGS ************************************ */