[0/17] Add zinc using existing algorithm implementations

Message ID	20190322062740.nrwfx2rvmt7lzotj@gondor.apana.org.au (mailing list archive)
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> Date: Fri, 22 Mar 2019 14:27:41 +0800 From: Herbert Xu <herbert@gondor.apana.org.au> To: Linus Torvalds <torvalds@linux-foundation.org>, "David S. Miller" <davem@davemloft.net>, "Jason A. Donenfeld" <Jason@zx2c4.com>, Eric Biggers <ebiggers@kernel.org>, Ard Biesheuvel <ard.biesheuvel@linaro.org>, Linux Crypto Mailing List <linux-crypto@vger.kernel.org>, linux-fscrypt@vger.kernel.org, linux-arm-kernel@lists.infradead.org, LKML <linux-kernel@vger.kernel.org>, Paul Crowley <paulcrowley@google.com>, Greg Kaiser <gkaiser@google.com>, Samuel Neves <samuel.c.p.neves@gmail.com>, Tomer Ashur <tomer.ashur@esat.kuleuven.be>, Martin Willi <martin@strongswan.org> Subject: [PATCH 0/17] Add zinc using existing algorithm implementations Message-ID: <20190322062740.nrwfx2rvmt7lzotj@gondor.apana.org.au> MIME-Version: 1.0 Content-Disposition: inline User-Agent: NeoMutt/20170113 (1.7.2) Precedence: list Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org
Series	Add zinc using existing algorithm implementations \| expand [0/17] Add zinc using existing algorithm implementations [1/17] asm: simd context helper API [2/17] crypto: chacha20 - Export chacha20 functions without crypto API [3/17] zinc: introduce minimal cryptography library [5/17] zinc: Add x86 accelerated ChaCha20 [6/17] zinc: Add arm accelerated chacha20 [7/17] crypto: poly1305 - Export core functions without crypto API [8/17] zinc: Add generic C implementation of poly1305 and self-test [9/17] zinc: Add x86 accelerated poly1305 [12/17] zinc: BLAKE2s x86_64 implementation [14/17] zinc: Curve25519 x86_64 implementation [15/17] zinc: import Bernstein and Schwabe's Curve25519 ARM implementation [17/17] security/keys: rewrite big_key crypto to use Zinc

Herbert Xu March 22, 2019, 6:27 a.m. UTC

Hi:

In the interest of moving forward with wireguard, this patch
series adds the zinc interface as submitted in

https://patchwork.kernel.org/project/linux-crypto/list/?series=32507&state=*

with the change that existing implementations of chacha20 and
poly1305 are used instead of the new ones.  The use of the existing
chacha20/poly1305 implementations does not involve any use of the
crypto API or indirect function calls.  Only direct function calls
are made into the underlying implementation.

This should allow the wireguard code itself to proceed.  At the
same time we can also move forward with replacing the existing
implementations of chacha20 and/or poly1305 where appropriate.

Thanks,

Jason A. Donenfeld March 22, 2019, 6:41 a.m. UTC | #1

Hey Herbert,

On Fri, Mar 22, 2019 at 12:34 AM Herbert Xu <herbert@gondor.apana.org.au> wrote:
> This should allow the wireguard code itself to proceed.  At the
> same time we can also move forward with replacing the existing
> implementations of chacha20 and/or poly1305 where appropriate.

I'll have a long flight tomorrow to review this, but also, I should
have v9 of the patchset submitted soon, which will address a lot of
the good issues people brought up several months ago for v8. So please
sit tight while I prepare that.

Jason

Ard Biesheuvel March 22, 2019, 7:56 a.m. UTC | #2

On Fri, 22 Mar 2019 at 07:28, Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> Hi:
>
> In the interest of moving forward with wireguard, this patch
> series adds the zinc interface as submitted in
>
> https://patchwork.kernel.org/project/linux-crypto/list/?series=32507&state=*
>
> with the change that existing implementations of chacha20 and
> poly1305 are used instead of the new ones.  The use of the existing
> chacha20/poly1305 implementations does not involve any use of the
> crypto API or indirect function calls.  Only direct function calls
> are made into the underlying implementation.
>
> This should allow the wireguard code itself to proceed.  At the
> same time we can also move forward with replacing the existing
> implementations of chacha20 and/or poly1305 where appropriate.
>

Hi Herbert,

Let me reiterate some of my concerns with Zinc, which aren't really
addressed by your patches afaict.

- The way WireGuard uses crypto in the kernel is essentially a
layering violation - while we already have an asynchronous API that
implements ChaCha20Poly1305 in a way that WireGuard /could/ benefit
from, and while we even have support already for async accelerators
that implement it, the reluctance of Jason to work with the community
to fix the issues that he identified in the crypto API means that
WireGuard will not be able to use these, and is restricted to
synchronous software implementations. Saying accelerators will not
matter is a bit like saying there are no American soldiers in Iraq.
(Note that adding async interfaces to Zinc is not the right way to
deal with this IMO)
- I think it is fine to have a 'blessed' set of software crypto in the
kernel that we standardize on for internal use cases but WireGuard is
not one of them. Having two separate s/w crypto stacks is a problem,
and I don't think it helps to have a cute name either.
- I don't think Zinc changes should go through Greg's tree and have
separate maintainers - in fact, I am a bit concerned about the fact
that, after the last Zinc/WG submission in October, Jason has not
really interacted with the linux-crypto community at all, while at lot
of work was being done (especially by Eric) to address issues that he
helped identify. So letting Jason (and Samuel, who has not chimed in
at all, IIRC) maintain something that is relevant to crypto in the
kernel, but without a community and without involvement of the
linux-crypto maintainer is not acceptable to me. (In general, people
tend to join a community before being volunteered to maintain its
codebase)

I am among the people who really want to see WireGuard merged. But the
whole Zinc thing is an unnecessary distraction from getting the
existing crypto API fixed in places where it fails to support
WireGuard's use case, and that is a loss for WireGuard users as well
as the linux-crypto community.

Jason A. Donenfeld March 22, 2019, 8:10 a.m. UTC | #3

On Fri, Mar 22, 2019 at 1:56 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> that implement it, the reluctance of Jason to work with the community
> to fix the issues that he identified in the crypto API means that
> WireGuard will not be able to use these, and is restricted to
> synchronous software implementations.

There's no reluctance to work with the community. I'm pretty deeply
committed to this, as evidenced by the multitudes of patch
submissions, discussions, and popping around from conference to
conference discussing with folks face to face.

> - The way WireGuard uses crypto in the kernel is essentially a
> layering violation - while we already have an asynchronous API that
> implements ChaCha20Poly1305 in a way that WireGuard /could/ benefit
> from, and while we even have support already for async accelerators
> Saying accelerators will not
> matter is a bit like saying there are no American soldiers in Iraq.
> (Note that adding async interfaces to Zinc is not the right way to
> deal with this IMO)

Geopolitics aside, as explained over and over, the point of Zinc is to
just get down simple software function calls, for instances where
async accelerators aren't available. I consider smushing it all
together to be the "layering violation".

> - I don't think Zinc changes should go through Greg's tree and have
> separate maintainers

As I've mentioned before, I'm fine with merges going whichever route
is best for merges. If Herbert thinks it'd be useful to send things in
that direction and would reduce clashes and such, then that's fine
with me.

>  in fact, I am a bit concerned about the fact
> that, after the last Zinc/WG submission in October, Jason has not
> really interacted

Such as... https://www.youtube.com/watch?v=CejbCQ5wS7Q ? November is
after October.

Anyway, I thought it good advice to not post v9 too swiftly after
plumbers, letting things calm a bit and focusing on other improvements
in Zinc in the meanwhile. I've certainly been around.

Looks like Herbert posting this might erupt another contentious thread
of bikeshedding and argument, what a bummer.

Linus Torvalds March 22, 2019, 5:48 p.m. UTC | #4

On Fri, Mar 22, 2019 at 12:56 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> - The way WireGuard uses crypto in the kernel is essentially a
> layering violation

What? No.

That's just wrong.

It's only a layering violation if you accept and buy into the crypto/ model.

And Jason obviously doesn't.

And honestly, I'm 1000% with Jason on this. The crypto/ model is hard
to use, inefficient, and completely pointless when you know what your
cipher or hash algorithm is, and your CPU just does it well directly.

> we even have support already for async accelerators that implement it,

Afaik, none of the async accelerator code has ever been worth anything
on real hardware and on any sane and real loads. The cost of going
outside the CPU is *so* expensive that you'll always lose, unless the
algorithm has been explicitly designed to be insanely hard on a
regular CPU.

(Corollary: "insanely hard on a regular CPU" is also easy to do by
making the CPU be weak and bad. Which is not a case we should optimize
for).

The whole "external accelerator" model is odd. It was wrong. It only
makes sense if the accelerator does *everything* (ie it's the network
card), and then you wouldn't use the wireguard thing on the CPU at
all, you'd have all those things on the accelerator (ie a "network
card that does WG").

One of the (best or worst, depending on your hangups) arguments for
external accelerators has been "but I trust the external hardware with
the key, but not my own code", aka the TPM or Disney argument.  I
don't think that's at all relevant to the discussion either.

The whole model of async accelerators is completely bogus. The only
crypto or hash accelerator that is worth it are the ones integrated on
the CPU cores, which have the direct access to caches.

And if the accelerator is some tightly coupled thing that has direct
access to your caches, and doesn't need interrupt overhead or address
translation etc (at which point it can be worth using) then you might
as well just consider it an odd version of the above. You'd want to
poll for the result anyway, because not polling is too expensive.

Just a single interrupt would completely undo all the advantages you
got from using specialized hardware - both power and performance.

And that kind of model would work just fine with zinc.

So an accelerator ends up being useful in two cases:

 - it's entirely external and part of the network card, so that
there's no extra data transfer overhead

 - it's tightly coupled enough (either CPU instructions or some on-die
cache coherent engine) that you can and will just use it synchronously
anyway.

In the first case, you wouldn't run wireguard on the CPU anyway - you
have a network card that just implements the VPN.

And in the second case, the zinc model is the right one.

So no. I don't believe "layering violation" is the issue here at all.

The only main issue as far as I'm concerned is how to deal with the
fact that we have duplicate code and effort.

                      Linus

Pascal Van Leeuwen March 25, 2019, 9:10 a.m. UTC | #5

As someone who has been working on crypto acceleration hardware for the better
part of the past 20 years, I feel compelled to respond to this, in defence of
the crypto API (which we're really happy with ...).

> And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use,
> inefficient, and completely pointless when you know what your cipher or
> hash algorithm is, and your CPU just does it well directly.
>
> > we even have support already for async accelerators that implement it,
>
> Afaik, none of the async accelerator code has ever been worth anything on
> real hardware and on any sane and real loads. The cost of going outside the
> CPU is *so* expensive that you'll always lose, unless the algorithm has been
> explicitly designed to be insanely hard on a regular CPU.
>
The days of designing them *specifically* to be hard on a CPU are over, but
nevertheless, due to required security properties, *good* crypto is usually
still hard work for a CPU.
Especially *asymmetric* crypto, which can easily take milliseconds per operation
even on a high-end CPU. You can do a lot of interrupt processing in a millisecond
Now some symmetric algorithms (AES,SHA,GHASH) have actually made it *into*
the CPU, somewhat changing the landscape, but:
a) That's really only a small subset of the crypto being used out there.
   There's much more to crypto than just AES, SHA and GHASH.
b) This only applies to high-end CPU's. The large majority of embedded CPU's do
   not have such embedded crypto instructions. This may change, but I don't
   really expect that to happen soon.
c) You're still keeping a big, power-hungry CPU busy for a lot of cycles doing
   some fairly trivial work.

> (Corollary: "insanely hard on a regular CPU" is also easy to do by making the
> CPU be weak and bad. Which is not a case we should optimize for).
>

Linux is being used a lot for embedded systems which usually DO have fairly
weak CPU's. I would argue that's exactly the case you SHOULD optimize for,
as that's where you can *really* still make the difference as a programmer.

> The whole "external accelerator" model is odd. It was wrong. It only makes
> sense if the accelerator does *everything* (ie it's the network card), and
> then you wouldn't use the wireguard thing on the CPU at all, you'd have all
> those things on the accelerator (ie a "network card that does WG").
>
> One of the (best or worst, depending on your hangups) arguments for
> external accelerators has been "but I trust the external hardware with the
> key, but not my own code", aka the TPM or Disney argument.  I don't think
> that's at all relevant to the discussion either.
>
> The whole model of async accelerators is completely bogus. The only crypto
> or hash accelerator that is worth it are the ones integrated on the CPU cores,
> which have the direct access to caches.
>
NOT true. We wouldn't still be in business after 20 years if that were true.

> And if the accelerator is some tightly coupled thing that has direct access to
> your caches, and doesn't need interrupt overhead or address translation etc
> (at which point it can be worth using) then you might as well just consider it
> an odd version of the above. You'd want to poll for the result anyway,
> because not polling is too expensive.
>
It's HARD to get the interfacing right to take advantage of the acceleration.
And that's exacly why you MUST have an async interface for that: to cope with
the large latency of external acceleration, going through the memory subsystem
and external buses as - as you already pointed out - you cannot access the CPU
cache directly (though you can be fully coherent with it).
So to cope with that latency, you will need to batch queue and pipeline your
processing. This is not unique to crypto acceleration, the same principles
apply to e.g. your GPU as well. Or any HW worth anything, for that matter.

What that DOES mean is that external crypto accelerators are indeed useless for
doing the occasional crypto operation, it really only makes sense for streaming
and/or bulk use cases, such as network packet or disk encryption. And for
operations that really take significant time, such as asymmetic crypto.
For the occasional symmetric crypto operation, by all means, do that on the CPU
using a very thin API layer for efficiency and simplicity. This is where Zinc
makes a lot of sense - I'm not against Zinc at all. But DO realise that when
you go that route, you forfeit any chances of benefitting from acceleration.

Ironically, for doing what Wireguard is doing - bulk packet encryption - the
async crypto API makes a lot more sense than Zinc. In Jason's defense, as
far as I know, there is no Poly/Chacha HW acceleration out there yet, but I
can assure you that that situation is going to change soon :-)
Still,  I would really recommend running Wireguard on top of crypto API.
How much performance would you really lose by doing that? If there's
anything wrong with the efficiency of the crypto API implementation of
P/C, then just fix that.

> Just a single interrupt would completely undo all the advantages you got
> from using specialized hardware - both power and performance.
>
We have done plenty of measurements, on both power and performance, to prove
you wrong there. Typically our HW needs *at least* a full order of a magnitude
less power to do the actual work. The CPU load for handing the interrupts etc.
tends to be around 20%. So assuming the CPU goes to sleep for the other 80%
of the time,  the combined solution would need about 1/3rd of the power of a
CPU only solution. It's one of our biggest selling points.

As for performance - in the embedded space you can normally expect the crypto
accelerator to be *much* faster than the CPU subsystem as well. So yes, big
multi-core multi-GHz Xeons can do AES-GCM extremely fast and we'd have a hard
time competing with that, but that's not the relevant use case here.

> And that kind of model would work just fine with zinc.

>
> So an accelerator ends up being useful in two cases:
>
>  - it's entirely external and part of the network card, so that there's no extra
> data transfer overhead
>
There are inherent efficiency advantages to integrating the crypto there, but
it's also very inflexible as you're stuck with a particular, usually very limited,
implementation. And it doesn't fly very well with current trends of Software
Defined Networking and Network Function Virtualization.
As for data transfer overhead: as long as the SoC memory subsystem was designed
to cope with this, it's really irrelevant as you shouldn't notice it.

>  - it's tightly coupled enough (either CPU instructions or some on-die cache
> coherent engine) that you can and will just use it synchronously anyway.
>
> In the first case, you wouldn't run wireguard on the CPU anyway - you have a
> network card that just implements the VPN.
>
The irony here is, that network card would probably be an embedded system running
Linux on an embedded CPU, offloading the crypto operations to our accelerator ...
Most smart NICs are architected that way.

Pascal van Leeuwen,
Silicon IP Architect @ Inside Secure

Ard Biesheuvel March 25, 2019, 10:43 a.m. UTC | #6

On Fri, 22 Mar 2019 at 18:48, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Fri, Mar 22, 2019 at 12:56 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > - The way WireGuard uses crypto in the kernel is essentially a
> > layering violation
>
> What? No.
>
> That's just wrong.
>
> It's only a layering violation if you accept and buy into the crypto/ model.
>
> And Jason obviously doesn't.
>
> And honestly, I'm 1000% with Jason on this. The crypto/ model is hard
> to use, inefficient, and completely pointless when you know what your
> cipher or hash algorithm is, and your CPU just does it well directly.
>

Well, it is true that the the dynamic dispatch is annoying and we need
to fix that. And I have not given up hope that someone like Jason,
with his level of understanding of crypto, and an enticing use case
such as WireGuard, will choose to work with the linux-crypto community
to improve it rather than build something from scratch on the side.

But that is orthogonal to the sync vs async debate.

> > we even have support already for async accelerators that implement it,
>
> Afaik, none of the async accelerator code has ever been worth anything
> on real hardware and on any sane and real loads. The cost of going
> outside the CPU is *so* expensive that you'll always lose, unless the
> algorithm has been explicitly designed to be insanely hard on a
> regular CPU.
>
> (Corollary: "insanely hard on a regular CPU" is also easy to do by
> making the CPU be weak and bad. Which is not a case we should optimize
> for).
>
> The whole "external accelerator" model is odd. It was wrong. It only
> makes sense if the accelerator does *everything* (ie it's the network
> card), and then you wouldn't use the wireguard thing on the CPU at
> all, you'd have all those things on the accelerator (ie a "network
> card that does WG").
>
> One of the (best or worst, depending on your hangups) arguments for
> external accelerators has been "but I trust the external hardware with
> the key, but not my own code", aka the TPM or Disney argument.  I
> don't think that's at all relevant to the discussion either.
>
> The whole model of async accelerators is completely bogus. The only
> crypto or hash accelerator that is worth it are the ones integrated on
> the CPU cores, which have the direct access to caches.
>
> And if the accelerator is some tightly coupled thing that has direct
> access to your caches, and doesn't need interrupt overhead or address
> translation etc (at which point it can be worth using) then you might
> as well just consider it an odd version of the above. You'd want to
> poll for the result anyway, because not polling is too expensive.
>
> Just a single interrupt would completely undo all the advantages you
> got from using specialized hardware - both power and performance.
>
> And that kind of model would work just fine with zinc.
>
> So an accelerator ends up being useful in two cases:
>
>  - it's entirely external and part of the network card, so that
> there's no extra data transfer overhead
>
>  - it's tightly coupled enough (either CPU instructions or some on-die
> cache coherent engine) that you can and will just use it synchronously
> anyway.
>
> In the first case, you wouldn't run wireguard on the CPU anyway - you
> have a network card that just implements the VPN.
>
> And in the second case, the zinc model is the right one.
>
> So no. I don't believe "layering violation" is the issue here at all.
>

If the consensus is that no accelerator is likely to ever exist that
outperforms CPU crypto by any measure (latency, throughput, power
efficiency), then I don't have any objections to bolting this straight
onto a synchronous interface. My concern is that we will end up with
Zinc patches 12 months from now that implement async interfaces to
support some particular piece of hardware, while other hardware is
only supported by the crypto/ API, even though the algorithm they
implement is the same.

> The only main issue as far as I'm concerned is how to deal with the
> fact that we have duplicate code and effort.
>
>                       Linus

Jason A. Donenfeld March 25, 2019, 10:45 a.m. UTC | #7

On Mon, Mar 25, 2019 at 11:43 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> onto a synchronous interface. My concern is that we will end up with
> Zinc patches 12 months from now that implement async interfaces to
> support some particular piece of hardware, while other hardware is
> only supported by the crypto/ API, even though the algorithm they
> implement is the same.

Allow me to allay that concern: Zinc is for simple synchronous
software functions.

Riku Voipio March 26, 2019, 9:46 a.m. UTC | #8

On Mon, Mar 25, 2019 at 09:10:08AM +0000, Pascal Van Leeuwen wrote:
> As someone who has been working on crypto acceleration hardware for the better
> part of the past 20 years, I feel compelled to respond to this, in defence of
> the crypto API (which we're really happy with ...).

Thanks for joining in.

> We have done plenty of measurements, on both power and performance, to prove
> you wrong there. Typically our HW needs *at least* a full order of a magnitude
> less power to do the actual work. The CPU load for handing the interrupts etc.
> tends to be around 20%. So assuming the CPU goes to sleep for the other 80%
> of the time,  the combined solution would need about 1/3rd of the power of a
> CPU only solution. It's one of our biggest selling points.

I actually have the kind of (relatively) low-power system with your crypto
accelerator IP. But it doesn't matter how great performance or energy saving
I would get from using crypto offloading. I will continue to use CPU only
solution for the foreseeable future.

Because of one tiny problem:

The firmware files for eip197(?) crypto accelerator are secrets we are not
allowed download anywhere from.

> Pascal van Leeuwen,
> Silicon IP Architect @ Inside Secure

Riku

Pascal Van Leeuwen April 9, 2019, 4:14 p.m. UTC | #9

> The firmware files for eip197(?) crypto accelerator are secrets we are
> not
> allowed download anywhere from.
>
Riku,

The driver update I'm working on will fix that issue for you ;-)
I'm planning to upstream my changes but I still need to find the right
approach for doing so.

Regards,
Pascal

[0/17] Add zinc using existing algorithm implementations

Message

Comments