Message ID | 20190322062740.nrwfx2rvmt7lzotj@gondor.apana.org.au (mailing list archive) |
---|---|
Headers | show |
Series | Add zinc using existing algorithm implementations | expand |
Hey Herbert, On Fri, Mar 22, 2019 at 12:34 AM Herbert Xu <herbert@gondor.apana.org.au> wrote: > This should allow the wireguard code itself to proceed. At the > same time we can also move forward with replacing the existing > implementations of chacha20 and/or poly1305 where appropriate. I'll have a long flight tomorrow to review this, but also, I should have v9 of the patchset submitted soon, which will address a lot of the good issues people brought up several months ago for v8. So please sit tight while I prepare that. Jason
On Fri, 22 Mar 2019 at 07:28, Herbert Xu <herbert@gondor.apana.org.au> wrote: > > Hi: > > In the interest of moving forward with wireguard, this patch > series adds the zinc interface as submitted in > > https://patchwork.kernel.org/project/linux-crypto/list/?series=32507&state=* > > with the change that existing implementations of chacha20 and > poly1305 are used instead of the new ones. The use of the existing > chacha20/poly1305 implementations does not involve any use of the > crypto API or indirect function calls. Only direct function calls > are made into the underlying implementation. > > This should allow the wireguard code itself to proceed. At the > same time we can also move forward with replacing the existing > implementations of chacha20 and/or poly1305 where appropriate. > Hi Herbert, Let me reiterate some of my concerns with Zinc, which aren't really addressed by your patches afaict. - The way WireGuard uses crypto in the kernel is essentially a layering violation - while we already have an asynchronous API that implements ChaCha20Poly1305 in a way that WireGuard /could/ benefit from, and while we even have support already for async accelerators that implement it, the reluctance of Jason to work with the community to fix the issues that he identified in the crypto API means that WireGuard will not be able to use these, and is restricted to synchronous software implementations. Saying accelerators will not matter is a bit like saying there are no American soldiers in Iraq. (Note that adding async interfaces to Zinc is not the right way to deal with this IMO) - I think it is fine to have a 'blessed' set of software crypto in the kernel that we standardize on for internal use cases but WireGuard is not one of them. Having two separate s/w crypto stacks is a problem, and I don't think it helps to have a cute name either. - I don't think Zinc changes should go through Greg's tree and have separate maintainers - in fact, I am a bit concerned about the fact that, after the last Zinc/WG submission in October, Jason has not really interacted with the linux-crypto community at all, while at lot of work was being done (especially by Eric) to address issues that he helped identify. So letting Jason (and Samuel, who has not chimed in at all, IIRC) maintain something that is relevant to crypto in the kernel, but without a community and without involvement of the linux-crypto maintainer is not acceptable to me. (In general, people tend to join a community before being volunteered to maintain its codebase) I am among the people who really want to see WireGuard merged. But the whole Zinc thing is an unnecessary distraction from getting the existing crypto API fixed in places where it fails to support WireGuard's use case, and that is a loss for WireGuard users as well as the linux-crypto community.
On Fri, Mar 22, 2019 at 1:56 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > that implement it, the reluctance of Jason to work with the community > to fix the issues that he identified in the crypto API means that > WireGuard will not be able to use these, and is restricted to > synchronous software implementations. There's no reluctance to work with the community. I'm pretty deeply committed to this, as evidenced by the multitudes of patch submissions, discussions, and popping around from conference to conference discussing with folks face to face. > - The way WireGuard uses crypto in the kernel is essentially a > layering violation - while we already have an asynchronous API that > implements ChaCha20Poly1305 in a way that WireGuard /could/ benefit > from, and while we even have support already for async accelerators > Saying accelerators will not > matter is a bit like saying there are no American soldiers in Iraq. > (Note that adding async interfaces to Zinc is not the right way to > deal with this IMO) Geopolitics aside, as explained over and over, the point of Zinc is to just get down simple software function calls, for instances where async accelerators aren't available. I consider smushing it all together to be the "layering violation". > - I don't think Zinc changes should go through Greg's tree and have > separate maintainers As I've mentioned before, I'm fine with merges going whichever route is best for merges. If Herbert thinks it'd be useful to send things in that direction and would reduce clashes and such, then that's fine with me. > in fact, I am a bit concerned about the fact > that, after the last Zinc/WG submission in October, Jason has not > really interacted Such as... https://www.youtube.com/watch?v=CejbCQ5wS7Q ? November is after October. Anyway, I thought it good advice to not post v9 too swiftly after plumbers, letting things calm a bit and focusing on other improvements in Zinc in the meanwhile. I've certainly been around. Looks like Herbert posting this might erupt another contentious thread of bikeshedding and argument, what a bummer.
On Fri, Mar 22, 2019 at 12:56 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > > - The way WireGuard uses crypto in the kernel is essentially a > layering violation What? No. That's just wrong. It's only a layering violation if you accept and buy into the crypto/ model. And Jason obviously doesn't. And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use, inefficient, and completely pointless when you know what your cipher or hash algorithm is, and your CPU just does it well directly. > we even have support already for async accelerators that implement it, Afaik, none of the async accelerator code has ever been worth anything on real hardware and on any sane and real loads. The cost of going outside the CPU is *so* expensive that you'll always lose, unless the algorithm has been explicitly designed to be insanely hard on a regular CPU. (Corollary: "insanely hard on a regular CPU" is also easy to do by making the CPU be weak and bad. Which is not a case we should optimize for). The whole "external accelerator" model is odd. It was wrong. It only makes sense if the accelerator does *everything* (ie it's the network card), and then you wouldn't use the wireguard thing on the CPU at all, you'd have all those things on the accelerator (ie a "network card that does WG"). One of the (best or worst, depending on your hangups) arguments for external accelerators has been "but I trust the external hardware with the key, but not my own code", aka the TPM or Disney argument. I don't think that's at all relevant to the discussion either. The whole model of async accelerators is completely bogus. The only crypto or hash accelerator that is worth it are the ones integrated on the CPU cores, which have the direct access to caches. And if the accelerator is some tightly coupled thing that has direct access to your caches, and doesn't need interrupt overhead or address translation etc (at which point it can be worth using) then you might as well just consider it an odd version of the above. You'd want to poll for the result anyway, because not polling is too expensive. Just a single interrupt would completely undo all the advantages you got from using specialized hardware - both power and performance. And that kind of model would work just fine with zinc. So an accelerator ends up being useful in two cases: - it's entirely external and part of the network card, so that there's no extra data transfer overhead - it's tightly coupled enough (either CPU instructions or some on-die cache coherent engine) that you can and will just use it synchronously anyway. In the first case, you wouldn't run wireguard on the CPU anyway - you have a network card that just implements the VPN. And in the second case, the zinc model is the right one. So no. I don't believe "layering violation" is the issue here at all. The only main issue as far as I'm concerned is how to deal with the fact that we have duplicate code and effort. Linus
As someone who has been working on crypto acceleration hardware for the better part of the past 20 years, I feel compelled to respond to this, in defence of the crypto API (which we're really happy with ...). > And honestly, I'm 1000% with Jason on this. The crypto/ model is hard to use, > inefficient, and completely pointless when you know what your cipher or > hash algorithm is, and your CPU just does it well directly. > > > we even have support already for async accelerators that implement it, > > Afaik, none of the async accelerator code has ever been worth anything on > real hardware and on any sane and real loads. The cost of going outside the > CPU is *so* expensive that you'll always lose, unless the algorithm has been > explicitly designed to be insanely hard on a regular CPU. > The days of designing them *specifically* to be hard on a CPU are over, but nevertheless, due to required security properties, *good* crypto is usually still hard work for a CPU. Especially *asymmetric* crypto, which can easily take milliseconds per operation even on a high-end CPU. You can do a lot of interrupt processing in a millisecond Now some symmetric algorithms (AES,SHA,GHASH) have actually made it *into* the CPU, somewhat changing the landscape, but: a) That's really only a small subset of the crypto being used out there. There's much more to crypto than just AES, SHA and GHASH. b) This only applies to high-end CPU's. The large majority of embedded CPU's do not have such embedded crypto instructions. This may change, but I don't really expect that to happen soon. c) You're still keeping a big, power-hungry CPU busy for a lot of cycles doing some fairly trivial work. > (Corollary: "insanely hard on a regular CPU" is also easy to do by making the > CPU be weak and bad. Which is not a case we should optimize for). > Linux is being used a lot for embedded systems which usually DO have fairly weak CPU's. I would argue that's exactly the case you SHOULD optimize for, as that's where you can *really* still make the difference as a programmer. > The whole "external accelerator" model is odd. It was wrong. It only makes > sense if the accelerator does *everything* (ie it's the network card), and > then you wouldn't use the wireguard thing on the CPU at all, you'd have all > those things on the accelerator (ie a "network card that does WG"). > > One of the (best or worst, depending on your hangups) arguments for > external accelerators has been "but I trust the external hardware with the > key, but not my own code", aka the TPM or Disney argument. I don't think > that's at all relevant to the discussion either. > > The whole model of async accelerators is completely bogus. The only crypto > or hash accelerator that is worth it are the ones integrated on the CPU cores, > which have the direct access to caches. > NOT true. We wouldn't still be in business after 20 years if that were true. > And if the accelerator is some tightly coupled thing that has direct access to > your caches, and doesn't need interrupt overhead or address translation etc > (at which point it can be worth using) then you might as well just consider it > an odd version of the above. You'd want to poll for the result anyway, > because not polling is too expensive. > It's HARD to get the interfacing right to take advantage of the acceleration. And that's exacly why you MUST have an async interface for that: to cope with the large latency of external acceleration, going through the memory subsystem and external buses as - as you already pointed out - you cannot access the CPU cache directly (though you can be fully coherent with it). So to cope with that latency, you will need to batch queue and pipeline your processing. This is not unique to crypto acceleration, the same principles apply to e.g. your GPU as well. Or any HW worth anything, for that matter. What that DOES mean is that external crypto accelerators are indeed useless for doing the occasional crypto operation, it really only makes sense for streaming and/or bulk use cases, such as network packet or disk encryption. And for operations that really take significant time, such as asymmetic crypto. For the occasional symmetric crypto operation, by all means, do that on the CPU using a very thin API layer for efficiency and simplicity. This is where Zinc makes a lot of sense - I'm not against Zinc at all. But DO realise that when you go that route, you forfeit any chances of benefitting from acceleration. Ironically, for doing what Wireguard is doing - bulk packet encryption - the async crypto API makes a lot more sense than Zinc. In Jason's defense, as far as I know, there is no Poly/Chacha HW acceleration out there yet, but I can assure you that that situation is going to change soon :-) Still, I would really recommend running Wireguard on top of crypto API. How much performance would you really lose by doing that? If there's anything wrong with the efficiency of the crypto API implementation of P/C, then just fix that. > Just a single interrupt would completely undo all the advantages you got > from using specialized hardware - both power and performance. > We have done plenty of measurements, on both power and performance, to prove you wrong there. Typically our HW needs *at least* a full order of a magnitude less power to do the actual work. The CPU load for handing the interrupts etc. tends to be around 20%. So assuming the CPU goes to sleep for the other 80% of the time, the combined solution would need about 1/3rd of the power of a CPU only solution. It's one of our biggest selling points. As for performance - in the embedded space you can normally expect the crypto accelerator to be *much* faster than the CPU subsystem as well. So yes, big multi-core multi-GHz Xeons can do AES-GCM extremely fast and we'd have a hard time competing with that, but that's not the relevant use case here. > And that kind of model would work just fine with zinc. > > So an accelerator ends up being useful in two cases: > > - it's entirely external and part of the network card, so that there's no extra > data transfer overhead > There are inherent efficiency advantages to integrating the crypto there, but it's also very inflexible as you're stuck with a particular, usually very limited, implementation. And it doesn't fly very well with current trends of Software Defined Networking and Network Function Virtualization. As for data transfer overhead: as long as the SoC memory subsystem was designed to cope with this, it's really irrelevant as you shouldn't notice it. > - it's tightly coupled enough (either CPU instructions or some on-die cache > coherent engine) that you can and will just use it synchronously anyway. > > In the first case, you wouldn't run wireguard on the CPU anyway - you have a > network card that just implements the VPN. > The irony here is, that network card would probably be an embedded system running Linux on an embedded CPU, offloading the crypto operations to our accelerator ... Most smart NICs are architected that way. Pascal van Leeuwen, Silicon IP Architect @ Inside Secure
On Fri, 22 Mar 2019 at 18:48, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Fri, Mar 22, 2019 at 12:56 AM Ard Biesheuvel > <ard.biesheuvel@linaro.org> wrote: > > > > - The way WireGuard uses crypto in the kernel is essentially a > > layering violation > > What? No. > > That's just wrong. > > It's only a layering violation if you accept and buy into the crypto/ model. > > And Jason obviously doesn't. > > And honestly, I'm 1000% with Jason on this. The crypto/ model is hard > to use, inefficient, and completely pointless when you know what your > cipher or hash algorithm is, and your CPU just does it well directly. > Well, it is true that the the dynamic dispatch is annoying and we need to fix that. And I have not given up hope that someone like Jason, with his level of understanding of crypto, and an enticing use case such as WireGuard, will choose to work with the linux-crypto community to improve it rather than build something from scratch on the side. But that is orthogonal to the sync vs async debate. > > we even have support already for async accelerators that implement it, > > Afaik, none of the async accelerator code has ever been worth anything > on real hardware and on any sane and real loads. The cost of going > outside the CPU is *so* expensive that you'll always lose, unless the > algorithm has been explicitly designed to be insanely hard on a > regular CPU. > > (Corollary: "insanely hard on a regular CPU" is also easy to do by > making the CPU be weak and bad. Which is not a case we should optimize > for). > > The whole "external accelerator" model is odd. It was wrong. It only > makes sense if the accelerator does *everything* (ie it's the network > card), and then you wouldn't use the wireguard thing on the CPU at > all, you'd have all those things on the accelerator (ie a "network > card that does WG"). > > One of the (best or worst, depending on your hangups) arguments for > external accelerators has been "but I trust the external hardware with > the key, but not my own code", aka the TPM or Disney argument. I > don't think that's at all relevant to the discussion either. > > The whole model of async accelerators is completely bogus. The only > crypto or hash accelerator that is worth it are the ones integrated on > the CPU cores, which have the direct access to caches. > > And if the accelerator is some tightly coupled thing that has direct > access to your caches, and doesn't need interrupt overhead or address > translation etc (at which point it can be worth using) then you might > as well just consider it an odd version of the above. You'd want to > poll for the result anyway, because not polling is too expensive. > > Just a single interrupt would completely undo all the advantages you > got from using specialized hardware - both power and performance. > > And that kind of model would work just fine with zinc. > > So an accelerator ends up being useful in two cases: > > - it's entirely external and part of the network card, so that > there's no extra data transfer overhead > > - it's tightly coupled enough (either CPU instructions or some on-die > cache coherent engine) that you can and will just use it synchronously > anyway. > > In the first case, you wouldn't run wireguard on the CPU anyway - you > have a network card that just implements the VPN. > > And in the second case, the zinc model is the right one. > > So no. I don't believe "layering violation" is the issue here at all. > If the consensus is that no accelerator is likely to ever exist that outperforms CPU crypto by any measure (latency, throughput, power efficiency), then I don't have any objections to bolting this straight onto a synchronous interface. My concern is that we will end up with Zinc patches 12 months from now that implement async interfaces to support some particular piece of hardware, while other hardware is only supported by the crypto/ API, even though the algorithm they implement is the same. > The only main issue as far as I'm concerned is how to deal with the > fact that we have duplicate code and effort. > > Linus
On Mon, Mar 25, 2019 at 11:43 AM Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > onto a synchronous interface. My concern is that we will end up with > Zinc patches 12 months from now that implement async interfaces to > support some particular piece of hardware, while other hardware is > only supported by the crypto/ API, even though the algorithm they > implement is the same. Allow me to allay that concern: Zinc is for simple synchronous software functions.
On Mon, Mar 25, 2019 at 09:10:08AM +0000, Pascal Van Leeuwen wrote: > As someone who has been working on crypto acceleration hardware for the better > part of the past 20 years, I feel compelled to respond to this, in defence of > the crypto API (which we're really happy with ...). Thanks for joining in. > We have done plenty of measurements, on both power and performance, to prove > you wrong there. Typically our HW needs *at least* a full order of a magnitude > less power to do the actual work. The CPU load for handing the interrupts etc. > tends to be around 20%. So assuming the CPU goes to sleep for the other 80% > of the time, the combined solution would need about 1/3rd of the power of a > CPU only solution. It's one of our biggest selling points. I actually have the kind of (relatively) low-power system with your crypto accelerator IP. But it doesn't matter how great performance or energy saving I would get from using crypto offloading. I will continue to use CPU only solution for the foreseeable future. Because of one tiny problem: The firmware files for eip197(?) crypto accelerator are secrets we are not allowed download anywhere from. > Pascal van Leeuwen, > Silicon IP Architect @ Inside Secure Riku
> The firmware files for eip197(?) crypto accelerator are secrets we are > not > allowed download anywhere from. > Riku, The driver update I'm working on will fix that issue for you ;-) I'm planning to upstream my changes but I still need to find the right approach for doing so. Regards, Pascal