Message ID | 20161214035927.30004-3-Jason@zx2c4.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi David, On Wed, Dec 14, 2016 at 10:51 AM, David Laight <David.Laight@aculab.com> wrote: > From: Jason A. Donenfeld >> Sent: 14 December 2016 00:17 >> This gives a clear speed and security improvement. Rather than manually >> filling MD5 buffers, we simply create a layout by a simple anonymous >> struct, for which gcc generates rather efficient code. > ... >> + const struct { >> + struct in6_addr saddr; >> + struct in6_addr daddr; >> + __be16 sport; >> + __be16 dport; >> + } __packed combined = { >> + .saddr = *(struct in6_addr *)saddr, >> + .daddr = *(struct in6_addr *)daddr, >> + .sport = sport, >> + .dport = dport >> + }; > > You need to look at the effect of marking this (and the other) > structures 'packed' on architectures like sparc64. In all current uses of __packed in the code, I think the impact is precisely zero, because all structures have members in descending order of size, with each member being a perfect multiple of the one below it. The __packed is therefore just there for safety, in case somebody comes in and screws everything up by sticking a u8 in between. In that case, it wouldn't be desirable to hash the structure padding bits. In the worst case, I don't believe the impact would be worse than a byte-by-byte memcpy, which is what the old code did. But anyway, these structures are already naturally packed anyway, so the present impact is nil. Jason
On 14.12.2016 13:53, Jason A. Donenfeld wrote: > Hi David, > > On Wed, Dec 14, 2016 at 10:51 AM, David Laight <David.Laight@aculab.com> wrote: >> From: Jason A. Donenfeld >>> Sent: 14 December 2016 00:17 >>> This gives a clear speed and security improvement. Rather than manually >>> filling MD5 buffers, we simply create a layout by a simple anonymous >>> struct, for which gcc generates rather efficient code. >> ... >>> + const struct { >>> + struct in6_addr saddr; >>> + struct in6_addr daddr; >>> + __be16 sport; >>> + __be16 dport; >>> + } __packed combined = { >>> + .saddr = *(struct in6_addr *)saddr, >>> + .daddr = *(struct in6_addr *)daddr, >>> + .sport = sport, >>> + .dport = dport >>> + }; >> >> You need to look at the effect of marking this (and the other) >> structures 'packed' on architectures like sparc64. > > In all current uses of __packed in the code, I think the impact is > precisely zero, because all structures have members in descending > order of size, with each member being a perfect multiple of the one > below it. The __packed is therefore just there for safety, in case > somebody comes in and screws everything up by sticking a u8 in > between. In that case, it wouldn't be desirable to hash the structure > padding bits. In the worst case, I don't believe the impact would be > worse than a byte-by-byte memcpy, which is what the old code did. But > anyway, these structures are already naturally packed anyway, so the > present impact is nil. __packed not only removes all padding of the struct but also changes the alignment assumptions for the whole struct itself. The rule, the struct is aligned by its maximum alignment of a member is no longer true. That said, the code accessing this struct will change (not on archs that can deal efficiently with unaligned access, but on others). A proper test for not introducing padding is to use something with BUILD_BUG_ON. Also gcc also clears the padding of the struct, so padding shouldn't be that bad in there (it is better than byte access on mips). Btw. I think gcc7 gets support for store merging optimization. Bye, Hannes
Hi Hannes, Thanks for the feedback. > __packed not only removes all padding of the struct but also changes the > alignment assumptions for the whole struct itself. The rule, the struct > is aligned by its maximum alignment of a member is no longer true. That > said, the code accessing this struct will change (not on archs that can > deal efficiently with unaligned access, but on others). That's interesting. There currently aren't any alignment requirements in siphash because we use the unaligned helper functions, but as David pointed out in another thread, maybe that too should change. In that case, we'd have an aligned-only version of the function that requires 8-byte aligned input. Perhaps the best way to go about that would be to just mark the struct as __packed __aligned(8). Or, I guess, since 64-bit accesses gets split into two on 32-bit, that'd be best descried as __packed __aligned(sizeof(long)). Would that be an acceptable solution? Jason
From: Jason A. Donenfeld > Sent: 14 December 2016 13:44 > To: Hannes Frederic Sowa > > __packed not only removes all padding of the struct but also changes the > > alignment assumptions for the whole struct itself. The rule, the struct > > is aligned by its maximum alignment of a member is no longer true. That > > said, the code accessing this struct will change (not on archs that can > > deal efficiently with unaligned access, but on others). > > That's interesting. There currently aren't any alignment requirements > in siphash because we use the unaligned helper functions, but as David > pointed out in another thread, maybe that too should change. In that > case, we'd have an aligned-only version of the function that requires > 8-byte aligned input. Perhaps the best way to go about that would be > to just mark the struct as __packed __aligned(8). Or, I guess, since > 64-bit accesses gets split into two on 32-bit, that'd be best descried > as __packed __aligned(sizeof(long)). Would that be an acceptable > solution? Just remove the __packed and ensure that the structure is 'nice'. This includes ensuring there is no 'tail padding'. In some cases you'll need to put the port number into a 32bit field. I'd also require that the key be aligned. It probably ought to be a named structure type with two 64bit members (or with an array member that has two elements). David
On Wed, Dec 14, 2016 at 3:47 PM, David Laight <David.Laight@aculab.com> wrote: > Just remove the __packed and ensure that the structure is 'nice'. > This includes ensuring there is no 'tail padding'. > In some cases you'll need to put the port number into a 32bit field. I'd rather not. There's no point in wasting extra cycles on hashing useless data, just for some particular syntactic improvement. __packed __aligned(8) will do what we want perfectly, I think. > I'd also require that the key be aligned. Yep, I'll certainly do this for the siphash24_aligned version in the v3.
From: "Jason A. Donenfeld" <Jason@zx2c4.com> Date: Wed, 14 Dec 2016 13:53:10 +0100 > In all current uses of __packed in the code, I think the impact is > precisely zero, because all structures have members in descending > order of size, with each member being a perfect multiple of the one > below it. The __packed is therefore just there for safety, in case > somebody comes in and screws everything up by sticking a u8 in > between. Just marking the structure __packed, whether necessary or not, makes the compiler assume that the members are not aligned and causes byte-by-byte accesses to be performed for words. Never, _ever_, use __packed unless absolutely necessary, it pessimizes the code on cpus that require proper alignment of types.
Hi David, On Wed, Dec 14, 2016 at 6:56 PM, David Miller <davem@davemloft.net> wrote: > Just marking the structure __packed, whether necessary or not, makes > the compiler assume that the members are not aligned and causes > byte-by-byte accesses to be performed for words. > Never, _ever_, use __packed unless absolutely necessary, it pessimizes > the code on cpus that require proper alignment of types. Oh, jimminy cricket, I did not realize that it made assignments byte-by-byte *always*. So what options am I left with? What immediately comes to mind are: 1) struct { u64 a; u32 b; u32 c; u16 d; u8 end[]; } a = { .a = a, .b = b, .c = c, .d = d }; siphash24(&a, offsetof(typeof(a), end), key); 2) u8 bytes[sizeof(u64) + sizeof(u32) * 2 + sizeof(u16)]; *(u64 *)&bytes[0] = a; *(u32 *)&bytes[sizeof(u64)] = b; *(u32 *)&bytes[sizeof(u64) + sizeof(u32)] = c; *(u16 *)&bytes[sizeof(u64) + sizeof(u32) * 2] = d; siphash24(bytes, sizeof(bytes), key); Personally I find (1) a bit neater than (2). What's your opinion? Jason
On 14.12.2016 19:06, Jason A. Donenfeld wrote: > Hi David, > > On Wed, Dec 14, 2016 at 6:56 PM, David Miller <davem@davemloft.net> wrote: >> Just marking the structure __packed, whether necessary or not, makes >> the compiler assume that the members are not aligned and causes >> byte-by-byte accesses to be performed for words. >> Never, _ever_, use __packed unless absolutely necessary, it pessimizes >> the code on cpus that require proper alignment of types. > > Oh, jimminy cricket, I did not realize that it made assignments > byte-by-byte *always*. So what options am I left with? What > immediately comes to mind are: > > 1) > > struct { > u64 a; > u32 b; > u32 c; > u16 d; > u8 end[]; I don't think this helps. Did you test it? I don't see reason why padding could be left out between `d' and `end' because of the flexible array member? Bye, Hannes
Hi Hannes, On Wed, Dec 14, 2016 at 8:22 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > I don't think this helps. Did you test it? I don't see reason why > padding could be left out between `d' and `end' because of the flexible > array member? Because the type u8 doesn't require any alignment requirements, it can nestle right up there cozy with the u16: zx2c4@thinkpad ~ $ cat a.c #include <stdint.h> #include <stdio.h> #include <stddef.h> int main() { struct { uint64_t a; uint32_t b; uint32_t c; uint16_t d; char x[]; } a; printf("%zu\n", sizeof(a)); printf("%zu\n", offsetof(typeof(a), x)); return 0; } zx2c4@thinkpad ~ $ gcc a.c zx2c4@thinkpad ~ $ ./a.out 24 18 Jason
On Wed, Dec 14, 2016 at 4:53 AM, Jason A. Donenfeld <Jason@zx2c4.com> wrote: > Hi David, > > On Wed, Dec 14, 2016 at 10:51 AM, David Laight <David.Laight@aculab.com> wrote: >> From: Jason A. Donenfeld >>> Sent: 14 December 2016 00:17 >>> This gives a clear speed and security improvement. Rather than manually >>> filling MD5 buffers, we simply create a layout by a simple anonymous >>> struct, for which gcc generates rather efficient code. >> ... >>> + const struct { >>> + struct in6_addr saddr; >>> + struct in6_addr daddr; >>> + __be16 sport; >>> + __be16 dport; >>> + } __packed combined = { >>> + .saddr = *(struct in6_addr *)saddr, >>> + .daddr = *(struct in6_addr *)daddr, >>> + .sport = sport, >>> + .dport = dport >>> + }; >> >> You need to look at the effect of marking this (and the other) >> structures 'packed' on architectures like sparc64. > > In all current uses of __packed in the code, I think the impact is > precisely zero, because all structures have members in descending > order of size, with each member being a perfect multiple of the one > below it. The __packed is therefore just there for safety, in case > somebody comes in and screws everything up by sticking a u8 in > between. In that case, it wouldn't be desirable to hash the structure > padding bits. In the worst case, I don't believe the impact would be > worse than a byte-by-byte memcpy, which is what the old code did. But > anyway, these structures are already naturally packed anyway, so the > present impact is nil. > If you pad the data structure to 64 bits then we can call the version of siphash that only deals in 64 bit words. Writing a zero in the padding will be cheaper than dealing with odd lengths in siphash24. Tom > Jason
Hey Jason, On 14.12.2016 20:38, Jason A. Donenfeld wrote: > On Wed, Dec 14, 2016 at 8:22 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> I don't think this helps. Did you test it? I don't see reason why >> padding could be left out between `d' and `end' because of the flexible >> array member? > > Because the type u8 doesn't require any alignment requirements, it can > nestle right up there cozy with the u16: > > zx2c4@thinkpad ~ $ cat a.c > #include <stdint.h> > #include <stdio.h> > #include <stddef.h> > int main() > { > struct { > uint64_t a; > uint32_t b; > uint32_t c; > uint16_t d; > char x[]; > } a; > printf("%zu\n", sizeof(a)); > printf("%zu\n", offsetof(typeof(a), x)); > return 0; > } > zx2c4@thinkpad ~ $ gcc a.c > zx2c4@thinkpad ~ $ ./a.out > 24 > 18 Sorry, I misread the patch. You are using offsetof. In this case remove the char x[] and just use offsetofend because it is misleading otherwise. Should work like that though. What I don't really understand is that the addition of this complexity actually reduces the performance, as you have to take the "if (left)" branch during hashing and causes you to make a load_unaligned_zeropad. Bye, Hannes
On Wed, Dec 14, 2016 at 9:12 PM, Tom Herbert <tom@herbertland.com> wrote: > If you pad the data structure to 64 bits then we can call the version > of siphash that only deals in 64 bit words. Writing a zero in the > padding will be cheaper than dealing with odd lengths in siphash24. On Wed, Dec 14, 2016 at 9:27 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > What I don't really understand is that the addition of this complexity > actually reduces the performance, as you have to take the "if (left)" > branch during hashing and causes you to make a load_unaligned_zeropad. Oh, duh, you guys are right. Fixed in my repo [1]. I'll submit the next version in a day or so to let some other comments come in. Thanks again for your reviews. Jason [1] https://git.zx2c4.com/linux-dev/log/?h=siphash
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c index 88a8e429fc3e..abadc79cd5d3 100644 --- a/net/core/secure_seq.c +++ b/net/core/secure_seq.c @@ -1,3 +1,5 @@ +/* Copyright (C) 2016 Jason A. Donenfeld <Jason@zx2c4.com>. All Rights Reserved. */ + #include <linux/kernel.h> #include <linux/init.h> #include <linux/cryptohash.h> @@ -8,14 +10,14 @@ #include <linux/ktime.h> #include <linux/string.h> #include <linux/net.h> - +#include <linux/siphash.h> #include <net/secure_seq.h> #if IS_ENABLED(CONFIG_IPV6) || IS_ENABLED(CONFIG_INET) +#include <linux/in6.h> #include <net/tcp.h> -#define NET_SECRET_SIZE (MD5_MESSAGE_BYTES / 4) -static u32 net_secret[NET_SECRET_SIZE] ____cacheline_aligned; +static u8 net_secret[SIPHASH24_KEY_LEN]; static __always_inline void net_secret_init(void) { @@ -44,44 +46,39 @@ static u32 seq_scale(u32 seq) u32 secure_tcpv6_sequence_number(const __be32 *saddr, const __be32 *daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 sport; + __be16 dport; + } __packed combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .sport = sport, + .dport = dport + }; + u64 hash; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32)daddr[i]; - secret[4] = net_secret[4] + - (((__force u16)sport << 16) + (__force u16)dport); - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0; - return seq_scale(hash[0]); + hash = siphash24((const u8 *)&combined, sizeof(combined), net_secret); + *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0; + return seq_scale(hash); } EXPORT_SYMBOL(secure_tcpv6_sequence_number); u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr, __be16 dport) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; - u32 i; - + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 dport; + } __packed combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .dport = dport + }; net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32) daddr[i]; - secret[4] = net_secret[4] + (__force u32)dport; - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - return hash[0]; + return siphash24((const u8 *)&combined, sizeof(combined), net_secret); } EXPORT_SYMBOL(secure_ipv6_port_ephemeral); #endif @@ -91,33 +88,37 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral); u32 secure_tcp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport, u32 *tsoff) { - u32 hash[MD5_DIGEST_WORDS]; - + const struct { + __be32 saddr; + __be32 daddr; + __be16 sport; + __be16 dport; + } __packed combined = { + .saddr = saddr, + .daddr = daddr, + .sport = sport, + .dport = dport + }; + u64 hash; net_secret_init(); - hash[0] = (__force u32)saddr; - hash[1] = (__force u32)daddr; - hash[2] = ((__force u16)sport << 16) + (__force u16)dport; - hash[3] = net_secret[15]; - - md5_transform(hash, net_secret); - - *tsoff = sysctl_tcp_timestamps == 1 ? hash[1] : 0; - return seq_scale(hash[0]); + hash = siphash24((const u8 *)&combined, sizeof(combined), net_secret); + *tsoff = sysctl_tcp_timestamps == 1 ? (hash >> 32) : 0; + return seq_scale(hash); } u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport) { - u32 hash[MD5_DIGEST_WORDS]; - + const struct { + __be32 saddr; + __be32 daddr; + __be16 dport; + } __packed combined = { + .saddr = saddr, + .daddr = daddr, + .dport = dport + }; net_secret_init(); - hash[0] = (__force u32)saddr; - hash[1] = (__force u32)daddr; - hash[2] = (__force u32)dport ^ net_secret[14]; - hash[3] = net_secret[15]; - - md5_transform(hash, net_secret); - - return hash[0]; + return seq_scale(siphash24((const u8 *)&combined, sizeof(combined), net_secret)); } EXPORT_SYMBOL_GPL(secure_ipv4_port_ephemeral); #endif @@ -126,21 +127,22 @@ EXPORT_SYMBOL_GPL(secure_ipv4_port_ephemeral); u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr, __be16 sport, __be16 dport) { - u32 hash[MD5_DIGEST_WORDS]; + const struct { + __be32 saddr; + __be32 daddr; + __be16 sport; + __be16 dport; + } __packed combined = { + .saddr = saddr, + .daddr = daddr, + .sport = sport, + .dport = dport + }; u64 seq; - net_secret_init(); - hash[0] = (__force u32)saddr; - hash[1] = (__force u32)daddr; - hash[2] = ((__force u16)sport << 16) + (__force u16)dport; - hash[3] = net_secret[15]; - - md5_transform(hash, net_secret); - - seq = hash[0] | (((u64)hash[1]) << 32); + seq = siphash24((const u8 *)&combined, sizeof(combined), net_secret); seq += ktime_get_real_ns(); seq &= (1ull << 48) - 1; - return seq; } EXPORT_SYMBOL(secure_dccp_sequence_number); @@ -149,26 +151,22 @@ EXPORT_SYMBOL(secure_dccp_sequence_number); u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr, __be16 sport, __be16 dport) { - u32 secret[MD5_MESSAGE_BYTES / 4]; - u32 hash[MD5_DIGEST_WORDS]; + const struct { + struct in6_addr saddr; + struct in6_addr daddr; + __be16 sport; + __be16 dport; + } __packed combined = { + .saddr = *(struct in6_addr *)saddr, + .daddr = *(struct in6_addr *)daddr, + .sport = sport, + .dport = dport + }; u64 seq; - u32 i; - net_secret_init(); - memcpy(hash, saddr, 16); - for (i = 0; i < 4; i++) - secret[i] = net_secret[i] + (__force u32)daddr[i]; - secret[4] = net_secret[4] + - (((__force u16)sport << 16) + (__force u16)dport); - for (i = 5; i < MD5_MESSAGE_BYTES / 4; i++) - secret[i] = net_secret[i]; - - md5_transform(hash, secret); - - seq = hash[0] | (((u64)hash[1]) << 32); + seq = siphash24((const u8 *)&combined, sizeof(combined), net_secret); seq += ktime_get_real_ns(); seq &= (1ull << 48) - 1; - return seq; } EXPORT_SYMBOL(secure_dccpv6_sequence_number);
This gives a clear speed and security improvement. Siphash is both faster and is more solid crypto than the aging MD5. Rather than manually filling MD5 buffers, we simply create a layout by a simple anonymous struct, for which gcc generates rather efficient code. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Andi Kleen <ak@linux.intel.com> --- Changes from v1->v2: - Rebased on the latest 4.10, and now uses top 32-bits of siphash for the optional ts value. net/core/secure_seq.c | 160 +++++++++++++++++++++++++------------------------- 1 file changed, 79 insertions(+), 81 deletions(-)