Message ID | 78172b8-74bc-1177-6ac7-7a7e7a44d18@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | block: use the __packed attribute only on architectures where it is efficient | expand |
On Tue, Feb 06, 2024 at 12:14:14PM +0100, Mikulas Patocka wrote: > The __packed macro (expanding to __attribute__((__packed__))) specifies > that the structure has an alignment of 1. Therefore, it may be arbitrarily > misaligned. On architectures that don't have hardware support for > unaligned accesses, gcc generates very inefficient code that accesses the > structure fields byte-by-byte and assembles the result using shifts and > ors. > > For example, on PA-RISC, this function is compiled to 23 instructions with > the __packed attribute and only 2 instructions without the __packed > attribute. Can you share user visible effects in this way? such as IOPS or CPU utilization effect when running typical workload on null_blk or NVMe. CPU is supposed to be super fast if the data is in single L1 cacheline, but removing '__packed' may introduce one extra L1 cacheline load for bio. thanks, Ming
On Tue, 6 Feb 2024, Ming Lei wrote: > On Tue, Feb 06, 2024 at 12:14:14PM +0100, Mikulas Patocka wrote: > > The __packed macro (expanding to __attribute__((__packed__))) specifies > > that the structure has an alignment of 1. Therefore, it may be arbitrarily > > misaligned. On architectures that don't have hardware support for > > unaligned accesses, gcc generates very inefficient code that accesses the > > structure fields byte-by-byte and assembles the result using shifts and > > ors. > > > > For example, on PA-RISC, this function is compiled to 23 instructions with > > the __packed attribute and only 2 instructions without the __packed > > attribute. > > Can you share user visible effects in this way? such as IOPS or CPU > utilization effect when running typical workload on null_blk or NVMe. The patch reduces total kernel size by 4096 bytes. The parisc machine doesn't have PCIe, so I can't test it with NVMe :) > CPU is supposed to be super fast if the data is in single L1 cacheline, > but removing '__packed' may introduce one extra L1 cacheline load for > bio. Saving the intruction cache is also important. Removing the __packed keyword increases the bio structure size by 8 bytes - that is, L1 data cache consumption will be increased with the probability 8/64. And it reduces L1 instruction cache consumption by 84 bytes - that is one or two cachelines. Mikulas
On Tue, Feb 06, 2024 at 07:31:26PM +0100, Mikulas Patocka wrote: > > > On Tue, 6 Feb 2024, Ming Lei wrote: > > > On Tue, Feb 06, 2024 at 12:14:14PM +0100, Mikulas Patocka wrote: > > > The __packed macro (expanding to __attribute__((__packed__))) specifies > > > that the structure has an alignment of 1. Therefore, it may be arbitrarily > > > misaligned. On architectures that don't have hardware support for > > > unaligned accesses, gcc generates very inefficient code that accesses the > > > structure fields byte-by-byte and assembles the result using shifts and > > > ors. > > > > > > For example, on PA-RISC, this function is compiled to 23 instructions with > > > the __packed attribute and only 2 instructions without the __packed > > > attribute. > > > > Can you share user visible effects in this way? such as IOPS or CPU > > utilization effect when running typical workload on null_blk or NVMe. > > The patch reduces total kernel size by 4096 bytes. The parisc machine > doesn't have PCIe, so I can't test it with NVMe :) You can run test over null-blk, which is enough to cover this report or change, given this patch is marked as "Fixes: ", we need to understand what it fixes. > > > CPU is supposed to be super fast if the data is in single L1 cacheline, > > but removing '__packed' may introduce one extra L1 cacheline load for > > bio. > > Saving the intruction cache is also important. Removing the __packed > keyword increases the bio structure size by 8 bytes - that is, L1 data > cache consumption will be increased with the probability 8/64. And it > reduces L1 instruction cache consumption by 84 bytes - that is one or two > cachelines. Yes. But the two kinds of caches have different properties, such as: - instruction cache has lower miss rate - instruction cache is read only so I'd suggest to provide null-blk test result at least. Thanks, Ming
Index: linux-2.6/include/linux/bvec.h =================================================================== --- linux-2.6.orig/include/linux/bvec.h 2023-09-05 14:46:02.000000000 +0200 +++ linux-2.6/include/linux/bvec.h 2024-02-06 11:49:56.000000000 +0100 @@ -83,7 +83,11 @@ struct bvec_iter { unsigned int bi_bvec_done; /* number of bytes completed in current bvec */ -} __packed; +} +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS +__packed +#endif +; struct bvec_iter_all { struct bio_vec bv;
The __packed macro (expanding to __attribute__((__packed__))) specifies that the structure has an alignment of 1. Therefore, it may be arbitrarily misaligned. On architectures that don't have hardware support for unaligned accesses, gcc generates very inefficient code that accesses the structure fields byte-by-byte and assembles the result using shifts and ors. For example, on PA-RISC, this function is compiled to 23 instructions with the __packed attribute and only 2 instructions without the __packed attribute. This commit changes the definition of 'struct bvec_iter', so that it only uses the __packed attribute on architectures where unaligned access is efficient. sector_t get_sector(struct bvec_iter *iter) { return iter->bi_sector; } with __packed: 0000000000000000 <get_sector>: 0: 0f 40 10 1f ldb 0(r26),r31 4: 0f 42 10 1c ldb 1(r26),ret0 8: f3 ff 0b 18 depd,z,* r31,7,8,r31 c: f3 9c 0a 10 depd,z,* ret0,15,16,ret0 10: 0b fc 02 5c or ret0,r31,ret0 14: 0f 44 10 15 ldb 2(r26),r21 18: 0f 46 10 14 ldb 3(r26),r20 1c: f2 b5 09 08 depd,z,* r21,23,24,r21 20: 0f 48 10 13 ldb 4(r26),r19 24: 0b 95 02 55 or r21,ret0,r21 28: 0f 4a 10 1f ldb 5(r26),r31 2c: f2 94 08 00 depd,z,* r20,31,32,r20 30: 0f 4c 10 1c ldb 6(r26),ret0 34: 0f 4e 10 16 ldb 7(r26),r22 38: 0a b4 02 54 or r20,r21,r20 3c: f2 73 13 18 depd,z,* r19,39,40,r19 40: f3 ff 12 10 depd,z,* r31,47,48,r31 44: 0a 93 02 53 or r19,r20,r19 48: f3 9c 11 08 depd,z,* ret0,55,56,ret0 4c: 0a 7f 02 5f or r31,r19,r31 50: 0b fc 02 5c or ret0,r31,ret0 54: e8 40 d0 00 bve (rp) 58: 0b 96 02 5c or r22,ret0,ret0 5c: 00 00 00 00 break 0,0 without __packed: 0000000000000000 <get_sector>: 0: e8 40 d0 00 bve (rp) 4: 0f 40 10 dc ldd 0(r26),ret0 Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: 19416123ab3e ("block: define 'struct bvec_iter' as packed") Cc: stable@vger.kernel.org # v5.16+ --- include/linux/bvec.h | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-)