Message ID | 20220610232130.2865479-1-lucas.demarchi@intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [1/3] iosys-map: Add per-word read | expand |
Am 11.06.22 um 01:21 schrieb Lucas De Marchi: > Instead of always falling back to memcpy_fromio() for any size, prefer > using read{b,w,l}(). When reading struct members it's common to read > individual integer variables individually. Going through memcpy_fromio() > for each of them poses a high penalty. > > Employ a similar trick as __seqprop() by using _Generic() to generate > only the specific call based on a type-compatible variable. > > For a pariticular i915 workload producing GPU context switches, > __get_engine_usage_record() is particularly hot since the engine usage > is read from device local memory with dgfx, possibly multiple times > since it's racy. Test execution time for this test shows a ~12.5% > improvement with DG2: > > Before: > nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07; > median = 9.52548e+06; var = 526149; > After: > nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06; > median = 8.33955e+06; var = 333113; > > Other things attempted that didn't prove very useful: > 1) Change the _Generic() on x86 to just dereference the memory address > 2) Change __get_engine_usage_record() to do just 1 read per loop, > comparing with the previous value read > 3) Change __get_engine_usage_record() to access the fields directly as it > was before the conversion to iosys-map > > (3) did gave a small improvement (~3%), but doesn't seem to scale well > to other similar cases in the driver. > > Additional test by Chris Wilson using gem_create from igt with some > changes to track object creation time. This happens to accidentally > stress this code path: > > Pre iosys_map conversion of engine busyness: > lmem0: Creating 262144 4KiB objects took 59274.2ms > > Unpatched: > lmem0: Creating 262144 4KiB objects took 108830.2ms > > With readl (this patch): > lmem0: Creating 262144 4KiB objects took 61348.6ms > > s/readl/READ_ONCE/ > lmem0: Creating 262144 4KiB objects took 61333.2ms > > So we do take a little bit more time than before the conversion, but > that is due to other factors: bringing the READ_ONCE back would be as > good as just doing this conversion. > > Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Reviewed-by: Christian König <christian.koenig@amd.com> for the entire series. > --- > include/linux/iosys-map.h | 26 ++++++++++++++++++++++---- > 1 file changed, 22 insertions(+), 4 deletions(-) > > diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h > index e69a002d5aa4..cd28c7a1b79c 100644 > --- a/include/linux/iosys-map.h > +++ b/include/linux/iosys-map.h > @@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, > memset(dst->vaddr + offset, value, len); > } > > +#ifdef CONFIG_64BIT > +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ > + u64: val_ = readq(vaddr_iomem_), > +#else > +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) > +#endif > + > +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \ > + u8: val__ = readb(vaddr_iomem__), \ > + u16: val__ = readw(vaddr_iomem__), \ > + u32: val__ = readl(vaddr_iomem__), \ > + __iosys_map_rd_io_u64_case(val__, vaddr_iomem__) \ > + default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__))) > + > /** > * iosys_map_rd - Read a C-type value from the iosys_map > * > @@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, > * Returns: > * The value read from the mapping. > */ > -#define iosys_map_rd(map__, offset__, type__) ({ \ > - type__ val; \ > - iosys_map_memcpy_from(&val, map__, offset__, sizeof(val)); \ > - val; \ > +#define iosys_map_rd(map__, offset__, type__) ({ \ > + type__ val; \ > + if ((map__)->is_iomem) { \ > + __iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\ > + } else { \ > + memcpy(&val, (map__)->vaddr + offset__, sizeof(val)); \ > + } \ > + val; \ > }) > > /**
diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h index e69a002d5aa4..cd28c7a1b79c 100644 --- a/include/linux/iosys-map.h +++ b/include/linux/iosys-map.h @@ -333,6 +333,20 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, memset(dst->vaddr + offset, value, len); } +#ifdef CONFIG_64BIT +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) \ + u64: val_ = readq(vaddr_iomem_), +#else +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_) +#endif + +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__, \ + u8: val__ = readb(vaddr_iomem__), \ + u16: val__ = readw(vaddr_iomem__), \ + u32: val__ = readl(vaddr_iomem__), \ + __iosys_map_rd_io_u64_case(val__, vaddr_iomem__) \ + default: memcpy_fromio(&(val__), vaddr_iomem__, sizeof(val__))) + /** * iosys_map_rd - Read a C-type value from the iosys_map * @@ -346,10 +360,14 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset, * Returns: * The value read from the mapping. */ -#define iosys_map_rd(map__, offset__, type__) ({ \ - type__ val; \ - iosys_map_memcpy_from(&val, map__, offset__, sizeof(val)); \ - val; \ +#define iosys_map_rd(map__, offset__, type__) ({ \ + type__ val; \ + if ((map__)->is_iomem) { \ + __iosys_map_rd_io(val, (map__)->vaddr_iomem + offset__, type__);\ + } else { \ + memcpy(&val, (map__)->vaddr + offset__, sizeof(val)); \ + } \ + val; \ }) /**
Instead of always falling back to memcpy_fromio() for any size, prefer using read{b,w,l}(). When reading struct members it's common to read individual integer variables individually. Going through memcpy_fromio() for each of them poses a high penalty. Employ a similar trick as __seqprop() by using _Generic() to generate only the specific call based on a type-compatible variable. For a pariticular i915 workload producing GPU context switches, __get_engine_usage_record() is particularly hot since the engine usage is read from device local memory with dgfx, possibly multiple times since it's racy. Test execution time for this test shows a ~12.5% improvement with DG2: Before: nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07; median = 9.52548e+06; var = 526149; After: nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06; median = 8.33955e+06; var = 333113; Other things attempted that didn't prove very useful: 1) Change the _Generic() on x86 to just dereference the memory address 2) Change __get_engine_usage_record() to do just 1 read per loop, comparing with the previous value read 3) Change __get_engine_usage_record() to access the fields directly as it was before the conversion to iosys-map (3) did gave a small improvement (~3%), but doesn't seem to scale well to other similar cases in the driver. Additional test by Chris Wilson using gem_create from igt with some changes to track object creation time. This happens to accidentally stress this code path: Pre iosys_map conversion of engine busyness: lmem0: Creating 262144 4KiB objects took 59274.2ms Unpatched: lmem0: Creating 262144 4KiB objects took 108830.2ms With readl (this patch): lmem0: Creating 262144 4KiB objects took 61348.6ms s/readl/READ_ONCE/ lmem0: Creating 262144 4KiB objects took 61333.2ms So we do take a little bit more time than before the conversion, but that is due to other factors: bringing the READ_ONCE back would be as good as just doing this conversion. Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> --- include/linux/iosys-map.h | 26 ++++++++++++++++++++++---- 1 file changed, 22 insertions(+), 4 deletions(-)