diff mbox series

[CI,1/2] iosys-map: Add per-word read

Message ID 20220627224751.3627465-1-lucas.demarchi@intel.com (mailing list archive)
State New, archived
Headers show
Series [CI,1/2] iosys-map: Add per-word read | expand

Commit Message

Lucas De Marchi June 27, 2022, 10:47 p.m. UTC
Instead of always falling back to memcpy_fromio() for any size, prefer
using read{b,w,l}(). When reading struct members it's common to read
individual integer variables individually. Going through memcpy_fromio()
for each of them poses a high penalty.

Employ a similar trick as __seqprop() by using _Generic() to generate
only the specific call based on a type-compatible variable.

For a pariticular i915 workload producing GPU context switches,
__get_engine_usage_record() is particularly hot since the engine usage
is read from device local memory with dgfx, possibly multiple times
since it's racy. Test execution time for this test shows a ~12.5%
improvement with DG2:

Before:
	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
	median = 9.52548e+06; var = 526149;
After:
	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
	median = 8.33955e+06; var = 333113;

Other things attempted that didn't prove very useful:
1) Change the _Generic() on x86 to just dereference the memory address
2) Change __get_engine_usage_record() to do just 1 read per loop,
   comparing with the previous value read
3) Change __get_engine_usage_record() to access the fields directly as it
   was before the conversion to iosys-map

(3) did gave a small improvement (~3%), but doesn't seem to scale well
to other similar cases in the driver.

Additional test by Chris Wilson using gem_create from igt with some
changes to track object creation time. This happens to accidentally
stress this code path:

	Pre iosys_map conversion of engine busyness:
	lmem0: Creating    262144 4KiB objects took 59274.2ms

	Unpatched:
	lmem0: Creating    262144 4KiB objects took 108830.2ms

	With readl (this patch):
	lmem0: Creating    262144 4KiB objects took 61348.6ms

	s/readl/READ_ONCE/
	lmem0: Creating    262144 4KiB objects took 61333.2ms

So we do take a little bit more time than before the conversion, but
that is due to other factors: bringing the READ_ONCE back would be as
good as just doing this conversion.

v2:
- Remove default from _Generic() - callers wanting to read more
  than u64 should use iosys_map_memcpy_from()
- Add READ_ONCE() cases dereferencing the pointer when using system
  memory

Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com> # v1
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
---
 include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 9 deletions(-)

Comments

Christian König July 4, 2022, 6:30 a.m. UTC | #1
Am 28.06.22 um 00:47 schrieb Lucas De Marchi:
> Instead of always falling back to memcpy_fromio() for any size, prefer
> using read{b,w,l}(). When reading struct members it's common to read
> individual integer variables individually. Going through memcpy_fromio()
> for each of them poses a high penalty.
>
> Employ a similar trick as __seqprop() by using _Generic() to generate
> only the specific call based on a type-compatible variable.
>
> For a pariticular i915 workload producing GPU context switches,
> __get_engine_usage_record() is particularly hot since the engine usage
> is read from device local memory with dgfx, possibly multiple times
> since it's racy. Test execution time for this test shows a ~12.5%
> improvement with DG2:
>
> Before:
> 	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
> 	median = 9.52548e+06; var = 526149;
> After:
> 	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
> 	median = 8.33955e+06; var = 333113;
>
> Other things attempted that didn't prove very useful:
> 1) Change the _Generic() on x86 to just dereference the memory address
> 2) Change __get_engine_usage_record() to do just 1 read per loop,
>     comparing with the previous value read
> 3) Change __get_engine_usage_record() to access the fields directly as it
>     was before the conversion to iosys-map
>
> (3) did gave a small improvement (~3%), but doesn't seem to scale well
> to other similar cases in the driver.
>
> Additional test by Chris Wilson using gem_create from igt with some
> changes to track object creation time. This happens to accidentally
> stress this code path:
>
> 	Pre iosys_map conversion of engine busyness:
> 	lmem0: Creating    262144 4KiB objects took 59274.2ms
>
> 	Unpatched:
> 	lmem0: Creating    262144 4KiB objects took 108830.2ms
>
> 	With readl (this patch):
> 	lmem0: Creating    262144 4KiB objects took 61348.6ms
>
> 	s/readl/READ_ONCE/
> 	lmem0: Creating    262144 4KiB objects took 61333.2ms
>
> So we do take a little bit more time than before the conversion, but
> that is due to other factors: bringing the READ_ONCE back would be as
> good as just doing this conversion.
>
> v2:
> - Remove default from _Generic() - callers wanting to read more
>    than u64 should use iosys_map_memcpy_from()
> - Add READ_ONCE() cases dereferencing the pointer when using system
>    memory
>
> Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
> Reviewed-by: Christian König <christian.koenig@amd.com> # v1
> Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>

Feel free to update my rb to v2 as well.

Apart from that do you have commit rights to drm-misc-next? If not 
should we push this?

Thanks,
Christian.

> ---
>   include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++--------
>   1 file changed, 36 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
> index 4b8406ee8bc4..ec81ed995c59 100644
> --- a/include/linux/iosys-map.h
> +++ b/include/linux/iosys-map.h
> @@ -6,6 +6,7 @@
>   #ifndef __IOSYS_MAP_H__
>   #define __IOSYS_MAP_H__
>   
> +#include <linux/compiler_types.h>
>   #include <linux/io.h>
>   #include <linux/string.h>
>   
> @@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>   		memset(dst->vaddr + offset, value, len);
>   }
>   
> +#ifdef CONFIG_64BIT
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: val_ = readq(vaddr_iomem_)
> +#else
> +#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
> +	u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64))
> +#endif
> +
> +#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,		\
> +	u8: val__ = readb(vaddr_iomem__),					\
> +	u16: val__ = readw(vaddr_iomem__),					\
> +	u32: val__ = readl(vaddr_iomem__),					\
> +	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__))
> +
> +#define __iosys_map_rd_sys(val__, vaddr__, type__) ({				\
> +	compiletime_assert(sizeof(type__) <= sizeof(u64),			\
> +			   "Unsupported access size for __iosys_map_rd_sys()");	\
> +	val__ = READ_ONCE(*((type__ *)vaddr__));				\
> +})
> +
>   /**
>    * iosys_map_rd - Read a C-type value from the iosys_map
>    *
> @@ -340,16 +361,21 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    * @offset__:	The offset from which to read
>    * @type__:	Type of the value being read
>    *
> - * Read a C type value from iosys_map, handling possible un-aligned accesses to
> - * the mapping.
> + * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or
> + * if pointer may be unaligned (and problematic for the architecture supported),
> + * use iosys_map_memcpy_from().
>    *
>    * Returns:
>    * The value read from the mapping.
>    */
> -#define iosys_map_rd(map__, offset__, type__) ({			\
> -	type__ val;							\
> -	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
> -	val;								\
> +#define iosys_map_rd(map__, offset__, type__) ({				\
> +	type__ val;								\
> +	if ((map__)->is_iomem) {						\
> +		__iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\
> +	} else {								\
> +		__iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__);	\
> +	}									\
> +	val;									\
>   })
>   
>   /**
> @@ -379,9 +405,10 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>    *
>    * Read a value from iosys_map considering its layout is described by a C struct
>    * starting at @struct_offset__. The field offset and size is calculated and its
> - * value read handling possible un-aligned memory accesses. For example: suppose
> - * there is a @struct foo defined as below and the value ``foo.field2.inner2``
> - * needs to be read from the iosys_map:
> + * value read. If the field access would incur in un-aligned access, then either
> + * iosys_map_memcpy_from() needs to be used or the architecture must support it.
> + * For example: suppose there is a @struct foo defined as below and the value
> + * ``foo.field2.inner2`` needs to be read from the iosys_map:
>    *
>    * .. code-block:: c
>    *
Lucas De Marchi July 4, 2022, 5:14 p.m. UTC | #2
On Mon, Jul 04, 2022 at 08:30:08AM +0200, Christian König wrote:
>Am 28.06.22 um 00:47 schrieb Lucas De Marchi:
>>Instead of always falling back to memcpy_fromio() for any size, prefer
>>using read{b,w,l}(). When reading struct members it's common to read
>>individual integer variables individually. Going through memcpy_fromio()
>>for each of them poses a high penalty.
>>
>>Employ a similar trick as __seqprop() by using _Generic() to generate
>>only the specific call based on a type-compatible variable.
>>
>>For a pariticular i915 workload producing GPU context switches,
>>__get_engine_usage_record() is particularly hot since the engine usage
>>is read from device local memory with dgfx, possibly multiple times
>>since it's racy. Test execution time for this test shows a ~12.5%
>>improvement with DG2:
>>
>>Before:
>>	nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
>>	median = 9.52548e+06; var = 526149;
>>After:
>>	nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
>>	median = 8.33955e+06; var = 333113;
>>
>>Other things attempted that didn't prove very useful:
>>1) Change the _Generic() on x86 to just dereference the memory address
>>2) Change __get_engine_usage_record() to do just 1 read per loop,
>>    comparing with the previous value read
>>3) Change __get_engine_usage_record() to access the fields directly as it
>>    was before the conversion to iosys-map
>>
>>(3) did gave a small improvement (~3%), but doesn't seem to scale well
>>to other similar cases in the driver.
>>
>>Additional test by Chris Wilson using gem_create from igt with some
>>changes to track object creation time. This happens to accidentally
>>stress this code path:
>>
>>	Pre iosys_map conversion of engine busyness:
>>	lmem0: Creating    262144 4KiB objects took 59274.2ms
>>
>>	Unpatched:
>>	lmem0: Creating    262144 4KiB objects took 108830.2ms
>>
>>	With readl (this patch):
>>	lmem0: Creating    262144 4KiB objects took 61348.6ms
>>
>>	s/readl/READ_ONCE/
>>	lmem0: Creating    262144 4KiB objects took 61333.2ms
>>
>>So we do take a little bit more time than before the conversion, but
>>that is due to other factors: bringing the READ_ONCE back would be as
>>good as just doing this conversion.
>>
>>v2:
>>- Remove default from _Generic() - callers wanting to read more
>>   than u64 should use iosys_map_memcpy_from()
>>- Add READ_ONCE() cases dereferencing the pointer when using system
>>   memory
>>
>>Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
>>Reviewed-by: Christian König <christian.koenig@amd.com> # v1
>>Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
>
>Feel free to update my rb to v2 as well.
>
>Apart from that do you have commit rights to drm-misc-next? If not 
>should we push this?

I do have commit rights. I pushed this last week to drm-misc-next.

6fb5ee7cec06 ("iosys-map: Add per-word write")
5f278dbd540b ("iosys-map: Add per-word read")

thanks
Lucas De Marchi

>
>Thanks,
>Christian.
>
>>---
>>  include/linux/iosys-map.h | 45 +++++++++++++++++++++++++++++++--------
>>  1 file changed, 36 insertions(+), 9 deletions(-)
>>
>>diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
>>index 4b8406ee8bc4..ec81ed995c59 100644
>>--- a/include/linux/iosys-map.h
>>+++ b/include/linux/iosys-map.h
>>@@ -6,6 +6,7 @@
>>  #ifndef __IOSYS_MAP_H__
>>  #define __IOSYS_MAP_H__
>>+#include <linux/compiler_types.h>
>>  #include <linux/io.h>
>>  #include <linux/string.h>
>>@@ -333,6 +334,26 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>>  		memset(dst->vaddr + offset, value, len);
>>  }
>>+#ifdef CONFIG_64BIT
>>+#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
>>+	u64: val_ = readq(vaddr_iomem_)
>>+#else
>>+#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
>>+	u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64))
>>+#endif
>>+
>>+#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,		\
>>+	u8: val__ = readb(vaddr_iomem__),					\
>>+	u16: val__ = readw(vaddr_iomem__),					\
>>+	u32: val__ = readl(vaddr_iomem__),					\
>>+	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__))
>>+
>>+#define __iosys_map_rd_sys(val__, vaddr__, type__) ({				\
>>+	compiletime_assert(sizeof(type__) <= sizeof(u64),			\
>>+			   "Unsupported access size for __iosys_map_rd_sys()");	\
>>+	val__ = READ_ONCE(*((type__ *)vaddr__));				\
>>+})
>>+
>>  /**
>>   * iosys_map_rd - Read a C-type value from the iosys_map
>>   *
>>@@ -340,16 +361,21 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>>   * @offset__:	The offset from which to read
>>   * @type__:	Type of the value being read
>>   *
>>- * Read a C type value from iosys_map, handling possible un-aligned accesses to
>>- * the mapping.
>>+ * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or
>>+ * if pointer may be unaligned (and problematic for the architecture supported),
>>+ * use iosys_map_memcpy_from().
>>   *
>>   * Returns:
>>   * The value read from the mapping.
>>   */
>>-#define iosys_map_rd(map__, offset__, type__) ({			\
>>-	type__ val;							\
>>-	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
>>-	val;								\
>>+#define iosys_map_rd(map__, offset__, type__) ({				\
>>+	type__ val;								\
>>+	if ((map__)->is_iomem) {						\
>>+		__iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\
>>+	} else {								\
>>+		__iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__);	\
>>+	}									\
>>+	val;									\
>>  })
>>  /**
>>@@ -379,9 +405,10 @@ static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
>>   *
>>   * Read a value from iosys_map considering its layout is described by a C struct
>>   * starting at @struct_offset__. The field offset and size is calculated and its
>>- * value read handling possible un-aligned memory accesses. For example: suppose
>>- * there is a @struct foo defined as below and the value ``foo.field2.inner2``
>>- * needs to be read from the iosys_map:
>>+ * value read. If the field access would incur in un-aligned access, then either
>>+ * iosys_map_memcpy_from() needs to be used or the architecture must support it.
>>+ * For example: suppose there is a @struct foo defined as below and the value
>>+ * ``foo.field2.inner2`` needs to be read from the iosys_map:
>>   *
>>   * .. code-block:: c
>>   *
>
diff mbox series

Patch

diff --git a/include/linux/iosys-map.h b/include/linux/iosys-map.h
index 4b8406ee8bc4..ec81ed995c59 100644
--- a/include/linux/iosys-map.h
+++ b/include/linux/iosys-map.h
@@ -6,6 +6,7 @@ 
 #ifndef __IOSYS_MAP_H__
 #define __IOSYS_MAP_H__
 
+#include <linux/compiler_types.h>
 #include <linux/io.h>
 #include <linux/string.h>
 
@@ -333,6 +334,26 @@  static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
 		memset(dst->vaddr + offset, value, len);
 }
 
+#ifdef CONFIG_64BIT
+#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
+	u64: val_ = readq(vaddr_iomem_)
+#else
+#define __iosys_map_rd_io_u64_case(val_, vaddr_iomem_)				\
+	u64: memcpy_fromio(&(val_), vaddr_iomem_, sizeof(u64))
+#endif
+
+#define __iosys_map_rd_io(val__, vaddr_iomem__, type__) _Generic(val__,		\
+	u8: val__ = readb(vaddr_iomem__),					\
+	u16: val__ = readw(vaddr_iomem__),					\
+	u32: val__ = readl(vaddr_iomem__),					\
+	__iosys_map_rd_io_u64_case(val__, vaddr_iomem__))
+
+#define __iosys_map_rd_sys(val__, vaddr__, type__) ({				\
+	compiletime_assert(sizeof(type__) <= sizeof(u64),			\
+			   "Unsupported access size for __iosys_map_rd_sys()");	\
+	val__ = READ_ONCE(*((type__ *)vaddr__));				\
+})
+
 /**
  * iosys_map_rd - Read a C-type value from the iosys_map
  *
@@ -340,16 +361,21 @@  static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
  * @offset__:	The offset from which to read
  * @type__:	Type of the value being read
  *
- * Read a C type value from iosys_map, handling possible un-aligned accesses to
- * the mapping.
+ * Read a C type value (u8, u16, u32 and u64) from iosys_map. For other types or
+ * if pointer may be unaligned (and problematic for the architecture supported),
+ * use iosys_map_memcpy_from().
  *
  * Returns:
  * The value read from the mapping.
  */
-#define iosys_map_rd(map__, offset__, type__) ({			\
-	type__ val;							\
-	iosys_map_memcpy_from(&val, map__, offset__, sizeof(val));	\
-	val;								\
+#define iosys_map_rd(map__, offset__, type__) ({				\
+	type__ val;								\
+	if ((map__)->is_iomem) {						\
+		__iosys_map_rd_io(val, (map__)->vaddr_iomem + (offset__), type__);\
+	} else {								\
+		__iosys_map_rd_sys(val, (map__)->vaddr + (offset__), type__);	\
+	}									\
+	val;									\
 })
 
 /**
@@ -379,9 +405,10 @@  static inline void iosys_map_memset(struct iosys_map *dst, size_t offset,
  *
  * Read a value from iosys_map considering its layout is described by a C struct
  * starting at @struct_offset__. The field offset and size is calculated and its
- * value read handling possible un-aligned memory accesses. For example: suppose
- * there is a @struct foo defined as below and the value ``foo.field2.inner2``
- * needs to be read from the iosys_map:
+ * value read. If the field access would incur in un-aligned access, then either
+ * iosys_map_memcpy_from() needs to be used or the architecture must support it.
+ * For example: suppose there is a @struct foo defined as below and the value
+ * ``foo.field2.inner2`` needs to be read from the iosys_map:
  *
  * .. code-block:: c
  *