Message ID | 20180603144221.959741603@twibright.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Tue, Jun 05, 2018 at 10:08:02AM +0000, Alexey Brodkin wrote: > Hi Mikulas, > > On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote: > > Modern processors can detect linear memory accesses and prefetch data > > automatically, so there's no need to use prefetch. > > Not each and every CPU that's capable of running Linux has prefetch > functionality :) > > Still read-on... > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > --- > > drivers/gpu/drm/udl/udl_transfer.c | 7 ------- > > 1 file changed, 7 deletions(-) > > > > Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c > > =================================================================== > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > @@ -13,7 +13,6 @@ > > #include <linux/module.h> > > #include <linux/slab.h> > > #include <linux/fb.h> > > -#include <linux/prefetch.h> > > #include <asm/unaligned.h> > > > > #include <drm/drmP.h> > > @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac > > int start = width; > > int end = width; > > > > - prefetch((void *) front); > > - prefetch((void *) back); > > AFAIK prefetcher fetches new data according to a known history... i.e. based on previously > used pattern we'll trying to get the next batch of data. > > But the code above is in the very beginning of the data processing routine where > prefetcher doesn't yet have any history to know what and where to prefetch. > > So I'd say this particular usage is good. > At least those prefetches shouldn't hurt because typically it > would be just 1 instruction if those exist or nothing if CPU/compiler doesn't > support it. > > > for (j = 0; j < width; j++) { > > if (back[j] != front[j]) { > > start = j; > > @@ -140,8 +136,6 @@ static void udl_compress_hline16( > > const u8 *cmd_pixel_start, *cmd_pixel_end = NULL; > > uint16_t pixel_val16; > > > > - prefetchw((void *) cmd); /* pull in one cache line at least */ > > - > > Pretty much the same is here. Obviously on the first iteration prefetcher dosn't > know where to get "cmd". On the next iterations it might be better but given amount > of operation happens further in the cycle (and in inner cycles) I won't be completely > sure that each and every prefetcher will still keep track of "cmd". > > > *cmd++ = 0xaf; > > *cmd++ = 0x6b; > > *cmd++ = (uint8_t) ((dev_addr >> 16) & 0xFF); > > @@ -158,7 +152,6 @@ static void udl_compress_hline16( > > (unsigned long)(pixel_end - pixel) >> log_bpp, > > (unsigned long)(cmd_buffer_end - 1 - cmd) / 2) << log_bpp); > > > > - prefetch_range((void *) pixel, cmd_pixel_end - pixel); > > Again I'm not sure what we gain removing that code in comparison of possible > performance degradation on simpler CPUs. > > And essentially all the same is applicable to UDLFB patch. Tested this one on AT91SAM9G20 SoC, but couldn't get any measurable difference. Part of problem is probably that full speed USB is the bottleneck here. However, the same applies on OMAP3630 based board with high speed USB. As a side note, I didn't experience any problem those paches are fixing, so perhaps testcases could be described briefly, preferably with some numbers (I'm not running console on udlfb, but try to give it a try next week). Thank you, ladis > -Alexey -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 5 Jun 2018, Alexey Brodkin wrote: > Hi Mikulas, > > On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote: > > Modern processors can detect linear memory accesses and prefetch data > > automatically, so there's no need to use prefetch. > > Not each and every CPU that's capable of running Linux has prefetch > functionality :) > > Still read-on... > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > --- > > drivers/gpu/drm/udl/udl_transfer.c | 7 ------- > > 1 file changed, 7 deletions(-) > > > > Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c > > =================================================================== > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > @@ -13,7 +13,6 @@ > > #include <linux/module.h> > > #include <linux/slab.h> > > #include <linux/fb.h> > > -#include <linux/prefetch.h> > > #include <asm/unaligned.h> > > > > #include <drm/drmP.h> > > @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac > > int start = width; > > int end = width; > > > > - prefetch((void *) front); > > - prefetch((void *) back); > > AFAIK prefetcher fetches new data according to a known history... i.e. based on previously > used pattern we'll trying to get the next batch of data. > > But the code above is in the very beginning of the data processing routine where > prefetcher doesn't yet have any history to know what and where to prefetch. > > So I'd say this particular usage is good. > At least those prefetches shouldn't hurt because typically it > would be just 1 instruction if those exist or nothing if CPU/compiler doesn't > support it. See this post https://lwn.net/Articles/444336/ where they measured that prefetch hurts performance. Prefetch shouldn't be used unless you have a proof that it improves performance. The problem is that the prefetch instruction causes stalls in the pipeline when it encounters TLB miss and the automatic prefetcher doesn't. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 6 Jun 2018, Alexey Brodkin wrote: > Hi Mikulas, > > On Tue, 2018-06-05 at 11:30 -0400, Mikulas Patocka wrote: > > > > On Tue, 5 Jun 2018, Alexey Brodkin wrote: > > > > > Hi Mikulas, > > > > > > On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote: > > > > Modern processors can detect linear memory accesses and prefetch data > > > > automatically, so there's no need to use prefetch. > > > > > > Not each and every CPU that's capable of running Linux has prefetch > > > functionality :) > > > > > > Still read-on... > > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > > > > > --- > > > > drivers/gpu/drm/udl/udl_transfer.c | 7 ------- > > > > 1 file changed, 7 deletions(-) > > > > > > > > Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c > > > > =================================================================== > > > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > > > +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > > > @@ -13,7 +13,6 @@ > > > > #include <linux/module.h> > > > > #include <linux/slab.h> > > > > #include <linux/fb.h> > > > > -#include <linux/prefetch.h> > > > > #include <asm/unaligned.h> > > > > > > > > #include <drm/drmP.h> > > > > @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac > > > > int start = width; > > > > int end = width; > > > > > > > > - prefetch((void *) front); > > > > - prefetch((void *) back); > > > > > > AFAIK prefetcher fetches new data according to a known history... i.e. based on previously > > > used pattern we'll trying to get the next batch of data. > > > > > > But the code above is in the very beginning of the data processing routine where > > > prefetcher doesn't yet have any history to know what and where to prefetch. > > > > > > So I'd say this particular usage is good. > > > At least those prefetches shouldn't hurt because typically it > > > would be just 1 instruction if those exist or nothing if CPU/compiler doesn't > > > support it. > > > > See this post https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_444336_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656e > > ViXO7breS55ytWkhpk5R81I&m=a5RaqJtvajFkM1hL7bOKD5jV7cpFfTvG2Y1cYCdBPd0&s=w0W8wFtAgENp8TE6RzdPGhdKRasJc_otIn08V0EkgrY&e= where they measured that > > prefetch hurts performance. Prefetch shouldn't be used unless you have a > > proof that it improves performance. > > > > The problem is that the prefetch instruction causes stalls in the pipeline > > when it encounters TLB miss and the automatic prefetcher doesn't. > > Wow, thanks for the link. > I didn't know about that subtle issue with prefetch instructions on ARM and x86. > > So OK in case of UDL these prefetches anyways make not not much sense I guess and there's > something worse still, see what I've got from WandBoard Quad running kmscube [1] application > with help of perf utility: > --------------------------->8------------------------- > # Overhead Command Shared Object Symbol > # ........ ....... ....................... ........................................ > # > 92.93% kmscube [kernel.kallsyms] [k] udl_render_hline > 2.51% kmscube [kernel.kallsyms] [k] __divsi3 > 0.33% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > 0.22% kmscube [kernel.kallsyms] [k] lock_acquire > 0.19% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irq > 0.17% kmscube [kernel.kallsyms] [k] udl_handle_damage > 0.12% kmscube [kernel.kallsyms] [k] v7_dma_clean_range > 0.11% kmscube [kernel.kallsyms] [k] l2c210_clean_range > 0.06% kmscube [kernel.kallsyms] [k] __memzero > --------------------------->8------------------------- > > That said it's not even USB 2.0 which is a bottle-neck but > computations in the udl_render_hline(). > > > [1] https://cgit.freedesktop.org/mesa/kmscube/ > > -Alexey Try this patch http://people.redhat.com/~mpatocka/patches/kernel/udl/udlkms-avoid-division.patch It is doing a lot of divisions - and WandBoard has Cortex-A9, that doesn't have division instruction. BTW. the framebuffer UDL driver (not the modesetting driver) has performance counters in sysfs. Their location depends on the system, you can find them with find /sys -name "*metrics*" The file "metrics_reset" resets the counters, so you can measure if the prefetch instructions improve performance or not. Mikulas -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mikulas, On Wed, 2018-06-06 at 11:46 -0400, Mikulas Patocka wrote: > > On Wed, 6 Jun 2018, Alexey Brodkin wrote: > > > Hi Mikulas, > > > > On Tue, 2018-06-05 at 11:30 -0400, Mikulas Patocka wrote: > > > > > > On Tue, 5 Jun 2018, Alexey Brodkin wrote: > > > > > > > Hi Mikulas, > > > > > > > > On Sun, 2018-06-03 at 16:41 +0200, Mikulas Patocka wrote: > > > > > Modern processors can detect linear memory accesses and prefetch data > > > > > automatically, so there's no need to use prefetch. > > > > > > > > Not each and every CPU that's capable of running Linux has prefetch > > > > functionality :) > > > > > > > > Still read-on... > > > > > > > > > Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> > > > > > > > > > > --- > > > > > drivers/gpu/drm/udl/udl_transfer.c | 7 ------- > > > > > 1 file changed, 7 deletions(-) > > > > > > > > > > Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c > > > > > =================================================================== > > > > > --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > > > > +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 > > > > > @@ -13,7 +13,6 @@ > > > > > #include <linux/module.h> > > > > > #include <linux/slab.h> > > > > > #include <linux/fb.h> > > > > > -#include <linux/prefetch.h> > > > > > #include <asm/unaligned.h> > > > > > > > > > > #include <drm/drmP.h> > > > > > @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac > > > > > int start = width; > > > > > int end = width; > > > > > > > > > > - prefetch((void *) front); > > > > > - prefetch((void *) back); > > > > > > > > AFAIK prefetcher fetches new data according to a known history... i.e. based on previously > > > > used pattern we'll trying to get the next batch of data. > > > > > > > > But the code above is in the very beginning of the data processing routine where > > > > prefetcher doesn't yet have any history to know what and where to prefetch. > > > > > > > > So I'd say this particular usage is good. > > > > At least those prefetches shouldn't hurt because typically it > > > > would be just 1 instruction if those exist or nothing if CPU/compiler doesn't > > > > support it. > > > > > > See this post https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_444336_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl > > > 656e > > > ViXO7breS55ytWkhpk5R81I&m=a5RaqJtvajFkM1hL7bOKD5jV7cpFfTvG2Y1cYCdBPd0&s=w0W8wFtAgENp8TE6RzdPGhdKRasJc_otIn08V0EkgrY&e= where they measured that > > > prefetch hurts performance. Prefetch shouldn't be used unless you have a > > > proof that it improves performance. > > > > > > The problem is that the prefetch instruction causes stalls in the pipeline > > > when it encounters TLB miss and the automatic prefetcher doesn't. > > > > Wow, thanks for the link. > > I didn't know about that subtle issue with prefetch instructions on ARM and x86. > > > > So OK in case of UDL these prefetches anyways make not not much sense I guess and there's > > something worse still, see what I've got from WandBoard Quad running kmscube [1] application > > with help of perf utility: > > --------------------------->8------------------------- > > # Overhead Command Shared Object Symbol > > # ........ ....... ....................... ........................................ > > # > > 92.93% kmscube [kernel.kallsyms] [k] udl_render_hline > > 2.51% kmscube [kernel.kallsyms] [k] __divsi3 > > 0.33% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore > > 0.22% kmscube [kernel.kallsyms] [k] lock_acquire > > 0.19% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irq > > 0.17% kmscube [kernel.kallsyms] [k] udl_handle_damage > > 0.12% kmscube [kernel.kallsyms] [k] v7_dma_clean_range > > 0.11% kmscube [kernel.kallsyms] [k] l2c210_clean_range > > 0.06% kmscube [kernel.kallsyms] [k] __memzero > > --------------------------->8------------------------- > > > > That said it's not even USB 2.0 which is a bottle-neck but > > computations in the udl_render_hline(). > > > > > > [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__cgit.freedesktop.org_mesa_kmscube_&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl65 > > 6eViXO7breS55ytWkhpk5R81I&m=pj60-2aLQeePuzK_lo0lQ1j9GenwSnjZ6UmI3r_nbBU&s=JMUk3_YdOpEQTbIyAs0hGvbUgNRhn4ytlIaJ9iE_Lbk&e= > > > > -Alexey > > Try this patch > https://urldefense.proofpoint.com/v2/url?u=http-3A__people.redhat.com_-7Empatocka_patches_kernel_udl_udlkms-2Davoid-2Ddivision.patch&d=DwIBAg&c=DPL6 > _X_6JkXFx7AXWqB0tg&r=lqdeeSSEes0GFDDl656eViXO7breS55ytWkhpk5R81I&m=pj60- > 2aLQeePuzK_lo0lQ1j9GenwSnjZ6UmI3r_nbBU&s=HNJmAaDh2uo9wtYsChwu_TpeOY5PmkUDpicFArzK3wE&e= > > It is doing a lot of divisions - and WandBoard has Cortex-A9, that doesn't > have division instruction. Looks like that patch makes not that much of a difference. Below are results of perf. Without patch: ----------------------------->8------------------------------------------- 91.46% kmscube [kernel.kallsyms] [k] udl_render_hline 2.15% kmscube [kernel.kallsyms] [k] __divsi3 1.09% kmscube [kernel.kallsyms] [k] mmioset 0.48% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irq 0.48% kmscube [kernel.kallsyms] [k] v7_dma_clean_range 0.34% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.28% kmscube [kernel.kallsyms] [k] l2c210_clean_range 0.25% kmscube [kernel.kallsyms] [k] shmem_getpage_gfp.constprop.4 0.18% kmscube [kernel.kallsyms] [k] lock_acquire With patch: ----------------------------->8------------------------------------------- 94.81% kmscube [kernel.kallsyms] [k] udl_render_hline 0.88% kmscube [kernel.kallsyms] [k] mmioset 0.44% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irq 0.38% kmscube [kernel.kallsyms] [k] v7_dma_clean_range 0.33% kmscube [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore 0.22% kmscube [kernel.kallsyms] [k] l2c210_clean_range 0.16% kmscube [kernel.kallsyms] [k] shmem_getpage_gfp.constprop.4 0.16% kmscube [kernel.kallsyms] [k] udl_handle_damage 0.16% kmscube [kernel.kallsyms] [k] lock_acquire ----------------------------->8------------------------------------------- There is no mode __divsi3 function in perf results, so patch works. But this function was just ~2% of overall overhead. -Alexey
Index: linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c =================================================================== --- linux-4.16.12.orig/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 +++ linux-4.16.12/drivers/gpu/drm/udl/udl_transfer.c 2018-05-31 14:48:12.000000000 +0200 @@ -13,7 +13,6 @@ #include <linux/module.h> #include <linux/slab.h> #include <linux/fb.h> -#include <linux/prefetch.h> #include <asm/unaligned.h> #include <drm/drmP.h> @@ -51,9 +50,6 @@ static int udl_trim_hline(const u8 *bbac int start = width; int end = width; - prefetch((void *) front); - prefetch((void *) back); - for (j = 0; j < width; j++) { if (back[j] != front[j]) { start = j; @@ -140,8 +136,6 @@ static void udl_compress_hline16( const u8 *cmd_pixel_start, *cmd_pixel_end = NULL; uint16_t pixel_val16; - prefetchw((void *) cmd); /* pull in one cache line at least */ - *cmd++ = 0xaf; *cmd++ = 0x6b; *cmd++ = (uint8_t) ((dev_addr >> 16) & 0xFF); @@ -158,7 +152,6 @@ static void udl_compress_hline16( (unsigned long)(pixel_end - pixel) >> log_bpp, (unsigned long)(cmd_buffer_end - 1 - cmd) / 2) << log_bpp); - prefetch_range((void *) pixel, cmd_pixel_end - pixel); pixel_val16 = get_pixel_val16(pixel, log_bpp); while (pixel < cmd_pixel_end) {
Modern processors can detect linear memory accesses and prefetch data automatically, so there's no need to use prefetch. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- drivers/gpu/drm/udl/udl_transfer.c | 7 ------- 1 file changed, 7 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-fbdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html