Message ID | 20240201-yuv-v1-2-3ca376f27632@bootlin.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Better support for complex pixel formats | expand |
On Thu, 01 Feb 2024 18:31:32 +0100 Louis Chauvet <louis.chauvet@bootlin.com> wrote: > Change the composition algorithm to iterate over pixels instead of lines. > It allows a simpler management of rotation and pixel access for complex formats. > > This new algorithm allows read_pixel function to have access to x/y > coordinates and make it possible to read the correct thing in a block > when block_w and block_h are not 1. > The iteration pixel-by-pixel in the same method also allows a simpler > management of rotation with drm_rect_* helpers. This way it's not needed > anymore to have misterious switch-case distributed in multiple places. Hi, there was a very good reason to write this code using lines: performance. Before lines, it was indeed operating on individual pixels. Please, include performance measurements before and after this series to quantify the impact on the previously already supported pixel formats, particularly the 32-bit-per-pixel RGB variants. VKMS will be used more and more in CI for userspace projects, and performance actually matters there. I'm worrying that this performance degradation here is significant. I believe it is possible to keep blending with lines, if you add new line getters for reading from rotated, sub-sampled etc. images. That way you don't have to regress the most common formats' performance. Thanks, pq > > This patch is tested with IGT: > - kms_plane@pixel_format > - kms_plane@pixel-format-source-clamping > - kms_plane@plane-position-covered > - kms_plane@plane-position-hole > - kms_plane@plane-position-hole-dpms > - kms_plane@plane-panning-top-left > - kms_plane@plane-panning-bottom-right > - kms_plane@test-odd-size-with-yuv - See [1] > - kms_cursor_crc@cursor-size-change > - kms_cursor_crc@cursor-alpha-opaque > - kms_cursor_crc@cursor-alpha-transparent > - kms_cursor_crc@cursor-dpms > - kms_cursor_crc@cursor-onscreen-* > - kms_cursor_crc@cursor-offscreen-* > - kms_cursor_crc@cursor-sliding-* > - kms_cursor_crc@cursor-random-* > - kms_cursor_crc@cursor-rapid-movement-* - FAIL, but with Overflow of > CRC buffer > - kms_rotation_crc@primary-rotation-* - See [1] > - kms_rotation_crc@sprite-rotation-* - See [1] > - kms_rotation_crc@cursor-rotation-* - See [1] > - kms_rotation_crc@sprite-rotation-90-pos-100-0 - See [1] > - kms_rotation_crc@multiplane-rotation - See [1] > - kms_rotation_crc@multiplane-rotation-cropping-* - See [1] > > [1]: https://lore.kernel.org/igt-dev/20240201-kms_tests-v1-0-bc34c5d28b3f@bootlin.com/T/#t > > Signed-off-by: Louis Chauvet <louis.chauvet@bootlin.com> > --- > drivers/gpu/drm/vkms/vkms_composer.c | 97 +++++++++----- > drivers/gpu/drm/vkms/vkms_drv.h | 21 ++- > drivers/gpu/drm/vkms/vkms_formats.c | 250 ++++++++++++++++++----------------- > drivers/gpu/drm/vkms/vkms_plane.c | 13 +- > 4 files changed, 221 insertions(+), 160 deletions(-) > > diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c > index 3c99fb8b54e2..4c5439209907 100644 > --- a/drivers/gpu/drm/vkms/vkms_composer.c > +++ b/drivers/gpu/drm/vkms/vkms_composer.c > @@ -43,7 +43,7 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info, > { > int x_dst = frame_info->dst.x1; > struct pixel_argb_u16 *out = output_buffer->pixels + x_dst; > - struct pixel_argb_u16 *in = stage_buffer->pixels; > + struct pixel_argb_u16 *in = stage_buffer->pixels + x_dst; > int x_limit = min_t(size_t, drm_rect_width(&frame_info->dst), > stage_buffer->n_pixels); > > @@ -55,33 +55,6 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info, > } > } > > -static int get_y_pos(struct vkms_frame_info *frame_info, int y) > -{ > - if (frame_info->rotation & DRM_MODE_REFLECT_Y) > - return drm_rect_height(&frame_info->rotated) - y - 1; > - > - switch (frame_info->rotation & DRM_MODE_ROTATE_MASK) { > - case DRM_MODE_ROTATE_90: > - return frame_info->rotated.x2 - y - 1; > - case DRM_MODE_ROTATE_270: > - return y + frame_info->rotated.x1; > - default: > - return y; > - } > -} > - > -static bool check_limit(struct vkms_frame_info *frame_info, int pos) > -{ > - if (drm_rotation_90_or_270(frame_info->rotation)) { > - if (pos >= 0 && pos < drm_rect_width(&frame_info->rotated)) > - return true; > - } else { > - if (pos >= frame_info->rotated.y1 && pos < frame_info->rotated.y2) > - return true; > - } > - > - return false; > -} > > static void fill_background(const struct pixel_argb_u16 *background_color, > struct line_buffer *output_buffer) > @@ -182,18 +155,74 @@ static void blend(struct vkms_writeback_job *wb, > const struct pixel_argb_u16 background_color = { .a = 0xffff }; > > size_t crtc_y_limit = crtc_state->base.crtc->mode.vdisplay; > + size_t crtc_x_limit = crtc_state->base.crtc->mode.hdisplay; > > + /* > + * Iterating over each pixel to simplify the handling of rotations by using drm_rect_* > + * helpers. > + * Iteration per pixel also allosw a simple management of complex pixel formats. > + * > + * If needed, this triple loop might be a good place for optimizations. > + */ > for (size_t y = 0; y < crtc_y_limit; y++) { > fill_background(&background_color, output_buffer); > > /* The active planes are composed associatively in z-order. */ > for (size_t i = 0; i < n_active_planes; i++) { > - y_pos = get_y_pos(plane[i]->frame_info, y); > - > - if (!check_limit(plane[i]->frame_info, y_pos)) > - continue; > - > - vkms_compose_row(stage_buffer, plane[i], y_pos); > + for (size_t x = 0; x < crtc_x_limit; x++) { > + /* > + * @x and @y are the coordinate in the output crtc. > + * @dst_px is used to easily check if a pixel must be blended. > + * @src_px is used to handle rotation. Just before blending, it > + * will contain the coordinate of the wanted source pixel in > + * the source buffer. > + * @tmp_src is the conversion of src rectangle to integer. > + */ > + > + struct drm_rect dst_px = DRM_RECT_INIT(x, y, 1, 1); > + struct drm_rect src_px = DRM_RECT_INIT(x, y, 1, 1); > + struct drm_rect tmp_src; > + > + drm_rect_fp_to_int(&tmp_src, &plane[i]->frame_info->src); > + > + /* > + * Check that dst_px is inside the plane output > + * Note: This is destructive for dst_px, if you need this > + * rectangle you have to do a copy > + */ > + if (!drm_rect_intersect(&dst_px, &plane[i]->frame_info->dst)) > + continue; > + > + /* > + * Transform the coordinate x/y from the crtc to coordinates into > + * coordinates for the src buffer. > + * > + * First step is to cancel the offset of the dst buffer. > + * Then t will have to invert the rotation. This assumes that > + * dst = drm_rect_rotate(src, rotation) (dst and src have the > + * space size, but can be rotated). > + * And then, apply the offset of the source rectangle to the > + * coordinate. > + */ > + drm_rect_translate(&src_px, -plane[i]->frame_info->dst.x1, > + -plane[i]->frame_info->dst.y1); > + drm_rect_rotate_inv(&src_px, > + drm_rect_width(&tmp_src), > + drm_rect_height(&tmp_src), > + plane[i]->frame_info->rotation); > + drm_rect_translate(&src_px, tmp_src.x1, tmp_src.y1); > + > + /* > + * Now, as the src_px contains the correct position, we can use > + * it to convert the pixel color > + */ > + plane[i]->pixel_read(plane[i]->frame_info, > + src_px.x1, > + src_px.y1, > + &stage_buffer->pixels[x], > + plane[i]->base.base.color_encoding, > + plane[i]->base.base.color_range); > + } > pre_mul_alpha_blend(plane[i]->frame_info, stage_buffer, > output_buffer); > } > diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h > index cb20bab26cae..ba3c063adc5f 100644 > --- a/drivers/gpu/drm/vkms/vkms_drv.h > +++ b/drivers/gpu/drm/vkms/vkms_drv.h > @@ -25,10 +25,20 @@ > > #define VKMS_LUT_SIZE 256 > > +/** > + * struct vkms_frame_info - structure to store the state of a frame > + * > + * @fb: backing drm framebuffer > + * @src: source rectangle of this frame in the source framebuffer > + * @dst: destination rectangle in the crtc buffer > + * @map: see drm_shadow_plane_state@data > + * @rotation: rotation applied to the source. > + * > + * @src and @dst should have the same size modulo the rotation. > + */ > struct vkms_frame_info { > struct drm_framebuffer *fb; > struct drm_rect src, dst; > - struct drm_rect rotated; > struct iosys_map map[DRM_FORMAT_MAX_PLANES]; > unsigned int rotation; > }; > @@ -51,14 +61,15 @@ struct vkms_writeback_job { > /** > * typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert > * them to argb16 and write them to out_pixel. > - * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel) > * > - * @src_pixels: Source pointer to a pixel > + * @frame_info: Frame used as source for the pixel value > + * @x: X (width) coordinate in the source buffer > + * @y: Y (height) coordinate in the source buffer > * @out_pixel: Pointer where to write the pixel value > * @encoding: Color encoding to use (mainly used for YUV formats) > * @range: Color range to use (mainly used for YUV formats) > */ > -typedef void (*pixel_read_t)(u8 **src_pixels, int y, > +typedef void (*pixel_read_t)(struct vkms_frame_info *frame_info, int x, int y, > struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding, > enum drm_color_range range); > > @@ -66,6 +77,8 @@ typedef void (*pixel_read_t)(u8 **src_pixels, int y, > * vkms_plane_state - Driver specific plane state > * @base: base plane state > * @frame_info: data required for composing computation > + * @pixel_read: function to read a pixel in this plane. The creator of a vkms_plane_state must > + * ensure that this pointer is valid > */ > struct vkms_plane_state { > struct drm_shadow_plane_state base; > diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c > index c6376db58d38..8ff85ffda94f 100644 > --- a/drivers/gpu/drm/vkms/vkms_formats.c > +++ b/drivers/gpu/drm/vkms/vkms_formats.c > @@ -11,79 +11,88 @@ > > #include "vkms_formats.h" > > -static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index) > + > +/** > + * packed_pixels_offset() - Get the offset of the block containing the pixel at coordinates x/y > + * > + * @frame_info: Buffer metadata > + * @x: The x coordinate of the wanted pixel in the buffer > + * @y: The y coordinate of the wanted pixel in the buffer > + * @plane: The index of the plane to use > + * > + * The caller must be aware that this offset is not always a pointer to a pixel. If individual > + * pixel values are needed, they have to be extracted from the block. > + */ > +static size_t packed_pixels_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t plane) > { > struct drm_framebuffer *fb = frame_info->fb; > - > - return fb->offsets[index] + (y * fb->pitches[index]) > - + (x * fb->format->cpp[index]); > + const struct drm_format_info *format = frame_info->fb->format; > + /* Directly using x and y to multiply pitches and format->ccp is not sufficient because > + * in some formats a block can represent multiple pixels. > + * > + * Dividing x and y by the block size allows to extract the correct offset of the block > + * containing the pixel. > + */ > + return fb->offsets[plane] + > + (y / drm_format_info_block_width(format, plane)) * fb->pitches[plane] + > + (x / drm_format_info_block_height(format, plane)) * format->char_per_block[plane]; > } > > /* > - * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates > + * packed_pixels_addr - Get the pointer to the block containing the pixel at the given coordinate > * > * @frame_info: Buffer metadata > - * @x: The x(width) coordinate of the 2D buffer > - * @y: The y(Heigth) coordinate of the 2D buffer > + * @x: The x(width) coordinate inside the 2D buffer > + * @y: The y(Heigth) coordinate inside the 2D buffer > * @index: The index of the plane on the 2D buffer > * > - * Takes the information stored in the frame_info, a pair of coordinates, and > - * returns the address of the first color channel on the desired index. The > - * format's specific subsample is applied. > + * Takes the information stored in the frame_info, a pair of coordinates, and returns the address > + * of the block containing this pixel. > + * The caller must be aware that this pointer is sometimes not directly a pixel, it needs some > + * additional work to extract pixel color from this block. > */ > static void *packed_pixels_addr(const struct vkms_frame_info *frame_info, > int x, int y, size_t index) > { > - int vsub = index == 0 ? 1 : frame_info->fb->format->vsub; > - int hsub = index == 0 ? 1 : frame_info->fb->format->hsub; > - size_t offset = pixel_offset(frame_info, x / hsub, y / vsub, index); > - > - return (u8 *)frame_info->map[0].vaddr + offset; > + return (u8 *)frame_info->map[0].vaddr + packed_pixels_offset(frame_info, x, y, index); > } > > -static void *get_packed_src_addr(const struct vkms_frame_info *frame_info, int y, size_t index) > +static void ARGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, > + enum drm_color_range range) > { > - int x_src = frame_info->src.x1 >> 16; > - int y_src = y - frame_info->rotated.y1 + (frame_info->src.y1 >> 16); > - > - return packed_pixels_addr(frame_info, x_src, y_src, index); > -} > + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); > > -static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x) > -{ > - if (frame_info->rotation & (DRM_MODE_REFLECT_X | DRM_MODE_ROTATE_270)) > - return limit - x - 1; > - return x; > -} > - > -static void ARGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > -{ > /* > * The 257 is the "conversion ratio". This number is obtained by the > * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get > * the best color value in a pixel format with more possibilities. > * A similar idea applies to others RGB color conversions. > */ > - out_pixel->a = (u16)src_pixels[0][3] * 257; > - out_pixel->r = (u16)src_pixels[0][2] * 257; > - out_pixel->g = (u16)src_pixels[0][1] * 257; > - out_pixel->b = (u16)src_pixels[0][0] * 257; > + out_pixel->a = (u16)src_pixel[3] * 257; > + out_pixel->r = (u16)src_pixel[2] * 257; > + out_pixel->g = (u16)src_pixel[1] * 257; > + out_pixel->b = (u16)src_pixel[0] * 257; > } > > -static void XRGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void XRGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, > + enum drm_color_range range) > { > + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); > out_pixel->a = (u16)0xffff; > - out_pixel->r = (u16)src_pixels[0][2] * 257; > - out_pixel->g = (u16)src_pixels[0][1] * 257; > - out_pixel->b = (u16)src_pixels[0][0] * 257; > + out_pixel->r = (u16)src_pixel[2] * 257; > + out_pixel->g = (u16)src_pixel[1] * 257; > + out_pixel->b = (u16)src_pixel[0] * 257; > } > > -static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void ARGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, > + enum drm_color_encoding encoding, > + enum drm_color_range range) > { > - u16 *pixels = (u16 *)src_pixels[0]; > + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); > + u16 *pixels = (u16 *)src_pixel; > > out_pixel->a = le16_to_cpu(pixels[3]); > out_pixel->r = le16_to_cpu(pixels[2]); > @@ -91,10 +100,13 @@ static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out > out_pixel->b = le16_to_cpu(pixels[0]); > } > > -static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void XRGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, > + enum drm_color_encoding encoding, > + enum drm_color_range range) > { > - u16 *pixels = (u16 *)src_pixels[0]; > + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); > + u16 *pixels = (u16 *)src_pixel; > > out_pixel->a = (u16)0xffff; > out_pixel->r = le16_to_cpu(pixels[2]); > @@ -102,10 +114,12 @@ static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out > out_pixel->b = le16_to_cpu(pixels[0]); > } > > -static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void RGB565_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, > + enum drm_color_range range) > { > - u16 *pixels = (u16 *)src_pixels[0]; > + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); > + u16 *pixels = (u16 *)src_pixel; > > s64 fp_rb_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(31)); > s64 fp_g_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(63)); > @@ -121,12 +135,19 @@ static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel > out_pixel->b = drm_fixp2int_round(drm_fixp_mul(fp_b, fp_rb_ratio)); > } > > +/** > + * ycbcr2rgb() - helper to convert ycbcr 8 bits to rbg 16 bits > + * > + * @m: conversion matrix to use > + * @y, @cb, @cr: component of the ycbcr pixel > + * @r, @g, @b: pointers to write the rbg pixel > + */ > static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, u8 *g, u8 *b) > { > s32 y_16, cb_16, cr_16; > s32 r_16, g_16, b_16; > > - y_16 = y - y_offset; > + y_16 = y - y_offset; > cb_16 = cb - 128; > cr_16 = cr - 128; > > @@ -139,6 +160,14 @@ static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, > *b = clamp(b_16, 0, 0xffff) >> 8; > } > > +/** > + * yuv_u8_to_argb_u16() - helper to convert yuv 8 bits to argb 16 bits > + * > + * @argb_u16: pointer to write the converted pixel > + * @yuv_u8: pointer to the source yuv pixel > + * @encoding: encoding to use > + * @range: range to use > + */ > VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, > const struct pixel_yuv_u8 *yuv_u8, > enum drm_color_encoding encoding, > @@ -205,104 +234,89 @@ VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, > } > EXPORT_SYMBOL_IF_KUNIT(yuv_u8_to_argb_u16); > > -static void semi_planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > +static void semi_planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, > + int y, struct pixel_argb_u16 *out_pixel, > enum drm_color_encoding encoding, > enum drm_color_range range) > { > struct pixel_yuv_u8 yuv_u8; > > - yuv_u8.y = src_pixels[0][0]; > - yuv_u8.u = src_pixels[1][0]; > - yuv_u8.v = src_pixels[1][1]; > + /* Subsampling must be applied for semi-planar yuv formats */ > + int vsub = frame_info->fb->format->vsub; > + int hsub = frame_info->fb->format->hsub; > + > + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); > + u8 *src_uv = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); > + > + yuv_u8.y = src_y[0]; > + yuv_u8.u = src_uv[0]; > + yuv_u8.v = src_uv[1]; > > yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); > } > > -static void semi_planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > +static void semi_planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, > + int y, struct pixel_argb_u16 *out_pixel, > enum drm_color_encoding encoding, > enum drm_color_range range) > { > struct pixel_yuv_u8 yuv_u8; > > - yuv_u8.y = src_pixels[0][0]; > - yuv_u8.v = src_pixels[1][0]; > - yuv_u8.u = src_pixels[1][1]; > + /* Subsampling must be applied for semi-planar yuv formats */ > + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; > + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; > + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); > + u8 *src_vu = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); > + > + yuv_u8.y = src_y[0]; > + yuv_u8.v = src_vu[0]; > + yuv_u8.u = src_vu[1]; > > yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); > } > > -static void planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, > + enum drm_color_encoding encoding, > + enum drm_color_range range) > { > struct pixel_yuv_u8 yuv_u8; > > - yuv_u8.y = src_pixels[0][0]; > - yuv_u8.u = src_pixels[1][0]; > - yuv_u8.v = src_pixels[2][0]; > + /* Subsampling must be applied for planar yuv formats */ > + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; > + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; > + > + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); > + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); > + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2); > + > + yuv_u8.y = src_y[0]; > + yuv_u8.u = src_u[0]; > + yuv_u8.v = src_v[0]; > > yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); > } > > -static void planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, > - enum drm_color_encoding encoding, enum drm_color_range range) > +static void planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, > + struct pixel_argb_u16 *out_pixel, > + enum drm_color_encoding encoding, > + enum drm_color_range range) > { > struct pixel_yuv_u8 yuv_u8; > > - yuv_u8.y = src_pixels[0][0]; > - yuv_u8.v = src_pixels[1][0]; > - yuv_u8.u = src_pixels[2][0]; > + /* Subsampling must be applied for semi-planar yuv formats */ > + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; > + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; > > - yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); > -} > + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); > + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); > + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2); > > -/** > - * vkms_compose_row - compose a single row of a plane > - * @stage_buffer: output line with the composed pixels > - * @plane: state of the plane that is being composed > - * @y: y coordinate of the row > - * > - * This function composes a single row of a plane. It gets the source pixels > - * through the y coordinate (see get_packed_src_addr()) and goes linearly > - * through the source pixel, reading the pixels and converting it to > - * ARGB16161616 (see the pixel_read() callback). For rotate-90 and rotate-270, > - * the source pixels are not traversed linearly. The source pixels are queried > - * on each iteration in order to traverse the pixels vertically. > - */ > -void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y) > -{ > - struct pixel_argb_u16 *out_pixels = stage_buffer->pixels; > - struct vkms_frame_info *frame_info = plane->frame_info; > - const struct drm_format_info *frame_format = frame_info->fb->format; > - int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels); > - u8 *src_pixels[DRM_FORMAT_MAX_PLANES]; > - > - enum drm_color_encoding encoding = plane->base.base.color_encoding; > - enum drm_color_range range = plane->base.base.color_range; > - > - for (size_t i = 0; i < frame_format->num_planes; i++) > - src_pixels[i] = get_packed_src_addr(frame_info, y, i); > - > - for (size_t x = 0; x < limit; x++) { > - int x_pos = get_x_position(frame_info, limit, x); > - > - bool shoud_inc = !((x + 1) % frame_format->num_planes); > - > - if (drm_rotation_90_or_270(frame_info->rotation)) { > - for (size_t i = 0; i < frame_format->num_planes; i++) { > - src_pixels[i] = get_packed_src_addr(frame_info, > - x + frame_info->rotated.y1, i); > - if (!i || shoud_inc) > - src_pixels[i] += frame_format->cpp[i] * y; > - } > - } > - > - plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range); > - > - for (size_t i = 0; i < frame_format->num_planes; i++) { > - if (!i || shoud_inc) > - src_pixels[i] += frame_format->cpp[i]; > - } > - } > + yuv_u8.y = src_y[0]; > + yuv_u8.v = src_u[0]; > + yuv_u8.u = src_v[0]; > + > + yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); > } > > /* > diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c > index 932736fc3ee9..d818ed246a46 100644 > --- a/drivers/gpu/drm/vkms/vkms_plane.c > +++ b/drivers/gpu/drm/vkms/vkms_plane.c > @@ -118,13 +118,20 @@ static void vkms_plane_atomic_update(struct drm_plane *plane, > return; > > fmt = fb->format->format; > + pixel_read_t pixel_read = get_pixel_conversion_function(fmt); > + > + if (!pixel_read) { > + DRM_WARN("The requested pixel format is not supported by VKMS. The new state is " > + "not applied.\n"); > + return; > + } > + > vkms_plane_state = to_vkms_plane_state(new_state); > shadow_plane_state = &vkms_plane_state->base; > > frame_info = vkms_plane_state->frame_info; > memcpy(&frame_info->src, &new_state->src, sizeof(struct drm_rect)); > memcpy(&frame_info->dst, &new_state->dst, sizeof(struct drm_rect)); > - memcpy(&frame_info->rotated, &new_state->dst, sizeof(struct drm_rect)); > frame_info->fb = fb; > memcpy(&frame_info->map, &shadow_plane_state->data, sizeof(frame_info->map)); > drm_framebuffer_get(frame_info->fb); > @@ -134,10 +141,8 @@ static void vkms_plane_atomic_update(struct drm_plane *plane, > DRM_MODE_REFLECT_X | > DRM_MODE_REFLECT_Y); > > - drm_rect_rotate(&frame_info->rotated, drm_rect_width(&frame_info->rotated), > - drm_rect_height(&frame_info->rotated), frame_info->rotation); > > - vkms_plane_state->pixel_read = get_pixel_conversion_function(fmt); > + vkms_plane_state->pixel_read = pixel_read; > } > > static int vkms_plane_atomic_check(struct drm_plane *plane, >
Hi Pekka, pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > On Thu, 01 Feb 2024 18:31:32 +0100 > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > Change the composition algorithm to iterate over pixels instead of lines. > > It allows a simpler management of rotation and pixel access for complex formats. > > > > This new algorithm allows read_pixel function to have access to x/y > > coordinates and make it possible to read the correct thing in a block > > when block_w and block_h are not 1. > > The iteration pixel-by-pixel in the same method also allows a simpler > > management of rotation with drm_rect_* helpers. This way it's not needed > > anymore to have misterious switch-case distributed in multiple places. > > Hi, > > there was a very good reason to write this code using lines: > performance. Before lines, it was indeed operating on individual pixels. > > Please, include performance measurements before and after this series > to quantify the impact on the previously already supported pixel > formats, particularly the 32-bit-per-pixel RGB variants. > > VKMS will be used more and more in CI for userspace projects, and > performance actually matters there. > > I'm worrying that this performance degradation here is significant. I > believe it is possible to keep blending with lines, if you add new line > getters for reading from rotated, sub-sampled etc. images. That way you > don't have to regress the most common formats' performance. While I understand performance is important and should be taken into account seriously, I cannot understand how broken testing could be considered better. Fast but inaccurate will always be significantly less attractive to my eyes. I am in favor of making this working first, and then improving the code for faster results. Maybe the line-driven approach can be dedicated to "simpler" formats where more complex corner cases do not happen. But for now I don't see the point in comparing performances between broken and (hopefully) non broken implementations. Thanks, Miquèl
Hi Miquel, On 2/2/24 06:26, Miquel Raynal wrote: > Hi Pekka, > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > >> On Thu, 01 Feb 2024 18:31:32 +0100 >> Louis Chauvet <louis.chauvet@bootlin.com> wrote: >> >>> Change the composition algorithm to iterate over pixels instead of lines. >>> It allows a simpler management of rotation and pixel access for complex formats. >>> >>> This new algorithm allows read_pixel function to have access to x/y >>> coordinates and make it possible to read the correct thing in a block >>> when block_w and block_h are not 1. >>> The iteration pixel-by-pixel in the same method also allows a simpler >>> management of rotation with drm_rect_* helpers. This way it's not needed >>> anymore to have misterious switch-case distributed in multiple places. >> >> Hi, >> >> there was a very good reason to write this code using lines: >> performance. Before lines, it was indeed operating on individual pixels. >> >> Please, include performance measurements before and after this series >> to quantify the impact on the previously already supported pixel >> formats, particularly the 32-bit-per-pixel RGB variants. >> >> VKMS will be used more and more in CI for userspace projects, and >> performance actually matters there. >> >> I'm worrying that this performance degradation here is significant. I >> believe it is possible to keep blending with lines, if you add new line >> getters for reading from rotated, sub-sampled etc. images. That way you >> don't have to regress the most common formats' performance. > > While I understand performance is important and should be taken into > account seriously, I cannot understand how broken testing could be > considered better. Fast but inaccurate will always be significantly > less attractive to my eyes. > > I am in favor of making this working first, and then improving the code > for faster results. Maybe the line-driven approach can be dedicated to > "simpler" formats where more complex corner cases do not happen. But > for now I don't see the point in comparing performances between broken > and (hopefully) non broken implementations. We still haven't landed the YUV series for VKMS. Therefore, the code available in the repository is not broken. Performance is crucial for VKMS, as Pekka mentioned. First, because it will be used more and more in CI. Second, because I wouldn't like to see IGT tests timing out in VKMS. Best Regards, - Maíra > > Thanks, > Miquèl
Hi Miquel, On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > coordinates and make it possible to read the correct thing in a block > > > when block_w and block_h are not 1. > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > anymore to have misterious switch-case distributed in multiple places. > > > > Hi, > > > > there was a very good reason to write this code using lines: > > performance. Before lines, it was indeed operating on individual pixels. > > > > Please, include performance measurements before and after this series > > to quantify the impact on the previously already supported pixel > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > VKMS will be used more and more in CI for userspace projects, and > > performance actually matters there. > > > > I'm worrying that this performance degradation here is significant. I > > believe it is possible to keep blending with lines, if you add new line > > getters for reading from rotated, sub-sampled etc. images. That way you > > don't have to regress the most common formats' performance. > > While I understand performance is important and should be taken into > account seriously, I cannot understand how broken testing could be > considered better. Fast but inaccurate will always be significantly > less attractive to my eyes. AFAIK, neither the cover letter nor the commit log claimed it was fixing something broken, just that it was "better" (according to what criteria?). If something is truly broken, it must be stated what exactly is so we can all come up with a solution that will satisfy everyone. > I am in favor of making this working first, and then improving the code > for faster results. Maybe the line-driven approach can be dedicated to > "simpler" formats where more complex corner cases do not happen. Which ones? Maxime
On Fri, 2 Feb 2024 10:26:01 +0100 Miquel Raynal <miquel.raynal@bootlin.com> wrote: > Hi Pekka, > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > coordinates and make it possible to read the correct thing in a block > > > when block_w and block_h are not 1. > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > anymore to have misterious switch-case distributed in multiple places. > > > > Hi, > > > > there was a very good reason to write this code using lines: > > performance. Before lines, it was indeed operating on individual pixels. > > > > Please, include performance measurements before and after this series > > to quantify the impact on the previously already supported pixel > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > VKMS will be used more and more in CI for userspace projects, and > > performance actually matters there. > > > > I'm worrying that this performance degradation here is significant. I > > believe it is possible to keep blending with lines, if you add new line > > getters for reading from rotated, sub-sampled etc. images. That way you > > don't have to regress the most common formats' performance. > > While I understand performance is important and should be taken into > account seriously, I cannot understand how broken testing could be > considered better. Fast but inaccurate will always be significantly > less attractive to my eyes. > > I am in favor of making this working first, and then improving the code > for faster results. Maybe the line-driven approach can be dedicated to > "simpler" formats where more complex corner cases do not happen. But > for now I don't see the point in comparing performances between broken > and (hopefully) non broken implementations. Hi, usually that is a good idea, but in this case you are changing the fundamental design of the algorithm. That makes it easy to add new things (e.g. YUV support) that will be even harder to make faster later. If you then later improve performance, that is another re-design. You would be making maintainers review two rewrites instead of one or less. I suspect it might be less effort for the author as well to not ditch the line-based based algorithm as the first step. Thanks, pq
Hello Maxime, + Arthur mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > Hi Miquel, > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > > coordinates and make it possible to read the correct thing in a block > > > > when block_w and block_h are not 1. > > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > > anymore to have misterious switch-case distributed in multiple places. > > > > > > Hi, > > > > > > there was a very good reason to write this code using lines: > > > performance. Before lines, it was indeed operating on individual pixels. > > > > > > Please, include performance measurements before and after this series > > > to quantify the impact on the previously already supported pixel > > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > > > VKMS will be used more and more in CI for userspace projects, and > > > performance actually matters there. > > > > > > I'm worrying that this performance degradation here is significant. I > > > believe it is possible to keep blending with lines, if you add new line > > > getters for reading from rotated, sub-sampled etc. images. That way you > > > don't have to regress the most common formats' performance. > > > > While I understand performance is important and should be taken into > > account seriously, I cannot understand how broken testing could be > > considered better. Fast but inaccurate will always be significantly > > less attractive to my eyes. > > AFAIK, neither the cover letter nor the commit log claimed it was fixing > something broken, just that it was "better" (according to what > criteria?). Better is probably too vague and I agree the "fixing" part is not clearly explained in the commit log. The cover-letter however states: > Patch 2/2: This patch is more complex. My main target was to solve issues > I found in [1], but as it was very complex to do it "in place", I choose > to rework the composition function. ... > [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ If you follow this link you will find all the feedback and especially the "broken" parts. Just to be clear, writing bugs is totally expected and review/testing is supposed to help on this regard. I am not blaming the author in any way, just focusing on getting this code in a more readable shape and hopefully reinforce the testing procedure. > If something is truly broken, it must be stated what exactly is so we > can all come up with a solution that will satisfy everyone. Maybe going through the series pointed above will give more context but AFAIU: the YUV composition is not totally right (and the tests used to validate it need to be more complex as well in order to fail). Here is a proposal. Today's RGB implementation is only optimized in the line-by-line case when there is no rotation. The logic is bit convoluted and may possibly be slightly clarified with a per-format read_line() implementation, at a very light performance cost. Such an improvement would definitely benefit to the clarity of the code, especially when transformations (especially the rotations) come into play because they would be clearly handled differently instead of being "hidden" in the optimized logic. Performances would not change much as this path is not optimized today anyway (the pixel-oriented logic is already used in the rotation case). Arthur's YUV implementation is indeed well optimized but the added complexity probably lead to small mistakes in the logic. The per-format read_line() implementation mentioned above could be extended to the YUV format as well, which would leverage Arthur's proposal by re-using his optimized version. Louis will help on this regard. However, for more complex cases such as when there is a rotation, it will be easier (and not sub-optimized compared to the RGB case) to also fallback to a pixel-oriented processing. Would this approach make sense? Thanks, Miquèl
On Fri, 2 Feb 2024 13:13:22 +0100 Miquel Raynal <miquel.raynal@bootlin.com> wrote: > Hello Maxime, > > + Arthur > > mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > > > Hi Miquel, > > > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > > > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > > > coordinates and make it possible to read the correct thing in a block > > > > > when block_w and block_h are not 1. > > > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > > > anymore to have misterious switch-case distributed in multiple places. > > > > > > > > Hi, > > > > > > > > there was a very good reason to write this code using lines: > > > > performance. Before lines, it was indeed operating on individual pixels. > > > > > > > > Please, include performance measurements before and after this series > > > > to quantify the impact on the previously already supported pixel > > > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > > > > > VKMS will be used more and more in CI for userspace projects, and > > > > performance actually matters there. > > > > > > > > I'm worrying that this performance degradation here is significant. I > > > > believe it is possible to keep blending with lines, if you add new line > > > > getters for reading from rotated, sub-sampled etc. images. That way you > > > > don't have to regress the most common formats' performance. > > > > > > While I understand performance is important and should be taken into > > > account seriously, I cannot understand how broken testing could be > > > considered better. Fast but inaccurate will always be significantly > > > less attractive to my eyes. > > > > AFAIK, neither the cover letter nor the commit log claimed it was fixing > > something broken, just that it was "better" (according to what > > criteria?). > > Better is probably too vague and I agree the "fixing" part is not > clearly explained in the commit log. The cover-letter however states: > > > Patch 2/2: This patch is more complex. My main target was to solve issues > > I found in [1], but as it was very complex to do it "in place", I choose > > to rework the composition function. > ... > > [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ > > If you follow this link you will find all the feedback and especially > the "broken" parts. Just to be clear, writing bugs is totally expected > and review/testing is supposed to help on this regard. I am not blaming > the author in any way, just focusing on getting this code in a more > readable shape and hopefully reinforce the testing procedure. > > > If something is truly broken, it must be stated what exactly is so we > > can all come up with a solution that will satisfy everyone. > > Maybe going through the series pointed above will give more context > but AFAIU: the YUV composition is not totally right (and the tests used > to validate it need to be more complex as well in order to fail). > > Here is a proposal. > > Today's RGB implementation is only optimized in the line-by-line case > when there is no rotation. The logic is bit convoluted and may possibly > be slightly clarified with a per-format read_line() implementation, > at a very light performance cost. Such an improvement would definitely > benefit to the clarity of the code, especially when transformations > (especially the rotations) come into play because they would be clearly > handled differently instead of being "hidden" in the optimized logic. > Performances would not change much as this path is not optimized today > anyway (the pixel-oriented logic is already used in the rotation case). > > Arthur's YUV implementation is indeed well optimized but the added > complexity probably lead to small mistakes in the logic. The > per-format read_line() implementation mentioned above could be > extended to the YUV format as well, which would leverage Arthur's > proposal by re-using his optimized version. Louis will help on this > regard. However, for more complex cases such as when there is a > rotation, it will be easier (and not sub-optimized compared to the RGB > case) to also fallback to a pixel-oriented processing. > > Would this approach make sense? Hi, I think it would, if I understand what you mean. Ever since I proposed a line-by-line algorithm to improve the performance, I was thinking of per-format read_line() functions that would be selected outside of any loops. Extending that to support YUV is only natural. I can imagine rotation complicates things, and I won't oppose that resulting in a much heavier read_line() implementation used in those cases. They might perhaps call the original read_line() implementations pixel-by-pixel or plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates things even further. That way one could compose any rotation-format-siting combination by chaining function pointers. I haven't looked at VKMS in a long time, and I am disappointed to find that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. The reading vfunc should be called with many pixels at a time when the source FB layout allows it. The whole point of the line-based functions was that they repeat the innermost loop in every function body to make the per-pixel overhead as small as possible. The VKMS implementations benchmarked before and after the original line-based algorithm showed that calling a function pointer per-pixel is relatively very expensive. Or maybe it was a switch-case. Sorry, I didn't realize the optimization had already been lost. Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since it's not composing anything and the name mislead me. I think if you inspect the compositing code as of revision 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of what it was supposed to be. Thanks, pq
Hi Pekka, pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 17:49:13 +0200: > On Fri, 2 Feb 2024 13:13:22 +0100 > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > > > Hello Maxime, > > > > + Arthur > > > > mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > > > > > Hi Miquel, > > > > > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > > > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > > > > > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > > > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > > > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > > > > coordinates and make it possible to read the correct thing in a block > > > > > > when block_w and block_h are not 1. > > > > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > > > > anymore to have misterious switch-case distributed in multiple places. > > > > > > > > > > Hi, > > > > > > > > > > there was a very good reason to write this code using lines: > > > > > performance. Before lines, it was indeed operating on individual pixels. > > > > > > > > > > Please, include performance measurements before and after this series > > > > > to quantify the impact on the previously already supported pixel > > > > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > > > > > > > VKMS will be used more and more in CI for userspace projects, and > > > > > performance actually matters there. > > > > > > > > > > I'm worrying that this performance degradation here is significant. I > > > > > believe it is possible to keep blending with lines, if you add new line > > > > > getters for reading from rotated, sub-sampled etc. images. That way you > > > > > don't have to regress the most common formats' performance. > > > > > > > > While I understand performance is important and should be taken into > > > > account seriously, I cannot understand how broken testing could be > > > > considered better. Fast but inaccurate will always be significantly > > > > less attractive to my eyes. > > > > > > AFAIK, neither the cover letter nor the commit log claimed it was fixing > > > something broken, just that it was "better" (according to what > > > criteria?). > > > > Better is probably too vague and I agree the "fixing" part is not > > clearly explained in the commit log. The cover-letter however states: > > > > > Patch 2/2: This patch is more complex. My main target was to solve issues > > > I found in [1], but as it was very complex to do it "in place", I choose > > > to rework the composition function. > > ... > > > [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ > > > > If you follow this link you will find all the feedback and especially > > the "broken" parts. Just to be clear, writing bugs is totally expected > > and review/testing is supposed to help on this regard. I am not blaming > > the author in any way, just focusing on getting this code in a more > > readable shape and hopefully reinforce the testing procedure. > > > > > If something is truly broken, it must be stated what exactly is so we > > > can all come up with a solution that will satisfy everyone. > > > > Maybe going through the series pointed above will give more context > > but AFAIU: the YUV composition is not totally right (and the tests used > > to validate it need to be more complex as well in order to fail). > > > > Here is a proposal. > > > > Today's RGB implementation is only optimized in the line-by-line case > > when there is no rotation. The logic is bit convoluted and may possibly > > be slightly clarified with a per-format read_line() implementation, > > at a very light performance cost. Such an improvement would definitely > > benefit to the clarity of the code, especially when transformations > > (especially the rotations) come into play because they would be clearly > > handled differently instead of being "hidden" in the optimized logic. > > Performances would not change much as this path is not optimized today > > anyway (the pixel-oriented logic is already used in the rotation case). > > > > Arthur's YUV implementation is indeed well optimized but the added > > complexity probably lead to small mistakes in the logic. The > > per-format read_line() implementation mentioned above could be > > extended to the YUV format as well, which would leverage Arthur's > > proposal by re-using his optimized version. Louis will help on this > > regard. However, for more complex cases such as when there is a > > rotation, it will be easier (and not sub-optimized compared to the RGB > > case) to also fallback to a pixel-oriented processing. > > > > Would this approach make sense? > > Hi, > > I think it would, if I understand what you mean. Ever since I proposed > a line-by-line algorithm to improve the performance, I was thinking of > per-format read_line() functions that would be selected outside of any > loops. Extending that to support YUV is only natural. I can imagine > rotation complicates things, and I won't oppose that resulting in a > much heavier read_line() implementation used in those cases. They might > perhaps call the original read_line() implementations pixel-by-pixel or > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates > things even further. That way one could compose any > rotation-format-siting combination by chaining function pointers. I'll let Louis also validate but on my side I feel like I totally agree with your feedback. > I haven't looked at VKMS in a long time, and I am disappointed to find > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > The reading vfunc should be called with many pixels at a time when the > source FB layout allows it. The whole point of the line-based functions > was that they repeat the innermost loop in every function body to make > the per-pixel overhead as small as possible. The VKMS implementations > benchmarked before and after the original line-based algorithm showed > that calling a function pointer per-pixel is relatively very expensive. > Or maybe it was a switch-case. Indeed, since your initial feedback Louis made a couple of comparisons and the time penalty is between +5% and +60% depending on the case, AFAIR. > Sorry, I didn't realize the optimization had already been lost. No problem, actually I also lost myself in my first answer as I initially thought the (mainline) RGB logic was also broken in edge cases, which it was not, only the YUV logic suffered from some limitations. > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since > it's not composing anything and the name mislead me. Makes sense. > I think if you inspect the compositing code as of revision > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of > what it was supposed to be. Excellent, thanks a lot! Miquèl
On Fri, 2 Feb 2024 17:07:34 +0100 Miquel Raynal <miquel.raynal@bootlin.com> wrote: > Hi Pekka, Hi Miquel, I'm happy to see no hard feelings below. I know I may be blunt at times. Another thing coming to my mind is that I suppose there is no agreed standard benchmark for VKMS performance, is there? I recall people running selected IGT tests in a loop in the past, and I worry that the IGT start-up overhead and small plane dimensions might skew the results. Would it be possible to have a standardised benchmark specifically for performance rather than correctness, in IGT or where-ever it would make sense? Then it would be simple to tell contributors to run this and report the numbers before and after. I would propose this kind of KMS layout: - CRTC size 3841 x 2161 - primary plane, XRGB8888, 3639 x 2161 @ 101,0 - overlay A, XBGR2101010, 3033 x 1777 @ 201,199 - overlay B, ARGB8888, 1507 x 1400 @ 1800,250 The sizes and positions are deliberately odd to try to avoid happy alignment accidents. The planes are big, which should let the pixel operations easily dominate performance measurement. There are different pixel formats, both opaque and semi-transparent. There is lots of plane overlap. The planes also do not cover the whole CRTC leaving the background visible a bit. There should be two FBs per each plane, flipped alternatingly each frame. Writeback should be active. Run this a number of frames, say, 100, and measure the kernel CPU time taken. It's supposed to take at least several seconds in total. I think something like this should be the base benchmark. One can add more to it, like rotated planes, YUV planes, etc. or switch settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe one more overlay that is very tall and thin. Just an idea, what do you all think? Thanks, pq > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 17:49:13 +0200: > > > On Fri, 2 Feb 2024 13:13:22 +0100 > > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > > > > > Hello Maxime, > > > > > > + Arthur > > > > > > mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > > > > > > > Hi Miquel, > > > > > > > > On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > > > > > pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > > > > > > > > > > On Thu, 01 Feb 2024 18:31:32 +0100 > > > > > > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > > > > > > > > > > > > Change the composition algorithm to iterate over pixels instead of lines. > > > > > > > It allows a simpler management of rotation and pixel access for complex formats. > > > > > > > > > > > > > > This new algorithm allows read_pixel function to have access to x/y > > > > > > > coordinates and make it possible to read the correct thing in a block > > > > > > > when block_w and block_h are not 1. > > > > > > > The iteration pixel-by-pixel in the same method also allows a simpler > > > > > > > management of rotation with drm_rect_* helpers. This way it's not needed > > > > > > > anymore to have misterious switch-case distributed in multiple places. > > > > > > > > > > > > Hi, > > > > > > > > > > > > there was a very good reason to write this code using lines: > > > > > > performance. Before lines, it was indeed operating on individual pixels. > > > > > > > > > > > > Please, include performance measurements before and after this series > > > > > > to quantify the impact on the previously already supported pixel > > > > > > formats, particularly the 32-bit-per-pixel RGB variants. > > > > > > > > > > > > VKMS will be used more and more in CI for userspace projects, and > > > > > > performance actually matters there. > > > > > > > > > > > > I'm worrying that this performance degradation here is significant. I > > > > > > believe it is possible to keep blending with lines, if you add new line > > > > > > getters for reading from rotated, sub-sampled etc. images. That way you > > > > > > don't have to regress the most common formats' performance. > > > > > > > > > > While I understand performance is important and should be taken into > > > > > account seriously, I cannot understand how broken testing could be > > > > > considered better. Fast but inaccurate will always be significantly > > > > > less attractive to my eyes. > > > > > > > > AFAIK, neither the cover letter nor the commit log claimed it was fixing > > > > something broken, just that it was "better" (according to what > > > > criteria?). > > > > > > Better is probably too vague and I agree the "fixing" part is not > > > clearly explained in the commit log. The cover-letter however states: > > > > > > > Patch 2/2: This patch is more complex. My main target was to solve issues > > > > I found in [1], but as it was very complex to do it "in place", I choose > > > > to rework the composition function. > > > ... > > > > [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ > > > > > > If you follow this link you will find all the feedback and especially > > > the "broken" parts. Just to be clear, writing bugs is totally expected > > > and review/testing is supposed to help on this regard. I am not blaming > > > the author in any way, just focusing on getting this code in a more > > > readable shape and hopefully reinforce the testing procedure. > > > > > > > If something is truly broken, it must be stated what exactly is so we > > > > can all come up with a solution that will satisfy everyone. > > > > > > Maybe going through the series pointed above will give more context > > > but AFAIU: the YUV composition is not totally right (and the tests used > > > to validate it need to be more complex as well in order to fail). > > > > > > Here is a proposal. > > > > > > Today's RGB implementation is only optimized in the line-by-line case > > > when there is no rotation. The logic is bit convoluted and may possibly > > > be slightly clarified with a per-format read_line() implementation, > > > at a very light performance cost. Such an improvement would definitely > > > benefit to the clarity of the code, especially when transformations > > > (especially the rotations) come into play because they would be clearly > > > handled differently instead of being "hidden" in the optimized logic. > > > Performances would not change much as this path is not optimized today > > > anyway (the pixel-oriented logic is already used in the rotation case). > > > > > > Arthur's YUV implementation is indeed well optimized but the added > > > complexity probably lead to small mistakes in the logic. The > > > per-format read_line() implementation mentioned above could be > > > extended to the YUV format as well, which would leverage Arthur's > > > proposal by re-using his optimized version. Louis will help on this > > > regard. However, for more complex cases such as when there is a > > > rotation, it will be easier (and not sub-optimized compared to the RGB > > > case) to also fallback to a pixel-oriented processing. > > > > > > Would this approach make sense? > > > > Hi, > > > > I think it would, if I understand what you mean. Ever since I proposed > > a line-by-line algorithm to improve the performance, I was thinking of > > per-format read_line() functions that would be selected outside of any > > loops. Extending that to support YUV is only natural. I can imagine > > rotation complicates things, and I won't oppose that resulting in a > > much heavier read_line() implementation used in those cases. They might > > perhaps call the original read_line() implementations pixel-by-pixel or > > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates > > things even further. That way one could compose any > > rotation-format-siting combination by chaining function pointers. > > I'll let Louis also validate but on my side I feel like I totally > agree with your feedback. > > > I haven't looked at VKMS in a long time, and I am disappointed to find > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > > The reading vfunc should be called with many pixels at a time when the > > source FB layout allows it. The whole point of the line-based functions > > was that they repeat the innermost loop in every function body to make > > the per-pixel overhead as small as possible. The VKMS implementations > > benchmarked before and after the original line-based algorithm showed > > that calling a function pointer per-pixel is relatively very expensive. > > Or maybe it was a switch-case. > > Indeed, since your initial feedback Louis made a couple of comparisons > and the time penalty is between +5% and +60% depending on the case, > AFAIR. > > > Sorry, I didn't realize the optimization had already been lost. > > No problem, actually I also lost myself in my first answer as I > initially thought the (mainline) RGB logic was also broken in edge > cases, which it was not, only the YUV logic suffered from some > limitations. > > > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since > > it's not composing anything and the name mislead me. > > Makes sense. > > > I think if you inspect the compositing code as of revision > > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of > > what it was supposed to be. > > Excellent, thanks a lot! > > Miquèl
On 02/02/24 12:49, Pekka Paalanen wrote: > On Fri, 2 Feb 2024 13:13:22 +0100 > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > >> Hello Maxime, >> >> + Arthur >> >> mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: >> >>> Hi Miquel, >>> >>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: >>>> pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: >>>> >>>>> On Thu, 01 Feb 2024 18:31:32 +0100 >>>>> Louis Chauvet <louis.chauvet@bootlin.com> wrote: >>>>> >>>>>> Change the composition algorithm to iterate over pixels instead of lines. >>>>>> It allows a simpler management of rotation and pixel access for complex formats. >>>>>> >>>>>> This new algorithm allows read_pixel function to have access to x/y >>>>>> coordinates and make it possible to read the correct thing in a block >>>>>> when block_w and block_h are not 1. >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed >>>>>> anymore to have misterious switch-case distributed in multiple places. >>>>> >>>>> Hi, >>>>> >>>>> there was a very good reason to write this code using lines: >>>>> performance. Before lines, it was indeed operating on individual pixels. >>>>> >>>>> Please, include performance measurements before and after this series >>>>> to quantify the impact on the previously already supported pixel >>>>> formats, particularly the 32-bit-per-pixel RGB variants. >>>>> >>>>> VKMS will be used more and more in CI for userspace projects, and >>>>> performance actually matters there. >>>>> >>>>> I'm worrying that this performance degradation here is significant. I >>>>> believe it is possible to keep blending with lines, if you add new line >>>>> getters for reading from rotated, sub-sampled etc. images. That way you >>>>> don't have to regress the most common formats' performance. >>>> >>>> While I understand performance is important and should be taken into >>>> account seriously, I cannot understand how broken testing could be >>>> considered better. Fast but inaccurate will always be significantly >>>> less attractive to my eyes. >>> >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing >>> something broken, just that it was "better" (according to what >>> criteria?). >> >> Better is probably too vague and I agree the "fixing" part is not >> clearly explained in the commit log. The cover-letter however states: >> >>> Patch 2/2: This patch is more complex. My main target was to solve issues >>> I found in [1], but as it was very complex to do it "in place", I choose >>> to rework the composition function. >> ... >>> [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ >> >> If you follow this link you will find all the feedback and especially >> the "broken" parts. Just to be clear, writing bugs is totally expected >> and review/testing is supposed to help on this regard. I am not blaming >> the author in any way, just focusing on getting this code in a more >> readable shape and hopefully reinforce the testing procedure. >> >>> If something is truly broken, it must be stated what exactly is so we >>> can all come up with a solution that will satisfy everyone. >> >> Maybe going through the series pointed above will give more context >> but AFAIU: the YUV composition is not totally right (and the tests used >> to validate it need to be more complex as well in order to fail). >> >> Here is a proposal. >> >> Today's RGB implementation is only optimized in the line-by-line case >> when there is no rotation. The logic is bit convoluted and may possibly >> be slightly clarified with a per-format read_line() implementation, >> at a very light performance cost. Such an improvement would definitely >> benefit to the clarity of the code, especially when transformations >> (especially the rotations) come into play because they would be clearly >> handled differently instead of being "hidden" in the optimized logic. >> Performances would not change much as this path is not optimized today >> anyway (the pixel-oriented logic is already used in the rotation case). >> >> Arthur's YUV implementation is indeed well optimized but the added >> complexity probably lead to small mistakes in the logic. The >> per-format read_line() implementation mentioned above could be >> extended to the YUV format as well, which would leverage Arthur's >> proposal by re-using his optimized version. Louis will help on this >> regard. However, for more complex cases such as when there is a >> rotation, it will be easier (and not sub-optimized compared to the RGB >> case) to also fallback to a pixel-oriented processing. >> >> Would this approach make sense? > > Hi, > > I think it would, if I understand what you mean. Ever since I proposed > a line-by-line algorithm to improve the performance, I was thinking of > per-format read_line() functions that would be selected outside of any > loops. Extending that to support YUV is only natural. I can imagine > rotation complicates things, and I won't oppose that resulting in a > much heavier read_line() implementation used in those cases. They might > perhaps call the original read_line() implementations pixel-by-pixel or > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates > things even further. That way one could compose any > rotation-format-siting combination by chaining function pointers. > > I haven't looked at VKMS in a long time, and I am disappointed to find > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > The reading vfunc should be called with many pixels at a time when the > source FB layout allows it. The whole point of the line-based functions > was that they repeat the innermost loop in every function body to make > the per-pixel overhead as small as possible. The VKMS implementations > benchmarked before and after the original line-based algorithm showed > that calling a function pointer per-pixel is relatively very expensive. > Or maybe it was a switch-case. Hi, I think I'm the culprit for that, as stated on [1]. My intention with the suggestion was to remove some code repetition and too facilitate the rotation support implementation. Going back, I think I was to high on DRY at the time and didn't worry about optimization, which was a mistake. But, I agree with Miquel that the rotation logic is easier to implement in a pixel-based way. So going pixel-by-pixel only when rotation occurs would be great. Best Regards, ~Arthur Grillo [1]: https://lore.kernel.org/dri-devel/20230418130525.128733-2-mcanal@igalia.com/ > > Sorry, I didn't realize the optimization had already been lost. > > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since > it's not composing anything and the name mislead me. > > I think if you inspect the compositing code as of revision > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of > what it was supposed to be. > > > Thanks, > pq
On Fri, 2 Feb 2024 17:02:20 -0300 Arthur Grillo <arthurgrillo@riseup.net> wrote: > On 02/02/24 12:49, Pekka Paalanen wrote: > > On Fri, 2 Feb 2024 13:13:22 +0100 > > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > > > >> Hello Maxime, > >> > >> + Arthur > >> > >> mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > >> > >>> Hi Miquel, > >>> > >>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > >>>> pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > >>>> > >>>>> On Thu, 01 Feb 2024 18:31:32 +0100 > >>>>> Louis Chauvet <louis.chauvet@bootlin.com> wrote: > >>>>> > >>>>>> Change the composition algorithm to iterate over pixels instead of lines. > >>>>>> It allows a simpler management of rotation and pixel access for complex formats. > >>>>>> > >>>>>> This new algorithm allows read_pixel function to have access to x/y > >>>>>> coordinates and make it possible to read the correct thing in a block > >>>>>> when block_w and block_h are not 1. > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed > >>>>>> anymore to have misterious switch-case distributed in multiple places. > >>>>> > >>>>> Hi, > >>>>> > >>>>> there was a very good reason to write this code using lines: > >>>>> performance. Before lines, it was indeed operating on individual pixels. > >>>>> > >>>>> Please, include performance measurements before and after this series > >>>>> to quantify the impact on the previously already supported pixel > >>>>> formats, particularly the 32-bit-per-pixel RGB variants. > >>>>> > >>>>> VKMS will be used more and more in CI for userspace projects, and > >>>>> performance actually matters there. > >>>>> > >>>>> I'm worrying that this performance degradation here is significant. I > >>>>> believe it is possible to keep blending with lines, if you add new line > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you > >>>>> don't have to regress the most common formats' performance. > >>>> > >>>> While I understand performance is important and should be taken into > >>>> account seriously, I cannot understand how broken testing could be > >>>> considered better. Fast but inaccurate will always be significantly > >>>> less attractive to my eyes. > >>> > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing > >>> something broken, just that it was "better" (according to what > >>> criteria?). > >> > >> Better is probably too vague and I agree the "fixing" part is not > >> clearly explained in the commit log. The cover-letter however states: > >> > >>> Patch 2/2: This patch is more complex. My main target was to solve issues > >>> I found in [1], but as it was very complex to do it "in place", I choose > >>> to rework the composition function. > >> ... > >>> [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ > >> > >> If you follow this link you will find all the feedback and especially > >> the "broken" parts. Just to be clear, writing bugs is totally expected > >> and review/testing is supposed to help on this regard. I am not blaming > >> the author in any way, just focusing on getting this code in a more > >> readable shape and hopefully reinforce the testing procedure. > >> > >>> If something is truly broken, it must be stated what exactly is so we > >>> can all come up with a solution that will satisfy everyone. > >> > >> Maybe going through the series pointed above will give more context > >> but AFAIU: the YUV composition is not totally right (and the tests used > >> to validate it need to be more complex as well in order to fail). > >> > >> Here is a proposal. > >> > >> Today's RGB implementation is only optimized in the line-by-line case > >> when there is no rotation. The logic is bit convoluted and may possibly > >> be slightly clarified with a per-format read_line() implementation, > >> at a very light performance cost. Such an improvement would definitely > >> benefit to the clarity of the code, especially when transformations > >> (especially the rotations) come into play because they would be clearly > >> handled differently instead of being "hidden" in the optimized logic. > >> Performances would not change much as this path is not optimized today > >> anyway (the pixel-oriented logic is already used in the rotation case). > >> > >> Arthur's YUV implementation is indeed well optimized but the added > >> complexity probably lead to small mistakes in the logic. The > >> per-format read_line() implementation mentioned above could be > >> extended to the YUV format as well, which would leverage Arthur's > >> proposal by re-using his optimized version. Louis will help on this > >> regard. However, for more complex cases such as when there is a > >> rotation, it will be easier (and not sub-optimized compared to the RGB > >> case) to also fallback to a pixel-oriented processing. > >> > >> Would this approach make sense? > > > > Hi, > > > > I think it would, if I understand what you mean. Ever since I proposed > > a line-by-line algorithm to improve the performance, I was thinking of > > per-format read_line() functions that would be selected outside of any > > loops. Extending that to support YUV is only natural. I can imagine > > rotation complicates things, and I won't oppose that resulting in a > > much heavier read_line() implementation used in those cases. They might > > perhaps call the original read_line() implementations pixel-by-pixel or > > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates > > things even further. That way one could compose any > > rotation-format-siting combination by chaining function pointers. > > > > I haven't looked at VKMS in a long time, and I am disappointed to find > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > > The reading vfunc should be called with many pixels at a time when the > > source FB layout allows it. The whole point of the line-based functions > > was that they repeat the innermost loop in every function body to make > > the per-pixel overhead as small as possible. The VKMS implementations > > benchmarked before and after the original line-based algorithm showed > > that calling a function pointer per-pixel is relatively very expensive. > > Or maybe it was a switch-case. > > Hi, > > I think I'm the culprit for that, as stated on [1]. My intention with > the suggestion was to remove some code repetition and too facilitate the > rotation support implementation. Going back, I think I was to high on > DRY at the time and didn't worry about optimization, which was a > mistake. Hi Arthur, no worries. We can also blame reviewers for not minding benchmarks. ;-) > But, I agree with Miquel that the rotation logic is easier to implement > in a pixel-based way. So going pixel-by-pixel only when rotation occurs > would be great. Yes, and I think that can very well be done in the line-based framework still that existed in the old days before any rotation support was added. Essentially a plug-in line-getter function that then calls a format-specific line-getter pixel-by-pixel while applying the rotation. It would be simple, it would leave unrotated performance unharmed (use format-specific line-getter directly with lines), but it might be somewhat less performant for rotated KMS planes. I suspect that might be a good compromise. Format-specific line-getters could also be parameterized by pixel-to-pixel offset in bytes. Then they could directly traverse FB rows forward and backward, and even FB columns. It may or may not have a penalty compared to the original line-getters, so it would have to be benchmarked. Line-getters working on planar YUV FBs might delegate Y, U, V, or UV/VU reading to R8 and/or RG88 line or pixel reader functions. They might also return block-height lines instead of one line at a time. However, I wouldn't commit to any approach without benchmarking alternatives. The performance comparison must justify the code complexity, like it was seen with the initial line-based implementation. A good benchmark is key, IMO. Thanks, pq > > Best Regards, > ~Arthur Grillo > > [1]: https://lore.kernel.org/dri-devel/20230418130525.128733-2-mcanal@igalia.com/ > > > > > Sorry, I didn't realize the optimization had already been lost. > > > > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since > > it's not composing anything and the name mislead me. > > > > I think if you inspect the compositing code as of revision > > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of > > what it was supposed to be. > > > > > > Thanks, > > pq
On Mon, 5 Feb 2024 12:12:04 +0200 Pekka Paalanen <pekka.paalanen@haloniitty.fi> wrote: > On Fri, 2 Feb 2024 17:02:20 -0300 > Arthur Grillo <arthurgrillo@riseup.net> wrote: > > > On 02/02/24 12:49, Pekka Paalanen wrote: > > > On Fri, 2 Feb 2024 13:13:22 +0100 > > > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > > > > > >> Hello Maxime, > > >> > > >> + Arthur > > >> > > >> mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: > > >> > > >>> Hi Miquel, > > >>> > > >>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: > > >>>> pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: > > >>>> > > >>>>> On Thu, 01 Feb 2024 18:31:32 +0100 > > >>>>> Louis Chauvet <louis.chauvet@bootlin.com> wrote: > > >>>>> > > >>>>>> Change the composition algorithm to iterate over pixels instead of lines. > > >>>>>> It allows a simpler management of rotation and pixel access for complex formats. > > >>>>>> > > >>>>>> This new algorithm allows read_pixel function to have access to x/y > > >>>>>> coordinates and make it possible to read the correct thing in a block > > >>>>>> when block_w and block_h are not 1. > > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler > > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed > > >>>>>> anymore to have misterious switch-case distributed in multiple places. > > >>>>> > > >>>>> Hi, > > >>>>> > > >>>>> there was a very good reason to write this code using lines: > > >>>>> performance. Before lines, it was indeed operating on individual pixels. > > >>>>> > > >>>>> Please, include performance measurements before and after this series > > >>>>> to quantify the impact on the previously already supported pixel > > >>>>> formats, particularly the 32-bit-per-pixel RGB variants. > > >>>>> > > >>>>> VKMS will be used more and more in CI for userspace projects, and > > >>>>> performance actually matters there. > > >>>>> > > >>>>> I'm worrying that this performance degradation here is significant. I > > >>>>> believe it is possible to keep blending with lines, if you add new line > > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you > > >>>>> don't have to regress the most common formats' performance. > > >>>> > > >>>> While I understand performance is important and should be taken into > > >>>> account seriously, I cannot understand how broken testing could be > > >>>> considered better. Fast but inaccurate will always be significantly > > >>>> less attractive to my eyes. > > >>> > > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing > > >>> something broken, just that it was "better" (according to what > > >>> criteria?). > > >> > > >> Better is probably too vague and I agree the "fixing" part is not > > >> clearly explained in the commit log. The cover-letter however states: > > >> > > >>> Patch 2/2: This patch is more complex. My main target was to solve issues > > >>> I found in [1], but as it was very complex to do it "in place", I choose > > >>> to rework the composition function. > > >> ... > > >>> [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ > > >> > > >> If you follow this link you will find all the feedback and especially > > >> the "broken" parts. Just to be clear, writing bugs is totally expected > > >> and review/testing is supposed to help on this regard. I am not blaming > > >> the author in any way, just focusing on getting this code in a more > > >> readable shape and hopefully reinforce the testing procedure. > > >> > > >>> If something is truly broken, it must be stated what exactly is so we > > >>> can all come up with a solution that will satisfy everyone. > > >> > > >> Maybe going through the series pointed above will give more context > > >> but AFAIU: the YUV composition is not totally right (and the tests used > > >> to validate it need to be more complex as well in order to fail). > > >> > > >> Here is a proposal. > > >> > > >> Today's RGB implementation is only optimized in the line-by-line case > > >> when there is no rotation. The logic is bit convoluted and may possibly > > >> be slightly clarified with a per-format read_line() implementation, > > >> at a very light performance cost. Such an improvement would definitely > > >> benefit to the clarity of the code, especially when transformations > > >> (especially the rotations) come into play because they would be clearly > > >> handled differently instead of being "hidden" in the optimized logic. > > >> Performances would not change much as this path is not optimized today > > >> anyway (the pixel-oriented logic is already used in the rotation case). > > >> > > >> Arthur's YUV implementation is indeed well optimized but the added > > >> complexity probably lead to small mistakes in the logic. The > > >> per-format read_line() implementation mentioned above could be > > >> extended to the YUV format as well, which would leverage Arthur's > > >> proposal by re-using his optimized version. Louis will help on this > > >> regard. However, for more complex cases such as when there is a > > >> rotation, it will be easier (and not sub-optimized compared to the RGB > > >> case) to also fallback to a pixel-oriented processing. > > >> > > >> Would this approach make sense? > > > > > > Hi, > > > > > > I think it would, if I understand what you mean. Ever since I proposed > > > a line-by-line algorithm to improve the performance, I was thinking of > > > per-format read_line() functions that would be selected outside of any > > > loops. Extending that to support YUV is only natural. I can imagine > > > rotation complicates things, and I won't oppose that resulting in a > > > much heavier read_line() implementation used in those cases. They might > > > perhaps call the original read_line() implementations pixel-by-pixel or > > > plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates > > > things even further. That way one could compose any > > > rotation-format-siting combination by chaining function pointers. > > > > > > I haven't looked at VKMS in a long time, and I am disappointed to find > > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > > > The reading vfunc should be called with many pixels at a time when the > > > source FB layout allows it. The whole point of the line-based functions > > > was that they repeat the innermost loop in every function body to make > > > the per-pixel overhead as small as possible. The VKMS implementations > > > benchmarked before and after the original line-based algorithm showed > > > that calling a function pointer per-pixel is relatively very expensive. > > > Or maybe it was a switch-case. > > > > Hi, > > > > I think I'm the culprit for that, as stated on [1]. My intention with > > the suggestion was to remove some code repetition and too facilitate the > > rotation support implementation. Going back, I think I was to high on > > DRY at the time and didn't worry about optimization, which was a > > mistake. > > Hi Arthur, > > no worries. We can also blame reviewers for not minding benchmarks. ;-) > > > But, I agree with Miquel that the rotation logic is easier to implement > > in a pixel-based way. So going pixel-by-pixel only when rotation occurs > > would be great. > > Yes, and I think that can very well be done in the line-based framework > still that existed in the old days before any rotation support was > added. Essentially a plug-in line-getter function that then calls a > format-specific line-getter pixel-by-pixel while applying the rotation. > It would be simple, it would leave unrotated performance unharmed (use > format-specific line-getter directly with lines), but it might be > somewhat less performant for rotated KMS planes. I suspect that might > be a good compromise. > > Format-specific line-getters could also be parameterized by > pixel-to-pixel offset in bytes. Then they could directly traverse FB > rows forward and backward, and even FB columns. It may or may not have > a penalty compared to the original line-getters, so it would have to > be benchmarked. Oh, actually, since the byte offset depends on format, it might be better to parametrize by direction and compute the offset in the format-specific line-getter before the loop. Thanks, pq > Line-getters working on planar YUV FBs might delegate Y, U, V, or UV/VU > reading to R8 and/or RG88 line or pixel reader functions. They might > also return block-height lines instead of one line at a time. However, > I wouldn't commit to any approach without benchmarking alternatives. > The performance comparison must justify the code complexity, like it > was seen with the initial line-based implementation. > > A good benchmark is key, IMO. > > > Thanks, > pq > > > > > Best Regards, > > ~Arthur Grillo > > > > [1]: https://lore.kernel.org/dri-devel/20230418130525.128733-2-mcanal@igalia.com/ > > > > > > > > Sorry, I didn't realize the optimization had already been lost. > > > > > > Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since > > > it's not composing anything and the name mislead me. > > > > > > I think if you inspect the compositing code as of revision > > > 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of > > > what it was supposed to be. > > > > > > > > > Thanks, > > > pq >
On 02/02/24 16:45, Pekka Paalanen wrote: > On Fri, 2 Feb 2024 17:07:34 +0100 > Miquel Raynal <miquel.raynal@bootlin.com> wrote: > >> Hi Pekka, > > Hi Miquel, > > I'm happy to see no hard feelings below. I know I may be blunt at > times. > > Another thing coming to my mind is that I suppose there is no > agreed standard benchmark for VKMS performance, is there? > > I recall people running selected IGT tests in a loop in the past, > and I worry that the IGT start-up overhead and small plane > dimensions might skew the results. > > Would it be possible to have a standardised benchmark specifically > for performance rather than correctness, in IGT or where-ever it > would make sense? Then it would be simple to tell contributors to > run this and report the numbers before and after. > > I would propose this kind of KMS layout: > > - CRTC size 3841 x 2161 > - primary plane, XRGB8888, 3639 x 2161 @ 101,0 > - overlay A, XBGR2101010, 3033 x 1777 @ 201,199 > - overlay B, ARGB8888, 1507 x 1400 @ 1800,250 > > The sizes and positions are deliberately odd to try to avoid happy > alignment accidents. The planes are big, which should let the pixel > operations easily dominate performance measurement. There are > different pixel formats, both opaque and semi-transparent. There is > lots of plane overlap. The planes also do not cover the whole CRTC > leaving the background visible a bit. > > There should be two FBs per each plane, flipped alternatingly each > frame. Writeback should be active. Run this a number of frames, say, > 100, and measure the kernel CPU time taken. It's supposed to take at > least several seconds in total. > > I think something like this should be the base benchmark. One can > add more to it, like rotated planes, YUV planes, etc. or switch > settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe > one more overlay that is very tall and thin. > > Just an idea, what do you all think? Hi Pekka, I just finished writing this proposal using IGT. I got pretty interesting results: The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took around 13 seconds. While drm-misc/drm-misc-next took 36 seconds. I'm currently bisecting to be certain that the change to the pixel-by-pixel is the culprit, but I don't see why it wouldn't be. I just need to do some final touches on the benchmark code and it will be ready for revision. Best Regards, ~Arthur Grillo > > > Thanks, > pq > >> pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 17:49:13 +0200: >> >>> On Fri, 2 Feb 2024 13:13:22 +0100 >>> Miquel Raynal <miquel.raynal@bootlin.com> wrote: >>> >>>> Hello Maxime, >>>> >>>> + Arthur >>>> >>>> mripard@kernel.org wrote on Fri, 2 Feb 2024 10:53:37 +0100: >>>> >>>>> Hi Miquel, >>>>> >>>>> On Fri, Feb 02, 2024 at 10:26:01AM +0100, Miquel Raynal wrote: >>>>>> pekka.paalanen@haloniitty.fi wrote on Fri, 2 Feb 2024 10:55:22 +0200: >>>>>> >>>>>>> On Thu, 01 Feb 2024 18:31:32 +0100 >>>>>>> Louis Chauvet <louis.chauvet@bootlin.com> wrote: >>>>>>> >>>>>>>> Change the composition algorithm to iterate over pixels instead of lines. >>>>>>>> It allows a simpler management of rotation and pixel access for complex formats. >>>>>>>> >>>>>>>> This new algorithm allows read_pixel function to have access to x/y >>>>>>>> coordinates and make it possible to read the correct thing in a block >>>>>>>> when block_w and block_h are not 1. >>>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler >>>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed >>>>>>>> anymore to have misterious switch-case distributed in multiple places. >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> there was a very good reason to write this code using lines: >>>>>>> performance. Before lines, it was indeed operating on individual pixels. >>>>>>> >>>>>>> Please, include performance measurements before and after this series >>>>>>> to quantify the impact on the previously already supported pixel >>>>>>> formats, particularly the 32-bit-per-pixel RGB variants. >>>>>>> >>>>>>> VKMS will be used more and more in CI for userspace projects, and >>>>>>> performance actually matters there. >>>>>>> >>>>>>> I'm worrying that this performance degradation here is significant. I >>>>>>> believe it is possible to keep blending with lines, if you add new line >>>>>>> getters for reading from rotated, sub-sampled etc. images. That way you >>>>>>> don't have to regress the most common formats' performance. >>>>>> >>>>>> While I understand performance is important and should be taken into >>>>>> account seriously, I cannot understand how broken testing could be >>>>>> considered better. Fast but inaccurate will always be significantly >>>>>> less attractive to my eyes. >>>>> >>>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing >>>>> something broken, just that it was "better" (according to what >>>>> criteria?). >>>> >>>> Better is probably too vague and I agree the "fixing" part is not >>>> clearly explained in the commit log. The cover-letter however states: >>>> >>>>> Patch 2/2: This patch is more complex. My main target was to solve issues >>>>> I found in [1], but as it was very complex to do it "in place", I choose >>>>> to rework the composition function. >>>> ... >>>>> [1]: https://lore.kernel.org/dri-devel/20240110-vkms-yuv-v2-0-952fcaa5a193@riseup.net/ >>>> >>>> If you follow this link you will find all the feedback and especially >>>> the "broken" parts. Just to be clear, writing bugs is totally expected >>>> and review/testing is supposed to help on this regard. I am not blaming >>>> the author in any way, just focusing on getting this code in a more >>>> readable shape and hopefully reinforce the testing procedure. >>>> >>>>> If something is truly broken, it must be stated what exactly is so we >>>>> can all come up with a solution that will satisfy everyone. >>>> >>>> Maybe going through the series pointed above will give more context >>>> but AFAIU: the YUV composition is not totally right (and the tests used >>>> to validate it need to be more complex as well in order to fail). >>>> >>>> Here is a proposal. >>>> >>>> Today's RGB implementation is only optimized in the line-by-line case >>>> when there is no rotation. The logic is bit convoluted and may possibly >>>> be slightly clarified with a per-format read_line() implementation, >>>> at a very light performance cost. Such an improvement would definitely >>>> benefit to the clarity of the code, especially when transformations >>>> (especially the rotations) come into play because they would be clearly >>>> handled differently instead of being "hidden" in the optimized logic. >>>> Performances would not change much as this path is not optimized today >>>> anyway (the pixel-oriented logic is already used in the rotation case). >>>> >>>> Arthur's YUV implementation is indeed well optimized but the added >>>> complexity probably lead to small mistakes in the logic. The >>>> per-format read_line() implementation mentioned above could be >>>> extended to the YUV format as well, which would leverage Arthur's >>>> proposal by re-using his optimized version. Louis will help on this >>>> regard. However, for more complex cases such as when there is a >>>> rotation, it will be easier (and not sub-optimized compared to the RGB >>>> case) to also fallback to a pixel-oriented processing. >>>> >>>> Would this approach make sense? >>> >>> Hi, >>> >>> I think it would, if I understand what you mean. Ever since I proposed >>> a line-by-line algorithm to improve the performance, I was thinking of >>> per-format read_line() functions that would be selected outside of any >>> loops. Extending that to support YUV is only natural. I can imagine >>> rotation complicates things, and I won't oppose that resulting in a >>> much heavier read_line() implementation used in those cases. They might >>> perhaps call the original read_line() implementations pixel-by-pixel or >>> plane-by-plane (i.e. YUV planes) per pixel. Chroma-siting complicates >>> things even further. That way one could compose any >>> rotation-format-siting combination by chaining function pointers. >> >> I'll let Louis also validate but on my side I feel like I totally >> agree with your feedback. >> >>> I haven't looked at VKMS in a long time, and I am disappointed to find >>> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. >>> The reading vfunc should be called with many pixels at a time when the >>> source FB layout allows it. The whole point of the line-based functions >>> was that they repeat the innermost loop in every function body to make >>> the per-pixel overhead as small as possible. The VKMS implementations >>> benchmarked before and after the original line-based algorithm showed >>> that calling a function pointer per-pixel is relatively very expensive. >>> Or maybe it was a switch-case. >> >> Indeed, since your initial feedback Louis made a couple of comparisons >> and the time penalty is between +5% and +60% depending on the case, >> AFAIR. >> >>> Sorry, I didn't realize the optimization had already been lost. >> >> No problem, actually I also lost myself in my first answer as I >> initially thought the (mainline) RGB logic was also broken in edge >> cases, which it was not, only the YUV logic suffered from some >> limitations. >> >>> Btw. I'd suggest renaming vkms_compose_row() to vkms_fetch_row() since >>> it's not composing anything and the name mislead me. >> >> Makes sense. >> >>> I think if you inspect the compositing code as of revision >>> 8356b97906503a02125c8d03c9b88a61ea46a05a you'll get a better feeling of >>> what it was supposed to be. >> >> Excellent, thanks a lot! >> >> Miquèl > >
On Tue, 6 Feb 2024 14:57:48 -0300 Arthur Grillo <arthurgrillo@riseup.net> wrote: > On 02/02/24 16:45, Pekka Paalanen wrote: ... > > Would it be possible to have a standardised benchmark specifically > > for performance rather than correctness, in IGT or where-ever it > > would make sense? Then it would be simple to tell contributors to > > run this and report the numbers before and after. > > > > I would propose this kind of KMS layout: > > > > - CRTC size 3841 x 2161 > > - primary plane, XRGB8888, 3639 x 2161 @ 101,0 > > - overlay A, XBGR2101010, 3033 x 1777 @ 201,199 > > - overlay B, ARGB8888, 1507 x 1400 @ 1800,250 > > > > The sizes and positions are deliberately odd to try to avoid happy > > alignment accidents. The planes are big, which should let the pixel > > operations easily dominate performance measurement. There are > > different pixel formats, both opaque and semi-transparent. There is > > lots of plane overlap. The planes also do not cover the whole CRTC > > leaving the background visible a bit. > > > > There should be two FBs per each plane, flipped alternatingly each > > frame. Writeback should be active. Run this a number of frames, say, > > 100, and measure the kernel CPU time taken. It's supposed to take at > > least several seconds in total. > > > > I think something like this should be the base benchmark. One can > > add more to it, like rotated planes, YUV planes, etc. or switch > > settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe > > one more overlay that is very tall and thin. > > > > Just an idea, what do you all think? > > Hi Pekka, > > I just finished writing this proposal using IGT. > > I got pretty interesting results: > > The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took > around 13 seconds. While drm-misc/drm-misc-next took 36 seconds. > > I'm currently bisecting to be certain that the change to the > pixel-by-pixel is the culprit, but I don't see why it wouldn't be. > > I just need to do some final touches on the benchmark code and it > will be ready for revision. Awesome, thank you very much for doing that! pq
Hello Pekka, Arthur, Maxime, > > > >>>>>> Change the composition algorithm to iterate over pixels instead of lines. > > > >>>>>> It allows a simpler management of rotation and pixel access for complex formats. > > > >>>>>> > > > >>>>>> This new algorithm allows read_pixel function to have access to x/y > > > >>>>>> coordinates and make it possible to read the correct thing in a block > > > >>>>>> when block_w and block_h are not 1. > > > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler > > > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed > > > >>>>>> anymore to have misterious switch-case distributed in multiple places. > > > >>>>> > > > >>>>> Hi, > > > >>>>> > > > >>>>> there was a very good reason to write this code using lines: > > > >>>>> performance. Before lines, it was indeed operating on individual pixels. > > > >>>>> > > > >>>>> Please, include performance measurements before and after this series > > > >>>>> to quantify the impact on the previously already supported pixel > > > >>>>> formats, particularly the 32-bit-per-pixel RGB variants. > > > >>>>> > > > >>>>> VKMS will be used more and more in CI for userspace projects, and > > > >>>>> performance actually matters there. > > > >>>>> > > > >>>>> I'm worrying that this performance degradation here is significant. I > > > >>>>> believe it is possible to keep blending with lines, if you add new line > > > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you > > > >>>>> don't have to regress the most common formats' performance. I tested, and yes, it's significant for most of the tests. None of them timed out on my machine, but I agree that I have to improve this. Do you know which tests are the more "heavy"? > > > >>>> While I understand performance is important and should be taken into > > > >>>> account seriously, I cannot understand how broken testing could be > > > >>>> considered better. Fast but inaccurate will always be significantly > > > >>>> less attractive to my eyes. > > > >>> > > > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing > > > >>> something broken, just that it was "better" (according to what > > > >>> criteria?). Sorry Maxime for this little missunderstanding, I will improve the commit message and cover letter for the v2. > > > >> Today's RGB implementation is only optimized in the line-by-line case > > > >> when there is no rotation. The logic is bit convoluted and may possibly > > > >> be slightly clarified with a per-format read_line() implementation, > > > >> at a very light performance cost. Such an improvement would definitely > > > >> benefit to the clarity of the code, especially when transformations > > > >> (especially the rotations) come into play because they would be clearly > > > >> handled differently instead of being "hidden" in the optimized logic. > > > >> Performances would not change much as this path is not optimized today > > > >> anyway (the pixel-oriented logic is already used in the rotation case). [...] > > > > I think it would, if I understand what you mean. Ever since I proposed > > > > a line-by-line algorithm to improve the performance, I was thinking of > > > > per-format read_line() functions that would be selected outside of any > > > > loops. [...] > > > > I haven't looked at VKMS in a long time, and I am disappointed to find > > > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > > > > The reading vfunc should be called with many pixels at a time when the > > > > source FB layout allows it. The whole point of the line-based functions > > > > was that they repeat the innermost loop in every function body to make > > > > the per-pixel overhead as small as possible. The VKMS implementations > > > > benchmarked before and after the original line-based algorithm showed > > > > that calling a function pointer per-pixel is relatively very expensive. > > > > Or maybe it was a switch-case. [...] > > > But, I agree with Miquel that the rotation logic is easier to implement > > > in a pixel-based way. So going pixel-by-pixel only when rotation occurs > > > would be great. > > > > Yes, and I think that can very well be done in the line-based framework > > still that existed in the old days before any rotation support was > > added. Essentially a plug-in line-getter function that then calls a > > format-specific line-getter pixel-by-pixel while applying the rotation. > > It would be simple, it would leave unrotated performance unharmed (use > > format-specific line-getter directly with lines), but it might be > > somewhat less performant for rotated KMS planes. I suspect that might > > be a good compromise. > > > > Format-specific line-getters could also be parameterized by > > pixel-to-pixel offset in bytes. Then they could directly traverse FB > > rows forward and backward, and even FB columns. It may or may not have > > a penalty compared to the original line-getters, so it would have to > > be benchmarked. > > Oh, actually, since the byte offset depends on format, it might be > better to parametrize by direction and compute the offset in the > format-specific line-getter before the loop. > I'm currently working on this implementation. The algorithm would look like: void blend(...) { for(int y = 0; y < height; y++) { for(int plane = 0; plane < nb_planes; plane++) { if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) { [...] /* Small common logic to manage REFLECT_X/Y and translations */ planes[plane].read_line(....); } else { [...] /* v1 of my patch, pixel by pixel read */ } } } } where read_line is: void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[]) - y is the line to read (so the caller need to compute the correct offset) - x_start/x_stop are the start and stop column, but they may be not ordered properly (i.e DRM_REFLECT_X will make x_start greater than x_stop) - src/dst are source and destination buffers This way: - It's simple to read for the general case (usage of drm_rect_* instead of manually rewriting the logic) - Each pixel format can be quickly implemented with "pixel-by-pixel" methods - The performances should be good if no rotation is implied for some formats I also created some helpers for conversions between formats to avoid code duplication between pixel and line algorithms (and also between argb and xrgb variants). The only flaw with this, is that most of the read_line functions will look like: void read_line(...) { int increment = x_start < x_stop ? 1: -1; while(x_start != x_stop) { out += 1; [...] /* color conversion */ x_start += increment; } } But as Pekka explained, it's probably the most efficient way to do it. Is there a way to save the output of vkms to a folder (something like "one image per frame")? It's a bit complex to debug graphics without seeing them. I have something which (I think) should work, but some tests are failing and I don't find why in my code (I don't find the reason why the they are failing and the hexdump I added to debug seems normal). I think my issue is a simple mistake in the "translation managment" or maybe just a wrong offset, but I don't see my error in the code. I think a quick look on the final image would help me a lot. [...] Have a nice day, Louis Chauvet
Hello Pekka, Arthur, [...] > > > Would it be possible to have a standardised benchmark specifically > > > for performance rather than correctness, in IGT or where-ever it > > > would make sense? Then it would be simple to tell contributors to > > > run this and report the numbers before and after. > > > > > > I would propose this kind of KMS layout: > > > > > > - CRTC size 3841 x 2161 > > > - primary plane, XRGB8888, 3639 x 2161 @ 101,0 > > > - overlay A, XBGR2101010, 3033 x 1777 @ 201,199 > > > - overlay B, ARGB8888, 1507 x 1400 @ 1800,250 > > > > > > The sizes and positions are deliberately odd to try to avoid happy > > > alignment accidents. The planes are big, which should let the pixel > > > operations easily dominate performance measurement. There are > > > different pixel formats, both opaque and semi-transparent. There is > > > lots of plane overlap. The planes also do not cover the whole CRTC > > > leaving the background visible a bit. > > > > > > There should be two FBs per each plane, flipped alternatingly each > > > frame. Writeback should be active. Run this a number of frames, say, > > > 100, and measure the kernel CPU time taken. It's supposed to take at > > > least several seconds in total. > > > > > > I think something like this should be the base benchmark. One can > > > add more to it, like rotated planes, YUV planes, etc. or switch > > > settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe > > > one more overlay that is very tall and thin. > > > > > > Just an idea, what do you all think? > > > > Hi Pekka, > > > > I just finished writing this proposal using IGT. > > > > I got pretty interesting results: > > > > The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took > > around 13 seconds. While drm-misc/drm-misc-next took 36 seconds. > > > > I'm currently bisecting to be certain that the change to the > > pixel-by-pixel is the culprit, but I don't see why it wouldn't be. > > > > I just need to do some final touches on the benchmark code and it > > will be ready for revision. > > Awesome, thank you very much for doing that! > pq I also think it's a good benchmarks for classic configurations. The odd size is a very nice idea to verify the corner cases of line-by-line algorithms. When this is ready, please share the test, so I can check if my patch is as performant as before. Thank you for this work. Have a nice day, Louis Chauvet
On 07/02/24 13:03, Louis Chauvet wrote: > Hello Pekka, Arthur, > > [...] > >>>> Would it be possible to have a standardised benchmark specifically >>>> for performance rather than correctness, in IGT or where-ever it >>>> would make sense? Then it would be simple to tell contributors to >>>> run this and report the numbers before and after. >>>> >>>> I would propose this kind of KMS layout: >>>> >>>> - CRTC size 3841 x 2161 >>>> - primary plane, XRGB8888, 3639 x 2161 @ 101,0 >>>> - overlay A, XBGR2101010, 3033 x 1777 @ 201,199 >>>> - overlay B, ARGB8888, 1507 x 1400 @ 1800,250 >>>> >>>> The sizes and positions are deliberately odd to try to avoid happy >>>> alignment accidents. The planes are big, which should let the pixel >>>> operations easily dominate performance measurement. There are >>>> different pixel formats, both opaque and semi-transparent. There is >>>> lots of plane overlap. The planes also do not cover the whole CRTC >>>> leaving the background visible a bit. >>>> >>>> There should be two FBs per each plane, flipped alternatingly each >>>> frame. Writeback should be active. Run this a number of frames, say, >>>> 100, and measure the kernel CPU time taken. It's supposed to take at >>>> least several seconds in total. >>>> >>>> I think something like this should be the base benchmark. One can >>>> add more to it, like rotated planes, YUV planes, etc. or switch >>>> settings on the existing planes. Maybe even FB_DAMAGE_CLIPS. Maybe >>>> one more overlay that is very tall and thin. >>>> >>>> Just an idea, what do you all think? >>> >>> Hi Pekka, >>> >>> I just finished writing this proposal using IGT. >>> >>> I got pretty interesting results: >>> >>> The mentioned commit 8356b97906503a02125c8d03c9b88a61ea46a05a took >>> around 13 seconds. While drm-misc/drm-misc-next took 36 seconds. >>> >>> I'm currently bisecting to be certain that the change to the >>> pixel-by-pixel is the culprit, but I don't see why it wouldn't be. >>> >>> I just need to do some final touches on the benchmark code and it >>> will be ready for revision. >> >> Awesome, thank you very much for doing that! >> pq > > I also think it's a good benchmarks for classic configurations. The odd > size is a very nice idea to verify the corner cases of line-by-line > algorithms. > > When this is ready, please share the test, so I can check if my patch is > as performant as before. > > Thank you for this work. > > Have a nice day, > Louis Chauvet > Just sent the benchmark for revision: https://lore.kernel.org/r/20240207-bench-v1-1-7135ad426860@riseup.net
On Wed, 7 Feb 2024 16:49:56 +0100 Louis Chauvet <louis.chauvet@bootlin.com> wrote: > Hello Pekka, Arthur, Maxime, Hi all > > > > >>>>>> Change the composition algorithm to iterate over pixels instead of lines. > > > > >>>>>> It allows a simpler management of rotation and pixel access for complex formats. > > > > >>>>>> > > > > >>>>>> This new algorithm allows read_pixel function to have access to x/y > > > > >>>>>> coordinates and make it possible to read the correct thing in a block > > > > >>>>>> when block_w and block_h are not 1. > > > > >>>>>> The iteration pixel-by-pixel in the same method also allows a simpler > > > > >>>>>> management of rotation with drm_rect_* helpers. This way it's not needed > > > > >>>>>> anymore to have misterious switch-case distributed in multiple places. > > > > >>>>> > > > > >>>>> Hi, > > > > >>>>> > > > > >>>>> there was a very good reason to write this code using lines: > > > > >>>>> performance. Before lines, it was indeed operating on individual pixels. > > > > >>>>> > > > > >>>>> Please, include performance measurements before and after this series > > > > >>>>> to quantify the impact on the previously already supported pixel > > > > >>>>> formats, particularly the 32-bit-per-pixel RGB variants. > > > > >>>>> > > > > >>>>> VKMS will be used more and more in CI for userspace projects, and > > > > >>>>> performance actually matters there. > > > > >>>>> > > > > >>>>> I'm worrying that this performance degradation here is significant. I > > > > >>>>> believe it is possible to keep blending with lines, if you add new line > > > > >>>>> getters for reading from rotated, sub-sampled etc. images. That way you > > > > >>>>> don't have to regress the most common formats' performance. > > I tested, and yes, it's significant for most of the tests. None of them > timed out on my machine, but I agree that I have to improve this. Do you > know which tests are the more "heavy"? I don't, but considering that various userspace projects (e.g. Wayland compositors) want to use VKMS more and more in their own CI, looking only at IGT is not enough. Every second saved per run is a tiny bit of data center energy saved, or developers waiting less for results. I do have some expectations that for each KMS property, Wayland compositors tend to use the "normal" property value more than any other value. So if you test different pixel formats, you probably set rotation to normal, since it's completely orthogonal in userspace. And then you would test different rotations with just one pixel format. At least I would personally leave it to IGT to test all the possible combinations of pixel formats + rotations + odd sizes + odd positions. Wayland compositor CI wants to test the compositor internals, not VKMS internals. > > > > >>>> While I understand performance is important and should be taken into > > > > >>>> account seriously, I cannot understand how broken testing could be > > > > >>>> considered better. Fast but inaccurate will always be significantly > > > > >>>> less attractive to my eyes. > > > > >>> > > > > >>> AFAIK, neither the cover letter nor the commit log claimed it was fixing > > > > >>> something broken, just that it was "better" (according to what > > > > >>> criteria?). > > Sorry Maxime for this little missunderstanding, I will improve the commit > message and cover letter for the v2. > > > > > >> Today's RGB implementation is only optimized in the line-by-line case > > > > >> when there is no rotation. The logic is bit convoluted and may possibly > > > > >> be slightly clarified with a per-format read_line() implementation, > > > > >> at a very light performance cost. Such an improvement would definitely > > > > >> benefit to the clarity of the code, especially when transformations > > > > >> (especially the rotations) come into play because they would be clearly > > > > >> handled differently instead of being "hidden" in the optimized logic. > > > > >> Performances would not change much as this path is not optimized today > > > > >> anyway (the pixel-oriented logic is already used in the rotation case). > > [...] > > > > > > I think it would, if I understand what you mean. Ever since I proposed > > > > > a line-by-line algorithm to improve the performance, I was thinking of > > > > > per-format read_line() functions that would be selected outside of any > > > > > loops. > > [...] > > > > > > I haven't looked at VKMS in a long time, and I am disappointed to find > > > > > that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. > > > > > The reading vfunc should be called with many pixels at a time when the > > > > > source FB layout allows it. The whole point of the line-based functions > > > > > was that they repeat the innermost loop in every function body to make > > > > > the per-pixel overhead as small as possible. The VKMS implementations > > > > > benchmarked before and after the original line-based algorithm showed > > > > > that calling a function pointer per-pixel is relatively very expensive. > > > > > Or maybe it was a switch-case. > > [...] > > > > > But, I agree with Miquel that the rotation logic is easier to implement > > > > in a pixel-based way. So going pixel-by-pixel only when rotation occurs > > > > would be great. > > > > > > Yes, and I think that can very well be done in the line-based framework > > > still that existed in the old days before any rotation support was > > > added. Essentially a plug-in line-getter function that then calls a > > > format-specific line-getter pixel-by-pixel while applying the rotation. > > > It would be simple, it would leave unrotated performance unharmed (use > > > format-specific line-getter directly with lines), but it might be > > > somewhat less performant for rotated KMS planes. I suspect that might > > > be a good compromise. > > > > > > Format-specific line-getters could also be parameterized by > > > pixel-to-pixel offset in bytes. Then they could directly traverse FB > > > rows forward and backward, and even FB columns. It may or may not have > > > a penalty compared to the original line-getters, so it would have to > > > be benchmarked. > > > > Oh, actually, since the byte offset depends on format, it might be > > better to parametrize by direction and compute the offset in the > > format-specific line-getter before the loop. > > > > I'm currently working on this implementation. The algorithm would look > like: > > void blend(...) { > for(int y = 0; y < height; y++) { > for(int plane = 0; plane < nb_planes; plane++) { > if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) { I would try to drop the rotation check here completely. Instead, when choosing the function pointer to call here, outside of *all* loops, you would check the rotation property. If rotation is a no-op, pick the read_line function directly. If rotation/reflection is needed, pick a rotation function that will then call read_line function pixel-by-pixel. So planes[plane] would have two vfuncs, one with a plain read_line that assumes normal orientation and can return a line of arbitrary length from arbitrary x,y position, and another vfunc that this loop here will call which is either some rotation handling function or just the same function as the first vfunc. The two function pointers might well need different signatures, meaning you need a simple wrapper for the rotation=normal case too. I believe that could result in cleaner code. > [...] /* Small common logic to manage REFLECT_X/Y and translations */ > planes[plane].read_line(....); > } else { > [...] /* v1 of my patch, pixel by pixel read */ > } > } > } > } > > where read_line is: > void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[]) > - y is the line to read (so the caller need to compute the correct offset) > - x_start/x_stop are the start and stop column, but they may be not > ordered properly (i.e DRM_REFLECT_X will make x_start greater than > x_stop) > - src/dst are source and destination buffers This sounds ok. An alternative would be something like enum direction { RIGHT, LEFT, UP, DOWN, }; void read_line(frame_info *src, int start_x, int start_y, enum direction dir, int count_pixels, pixel_argb16 *dst); Based on dir, before the inner loop this function would compute the byte offset between the pixels to be read. If the format is multiplanar YUV, it can compute the offset per plane. And the starting pointers per pixel plane, of course, and one end pointer for the loop stop condition maybe from dst. This might make all the other directions than RIGHT much faster than calling read_line one pixel at a time to achieve the same. Would need to benchmark if this is significantly slower than your suggestion for dir=RIGHT, though. If it's roughly the same, then it would probably be worth it. > This way: > - It's simple to read for the general case (usage of drm_rect_* instead of > manually rewriting the logic) > - Each pixel format can be quickly implemented with "pixel-by-pixel" > methods > - The performances should be good if no rotation is implied for some > formats > > I also created some helpers for conversions between formats to avoid code > duplication between pixel and line algorithms (and also between argb and > xrgb variants). > > The only flaw with this, is that most of the read_line functions will > look like: > > void read_line(...) { > int increment = x_start < x_stop ? 1: -1; > while(x_start != x_stop) { > out += 1; > [...] /* color conversion */ > x_start += increment; > } > } > > But as Pekka explained, it's probably the most efficient way to do it. Yes, I expect them to look roughly like that. It's necessary for moving as much of the setup computations and real function calls out of the inner-most loop as possible. The middle (over KMS planes) and outer (over y) loops are less sensitive to wasted cycles on redundant computations. > Is there a way to save the output of vkms to a folder (something like > "one image per frame")? It's a bit complex to debug graphics without > seeing them. > > I have something which (I think) should work, but some tests are failing > and I don't find why in my code (I don't find the reason why the they are > failing and the hexdump I added to debug seems normal). > > I think my issue is a simple mistake in the "translation managment" or > maybe just a wrong offset, but I don't see my error in the code. I think a > quick look on the final image would help me a lot. I don't know anything about the kernel unit testing frameworks, maybe they could help? But if you drive the test through UAPI from userspace, use a writeback connector to fetch the resulting image. I'm sure IGT has code doing that somewhere, although many IGT tests rely on CRC instead because CRC is more widely available in hardware. Arthur's new benchmark seems to be using writeback, you just need to make it save to file: https://lore.kernel.org/all/20240207-bench-v1-1-7135ad426860@riseup.net/ Thanks, pq
On 08/02/24 06:39, Pekka Paalanen wrote: > On Wed, 7 Feb 2024 16:49:56 +0100 > Louis Chauvet <louis.chauvet@bootlin.com> wrote: > >> Hello Pekka, Arthur, Maxime, > > Hi all > >>>>>>>>>>> Change the composition algorithm to iterate over pixels instead of lines. >>>>>>>>>>> It allows a simpler management of rotation and pixel access for complex formats. >>>>>>>>>>> >>>>>>>>>>> This new algorithm allows read_pixel function to have access to x/y >>>>>>>>>>> coordinates and make it possible to read the correct thing in a block >>>>>>>>>>> when block_w and block_h are not 1. >>>>>>>>>>> The iteration pixel-by-pixel in the same method also allows a simpler >>>>>>>>>>> management of rotation with drm_rect_* helpers. This way it's not needed >>>>>>>>>>> anymore to have misterious switch-case distributed in multiple places. >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> there was a very good reason to write this code using lines: >>>>>>>>>> performance. Before lines, it was indeed operating on individual pixels. >>>>>>>>>> >>>>>>>>>> Please, include performance measurements before and after this series >>>>>>>>>> to quantify the impact on the previously already supported pixel >>>>>>>>>> formats, particularly the 32-bit-per-pixel RGB variants. >>>>>>>>>> >>>>>>>>>> VKMS will be used more and more in CI for userspace projects, and >>>>>>>>>> performance actually matters there. >>>>>>>>>> >>>>>>>>>> I'm worrying that this performance degradation here is significant. I >>>>>>>>>> believe it is possible to keep blending with lines, if you add new line >>>>>>>>>> getters for reading from rotated, sub-sampled etc. images. That way you >>>>>>>>>> don't have to regress the most common formats' performance. >> >> I tested, and yes, it's significant for most of the tests. None of them >> timed out on my machine, but I agree that I have to improve this. Do you >> know which tests are the more "heavy"? > > I don't, but considering that various userspace projects (e.g. Wayland > compositors) want to use VKMS more and more in their own CI, looking > only at IGT is not enough. Every second saved per run is a tiny bit of > data center energy saved, or developers waiting less for results. > > I do have some expectations that for each KMS property, Wayland > compositors tend to use the "normal" property value more than any other > value. So if you test different pixel formats, you probably set > rotation to normal, since it's completely orthogonal in userspace. And > then you would test different rotations with just one pixel format. > > At least I would personally leave it to IGT to test all the possible > combinations of pixel formats + rotations + odd sizes + odd positions. > Wayland compositor CI wants to test the compositor internals, not VKMS > internals. > >>>>>>>>> While I understand performance is important and should be taken into >>>>>>>>> account seriously, I cannot understand how broken testing could be >>>>>>>>> considered better. Fast but inaccurate will always be significantly >>>>>>>>> less attractive to my eyes. >>>>>>>> >>>>>>>> AFAIK, neither the cover letter nor the commit log claimed it was fixing >>>>>>>> something broken, just that it was "better" (according to what >>>>>>>> criteria?). >> >> Sorry Maxime for this little missunderstanding, I will improve the commit >> message and cover letter for the v2. >> >>>>>>> Today's RGB implementation is only optimized in the line-by-line case >>>>>>> when there is no rotation. The logic is bit convoluted and may possibly >>>>>>> be slightly clarified with a per-format read_line() implementation, >>>>>>> at a very light performance cost. Such an improvement would definitely >>>>>>> benefit to the clarity of the code, especially when transformations >>>>>>> (especially the rotations) come into play because they would be clearly >>>>>>> handled differently instead of being "hidden" in the optimized logic. >>>>>>> Performances would not change much as this path is not optimized today >>>>>>> anyway (the pixel-oriented logic is already used in the rotation case). >> >> [...] >> >>>>>> I think it would, if I understand what you mean. Ever since I proposed >>>>>> a line-by-line algorithm to improve the performance, I was thinking of >>>>>> per-format read_line() functions that would be selected outside of any >>>>>> loops. >> >> [...] >> >>>>>> I haven't looked at VKMS in a long time, and I am disappointed to find >>>>>> that vkms_compose_row() is calling plane->pixel_read() pixel-by-pixel. >>>>>> The reading vfunc should be called with many pixels at a time when the >>>>>> source FB layout allows it. The whole point of the line-based functions >>>>>> was that they repeat the innermost loop in every function body to make >>>>>> the per-pixel overhead as small as possible. The VKMS implementations >>>>>> benchmarked before and after the original line-based algorithm showed >>>>>> that calling a function pointer per-pixel is relatively very expensive. >>>>>> Or maybe it was a switch-case. >> >> [...] >> >>>>> But, I agree with Miquel that the rotation logic is easier to implement >>>>> in a pixel-based way. So going pixel-by-pixel only when rotation occurs >>>>> would be great. >>>> >>>> Yes, and I think that can very well be done in the line-based framework >>>> still that existed in the old days before any rotation support was >>>> added. Essentially a plug-in line-getter function that then calls a >>>> format-specific line-getter pixel-by-pixel while applying the rotation. >>>> It would be simple, it would leave unrotated performance unharmed (use >>>> format-specific line-getter directly with lines), but it might be >>>> somewhat less performant for rotated KMS planes. I suspect that might >>>> be a good compromise. >>>> >>>> Format-specific line-getters could also be parameterized by >>>> pixel-to-pixel offset in bytes. Then they could directly traverse FB >>>> rows forward and backward, and even FB columns. It may or may not have >>>> a penalty compared to the original line-getters, so it would have to >>>> be benchmarked. >>> >>> Oh, actually, since the byte offset depends on format, it might be >>> better to parametrize by direction and compute the offset in the >>> format-specific line-getter before the loop. >>> >> >> I'm currently working on this implementation. The algorithm would look >> like: >> >> void blend(...) { >> for(int y = 0; y < height; y++) { >> for(int plane = 0; plane < nb_planes; plane++) { >> if(planes[plane].read_line && planes[plane].rotation == DRM_ROTATION_0) { > > I would try to drop the rotation check here completely. Instead, when > choosing the function pointer to call here, outside of *all* loops, you > would check the rotation property. If rotation is a no-op, pick the > read_line function directly. If rotation/reflection is needed, pick a > rotation function that will then call read_line function pixel-by-pixel. > > So planes[plane] would have two vfuncs, one with a plain read_line that > assumes normal orientation and can return a line of arbitrary length > from arbitrary x,y position, and another vfunc that this loop here will > call which is either some rotation handling function or just the same > function as the first vfunc. > > The two function pointers might well need different signatures, meaning > you need a simple wrapper for the rotation=normal case too. > > I believe that could result in cleaner code. > >> [...] /* Small common logic to manage REFLECT_X/Y and translations */ >> planes[plane].read_line(....); >> } else { >> [...] /* v1 of my patch, pixel by pixel read */ >> } >> } >> } >> } >> >> where read_line is: >> void read_line(frame_info *src, int y, int x_start, int x_stop, pixel_argb16 *dts[]) >> - y is the line to read (so the caller need to compute the correct offset) >> - x_start/x_stop are the start and stop column, but they may be not >> ordered properly (i.e DRM_REFLECT_X will make x_start greater than >> x_stop) >> - src/dst are source and destination buffers > > This sounds ok. An alternative would be something like > > enum direction { > RIGHT, > LEFT, > UP, > DOWN, > }; > > void read_line(frame_info *src, int start_x, int start_y, enum direction dir, > int count_pixels, pixel_argb16 *dst); > > Based on dir, before the inner loop this function would compute the > byte offset between the pixels to be read. If the format is multiplanar > YUV, it can compute the offset per plane. And the starting pointers per > pixel plane, of course, and one end pointer for the loop stop condition > maybe from dst. > > This might make all the other directions than RIGHT much faster than > calling read_line one pixel at a time to achieve the same. > > Would need to benchmark if this is significantly slower than your > suggestion for dir=RIGHT, though. If it's roughly the same, then it > would probably be worth it. > > >> This way: >> - It's simple to read for the general case (usage of drm_rect_* instead of >> manually rewriting the logic) >> - Each pixel format can be quickly implemented with "pixel-by-pixel" >> methods >> - The performances should be good if no rotation is implied for some >> formats >> >> I also created some helpers for conversions between formats to avoid code >> duplication between pixel and line algorithms (and also between argb and >> xrgb variants). >> >> The only flaw with this, is that most of the read_line functions will >> look like: >> >> void read_line(...) { >> int increment = x_start < x_stop ? 1: -1; >> while(x_start != x_stop) { >> out += 1; >> [...] /* color conversion */ >> x_start += increment; >> } >> } >> >> But as Pekka explained, it's probably the most efficient way to do it. > > Yes, I expect them to look roughly like that. It's necessary for moving > as much of the setup computations and real function calls out of the > inner-most loop as possible. The middle (over KMS planes) and outer > (over y) loops are less sensitive to wasted cycles on redundant > computations. > >> Is there a way to save the output of vkms to a folder (something like >> "one image per frame")? It's a bit complex to debug graphics without >> seeing them. >> >> I have something which (I think) should work, but some tests are failing >> and I don't find why in my code (I don't find the reason why the they are >> failing and the hexdump I added to debug seems normal). >> >> I think my issue is a simple mistake in the "translation managment" or >> maybe just a wrong offset, but I don't see my error in the code. I think a >> quick look on the final image would help me a lot. > > I don't know anything about the kernel unit testing frameworks, maybe > they could help? > > But if you drive the test through UAPI from userspace, use a writeback > connector to fetch the resulting image. I'm sure IGT has code doing > that somewhere, although many IGT tests rely on CRC instead because CRC > is more widely available in hardware. > > Arthur's new benchmark seems to be using writeback, you just need to > make it save to file: > https://lore.kernel.org/all/20240207-bench-v1-1-7135ad426860@riseup.net/ Hi, I just found that the IGT's test kms_writeback has a flag '--dump' that does that, maybe you can use it or copy the logic to the benchmark. Best Regards, ~Arthur Grillo > > > Thanks, > pq
diff --git a/drivers/gpu/drm/vkms/vkms_composer.c b/drivers/gpu/drm/vkms/vkms_composer.c index 3c99fb8b54e2..4c5439209907 100644 --- a/drivers/gpu/drm/vkms/vkms_composer.c +++ b/drivers/gpu/drm/vkms/vkms_composer.c @@ -43,7 +43,7 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info, { int x_dst = frame_info->dst.x1; struct pixel_argb_u16 *out = output_buffer->pixels + x_dst; - struct pixel_argb_u16 *in = stage_buffer->pixels; + struct pixel_argb_u16 *in = stage_buffer->pixels + x_dst; int x_limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels); @@ -55,33 +55,6 @@ static void pre_mul_alpha_blend(struct vkms_frame_info *frame_info, } } -static int get_y_pos(struct vkms_frame_info *frame_info, int y) -{ - if (frame_info->rotation & DRM_MODE_REFLECT_Y) - return drm_rect_height(&frame_info->rotated) - y - 1; - - switch (frame_info->rotation & DRM_MODE_ROTATE_MASK) { - case DRM_MODE_ROTATE_90: - return frame_info->rotated.x2 - y - 1; - case DRM_MODE_ROTATE_270: - return y + frame_info->rotated.x1; - default: - return y; - } -} - -static bool check_limit(struct vkms_frame_info *frame_info, int pos) -{ - if (drm_rotation_90_or_270(frame_info->rotation)) { - if (pos >= 0 && pos < drm_rect_width(&frame_info->rotated)) - return true; - } else { - if (pos >= frame_info->rotated.y1 && pos < frame_info->rotated.y2) - return true; - } - - return false; -} static void fill_background(const struct pixel_argb_u16 *background_color, struct line_buffer *output_buffer) @@ -182,18 +155,74 @@ static void blend(struct vkms_writeback_job *wb, const struct pixel_argb_u16 background_color = { .a = 0xffff }; size_t crtc_y_limit = crtc_state->base.crtc->mode.vdisplay; + size_t crtc_x_limit = crtc_state->base.crtc->mode.hdisplay; + /* + * Iterating over each pixel to simplify the handling of rotations by using drm_rect_* + * helpers. + * Iteration per pixel also allosw a simple management of complex pixel formats. + * + * If needed, this triple loop might be a good place for optimizations. + */ for (size_t y = 0; y < crtc_y_limit; y++) { fill_background(&background_color, output_buffer); /* The active planes are composed associatively in z-order. */ for (size_t i = 0; i < n_active_planes; i++) { - y_pos = get_y_pos(plane[i]->frame_info, y); - - if (!check_limit(plane[i]->frame_info, y_pos)) - continue; - - vkms_compose_row(stage_buffer, plane[i], y_pos); + for (size_t x = 0; x < crtc_x_limit; x++) { + /* + * @x and @y are the coordinate in the output crtc. + * @dst_px is used to easily check if a pixel must be blended. + * @src_px is used to handle rotation. Just before blending, it + * will contain the coordinate of the wanted source pixel in + * the source buffer. + * @tmp_src is the conversion of src rectangle to integer. + */ + + struct drm_rect dst_px = DRM_RECT_INIT(x, y, 1, 1); + struct drm_rect src_px = DRM_RECT_INIT(x, y, 1, 1); + struct drm_rect tmp_src; + + drm_rect_fp_to_int(&tmp_src, &plane[i]->frame_info->src); + + /* + * Check that dst_px is inside the plane output + * Note: This is destructive for dst_px, if you need this + * rectangle you have to do a copy + */ + if (!drm_rect_intersect(&dst_px, &plane[i]->frame_info->dst)) + continue; + + /* + * Transform the coordinate x/y from the crtc to coordinates into + * coordinates for the src buffer. + * + * First step is to cancel the offset of the dst buffer. + * Then t will have to invert the rotation. This assumes that + * dst = drm_rect_rotate(src, rotation) (dst and src have the + * space size, but can be rotated). + * And then, apply the offset of the source rectangle to the + * coordinate. + */ + drm_rect_translate(&src_px, -plane[i]->frame_info->dst.x1, + -plane[i]->frame_info->dst.y1); + drm_rect_rotate_inv(&src_px, + drm_rect_width(&tmp_src), + drm_rect_height(&tmp_src), + plane[i]->frame_info->rotation); + drm_rect_translate(&src_px, tmp_src.x1, tmp_src.y1); + + /* + * Now, as the src_px contains the correct position, we can use + * it to convert the pixel color + */ + plane[i]->pixel_read(plane[i]->frame_info, + src_px.x1, + src_px.y1, + &stage_buffer->pixels[x], + plane[i]->base.base.color_encoding, + plane[i]->base.base.color_range); + } pre_mul_alpha_blend(plane[i]->frame_info, stage_buffer, output_buffer); } diff --git a/drivers/gpu/drm/vkms/vkms_drv.h b/drivers/gpu/drm/vkms/vkms_drv.h index cb20bab26cae..ba3c063adc5f 100644 --- a/drivers/gpu/drm/vkms/vkms_drv.h +++ b/drivers/gpu/drm/vkms/vkms_drv.h @@ -25,10 +25,20 @@ #define VKMS_LUT_SIZE 256 +/** + * struct vkms_frame_info - structure to store the state of a frame + * + * @fb: backing drm framebuffer + * @src: source rectangle of this frame in the source framebuffer + * @dst: destination rectangle in the crtc buffer + * @map: see drm_shadow_plane_state@data + * @rotation: rotation applied to the source. + * + * @src and @dst should have the same size modulo the rotation. + */ struct vkms_frame_info { struct drm_framebuffer *fb; struct drm_rect src, dst; - struct drm_rect rotated; struct iosys_map map[DRM_FORMAT_MAX_PLANES]; unsigned int rotation; }; @@ -51,14 +61,15 @@ struct vkms_writeback_job { /** * typedef pixel_read_t - These functions are used to read the pixels in the source frame, convert * them to argb16 and write them to out_pixel. - * It assumes that src_pixels point to a valid pixel (not a block, or a block of 1x1 pixel) * - * @src_pixels: Source pointer to a pixel + * @frame_info: Frame used as source for the pixel value + * @x: X (width) coordinate in the source buffer + * @y: Y (height) coordinate in the source buffer * @out_pixel: Pointer where to write the pixel value * @encoding: Color encoding to use (mainly used for YUV formats) * @range: Color range to use (mainly used for YUV formats) */ -typedef void (*pixel_read_t)(u8 **src_pixels, int y, +typedef void (*pixel_read_t)(struct vkms_frame_info *frame_info, int x, int y, struct pixel_argb_u16 *out_pixel, enum drm_color_encoding enconding, enum drm_color_range range); @@ -66,6 +77,8 @@ typedef void (*pixel_read_t)(u8 **src_pixels, int y, * vkms_plane_state - Driver specific plane state * @base: base plane state * @frame_info: data required for composing computation + * @pixel_read: function to read a pixel in this plane. The creator of a vkms_plane_state must + * ensure that this pointer is valid */ struct vkms_plane_state { struct drm_shadow_plane_state base; diff --git a/drivers/gpu/drm/vkms/vkms_formats.c b/drivers/gpu/drm/vkms/vkms_formats.c index c6376db58d38..8ff85ffda94f 100644 --- a/drivers/gpu/drm/vkms/vkms_formats.c +++ b/drivers/gpu/drm/vkms/vkms_formats.c @@ -11,79 +11,88 @@ #include "vkms_formats.h" -static size_t pixel_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t index) + +/** + * packed_pixels_offset() - Get the offset of the block containing the pixel at coordinates x/y + * + * @frame_info: Buffer metadata + * @x: The x coordinate of the wanted pixel in the buffer + * @y: The y coordinate of the wanted pixel in the buffer + * @plane: The index of the plane to use + * + * The caller must be aware that this offset is not always a pointer to a pixel. If individual + * pixel values are needed, they have to be extracted from the block. + */ +static size_t packed_pixels_offset(const struct vkms_frame_info *frame_info, int x, int y, size_t plane) { struct drm_framebuffer *fb = frame_info->fb; - - return fb->offsets[index] + (y * fb->pitches[index]) - + (x * fb->format->cpp[index]); + const struct drm_format_info *format = frame_info->fb->format; + /* Directly using x and y to multiply pitches and format->ccp is not sufficient because + * in some formats a block can represent multiple pixels. + * + * Dividing x and y by the block size allows to extract the correct offset of the block + * containing the pixel. + */ + return fb->offsets[plane] + + (y / drm_format_info_block_width(format, plane)) * fb->pitches[plane] + + (x / drm_format_info_block_height(format, plane)) * format->char_per_block[plane]; } /* - * packed_pixels_addr - Get the pointer to pixel of a given pair of coordinates + * packed_pixels_addr - Get the pointer to the block containing the pixel at the given coordinate * * @frame_info: Buffer metadata - * @x: The x(width) coordinate of the 2D buffer - * @y: The y(Heigth) coordinate of the 2D buffer + * @x: The x(width) coordinate inside the 2D buffer + * @y: The y(Heigth) coordinate inside the 2D buffer * @index: The index of the plane on the 2D buffer * - * Takes the information stored in the frame_info, a pair of coordinates, and - * returns the address of the first color channel on the desired index. The - * format's specific subsample is applied. + * Takes the information stored in the frame_info, a pair of coordinates, and returns the address + * of the block containing this pixel. + * The caller must be aware that this pointer is sometimes not directly a pixel, it needs some + * additional work to extract pixel color from this block. */ static void *packed_pixels_addr(const struct vkms_frame_info *frame_info, int x, int y, size_t index) { - int vsub = index == 0 ? 1 : frame_info->fb->format->vsub; - int hsub = index == 0 ? 1 : frame_info->fb->format->hsub; - size_t offset = pixel_offset(frame_info, x / hsub, y / vsub, index); - - return (u8 *)frame_info->map[0].vaddr + offset; + return (u8 *)frame_info->map[0].vaddr + packed_pixels_offset(frame_info, x, y, index); } -static void *get_packed_src_addr(const struct vkms_frame_info *frame_info, int y, size_t index) +static void ARGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, + enum drm_color_range range) { - int x_src = frame_info->src.x1 >> 16; - int y_src = y - frame_info->rotated.y1 + (frame_info->src.y1 >> 16); - - return packed_pixels_addr(frame_info, x_src, y_src, index); -} + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); -static int get_x_position(const struct vkms_frame_info *frame_info, int limit, int x) -{ - if (frame_info->rotation & (DRM_MODE_REFLECT_X | DRM_MODE_ROTATE_270)) - return limit - x - 1; - return x; -} - -static void ARGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) -{ /* * The 257 is the "conversion ratio". This number is obtained by the * (2^16 - 1) / (2^8 - 1) division. Which, in this case, tries to get * the best color value in a pixel format with more possibilities. * A similar idea applies to others RGB color conversions. */ - out_pixel->a = (u16)src_pixels[0][3] * 257; - out_pixel->r = (u16)src_pixels[0][2] * 257; - out_pixel->g = (u16)src_pixels[0][1] * 257; - out_pixel->b = (u16)src_pixels[0][0] * 257; + out_pixel->a = (u16)src_pixel[3] * 257; + out_pixel->r = (u16)src_pixel[2] * 257; + out_pixel->g = (u16)src_pixel[1] * 257; + out_pixel->b = (u16)src_pixel[0] * 257; } -static void XRGB8888_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void XRGB8888_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, + enum drm_color_range range) { + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); out_pixel->a = (u16)0xffff; - out_pixel->r = (u16)src_pixels[0][2] * 257; - out_pixel->g = (u16)src_pixels[0][1] * 257; - out_pixel->b = (u16)src_pixels[0][0] * 257; + out_pixel->r = (u16)src_pixel[2] * 257; + out_pixel->g = (u16)src_pixel[1] * 257; + out_pixel->b = (u16)src_pixel[0] * 257; } -static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void ARGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, + enum drm_color_encoding encoding, + enum drm_color_range range) { - u16 *pixels = (u16 *)src_pixels[0]; + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); + u16 *pixels = (u16 *)src_pixel; out_pixel->a = le16_to_cpu(pixels[3]); out_pixel->r = le16_to_cpu(pixels[2]); @@ -91,10 +100,13 @@ static void ARGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out out_pixel->b = le16_to_cpu(pixels[0]); } -static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void XRGB16161616_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, + enum drm_color_encoding encoding, + enum drm_color_range range) { - u16 *pixels = (u16 *)src_pixels[0]; + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); + u16 *pixels = (u16 *)src_pixel; out_pixel->a = (u16)0xffff; out_pixel->r = le16_to_cpu(pixels[2]); @@ -102,10 +114,12 @@ static void XRGB16161616_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out out_pixel->b = le16_to_cpu(pixels[0]); } -static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void RGB565_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, + enum drm_color_range range) { - u16 *pixels = (u16 *)src_pixels[0]; + u8 *src_pixel = packed_pixels_addr(frame_info, x, y, 0); + u16 *pixels = (u16 *)src_pixel; s64 fp_rb_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(31)); s64 fp_g_ratio = drm_fixp_div(drm_int2fixp(65535), drm_int2fixp(63)); @@ -121,12 +135,19 @@ static void RGB565_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel out_pixel->b = drm_fixp2int_round(drm_fixp_mul(fp_b, fp_rb_ratio)); } +/** + * ycbcr2rgb() - helper to convert ycbcr 8 bits to rbg 16 bits + * + * @m: conversion matrix to use + * @y, @cb, @cr: component of the ycbcr pixel + * @r, @g, @b: pointers to write the rbg pixel + */ static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, u8 *g, u8 *b) { s32 y_16, cb_16, cr_16; s32 r_16, g_16, b_16; - y_16 = y - y_offset; + y_16 = y - y_offset; cb_16 = cb - 128; cr_16 = cr - 128; @@ -139,6 +160,14 @@ static void ycbcr2rgb(const s16 m[3][3], u8 y, u8 cb, u8 cr, u8 y_offset, u8 *r, *b = clamp(b_16, 0, 0xffff) >> 8; } +/** + * yuv_u8_to_argb_u16() - helper to convert yuv 8 bits to argb 16 bits + * + * @argb_u16: pointer to write the converted pixel + * @yuv_u8: pointer to the source yuv pixel + * @encoding: encoding to use + * @range: range to use + */ VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, const struct pixel_yuv_u8 *yuv_u8, enum drm_color_encoding encoding, @@ -205,104 +234,89 @@ VISIBLE_IF_KUNIT void yuv_u8_to_argb_u16(struct pixel_argb_u16 *argb_u16, } EXPORT_SYMBOL_IF_KUNIT(yuv_u8_to_argb_u16); -static void semi_planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, +static void semi_planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, + int y, struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, enum drm_color_range range) { struct pixel_yuv_u8 yuv_u8; - yuv_u8.y = src_pixels[0][0]; - yuv_u8.u = src_pixels[1][0]; - yuv_u8.v = src_pixels[1][1]; + /* Subsampling must be applied for semi-planar yuv formats */ + int vsub = frame_info->fb->format->vsub; + int hsub = frame_info->fb->format->hsub; + + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); + u8 *src_uv = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); + + yuv_u8.y = src_y[0]; + yuv_u8.u = src_uv[0]; + yuv_u8.v = src_uv[1]; yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); } -static void semi_planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, +static void semi_planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, + int y, struct pixel_argb_u16 *out_pixel, enum drm_color_encoding encoding, enum drm_color_range range) { struct pixel_yuv_u8 yuv_u8; - yuv_u8.y = src_pixels[0][0]; - yuv_u8.v = src_pixels[1][0]; - yuv_u8.u = src_pixels[1][1]; + /* Subsampling must be applied for semi-planar yuv formats */ + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); + u8 *src_vu = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); + + yuv_u8.y = src_y[0]; + yuv_u8.v = src_vu[0]; + yuv_u8.u = src_vu[1]; yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); } -static void planar_yuv_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void planar_yuv_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, + enum drm_color_encoding encoding, + enum drm_color_range range) { struct pixel_yuv_u8 yuv_u8; - yuv_u8.y = src_pixels[0][0]; - yuv_u8.u = src_pixels[1][0]; - yuv_u8.v = src_pixels[2][0]; + /* Subsampling must be applied for planar yuv formats */ + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; + + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2); + + yuv_u8.y = src_y[0]; + yuv_u8.u = src_u[0]; + yuv_u8.v = src_v[0]; yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); } -static void planar_yvu_to_argb_u16(u8 **src_pixels, struct pixel_argb_u16 *out_pixel, - enum drm_color_encoding encoding, enum drm_color_range range) +static void planar_yvu_to_argb_u16(struct vkms_frame_info *frame_info, int x, int y, + struct pixel_argb_u16 *out_pixel, + enum drm_color_encoding encoding, + enum drm_color_range range) { struct pixel_yuv_u8 yuv_u8; - yuv_u8.y = src_pixels[0][0]; - yuv_u8.v = src_pixels[1][0]; - yuv_u8.u = src_pixels[2][0]; + /* Subsampling must be applied for semi-planar yuv formats */ + int vsub = frame_info->fb->format->vsub ? frame_info->fb->format->vsub : 1; + int hsub = frame_info->fb->format->hsub ? frame_info->fb->format->hsub : 1; - yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); -} + u8 *src_y = packed_pixels_addr(frame_info, x, y, 0); + u8 *src_u = packed_pixels_addr(frame_info, x / hsub, y / vsub, 1); + u8 *src_v = packed_pixels_addr(frame_info, x / hsub, y / vsub, 2); -/** - * vkms_compose_row - compose a single row of a plane - * @stage_buffer: output line with the composed pixels - * @plane: state of the plane that is being composed - * @y: y coordinate of the row - * - * This function composes a single row of a plane. It gets the source pixels - * through the y coordinate (see get_packed_src_addr()) and goes linearly - * through the source pixel, reading the pixels and converting it to - * ARGB16161616 (see the pixel_read() callback). For rotate-90 and rotate-270, - * the source pixels are not traversed linearly. The source pixels are queried - * on each iteration in order to traverse the pixels vertically. - */ -void vkms_compose_row(struct line_buffer *stage_buffer, struct vkms_plane_state *plane, int y) -{ - struct pixel_argb_u16 *out_pixels = stage_buffer->pixels; - struct vkms_frame_info *frame_info = plane->frame_info; - const struct drm_format_info *frame_format = frame_info->fb->format; - int limit = min_t(size_t, drm_rect_width(&frame_info->dst), stage_buffer->n_pixels); - u8 *src_pixels[DRM_FORMAT_MAX_PLANES]; - - enum drm_color_encoding encoding = plane->base.base.color_encoding; - enum drm_color_range range = plane->base.base.color_range; - - for (size_t i = 0; i < frame_format->num_planes; i++) - src_pixels[i] = get_packed_src_addr(frame_info, y, i); - - for (size_t x = 0; x < limit; x++) { - int x_pos = get_x_position(frame_info, limit, x); - - bool shoud_inc = !((x + 1) % frame_format->num_planes); - - if (drm_rotation_90_or_270(frame_info->rotation)) { - for (size_t i = 0; i < frame_format->num_planes; i++) { - src_pixels[i] = get_packed_src_addr(frame_info, - x + frame_info->rotated.y1, i); - if (!i || shoud_inc) - src_pixels[i] += frame_format->cpp[i] * y; - } - } - - plane->pixel_read(src_pixels, &out_pixels[x_pos], encoding, range); - - for (size_t i = 0; i < frame_format->num_planes; i++) { - if (!i || shoud_inc) - src_pixels[i] += frame_format->cpp[i]; - } - } + yuv_u8.y = src_y[0]; + yuv_u8.v = src_u[0]; + yuv_u8.u = src_v[0]; + + yuv_u8_to_argb_u16(out_pixel, &yuv_u8, encoding, range); } /* diff --git a/drivers/gpu/drm/vkms/vkms_plane.c b/drivers/gpu/drm/vkms/vkms_plane.c index 932736fc3ee9..d818ed246a46 100644 --- a/drivers/gpu/drm/vkms/vkms_plane.c +++ b/drivers/gpu/drm/vkms/vkms_plane.c @@ -118,13 +118,20 @@ static void vkms_plane_atomic_update(struct drm_plane *plane, return; fmt = fb->format->format; + pixel_read_t pixel_read = get_pixel_conversion_function(fmt); + + if (!pixel_read) { + DRM_WARN("The requested pixel format is not supported by VKMS. The new state is " + "not applied.\n"); + return; + } + vkms_plane_state = to_vkms_plane_state(new_state); shadow_plane_state = &vkms_plane_state->base; frame_info = vkms_plane_state->frame_info; memcpy(&frame_info->src, &new_state->src, sizeof(struct drm_rect)); memcpy(&frame_info->dst, &new_state->dst, sizeof(struct drm_rect)); - memcpy(&frame_info->rotated, &new_state->dst, sizeof(struct drm_rect)); frame_info->fb = fb; memcpy(&frame_info->map, &shadow_plane_state->data, sizeof(frame_info->map)); drm_framebuffer_get(frame_info->fb); @@ -134,10 +141,8 @@ static void vkms_plane_atomic_update(struct drm_plane *plane, DRM_MODE_REFLECT_X | DRM_MODE_REFLECT_Y); - drm_rect_rotate(&frame_info->rotated, drm_rect_width(&frame_info->rotated), - drm_rect_height(&frame_info->rotated), frame_info->rotation); - vkms_plane_state->pixel_read = get_pixel_conversion_function(fmt); + vkms_plane_state->pixel_read = pixel_read; } static int vkms_plane_atomic_check(struct drm_plane *plane,
Change the composition algorithm to iterate over pixels instead of lines. It allows a simpler management of rotation and pixel access for complex formats. This new algorithm allows read_pixel function to have access to x/y coordinates and make it possible to read the correct thing in a block when block_w and block_h are not 1. The iteration pixel-by-pixel in the same method also allows a simpler management of rotation with drm_rect_* helpers. This way it's not needed anymore to have misterious switch-case distributed in multiple places. This patch is tested with IGT: - kms_plane@pixel_format - kms_plane@pixel-format-source-clamping - kms_plane@plane-position-covered - kms_plane@plane-position-hole - kms_plane@plane-position-hole-dpms - kms_plane@plane-panning-top-left - kms_plane@plane-panning-bottom-right - kms_plane@test-odd-size-with-yuv - See [1] - kms_cursor_crc@cursor-size-change - kms_cursor_crc@cursor-alpha-opaque - kms_cursor_crc@cursor-alpha-transparent - kms_cursor_crc@cursor-dpms - kms_cursor_crc@cursor-onscreen-* - kms_cursor_crc@cursor-offscreen-* - kms_cursor_crc@cursor-sliding-* - kms_cursor_crc@cursor-random-* - kms_cursor_crc@cursor-rapid-movement-* - FAIL, but with Overflow of CRC buffer - kms_rotation_crc@primary-rotation-* - See [1] - kms_rotation_crc@sprite-rotation-* - See [1] - kms_rotation_crc@cursor-rotation-* - See [1] - kms_rotation_crc@sprite-rotation-90-pos-100-0 - See [1] - kms_rotation_crc@multiplane-rotation - See [1] - kms_rotation_crc@multiplane-rotation-cropping-* - See [1] [1]: https://lore.kernel.org/igt-dev/20240201-kms_tests-v1-0-bc34c5d28b3f@bootlin.com/T/#t Signed-off-by: Louis Chauvet <louis.chauvet@bootlin.com> --- drivers/gpu/drm/vkms/vkms_composer.c | 97 +++++++++----- drivers/gpu/drm/vkms/vkms_drv.h | 21 ++- drivers/gpu/drm/vkms/vkms_formats.c | 250 ++++++++++++++++++----------------- drivers/gpu/drm/vkms/vkms_plane.c | 13 +- 4 files changed, 221 insertions(+), 160 deletions(-)