I.MX6 HDMI support in v4.2

Message ID	m3bnd52y4s.fsf@t19.piap.pl (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org> From: khalasa@piap.pl (Krzysztof =?utf-8?Q?Ha=C5=82asa?=) To: linux-arm-kernel <linux-arm-kernel@lists.infradead.org> Subject: Re: I.MX6 HDMI support in v4.2 References: <m3bnde5whx.fsf@t19.piap.pl> <20150907112555.GS21084@n2100.arm.linux.org.uk> <m337yp5gvv.fsf@t19.piap.pl> <1441709778.13536.8.camel@pengutronix.de> Date: Mon, 14 Sep 2015 10:39:15 +0200 In-Reply-To: <1441709778.13536.8.camel@pengutronix.de> (Lucas Stach's message of "Tue, 08 Sep 2015 12:56:18 +0200") Message-ID: <m3bnd52y4s.fsf@t19.piap.pl> MIME-Version: 1.0 bases: 2015/09/14 02:23:00 #6478098 Precedence: list Cc: Russell King - ARM Linux <linux@arm.linux.org.uk>, Lucas Stach <l.stach@pengutronix.de> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" <linux-arm-kernel-bounces@lists.infradead.org> Errors-To: linux-arm-kernel-bounces+patchwork-linux-arm=patchwork.kernel.org@lists.infradead.org

Krzysztof Hałasa Sept. 14, 2015, 8:39 a.m. UTC

Another round of tests, I noticed the new git versions :-)
Testing Linux v4.2 + PLL5 DTS patch (for HDMI output with enabled LVDS).
Using mplayer with YUV420 (DRM Xvideo would probably work with packed
UYVY-alike formats but I need YUV420 because H.264 decoder produces it).
The driver is git://ftp.arm.linux.org.uk/~rmk/xf86-video-armada.git,
branch unstable-devel, and it uses
git://ftp.arm.linux.org.uk/~rmk/libdrm-armada.git/.

	IMX DRM Xvideo output:

Only unscaled video: no color (luminance is good but the color
components are green). The driver doesn't use color information.
I hope this is easily fixable.


	rmk/drm-etnaviv-devel:

With unscaled video, the only visible problem is tearing in the middle
of the screen (unability to sync with screen refresh).

With scaling, I'm getting horizontal lines (mostly visible when the
scene changes) and some sort of stalls - sometimes two frames are
alternating for few seconds then it goes forward.


	pengutronix/etnaviv-for-upstream:
No etnaviv Xvideo:
(==) armada(0): Backing store enabled
(==) armada(0): Silken mouse enabled
(EE) armada(0): etnaviv: unable to open: Not a directory
(WW) armada(0): [drm] Vivante initialization failed, running unaccelerated

I assume I need a newer something.


	pengutronix/v4.2/topic/etnaviv-for-rmk:
It requires:


Likewise, no etnaviv XVideo:
(--) armada(0): Vivante GC320 GPU revision 5007 (etnaviv) 2d PE2.0
(EE) armada(0): etnaviv: unable to create context: (null)
(WW) armada(0): [drm] Vivante initialization failed, running unaccelerated


Comments?

Krzysztof Hałasa Sept. 15, 2015, 8:24 a.m. UTC | #1

> Testing Linux v4.2 + PLL5 DTS patch (for HDMI output with enabled LVDS).
> Using mplayer with YUV420 (DRM Xvideo would probably work with packed
> UYVY-alike formats but I need YUV420 because H.264 decoder produces it).
> The driver is git://ftp.arm.linux.org.uk/~rmk/xf86-video-armada.git,
> branch unstable-devel, and it uses
> git://ftp.arm.linux.org.uk/~rmk/libdrm-armada.git/.
>
> 	IMX DRM Xvideo output:
>
> Only unscaled video: no color (luminance is good but the color
> components are green). The driver doesn't use color information.

1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually
takes (much) less than 4 milliseconds, but every 315 frames, it takes
much longer (when started, it takes 1259 frames). The following is with
constant image, i.e., not altered between frames. Source attached (one
may need to change xv_port variable).

Now what?

XVideo: using adaptor #0, port 85
Image ID 30323449 1024x768 data_size 1179648 num_planes 3 pitches 1024 512 512 data 0x76b86000
Frame#  frame count from the last such event
  vvvv   vvvv
  1259   1259: Took 0.665640667 s, avg 0.000530765 s
  1574    315: Took 0.501529334 s, avg 0.000743380 s
  1889    315: Took 0.496656667 s, avg 0.000882432 s
  2204    315: Took 0.496726667 s, avg 0.000981782 s
  2519    315: Took 0.496672333 s, avg 0.001056277 s
  2834    315: Took 0.496910000 s, avg 0.001114301 s
  3149    315: Took 0.497260667 s, avg 0.001160832 s
  3464    315: Took 0.499651334 s, avg 0.001199600 s
  3779    315: Took 0.495202334 s, avg 0.001230718 s
  4094    315: Took 0.499449667 s, avg 0.001258081 s
  4409    315: Took 0.525910333 s, avg 0.001287535 s
  4724    315: Took 0.510400000 s, avg 0.001309786 s
  5039    315: Took 0.541753334 s, avg 0.001335467 s
  5354    315: Took 0.537526666 s, avg 0.001357350 s
  5669    315: Took 0.542006000 s, avg 0.001377594 s
  5984    315: Took 0.539530333 s, avg 0.001395288 s
  6299    315: Took 0.541836000 s, avg 0.001411578 s
  6614    315: Took 0.541556333 s, avg 0.001426275 s
  6929    315: Took 0.543209666 s, avg 0.001439874 s
  7244    315: Took 0.540764000 s, avg 0.001451956 s
  7559    315: Took 0.512616666 s, avg 0.001459303 s
  7874    315: Took 0.541380000 s, avg 0.001469720 s
  8189    315: Took 0.535251333 s, avg 0.001478584 s
  8504    315: Took 0.536773334 s, avg 0.001486975 s
  8819    315: Took 0.521379000 s, avg 0.001493018 s
  9134    315: Took 0.492112667 s, avg 0.001495438 s
  9449    315: Took 0.616766000 s, avg 0.001510886 s
  9764    315: Took 0.492945666 s, avg 0.001512657 s
 10079    315: Took 0.492615000 s, avg 0.001514286 s
 10394    315: Took 0.493068000 s, avg 0.001515858 s
 10709    315: Took 0.496838000 s, avg 0.001517693 s
 11024    315: Took 0.516352333 s, avg 0.001521193 s
 11339    315: Took 0.500899333 s, avg 0.001523138 s
 11654    315: Took 0.494034334 s, avg 0.001524386 s
 11969    315: Took 0.492401000 s, avg 0.001525430 s
 12284    315: Took 0.505440000 s, avg 0.001527485 s
 12599    315: Took 0.511242666 s, avg 0.001529898 s
 12914    315: Took 0.493489333 s, avg 0.001530816 s

With etnaviv Xvideo, the effect is similar but much worse:

> 	rmk/drm-etnaviv-devel:
>
> With unscaled video, the only visible problem is tearing in the middle
> of the screen (unability to sync with screen refresh).

XVideo: using adaptor #1, port 90
Image ID 30323449 1024x768 data_size 1179648 num_planes 3 pitches 1024 512 512 data 0x76b8d000
  1259   1259: Took 18.864969669 s, avg 0.014974047 s
  1574    315: Took 20.758470002 s, avg 0.025159601 s
  1889    315: Took 20.720458336 s, avg 0.031929827 s
  2204    315: Took 20.831572669 s, avg 0.036815999 s
  2519    315: Took 20.807073336 s, avg 0.040470906 s
  2834    315: Took 20.789431003 s, avg 0.043307485 s
  3149    315: Took 20.809190003 s, avg 0.045582932 s
  3464    315: Took 20.803870003 s, avg 0.047443189 s
  3779    315: Took 20.798422002 s, avg 0.048991905 s
  4094    315: Took 20.809182336 s, avg 0.050304985 s
  4409    315: Took 20.812659336 s, avg 0.051431326 s
  4724    315: Took 20.816828669 s, avg 0.052408363 s
  5039    315: Took 20.814890002 s, avg 0.053262894 s
  5354    315: Took 20.803809003 s, avg 0.054014822 s
  5669    315: Took 20.798233669 s, avg 0.054682176 s
  5984    315: Took 20.809503336 s, avg 0.055281168 s
  6299    315: Took 20.814339002 s, avg 0.055821065 s
  6614    315: Took 20.812699003 s, avg 0.056309294 s
  6929    315: Took 20.800363670 s, avg 0.056751361 s
  7244    315: Took 20.814518669 s, avg 0.057156941 s
  7559    315: Took 20.814902002 s, avg 0.057528773 s
  7874    315: Took 20.814914336 s, avg 0.057870859 s
  8189    315: Took 20.803919669 s, avg 0.058185288 s
  8504    315: Took 20.807010336 s, avg 0.058476760 s
  8819    315: Took 20.855880669 s, avg 0.058752983 s
  9134    315: Took 20.809301669 s, avg 0.059005029 s
  9449    315: Took 26.004653003 s, avg 0.059790071 s
  9764    315: Took 20.796989002 s, avg 0.059991163 s
 10079    315: Took 20.798379670 s, avg 0.060179803 s
 10394    315: Took 20.809240335 s, avg 0.060358055 s
 10709    315: Took 20.814655003 s, avg 0.060526351 s
 11024    315: Took 20.803836002 s, avg 0.060684046 s

Russell King - ARM Linux Sept. 15, 2015, 10:12 a.m. UTC | #2

On Tue, Sep 15, 2015 at 10:24:55AM +0200, Krzysztof Ha?asa wrote:
> 1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually
> takes (much) less than 4 milliseconds, but every 315 frames, it takes
> much longer (when started, it takes 1259 frames). The following is with
> constant image, i.e., not altered between frames. Source attached (one
> may need to change xv_port variable).

Testing on Dove and Armada DRM, I've had to make several changes to your
program:

1. Don't hard code the Xv port ID (worth mentioning somewhere so people
   testing it don't spend something like a quarter of an hour trying to
   debug why it doesn't work.)

2. Added _GNU_SOURCE at the top to get various structs and functions
   that the program uses.

3. Added a line detailing the GCC flags to be used

4. A description of what the program is doing would be nice.  For the
   record, it appears to try and call XvShmPutImage() as quickly as
   possible, recording the absolute time between calls, and reporting
   when it is more than 4ms.

5. Added a call to XSync(display, 1) after the call to XvShmPutImage()
   to ensure that the XvShmPutImage() gets pushed to the X server in a
   timely manner, doesn't sit in the applications X queue, and ensure
   that we wait for the X server to respond.

6. Extended the 4ms timeout to something sane, like 17ms.  If we're
   talking about synchronising to the vertical sync, then the period
   when we want to complain is when it takes much longer than that.

   If the display is synchronised to the vertical sync to avoid tearing,
   then you can only update the display once per vertical sync, and at
   60Hz, that's about 17ms.  At 50Hz, that's 20ms.

Now, running on Dove overlay, I see about 17ms being the average time,
which is slightly longer than the vsync.

Further analysis with strace of the X server (which pushes things up to
20ms average time):
- The X server takes about 1ms between select() indicating that there's
  an event available, and reading the event.
- It takes 14ms from reading the event to the DDX calling the DRM ioctl
  to display the image - mostly spent memcpy()'ing the image.
- 900us to return the response to the X client
- The X server then takes about 4ms to get back to calling select().

The big chunk of that is the memcpy() - that's something which can't be
eliminated for several reasons:
- With XvShm, the image buffer is shared between the X client and the
  X server.  The X server has no control over how the X client manipulates
  that buffer after the X server has returned from its call - however the
  displayed image must remain stable and free from manipulation until the
  next PutImage call.
- The X shm buffer can't be passed directly into DRM - DRM needs the
  image to be displayed to be in DRM buffer objects, which are then
  wrapped as frame buffers (a frame buffer can contain several buffer
  objects, one for each plane), and lastly the overlay plane updated
  with the new framebuffer.

Let's not forget that Dove hardware is slower than iMX6, so I think that
iMX6 should manage to comfortably achieve the 16.6ms required for
frame-by-frame display from your program.

Now, as for using the GPU, X server analysis:
- Again about 1ms between select() and read() of the event
- 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the
  last VSYNC time
- 200us later to map the user pointer
- 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this
  instance 3.5ms to fire)
- 130us to first attempt to submit the GPU queue, which returns -EAGAIN
  after 6.4ms
- second attempt takes 9.5ms to complete successfully
- 1ms (with intervening SIGALRM handler) to wait for the GPU to finish,
  which takes 11.5ms
- 450us to ask DRM to release the user pointer buffer, which takes 9.4ms
- 1ms (with intervening SIGALRM handler) to report completion of the operation
- The X server then takes about 4ms to get back to calling select().

So, you can see that using the GPU is a much heavier and complex operation
all round.  The long delays in mapping and releasing the buffer are of
course the DMA mapping/unmapping of the buffer object to ensure that the
GPU can see the data in those buffers.

What's also notable is that it takes the GPU on the order of 10ms to do the
operation - it's actually two operations: a vertical filter blit followed
by a horizontal filter blit.  None of the Vivante GPUs that I've access to
support a single-pass filter blit scaling in both directions (the feature
bit for that is clear on both Dove's GC600 and iMX6 GC320 GPUs.)  I
suspect a Vivante GPU supporting that would take around half the time,
or better.

Of course, the time each of these operations take scales with the image
size, so an image half the width and height (which will be a quarter of
the size) will take a quarter of the time, and will be comfortably
within 17ms.

Krzysztof Hałasa Sept. 15, 2015, 11:01 a.m. UTC | #3

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

>> 1024 x 768 for now. I noticed a strange thing: XvShmPutImage() usually
>> takes (much) less than 4 milliseconds, but every 315 frames, it takes
>> much longer (when started, it takes 1259 frames). The following is with
>> constant image, i.e., not altered between frames. Source attached (one
>> may need to change xv_port variable).
   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

> 1. Don't hard code the Xv port ID (worth mentioning somewhere so people
>    testing it don't spend something like a quarter of an hour trying to
>    debug why it doesn't work.)

Well, I actually mentioned it.


So this means the delay was only "grouped" by the X protocol, and it
won't be magically faster.

> Now, running on Dove overlay, I see about 17ms being the average time,
> which is slightly longer than the vsync.

> Let's not forget that Dove hardware is slower than iMX6, so I think that
> iMX6 should manage to comfortably achieve the 16.6ms required for
> frame-by-frame display from your program.

Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster.
If not for the missing color planes (and scaling), it would be perfect.
(and I only need 30 FPS).

> Now, as for using the GPU, X server analysis:
> - Again about 1ms between select() and read() of the event
> - 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the
>   last VSYNC time
> - 200us later to map the user pointer
> - 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this
>   instance 3.5ms to fire)
> - 130us to first attempt to submit the GPU queue, which returns -EAGAIN
>   after 6.4ms
> - second attempt takes 9.5ms to complete successfully
> - 1ms (with intervening SIGALRM handler) to wait for the GPU to finish,
>   which takes 11.5ms
> - 450us to ask DRM to release the user pointer buffer, which takes 9.4ms
> - 1ms (with intervening SIGALRM handler) to report completion of the operation
> - The X server then takes about 4ms to get back to calling select().

Sure, I know it has to be slower.

So this takes (on Dove) about 50 ms (at 1024x768), thus is way too slow
to display video at this (or higher) resolution and frame rate.
On i.MX6q, it takes about 66 - 74 ms, depending on actual clock speed
(frequency scaling, up to 1 GHz).

Thanks for testing.

Russell King - ARM Linux Sept. 15, 2015, 2:29 p.m. UTC | #4

On Tue, Sep 15, 2015 at 01:01:25PM +0200, Krzysztof Ha?asa wrote:
> Russell King - ARM Linux <linux@arm.linux.org.uk> writes:
> > Let's not forget that Dove hardware is slower than iMX6, so I think that
> > iMX6 should manage to comfortably achieve the 16.6ms required for
> > frame-by-frame display from your program.
> 
> Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster.
> If not for the missing color planes (and scaling), it would be perfect.
> (and I only need 30 FPS).

That means iMX6 isn't synchronising to the vertical sync, so would be
subject to tearing.

> 
> > Now, as for using the GPU, X server analysis:
> > - Again about 1ms between select() and read() of the event
> > - 1ms to the first WAIT_VBLANK ioctl on the display adapter to read the
> >   last VSYNC time
> > - 200us later to map the user pointer
> > - 300us to the WAIT_VBLANK ioctl to wait for the vblank (which takes in this
> >   instance 3.5ms to fire)
> > - 130us to first attempt to submit the GPU queue, which returns -EAGAIN
> >   after 6.4ms
> > - second attempt takes 9.5ms to complete successfully
> > - 1ms (with intervening SIGALRM handler) to wait for the GPU to finish,
> >   which takes 11.5ms
> > - 450us to ask DRM to release the user pointer buffer, which takes 9.4ms
> > - 1ms (with intervening SIGALRM handler) to report completion of the operation
> > - The X server then takes about 4ms to get back to calling select().
> 
> Sure, I know it has to be slower.
> 
> So this takes (on Dove) about 50 ms (at 1024x768), thus is way too slow
> to display video at this (or higher) resolution and frame rate.
> On i.MX6q, it takes about 66 - 74 ms, depending on actual clock speed
> (frequency scaling, up to 1 GHz).

It's interesting that iMX6 takes longer - without analysing what's going
on, I'd guess that will be because the GC320's memory bandwidth is
throttled compared to Dove.

iMX6 seems to have a problem with the bus arbitration in that each master
which can access DDR is given a fixed timeslot, whether or not there are
any other masters wanting access.  Performance measurement with 64-bit
DDR has shown that out of iMX6Q,D,DL,S the iMX6DL performs the best for
raw IO as there are fewer masters (fewer CPUs and fewer peripherals
wanting DDR access), so the timeslots are bigger.

I've found that it is very difficult to get iMX6's GPUs to out-perform
neon optimised libpixman in terms of performance measurements - and I
think the reason for this is because both Neon optimised libpixman and
the GPUs are running up against the same bottleneck - the available
DDR bandwidth.  However, what you do get is CPU offload of that effort
- filling one of those non-CPU DDR access timeslots with useful work,
freeing up the CPU DDR timeslot to do other tasks.

I have no idea how much truth there is to this, this is all based upon
performance measurement, and drawing conclusions from those measurements.

Lucas Stach Sept. 15, 2015, 3:53 p.m. UTC | #5

Am Montag, den 14.09.2015, 10:39 +0200 schrieb Krzysztof Ha?asa:
> Another round of tests, I noticed the new git versions :-)
> Testing Linux v4.2 + PLL5 DTS patch (for HDMI output with enabled
> LVDS).
> Using mplayer with YUV420 (DRM Xvideo would probably work with packed
> UYVY-alike formats but I need YUV420 because H.264 decoder produces
> it).
> The driver is git://ftp.arm.linux.org.uk/~rmk/xf86-video-armada.git,
> branch unstable-devel, and it uses
> git://ftp.arm.linux.org.uk/~rmk/libdrm-armada.git/.
> 
> 	IMX DRM Xvideo output:
> 
> Only unscaled video: no color (luminance is good but the color
> components are green). The driver doesn't use color information.
> I hope this is easily fixable.
> 
There were patches for that some weeks ago already, I would hope they
got applied to some tree destinied for upstream. Will look at that and
ping people if necessary.

> 
> 	rmk/drm-etnaviv-devel:
> 
> With unscaled video, the only visible problem is tearing in the
> middle
> of the screen (unability to sync with screen refresh).
> 
> With scaling, I'm getting horizontal lines (mostly visible when the
> scene changes) and some sort of stalls - sometimes two frames are
> alternating for few seconds then it goes forward.
> 
> 
> 	pengutronix/etnaviv-for-upstream:
That is the right branch, containing the changes I sent out for review.

It includes a new kernel<->userspace interface, so needs a modified
xf86-video-armada to go with it. You can pick that up here:
git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk

Caution: the armada code is WIP and needs further cleanups. It should
work though.

> No etnaviv Xvideo:
> (==) armada(0): Backing store enabled
> (==) armada(0): Silken mouse enabled
> (EE) armada(0): etnaviv: unable to open: Not a directory
> (WW) armada(0): [drm] Vivante initialization failed, running
> unaccelerated
> 
> I assume I need a newer something.
> 
> 
> 	pengutronix/v4.2/topic/etnaviv-for-rmk:

Please don't use that branch. It just there to keep stability for
people that already know about it.

Regards,
Lucas

Russell King - ARM Linux Sept. 15, 2015, 4:36 p.m. UTC | #6

On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote:
> It includes a new kernel<->userspace interface, so needs a modified
> xf86-video-armada to go with it. You can pick that up here:
> git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk
> 
> Caution: the armada code is WIP and needs further cleanups. It should
> work though.

And it needs updating to the latest version - you seem to be missing
a whole load of changes, including the changes which update the
kernel API.

So... you seem to be talking about the kernel API having changed from
the version which you picked up years ago, rather than the current
version.  It's really unclear what you're talking about.

Krzysztof Hałasa Sept. 15, 2015, 4:53 p.m. UTC | #7

Russell King - ARM Linux <linux@arm.linux.org.uk> writes:

>> Interestingly, i.MX6 seems to achive about 3.2 ms, which is much faster.
>> If not for the missing color planes (and scaling), it would be perfect.
>> (and I only need 30 FPS).
>
> That means iMX6 isn't synchronising to the vertical sync, so would be
> subject to tearing.

Right. However the bigger problem ATM is the lack of color, or actually
wrong colors (such as solid green or magenta etc). I'm looking if the
support can be added easily, this whole IPU stuff is a bit complicated.

> It's interesting that iMX6 takes longer - without analysing what's going
> on, I'd guess that will be because the GC320's memory bandwidth is
> throttled compared to Dove.

Well, one would think the bandwidths should be an order of magnitude
larger than needed here. The boards I have here (Ventana GW5xxx) are
using e.g. 4 * MT41K128M16JT-125 DDR3 chips, for a total of (if
configured correctly) 12 GB/s. Now, a single burst read (or write) at
1024x768 32 bit requires 3 MB, for 30 FPS it means 90 MB/s.
I realize the buffers may be copied many times, and the timeslots you
describe don't make it easier. 66 ms for a frame is rather long, though.

I guess, in this case, it would be better to convert YUV420 planar ->
YUV422 packed (perhaps with NEON - the packed mode(s) work correctly),
and use the more traditional overlay instead of the textured adapter.
Or maybe I will make this YUV420 mode work.

Lucas Stach Sept. 15, 2015, 4:53 p.m. UTC | #8

Hi Russell,

Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM
Linux:
> On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote:
> > It includes a new kernel<->userspace interface, so needs a modified
> > xf86-video-armada to go with it. You can pick that up here:
> > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk
> > 
> > Caution: the armada code is WIP and needs further cleanups. It
> > should
> > work though.
> 
> And it needs updating to the latest version - you seem to be missing
> a whole load of changes, including the changes which update the
> kernel API.
> 
> So... you seem to be talking about the kernel API having changed from
> the version which you picked up years ago, rather than the current
> version.  It's really unclear what you're talking about.
> 
What do you mean with missing a lot of stuff? The "for-rmk" branch in
that repo is your upstream "unstable-devel" branch minus the few XvBO
commits with my changes on top.

Regards,
Lucas

Krzysztof Hałasa Sept. 15, 2015, 4:57 p.m. UTC | #9

Lucas Stach <l.stach@pengutronix.de> writes:

>> Only unscaled video: no color (luminance is good but the color
>> components are green). The driver doesn't use color information.
>> I hope this is easily fixable.

> There were patches for that some weeks ago already, I would hope they
> got applied to some tree destinied for upstream. Will look at that and
> ping people if necessary.

Good, this probably resolves one of main problems.

Russell King - ARM Linux Sept. 15, 2015, 5:04 p.m. UTC | #10

On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote:
> Hi Russell,
> 
> Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM
> Linux:
> > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote:
> > > It includes a new kernel<->userspace interface, so needs a modified
> > > xf86-video-armada to go with it. You can pick that up here:
> > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk
> > > 
> > > Caution: the armada code is WIP and needs further cleanups. It
> > > should
> > > work though.
> > 
> > And it needs updating to the latest version - you seem to be missing
> > a whole load of changes, including the changes which update the
> > kernel API.
> > 
> > So... you seem to be talking about the kernel API having changed from
> > the version which you picked up years ago, rather than the current
> > version.  It's really unclear what you're talking about.
> > 
> What do you mean with missing a lot of stuff? The "for-rmk" branch in
> that repo is your upstream "unstable-devel" branch minus the few XvBO
> commits with my changes on top.

Ah, probably because I haven't pushed the stuff out :)

Anyway, I've taken two of your patches, reworking the second slightly:

1fec728 etnaviv: fix compilation without DRI2 support
1663382 make sure that system strndup doesn't collide with X.Org server version

which are _both_ fixes, and would have been nice for them to have been
sent in a timely manner.

Looking at "etnaviv: fix XV resize for UYVY source format", this is wrong
to the documentation for the GC320.  Documentation which was around for
the GC320 indicates that the _only_ YUV format it supports as a target
is YUY2 with the vertical filter blit.  It's very specifically worded to
indicate that this is the only YUV target format which is supported.
What testing have you done to prove uyvy support, and on which GPUs?

"etnaviv: align vximage buffer size to pagesize" I have issues with as
well - mapping more memory than was passed to the X server is really not
permissible.  If we need the image size to be larger, we need to report
a larger buffer size - and in fact, that's already done.
QueryImageAttributes indicates that the image size is to be rounded
up to a page size.  The X server will reject smaller buffers at higher
levels.  So, this patch isn't required.

The last patch is your final one, I'll look at that at some point,
but for now I'm pushing out the tree as it now stands.

Lucas Stach Sept. 15, 2015, 7:01 p.m. UTC | #11

Am Dienstag, den 15.09.2015, 18:04 +0100 schrieb Russell King - ARM
Linux:
> On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM
> > Linux:
> > > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote:
> > > > It includes a new kernel<->userspace interface, so needs a
> > > > modified
> > > > xf86-video-armada to go with it. You can pick that up here:
> > > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk
> > > > 
> > > > Caution: the armada code is WIP and needs further cleanups. It
> > > > should
> > > > work though.
> > > 
> > > And it needs updating to the latest version - you seem to be
> > > missing
> > > a whole load of changes, including the changes which update the
> > > kernel API.
> > > 
> > > So... you seem to be talking about the kernel API having changed
> > > from
> > > the version which you picked up years ago, rather than the
> > > current
> > > version.  It's really unclear what you're talking about.
> > > 
> > What do you mean with missing a lot of stuff? The "for-rmk" branch
> > in
> > that repo is your upstream "unstable-devel" branch minus the few
> > XvBO
> > commits with my changes on top.
> 
> Ah, probably because I haven't pushed the stuff out :)
> 
> Anyway, I've taken two of your patches, reworking the second
> slightly:
> 
> 1fec728 etnaviv: fix compilation without DRI2 support
> 1663382 make sure that system strndup doesn't collide with X.Org
> server version
> 
> which are _both_ fixes, and would have been nice for them to have
> been
> sent in a timely manner.
> 
Sorry about that. I'll try to single out the fixes more quickly going
forward.

> Looking at "etnaviv: fix XV resize for UYVY source format", this is
> wrong
> to the documentation for the GC320.  Documentation which was around
> for
> the GC320 indicates that the _only_ YUV format it supports as a
> target
> is YUY2 with the vertical filter blit.  It's very specifically worded
> to
> indicate that this is the only YUV target format which is supported.
> What testing have you done to prove uyvy support, and on which GPUs?
> 
I've tested this on i.MX6Q hardware with some GStreamer pipelines using
the videotestsrc. Without this commit the GPU will write out only zerosin the first step, so the video image will show up as fully green after CSC.

I don't think I have that documentation you are talking about. Is this
doc under NDA?

Chapter "31.4.1.6 Filter BLT" of the i.MX6 RM seems to agree with you,
so this commit depends on unspecified behavior. But it fixes a real bug
where the GPU writes out only zeros to the intermediate buffer if the
source is UYVY, so videos show up as fully green after CSC. So this
needs more investigation.

> "etnaviv: align vximage buffer size to pagesize" I have issues with
> as
> well - mapping more memory than was passed to the X server is really
> not
> permissible.  If we need the image size to be larger, we need to
> report
> a larger buffer size - and in fact, that's already done.
> QueryImageAttributes indicates that the image size is to be rounded
> up to a page size.  The X server will reject smaller buffers at
> higher
> levels.  So, this patch isn't required.
> 
This does not seem to be be working. I've certainly seen GStreamer
pushing non-aligned buffers which cause the GEM import to fail.
Ignoring the real size may well be a GStreamer bug (I'll look into that
code), but the server doesn't seem to reject that.

> The last patch is your final one, I'll look at that at some point,
> but for now I'm pushing out the tree as it now stands.
> 
If you're not going to look at this too soonish I can rebase that on
top of what you just pushed out and mybe clean it up some more. But
certainly not this week, as I'm physically away from any test hardware.

Regards,
Lucas

Krzysztof Hałasa Sept. 16, 2015, 7:57 a.m. UTC | #12

Lucas Stach <l.stach@pengutronix.de> writes:

>> 	IMX DRM Xvideo output:
>> 
>> Only unscaled video: no color (luminance is good but the color
>> components are green). The driver doesn't use color information.
>> I hope this is easily fixable.
>> 
> There were patches for that some weeks ago already, I would hope they
> got applied to some tree destinied for upstream. Will look at that and
> ping people if necessary.

BTW: any pointer to the patches?

Lucas Stach Sept. 16, 2015, 3:52 p.m. UTC | #13

Am Mittwoch, den 16.09.2015, 09:57 +0200 schrieb Krzysztof Ha?asa:
> Lucas Stach <l.stach@pengutronix.de> writes:
> 
> > > 	IMX DRM Xvideo output:
> > > 
> > > Only unscaled video: no color (luminance is good but the color
> > > components are green). The driver doesn't use color information.
> > > I hope this is easily fixable.
> > > 
> > There were patches for that some weeks ago already, I would hope
> > they
> > got applied to some tree destinied for upstream. Will look at that
> > and
> > ping people if necessary.
> 
> BTW: any pointer to the patches?

I just digged a bit and it seems they didn't make it out. I've pinged
Philipp about that.

Regards,
Lucas

Lucas Stach Sept. 28, 2015, 2:48 p.m. UTC | #14

Am Dienstag, den 15.09.2015, 18:04 +0100 schrieb Russell King - ARM
Linux:
> On Tue, Sep 15, 2015 at 12:53:54PM -0400, Lucas Stach wrote:
> > Hi Russell,
> > 
> > Am Dienstag, den 15.09.2015, 17:36 +0100 schrieb Russell King - ARM
> > Linux:
> > > On Tue, Sep 15, 2015 at 05:53:45PM +0200, Lucas Stach wrote:
> > > > It includes a new kernel<->userspace interface, so needs a modified
> > > > xf86-video-armada to go with it. You can pick that up here:
> > > > git://git.pengutronix.de/git/lst/xf86-video-armada.git for-rmk
> > > > 
> > > > Caution: the armada code is WIP and needs further cleanups. It
> > > > should
> > > > work though.
> > > 
> > > And it needs updating to the latest version - you seem to be missing
> > > a whole load of changes, including the changes which update the
> > > kernel API.
> > > 
> > > So... you seem to be talking about the kernel API having changed from
> > > the version which you picked up years ago, rather than the current
> > > version.  It's really unclear what you're talking about.
> > > 
> > What do you mean with missing a lot of stuff? The "for-rmk" branch in
> > that repo is your upstream "unstable-devel" branch minus the few XvBO
> > commits with my changes on top.
> 
> Ah, probably because I haven't pushed the stuff out :)
> 
> Anyway, I've taken two of your patches, reworking the second slightly:
> 
> 1fec728 etnaviv: fix compilation without DRI2 support
> 1663382 make sure that system strndup doesn't collide with X.Org server version
> 
> which are _both_ fixes, and would have been nice for them to have been
> sent in a timely manner.
> 
> Looking at "etnaviv: fix XV resize for UYVY source format", this is wrong
> to the documentation for the GC320.  Documentation which was around for
> the GC320 indicates that the _only_ YUV format it supports as a target
> is YUY2 with the vertical filter blit.  It's very specifically worded to
> indicate that this is the only YUV target format which is supported.
> What testing have you done to prove uyvy support, and on which GPUs?
> 
> "etnaviv: align vximage buffer size to pagesize" I have issues with as
> well - mapping more memory than was passed to the X server is really not
> permissible.  If we need the image size to be larger, we need to report
> a larger buffer size - and in fact, that's already done.
> QueryImageAttributes indicates that the image size is to be rounded
> up to a page size.  The X server will reject smaller buffers at higher
> levels.  So, this patch isn't required.
> 
I've looked at that again and the issue here seems to be that GStreamer
is handing us a pointer to a buffer that isn't aligned to a page. The
buffers size is properly rounded up to a pagesize, as requested by
QueryImageAttributes, but if the buffer start pointer isn't aligned to a
page boundary we end up with an non-mappable buffer anyway.

Unfortunately there is no obvious way for a driver to request a minimum
alignment for the buffer. The only possible fix is for the client to
always align the buffer to a page boundary in hopes that this is enough
for the hardware to map it directly and allow to skip any unwanted
copying.

Regards,
Lucas

Russell King - ARM Linux Sept. 28, 2015, 3:24 p.m. UTC | #15

On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote:
> I've looked at that again and the issue here seems to be that GStreamer
> is handing us a pointer to a buffer that isn't aligned to a page. The
> buffers size is properly rounded up to a pagesize, as requested by
> QueryImageAttributes, but if the buffer start pointer isn't aligned to a
> page boundary we end up with an non-mappable buffer anyway.
> 
> Unfortunately there is no obvious way for a driver to request a minimum
> alignment for the buffer. The only possible fix is for the client to
> always align the buffer to a page boundary in hopes that this is enough
> for the hardware to map it directly and allow to skip any unwanted
> copying.

We can't just "round up" the size.  We've no idea whether the buffer
came from shmem (for XvShmPutImage), or whether it's part of an internal
X buffer (for XvPutImage).  In the latter case, we've no idea whether
data in the remainder of the page will be read or written by the CPU
when, eg, a signal occurs - and X does use signals.

Lucas Stach Sept. 28, 2015, 3:40 p.m. UTC | #16

Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux:
> On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote:
> > I've looked at that again and the issue here seems to be that GStreamer
> > is handing us a pointer to a buffer that isn't aligned to a page. The
> > buffers size is properly rounded up to a pagesize, as requested by
> > QueryImageAttributes, but if the buffer start pointer isn't aligned to a
> > page boundary we end up with an non-mappable buffer anyway.
> > 
> > Unfortunately there is no obvious way for a driver to request a minimum
> > alignment for the buffer. The only possible fix is for the client to
> > always align the buffer to a page boundary in hopes that this is enough
> > for the hardware to map it directly and allow to skip any unwanted
> > copying.
> 
> We can't just "round up" the size.  We've no idea whether the buffer
> came from shmem (for XvShmPutImage), or whether it's part of an internal
> X buffer (for XvPutImage).  In the latter case, we've no idea whether
> data in the remainder of the page will be read or written by the CPU
> when, eg, a signal occurs - and X does use signals.
> 
I'm not talking about the driver rounding up the size. That is obviously
(contrary to what my patch did) the wrong thing to do.

I was talking of the client (as in VLC, GStreamer, whatever) aligning
the buffer to a page boundary. As there is no way for the client to
query the alignment restrictions, we may still need a fallback path in
the driver, so that if an unaligned buffer comes in we don't do a
buffer_from_userptr but actually copy client memory to a new buffer.

Regards,
Lucas

Russell King - ARM Linux Sept. 28, 2015, 4:50 p.m. UTC | #17

On Mon, Sep 28, 2015 at 05:40:13PM +0200, Lucas Stach wrote:
> Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux:
> > On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote:
> > > I've looked at that again and the issue here seems to be that GStreamer
> > > is handing us a pointer to a buffer that isn't aligned to a page. The
> > > buffers size is properly rounded up to a pagesize, as requested by
> > > QueryImageAttributes, but if the buffer start pointer isn't aligned to a
> > > page boundary we end up with an non-mappable buffer anyway.
> > > 
> > > Unfortunately there is no obvious way for a driver to request a minimum
> > > alignment for the buffer. The only possible fix is for the client to
> > > always align the buffer to a page boundary in hopes that this is enough
> > > for the hardware to map it directly and allow to skip any unwanted
> > > copying.
> > 
> > We can't just "round up" the size.  We've no idea whether the buffer
> > came from shmem (for XvShmPutImage), or whether it's part of an internal
> > X buffer (for XvPutImage).  In the latter case, we've no idea whether
> > data in the remainder of the page will be read or written by the CPU
> > when, eg, a signal occurs - and X does use signals.
> > 
> I'm not talking about the driver rounding up the size. That is obviously
> (contrary to what my patch did) the wrong thing to do.
> 
> I was talking of the client (as in VLC, GStreamer, whatever) aligning
> the buffer to a page boundary. As there is no way for the client to
> query the alignment restrictions, we may still need a fallback path in
> the driver, so that if an unaligned buffer comes in we don't do a
> buffer_from_userptr but actually copy client memory to a new buffer.

You really do _not_ want to do be copying image data.  With large images
(1080p) that will consume lots of CPU, and tie up the X server doing not
much other than copying data.  You might as well manually convert and
copy the data to the screen at that point.

Lucas Stach Sept. 29, 2015, 8:28 a.m. UTC | #18

Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux:
> On Mon, Sep 28, 2015 at 05:40:13PM +0200, Lucas Stach wrote:
> > Am Montag, den 28.09.2015, 16:24 +0100 schrieb Russell King - ARM Linux:
> > > On Mon, Sep 28, 2015 at 04:48:08PM +0200, Lucas Stach wrote:
> > > > I've looked at that again and the issue here seems to be that GStreamer
> > > > is handing us a pointer to a buffer that isn't aligned to a page. The
> > > > buffers size is properly rounded up to a pagesize, as requested by
> > > > QueryImageAttributes, but if the buffer start pointer isn't aligned to a
> > > > page boundary we end up with an non-mappable buffer anyway.
> > > > 
> > > > Unfortunately there is no obvious way for a driver to request a minimum
> > > > alignment for the buffer. The only possible fix is for the client to
> > > > always align the buffer to a page boundary in hopes that this is enough
> > > > for the hardware to map it directly and allow to skip any unwanted
> > > > copying.
> > > 
> > > We can't just "round up" the size.  We've no idea whether the buffer
> > > came from shmem (for XvShmPutImage), or whether it's part of an internal
> > > X buffer (for XvPutImage).  In the latter case, we've no idea whether
> > > data in the remainder of the page will be read or written by the CPU
> > > when, eg, a signal occurs - and X does use signals.
> > > 
> > I'm not talking about the driver rounding up the size. That is obviously
> > (contrary to what my patch did) the wrong thing to do.
> > 
> > I was talking of the client (as in VLC, GStreamer, whatever) aligning
> > the buffer to a page boundary. As there is no way for the client to
> > query the alignment restrictions, we may still need a fallback path in
> > the driver, so that if an unaligned buffer comes in we don't do a
> > buffer_from_userptr but actually copy client memory to a new buffer.
> 
> You really do _not_ want to do be copying image data.  With large images
> (1080p) that will consume lots of CPU, and tie up the X server doing not
> much other than copying data.  You might as well manually convert and
> copy the data to the screen at that point.
> 
So what other possibilities do we have if the client hands us a non page
aligned buffer, other than failing the PutImage? Converting the image
manually and possibly doubling the amount of bytes to write out
(YUV->RGB) will tie up the X server even longer.

This all doesn't seem to be a problem as long as the client allocates
from XShm, as this seems to hand out page aligned memory. Only if the
client mallocs the image buffer we might end up with unaligned buffers.

Regards,
Lucas

Russell King - ARM Linux Sept. 29, 2015, 8:41 a.m. UTC | #19

On Tue, Sep 29, 2015 at 10:28:19AM +0200, Lucas Stach wrote:
> Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux:
> > You really do _not_ want to do be copying image data.  With large images
> > (1080p) that will consume lots of CPU, and tie up the X server doing not
> > much other than copying data.  You might as well manually convert and
> > copy the data to the screen at that point.
> > 
> So what other possibilities do we have if the client hands us a non page
> aligned buffer, other than failing the PutImage? Converting the image
> manually and possibly doubling the amount of bytes to write out
> (YUV->RGB) will tie up the X server even longer.

Are you sure about that?

If you memcpy() the buffer, you have to read about 6MB of data, write
6MB of data, flush the caches of that, then ask the GPU to perform two
blits (which on my measurements just about fit in one field scan), wait
for the blit to finish before then returning to the client.

I can tell you that VLC uses memcpy() on images internally, and you can't
get away with it for 1080p video.

> This all doesn't seem to be a problem as long as the client allocates
> from XShm, as this seems to hand out page aligned memory. Only if the
> client mallocs the image buffer we might end up with unaligned buffers.

I don't see that.  How does a malloc()'d buffer get to the X server,
other than throuh XvPutImage?  If you're using XvPutImage(), you'll
be incurring an even _larger_ overhead because the kernel will have
to copy the image data as well.  AFAIK, you can't pass arbitary malloc()d
buffers to sysvipc and therefore via XShm.

Lucas Stach Sept. 29, 2015, 9:01 a.m. UTC | #20

Am Dienstag, den 29.09.2015, 09:41 +0100 schrieb Russell King - ARM
Linux:
> On Tue, Sep 29, 2015 at 10:28:19AM +0200, Lucas Stach wrote:
> > Am Montag, den 28.09.2015, 17:50 +0100 schrieb Russell King - ARM Linux:
> > > You really do _not_ want to do be copying image data.  With large images
> > > (1080p) that will consume lots of CPU, and tie up the X server doing not
> > > much other than copying data.  You might as well manually convert and
> > > copy the data to the screen at that point.
> > > 
> > So what other possibilities do we have if the client hands us a non page
> > aligned buffer, other than failing the PutImage? Converting the image
> > manually and possibly doubling the amount of bytes to write out
> > (YUV->RGB) will tie up the X server even longer.
> 
> Are you sure about that?
> 
> If you memcpy() the buffer, you have to read about 6MB of data, write
> 6MB of data, flush the caches of that, then ask the GPU to perform two
> blits (which on my measurements just about fit in one field scan), wait
> for the blit to finish before then returning to the client.
> 
The cache flush can easily be avoided if you copy into a WC buffer.

And I don't see why we would need to wait for the blit to finish before
returning to the client. If we copy the userdata there is no need to
block the client until the GPU is done with the data.

Or is there something in the spec I'm not aware of which would mandate
such blocking?

> I can tell you that VLC uses memcpy() on images internally, and you can't
> get away with it for 1080p video.
> 
I'm not disputing that it's a really bad idea if you care about
performance.

> > This all doesn't seem to be a problem as long as the client allocates
> > from XShm, as this seems to hand out page aligned memory. Only if the
> > client mallocs the image buffer we might end up with unaligned buffers.
> 
> I don't see that.  How does a malloc()'d buffer get to the X server,
> other than throuh XvPutImage?  If you're using XvPutImage(), you'll
> be incurring an even _larger_ overhead because the kernel will have
> to copy the image data as well.  AFAIK, you can't pass arbitary malloc()d
> buffers to sysvipc and therefore via XShm.
> 
Right I'm talking about XvPutImage. I'm not saying it is a good way to
hand off your image data to the X server, all I'm saying is that this is
a path that applications _might_ use.

We can just choose to ignore this for the time being, as it's a
non-issue if the client is using XShm.

Regards,
Lucas

I.MX6 HDMI support in v4.2

Commit Message

Comments

Patch