diff mbox series

[v4] media: docs-rst: Document m2m stateless video decoder interface

Message ID 20190306080019.159676-1-acourbot@chromium.org (mailing list archive)
State New, archived
Headers show
Series [v4] media: docs-rst: Document m2m stateless video decoder interface | expand

Commit Message

Alexandre Courbot March 6, 2019, 8 a.m. UTC
Documents the protocol that user-space should follow when
communicating with stateless video decoders.

The stateless video decoding API makes use of the new request and tags
APIs. While it has been implemented with the Cedrus driver so far, it
should probably still be considered staging for a short while.

Signed-off-by: Alexandre Courbot <acourbot@chromium.org>
---
Changes since v3:

* Rephrased the conditions under which reference buffers must be queued
  back (hopefully) more accurately.

 Documentation/media/uapi/v4l/dev-mem2mem.rst  |   5 +
 .../media/uapi/v4l/dev-stateless-decoder.rst  | 386 ++++++++++++++++++
 2 files changed, 391 insertions(+)
 create mode 100644 Documentation/media/uapi/v4l/dev-stateless-decoder.rst

Comments

Nicolas Dufresne April 12, 2019, 8:47 p.m. UTC | #1
Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> Documents the protocol that user-space should follow when
> communicating with stateless video decoders.
> 
> The stateless video decoding API makes use of the new request and tags
> APIs. While it has been implemented with the Cedrus driver so far, it
> should probably still be considered staging for a short while.
> 
> Signed-off-by: Alexandre Courbot <acourbot@chromium.org>
> ---
> Changes since v3:
> 
> * Rephrased the conditions under which reference buffers must be queued
>   back (hopefully) more accurately.
> 
>  Documentation/media/uapi/v4l/dev-mem2mem.rst  |   5 +
>  .../media/uapi/v4l/dev-stateless-decoder.rst  | 386 ++++++++++++++++++
>  2 files changed, 391 insertions(+)
>  create mode 100644 Documentation/media/uapi/v4l/dev-stateless-decoder.rst
> 
> diff --git a/Documentation/media/uapi/v4l/dev-mem2mem.rst b/Documentation/media/uapi/v4l/dev-mem2mem.rst
> index 67a980818dc8..db6f4efc458d 100644
> --- a/Documentation/media/uapi/v4l/dev-mem2mem.rst
> +++ b/Documentation/media/uapi/v4l/dev-mem2mem.rst
> @@ -13,6 +13,11 @@
>  Video Memory-To-Memory Interface
>  ********************************
>  
> +.. toctree::
> +    :maxdepth: 1
> +
> +    dev-stateless-decoder
> +
>  A V4L2 memory-to-memory device can compress, decompress, transform, or
>  otherwise convert video data from one format into another format, in memory.
>  Such memory-to-memory devices set the ``V4L2_CAP_VIDEO_M2M`` or
> diff --git a/Documentation/media/uapi/v4l/dev-stateless-decoder.rst b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst
> new file mode 100644
> index 000000000000..861fd2662886
> --- /dev/null
> +++ b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst
> @@ -0,0 +1,386 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +.. _stateless_decoder:
> +
> +**************************************************
> +Memory-to-memory Stateless Video Decoder Interface
> +**************************************************
> +
> +A stateless decoder is a decoder that works without retaining any kind of state
> +between processed frames. This means that each frame is decoded independently
> +of any previous and future frames, and that the client is responsible for
> +maintaining the decoding state and providing it to the decoder with each
> +decoding request. This is in contrast to the stateful video decoder interface,
> +where the hardware and driver maintain the decoding state and all the client
> +has to do is to provide the raw encoded stream and dequeue decoded frames in
> +display order.
> +
> +This section describes how user-space ("the client") is expected to communicate
> +with stateless decoders in order to successfully decode an encoded stream.
> +Compared to stateful codecs, the decoder/client sequence is simpler, but the
> +cost of this simplicity is extra complexity in the client which must maintain a
> +consistent decoding state.
> +
> +Stateless decoders make use of the request API. A stateless decoder must expose
> +the ``V4L2_BUF_CAP_SUPPORTS_REQUESTS`` capability on its ``OUTPUT`` queue when
> +:c:func:`VIDIOC_REQBUFS` or :c:func:`VIDIOC_CREATE_BUFS` are invoked.
> +
> +Querying capabilities
> +=====================
> +
> +1. To enumerate the set of coded formats supported by the decoder, the client
> +   calls :c:func:`VIDIOC_ENUM_FMT` on the ``OUTPUT`` queue.
> +
> +   * The driver must always return the full set of supported ``OUTPUT`` formats,
> +     irrespective of the format currently set on the ``CAPTURE`` queue.
> +
> +   * Simultaneously, the driver must restrain the set of values returned by
> +     codec-specific capability controls (such as H.264 profiles) to the set
> +     actually supported by the hardware.
> +
> +2. To enumerate the set of supported raw formats, the client calls
> +   :c:func:`VIDIOC_ENUM_FMT` on the ``CAPTURE`` queue.
> +
> +   * The driver must return only the formats supported for the format currently
> +     active on the ``OUTPUT`` queue.
> +
> +   * Depending on the currently set ``OUTPUT`` format, the set of supported raw
> +     formats may depend on the value of some controls (e.g. parsed format
> +     headers) which are codec-dependent. The client is responsible for making
> +     sure that these controls are set before querying the ``CAPTURE`` queue.
> +     Failure to do so will result in the default values for these controls being
> +     used, and a returned set of formats that may not be usable for the media
> +     the client is trying to decode.
> +
> +3. The client may use :c:func:`VIDIOC_ENUM_FRAMESIZES` to detect supported
> +   resolutions for a given format, passing desired pixel format in
> +   :c:type:`v4l2_frmsizeenum`'s ``pixel_format``.
> +
> +4. Supported profiles and levels for the current ``OUTPUT`` format, if
> +   applicable, may be queried using their respective controls via
> +   :c:func:`VIDIOC_QUERYCTRL`.
> +
> +Initialization
> +==============
> +
> +1. Set the coded format on the ``OUTPUT`` queue via :c:func:`VIDIOC_S_FMT`.
> +
> +   * **Required fields:**
> +
> +     ``type``
> +         a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``.
> +
> +     ``pixelformat``
> +         a coded pixel format.
> +
> +     ``width``, ``height``
> +         coded width and height parsed from the stream.
> +
> +     other fields
> +         follow standard semantics.
> +
> +   .. note::
> +
> +      Changing the ``OUTPUT`` format may change the currently set ``CAPTURE``
> +      format. The driver will derive a new ``CAPTURE`` format from the
> +      ``OUTPUT`` format being set, including resolution, colorimetry
> +      parameters, etc. If the client needs a specific ``CAPTURE`` format,
> +      it must adjust it afterwards.
> +
> +2. Call :c:func:`VIDIOC_S_EXT_CTRLS` to set all the controls (parsed headers,
> +   etc.) required by the ``OUTPUT`` format to enumerate the ``CAPTURE`` formats.
> +
> +3. Call :c:func:`VIDIOC_G_FMT` for ``CAPTURE`` queue to get the format for the
> +   destination buffers parsed/decoded from the bitstream.
> +
> +   * **Required fields:**
> +
> +     ``type``
> +         a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
> +
> +   * **Returned fields:**
> +
> +     ``width``, ``height``
> +         frame buffer resolution for the decoded frames.
> +
> +     ``pixelformat``
> +         pixel format for decoded frames.
> +
> +     ``num_planes`` (for _MPLANE ``type`` only)
> +         number of planes for pixelformat.
> +
> +     ``sizeimage``, ``bytesperline``
> +         as per standard semantics; matching frame buffer format.
> +
> +   .. note::
> +
> +      The value of ``pixelformat`` may be any pixel format supported for the
> +      ``OUTPUT`` format, based on the hardware capabilities. It is suggested
> +      that driver chooses the preferred/optimal format for the current
> +      configuration. For example, a YUV format may be preferred over an RGB
> +      format, if an additional conversion step would be required for RGB.
> +
> +4. *[optional]* Enumerate ``CAPTURE`` formats via :c:func:`VIDIOC_ENUM_FMT` on
> +   the ``CAPTURE`` queue. The client may use this ioctl to discover which
> +   alternative raw formats are supported for the current ``OUTPUT`` format and
> +   select one of them via :c:func:`VIDIOC_S_FMT`.
> +
> +   .. note::
> +
> +      The driver will return only formats supported for the currently selected
> +      ``OUTPUT`` format, even if more formats may be supported by the decoder in
> +      general.
> +
> +      For example, a decoder may support YUV and RGB formats for
> +      resolutions 1920x1088 and lower, but only YUV for higher resolutions (due
> +      to hardware limitations). After setting a resolution of 1920x1088 or lower
> +      as the ``OUTPUT`` format, :c:func:`VIDIOC_ENUM_FMT` may return a set of
> +      YUV and RGB pixel formats, but after setting a resolution higher than
> +      1920x1088, the driver will not return RGB pixel formats, since they are
> +      unsupported for this resolution.
> +
> +5. *[optional]* Choose a different ``CAPTURE`` format than suggested via
> +   :c:func:`VIDIOC_S_FMT` on ``CAPTURE`` queue. It is possible for the client to
> +   choose a different format than selected/suggested by the driver in
> +   :c:func:`VIDIOC_G_FMT`.
> +
> +    * **Required fields:**
> +
> +      ``type``
> +          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
> +
> +      ``pixelformat``
> +          a raw pixel format.
> +
> +6. Allocate source (bitstream) buffers via :c:func:`VIDIOC_REQBUFS` on
> +   ``OUTPUT`` queue.
> +
> +    * **Required fields:**
> +
> +      ``count``
> +          requested number of buffers to allocate; greater than zero.
> +
> +      ``type``
> +          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``.
> +
> +      ``memory``
> +          follows standard semantics.
> +
> +    * **Return fields:**
> +
> +      ``count``
> +          actual number of buffers allocated.
> +
> +    * If required, the driver will adjust ``count`` to be equal or bigger to the
> +      minimum of required number of ``OUTPUT`` buffers for the given format and
> +      requested count. The client must check this value after the ioctl returns
> +      to get the actual number of buffers allocated.
> +
> +7. Allocate destination (raw format) buffers via :c:func:`VIDIOC_REQBUFS` on the
> +   ``CAPTURE`` queue.
> +
> +    * **Required fields:**
> +
> +      ``count``
> +          requested number of buffers to allocate; greater than zero. The client
> +          is responsible for deducing the minimum number of buffers required
> +          for the stream to be properly decoded (taking e.g. reference frames
> +          into account) and pass an equal or bigger number.
> +
> +      ``type``
> +          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
> +
> +      ``memory``
> +          follows standard semantics. ``V4L2_MEMORY_USERPTR`` is not supported
> +          for ``CAPTURE`` buffers.
> +
> +    * **Return fields:**
> +
> +      ``count``
> +          adjusted to allocated number of buffers, in case the codec requires
> +          more buffers than requested.
> +
> +    * The driver must adjust count to the minimum of required number of
> +      ``CAPTURE`` buffers for the current format, stream configuration and
> +      requested count. The client must check this value after the ioctl
> +      returns to get the number of buffers allocated.
> +
> +8. Allocate requests (likely one per ``OUTPUT`` buffer) via
> +    :c:func:`MEDIA_IOC_REQUEST_ALLOC` on the media device.
> +
> +9. Start streaming on both ``OUTPUT`` and ``CAPTURE`` queues via
> +    :c:func:`VIDIOC_STREAMON`.
> +
> +Decoding
> +========
> +
> +For each frame, the client is responsible for submitting at least one request to
> +which the following is attached:
> +
> +* The amount of encoded data expected by the codec for its current
> +  configuration, as a buffer submitted to the ``OUTPUT`` queue. Typically, this
> +  corresponds to one frame worth of encoded data, but some formats may allow (or
> +  require) different amounts per unit.
> +* All the metadata needed to decode the submitted encoded data, in the form of
> +  controls relevant to the format being decoded.
> +
> +The amount and contents of the source ``OUTPUT`` buffer, as well as the controls
> +that must be set on the request, depend on the active coded pixel format and
> +might be affected by codec-specific extended controls, as stated in
> +documentation of each format.

From an IRC discussion with Paul and some more digging, I have found a
design problem in the decoding process.

In H264 and HEVC you can have multiple decoding unit per frames
(slices). This type of encoding is increasingly popular, specially for
low latency streaming use cases. The wording of this spec does allow
for the notion of decoding unit, and in practice it has been proven to
work through some RFC FFMPEG patches and the Cedrus driver. But
something important to know is that the FFMPEG RFC implements decoding
in lock steps. Which means:

  1. It queues a single free capture buffer
  2. It queues an output buffer, set controls, queue the request
  3. It waits for a capture buffer to reach state done
  4. It dequeues that capture buffer, and queue it back again
  5. And then it runs step 2,4,3 again with following slices, until we 
     have a complete frame. After what, it restart at step 1

So the implementation makes no use of the queues. There is no batch
processing, so we might not be able to reach the maximum hardware
throughput.

So the optimal method would look like the following, but there comes
the design issue.

  1. Queue a single free capture buffer
  2. Queue output buffer for slice 1, set controls, queue the request
  3. Queue output buffer for slice 2, set controls, queue the request
  4. Wait for completion

The problem is in step 4. Completion means that the capture buffer done
decoding a single unit. So assuming the driver supports matching the
timestamp against the queued buffer, instead of waiting for a new
buffer, the driver would have to mark twice the same buffer to done
state, which is just not working to inform userspace that all slices
are decoded into the one capture buffer they share.

To me, multi slice encoded stream are just too common, and they will
also exist for AV1. So we really need a solution to this that does not
require operating in lock steps. Specially that some HW can decode
multiple slices in parallel (multi core), we would not want to prevent
that HW from being used efficiently. On top of this, we need a solution
so that we can also keep queuing slice of the following frames if they
arrive before decoding is done.

I don't have a solution yet myself, but it would be nice to come up
with something before we freeze this API. By the way, if we could queue
twice the same buffer, that would in principal work, but internally
there is only one state per buffer. If you do external allocation, then
in theory you could workaround that, but then it's ugly, because you'll
have two buffers with the same timestamp.

An argument that was made early was that we don't need to support this
right away because userspace can combine all the slices into one
buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
extra control to tell the driver the offset to each slices, because the
raw H264 does not have enough information to be parsed. RAW slice are
also I believe de-emulated, which means the code use to prevent having
pattern looking like a start code has been removed, so you cannot just
prepend start codes. De-emulation seems better placed in userspace if
the HW does not take care.

I also very dislike the idea that we would enforce merging all slice
into the same buffer. The entire purpose of slices and the reason they
are used in practice is that you can start decoding slices before you
have all slices of a frame. This reduce drastically the latency for
streaming use cases, like video conferencing. So forcing the merging of
slices is basically like pretending slices have no benefits.

I have just exposed the problem I see for now, to see what comes up.
But I hope we be able to propose solution too in the short term (in no
one beats me at it).

> +
> +A typical frame would thus be decoded using the following sequence:
> +
> +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> +
> +    * **Required fields:**
> +
> +      ``index``
> +          index of the buffer being queued.
> +
> +      ``type``
> +          type of the buffer.
> +
> +      ``bytesused``
> +          number of bytes taken by the encoded data frame in the buffer.
> +
> +      ``flags``
> +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> +
> +      ``request_fd``
> +          must be set to the file descriptor of the decoding request.
> +
> +      ``timestamp``
> +          must be set to a unique value per frame. This value will be propagated
> +          into the decoded frame's buffer and can also be used to use this frame
> +          as the reference of another.
> +
> +2. Set the codec-specific controls for the decoding request, using
> +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> +
> +    * **Required fields:**
> +
> +      ``which``
> +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> +
> +      ``request_fd``
> +          must be set to the file descriptor of the decoding request.
> +
> +      other fields
> +          other fields are set as usual when setting controls. The ``controls``
> +          array must contain all the codec-specific controls required to decode
> +          a frame.
> +
> +   .. note::
> +
> +      It is possible to specify the controls in different invocations of
> +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> +      long as ``request_fd`` and ``which`` are properly set. The controls state
> +      at the moment of request submission is the one that will be considered.
> +
> +   .. note::
> +
> +      The order in which steps 1 and 2 take place is interchangeable.
> +
> +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> +   request FD.
> +
> +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> +    required controls are missing from the request, then
> +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> +    ``CAPTURE`` buffer will be produced for this request.
> +
> +``CAPTURE`` buffers must not be part of the request, and are queued
> +independently. They are returned in decode order (i.e. the same order as coded
> +frames were submitted to the ``OUTPUT`` queue).
> +
> +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> +error, then all following decoded frames that refer to it also have the
> +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> +produce a (likely corrupted) frame.
> +
> +Buffer management while decoding
> +================================
> +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> +used by the client for as long as they are not queued again. "Used" here
> +encompasses using the buffer for compositing or display.
> +
> +A dequeued capture buffer can also be used as the reference frame of another
> +buffer.
> +
> +A frame is specified as reference by converting its timestamp into nanoseconds,
> +and storing it into the relevant member of a codec-dependent control structure.
> +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> +conversion. The timestamp of a frame can be used to reference it as soon as all
> +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> +
> +A decoded buffer containing a reference frame must not be reused as a decoding
> +target until all the frames referencing it have been decoded. The safest way to
> +achieve this is to refrain from queueing a reference buffer until all the
> +decoded frames referencing it have been dequeued. However, if the driver can
> +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> +order, then user-space can take advantage of this guarantee and queue a
> +reference buffer when the following conditions are met:
> +
> +1. All the requests for frames affected by the reference frame have been
> +   queued, and
> +
> +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> +   referencing frames have been queued.
> +
> +When queuing a decoding request, the driver will increase the reference count of
> +all the resources associated with reference frames. This means that the client
> +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> +won't need them afterwards.
> +
> +Seeking
> +=======
> +In order to seek, the client just needs to submit requests using input buffers
> +corresponding to the new stream position. It must however be aware that
> +resolution may have changed and follow the dynamic resolution change sequence in
> +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> +for H.264) may have changed and the client is responsible for making sure that a
> +valid state is sent to the decoder.
> +
> +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> +from the pre-seek position.
> +
> +Pause
> +=====
> +
> +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> +queue. Without source bitstream data, there is no data to process and the codec
> +will remain idle.
> +
> +Dynamic resolution change
> +=========================
> +
> +If the client detects a resolution change in the stream, it will need to perform
> +the initialization sequence again with the new resolution:
> +
> +1. Wait until all submitted requests have completed and dequeue the
> +   corresponding output buffers.
> +
> +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> +   queues.
> +
> +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> +   ``CAPTURE`` queue with a buffer count of zero.
> +
> +4. Perform the initialization sequence again (minus the allocation of
> +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> +   Note that due to resolution constraints, a different format may need to be
> +   picked on the ``CAPTURE`` queue.
> +
> +Drain
> +=====
> +
> +In order to drain the stream on a stateless decoder, the client just needs to
> +wait until all the submitted requests are completed. There is no need to send a
> +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> +decoder.
Paul Kocialkowski April 14, 2019, 4:41 p.m. UTC | #2
Hi,

Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > Documents the protocol that user-space should follow when
> > communicating with stateless video decoders.
> > 
> > The stateless video decoding API makes use of the new request and tags
> > APIs. While it has been implemented with the Cedrus driver so far, it
> > should probably still be considered staging for a short while.

[...]

> From an IRC discussion with Paul and some more digging, I have found a
> design problem in the decoding process.
> 
> In H264 and HEVC you can have multiple decoding unit per frames
> (slices). This type of encoding is increasingly popular, specially for
> low latency streaming use cases. The wording of this spec does allow
> for the notion of decoding unit, and in practice it has been proven to
> work through some RFC FFMPEG patches and the Cedrus driver. But
> something important to know is that the FFMPEG RFC implements decoding
> in lock steps. Which means:
> 
>   1. It queues a single free capture buffer
>   2. It queues an output buffer, set controls, queue the request
>   3. It waits for a capture buffer to reach state done
>   4. It dequeues that capture buffer, and queue it back again
>   5. And then it runs step 2,4,3 again with following slices, until we 
>      have a complete frame. After what, it restart at step 1
> 
> So the implementation makes no use of the queues. There is no batch
> processing, so we might not be able to reach the maximum hardware
> throughput.
> 
> So the optimal method would look like the following, but there comes
> the design issue.
> 
>   1. Queue a single free capture buffer
>   2. Queue output buffer for slice 1, set controls, queue the request
>   3. Queue output buffer for slice 2, set controls, queue the request
>   4. Wait for completion
> 
> The problem is in step 4. Completion means that the capture buffer done
> decoding a single unit. So assuming the driver supports matching the
> timestamp against the queued buffer, instead of waiting for a new
> buffer, the driver would have to mark twice the same buffer to done
> state, which is just not working to inform userspace that all slices
> are decoded into the one capture buffer they share.

Interestingly, I'm experiencing the exact same problem dealing with a
2D graphics blitter that has limited ouput scaling abilities which
imply handlnig a large scaling operation as multiple clipped smaller
scaling operations. The issue is basically that multiple jobs have to
be submitted to complete a single frame and relying on an indication
from the destination buffer (such as a fence) doesn't work to indicate
that all the operations were completed, since we get the indication at
each step instead of at the end of the batch.

One idea I see to solve this is to have a notion of batch in the driver
(for our situation, that would be in v4l2) and provide means to get a
done indication for that entity.

I think we could extend the request API to allow this. We already
represent requests as individual file descriptors, we could totally
group requests in batches and get a sync fd for the batch to poll on
when we need to return the frames. It would be good if we could expose
this in a way that makes it work with DRM as an in fence for display.
Then we can pretty much schedule our flip + decoding together (which is
quite nice to have when we're running late on the decoding side).

What do you think?

It feels to me like the request API was designed to open up the way for
these kinds of improvements, so I'm sure we can find an agreeable
solution that extends the API.

> To me, multi slice encoded stream are just too common, and they will
> also exist for AV1. So we really need a solution to this that does not
> require operating in lock steps. Specially that some HW can decode
> multiple slices in parallel (multi core), we would not want to prevent
> that HW from being used efficiently. On top of this, we need a solution
> so that we can also keep queuing slice of the following frames if they
> arrive before decoding is done.

Agreed.

> I don't have a solution yet myself, but it would be nice to come up
> with something before we freeze this API.

I think it's rather independent from the codec used and this is
something that should be handled at the request API level. 

I'm not sure we can always expect the hardware to be able to operate on
a per-slice basis. I think it would be useful to reflect this in the
pixel format, so that we also have a possibility for a gathered slice
buffer (as the spec currently mentions for mpeg-2) for legacy decoder
hardware that will need to decode one frame in one go from a contiguous
buffer with all the slice data appended.

This updates my pixel format proposition from IRC to the following:
- V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
(appended buffer), slice params as v4l2 control (legacy);

- V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;

- V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
slice params encoded in the buffer;

- V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
slice params encoded in the buffer and in slice params control;

Also, we need to make sure to have a per-slice bit offset to the
encoded data in the slice params control so that the same slice buffer
can be used with any of the _SLICE formats (e.g. ffmpeg would only have
an annex-b slice and use any of the formats with it).

For the legacy format, we need to specify that the appended slices
don't repeat the annex-b start code and NAL header.

What do you think?

>  By the way, if we could queue
> twice the same buffer, that would in principal work, but internally
> there is only one state per buffer. If you do external allocation, then
> in theory you could workaround that, but then it's ugly, because you'll
> have two buffers with the same timestamp.

One advantage of the request API is that buffers are actually queued
when the request is processed, so this might not be too problematic.

I think what we need boils down to:
- Being able to queue the same output buffer to multiple requests,
which the request API should already allow;
- Being able to grab the right capture buffer based on the output
timestamp so that the different requests for the slices are rendered to
the same destination buffer.

For the second point, I don't really have a clear idea of whether we
can already expect v4l2 to allow picking a buffer that was marked done
but was not de-queued by userspace yet. It might already be allowed and
we could just implement something to lookup the buffer to grab by
timestamp.

> An argument that was made early was that we don't need to support this
> right away because userspace can combine all the slices into one
> buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> extra control to tell the driver the offset to each slices, because the
> raw H264 does not have enough information to be parsed. RAW slice are
> also I believe de-emulated, which means the code use to prevent having
> pattern looking like a start code has been removed, so you cannot just
> prepend start codes. De-emulation seems better placed in userspace if
> the HW does not take care.

Mhh I'd like to avoid having having to specify the offset to each slice
for the legacy case. Just appending the encoded data (excluding slice
header and start code) works for cedrus and I think it makes sense more
generally. The idea is to only expose a single slice params and act as
if it was just one big slice buffer.

Come to think of it, maybe we need annex-b and mixed fashions of that
legacy pixfmt too...

> I also very dislike the idea that we would enforce merging all slice
> into the same buffer. The entire purpose of slices and the reason they
> are used in practice is that you can start decoding slices before you
> have all slices of a frame. This reduce drastically the latency for
> streaming use cases, like video conferencing. So forcing the merging of
> slices is basically like pretending slices have no benefits.

Of course, we don't want things to stay like this and this rework is
definitely needed to get serious performance and latency going.

One thing you should also be aware of: we're currently using a
workqueue between the job done irq and scheduling the next frame (in
v4l2 m2m).

Maybe we could manage to fit that into an atomic path to schedule the
next request in the previous job done irq context.

> I have just exposed the problem I see for now, to see what comes up.
> But I hope we be able to propose solution too in the short term (in no
> one beats me at it).

Seems that we have good grounds for a discussion!

Cheers,

Paul

> > +
> > +A typical frame would thus be decoded using the following sequence:
> > +
> > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > +
> > +    * **Required fields:**
> > +
> > +      ``index``
> > +          index of the buffer being queued.
> > +
> > +      ``type``
> > +          type of the buffer.
> > +
> > +      ``bytesused``
> > +          number of bytes taken by the encoded data frame in the buffer.
> > +
> > +      ``flags``
> > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > +
> > +      ``request_fd``
> > +          must be set to the file descriptor of the decoding request.
> > +
> > +      ``timestamp``
> > +          must be set to a unique value per frame. This value will be propagated
> > +          into the decoded frame's buffer and can also be used to use this frame
> > +          as the reference of another.
> > +
> > +2. Set the codec-specific controls for the decoding request, using
> > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > +
> > +    * **Required fields:**
> > +
> > +      ``which``
> > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > +
> > +      ``request_fd``
> > +          must be set to the file descriptor of the decoding request.
> > +
> > +      other fields
> > +          other fields are set as usual when setting controls. The ``controls``
> > +          array must contain all the codec-specific controls required to decode
> > +          a frame.
> > +
> > +   .. note::
> > +
> > +      It is possible to specify the controls in different invocations of
> > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > +      at the moment of request submission is the one that will be considered.
> > +
> > +   .. note::
> > +
> > +      The order in which steps 1 and 2 take place is interchangeable.
> > +
> > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > +   request FD.
> > +
> > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > +    required controls are missing from the request, then
> > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > +    ``CAPTURE`` buffer will be produced for this request.
> > +
> > +``CAPTURE`` buffers must not be part of the request, and are queued
> > +independently. They are returned in decode order (i.e. the same order as coded
> > +frames were submitted to the ``OUTPUT`` queue).
> > +
> > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > +error, then all following decoded frames that refer to it also have the
> > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > +produce a (likely corrupted) frame.
> > +
> > +Buffer management while decoding
> > +================================
> > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > +used by the client for as long as they are not queued again. "Used" here
> > +encompasses using the buffer for compositing or display.
> > +
> > +A dequeued capture buffer can also be used as the reference frame of another
> > +buffer.
> > +
> > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > +and storing it into the relevant member of a codec-dependent control structure.
> > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > +
> > +A decoded buffer containing a reference frame must not be reused as a decoding
> > +target until all the frames referencing it have been decoded. The safest way to
> > +achieve this is to refrain from queueing a reference buffer until all the
> > +decoded frames referencing it have been dequeued. However, if the driver can
> > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > +order, then user-space can take advantage of this guarantee and queue a
> > +reference buffer when the following conditions are met:
> > +
> > +1. All the requests for frames affected by the reference frame have been
> > +   queued, and
> > +
> > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > +   referencing frames have been queued.
> > +
> > +When queuing a decoding request, the driver will increase the reference count of
> > +all the resources associated with reference frames. This means that the client
> > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > +won't need them afterwards.
> > +
> > +Seeking
> > +=======
> > +In order to seek, the client just needs to submit requests using input buffers
> > +corresponding to the new stream position. It must however be aware that
> > +resolution may have changed and follow the dynamic resolution change sequence in
> > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > +for H.264) may have changed and the client is responsible for making sure that a
> > +valid state is sent to the decoder.
> > +
> > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > +from the pre-seek position.
> > +
> > +Pause
> > +=====
> > +
> > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > +queue. Without source bitstream data, there is no data to process and the codec
> > +will remain idle.
> > +
> > +Dynamic resolution change
> > +=========================
> > +
> > +If the client detects a resolution change in the stream, it will need to perform
> > +the initialization sequence again with the new resolution:
> > +
> > +1. Wait until all submitted requests have completed and dequeue the
> > +   corresponding output buffers.
> > +
> > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > +   queues.
> > +
> > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > +   ``CAPTURE`` queue with a buffer count of zero.
> > +
> > +4. Perform the initialization sequence again (minus the allocation of
> > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > +   Note that due to resolution constraints, a different format may need to be
> > +   picked on the ``CAPTURE`` queue.
> > +
> > +Drain
> > +=====
> > +
> > +In order to drain the stream on a stateless decoder, the client just needs to
> > +wait until all the submitted requests are completed. There is no need to send a
> > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > +decoder.
Nicolas Dufresne April 14, 2019, 10:38 p.m. UTC | #3
Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > Documents the protocol that user-space should follow when
> > > communicating with stateless video decoders.
> > > 
> > > The stateless video decoding API makes use of the new request and tags
> > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > should probably still be considered staging for a short while.
> 
> [...]
> 
> > From an IRC discussion with Paul and some more digging, I have found a
> > design problem in the decoding process.
> > 
> > In H264 and HEVC you can have multiple decoding unit per frames
> > (slices). This type of encoding is increasingly popular, specially for
> > low latency streaming use cases. The wording of this spec does allow
> > for the notion of decoding unit, and in practice it has been proven to
> > work through some RFC FFMPEG patches and the Cedrus driver. But
> > something important to know is that the FFMPEG RFC implements decoding
> > in lock steps. Which means:
> > 
> >   1. It queues a single free capture buffer
> >   2. It queues an output buffer, set controls, queue the request
> >   3. It waits for a capture buffer to reach state done
> >   4. It dequeues that capture buffer, and queue it back again
> >   5. And then it runs step 2,4,3 again with following slices, until we 
> >      have a complete frame. After what, it restart at step 1
> > 
> > So the implementation makes no use of the queues. There is no batch
> > processing, so we might not be able to reach the maximum hardware
> > throughput.
> > 
> > So the optimal method would look like the following, but there comes
> > the design issue.
> > 
> >   1. Queue a single free capture buffer
> >   2. Queue output buffer for slice 1, set controls, queue the request
> >   3. Queue output buffer for slice 2, set controls, queue the request
> >   4. Wait for completion
> > 
> > The problem is in step 4. Completion means that the capture buffer done
> > decoding a single unit. So assuming the driver supports matching the
> > timestamp against the queued buffer, instead of waiting for a new
> > buffer, the driver would have to mark twice the same buffer to done
> > state, which is just not working to inform userspace that all slices
> > are decoded into the one capture buffer they share.
> 
> Interestingly, I'm experiencing the exact same problem dealing with a
> 2D graphics blitter that has limited ouput scaling abilities which
> imply handlnig a large scaling operation as multiple clipped smaller
> scaling operations. The issue is basically that multiple jobs have to
> be submitted to complete a single frame and relying on an indication
> from the destination buffer (such as a fence) doesn't work to indicate
> that all the operations were completed, since we get the indication at
> each step instead of at the end of the batch.
> 
> One idea I see to solve this is to have a notion of batch in the driver
> (for our situation, that would be in v4l2) and provide means to get a
> done indication for that entity.
> 
> I think we could extend the request API to allow this. We already
> represent requests as individual file descriptors, we could totally
> group requests in batches and get a sync fd for the batch to poll on
> when we need to return the frames. It would be good if we could expose
> this in a way that makes it work with DRM as an in fence for display.
> Then we can pretty much schedule our flip + decoding together (which is
> quite nice to have when we're running late on the decoding side).
> 
> What do you think?
> 
> It feels to me like the request API was designed to open up the way for
> these kinds of improvements, so I'm sure we can find an agreeable
> solution that extends the API.
> 
> > To me, multi slice encoded stream are just too common, and they will
> > also exist for AV1. So we really need a solution to this that does not
> > require operating in lock steps. Specially that some HW can decode
> > multiple slices in parallel (multi core), we would not want to prevent
> > that HW from being used efficiently. On top of this, we need a solution
> > so that we can also keep queuing slice of the following frames if they
> > arrive before decoding is done.
> 
> Agreed.
> 
> > I don't have a solution yet myself, but it would be nice to come up
> > with something before we freeze this API.
> 
> I think it's rather independent from the codec used and this is
> something that should be handled at the request API level. 
> 
> I'm not sure we can always expect the hardware to be able to operate on
> a per-slice basis. I think it would be useful to reflect this in the
> pixel format, so that we also have a possibility for a gathered slice
> buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> hardware that will need to decode one frame in one go from a contiguous
> buffer with all the slice data appended.
> 
> This updates my pixel format proposition from IRC to the following:
> - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> (appended buffer), slice params as v4l2 control (legacy);

SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
control for the NAL index, as there is no way to figure-out otherwise.
I would not add this format unless a specific HW need it.

> 
> - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> 
> - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> slice params encoded in the buffer;

We are still working on this one, this format will be used by Rockchip
driver for sure, but this needs clarification and maybe a rename if
it's not just one slice per buffer.

> 
> - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> slice params encoded in the buffer and in slice params control;
> 
> Also, we need to make sure to have a per-slice bit offset to the
> encoded data in the slice params control so that the same slice buffer
> can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> an annex-b slice and use any of the formats with it).

Ah, I we are saying the same.

> 
> For the legacy format, we need to specify that the appended slices
> don't repeat the annex-b start code and NAL header.

I'm not sure this one make sense. the NAL header for each slices of one
frames are not always identical.

> 
> What do you think?
> 
> >  By the way, if we could queue
> > twice the same buffer, that would in principal work, but internally
> > there is only one state per buffer. If you do external allocation, then
> > in theory you could workaround that, but then it's ugly, because you'll
> > have two buffers with the same timestamp.
> 
> One advantage of the request API is that buffers are actually queued
> when the request is processed, so this might not be too problematic.
> 
> I think what we need boils down to:
> - Being able to queue the same output buffer to multiple requests,
> which the request API should already allow;
> - Being able to grab the right capture buffer based on the output
> timestamp so that the different requests for the slices are rendered to
> the same destination buffer.
> 
> For the second point, I don't really have a clear idea of whether we
> can already expect v4l2 to allow picking a buffer that was marked done
> but was not de-queued by userspace yet. It might already be allowed and
> we could just implement something to lookup the buffer to grab by
> timestamp.

An entirely difference solution that came to my mind in the last few
days would be to add a new buffer flag that would mean END_OF_FRAME (or
reused the generic LAST flag). This flag would be passed on the last
slice (if it is known that we are handling the last one) or in an empty
buffer if it is found through parsing the next following NAL. This is
inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
Though, if we make this flag mandatory, the driver could avoid marking
the frame done until all slices has been decoded. The cons is that
userpace is not informed when a specific slice is done decoding. This
is quite niche, but you can use this information along with the list of
macroblocks from the slice header so signal which portion of the image
is now ready for an hypothetical video processing. The pros is that
this solution can be per format, so this would not be needed for VP8 as
an example.

A third approach could be to use the encoded buffer state to track the
progress decoding that slice. Many driver will mark the buffer done as
soon as it is transferred to the accelerator, it does not always match
the moment that slice has been decoded. But has use said, we would need
to study if it make sense to let a driver pick by timestamp a buffer
that might already have reached done state. Other cons, is that polling
for buffer states on the capture queue won't mean anything anymore. But
combined with the FLAG, it would fix the cons of the FLAG solution.

> 
> > An argument that was made early was that we don't need to support this
> > right away because userspace can combine all the slices into one
> > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > extra control to tell the driver the offset to each slices, because the
> > raw H264 does not have enough information to be parsed. RAW slice are
> > also I believe de-emulated, which means the code use to prevent having
> > pattern looking like a start code has been removed, so you cannot just
> > prepend start codes. De-emulation seems better placed in userspace if
> > the HW does not take care.
> 
> Mhh I'd like to avoid having having to specify the offset to each slice
> for the legacy case. Just appending the encoded data (excluding slice
> header and start code) works for cedrus and I think it makes sense more
> generally. The idea is to only expose a single slice params and act as
> if it was just one big slice buffer.
> 
> Come to think of it, maybe we need annex-b and mixed fashions of that
> legacy pixfmt too...
> 
> > I also very dislike the idea that we would enforce merging all slice
> > into the same buffer. The entire purpose of slices and the reason they
> > are used in practice is that you can start decoding slices before you
> > have all slices of a frame. This reduce drastically the latency for
> > streaming use cases, like video conferencing. So forcing the merging of
> > slices is basically like pretending slices have no benefits.
> 
> Of course, we don't want things to stay like this and this rework is
> definitely needed to get serious performance and latency going.
> 
> One thing you should also be aware of: we're currently using a
> workqueue between the job done irq and scheduling the next frame (in
> v4l2 m2m).
> 
> Maybe we could manage to fit that into an atomic path to schedule the
> next request in the previous job done irq context.
> 
> > I have just exposed the problem I see for now, to see what comes up.
> > But I hope we be able to propose solution too in the short term (in no
> > one beats me at it).
> 
> Seems that we have good grounds for a discussion!
> 
> Cheers,
> 
> Paul
> 
> > > +
> > > +A typical frame would thus be decoded using the following sequence:
> > > +
> > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > +
> > > +    * **Required fields:**
> > > +
> > > +      ``index``
> > > +          index of the buffer being queued.
> > > +
> > > +      ``type``
> > > +          type of the buffer.
> > > +
> > > +      ``bytesused``
> > > +          number of bytes taken by the encoded data frame in the buffer.
> > > +
> > > +      ``flags``
> > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > +
> > > +      ``request_fd``
> > > +          must be set to the file descriptor of the decoding request.
> > > +
> > > +      ``timestamp``
> > > +          must be set to a unique value per frame. This value will be propagated
> > > +          into the decoded frame's buffer and can also be used to use this frame
> > > +          as the reference of another.
> > > +
> > > +2. Set the codec-specific controls for the decoding request, using
> > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > +
> > > +    * **Required fields:**
> > > +
> > > +      ``which``
> > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > +
> > > +      ``request_fd``
> > > +          must be set to the file descriptor of the decoding request.
> > > +
> > > +      other fields
> > > +          other fields are set as usual when setting controls. The ``controls``
> > > +          array must contain all the codec-specific controls required to decode
> > > +          a frame.
> > > +
> > > +   .. note::
> > > +
> > > +      It is possible to specify the controls in different invocations of
> > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > +      at the moment of request submission is the one that will be considered.
> > > +
> > > +   .. note::
> > > +
> > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > +
> > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > +   request FD.
> > > +
> > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > +    required controls are missing from the request, then
> > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > +    ``CAPTURE`` buffer will be produced for this request.
> > > +
> > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > +independently. They are returned in decode order (i.e. the same order as coded
> > > +frames were submitted to the ``OUTPUT`` queue).
> > > +
> > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > +error, then all following decoded frames that refer to it also have the
> > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > +produce a (likely corrupted) frame.
> > > +
> > > +Buffer management while decoding
> > > +================================
> > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > +used by the client for as long as they are not queued again. "Used" here
> > > +encompasses using the buffer for compositing or display.
> > > +
> > > +A dequeued capture buffer can also be used as the reference frame of another
> > > +buffer.
> > > +
> > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > +and storing it into the relevant member of a codec-dependent control structure.
> > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > +
> > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > +target until all the frames referencing it have been decoded. The safest way to
> > > +achieve this is to refrain from queueing a reference buffer until all the
> > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > +order, then user-space can take advantage of this guarantee and queue a
> > > +reference buffer when the following conditions are met:
> > > +
> > > +1. All the requests for frames affected by the reference frame have been
> > > +   queued, and
> > > +
> > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > +   referencing frames have been queued.
> > > +
> > > +When queuing a decoding request, the driver will increase the reference count of
> > > +all the resources associated with reference frames. This means that the client
> > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > +won't need them afterwards.
> > > +
> > > +Seeking
> > > +=======
> > > +In order to seek, the client just needs to submit requests using input buffers
> > > +corresponding to the new stream position. It must however be aware that
> > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > +for H.264) may have changed and the client is responsible for making sure that a
> > > +valid state is sent to the decoder.
> > > +
> > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > +from the pre-seek position.
> > > +
> > > +Pause
> > > +=====
> > > +
> > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > +queue. Without source bitstream data, there is no data to process and the codec
> > > +will remain idle.
> > > +
> > > +Dynamic resolution change
> > > +=========================
> > > +
> > > +If the client detects a resolution change in the stream, it will need to perform
> > > +the initialization sequence again with the new resolution:
> > > +
> > > +1. Wait until all submitted requests have completed and dequeue the
> > > +   corresponding output buffers.
> > > +
> > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > +   queues.
> > > +
> > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > +
> > > +4. Perform the initialization sequence again (minus the allocation of
> > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > +   Note that due to resolution constraints, a different format may need to be
> > > +   picked on the ``CAPTURE`` queue.
> > > +
> > > +Drain
> > > +=====
> > > +
> > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > +wait until all the submitted requests are completed. There is no need to send a
> > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > +decoder.
Paul Kocialkowski April 15, 2019, 7:58 a.m. UTC | #4
Hi,

On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > Documents the protocol that user-space should follow when
> > > > communicating with stateless video decoders.
> > > > 
> > > > The stateless video decoding API makes use of the new request and tags
> > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > should probably still be considered staging for a short while.
> > 
> > [...]
> > 
> > > From an IRC discussion with Paul and some more digging, I have found a
> > > design problem in the decoding process.
> > > 
> > > In H264 and HEVC you can have multiple decoding unit per frames
> > > (slices). This type of encoding is increasingly popular, specially for
> > > low latency streaming use cases. The wording of this spec does allow
> > > for the notion of decoding unit, and in practice it has been proven to
> > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > something important to know is that the FFMPEG RFC implements decoding
> > > in lock steps. Which means:
> > > 
> > >   1. It queues a single free capture buffer
> > >   2. It queues an output buffer, set controls, queue the request
> > >   3. It waits for a capture buffer to reach state done
> > >   4. It dequeues that capture buffer, and queue it back again
> > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > >      have a complete frame. After what, it restart at step 1
> > > 
> > > So the implementation makes no use of the queues. There is no batch
> > > processing, so we might not be able to reach the maximum hardware
> > > throughput.
> > > 
> > > So the optimal method would look like the following, but there comes
> > > the design issue.
> > > 
> > >   1. Queue a single free capture buffer
> > >   2. Queue output buffer for slice 1, set controls, queue the request
> > >   3. Queue output buffer for slice 2, set controls, queue the request
> > >   4. Wait for completion
> > > 
> > > The problem is in step 4. Completion means that the capture buffer done
> > > decoding a single unit. So assuming the driver supports matching the
> > > timestamp against the queued buffer, instead of waiting for a new
> > > buffer, the driver would have to mark twice the same buffer to done
> > > state, which is just not working to inform userspace that all slices
> > > are decoded into the one capture buffer they share.
> > 
> > Interestingly, I'm experiencing the exact same problem dealing with a
> > 2D graphics blitter that has limited ouput scaling abilities which
> > imply handlnig a large scaling operation as multiple clipped smaller
> > scaling operations. The issue is basically that multiple jobs have to
> > be submitted to complete a single frame and relying on an indication
> > from the destination buffer (such as a fence) doesn't work to indicate
> > that all the operations were completed, since we get the indication at
> > each step instead of at the end of the batch.
> > 
> > One idea I see to solve this is to have a notion of batch in the driver
> > (for our situation, that would be in v4l2) and provide means to get a
> > done indication for that entity.
> > 
> > I think we could extend the request API to allow this. We already
> > represent requests as individual file descriptors, we could totally
> > group requests in batches and get a sync fd for the batch to poll on
> > when we need to return the frames. It would be good if we could expose
> > this in a way that makes it work with DRM as an in fence for display.
> > Then we can pretty much schedule our flip + decoding together (which is
> > quite nice to have when we're running late on the decoding side).
> > 
> > What do you think?
> > 
> > It feels to me like the request API was designed to open up the way for
> > these kinds of improvements, so I'm sure we can find an agreeable
> > solution that extends the API.
> > 
> > > To me, multi slice encoded stream are just too common, and they will
> > > also exist for AV1. So we really need a solution to this that does not
> > > require operating in lock steps. Specially that some HW can decode
> > > multiple slices in parallel (multi core), we would not want to prevent
> > > that HW from being used efficiently. On top of this, we need a solution
> > > so that we can also keep queuing slice of the following frames if they
> > > arrive before decoding is done.
> > 
> > Agreed.
> > 
> > > I don't have a solution yet myself, but it would be nice to come up
> > > with something before we freeze this API.
> > 
> > I think it's rather independent from the codec used and this is
> > something that should be handled at the request API level. 
> > 
> > I'm not sure we can always expect the hardware to be able to operate on
> > a per-slice basis. I think it would be useful to reflect this in the
> > pixel format, so that we also have a possibility for a gathered slice
> > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > hardware that will need to decode one frame in one go from a contiguous
> > buffer with all the slice data appended.
> > 
> > This updates my pixel format proposition from IRC to the following:
> > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > (appended buffer), slice params as v4l2 control (legacy);
> 
> SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> control for the NAL index, as there is no way to figure-out otherwise.
> I would not add this format unless a specific HW need it.

I don't really like using "raw" as a distinguisher: I don't think it's
explicit enough. The idea here is to reflect that there is only one
slice exposed, which is the appended result of all the frame slices
with a single v4l2 control.

> > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > 
> > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > slice params encoded in the buffer;
> 
> We are still working on this one, this format will be used by Rockchip
> driver for sure, but this needs clarification and maybe a rename if
> it's not just one slice per buffer.

I thought the decoder also needed the parse slice data? At least IIRC
for Tegra, we need Annex-B format and a parsed slice header (so the
next one).

> > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > slice params encoded in the buffer and in slice params control;
> > 
> > Also, we need to make sure to have a per-slice bit offset to the
> > encoded data in the slice params control so that the same slice buffer
> > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > an annex-b slice and use any of the formats with it).
> 
> Ah, I we are saying the same.
> 
> > For the legacy format, we need to specify that the appended slices
> > don't repeat the annex-b start code and NAL header.
> 
> I'm not sure this one make sense. the NAL header for each slices of one
> frames are not always identical.

Yes but that's pretty much the point of the legacy format: to only
expose a single slice buffer and slice header (even in cases where the
bitstream codes them in multiple distinct ones).

We can't expect this to work in every case, that's why it's a legacy
format. It seems to work pretty well for cedrus so far.

We could also decide to ditch the legacy idea altogether and only
specify formats that operate on a per-slice basis, but I'm afraid we'll
find decoders that can only take a single slice per buffer.

When decoding a multi-slice frame in that setup, I think we'll be
better off with an appended buffer containing all the slices for the
frame instead of passing only a the first slice.

> > What do you think?
> > 
> > >  By the way, if we could queue
> > > twice the same buffer, that would in principal work, but internally
> > > there is only one state per buffer. If you do external allocation, then
> > > in theory you could workaround that, but then it's ugly, because you'll
> > > have two buffers with the same timestamp.
> > 
> > One advantage of the request API is that buffers are actually queued
> > when the request is processed, so this might not be too problematic.
> > 
> > I think what we need boils down to:
> > - Being able to queue the same output buffer to multiple requests,
> > which the request API should already allow;
> > - Being able to grab the right capture buffer based on the output
> > timestamp so that the different requests for the slices are rendered to
> > the same destination buffer.
> > 
> > For the second point, I don't really have a clear idea of whether we
> > can already expect v4l2 to allow picking a buffer that was marked done
> > but was not de-queued by userspace yet. It might already be allowed and
> > we could just implement something to lookup the buffer to grab by
> > timestamp.
> 
> An entirely difference solution that came to my mind in the last few
> days would be to add a new buffer flag that would mean END_OF_FRAME (or
> reused the generic LAST flag). This flag would be passed on the last
> slice (if it is known that we are handling the last one) or in an empty
> buffer if it is found through parsing the next following NAL. This is
> inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> Though, if we make this flag mandatory, the driver could avoid marking
> the frame done until all slices has been decoded. The cons is that
> userpace is not informed when a specific slice is done decoding. This
> is quite niche, but you can use this information along with the list of
> macroblocks from the slice header so signal which portion of the image
> is now ready for an hypothetical video processing. The pros is that
> this solution can be per format, so this would not be needed for VP8 as
> an example.

Mhh, I don't really like the idea of setting an explicit order when
there is really none. I guess the slices for a given frame can be
decoded in whatever order, so I would like it better if we could just
submit the batch of requests and be told when the batch is done,
instead of specifying an explicit order and waiting for the last buffer
to be marked done.

And I think this batch idea could apply to other things than video
decoding, so it feels good to have it as the highest level we can in
media/v4l2.

> A third approach could be to use the encoded buffer state to track the
> progress decoding that slice. Many driver will mark the buffer done as
> soon as it is transferred to the accelerator, it does not always match
> the moment that slice has been decoded. But has use said, we would need
> to study if it make sense to let a driver pick by timestamp a buffer
> that might already have reached done state. Other cons, is that polling
> for buffer states on the capture queue won't mean anything anymore. But
> combined with the FLAG, it would fix the cons of the FLAG solution.

Well, I think we should keep the done operation as-is, only give it a
different interpretation depending on whether the request is handled
individually or as part of a batch. I really think we shouldn't rely on
any buffer-level indication for completion when handling a batch, but
rather have something about the batch entity itself.

Cheers,

Paul

> > > An argument that was made early was that we don't need to support this
> > > right away because userspace can combine all the slices into one
> > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > extra control to tell the driver the offset to each slices, because the
> > > raw H264 does not have enough information to be parsed. RAW slice are
> > > also I believe de-emulated, which means the code use to prevent having
> > > pattern looking like a start code has been removed, so you cannot just
> > > prepend start codes. De-emulation seems better placed in userspace if
> > > the HW does not take care.
> > 
> > Mhh I'd like to avoid having having to specify the offset to each slice
> > for the legacy case. Just appending the encoded data (excluding slice
> > header and start code) works for cedrus and I think it makes sense more
> > generally. The idea is to only expose a single slice params and act as
> > if it was just one big slice buffer.
> > 
> > Come to think of it, maybe we need annex-b and mixed fashions of that
> > legacy pixfmt too...
> > 
> > > I also very dislike the idea that we would enforce merging all slice
> > > into the same buffer. The entire purpose of slices and the reason they
> > > are used in practice is that you can start decoding slices before you
> > > have all slices of a frame. This reduce drastically the latency for
> > > streaming use cases, like video conferencing. So forcing the merging of
> > > slices is basically like pretending slices have no benefits.
> > 
> > Of course, we don't want things to stay like this and this rework is
> > definitely needed to get serious performance and latency going.
> > 
> > One thing you should also be aware of: we're currently using a
> > workqueue between the job done irq and scheduling the next frame (in
> > v4l2 m2m).
> > 
> > Maybe we could manage to fit that into an atomic path to schedule the
> > next request in the previous job done irq context.
> > 
> > > I have just exposed the problem I see for now, to see what comes up.
> > > But I hope we be able to propose solution too in the short term (in no
> > > one beats me at it).
> > 
> > Seems that we have good grounds for a discussion!
> > 
> > Cheers,
> > 
> > Paul
> > 
> > > > +
> > > > +A typical frame would thus be decoded using the following sequence:
> > > > +
> > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``index``
> > > > +          index of the buffer being queued.
> > > > +
> > > > +      ``type``
> > > > +          type of the buffer.
> > > > +
> > > > +      ``bytesused``
> > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > +
> > > > +      ``flags``
> > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      ``timestamp``
> > > > +          must be set to a unique value per frame. This value will be propagated
> > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > +          as the reference of another.
> > > > +
> > > > +2. Set the codec-specific controls for the decoding request, using
> > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``which``
> > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      other fields
> > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > +          array must contain all the codec-specific controls required to decode
> > > > +          a frame.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      It is possible to specify the controls in different invocations of
> > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > +      at the moment of request submission is the one that will be considered.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > +
> > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > +   request FD.
> > > > +
> > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > +    required controls are missing from the request, then
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > +
> > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > +
> > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > +error, then all following decoded frames that refer to it also have the
> > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > +produce a (likely corrupted) frame.
> > > > +
> > > > +Buffer management while decoding
> > > > +================================
> > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > +used by the client for as long as they are not queued again. "Used" here
> > > > +encompasses using the buffer for compositing or display.
> > > > +
> > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > +buffer.
> > > > +
> > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > +
> > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > +reference buffer when the following conditions are met:
> > > > +
> > > > +1. All the requests for frames affected by the reference frame have been
> > > > +   queued, and
> > > > +
> > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > +   referencing frames have been queued.
> > > > +
> > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > +all the resources associated with reference frames. This means that the client
> > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > +won't need them afterwards.
> > > > +
> > > > +Seeking
> > > > +=======
> > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > +corresponding to the new stream position. It must however be aware that
> > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > +valid state is sent to the decoder.
> > > > +
> > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > +from the pre-seek position.
> > > > +
> > > > +Pause
> > > > +=====
> > > > +
> > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > +will remain idle.
> > > > +
> > > > +Dynamic resolution change
> > > > +=========================
> > > > +
> > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > +the initialization sequence again with the new resolution:
> > > > +
> > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > +   corresponding output buffers.
> > > > +
> > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > +   queues.
> > > > +
> > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > +
> > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > +   Note that due to resolution constraints, a different format may need to be
> > > > +   picked on the ``CAPTURE`` queue.
> > > > +
> > > > +Drain
> > > > +=====
> > > > +
> > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > +decoder.
Nicolas Dufresne April 15, 2019, 12:24 p.m. UTC | #5
Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > Documents the protocol that user-space should follow when
> > > > > communicating with stateless video decoders.
> > > > > 
> > > > > The stateless video decoding API makes use of the new request and tags
> > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > should probably still be considered staging for a short while.
> > > 
> > > [...]
> > > 
> > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > design problem in the decoding process.
> > > > 
> > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > (slices). This type of encoding is increasingly popular, specially for
> > > > low latency streaming use cases. The wording of this spec does allow
> > > > for the notion of decoding unit, and in practice it has been proven to
> > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > something important to know is that the FFMPEG RFC implements decoding
> > > > in lock steps. Which means:
> > > > 
> > > >   1. It queues a single free capture buffer
> > > >   2. It queues an output buffer, set controls, queue the request
> > > >   3. It waits for a capture buffer to reach state done
> > > >   4. It dequeues that capture buffer, and queue it back again
> > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > >      have a complete frame. After what, it restart at step 1
> > > > 
> > > > So the implementation makes no use of the queues. There is no batch
> > > > processing, so we might not be able to reach the maximum hardware
> > > > throughput.
> > > > 
> > > > So the optimal method would look like the following, but there comes
> > > > the design issue.
> > > > 
> > > >   1. Queue a single free capture buffer
> > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > >   4. Wait for completion
> > > > 
> > > > The problem is in step 4. Completion means that the capture buffer done
> > > > decoding a single unit. So assuming the driver supports matching the
> > > > timestamp against the queued buffer, instead of waiting for a new
> > > > buffer, the driver would have to mark twice the same buffer to done
> > > > state, which is just not working to inform userspace that all slices
> > > > are decoded into the one capture buffer they share.
> > > 
> > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > 2D graphics blitter that has limited ouput scaling abilities which
> > > imply handlnig a large scaling operation as multiple clipped smaller
> > > scaling operations. The issue is basically that multiple jobs have to
> > > be submitted to complete a single frame and relying on an indication
> > > from the destination buffer (such as a fence) doesn't work to indicate
> > > that all the operations were completed, since we get the indication at
> > > each step instead of at the end of the batch.
> > > 
> > > One idea I see to solve this is to have a notion of batch in the driver
> > > (for our situation, that would be in v4l2) and provide means to get a
> > > done indication for that entity.
> > > 
> > > I think we could extend the request API to allow this. We already
> > > represent requests as individual file descriptors, we could totally
> > > group requests in batches and get a sync fd for the batch to poll on
> > > when we need to return the frames. It would be good if we could expose
> > > this in a way that makes it work with DRM as an in fence for display.
> > > Then we can pretty much schedule our flip + decoding together (which is
> > > quite nice to have when we're running late on the decoding side).
> > > 
> > > What do you think?
> > > 
> > > It feels to me like the request API was designed to open up the way for
> > > these kinds of improvements, so I'm sure we can find an agreeable
> > > solution that extends the API.
> > > 
> > > > To me, multi slice encoded stream are just too common, and they will
> > > > also exist for AV1. So we really need a solution to this that does not
> > > > require operating in lock steps. Specially that some HW can decode
> > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > that HW from being used efficiently. On top of this, we need a solution
> > > > so that we can also keep queuing slice of the following frames if they
> > > > arrive before decoding is done.
> > > 
> > > Agreed.
> > > 
> > > > I don't have a solution yet myself, but it would be nice to come up
> > > > with something before we freeze this API.
> > > 
> > > I think it's rather independent from the codec used and this is
> > > something that should be handled at the request API level. 
> > > 
> > > I'm not sure we can always expect the hardware to be able to operate on
> > > a per-slice basis. I think it would be useful to reflect this in the
> > > pixel format, so that we also have a possibility for a gathered slice
> > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > hardware that will need to decode one frame in one go from a contiguous
> > > buffer with all the slice data appended.
> > > 
> > > This updates my pixel format proposition from IRC to the following:
> > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > (appended buffer), slice params as v4l2 control (legacy);
> > 
> > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > control for the NAL index, as there is no way to figure-out otherwise.
> > I would not add this format unless a specific HW need it.
> 
> I don't really like using "raw" as a distinguisher: I don't think it's
> explicit enough. The idea here is to reflect that there is only one
> slice exposed, which is the appended result of all the frame slices
> with a single v4l2 control.

RAW in this context was suggested to reflect the fact there is no
header, no slice header and that emulation prevention bytes has been
removed and replaces by the real values.  Just SLICE alone was much
worst. There is to many properties to this type of H264 buffer to
encode everything into the name, so what will really matter in the end
if the documentation. Feel free to propose a better name.

> 
> > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > 
> > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > slice params encoded in the buffer;
> > 
> > We are still working on this one, this format will be used by Rockchip
> > driver for sure, but this needs clarification and maybe a rename if
> > it's not just one slice per buffer.
> 
> I thought the decoder also needed the parse slice data? At least IIRC
> for Tegra, we need Annex-B format and a parsed slice header (so the
> next one).

Yes, in every cases, the HW will parse the slice data. It's possible
that Tegra have a matching format as Rockchip, someone would need to do
a proper integration to verify. But the driver does not need the
following one, that is specific to ANNEX-B parsing.

> 
> > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > slice params encoded in the buffer and in slice params control;
> > > 
> > > Also, we need to make sure to have a per-slice bit offset to the
> > > encoded data in the slice params control so that the same slice buffer
> > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > an annex-b slice and use any of the formats with it).
> > 
> > Ah, I we are saying the same.
> > 
> > > For the legacy format, we need to specify that the appended slices
> > > don't repeat the annex-b start code and NAL header.
> > 
> > I'm not sure this one make sense. the NAL header for each slices of one
> > frames are not always identical.
> 
> Yes but that's pretty much the point of the legacy format: to only
> expose a single slice buffer and slice header (even in cases where the
> bitstream codes them in multiple distinct ones).
> 
> We can't expect this to work in every case, that's why it's a legacy
> format. It seems to work pretty well for cedrus so far.

I'm not sure I follow you, what Cedrus does should be changed to
whatever we decide as a final API, we should not maintain two formats.
Also, what works for Cedrus is that a each buffers must have a single
slice regardless how many slices per frame. And this is what I expect
from most stateless HW. This is how it works in VAAPI and VDPAU as an
example. Just for the reference, the API in VAAPI is (pseudo code, I
can't remember the exact name):

   - beginPicture()
   - decodeSlice() *
   - endPicture()

So the accelerator is told explicitly when a frame start/end, but also
it's told explicitly in which buffer to decode the frame to.

> 
> We could also decide to ditch the legacy idea altogether and only
> specify formats that operate on a per-slice basis, but I'm afraid we'll
> find decoders that can only take a single slice per buffer.

It's impossible for a compliant decoder to only support 1 slice per
frame, so I don't follow you on this one. Also, I don't understand what
difference you see between per-slice basis and single slice per buffer.

> 
> When decoding a multi-slice frame in that setup, I think we'll be
> better off with an appended buffer containing all the slices for the
> frame instead of passing only a the first slice.

Appended slices requires extra controls, but also introduce a lot more
decoding latency. As soon as we add the missing frame boundary
signalling, it should be really trivial for a driver to wait until it
received all slices before starting the decoding if that is a HW
requirement.

> 
> > > What do you think?
> > > 
> > > >  By the way, if we could queue
> > > > twice the same buffer, that would in principal work, but internally
> > > > there is only one state per buffer. If you do external allocation, then
> > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > have two buffers with the same timestamp.
> > > 
> > > One advantage of the request API is that buffers are actually queued
> > > when the request is processed, so this might not be too problematic.
> > > 
> > > I think what we need boils down to:
> > > - Being able to queue the same output buffer to multiple requests,
> > > which the request API should already allow;
> > > - Being able to grab the right capture buffer based on the output
> > > timestamp so that the different requests for the slices are rendered to
> > > the same destination buffer.
> > > 
> > > For the second point, I don't really have a clear idea of whether we
> > > can already expect v4l2 to allow picking a buffer that was marked done
> > > but was not de-queued by userspace yet. It might already be allowed and
> > > we could just implement something to lookup the buffer to grab by
> > > timestamp.
> > 
> > An entirely difference solution that came to my mind in the last few
> > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > reused the generic LAST flag). This flag would be passed on the last
> > slice (if it is known that we are handling the last one) or in an empty
> > buffer if it is found through parsing the next following NAL. This is
> > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > Though, if we make this flag mandatory, the driver could avoid marking
> > the frame done until all slices has been decoded. The cons is that
> > userpace is not informed when a specific slice is done decoding. This
> > is quite niche, but you can use this information along with the list of
> > macroblocks from the slice header so signal which portion of the image
> > is now ready for an hypothetical video processing. The pros is that
> > this solution can be per format, so this would not be needed for VP8 as
> > an example.
> 
> Mhh, I don't really like the idea of setting an explicit order when
> there is really none. I guess the slices for a given frame can be
> decoded in whatever order, so I would like it better if we could just
> submit the batch of requests and be told when the batch is done,
> instead of specifying an explicit order and waiting for the last buffer
> to be marked done.
> 
> And I think this batch idea could apply to other things than video
> decoding, so it feels good to have it as the highest level we can in
> media/v4l2.

I haven't said anything about order. I believe you can decode slice
out-of-order in H264 but it is likely not true for all formats. You are
again missing the point of decoding latency.

In live stream, the slices are transmitted over some serial link. If
you wait until you have all slice before you start decoding, you delay
further the moment the frame will be ready. A lot of vendors make use
of this to reduce latency, and libWebRTC also makes use of this. So
being able to pass slices as part of a specific frame is rather
important. Otherwise vendor will keep doing their own stuff as the
Linux kernel API won't allow reaching their customers expectation.

The batching capabilities should be used for the case the multiple
slices of a frame (or multiple slices of many frame is supported by the
HW) have been queued before the previous batch had completed.

> 
> > A third approach could be to use the encoded buffer state to track the
> > progress decoding that slice. Many driver will mark the buffer done as
> > soon as it is transferred to the accelerator, it does not always match
> > the moment that slice has been decoded. But has use said, we would need
> > to study if it make sense to let a driver pick by timestamp a buffer
> > that might already have reached done state. Other cons, is that polling
> > for buffer states on the capture queue won't mean anything anymore. But
> > combined with the FLAG, it would fix the cons of the FLAG solution.
> 
> Well, I think we should keep the done operation as-is, only give it a
> different interpretation depending on whether the request is handled
> individually or as part of a batch. I really think we shouldn't rely on
> any buffer-level indication for completion when handling a batch, but
> rather have something about the batch entity itself.

But then there is no way to know when the frame is decoded anymore,
because as soon as the first slice is decoded, the capture is done and
stay done. So what's your idea here, how to do you know your decoding
is complete if you haven't dequeue/queue the frame in between slices ?

> 
> Cheers,
> 
> Paul
> 
> > > > An argument that was made early was that we don't need to support this
> > > > right away because userspace can combine all the slices into one
> > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > extra control to tell the driver the offset to each slices, because the
> > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > also I believe de-emulated, which means the code use to prevent having
> > > > pattern looking like a start code has been removed, so you cannot just
> > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > the HW does not take care.
> > > 
> > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > for the legacy case. Just appending the encoded data (excluding slice
> > > header and start code) works for cedrus and I think it makes sense more
> > > generally. The idea is to only expose a single slice params and act as
> > > if it was just one big slice buffer.
> > > 
> > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > legacy pixfmt too...
> > > 
> > > > I also very dislike the idea that we would enforce merging all slice
> > > > into the same buffer. The entire purpose of slices and the reason they
> > > > are used in practice is that you can start decoding slices before you
> > > > have all slices of a frame. This reduce drastically the latency for
> > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > slices is basically like pretending slices have no benefits.
> > > 
> > > Of course, we don't want things to stay like this and this rework is
> > > definitely needed to get serious performance and latency going.
> > > 
> > > One thing you should also be aware of: we're currently using a
> > > workqueue between the job done irq and scheduling the next frame (in
> > > v4l2 m2m).
> > > 
> > > Maybe we could manage to fit that into an atomic path to schedule the
> > > next request in the previous job done irq context.
> > > 
> > > > I have just exposed the problem I see for now, to see what comes up.
> > > > But I hope we be able to propose solution too in the short term (in no
> > > > one beats me at it).
> > > 
> > > Seems that we have good grounds for a discussion!
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > > +
> > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > +
> > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > +
> > > > > +    * **Required fields:**
> > > > > +
> > > > > +      ``index``
> > > > > +          index of the buffer being queued.
> > > > > +
> > > > > +      ``type``
> > > > > +          type of the buffer.
> > > > > +
> > > > > +      ``bytesused``
> > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > +
> > > > > +      ``flags``
> > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > +
> > > > > +      ``request_fd``
> > > > > +          must be set to the file descriptor of the decoding request.
> > > > > +
> > > > > +      ``timestamp``
> > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > +          as the reference of another.
> > > > > +
> > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > +
> > > > > +    * **Required fields:**
> > > > > +
> > > > > +      ``which``
> > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > +
> > > > > +      ``request_fd``
> > > > > +          must be set to the file descriptor of the decoding request.
> > > > > +
> > > > > +      other fields
> > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > +          array must contain all the codec-specific controls required to decode
> > > > > +          a frame.
> > > > > +
> > > > > +   .. note::
> > > > > +
> > > > > +      It is possible to specify the controls in different invocations of
> > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > +      at the moment of request submission is the one that will be considered.
> > > > > +
> > > > > +   .. note::
> > > > > +
> > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > +
> > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > +   request FD.
> > > > > +
> > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > +    required controls are missing from the request, then
> > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > +
> > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > +
> > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > +error, then all following decoded frames that refer to it also have the
> > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > +produce a (likely corrupted) frame.
> > > > > +
> > > > > +Buffer management while decoding
> > > > > +================================
> > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > +encompasses using the buffer for compositing or display.
> > > > > +
> > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > +buffer.
> > > > > +
> > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > +
> > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > +reference buffer when the following conditions are met:
> > > > > +
> > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > +   queued, and
> > > > > +
> > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > +   referencing frames have been queued.
> > > > > +
> > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > +all the resources associated with reference frames. This means that the client
> > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > +won't need them afterwards.
> > > > > +
> > > > > +Seeking
> > > > > +=======
> > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > +corresponding to the new stream position. It must however be aware that
> > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > +valid state is sent to the decoder.
> > > > > +
> > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > +from the pre-seek position.
> > > > > +
> > > > > +Pause
> > > > > +=====
> > > > > +
> > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > +will remain idle.
> > > > > +
> > > > > +Dynamic resolution change
> > > > > +=========================
> > > > > +
> > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > +the initialization sequence again with the new resolution:
> > > > > +
> > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > +   corresponding output buffers.
> > > > > +
> > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > +   queues.
> > > > > +
> > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > +
> > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > +   picked on the ``CAPTURE`` queue.
> > > > > +
> > > > > +Drain
> > > > > +=====
> > > > > +
> > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > +decoder.
Paul Kocialkowski April 15, 2019, 1:26 p.m. UTC | #6
Hi,

On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > Documents the protocol that user-space should follow when
> > > > > > communicating with stateless video decoders.
> > > > > > 
> > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > should probably still be considered staging for a short while.
> > > > 
> > > > [...]
> > > > 
> > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > design problem in the decoding process.
> > > > > 
> > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > in lock steps. Which means:
> > > > > 
> > > > >   1. It queues a single free capture buffer
> > > > >   2. It queues an output buffer, set controls, queue the request
> > > > >   3. It waits for a capture buffer to reach state done
> > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > > >      have a complete frame. After what, it restart at step 1
> > > > > 
> > > > > So the implementation makes no use of the queues. There is no batch
> > > > > processing, so we might not be able to reach the maximum hardware
> > > > > throughput.
> > > > > 
> > > > > So the optimal method would look like the following, but there comes
> > > > > the design issue.
> > > > > 
> > > > >   1. Queue a single free capture buffer
> > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > >   4. Wait for completion
> > > > > 
> > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > state, which is just not working to inform userspace that all slices
> > > > > are decoded into the one capture buffer they share.
> > > > 
> > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > scaling operations. The issue is basically that multiple jobs have to
> > > > be submitted to complete a single frame and relying on an indication
> > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > that all the operations were completed, since we get the indication at
> > > > each step instead of at the end of the batch.
> > > > 
> > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > done indication for that entity.
> > > > 
> > > > I think we could extend the request API to allow this. We already
> > > > represent requests as individual file descriptors, we could totally
> > > > group requests in batches and get a sync fd for the batch to poll on
> > > > when we need to return the frames. It would be good if we could expose
> > > > this in a way that makes it work with DRM as an in fence for display.
> > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > quite nice to have when we're running late on the decoding side).
> > > > 
> > > > What do you think?
> > > > 
> > > > It feels to me like the request API was designed to open up the way for
> > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > solution that extends the API.
> > > > 
> > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > require operating in lock steps. Specially that some HW can decode
> > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > so that we can also keep queuing slice of the following frames if they
> > > > > arrive before decoding is done.
> > > > 
> > > > Agreed.
> > > > 
> > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > with something before we freeze this API.
> > > > 
> > > > I think it's rather independent from the codec used and this is
> > > > something that should be handled at the request API level. 
> > > > 
> > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > pixel format, so that we also have a possibility for a gathered slice
> > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > hardware that will need to decode one frame in one go from a contiguous
> > > > buffer with all the slice data appended.
> > > > 
> > > > This updates my pixel format proposition from IRC to the following:
> > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > (appended buffer), slice params as v4l2 control (legacy);
> > > 
> > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > control for the NAL index, as there is no way to figure-out otherwise.
> > > I would not add this format unless a specific HW need it.
> > 
> > I don't really like using "raw" as a distinguisher: I don't think it's
> > explicit enough. The idea here is to reflect that there is only one
> > slice exposed, which is the appended result of all the frame slices
> > with a single v4l2 control.
> 
> RAW in this context was suggested to reflect the fact there is no
> header, no slice header and that emulation prevention bytes has been
> removed and replaces by the real values.

That could also be understood as "slice params coded raw", which is the
opposite of what it describes, hence my reluctance.

> Just SLICE alone was much worst.

Keep in mind that we already have a MPEG2_SLICE format in the public
API. We should probably decide what it should become based on the
outcome of this discussion.

>  There is to many properties to this type of H264 buffer to
> encode everything into the name, so what will really matter in the end
> if the documentation. Feel free to propose a better name.

Agreed, it's a side point. I always find it hard to find naming good,
as well as finding good naming (my suggestions aren't really top-notch
either).

Here is another proposition:
- SLICE_PARSED
- SLICE_ANNEX_B
- SLICE_PARSED_ANNEX_B

> > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > 
> > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > > slice params encoded in the buffer;
> > > 
> > > We are still working on this one, this format will be used by Rockchip
> > > driver for sure, but this needs clarification and maybe a rename if
> > > it's not just one slice per buffer.
> > 
> > I thought the decoder also needed the parse slice data? At least IIRC
> > for Tegra, we need Annex-B format and a parsed slice header (so the
> > next one).
> 
> Yes, in every cases, the HW will parse the slice data.

Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
some decoders will need annex-b format but won't parse the slice header
on their own, so they also need the parsed slice header control.
Don't ask why...

In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.

>  It's possible
> that Tegra have a matching format as Rockchip, someone would need to do
> a proper integration to verify. But the driver does not need the
> following one, that is specific to ANNEX-B parsing.
> 
> > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > slice params encoded in the buffer and in slice params control;
> > > > 
> > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > encoded data in the slice params control so that the same slice buffer
> > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > an annex-b slice and use any of the formats with it).
> > > 
> > > Ah, I we are saying the same.
> > > 
> > > > For the legacy format, we need to specify that the appended slices
> > > > don't repeat the annex-b start code and NAL header.
> > > 
> > > I'm not sure this one make sense. the NAL header for each slices of one
> > > frames are not always identical.
> > 
> > Yes but that's pretty much the point of the legacy format: to only
> > expose a single slice buffer and slice header (even in cases where the
> > bitstream codes them in multiple distinct ones).
> > 
> > We can't expect this to work in every case, that's why it's a legacy
> > format. It seems to work pretty well for cedrus so far.
> 
> I'm not sure I follow you, what Cedrus does should be changed to
> whatever we decide as a final API, we should not maintain two formats.

That point has me hesitating. It depends on whether we can expect to
see hardware implementations with no support whatsoever for multi-slice 
per frame and just expect an aggregated buffer of slice compressed
data. This is one operation mode that the Allwinner VPU supports.

The point is not to use it in Cedrus since our VPU can operate per-
slice, but to allow supporting hardware decoders that can't do that in
the future.

I'm not sure it's healthy to make it a hard requirement for H.264
decoding to operate per-slice. Does that seem too far-fetched from your
perspective? I seem to recall from a discussion that some legacy
hardware only handles single-slices frames, but I may be wrong.

> Also, what works for Cedrus is that a each buffers must have a single
> slice regardless how many slices per frame. And this is what I expect
> from most stateless HW.

Currently, we append all the slices into one buffer and decode it in
one go with a slightly hacked slice params to reflect that. But of
course, we should be operating per-slice.

>  This is how it works in VAAPI and VDPAU as an
> example. Just for the reference, the API in VAAPI is (pseudo code, I
> can't remember the exact name):
> 
>    - beginPicture()
>    - decodeSlice() *
>    - endPicture()
> 
> So the accelerator is told explicitly when a frame start/end, but also
> it's told explicitly in which buffer to decode the frame to.

Yes definitely. We're also given all the parsed bitstream elements in
the right order so that we could already start queuing requests when
each slice is passed, and just wait for completion at endPicture.

> > We could also decide to ditch the legacy idea altogether and only
> > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > find decoders that can only take a single slice per buffer.
> 
> It's impossible for a compliant decoder to only support 1 slice per
> frame, so I don't follow you on this one. Also, I don't understand what
> difference you see between per-slice basis and single slice per buffer.

Okay that's exactly what I wanted to know: whether it makes any sense
to build a decoder that only operates per-frame and not per-slice.
If you are confident we won't see that in the wild, we can make it an
API requirement to operate per-slice.

> > When decoding a multi-slice frame in that setup, I think we'll be
> > better off with an appended buffer containing all the slices for the
> > frame instead of passing only a the first slice.
> 
> Appended slices requires extra controls, but also introduce a lot more
> decoding latency. As soon as we add the missing frame boundary
> signalling, it should be really trivial for a driver to wait until it
> received all slices before starting the decoding if that is a HW
> requirement.

Well, I don't really like the idea of the driver being aware of any of
that (IMO the logic should be in the media core, not the driver).

If a driver can't do multiple slices, it shouldn't be up to the driver
to gather them together. But anyway, if you think we won't ever see
this kind of hardware, we can just drop the whole idea.

> > > > What do you think?
> > > > 
> > > > >  By the way, if we could queue
> > > > > twice the same buffer, that would in principal work, but internally
> > > > > there is only one state per buffer. If you do external allocation, then
> > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > have two buffers with the same timestamp.
> > > > 
> > > > One advantage of the request API is that buffers are actually queued
> > > > when the request is processed, so this might not be too problematic.
> > > > 
> > > > I think what we need boils down to:
> > > > - Being able to queue the same output buffer to multiple requests,
> > > > which the request API should already allow;
> > > > - Being able to grab the right capture buffer based on the output
> > > > timestamp so that the different requests for the slices are rendered to
> > > > the same destination buffer.
> > > > 
> > > > For the second point, I don't really have a clear idea of whether we
> > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > we could just implement something to lookup the buffer to grab by
> > > > timestamp.
> > > 
> > > An entirely difference solution that came to my mind in the last few
> > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > reused the generic LAST flag). This flag would be passed on the last
> > > slice (if it is known that we are handling the last one) or in an empty
> > > buffer if it is found through parsing the next following NAL. This is
> > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > Though, if we make this flag mandatory, the driver could avoid marking
> > > the frame done until all slices has been decoded. The cons is that
> > > userpace is not informed when a specific slice is done decoding. This
> > > is quite niche, but you can use this information along with the list of
> > > macroblocks from the slice header so signal which portion of the image
> > > is now ready for an hypothetical video processing. The pros is that
> > > this solution can be per format, so this would not be needed for VP8 as
> > > an example.
> > 
> > Mhh, I don't really like the idea of setting an explicit order when
> > there is really none. I guess the slices for a given frame can be
> > decoded in whatever order, so I would like it better if we could just
> > submit the batch of requests and be told when the batch is done,
> > instead of specifying an explicit order and waiting for the last buffer
> > to be marked done.
> > 
> > And I think this batch idea could apply to other things than video
> > decoding, so it feels good to have it as the highest level we can in
> > media/v4l2.
> 
> I haven't said anything about order. I believe you can decode slice
> out-of-order in H264 but it is likely not true for all formats. You are
> again missing the point of decoding latency.

Well, having an END_OF_FRAME flag on one of the slices pretty much
implicitly defines an order (at least regarding this slice vs the
others).

> In live stream, the slices are transmitted over some serial link. If
> you wait until you have all slice before you start decoding, you delay
> further the moment the frame will be ready.

So that means we need some ability to add requests to a batch while the
batch is being handled. Seems a bit exotic but definitely legit, and it
can probably be done. Userspace would know when it has submitted all
the slices and move on to displaying the frame.

>  A lot of vendors make use
> of this to reduce latency, and libWebRTC also makes use of this. So
> being able to pass slices as part of a specific frame is rather
> important. Otherwise vendor will keep doing their own stuff as the
> Linux kernel API won't allow reaching their customers expectation.

I fully agree we need to prepare for all these low-latency
improvements. My goal is definitely to have something that can beat
vendor-specific implementations in upstream, not just a proof of
concept for half-baked decoding.

> The batching capabilities should be used for the case the multiple
> slices of a frame (or multiple slices of many frame is supported by the
> HW) have been queued before the previous batch had completed.
> 
> > > A third approach could be to use the encoded buffer state to track the
> > > progress decoding that slice. Many driver will mark the buffer done as
> > > soon as it is transferred to the accelerator, it does not always match
> > > the moment that slice has been decoded. But has use said, we would need
> > > to study if it make sense to let a driver pick by timestamp a buffer
> > > that might already have reached done state. Other cons, is that polling
> > > for buffer states on the capture queue won't mean anything anymore. But
> > > combined with the FLAG, it would fix the cons of the FLAG solution.
> > 
> > Well, I think we should keep the done operation as-is, only give it a
> > different interpretation depending on whether the request is handled
> > individually or as part of a batch. I really think we shouldn't rely on
> > any buffer-level indication for completion when handling a batch, but
> > rather have something about the batch entity itself.
> 
> But then there is no way to know when the frame is decoded anymore,
> because as soon as the first slice is decoded, the capture is done and
> stay done. So what's your idea here, how to do you know your decoding
> is complete if you haven't dequeue/queue the frame in between slices ?

Yes, I was thinking of exposing a sync object as a fd that can be
polled on, which would be associated with the request batch and
signalled when completed. That's pretty much a standalone fence that's
not backed by a buffer.

I would like that to be importable as a DRM input fence (if possible at
all), so we can schedule decoding and schedule a page flip for the
capture buffer at the same time. Then the capture buffer can be
displayed as soon as the decoding batch is done.

Cheers,

Paul

> > > > > An argument that was made early was that we don't need to support this
> > > > > right away because userspace can combine all the slices into one
> > > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > > extra control to tell the driver the offset to each slices, because the
> > > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > > also I believe de-emulated, which means the code use to prevent having
> > > > > pattern looking like a start code has been removed, so you cannot just
> > > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > > the HW does not take care.
> > > > 
> > > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > > for the legacy case. Just appending the encoded data (excluding slice
> > > > header and start code) works for cedrus and I think it makes sense more
> > > > generally. The idea is to only expose a single slice params and act as
> > > > if it was just one big slice buffer.
> > > > 
> > > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > > legacy pixfmt too...
> > > > 
> > > > > I also very dislike the idea that we would enforce merging all slice
> > > > > into the same buffer. The entire purpose of slices and the reason they
> > > > > are used in practice is that you can start decoding slices before you
> > > > > have all slices of a frame. This reduce drastically the latency for
> > > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > > slices is basically like pretending slices have no benefits.
> > > > 
> > > > Of course, we don't want things to stay like this and this rework is
> > > > definitely needed to get serious performance and latency going.
> > > > 
> > > > One thing you should also be aware of: we're currently using a
> > > > workqueue between the job done irq and scheduling the next frame (in
> > > > v4l2 m2m).
> > > > 
> > > > Maybe we could manage to fit that into an atomic path to schedule the
> > > > next request in the previous job done irq context.
> > > > 
> > > > > I have just exposed the problem I see for now, to see what comes up.
> > > > > But I hope we be able to propose solution too in the short term (in no
> > > > > one beats me at it).
> > > > 
> > > > Seems that we have good grounds for a discussion!
> > > > 
> > > > Cheers,
> > > > 
> > > > Paul
> > > > 
> > > > > > +
> > > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > > +
> > > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > > +
> > > > > > +    * **Required fields:**
> > > > > > +
> > > > > > +      ``index``
> > > > > > +          index of the buffer being queued.
> > > > > > +
> > > > > > +      ``type``
> > > > > > +          type of the buffer.
> > > > > > +
> > > > > > +      ``bytesused``
> > > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > > +
> > > > > > +      ``flags``
> > > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > > +
> > > > > > +      ``request_fd``
> > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > +
> > > > > > +      ``timestamp``
> > > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > > +          as the reference of another.
> > > > > > +
> > > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > > +
> > > > > > +    * **Required fields:**
> > > > > > +
> > > > > > +      ``which``
> > > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > > +
> > > > > > +      ``request_fd``
> > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > +
> > > > > > +      other fields
> > > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > > +          array must contain all the codec-specific controls required to decode
> > > > > > +          a frame.
> > > > > > +
> > > > > > +   .. note::
> > > > > > +
> > > > > > +      It is possible to specify the controls in different invocations of
> > > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > > +      at the moment of request submission is the one that will be considered.
> > > > > > +
> > > > > > +   .. note::
> > > > > > +
> > > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > > +
> > > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > > +   request FD.
> > > > > > +
> > > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > > +    required controls are missing from the request, then
> > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > > +
> > > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > > +
> > > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > > +error, then all following decoded frames that refer to it also have the
> > > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > > +produce a (likely corrupted) frame.
> > > > > > +
> > > > > > +Buffer management while decoding
> > > > > > +================================
> > > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > > +encompasses using the buffer for compositing or display.
> > > > > > +
> > > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > > +buffer.
> > > > > > +
> > > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > > +
> > > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > > +reference buffer when the following conditions are met:
> > > > > > +
> > > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > > +   queued, and
> > > > > > +
> > > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > > +   referencing frames have been queued.
> > > > > > +
> > > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > > +all the resources associated with reference frames. This means that the client
> > > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > > +won't need them afterwards.
> > > > > > +
> > > > > > +Seeking
> > > > > > +=======
> > > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > > +corresponding to the new stream position. It must however be aware that
> > > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > > +valid state is sent to the decoder.
> > > > > > +
> > > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > > +from the pre-seek position.
> > > > > > +
> > > > > > +Pause
> > > > > > +=====
> > > > > > +
> > > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > > +will remain idle.
> > > > > > +
> > > > > > +Dynamic resolution change
> > > > > > +=========================
> > > > > > +
> > > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > > +the initialization sequence again with the new resolution:
> > > > > > +
> > > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > > +   corresponding output buffers.
> > > > > > +
> > > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > > +   queues.
> > > > > > +
> > > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > > +
> > > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > > +   picked on the ``CAPTURE`` queue.
> > > > > > +
> > > > > > +Drain
> > > > > > +=====
> > > > > > +
> > > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > > +decoder.
Nicolas Dufresne April 15, 2019, 3:30 p.m. UTC | #7
Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> > Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > > Documents the protocol that user-space should follow when
> > > > > > > communicating with stateless video decoders.
> > > > > > > 
> > > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > > should probably still be considered staging for a short while.
> > > > > 
> > > > > [...]
> > > > > 
> > > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > > design problem in the decoding process.
> > > > > > 
> > > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > > in lock steps. Which means:
> > > > > > 
> > > > > >   1. It queues a single free capture buffer
> > > > > >   2. It queues an output buffer, set controls, queue the request
> > > > > >   3. It waits for a capture buffer to reach state done
> > > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > > > >      have a complete frame. After what, it restart at step 1
> > > > > > 
> > > > > > So the implementation makes no use of the queues. There is no batch
> > > > > > processing, so we might not be able to reach the maximum hardware
> > > > > > throughput.
> > > > > > 
> > > > > > So the optimal method would look like the following, but there comes
> > > > > > the design issue.
> > > > > > 
> > > > > >   1. Queue a single free capture buffer
> > > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > > >   4. Wait for completion
> > > > > > 
> > > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > > state, which is just not working to inform userspace that all slices
> > > > > > are decoded into the one capture buffer they share.
> > > > > 
> > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > be submitted to complete a single frame and relying on an indication
> > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > that all the operations were completed, since we get the indication at
> > > > > each step instead of at the end of the batch.
> > > > > 
> > > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > > done indication for that entity.
> > > > > 
> > > > > I think we could extend the request API to allow this. We already
> > > > > represent requests as individual file descriptors, we could totally
> > > > > group requests in batches and get a sync fd for the batch to poll on
> > > > > when we need to return the frames. It would be good if we could expose
> > > > > this in a way that makes it work with DRM as an in fence for display.
> > > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > > quite nice to have when we're running late on the decoding side).
> > > > > 
> > > > > What do you think?
> > > > > 
> > > > > It feels to me like the request API was designed to open up the way for
> > > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > > solution that extends the API.
> > > > > 
> > > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > > require operating in lock steps. Specially that some HW can decode
> > > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > > so that we can also keep queuing slice of the following frames if they
> > > > > > arrive before decoding is done.
> > > > > 
> > > > > Agreed.
> > > > > 
> > > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > > with something before we freeze this API.
> > > > > 
> > > > > I think it's rather independent from the codec used and this is
> > > > > something that should be handled at the request API level. 
> > > > > 
> > > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > > pixel format, so that we also have a possibility for a gathered slice
> > > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > > hardware that will need to decode one frame in one go from a contiguous
> > > > > buffer with all the slice data appended.
> > > > > 
> > > > > This updates my pixel format proposition from IRC to the following:
> > > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > > (appended buffer), slice params as v4l2 control (legacy);
> > > > 
> > > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > > control for the NAL index, as there is no way to figure-out otherwise.
> > > > I would not add this format unless a specific HW need it.
> > > 
> > > I don't really like using "raw" as a distinguisher: I don't think it's
> > > explicit enough. The idea here is to reflect that there is only one
> > > slice exposed, which is the appended result of all the frame slices
> > > with a single v4l2 control.
> > 
> > RAW in this context was suggested to reflect the fact there is no
> > header, no slice header and that emulation prevention bytes has been
> > removed and replaces by the real values.
> 
> That could also be understood as "slice params coded raw", which is the
> opposite of what it describes, hence my reluctance.
> 
> > Just SLICE alone was much worst.
> 
> Keep in mind that we already have a MPEG2_SLICE format in the public
> API. We should probably decide what it should become based on the
> outcome of this discussion.
> 
> >  There is to many properties to this type of H264 buffer to
> > encode everything into the name, so what will really matter in the end
> > if the documentation. Feel free to propose a better name.
> 
> Agreed, it's a side point. I always find it hard to find naming good,
> as well as finding good naming (my suggestions aren't really top-notch
> either).
> 
> Here is another proposition:
> - SLICE_PARSED
> - SLICE_ANNEX_B
> - SLICE_PARSED_ANNEX_B

Ok, we'll keep working on that then, naming is hard. I guess by PARSED
you meant that the slice headers are passed as controls, and that
indeed make sense. But I really thought all stateless decoder would
required that. A hard bet obviously.

> 
> > > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > > 
> > > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > > > slice params encoded in the buffer;
> > > > 
> > > > We are still working on this one, this format will be used by Rockchip
> > > > driver for sure, but this needs clarification and maybe a rename if
> > > > it's not just one slice per buffer.
> > > 
> > > I thought the decoder also needed the parse slice data? At least IIRC
> > > for Tegra, we need Annex-B format and a parsed slice header (so the
> > > next one).
> > 
> > Yes, in every cases, the HW will parse the slice data.
> 
> Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
> some decoders will need annex-b format but won't parse the slice header
> on their own, so they also need the parsed slice header control.
> Don't ask why...
> 
> In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.
> 
> >  It's possible
> > that Tegra have a matching format as Rockchip, someone would need to do
> > a proper integration to verify. But the driver does not need the
> > following one, that is specific to ANNEX-B parsing.
> > 
> > > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > > slice params encoded in the buffer and in slice params control;
> > > > > 
> > > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > > encoded data in the slice params control so that the same slice buffer
> > > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > > an annex-b slice and use any of the formats with it).
> > > > 
> > > > Ah, I we are saying the same.
> > > > 
> > > > > For the legacy format, we need to specify that the appended slices
> > > > > don't repeat the annex-b start code and NAL header.
> > > > 
> > > > I'm not sure this one make sense. the NAL header for each slices of one
> > > > frames are not always identical.
> > > 
> > > Yes but that's pretty much the point of the legacy format: to only
> > > expose a single slice buffer and slice header (even in cases where the
> > > bitstream codes them in multiple distinct ones).
> > > 
> > > We can't expect this to work in every case, that's why it's a legacy
> > > format. It seems to work pretty well for cedrus so far.
> > 
> > I'm not sure I follow you, what Cedrus does should be changed to
> > whatever we decide as a final API, we should not maintain two formats.
> 
> That point has me hesitating. It depends on whether we can expect to
> see hardware implementations with no support whatsoever for multi-slice 
> per frame and just expect an aggregated buffer of slice compressed
> data. This is one operation mode that the Allwinner VPU supports.
> 
> The point is not to use it in Cedrus since our VPU can operate per-
> slice, but to allow supporting hardware decoders that can't do that in
> the future.
> 
> I'm not sure it's healthy to make it a hard requirement for H.264
> decoding to operate per-slice. Does that seem too far-fetched from your
> perspective? I seem to recall from a discussion that some legacy
> hardware only handles single-slices frames, but I may be wrong.
> 
> > Also, what works for Cedrus is that a each buffers must have a single
> > slice regardless how many slices per frame. And this is what I expect
> > from most stateless HW.
> 
> Currently, we append all the slices into one buffer and decode it in
> one go with a slightly hacked slice params to reflect that. But of
> course, we should be operating per-slice.
> 
> >  This is how it works in VAAPI and VDPAU as an
> > example. Just for the reference, the API in VAAPI is (pseudo code, I
> > can't remember the exact name):
> > 
> >    - beginPicture()
> >    - decodeSlice() *
> >    - endPicture()
> > 
> > So the accelerator is told explicitly when a frame start/end, but also
> > it's told explicitly in which buffer to decode the frame to.
> 
> Yes definitely. We're also given all the parsed bitstream elements in
> the right order so that we could already start queuing requests when
> each slice is passed, and just wait for completion at endPicture.
> 
> > > We could also decide to ditch the legacy idea altogether and only
> > > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > > find decoders that can only take a single slice per buffer.
> > 
> > It's impossible for a compliant decoder to only support 1 slice per
> > frame, so I don't follow you on this one. Also, I don't understand what
> > difference you see between per-slice basis and single slice per buffer.
> 
> Okay that's exactly what I wanted to know: whether it makes any sense
> to build a decoder that only operates per-frame and not per-slice.
> If you are confident we won't see that in the wild, we can make it an
> API requirement to operate per-slice.

There is probably a small distinction to make between supporting
multiple slices per frame and operating per slice. It's nice to know
that Cedrus support both. As we discussed today on IRC, if we introduce
a flag that tells the driver when the last slice of a frame is passed,
it would be relatively simple for the driver to do decide what to do.
Of course if the HW have a limitation of one allocation, it might not
be fully optimal as it would have to copy.

But as this is stateless decoder, I'm more inclined in introducing a
format that means just that, leaving it to userspace to do that right
packing.

> 
> > > When decoding a multi-slice frame in that setup, I think we'll be
> > > better off with an appended buffer containing all the slices for the
> > > frame instead of passing only a the first slice.
> > 
> > Appended slices requires extra controls, but also introduce a lot more
> > decoding latency. As soon as we add the missing frame boundary
> > signalling, it should be really trivial for a driver to wait until it
> > received all slices before starting the decoding if that is a HW
> > requirement.
> 
> Well, I don't really like the idea of the driver being aware of any of
> that (IMO the logic should be in the media core, not the driver).
> 
> If a driver can't do multiple slices, it shouldn't be up to the driver
> to gather them together. But anyway, if you think we won't ever see
> this kind of hardware, we can just drop the whole idea.

A compliant HW will support multiple slices per frame, that's not
really optional. But it may require all slices to be packed in a single
allocation, in which case it could copy, or we can just have a
dedicated format for this behaviour.

> 
> > > > > What do you think?
> > > > > 
> > > > > >  By the way, if we could queue
> > > > > > twice the same buffer, that would in principal work, but internally
> > > > > > there is only one state per buffer. If you do external allocation, then
> > > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > > have two buffers with the same timestamp.
> > > > > 
> > > > > One advantage of the request API is that buffers are actually queued
> > > > > when the request is processed, so this might not be too problematic.
> > > > > 
> > > > > I think what we need boils down to:
> > > > > - Being able to queue the same output buffer to multiple requests,
> > > > > which the request API should already allow;
> > > > > - Being able to grab the right capture buffer based on the output
> > > > > timestamp so that the different requests for the slices are rendered to
> > > > > the same destination buffer.
> > > > > 
> > > > > For the second point, I don't really have a clear idea of whether we
> > > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > > we could just implement something to lookup the buffer to grab by
> > > > > timestamp.
> > > > 
> > > > An entirely difference solution that came to my mind in the last few
> > > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > > reused the generic LAST flag). This flag would be passed on the last
> > > > slice (if it is known that we are handling the last one) or in an empty
> > > > buffer if it is found through parsing the next following NAL. This is
> > > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > > Though, if we make this flag mandatory, the driver could avoid marking
> > > > the frame done until all slices has been decoded. The cons is that
> > > > userpace is not informed when a specific slice is done decoding. This
> > > > is quite niche, but you can use this information along with the list of
> > > > macroblocks from the slice header so signal which portion of the image
> > > > is now ready for an hypothetical video processing. The pros is that
> > > > this solution can be per format, so this would not be needed for VP8 as
> > > > an example.
> > > 
> > > Mhh, I don't really like the idea of setting an explicit order when
> > > there is really none. I guess the slices for a given frame can be
> > > decoded in whatever order, so I would like it better if we could just
> > > submit the batch of requests and be told when the batch is done,
> > > instead of specifying an explicit order and waiting for the last buffer
> > > to be marked done.
> > > 
> > > And I think this batch idea could apply to other things than video
> > > decoding, so it feels good to have it as the highest level we can in
> > > media/v4l2.
> > 
> > I haven't said anything about order. I believe you can decode slice
> > out-of-order in H264 but it is likely not true for all formats. You are
> > again missing the point of decoding latency.
> 
> Well, having an END_OF_FRAME flag on one of the slices pretty much
> implicitly defines an order (at least regarding this slice vs the
> others).

No, the flag simply means that any following request will be on another
frame. It's more like "closing" the decoded frame. I believe you have a
good understanding of this proposal now after our IRC discussion.

> 
> > In live stream, the slices are transmitted over some serial link. If
> > you wait until you have all slice before you start decoding, you delay
> > further the moment the frame will be ready.
> 
> So that means we need some ability to add requests to a batch while the
> batch is being handled. Seems a bit exotic but definitely legit, and it
> can probably be done. Userspace would know when it has submitted all
> the slices and move on to displaying the frame.
> 
> >  A lot of vendors make use
> > of this to reduce latency, and libWebRTC also makes use of this. So
> > being able to pass slices as part of a specific frame is rather
> > important. Otherwise vendor will keep doing their own stuff as the
> > Linux kernel API won't allow reaching their customers expectation.
> 
> I fully agree we need to prepare for all these low-latency
> improvements. My goal is definitely to have something that can beat
> vendor-specific implementations in upstream, not just a proof of
> concept for half-baked decoding.

Great.

> 
> > The batching capabilities should be used for the case the multiple
> > slices of a frame (or multiple slices of many frame is supported by the
> > HW) have been queued before the previous batch had completed.
> > 
> > > > A third approach could be to use the encoded buffer state to track the
> > > > progress decoding that slice. Many driver will mark the buffer done as
> > > > soon as it is transferred to the accelerator, it does not always match
> > > > the moment that slice has been decoded. But has use said, we would need
> > > > to study if it make sense to let a driver pick by timestamp a buffer
> > > > that might already have reached done state. Other cons, is that polling
> > > > for buffer states on the capture queue won't mean anything anymore. But
> > > > combined with the FLAG, it would fix the cons of the FLAG solution.
> > > 
> > > Well, I think we should keep the done operation as-is, only give it a
> > > different interpretation depending on whether the request is handled
> > > individually or as part of a batch. I really think we shouldn't rely on
> > > any buffer-level indication for completion when handling a batch, but
> > > rather have something about the batch entity itself.
> > 
> > But then there is no way to know when the frame is decoded anymore,
> > because as soon as the first slice is decoded, the capture is done and
> > stay done. So what's your idea here, how to do you know your decoding
> > is complete if you haven't dequeue/queue the frame in between slices ?
> 
> Yes, I was thinking of exposing a sync object as a fd that can be
> polled on, which would be associated with the request batch and
> signalled when completed. That's pretty much a standalone fence that's
> not backed by a buffer.
> 
> I would like that to be importable as a DRM input fence (if possible at
> all), so we can schedule decoding and schedule a page flip for the
> capture buffer at the same time. Then the capture buffer can be
> displayed as soon as the decoding batch is done.

We'll have to bring that one is a specific topic. Right now there is
many peaces missing on DRM side as you already know. Also, fences would
be delivered out-of-order (decoding order), we'd need strong doc to
help user-space figure-out how to use this.

> 
> Cheers,
> 
> Paul
> 
> > > > > > An argument that was made early was that we don't need to support this
> > > > > > right away because userspace can combine all the slices into one
> > > > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > > > extra control to tell the driver the offset to each slices, because the
> > > > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > > > also I believe de-emulated, which means the code use to prevent having
> > > > > > pattern looking like a start code has been removed, so you cannot just
> > > > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > > > the HW does not take care.
> > > > > 
> > > > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > > > for the legacy case. Just appending the encoded data (excluding slice
> > > > > header and start code) works for cedrus and I think it makes sense more
> > > > > generally. The idea is to only expose a single slice params and act as
> > > > > if it was just one big slice buffer.
> > > > > 
> > > > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > > > legacy pixfmt too...
> > > > > 
> > > > > > I also very dislike the idea that we would enforce merging all slice
> > > > > > into the same buffer. The entire purpose of slices and the reason they
> > > > > > are used in practice is that you can start decoding slices before you
> > > > > > have all slices of a frame. This reduce drastically the latency for
> > > > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > > > slices is basically like pretending slices have no benefits.
> > > > > 
> > > > > Of course, we don't want things to stay like this and this rework is
> > > > > definitely needed to get serious performance and latency going.
> > > > > 
> > > > > One thing you should also be aware of: we're currently using a
> > > > > workqueue between the job done irq and scheduling the next frame (in
> > > > > v4l2 m2m).
> > > > > 
> > > > > Maybe we could manage to fit that into an atomic path to schedule the
> > > > > next request in the previous job done irq context.
> > > > > 
> > > > > > I have just exposed the problem I see for now, to see what comes up.
> > > > > > But I hope we be able to propose solution too in the short term (in no
> > > > > > one beats me at it).
> > > > > 
> > > > > Seems that we have good grounds for a discussion!
> > > > > 
> > > > > Cheers,
> > > > > 
> > > > > Paul
> > > > > 
> > > > > > > +
> > > > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > > > +
> > > > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > > > +
> > > > > > > +    * **Required fields:**
> > > > > > > +
> > > > > > > +      ``index``
> > > > > > > +          index of the buffer being queued.
> > > > > > > +
> > > > > > > +      ``type``
> > > > > > > +          type of the buffer.
> > > > > > > +
> > > > > > > +      ``bytesused``
> > > > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > > > +
> > > > > > > +      ``flags``
> > > > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > > > +
> > > > > > > +      ``request_fd``
> > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > +
> > > > > > > +      ``timestamp``
> > > > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > > > +          as the reference of another.
> > > > > > > +
> > > > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > > > +
> > > > > > > +    * **Required fields:**
> > > > > > > +
> > > > > > > +      ``which``
> > > > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > > > +
> > > > > > > +      ``request_fd``
> > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > +
> > > > > > > +      other fields
> > > > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > > > +          array must contain all the codec-specific controls required to decode
> > > > > > > +          a frame.
> > > > > > > +
> > > > > > > +   .. note::
> > > > > > > +
> > > > > > > +      It is possible to specify the controls in different invocations of
> > > > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > > > +      at the moment of request submission is the one that will be considered.
> > > > > > > +
> > > > > > > +   .. note::
> > > > > > > +
> > > > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > > > +
> > > > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > > > +   request FD.
> > > > > > > +
> > > > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > > > +    required controls are missing from the request, then
> > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > > > +
> > > > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > > > +
> > > > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > > > +error, then all following decoded frames that refer to it also have the
> > > > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > > > +produce a (likely corrupted) frame.
> > > > > > > +
> > > > > > > +Buffer management while decoding
> > > > > > > +================================
> > > > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > > > +encompasses using the buffer for compositing or display.
> > > > > > > +
> > > > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > > > +buffer.
> > > > > > > +
> > > > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > > > +
> > > > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > > > +reference buffer when the following conditions are met:
> > > > > > > +
> > > > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > > > +   queued, and
> > > > > > > +
> > > > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > > > +   referencing frames have been queued.
> > > > > > > +
> > > > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > > > +all the resources associated with reference frames. This means that the client
> > > > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > > > +won't need them afterwards.
> > > > > > > +
> > > > > > > +Seeking
> > > > > > > +=======
> > > > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > > > +corresponding to the new stream position. It must however be aware that
> > > > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > > > +valid state is sent to the decoder.
> > > > > > > +
> > > > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > > > +from the pre-seek position.
> > > > > > > +
> > > > > > > +Pause
> > > > > > > +=====
> > > > > > > +
> > > > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > > > +will remain idle.
> > > > > > > +
> > > > > > > +Dynamic resolution change
> > > > > > > +=========================
> > > > > > > +
> > > > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > > > +the initialization sequence again with the new resolution:
> > > > > > > +
> > > > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > > > +   corresponding output buffers.
> > > > > > > +
> > > > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > > > +   queues.
> > > > > > > +
> > > > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > > > +
> > > > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > > > +   picked on the ``CAPTURE`` queue.
> > > > > > > +
> > > > > > > +Drain
> > > > > > > +=====
> > > > > > > +
> > > > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > > > +decoder.
Alexandre Courbot April 16, 2019, 7:22 a.m. UTC | #8
On Tue, Apr 16, 2019 at 12:30 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
>
> Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :
> > Hi,
> >
> > On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> > > Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > >
> > > > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > >
> > > > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > > > Documents the protocol that user-space should follow when
> > > > > > > > communicating with stateless video decoders.
> > > > > > > >
> > > > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > > > should probably still be considered staging for a short while.
> > > > > >
> > > > > > [...]
> > > > > >
> > > > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > > > design problem in the decoding process.
> > > > > > >
> > > > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > > > in lock steps. Which means:
> > > > > > >
> > > > > > >   1. It queues a single free capture buffer
> > > > > > >   2. It queues an output buffer, set controls, queue the request
> > > > > > >   3. It waits for a capture buffer to reach state done
> > > > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > > > >   5. And then it runs step 2,4,3 again with following slices, until we
> > > > > > >      have a complete frame. After what, it restart at step 1
> > > > > > >
> > > > > > > So the implementation makes no use of the queues. There is no batch
> > > > > > > processing, so we might not be able to reach the maximum hardware
> > > > > > > throughput.
> > > > > > >
> > > > > > > So the optimal method would look like the following, but there comes
> > > > > > > the design issue.
> > > > > > >
> > > > > > >   1. Queue a single free capture buffer
> > > > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > > > >   4. Wait for completion
> > > > > > >
> > > > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > > > state, which is just not working to inform userspace that all slices
> > > > > > > are decoded into the one capture buffer they share.
> > > > > >
> > > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > > be submitted to complete a single frame and relying on an indication
> > > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > > that all the operations were completed, since we get the indication at
> > > > > > each step instead of at the end of the batch.
> > > > > >
> > > > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > > > done indication for that entity.
> > > > > >
> > > > > > I think we could extend the request API to allow this. We already
> > > > > > represent requests as individual file descriptors, we could totally
> > > > > > group requests in batches and get a sync fd for the batch to poll on
> > > > > > when we need to return the frames. It would be good if we could expose
> > > > > > this in a way that makes it work with DRM as an in fence for display.
> > > > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > > > quite nice to have when we're running late on the decoding side).
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > It feels to me like the request API was designed to open up the way for
> > > > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > > > solution that extends the API.
> > > > > >
> > > > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > > > require operating in lock steps. Specially that some HW can decode
> > > > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > > > so that we can also keep queuing slice of the following frames if they
> > > > > > > arrive before decoding is done.
> > > > > >
> > > > > > Agreed.
> > > > > >
> > > > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > > > with something before we freeze this API.
> > > > > >
> > > > > > I think it's rather independent from the codec used and this is
> > > > > > something that should be handled at the request API level.
> > > > > >
> > > > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > > > pixel format, so that we also have a possibility for a gathered slice
> > > > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > > > hardware that will need to decode one frame in one go from a contiguous
> > > > > > buffer with all the slice data appended.
> > > > > >
> > > > > > This updates my pixel format proposition from IRC to the following:
> > > > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > > > (appended buffer), slice params as v4l2 control (legacy);
> > > > >
> > > > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > > > control for the NAL index, as there is no way to figure-out otherwise.
> > > > > I would not add this format unless a specific HW need it.
> > > >
> > > > I don't really like using "raw" as a distinguisher: I don't think it's
> > > > explicit enough. The idea here is to reflect that there is only one
> > > > slice exposed, which is the appended result of all the frame slices
> > > > with a single v4l2 control.
> > >
> > > RAW in this context was suggested to reflect the fact there is no
> > > header, no slice header and that emulation prevention bytes has been
> > > removed and replaces by the real values.
> >
> > That could also be understood as "slice params coded raw", which is the
> > opposite of what it describes, hence my reluctance.
> >
> > > Just SLICE alone was much worst.
> >
> > Keep in mind that we already have a MPEG2_SLICE format in the public
> > API. We should probably decide what it should become based on the
> > outcome of this discussion.
> >
> > >  There is to many properties to this type of H264 buffer to
> > > encode everything into the name, so what will really matter in the end
> > > if the documentation. Feel free to propose a better name.
> >
> > Agreed, it's a side point. I always find it hard to find naming good,
> > as well as finding good naming (my suggestions aren't really top-notch
> > either).
> >
> > Here is another proposition:
> > - SLICE_PARSED
> > - SLICE_ANNEX_B
> > - SLICE_PARSED_ANNEX_B
>
> Ok, we'll keep working on that then, naming is hard. I guess by PARSED
> you meant that the slice headers are passed as controls, and that
> indeed make sense. But I really thought all stateless decoder would
> required that. A hard bet obviously.
>
> >
> > > > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > > >
> > > > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format,
> > > > > > slice params encoded in the buffer;
> > > > >
> > > > > We are still working on this one, this format will be used by Rockchip
> > > > > driver for sure, but this needs clarification and maybe a rename if
> > > > > it's not just one slice per buffer.
> > > >
> > > > I thought the decoder also needed the parse slice data? At least IIRC
> > > > for Tegra, we need Annex-B format and a parsed slice header (so the
> > > > next one).
> > >
> > > Yes, in every cases, the HW will parse the slice data.
> >
> > Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
> > some decoders will need annex-b format but won't parse the slice header
> > on their own, so they also need the parsed slice header control.
> > Don't ask why...
> >
> > In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.
> >
> > >  It's possible
> > > that Tegra have a matching format as Rockchip, someone would need to do
> > > a proper integration to verify. But the driver does not need the
> > > following one, that is specific to ANNEX-B parsing.
> > >
> > > > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > > > slice params encoded in the buffer and in slice params control;
> > > > > >
> > > > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > > > encoded data in the slice params control so that the same slice buffer
> > > > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > > > an annex-b slice and use any of the formats with it).
> > > > >
> > > > > Ah, I we are saying the same.
> > > > >
> > > > > > For the legacy format, we need to specify that the appended slices
> > > > > > don't repeat the annex-b start code and NAL header.
> > > > >
> > > > > I'm not sure this one make sense. the NAL header for each slices of one
> > > > > frames are not always identical.
> > > >
> > > > Yes but that's pretty much the point of the legacy format: to only
> > > > expose a single slice buffer and slice header (even in cases where the
> > > > bitstream codes them in multiple distinct ones).
> > > >
> > > > We can't expect this to work in every case, that's why it's a legacy
> > > > format. It seems to work pretty well for cedrus so far.
> > >
> > > I'm not sure I follow you, what Cedrus does should be changed to
> > > whatever we decide as a final API, we should not maintain two formats.
> >
> > That point has me hesitating. It depends on whether we can expect to
> > see hardware implementations with no support whatsoever for multi-slice
> > per frame and just expect an aggregated buffer of slice compressed
> > data. This is one operation mode that the Allwinner VPU supports.
> >
> > The point is not to use it in Cedrus since our VPU can operate per-
> > slice, but to allow supporting hardware decoders that can't do that in
> > the future.
> >
> > I'm not sure it's healthy to make it a hard requirement for H.264
> > decoding to operate per-slice. Does that seem too far-fetched from your
> > perspective? I seem to recall from a discussion that some legacy
> > hardware only handles single-slices frames, but I may be wrong.
> >
> > > Also, what works for Cedrus is that a each buffers must have a single
> > > slice regardless how many slices per frame. And this is what I expect
> > > from most stateless HW.
> >
> > Currently, we append all the slices into one buffer and decode it in
> > one go with a slightly hacked slice params to reflect that. But of
> > course, we should be operating per-slice.
> >
> > >  This is how it works in VAAPI and VDPAU as an
> > > example. Just for the reference, the API in VAAPI is (pseudo code, I
> > > can't remember the exact name):
> > >
> > >    - beginPicture()
> > >    - decodeSlice() *
> > >    - endPicture()
> > >
> > > So the accelerator is told explicitly when a frame start/end, but also
> > > it's told explicitly in which buffer to decode the frame to.
> >
> > Yes definitely. We're also given all the parsed bitstream elements in
> > the right order so that we could already start queuing requests when
> > each slice is passed, and just wait for completion at endPicture.
> >
> > > > We could also decide to ditch the legacy idea altogether and only
> > > > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > > > find decoders that can only take a single slice per buffer.
> > >
> > > It's impossible for a compliant decoder to only support 1 slice per
> > > frame, so I don't follow you on this one. Also, I don't understand what
> > > difference you see between per-slice basis and single slice per buffer.
> >
> > Okay that's exactly what I wanted to know: whether it makes any sense
> > to build a decoder that only operates per-frame and not per-slice.
> > If you are confident we won't see that in the wild, we can make it an
> > API requirement to operate per-slice.
>
> There is probably a small distinction to make between supporting
> multiple slices per frame and operating per slice. It's nice to know
> that Cedrus support both. As we discussed today on IRC, if we introduce
> a flag that tells the driver when the last slice of a frame is passed,
> it would be relatively simple for the driver to do decide what to do.
> Of course if the HW have a limitation of one allocation, it might not
> be fully optimal as it would have to copy.
>
> But as this is stateless decoder, I'm more inclined in introducing a
> format that means just that, leaving it to userspace to do that right
> packing.
>
> >
> > > > When decoding a multi-slice frame in that setup, I think we'll be
> > > > better off with an appended buffer containing all the slices for the
> > > > frame instead of passing only a the first slice.
> > >
> > > Appended slices requires extra controls, but also introduce a lot more
> > > decoding latency. As soon as we add the missing frame boundary
> > > signalling, it should be really trivial for a driver to wait until it
> > > received all slices before starting the decoding if that is a HW
> > > requirement.
> >
> > Well, I don't really like the idea of the driver being aware of any of
> > that (IMO the logic should be in the media core, not the driver).
> >
> > If a driver can't do multiple slices, it shouldn't be up to the driver
> > to gather them together. But anyway, if you think we won't ever see
> > this kind of hardware, we can just drop the whole idea.
>
> A compliant HW will support multiple slices per frame, that's not
> really optional. But it may require all slices to be packed in a single
> allocation, in which case it could copy, or we can just have a
> dedicated format for this behaviour.
>
> >
> > > > > > What do you think?
> > > > > >
> > > > > > >  By the way, if we could queue
> > > > > > > twice the same buffer, that would in principal work, but internally
> > > > > > > there is only one state per buffer. If you do external allocation, then
> > > > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > > > have two buffers with the same timestamp.
> > > > > >
> > > > > > One advantage of the request API is that buffers are actually queued
> > > > > > when the request is processed, so this might not be too problematic.
> > > > > >
> > > > > > I think what we need boils down to:
> > > > > > - Being able to queue the same output buffer to multiple requests,
> > > > > > which the request API should already allow;
> > > > > > - Being able to grab the right capture buffer based on the output
> > > > > > timestamp so that the different requests for the slices are rendered to
> > > > > > the same destination buffer.
> > > > > >
> > > > > > For the second point, I don't really have a clear idea of whether we
> > > > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > > > we could just implement something to lookup the buffer to grab by
> > > > > > timestamp.
> > > > >
> > > > > An entirely difference solution that came to my mind in the last few
> > > > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > > > reused the generic LAST flag). This flag would be passed on the last
> > > > > slice (if it is known that we are handling the last one) or in an empty
> > > > > buffer if it is found through parsing the next following NAL. This is
> > > > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > > > Though, if we make this flag mandatory, the driver could avoid marking
> > > > > the frame done until all slices has been decoded. The cons is that
> > > > > userpace is not informed when a specific slice is done decoding. This
> > > > > is quite niche, but you can use this information along with the list of
> > > > > macroblocks from the slice header so signal which portion of the image
> > > > > is now ready for an hypothetical video processing. The pros is that
> > > > > this solution can be per format, so this would not be needed for VP8 as
> > > > > an example.
> > > >
> > > > Mhh, I don't really like the idea of setting an explicit order when
> > > > there is really none. I guess the slices for a given frame can be
> > > > decoded in whatever order, so I would like it better if we could just
> > > > submit the batch of requests and be told when the batch is done,
> > > > instead of specifying an explicit order and waiting for the last buffer
> > > > to be marked done.
> > > >
> > > > And I think this batch idea could apply to other things than video
> > > > decoding, so it feels good to have it as the highest level we can in
> > > > media/v4l2.
> > >
> > > I haven't said anything about order. I believe you can decode slice
> > > out-of-order in H264 but it is likely not true for all formats. You are
> > > again missing the point of decoding latency.
> >
> > Well, having an END_OF_FRAME flag on one of the slices pretty much
> > implicitly defines an order (at least regarding this slice vs the
> > others).
>
> No, the flag simply means that any following request will be on another
> frame. It's more like "closing" the decoded frame. I believe you have a
> good understanding of this proposal now after our IRC discussion.
>
> >
> > > In live stream, the slices are transmitted over some serial link. If
> > > you wait until you have all slice before you start decoding, you delay
> > > further the moment the frame will be ready.
> >
> > So that means we need some ability to add requests to a batch while the
> > batch is being handled. Seems a bit exotic but definitely legit, and it
> > can probably be done. Userspace would know when it has submitted all
> > the slices and move on to displaying the frame.
> >
> > >  A lot of vendors make use
> > > of this to reduce latency, and libWebRTC also makes use of this. So
> > > being able to pass slices as part of a specific frame is rather
> > > important. Otherwise vendor will keep doing their own stuff as the
> > > Linux kernel API won't allow reaching their customers expectation.
> >
> > I fully agree we need to prepare for all these low-latency
> > improvements. My goal is definitely to have something that can beat
> > vendor-specific implementations in upstream, not just a proof of
> > concept for half-baked decoding.

Thanks for this great discussion. Let me try to summarize the status
of this thread + the IRC discussion and add my own thoughts:

Proper support for multiple decoding units (e.g. H.264 slices) per
frame should not be an afterthought ; compliance to encoded formats
depend on it, and the benefit of lower latency is a significant
consideration for vendors.

m2m, which we use for all stateless codecs, has a strong assumption
that one OUTPUT buffer consumed results in one CAPTURE buffer being
produced. This assumption can however be overruled: at least the venus
driver does it to implement the stateful specification.

So we need a way to specify frame boundaries when submitting encoded
content to the driver. One request should contain a single OUTPUT
buffer, containing a single decoding unit, but we need a way to
specify whether the driver should directly produce a CAPTURE buffer
from this request, or keep using the same CAPTURE buffer with
subsequent requests.

I can think of 2 ways this can be expressed:
1) We keep the current m2m behavior as the default (a CAPTURE buffer
is produced), and add a flag to ask the driver to change that behavior
and hold on the CAPTURE buffer and reuse it with the next request(s) ;
2) We specify that no CAPTURE buffer is produced by default, unless a
flag asking so is specified.

The flag could be specified in one of two ways:
a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
b) As a dedicated control, either format-specific or more common to all codecs.

I tend to favor 2) and b) for this, for the reason that with H.264 at
least, user-space does not know whether a slice is the last slice of a
frame until it starts parsing the next one, and we don't know when we
will receive it. If we use a control to ask that a CAPTURE buffer be
produced, we can always submit another request with only that control
set once it is clear that the frame is complete (and not delay
decoding meanwhile). In practice I am not that familiar with
latency-sensitive streaming ; maybe a smart streamer would just append
an AUD NAL unit at the end of every frame and we can thus submit the
flag it with the last slice without further delay?

An extra constraint to enforce would be that each decoding unit
belonging to the same frame must be submitted with the same timestamp,
otherwise the request submission would fail. We really need a
framework to enforce all this at a higher level than individual
drivers, once we reach an agreement I will start working on this.

Formats that do not support multiple decoding units per frame would
reject any request that does not carry the end-of-frame information.

Anything missing / any further comment?
Paul Kocialkowski April 16, 2019, 7:37 a.m. UTC | #9
Hi,

Le lundi 15 avril 2019 à 11:30 -0400, Nicolas Dufresne a écrit :
> Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> > > Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > > > Hi,
> > > > 
> > > > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > > > Hi,
> > > > > > 
> > > > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > > > Documents the protocol that user-space should follow when
> > > > > > > > communicating with stateless video decoders.
> > > > > > > > 
> > > > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > > > should probably still be considered staging for a short while.
> > > > > > 
> > > > > > [...]
> > > > > > 
> > > > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > > > design problem in the decoding process.
> > > > > > > 
> > > > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > > > in lock steps. Which means:
> > > > > > > 
> > > > > > >   1. It queues a single free capture buffer
> > > > > > >   2. It queues an output buffer, set controls, queue the request
> > > > > > >   3. It waits for a capture buffer to reach state done
> > > > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > > > > >      have a complete frame. After what, it restart at step 1
> > > > > > > 
> > > > > > > So the implementation makes no use of the queues. There is no batch
> > > > > > > processing, so we might not be able to reach the maximum hardware
> > > > > > > throughput.
> > > > > > > 
> > > > > > > So the optimal method would look like the following, but there comes
> > > > > > > the design issue.
> > > > > > > 
> > > > > > >   1. Queue a single free capture buffer
> > > > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > > > >   4. Wait for completion
> > > > > > > 
> > > > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > > > state, which is just not working to inform userspace that all slices
> > > > > > > are decoded into the one capture buffer they share.
> > > > > > 
> > > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > > be submitted to complete a single frame and relying on an indication
> > > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > > that all the operations were completed, since we get the indication at
> > > > > > each step instead of at the end of the batch.
> > > > > > 
> > > > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > > > done indication for that entity.
> > > > > > 
> > > > > > I think we could extend the request API to allow this. We already
> > > > > > represent requests as individual file descriptors, we could totally
> > > > > > group requests in batches and get a sync fd for the batch to poll on
> > > > > > when we need to return the frames. It would be good if we could expose
> > > > > > this in a way that makes it work with DRM as an in fence for display.
> > > > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > > > quite nice to have when we're running late on the decoding side).
> > > > > > 
> > > > > > What do you think?
> > > > > > 
> > > > > > It feels to me like the request API was designed to open up the way for
> > > > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > > > solution that extends the API.
> > > > > > 
> > > > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > > > require operating in lock steps. Specially that some HW can decode
> > > > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > > > so that we can also keep queuing slice of the following frames if they
> > > > > > > arrive before decoding is done.
> > > > > > 
> > > > > > Agreed.
> > > > > > 
> > > > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > > > with something before we freeze this API.
> > > > > > 
> > > > > > I think it's rather independent from the codec used and this is
> > > > > > something that should be handled at the request API level. 
> > > > > > 
> > > > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > > > pixel format, so that we also have a possibility for a gathered slice
> > > > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > > > hardware that will need to decode one frame in one go from a contiguous
> > > > > > buffer with all the slice data appended.
> > > > > > 
> > > > > > This updates my pixel format proposition from IRC to the following:
> > > > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > > > (appended buffer), slice params as v4l2 control (legacy);
> > > > > 
> > > > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > > > control for the NAL index, as there is no way to figure-out otherwise.
> > > > > I would not add this format unless a specific HW need it.
> > > > 
> > > > I don't really like using "raw" as a distinguisher: I don't think it's
> > > > explicit enough. The idea here is to reflect that there is only one
> > > > slice exposed, which is the appended result of all the frame slices
> > > > with a single v4l2 control.
> > > 
> > > RAW in this context was suggested to reflect the fact there is no
> > > header, no slice header and that emulation prevention bytes has been
> > > removed and replaces by the real values.
> > 
> > That could also be understood as "slice params coded raw", which is the
> > opposite of what it describes, hence my reluctance.
> > 
> > > Just SLICE alone was much worst.
> > 
> > Keep in mind that we already have a MPEG2_SLICE format in the public
> > API. We should probably decide what it should become based on the
> > outcome of this discussion.
> > 
> > >  There is to many properties to this type of H264 buffer to
> > > encode everything into the name, so what will really matter in the end
> > > if the documentation. Feel free to propose a better name.
> > 
> > Agreed, it's a side point. I always find it hard to find naming good,
> > as well as finding good naming (my suggestions aren't really top-notch
> > either).
> > 
> > Here is another proposition:
> > - SLICE_PARSED
> > - SLICE_ANNEX_B
> > - SLICE_PARSED_ANNEX_B
> 
> Ok, we'll keep working on that then, naming is hard. I guess by PARSED
> you meant that the slice headers are passed as controls, and that
> indeed make sense.

Yep, that's right.

> But I really thought all stateless decoder would required that. A
> hard bet obviously.

Well, I think that's a debate on its own. A strict interpretation of
stateless could be that the decoder does not internally keep track of
the reference frames and any frame can be decoded at any time (at the
condition that the reference frames data is around). This doesn't have
to be correlated with whether the decoder will take the slice header in
raw or parsed format, after all.

Note that in cedrus, our decoder still has some state, but we get to
decide where that state is stored, which makes the whole thing
stateless  since we can bring the state we need dynamically without
involving the hardware.

> > > > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > > > 
> > > > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > > > > slice params encoded in the buffer;
> > > > > 
> > > > > We are still working on this one, this format will be used by Rockchip
> > > > > driver for sure, but this needs clarification and maybe a rename if
> > > > > it's not just one slice per buffer.
> > > > 
> > > > I thought the decoder also needed the parse slice data? At least IIRC
> > > > for Tegra, we need Annex-B format and a parsed slice header (so the
> > > > next one).
> > > 
> > > Yes, in every cases, the HW will parse the slice data.
> > 
> > Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
> > some decoders will need annex-b format but won't parse the slice header
> > on their own, so they also need the parsed slice header control.
> > Don't ask why...
> > 
> > In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.
> > 
> > >  It's possible
> > > that Tegra have a matching format as Rockchip, someone would need to do
> > > a proper integration to verify. But the driver does not need the
> > > following one, that is specific to ANNEX-B parsing.
> > > 
> > > > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > > > slice params encoded in the buffer and in slice params control;
> > > > > > 
> > > > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > > > encoded data in the slice params control so that the same slice buffer
> > > > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > > > an annex-b slice and use any of the formats with it).
> > > > > 
> > > > > Ah, I we are saying the same.
> > > > > 
> > > > > > For the legacy format, we need to specify that the appended slices
> > > > > > don't repeat the annex-b start code and NAL header.
> > > > > 
> > > > > I'm not sure this one make sense. the NAL header for each slices of one
> > > > > frames are not always identical.
> > > > 
> > > > Yes but that's pretty much the point of the legacy format: to only
> > > > expose a single slice buffer and slice header (even in cases where the
> > > > bitstream codes them in multiple distinct ones).
> > > > 
> > > > We can't expect this to work in every case, that's why it's a legacy
> > > > format. It seems to work pretty well for cedrus so far.
> > > 
> > > I'm not sure I follow you, what Cedrus does should be changed to
> > > whatever we decide as a final API, we should not maintain two formats.
> > 
> > That point has me hesitating. It depends on whether we can expect to
> > see hardware implementations with no support whatsoever for multi-slice 
> > per frame and just expect an aggregated buffer of slice compressed
> > data. This is one operation mode that the Allwinner VPU supports.
> > 
> > The point is not to use it in Cedrus since our VPU can operate per-
> > slice, but to allow supporting hardware decoders that can't do that in
> > the future.
> > 
> > I'm not sure it's healthy to make it a hard requirement for H.264
> > decoding to operate per-slice. Does that seem too far-fetched from your
> > perspective? I seem to recall from a discussion that some legacy
> > hardware only handles single-slices frames, but I may be wrong.
> > 
> > > Also, what works for Cedrus is that a each buffers must have a single
> > > slice regardless how many slices per frame. And this is what I expect
> > > from most stateless HW.
> > 
> > Currently, we append all the slices into one buffer and decode it in
> > one go with a slightly hacked slice params to reflect that. But of
> > course, we should be operating per-slice.
> > 
> > >  This is how it works in VAAPI and VDPAU as an
> > > example. Just for the reference, the API in VAAPI is (pseudo code, I
> > > can't remember the exact name):
> > > 
> > >    - beginPicture()
> > >    - decodeSlice() *
> > >    - endPicture()
> > > 
> > > So the accelerator is told explicitly when a frame start/end, but also
> > > it's told explicitly in which buffer to decode the frame to.
> > 
> > Yes definitely. We're also given all the parsed bitstream elements in
> > the right order so that we could already start queuing requests when
> > each slice is passed, and just wait for completion at endPicture.
> > 
> > > > We could also decide to ditch the legacy idea altogether and only
> > > > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > > > find decoders that can only take a single slice per buffer.
> > > 
> > > It's impossible for a compliant decoder to only support 1 slice per
> > > frame, so I don't follow you on this one. Also, I don't understand what
> > > difference you see between per-slice basis and single slice per buffer.
> > 
> > Okay that's exactly what I wanted to know: whether it makes any sense
> > to build a decoder that only operates per-frame and not per-slice.
> > If you are confident we won't see that in the wild, we can make it an
> > API requirement to operate per-slice.
> 
> There is probably a small distinction to make between supporting
> multiple slices per frame and operating per slice. It's nice to know
> that Cedrus support both. As we discussed today on IRC, if we introduce
> a flag that tells the driver when the last slice of a frame is passed,
> it would be relatively simple for the driver to do decide what to do.
> Of course if the HW have a limitation of one allocation, it might not
> be fully optimal as it would have to copy.
> 
> But as this is stateless decoder, I'm more inclined in introducing a
> format that means just that, leaving it to userspace to do that right
> packing.
> 
> > > > When decoding a multi-slice frame in that setup, I think we'll be
> > > > better off with an appended buffer containing all the slices for the
> > > > frame instead of passing only a the first slice.
> > > 
> > > Appended slices requires extra controls, but also introduce a lot more
> > > decoding latency. As soon as we add the missing frame boundary
> > > signalling, it should be really trivial for a driver to wait until it
> > > received all slices before starting the decoding if that is a HW
> > > requirement.
> > 
> > Well, I don't really like the idea of the driver being aware of any of
> > that (IMO the logic should be in the media core, not the driver).
> > 
> > If a driver can't do multiple slices, it shouldn't be up to the driver
> > to gather them together. But anyway, if you think we won't ever see
> > this kind of hardware, we can just drop the whole idea.
> 
> A compliant HW will support multiple slices per frame, that's not
> really optional. But it may require all slices to be packed in a single
> allocation, in which case it could copy, or we can just have a
> dedicated format for this behaviour.

I was thinking of allowing this by default with bit offsets to the
slice. The main issue I see here will be trying to add new slice data
to a buffer that was already queued (and might already be undergoing
decoding), when submitting a new request to the batch (that may already
have started decoding). We'd need a notion of "buffer partitions" or
so.

> > > > > > What do you think?
> > > > > > 
> > > > > > >  By the way, if we could queue
> > > > > > > twice the same buffer, that would in principal work, but internally
> > > > > > > there is only one state per buffer. If you do external allocation, then
> > > > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > > > have two buffers with the same timestamp.
> > > > > > 
> > > > > > One advantage of the request API is that buffers are actually queued
> > > > > > when the request is processed, so this might not be too problematic.
> > > > > > 
> > > > > > I think what we need boils down to:
> > > > > > - Being able to queue the same output buffer to multiple requests,
> > > > > > which the request API should already allow;
> > > > > > - Being able to grab the right capture buffer based on the output
> > > > > > timestamp so that the different requests for the slices are rendered to
> > > > > > the same destination buffer.
> > > > > > 
> > > > > > For the second point, I don't really have a clear idea of whether we
> > > > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > > > we could just implement something to lookup the buffer to grab by
> > > > > > timestamp.
> > > > > 
> > > > > An entirely difference solution that came to my mind in the last few
> > > > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > > > reused the generic LAST flag). This flag would be passed on the last
> > > > > slice (if it is known that we are handling the last one) or in an empty
> > > > > buffer if it is found through parsing the next following NAL. This is
> > > > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > > > Though, if we make this flag mandatory, the driver could avoid marking
> > > > > the frame done until all slices has been decoded. The cons is that
> > > > > userpace is not informed when a specific slice is done decoding. This
> > > > > is quite niche, but you can use this information along with the list of
> > > > > macroblocks from the slice header so signal which portion of the image
> > > > > is now ready for an hypothetical video processing. The pros is that
> > > > > this solution can be per format, so this would not be needed for VP8 as
> > > > > an example.
> > > > 
> > > > Mhh, I don't really like the idea of setting an explicit order when
> > > > there is really none. I guess the slices for a given frame can be
> > > > decoded in whatever order, so I would like it better if we could just
> > > > submit the batch of requests and be told when the batch is done,
> > > > instead of specifying an explicit order and waiting for the last buffer
> > > > to be marked done.
> > > > 
> > > > And I think this batch idea could apply to other things than video
> > > > decoding, so it feels good to have it as the highest level we can in
> > > > media/v4l2.
> > > 
> > > I haven't said anything about order. I believe you can decode slice
> > > out-of-order in H264 but it is likely not true for all formats. You are
> > > again missing the point of decoding latency.
> > 
> > Well, having an END_OF_FRAME flag on one of the slices pretty much
> > implicitly defines an order (at least regarding this slice vs the
> > others).
> 
> No, the flag simply means that any following request will be on another
> frame. It's more like "closing" the decoded frame. I believe you have a
> good understanding of this proposal now after our IRC discussion.

Yes and I think what you are suggesting makes good sense.
For the record, we certainly need to take care that the end frame *and
the previous ones* are finished before marking the buffers done, so
that we can handle parallelized pipelines that may finish decoding in a
different order than request submission order.

> > > In live stream, the slices are transmitted over some serial link. If
> > > you wait until you have all slice before you start decoding, you delay
> > > further the moment the frame will be ready.
> > 
> > So that means we need some ability to add requests to a batch while the
> > batch is being handled. Seems a bit exotic but definitely legit, and it
> > can probably be done. Userspace would know when it has submitted all
> > the slices and move on to displaying the frame.
> > 
> > >  A lot of vendors make use
> > > of this to reduce latency, and libWebRTC also makes use of this. So
> > > being able to pass slices as part of a specific frame is rather
> > > important. Otherwise vendor will keep doing their own stuff as the
> > > Linux kernel API won't allow reaching their customers expectation.
> > 
> > I fully agree we need to prepare for all these low-latency
> > improvements. My goal is definitely to have something that can beat
> > vendor-specific implementations in upstream, not just a proof of
> > concept for half-baked decoding.
> 
> Great.
> 
> > > The batching capabilities should be used for the case the multiple
> > > slices of a frame (or multiple slices of many frame is supported by the
> > > HW) have been queued before the previous batch had completed.
> > > 
> > > > > A third approach could be to use the encoded buffer state to track the
> > > > > progress decoding that slice. Many driver will mark the buffer done as
> > > > > soon as it is transferred to the accelerator, it does not always match
> > > > > the moment that slice has been decoded. But has use said, we would need
> > > > > to study if it make sense to let a driver pick by timestamp a buffer
> > > > > that might already have reached done state. Other cons, is that polling
> > > > > for buffer states on the capture queue won't mean anything anymore. But
> > > > > combined with the FLAG, it would fix the cons of the FLAG solution.
> > > > 
> > > > Well, I think we should keep the done operation as-is, only give it a
> > > > different interpretation depending on whether the request is handled
> > > > individually or as part of a batch. I really think we shouldn't rely on
> > > > any buffer-level indication for completion when handling a batch, but
> > > > rather have something about the batch entity itself.
> > > 
> > > But then there is no way to know when the frame is decoded anymore,
> > > because as soon as the first slice is decoded, the capture is done and
> > > stay done. So what's your idea here, how to do you know your decoding
> > > is complete if you haven't dequeue/queue the frame in between slices ?
> > 
> > Yes, I was thinking of exposing a sync object as a fd that can be
> > polled on, which would be associated with the request batch and
> > signalled when completed. That's pretty much a standalone fence that's
> > not backed by a buffer.
> > 
> > I would like that to be importable as a DRM input fence (if possible at
> > all), so we can schedule decoding and schedule a page flip for the
> > capture buffer at the same time. Then the capture buffer can be
> > displayed as soon as the decoding batch is done.
> 
> We'll have to bring that one is a specific topic. Right now there is
> many peaces missing on DRM side as you already know. Also, fences would
> be delivered out-of-order (decoding order), we'd need strong doc to
> help user-space figure-out how to use this.

I think DRM is quite relevant nowadays when it comes to sync (but
that's something missing from V4L2 as of now). Yes a big part of the
logic is left to userspace since as you mention, decoding order !=
display order.

That means that fences are only relevant when we are sure that the
decoded frame is to be displayed next and that we have already missed
the target vblank where its flip should have been scheduled. As far as
I can see, the fence can reduce the latency by one frame by scheduling
the flip at the current vblank target still (instead of the next one
where the frame is decoded), and letting it take effect "as soon as
possible" (which might be before or after the next vblank).

Cheers,

Paul

> > > > > > > An argument that was made early was that we don't need to support this
> > > > > > > right away because userspace can combine all the slices into one
> > > > > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > > > > extra control to tell the driver the offset to each slices, because the
> > > > > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > > > > also I believe de-emulated, which means the code use to prevent having
> > > > > > > pattern looking like a start code has been removed, so you cannot just
> > > > > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > > > > the HW does not take care.
> > > > > > 
> > > > > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > > > > for the legacy case. Just appending the encoded data (excluding slice
> > > > > > header and start code) works for cedrus and I think it makes sense more
> > > > > > generally. The idea is to only expose a single slice params and act as
> > > > > > if it was just one big slice buffer.
> > > > > > 
> > > > > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > > > > legacy pixfmt too...
> > > > > > 
> > > > > > > I also very dislike the idea that we would enforce merging all slice
> > > > > > > into the same buffer. The entire purpose of slices and the reason they
> > > > > > > are used in practice is that you can start decoding slices before you
> > > > > > > have all slices of a frame. This reduce drastically the latency for
> > > > > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > > > > slices is basically like pretending slices have no benefits.
> > > > > > 
> > > > > > Of course, we don't want things to stay like this and this rework is
> > > > > > definitely needed to get serious performance and latency going.
> > > > > > 
> > > > > > One thing you should also be aware of: we're currently using a
> > > > > > workqueue between the job done irq and scheduling the next frame (in
> > > > > > v4l2 m2m).
> > > > > > 
> > > > > > Maybe we could manage to fit that into an atomic path to schedule the
> > > > > > next request in the previous job done irq context.
> > > > > > 
> > > > > > > I have just exposed the problem I see for now, to see what comes up.
> > > > > > > But I hope we be able to propose solution too in the short term (in no
> > > > > > > one beats me at it).
> > > > > > 
> > > > > > Seems that we have good grounds for a discussion!
> > > > > > 
> > > > > > Cheers,
> > > > > > 
> > > > > > Paul
> > > > > > 
> > > > > > > > +
> > > > > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > > > > +
> > > > > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > > > > +
> > > > > > > > +    * **Required fields:**
> > > > > > > > +
> > > > > > > > +      ``index``
> > > > > > > > +          index of the buffer being queued.
> > > > > > > > +
> > > > > > > > +      ``type``
> > > > > > > > +          type of the buffer.
> > > > > > > > +
> > > > > > > > +      ``bytesused``
> > > > > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > > > > +
> > > > > > > > +      ``flags``
> > > > > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > > > > +
> > > > > > > > +      ``request_fd``
> > > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > > +
> > > > > > > > +      ``timestamp``
> > > > > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > > > > +          as the reference of another.
> > > > > > > > +
> > > > > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > > > > +
> > > > > > > > +    * **Required fields:**
> > > > > > > > +
> > > > > > > > +      ``which``
> > > > > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > > > > +
> > > > > > > > +      ``request_fd``
> > > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > > +
> > > > > > > > +      other fields
> > > > > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > > > > +          array must contain all the codec-specific controls required to decode
> > > > > > > > +          a frame.
> > > > > > > > +
> > > > > > > > +   .. note::
> > > > > > > > +
> > > > > > > > +      It is possible to specify the controls in different invocations of
> > > > > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > > > > +      at the moment of request submission is the one that will be considered.
> > > > > > > > +
> > > > > > > > +   .. note::
> > > > > > > > +
> > > > > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > > > > +
> > > > > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > > > > +   request FD.
> > > > > > > > +
> > > > > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > > > > +    required controls are missing from the request, then
> > > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > > > > +
> > > > > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > > > > +
> > > > > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > > > > +error, then all following decoded frames that refer to it also have the
> > > > > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > > > > +produce a (likely corrupted) frame.
> > > > > > > > +
> > > > > > > > +Buffer management while decoding
> > > > > > > > +================================
> > > > > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > > > > +encompasses using the buffer for compositing or display.
> > > > > > > > +
> > > > > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > > > > +buffer.
> > > > > > > > +
> > > > > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > > > > +
> > > > > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > > > > +reference buffer when the following conditions are met:
> > > > > > > > +
> > > > > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > > > > +   queued, and
> > > > > > > > +
> > > > > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > > > > +   referencing frames have been queued.
> > > > > > > > +
> > > > > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > > > > +all the resources associated with reference frames. This means that the client
> > > > > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > > > > +won't need them afterwards.
> > > > > > > > +
> > > > > > > > +Seeking
> > > > > > > > +=======
> > > > > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > > > > +corresponding to the new stream position. It must however be aware that
> > > > > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > > > > +valid state is sent to the decoder.
> > > > > > > > +
> > > > > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > > > > +from the pre-seek position.
> > > > > > > > +
> > > > > > > > +Pause
> > > > > > > > +=====
> > > > > > > > +
> > > > > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > > > > +will remain idle.
> > > > > > > > +
> > > > > > > > +Dynamic resolution change
> > > > > > > > +=========================
> > > > > > > > +
> > > > > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > > > > +the initialization sequence again with the new resolution:
> > > > > > > > +
> > > > > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > > > > +   corresponding output buffers.
> > > > > > > > +
> > > > > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > > > > +   queues.
> > > > > > > > +
> > > > > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > > > > +
> > > > > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > > > > +   picked on the ``CAPTURE`` queue.
> > > > > > > > +
> > > > > > > > +Drain
> > > > > > > > +=====
> > > > > > > > +
> > > > > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > > > > +decoder.
Paul Kocialkowski April 16, 2019, 7:55 a.m. UTC | #10
Hi,

Le mardi 16 avril 2019 à 16:22 +0900, Alexandre Courbot a écrit :

[...]

> Thanks for this great discussion. Let me try to summarize the status
> of this thread + the IRC discussion and add my own thoughts:
> 
> Proper support for multiple decoding units (e.g. H.264 slices) per
> frame should not be an afterthought ; compliance to encoded formats
> depend on it, and the benefit of lower latency is a significant
> consideration for vendors.
>
> m2m, which we use for all stateless codecs, has a strong assumption
> that one OUTPUT buffer consumed results in one CAPTURE buffer being
> produced. This assumption can however be overruled: at least the venus
> driver does it to implement the stateful specification.
> 
> So we need a way to specify frame boundaries when submitting encoded
> content to the driver. One request should contain a single OUTPUT
> buffer, containing a single decoding unit, but we need a way to
> specify whether the driver should directly produce a CAPTURE buffer
> from this request, or keep using the same CAPTURE buffer with
> subsequent requests.
> 
> I can think of 2 ways this can be expressed:
> 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> is produced), and add a flag to ask the driver to change that behavior
> and hold on the CAPTURE buffer and reuse it with the next request(s) ;

That would kind of break the stateless idea. I think we need requests
to be fully independent of eachother and have some entity that
coordinates requests for this kind of things.

> 2) We specify that no CAPTURE buffer is produced by default, unless a
> flag asking so is specified.
> 
> The flag could be specified in one of two ways:
> a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> b) As a dedicated control, either format-specific or more common to all codecs.

I think we must aim for a generic solution that would be at least
common to all codecs, and if possible common to requests regardless of
whether they concern video decoding or not.

I really like the idea of introducing a requests batch/group/queue,
which groups requests together and allows marking them done when the
whole group is done being decoded. For that, we explicitly mark one of
the requests as the final one, so that we can continue adding requests
to the batch even when it's already being processed. When all the
requests are done being decoded, we can mark them done.

With that, we also need some tweaking in the core to look for an
available capture buffer that matches the output buffer's timestamp
before trying to dequeue the next available capture buffer. This way,
the first request of the batch will get any queued capture buffer, but
subsequent requests will find the matchung capture buffer by timestamp.

I think that's basically all we need to handle that and the two aspects
(picking by timestamp and requests groups) are rather independent and
the latter could probably be used in other situations than video
decoding.

What do you think?

Cheers,

Paul

> I tend to favor 2) and b) for this, for the reason that with H.264 at
> least, user-space does not know whether a slice is the last slice of a
> frame until it starts parsing the next one, and we don't know when we
> will receive it. If we use a control to ask that a CAPTURE buffer be
> produced, we can always submit another request with only that control
> set once it is clear that the frame is complete (and not delay
> decoding meanwhile). In practice I am not that familiar with
> latency-sensitive streaming ; maybe a smart streamer would just append
> an AUD NAL unit at the end of every frame and we can thus submit the
> flag it with the last slice without further delay?
> 
> An extra constraint to enforce would be that each decoding unit
> belonging to the same frame must be submitted with the same timestamp,
> otherwise the request submission would fail. We really need a
> framework to enforce all this at a higher level than individual
> drivers, once we reach an agreement I will start working on this.
> 
> Formats that do not support multiple decoding units per frame would
> reject any request that does not carry the end-of-frame information.
> 
> Anything missing / any further comment?
Alexandre Courbot April 17, 2019, 5:39 a.m. UTC | #11
Hi Paul,

On Tue, Apr 16, 2019 at 4:55 PM Paul Kocialkowski
<paul.kocialkowski@bootlin.com> wrote:
>
> Hi,
>
> Le mardi 16 avril 2019 à 16:22 +0900, Alexandre Courbot a écrit :
>
> [...]
>
> > Thanks for this great discussion. Let me try to summarize the status
> > of this thread + the IRC discussion and add my own thoughts:
> >
> > Proper support for multiple decoding units (e.g. H.264 slices) per
> > frame should not be an afterthought ; compliance to encoded formats
> > depend on it, and the benefit of lower latency is a significant
> > consideration for vendors.
> >
> > m2m, which we use for all stateless codecs, has a strong assumption
> > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > produced. This assumption can however be overruled: at least the venus
> > driver does it to implement the stateful specification.
> >
> > So we need a way to specify frame boundaries when submitting encoded
> > content to the driver. One request should contain a single OUTPUT
> > buffer, containing a single decoding unit, but we need a way to
> > specify whether the driver should directly produce a CAPTURE buffer
> > from this request, or keep using the same CAPTURE buffer with
> > subsequent requests.
> >
> > I can think of 2 ways this can be expressed:
> > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > is produced), and add a flag to ask the driver to change that behavior
> > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
>
> That would kind of break the stateless idea. I think we need requests
> to be fully independent of eachother and have some entity that
> coordinates requests for this kind of things.

Side note: the idea that stateless decoders are entirely stateless is
not completely accurate anyway. When we specify a resolution on the
OUTPUT queue, we already store some state. What matters IIUC is that
the *hardware* behaves in a stateless manner. I don't think we should
refrain from storing some internal driver state if it makes sense.

Back to the topic: the effect of this flag would just be that the
first buffer is the CAPTURE queue is not removed, i.e. the next
request will work on the same buffer. It doesn't really preserve any
state - if the next request is the beginning of a different frame,
then the previous work will be discarded and the driver will behave as
it should, not considering any previous state.

>
> > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > flag asking so is specified.
> >
> > The flag could be specified in one of two ways:
> > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > b) As a dedicated control, either format-specific or more common to all codecs.
>
> I think we must aim for a generic solution that would be at least
> common to all codecs, and if possible common to requests regardless of
> whether they concern video decoding or not.
>
> I really like the idea of introducing a requests batch/group/queue,
> which groups requests together and allows marking them done when the
> whole group is done being decoded. For that, we explicitly mark one of
> the requests as the final one, so that we can continue adding requests
> to the batch even when it's already being processed. When all the
> requests are done being decoded, we can mark them done.

I'd need to see this idea more developed (with maybe an example of the
sequence of IOCTLs) to form an opinion about it. Also would need to be
given a few examples of where this could be used outside of stateless
codecs. Then we will have to address what this means for requests:
your argument against using a "release CAPTURE buffer" flag was that
requests won't be fully independent from each other anymore, but I
don't see that situation changing with batches. Then, does the end of
a batch only means that a CAPTURE buffer should be released, or are
other actions required for non-codec use-cases? There are lots and
lots of questions like this one lurking.

>
> With that, we also need some tweaking in the core to look for an
> available capture buffer that matches the output buffer's timestamp
> before trying to dequeue the next available capture buffer

I don't think that would be strictly necessary, unless we want to be
able to decode slices from different frames before the first one is
completed?

> This way,
> the first request of the batch will get any queued capture buffer, but
> subsequent requests will find the matchung capture buffer by timestamp.
>
> I think that's basically all we need to handle that and the two aspects
> (picking by timestamp and requests groups) are rather independent and
> the latter could probably be used in other situations than video
> decoding.
>
> What do you think?

At the current point I'd like to avoid over-engineering things.
Introducing a request batch mechanism would mean more months spent
before we can set the stateless codec API in stone, and at some point
we need to settle and release something that people can use. We don't
even have clear idea of what batches would look like and in which
cases they would be used. The idea of an extra flag is simple and
AFAICT would do the job nicely, so why not proceed with this for the
time being?

Cheers,
Alex.
Nicolas Dufresne April 17, 2019, 3:30 p.m. UTC | #12
Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > Documents the protocol that user-space should follow when
> > > communicating with stateless video decoders.
> > > 
> > > The stateless video decoding API makes use of the new request and tags
> > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > should probably still be considered staging for a short while.
> 
> [...]
> 
> > From an IRC discussion with Paul and some more digging, I have found a
> > design problem in the decoding process.
> > 
> > In H264 and HEVC you can have multiple decoding unit per frames
> > (slices). This type of encoding is increasingly popular, specially for
> > low latency streaming use cases. The wording of this spec does allow
> > for the notion of decoding unit, and in practice it has been proven to
> > work through some RFC FFMPEG patches and the Cedrus driver. But
> > something important to know is that the FFMPEG RFC implements decoding
> > in lock steps. Which means:
> > 
> >   1. It queues a single free capture buffer
> >   2. It queues an output buffer, set controls, queue the request
> >   3. It waits for a capture buffer to reach state done
> >   4. It dequeues that capture buffer, and queue it back again
> >   5. And then it runs step 2,4,3 again with following slices, until we 
> >      have a complete frame. After what, it restart at step 1
> > 
> > So the implementation makes no use of the queues. There is no batch
> > processing, so we might not be able to reach the maximum hardware
> > throughput.
> > 
> > So the optimal method would look like the following, but there comes
> > the design issue.
> > 
> >   1. Queue a single free capture buffer
> >   2. Queue output buffer for slice 1, set controls, queue the request
> >   3. Queue output buffer for slice 2, set controls, queue the request
> >   4. Wait for completion
> > 
> > The problem is in step 4. Completion means that the capture buffer done
> > decoding a single unit. So assuming the driver supports matching the
> > timestamp against the queued buffer, instead of waiting for a new
> > buffer, the driver would have to mark twice the same buffer to done
> > state, which is just not working to inform userspace that all slices
> > are decoded into the one capture buffer they share.
> 
> Interestingly, I'm experiencing the exact same problem dealing with a
> 2D graphics blitter that has limited ouput scaling abilities which
> imply handlnig a large scaling operation as multiple clipped smaller
> scaling operations. The issue is basically that multiple jobs have to
> be submitted to complete a single frame and relying on an indication
> from the destination buffer (such as a fence) doesn't work to indicate
> that all the operations were completed, since we get the indication at
> each step instead of at the end of the batch.

That looks similar to the IMX.6 IPU m2m driver. It splits the image in
tiles of 1024x1024 and process each tile separately. This driver has
been around for a long time, so I guess they have a solution to that.
They don't need requests, because there is nothing to be bundled with
the input image. I know that Renesas folks have started working on a
de-interlacer. Again, this kind of driver may process and reuse input
buffers for motion compensation, but I don't think they need special
userspace API for that.

> 
> One idea I see to solve this is to have a notion of batch in the driver
> (for our situation, that would be in v4l2) and provide means to get a
> done indication for that entity.

Can't you just make this part of your driver state machine ?

> 
> I think we could extend the request API to allow this. We already
> represent requests as individual file descriptors, we could totally
> group requests in batches and get a sync fd for the batch to poll on
> when we need to return the frames. It would be good if we could expose
> this in a way that makes it work with DRM as an in fence for display.
> Then we can pretty much schedule our flip + decoding together (which is
> quite nice to have when we're running late on the decoding side).
> 
> What do you think?

I'm not sure why this specific thing needs a userspace exposition.

> 
> It feels to me like the request API was designed to open up the way for
> these kinds of improvements, so I'm sure we can find an agreeable
> solution that extends the API.
> 
> > To me, multi slice encoded stream are just too common, and they will
> > also exist for AV1. So we really need a solution to this that does not
> > require operating in lock steps. Specially that some HW can decode
> > multiple slices in parallel (multi core), we would not want to prevent
> > that HW from being used efficiently. On top of this, we need a solution
> > so that we can also keep queuing slice of the following frames if they
> > arrive before decoding is done.
> 
> Agreed.
> 
> > I don't have a solution yet myself, but it would be nice to come up
> > with something before we freeze this API.
> 
> I think it's rather independent from the codec used and this is
> something that should be handled at the request API level. 
> 
> I'm not sure we can always expect the hardware to be able to operate on
> a per-slice basis. I think it would be useful to reflect this in the
> pixel format, so that we also have a possibility for a gathered slice
> buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> hardware that will need to decode one frame in one go from a contiguous
> buffer with all the slice data appended.
> 
> This updates my pixel format proposition from IRC to the following:
> - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> (appended buffer), slice params as v4l2 control (legacy);
> 
> - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> 
> - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> slice params encoded in the buffer;
> 
> - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> slice params encoded in the buffer and in slice params control;
> 
> Also, we need to make sure to have a per-slice bit offset to the
> encoded data in the slice params control so that the same slice buffer
> can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> an annex-b slice and use any of the formats with it).
> 
> For the legacy format, we need to specify that the appended slices
> don't repeat the annex-b start code and NAL header.
> 
> What do you think?
> 
> >  By the way, if we could queue
> > twice the same buffer, that would in principal work, but internally
> > there is only one state per buffer. If you do external allocation, then
> > in theory you could workaround that, but then it's ugly, because you'll
> > have two buffers with the same timestamp.
> 
> One advantage of the request API is that buffers are actually queued
> when the request is processed, so this might not be too problematic.
> 
> I think what we need boils down to:
> - Being able to queue the same output buffer to multiple requests,
> which the request API should already allow;
> - Being able to grab the right capture buffer based on the output
> timestamp so that the different requests for the slices are rendered to
> the same destination buffer.
> 
> For the second point, I don't really have a clear idea of whether we
> can already expect v4l2 to allow picking a buffer that was marked done
> but was not de-queued by userspace yet. It might already be allowed and
> we could just implement something to lookup the buffer to grab by
> timestamp.
> 
> > An argument that was made early was that we don't need to support this
> > right away because userspace can combine all the slices into one
> > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > extra control to tell the driver the offset to each slices, because the
> > raw H264 does not have enough information to be parsed. RAW slice are
> > also I believe de-emulated, which means the code use to prevent having
> > pattern looking like a start code has been removed, so you cannot just
> > prepend start codes. De-emulation seems better placed in userspace if
> > the HW does not take care.
> 
> Mhh I'd like to avoid having having to specify the offset to each slice
> for the legacy case. Just appending the encoded data (excluding slice
> header and start code) works for cedrus and I think it makes sense more
> generally. The idea is to only expose a single slice params and act as
> if it was just one big slice buffer.
> 
> Come to think of it, maybe we need annex-b and mixed fashions of that
> legacy pixfmt too...
> 
> > I also very dislike the idea that we would enforce merging all slice
> > into the same buffer. The entire purpose of slices and the reason they
> > are used in practice is that you can start decoding slices before you
> > have all slices of a frame. This reduce drastically the latency for
> > streaming use cases, like video conferencing. So forcing the merging of
> > slices is basically like pretending slices have no benefits.
> 
> Of course, we don't want things to stay like this and this rework is
> definitely needed to get serious performance and latency going.
> 
> One thing you should also be aware of: we're currently using a
> workqueue between the job done irq and scheduling the next frame (in
> v4l2 m2m).
> 
> Maybe we could manage to fit that into an atomic path to schedule the
> next request in the previous job done irq context.
> 
> > I have just exposed the problem I see for now, to see what comes up.
> > But I hope we be able to propose solution too in the short term (in no
> > one beats me at it).
> 
> Seems that we have good grounds for a discussion!
> 
> Cheers,
> 
> Paul
> 
> > > +
> > > +A typical frame would thus be decoded using the following sequence:
> > > +
> > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > +
> > > +    * **Required fields:**
> > > +
> > > +      ``index``
> > > +          index of the buffer being queued.
> > > +
> > > +      ``type``
> > > +          type of the buffer.
> > > +
> > > +      ``bytesused``
> > > +          number of bytes taken by the encoded data frame in the buffer.
> > > +
> > > +      ``flags``
> > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > +
> > > +      ``request_fd``
> > > +          must be set to the file descriptor of the decoding request.
> > > +
> > > +      ``timestamp``
> > > +          must be set to a unique value per frame. This value will be propagated
> > > +          into the decoded frame's buffer and can also be used to use this frame
> > > +          as the reference of another.
> > > +
> > > +2. Set the codec-specific controls for the decoding request, using
> > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > +
> > > +    * **Required fields:**
> > > +
> > > +      ``which``
> > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > +
> > > +      ``request_fd``
> > > +          must be set to the file descriptor of the decoding request.
> > > +
> > > +      other fields
> > > +          other fields are set as usual when setting controls. The ``controls``
> > > +          array must contain all the codec-specific controls required to decode
> > > +          a frame.
> > > +
> > > +   .. note::
> > > +
> > > +      It is possible to specify the controls in different invocations of
> > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > +      at the moment of request submission is the one that will be considered.
> > > +
> > > +   .. note::
> > > +
> > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > +
> > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > +   request FD.
> > > +
> > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > +    required controls are missing from the request, then
> > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > +    ``CAPTURE`` buffer will be produced for this request.
> > > +
> > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > +independently. They are returned in decode order (i.e. the same order as coded
> > > +frames were submitted to the ``OUTPUT`` queue).
> > > +
> > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > +error, then all following decoded frames that refer to it also have the
> > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > +produce a (likely corrupted) frame.
> > > +
> > > +Buffer management while decoding
> > > +================================
> > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > +used by the client for as long as they are not queued again. "Used" here
> > > +encompasses using the buffer for compositing or display.
> > > +
> > > +A dequeued capture buffer can also be used as the reference frame of another
> > > +buffer.
> > > +
> > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > +and storing it into the relevant member of a codec-dependent control structure.
> > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > +
> > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > +target until all the frames referencing it have been decoded. The safest way to
> > > +achieve this is to refrain from queueing a reference buffer until all the
> > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > +order, then user-space can take advantage of this guarantee and queue a
> > > +reference buffer when the following conditions are met:
> > > +
> > > +1. All the requests for frames affected by the reference frame have been
> > > +   queued, and
> > > +
> > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > +   referencing frames have been queued.
> > > +
> > > +When queuing a decoding request, the driver will increase the reference count of
> > > +all the resources associated with reference frames. This means that the client
> > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > +won't need them afterwards.
> > > +
> > > +Seeking
> > > +=======
> > > +In order to seek, the client just needs to submit requests using input buffers
> > > +corresponding to the new stream position. It must however be aware that
> > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > +for H.264) may have changed and the client is responsible for making sure that a
> > > +valid state is sent to the decoder.
> > > +
> > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > +from the pre-seek position.
> > > +
> > > +Pause
> > > +=====
> > > +
> > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > +queue. Without source bitstream data, there is no data to process and the codec
> > > +will remain idle.
> > > +
> > > +Dynamic resolution change
> > > +=========================
> > > +
> > > +If the client detects a resolution change in the stream, it will need to perform
> > > +the initialization sequence again with the new resolution:
> > > +
> > > +1. Wait until all submitted requests have completed and dequeue the
> > > +   corresponding output buffers.
> > > +
> > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > +   queues.
> > > +
> > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > +
> > > +4. Perform the initialization sequence again (minus the allocation of
> > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > +   Note that due to resolution constraints, a different format may need to be
> > > +   picked on the ``CAPTURE`` queue.
> > > +
> > > +Drain
> > > +=====
> > > +
> > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > +wait until all the submitted requests are completed. There is no need to send a
> > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > +decoder.
Paul Kocialkowski April 17, 2019, 3:40 p.m. UTC | #13
Hi,

On Wed, 2019-04-17 at 11:30 -0400, Nicolas Dufresne wrote:
> Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > Documents the protocol that user-space should follow when
> > > > communicating with stateless video decoders.
> > > > 
> > > > The stateless video decoding API makes use of the new request and tags
> > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > should probably still be considered staging for a short while.
> > 
> > [...]
> > 
> > > From an IRC discussion with Paul and some more digging, I have found a
> > > design problem in the decoding process.
> > > 
> > > In H264 and HEVC you can have multiple decoding unit per frames
> > > (slices). This type of encoding is increasingly popular, specially for
> > > low latency streaming use cases. The wording of this spec does allow
> > > for the notion of decoding unit, and in practice it has been proven to
> > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > something important to know is that the FFMPEG RFC implements decoding
> > > in lock steps. Which means:
> > > 
> > >   1. It queues a single free capture buffer
> > >   2. It queues an output buffer, set controls, queue the request
> > >   3. It waits for a capture buffer to reach state done
> > >   4. It dequeues that capture buffer, and queue it back again
> > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > >      have a complete frame. After what, it restart at step 1
> > > 
> > > So the implementation makes no use of the queues. There is no batch
> > > processing, so we might not be able to reach the maximum hardware
> > > throughput.
> > > 
> > > So the optimal method would look like the following, but there comes
> > > the design issue.
> > > 
> > >   1. Queue a single free capture buffer
> > >   2. Queue output buffer for slice 1, set controls, queue the request
> > >   3. Queue output buffer for slice 2, set controls, queue the request
> > >   4. Wait for completion
> > > 
> > > The problem is in step 4. Completion means that the capture buffer done
> > > decoding a single unit. So assuming the driver supports matching the
> > > timestamp against the queued buffer, instead of waiting for a new
> > > buffer, the driver would have to mark twice the same buffer to done
> > > state, which is just not working to inform userspace that all slices
> > > are decoded into the one capture buffer they share.
> > 
> > Interestingly, I'm experiencing the exact same problem dealing with a
> > 2D graphics blitter that has limited ouput scaling abilities which
> > imply handlnig a large scaling operation as multiple clipped smaller
> > scaling operations. The issue is basically that multiple jobs have to
> > be submitted to complete a single frame and relying on an indication
> > from the destination buffer (such as a fence) doesn't work to indicate
> > that all the operations were completed, since we get the indication at
> > each step instead of at the end of the batch.
> 
> That looks similar to the IMX.6 IPU m2m driver. It splits the image in
> tiles of 1024x1024 and process each tile separately. This driver has
> been around for a long time, so I guess they have a solution to that.
> They don't need requests, because there is nothing to be bundled with
> the input image. I know that Renesas folks have started working on a
> de-interlacer. Again, this kind of driver may process and reuse input
> buffers for motion compensation, but I don't think they need special
> userspace API for that.

Thanks for the reference! I hope it's not a blitter that was
contributed as a V4L2 driver instead of DRM, as it probably would be
more useful in DRM (but that's way beside the point).

> > One idea I see to solve this is to have a notion of batch in the driver
> > (for our situation, that would be in v4l2) and provide means to get a
> > done indication for that entity.
> 
> Can't you just make this part of your driver state machine ?

Yes definitely, and I forgot to mention that's in DRM, not V4L2.

Anyway from that point on, I was back to talking about our codec
situation, not my 2D blitter anymore :)

> > I think we could extend the request API to allow this. We already
> > represent requests as individual file descriptors, we could totally
> > group requests in batches and get a sync fd for the batch to poll on
> > when we need to return the frames. It would be good if we could expose
> > this in a way that makes it work with DRM as an in fence for display.
> > Then we can pretty much schedule our flip + decoding together (which is
> > quite nice to have when we're running late on the decoding side).
> > 
> > What do you think?
> 
> I'm not sure why this specific thing needs a userspace exposition.

Indeed, I'll handle it in my 2D driver.

Cheers,

Paul

> > It feels to me like the request API was designed to open up the way for
> > these kinds of improvements, so I'm sure we can find an agreeable
> > solution that extends the API.
> > 
> > > To me, multi slice encoded stream are just too common, and they will
> > > also exist for AV1. So we really need a solution to this that does not
> > > require operating in lock steps. Specially that some HW can decode
> > > multiple slices in parallel (multi core), we would not want to prevent
> > > that HW from being used efficiently. On top of this, we need a solution
> > > so that we can also keep queuing slice of the following frames if they
> > > arrive before decoding is done.
> > 
> > Agreed.
> > 
> > > I don't have a solution yet myself, but it would be nice to come up
> > > with something before we freeze this API.
> > 
> > I think it's rather independent from the codec used and this is
> > something that should be handled at the request API level. 
> > 
> > I'm not sure we can always expect the hardware to be able to operate on
> > a per-slice basis. I think it would be useful to reflect this in the
> > pixel format, so that we also have a possibility for a gathered slice
> > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > hardware that will need to decode one frame in one go from a contiguous
> > buffer with all the slice data appended.
> > 
> > This updates my pixel format proposition from IRC to the following:
> > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > (appended buffer), slice params as v4l2 control (legacy);
> > 
> > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > 
> > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > slice params encoded in the buffer;
> > 
> > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > slice params encoded in the buffer and in slice params control;
> > 
> > Also, we need to make sure to have a per-slice bit offset to the
> > encoded data in the slice params control so that the same slice buffer
> > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > an annex-b slice and use any of the formats with it).
> > 
> > For the legacy format, we need to specify that the appended slices
> > don't repeat the annex-b start code and NAL header.
> > 
> > What do you think?
> > 
> > >  By the way, if we could queue
> > > twice the same buffer, that would in principal work, but internally
> > > there is only one state per buffer. If you do external allocation, then
> > > in theory you could workaround that, but then it's ugly, because you'll
> > > have two buffers with the same timestamp.
> > 
> > One advantage of the request API is that buffers are actually queued
> > when the request is processed, so this might not be too problematic.
> > 
> > I think what we need boils down to:
> > - Being able to queue the same output buffer to multiple requests,
> > which the request API should already allow;
> > - Being able to grab the right capture buffer based on the output
> > timestamp so that the different requests for the slices are rendered to
> > the same destination buffer.
> > 
> > For the second point, I don't really have a clear idea of whether we
> > can already expect v4l2 to allow picking a buffer that was marked done
> > but was not de-queued by userspace yet. It might already be allowed and
> > we could just implement something to lookup the buffer to grab by
> > timestamp.
> > 
> > > An argument that was made early was that we don't need to support this
> > > right away because userspace can combine all the slices into one
> > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > extra control to tell the driver the offset to each slices, because the
> > > raw H264 does not have enough information to be parsed. RAW slice are
> > > also I believe de-emulated, which means the code use to prevent having
> > > pattern looking like a start code has been removed, so you cannot just
> > > prepend start codes. De-emulation seems better placed in userspace if
> > > the HW does not take care.
> > 
> > Mhh I'd like to avoid having having to specify the offset to each slice
> > for the legacy case. Just appending the encoded data (excluding slice
> > header and start code) works for cedrus and I think it makes sense more
> > generally. The idea is to only expose a single slice params and act as
> > if it was just one big slice buffer.
> > 
> > Come to think of it, maybe we need annex-b and mixed fashions of that
> > legacy pixfmt too...
> > 
> > > I also very dislike the idea that we would enforce merging all slice
> > > into the same buffer. The entire purpose of slices and the reason they
> > > are used in practice is that you can start decoding slices before you
> > > have all slices of a frame. This reduce drastically the latency for
> > > streaming use cases, like video conferencing. So forcing the merging of
> > > slices is basically like pretending slices have no benefits.
> > 
> > Of course, we don't want things to stay like this and this rework is
> > definitely needed to get serious performance and latency going.
> > 
> > One thing you should also be aware of: we're currently using a
> > workqueue between the job done irq and scheduling the next frame (in
> > v4l2 m2m).
> > 
> > Maybe we could manage to fit that into an atomic path to schedule the
> > next request in the previous job done irq context.
> > 
> > > I have just exposed the problem I see for now, to see what comes up.
> > > But I hope we be able to propose solution too in the short term (in no
> > > one beats me at it).
> > 
> > Seems that we have good grounds for a discussion!
> > 
> > Cheers,
> > 
> > Paul
> > 
> > > > +
> > > > +A typical frame would thus be decoded using the following sequence:
> > > > +
> > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``index``
> > > > +          index of the buffer being queued.
> > > > +
> > > > +      ``type``
> > > > +          type of the buffer.
> > > > +
> > > > +      ``bytesused``
> > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > +
> > > > +      ``flags``
> > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      ``timestamp``
> > > > +          must be set to a unique value per frame. This value will be propagated
> > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > +          as the reference of another.
> > > > +
> > > > +2. Set the codec-specific controls for the decoding request, using
> > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > +
> > > > +    * **Required fields:**
> > > > +
> > > > +      ``which``
> > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > +
> > > > +      ``request_fd``
> > > > +          must be set to the file descriptor of the decoding request.
> > > > +
> > > > +      other fields
> > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > +          array must contain all the codec-specific controls required to decode
> > > > +          a frame.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      It is possible to specify the controls in different invocations of
> > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > +      at the moment of request submission is the one that will be considered.
> > > > +
> > > > +   .. note::
> > > > +
> > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > +
> > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > +   request FD.
> > > > +
> > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > +    required controls are missing from the request, then
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > +
> > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > +
> > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > +error, then all following decoded frames that refer to it also have the
> > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > +produce a (likely corrupted) frame.
> > > > +
> > > > +Buffer management while decoding
> > > > +================================
> > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > +used by the client for as long as they are not queued again. "Used" here
> > > > +encompasses using the buffer for compositing or display.
> > > > +
> > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > +buffer.
> > > > +
> > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > +
> > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > +reference buffer when the following conditions are met:
> > > > +
> > > > +1. All the requests for frames affected by the reference frame have been
> > > > +   queued, and
> > > > +
> > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > +   referencing frames have been queued.
> > > > +
> > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > +all the resources associated with reference frames. This means that the client
> > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > +won't need them afterwards.
> > > > +
> > > > +Seeking
> > > > +=======
> > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > +corresponding to the new stream position. It must however be aware that
> > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > +valid state is sent to the decoder.
> > > > +
> > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > +from the pre-seek position.
> > > > +
> > > > +Pause
> > > > +=====
> > > > +
> > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > +will remain idle.
> > > > +
> > > > +Dynamic resolution change
> > > > +=========================
> > > > +
> > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > +the initialization sequence again with the new resolution:
> > > > +
> > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > +   corresponding output buffers.
> > > > +
> > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > +   queues.
> > > > +
> > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > +
> > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > +   Note that due to resolution constraints, a different format may need to be
> > > > +   picked on the ``CAPTURE`` queue.
> > > > +
> > > > +Drain
> > > > +=====
> > > > +
> > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > +decoder.
Nicolas Dufresne April 17, 2019, 4:06 p.m. UTC | #14
Le mardi 16 avril 2019 à 16:22 +0900, Alexandre Courbot a écrit :
> On Tue, Apr 16, 2019 at 12:30 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> > > > Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > > > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > > > > Documents the protocol that user-space should follow when
> > > > > > > > > communicating with stateless video decoders.
> > > > > > > > > 
> > > > > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > > > > should probably still be considered staging for a short while.
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > > > > design problem in the decoding process.
> > > > > > > > 
> > > > > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > > > > in lock steps. Which means:
> > > > > > > > 
> > > > > > > >   1. It queues a single free capture buffer
> > > > > > > >   2. It queues an output buffer, set controls, queue the request
> > > > > > > >   3. It waits for a capture buffer to reach state done
> > > > > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > > > > >   5. And then it runs step 2,4,3 again with following slices, until we
> > > > > > > >      have a complete frame. After what, it restart at step 1
> > > > > > > > 
> > > > > > > > So the implementation makes no use of the queues. There is no batch
> > > > > > > > processing, so we might not be able to reach the maximum hardware
> > > > > > > > throughput.
> > > > > > > > 
> > > > > > > > So the optimal method would look like the following, but there comes
> > > > > > > > the design issue.
> > > > > > > > 
> > > > > > > >   1. Queue a single free capture buffer
> > > > > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > > > > >   4. Wait for completion
> > > > > > > > 
> > > > > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > > > > state, which is just not working to inform userspace that all slices
> > > > > > > > are decoded into the one capture buffer they share.
> > > > > > > 
> > > > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > > > be submitted to complete a single frame and relying on an indication
> > > > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > > > that all the operations were completed, since we get the indication at
> > > > > > > each step instead of at the end of the batch.
> > > > > > > 
> > > > > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > > > > done indication for that entity.
> > > > > > > 
> > > > > > > I think we could extend the request API to allow this. We already
> > > > > > > represent requests as individual file descriptors, we could totally
> > > > > > > group requests in batches and get a sync fd for the batch to poll on
> > > > > > > when we need to return the frames. It would be good if we could expose
> > > > > > > this in a way that makes it work with DRM as an in fence for display.
> > > > > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > > > > quite nice to have when we're running late on the decoding side).
> > > > > > > 
> > > > > > > What do you think?
> > > > > > > 
> > > > > > > It feels to me like the request API was designed to open up the way for
> > > > > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > > > > solution that extends the API.
> > > > > > > 
> > > > > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > > > > require operating in lock steps. Specially that some HW can decode
> > > > > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > > > > so that we can also keep queuing slice of the following frames if they
> > > > > > > > arrive before decoding is done.
> > > > > > > 
> > > > > > > Agreed.
> > > > > > > 
> > > > > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > > > > with something before we freeze this API.
> > > > > > > 
> > > > > > > I think it's rather independent from the codec used and this is
> > > > > > > something that should be handled at the request API level.
> > > > > > > 
> > > > > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > > > > pixel format, so that we also have a possibility for a gathered slice
> > > > > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > > > > hardware that will need to decode one frame in one go from a contiguous
> > > > > > > buffer with all the slice data appended.
> > > > > > > 
> > > > > > > This updates my pixel format proposition from IRC to the following:
> > > > > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > > > > (appended buffer), slice params as v4l2 control (legacy);
> > > > > > 
> > > > > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > > > > control for the NAL index, as there is no way to figure-out otherwise.
> > > > > > I would not add this format unless a specific HW need it.
> > > > > 
> > > > > I don't really like using "raw" as a distinguisher: I don't think it's
> > > > > explicit enough. The idea here is to reflect that there is only one
> > > > > slice exposed, which is the appended result of all the frame slices
> > > > > with a single v4l2 control.
> > > > 
> > > > RAW in this context was suggested to reflect the fact there is no
> > > > header, no slice header and that emulation prevention bytes has been
> > > > removed and replaces by the real values.
> > > 
> > > That could also be understood as "slice params coded raw", which is the
> > > opposite of what it describes, hence my reluctance.
> > > 
> > > > Just SLICE alone was much worst.
> > > 
> > > Keep in mind that we already have a MPEG2_SLICE format in the public
> > > API. We should probably decide what it should become based on the
> > > outcome of this discussion.
> > > 
> > > >  There is to many properties to this type of H264 buffer to
> > > > encode everything into the name, so what will really matter in the end
> > > > if the documentation. Feel free to propose a better name.
> > > 
> > > Agreed, it's a side point. I always find it hard to find naming good,
> > > as well as finding good naming (my suggestions aren't really top-notch
> > > either).
> > > 
> > > Here is another proposition:
> > > - SLICE_PARSED
> > > - SLICE_ANNEX_B
> > > - SLICE_PARSED_ANNEX_B
> > 
> > Ok, we'll keep working on that then, naming is hard. I guess by PARSED
> > you meant that the slice headers are passed as controls, and that
> > indeed make sense. But I really thought all stateless decoder would
> > required that. A hard bet obviously.
> > 
> > > > > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > > > > 
> > > > > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format,
> > > > > > > slice params encoded in the buffer;
> > > > > > 
> > > > > > We are still working on this one, this format will be used by Rockchip
> > > > > > driver for sure, but this needs clarification and maybe a rename if
> > > > > > it's not just one slice per buffer.
> > > > > 
> > > > > I thought the decoder also needed the parse slice data? At least IIRC
> > > > > for Tegra, we need Annex-B format and a parsed slice header (so the
> > > > > next one).
> > > > 
> > > > Yes, in every cases, the HW will parse the slice data.
> > > 
> > > Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
> > > some decoders will need annex-b format but won't parse the slice header
> > > on their own, so they also need the parsed slice header control.
> > > Don't ask why...
> > > 
> > > In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.
> > > 
> > > >  It's possible
> > > > that Tegra have a matching format as Rockchip, someone would need to do
> > > > a proper integration to verify. But the driver does not need the
> > > > following one, that is specific to ANNEX-B parsing.
> > > > 
> > > > > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > > > > slice params encoded in the buffer and in slice params control;
> > > > > > > 
> > > > > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > > > > encoded data in the slice params control so that the same slice buffer
> > > > > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > > > > an annex-b slice and use any of the formats with it).
> > > > > > 
> > > > > > Ah, I we are saying the same.
> > > > > > 
> > > > > > > For the legacy format, we need to specify that the appended slices
> > > > > > > don't repeat the annex-b start code and NAL header.
> > > > > > 
> > > > > > I'm not sure this one make sense. the NAL header for each slices of one
> > > > > > frames are not always identical.
> > > > > 
> > > > > Yes but that's pretty much the point of the legacy format: to only
> > > > > expose a single slice buffer and slice header (even in cases where the
> > > > > bitstream codes them in multiple distinct ones).
> > > > > 
> > > > > We can't expect this to work in every case, that's why it's a legacy
> > > > > format. It seems to work pretty well for cedrus so far.
> > > > 
> > > > I'm not sure I follow you, what Cedrus does should be changed to
> > > > whatever we decide as a final API, we should not maintain two formats.
> > > 
> > > That point has me hesitating. It depends on whether we can expect to
> > > see hardware implementations with no support whatsoever for multi-slice
> > > per frame and just expect an aggregated buffer of slice compressed
> > > data. This is one operation mode that the Allwinner VPU supports.
> > > 
> > > The point is not to use it in Cedrus since our VPU can operate per-
> > > slice, but to allow supporting hardware decoders that can't do that in
> > > the future.
> > > 
> > > I'm not sure it's healthy to make it a hard requirement for H.264
> > > decoding to operate per-slice. Does that seem too far-fetched from your
> > > perspective? I seem to recall from a discussion that some legacy
> > > hardware only handles single-slices frames, but I may be wrong.
> > > 
> > > > Also, what works for Cedrus is that a each buffers must have a single
> > > > slice regardless how many slices per frame. And this is what I expect
> > > > from most stateless HW.
> > > 
> > > Currently, we append all the slices into one buffer and decode it in
> > > one go with a slightly hacked slice params to reflect that. But of
> > > course, we should be operating per-slice.
> > > 
> > > >  This is how it works in VAAPI and VDPAU as an
> > > > example. Just for the reference, the API in VAAPI is (pseudo code, I
> > > > can't remember the exact name):
> > > > 
> > > >    - beginPicture()
> > > >    - decodeSlice() *
> > > >    - endPicture()
> > > > 
> > > > So the accelerator is told explicitly when a frame start/end, but also
> > > > it's told explicitly in which buffer to decode the frame to.
> > > 
> > > Yes definitely. We're also given all the parsed bitstream elements in
> > > the right order so that we could already start queuing requests when
> > > each slice is passed, and just wait for completion at endPicture.
> > > 
> > > > > We could also decide to ditch the legacy idea altogether and only
> > > > > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > > > > find decoders that can only take a single slice per buffer.
> > > > 
> > > > It's impossible for a compliant decoder to only support 1 slice per
> > > > frame, so I don't follow you on this one. Also, I don't understand what
> > > > difference you see between per-slice basis and single slice per buffer.
> > > 
> > > Okay that's exactly what I wanted to know: whether it makes any sense
> > > to build a decoder that only operates per-frame and not per-slice.
> > > If you are confident we won't see that in the wild, we can make it an
> > > API requirement to operate per-slice.
> > 
> > There is probably a small distinction to make between supporting
> > multiple slices per frame and operating per slice. It's nice to know
> > that Cedrus support both. As we discussed today on IRC, if we introduce
> > a flag that tells the driver when the last slice of a frame is passed,
> > it would be relatively simple for the driver to do decide what to do.
> > Of course if the HW have a limitation of one allocation, it might not
> > be fully optimal as it would have to copy.
> > 
> > But as this is stateless decoder, I'm more inclined in introducing a
> > format that means just that, leaving it to userspace to do that right
> > packing.
> > 
> > > > > When decoding a multi-slice frame in that setup, I think we'll be
> > > > > better off with an appended buffer containing all the slices for the
> > > > > frame instead of passing only a the first slice.
> > > > 
> > > > Appended slices requires extra controls, but also introduce a lot more
> > > > decoding latency. As soon as we add the missing frame boundary
> > > > signalling, it should be really trivial for a driver to wait until it
> > > > received all slices before starting the decoding if that is a HW
> > > > requirement.
> > > 
> > > Well, I don't really like the idea of the driver being aware of any of
> > > that (IMO the logic should be in the media core, not the driver).
> > > 
> > > If a driver can't do multiple slices, it shouldn't be up to the driver
> > > to gather them together. But anyway, if you think we won't ever see
> > > this kind of hardware, we can just drop the whole idea.
> > 
> > A compliant HW will support multiple slices per frame, that's not
> > really optional. But it may require all slices to be packed in a single
> > allocation, in which case it could copy, or we can just have a
> > dedicated format for this behaviour.
> > 
> > > > > > > What do you think?
> > > > > > > 
> > > > > > > >  By the way, if we could queue
> > > > > > > > twice the same buffer, that would in principal work, but internally
> > > > > > > > there is only one state per buffer. If you do external allocation, then
> > > > > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > > > > have two buffers with the same timestamp.
> > > > > > > 
> > > > > > > One advantage of the request API is that buffers are actually queued
> > > > > > > when the request is processed, so this might not be too problematic.
> > > > > > > 
> > > > > > > I think what we need boils down to:
> > > > > > > - Being able to queue the same output buffer to multiple requests,
> > > > > > > which the request API should already allow;
> > > > > > > - Being able to grab the right capture buffer based on the output
> > > > > > > timestamp so that the different requests for the slices are rendered to
> > > > > > > the same destination buffer.
> > > > > > > 
> > > > > > > For the second point, I don't really have a clear idea of whether we
> > > > > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > > > > we could just implement something to lookup the buffer to grab by
> > > > > > > timestamp.
> > > > > > 
> > > > > > An entirely difference solution that came to my mind in the last few
> > > > > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > > > > reused the generic LAST flag). This flag would be passed on the last
> > > > > > slice (if it is known that we are handling the last one) or in an empty
> > > > > > buffer if it is found through parsing the next following NAL. This is
> > > > > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > > > > Though, if we make this flag mandatory, the driver could avoid marking
> > > > > > the frame done until all slices has been decoded. The cons is that
> > > > > > userpace is not informed when a specific slice is done decoding. This
> > > > > > is quite niche, but you can use this information along with the list of
> > > > > > macroblocks from the slice header so signal which portion of the image
> > > > > > is now ready for an hypothetical video processing. The pros is that
> > > > > > this solution can be per format, so this would not be needed for VP8 as
> > > > > > an example.
> > > > > 
> > > > > Mhh, I don't really like the idea of setting an explicit order when
> > > > > there is really none. I guess the slices for a given frame can be
> > > > > decoded in whatever order, so I would like it better if we could just
> > > > > submit the batch of requests and be told when the batch is done,
> > > > > instead of specifying an explicit order and waiting for the last buffer
> > > > > to be marked done.
> > > > > 
> > > > > And I think this batch idea could apply to other things than video
> > > > > decoding, so it feels good to have it as the highest level we can in
> > > > > media/v4l2.
> > > > 
> > > > I haven't said anything about order. I believe you can decode slice
> > > > out-of-order in H264 but it is likely not true for all formats. You are
> > > > again missing the point of decoding latency.
> > > 
> > > Well, having an END_OF_FRAME flag on one of the slices pretty much
> > > implicitly defines an order (at least regarding this slice vs the
> > > others).
> > 
> > No, the flag simply means that any following request will be on another
> > frame. It's more like "closing" the decoded frame. I believe you have a
> > good understanding of this proposal now after our IRC discussion.
> > 
> > > > In live stream, the slices are transmitted over some serial link. If
> > > > you wait until you have all slice before you start decoding, you delay
> > > > further the moment the frame will be ready.
> > > 
> > > So that means we need some ability to add requests to a batch while the
> > > batch is being handled. Seems a bit exotic but definitely legit, and it
> > > can probably be done. Userspace would know when it has submitted all
> > > the slices and move on to displaying the frame.
> > > 
> > > >  A lot of vendors make use
> > > > of this to reduce latency, and libWebRTC also makes use of this. So
> > > > being able to pass slices as part of a specific frame is rather
> > > > important. Otherwise vendor will keep doing their own stuff as the
> > > > Linux kernel API won't allow reaching their customers expectation.
> > > 
> > > I fully agree we need to prepare for all these low-latency
> > > improvements. My goal is definitely to have something that can beat
> > > vendor-specific implementations in upstream, not just a proof of
> > > concept for half-baked decoding.
> 
> Thanks for this great discussion. Let me try to summarize the status
> of this thread + the IRC discussion and add my own thoughts:
> 
> Proper support for multiple decoding units (e.g. H.264 slices) per
> frame should not be an afterthought ; compliance to encoded formats
> depend on it, and the benefit of lower latency is a significant
> consideration for vendors.
> 
> m2m, which we use for all stateless codecs, has a strong assumption
> that one OUTPUT buffer consumed results in one CAPTURE buffer being
> produced. This assumption can however be overruled: at least the venus
> driver does it to implement the stateful specification.

The m2m framework code, which is quite minimal, has this limitation,
but it has nothing to do with the userspace M2M interface. In
userspace, M2M are just two asynchronous queues. New input data is
queued on the OUTPUT queue, and results is taken from the CAPTURE
queue. There is nothing in the API or the spec that limits how many
input data (OUTPUT queue) will be used to produce a number of results
(CAPTURE queue).

> 
> So we need a way to specify frame boundaries when submitting encoded
> content to the driver. One request should contain a single OUTPUT
> buffer, containing a single decoding unit, but we need a way to
> specify whether the driver should directly produce a CAPTURE buffer
> from this request, or keep using the same CAPTURE buffer with
> subsequent requests.

Yes, that's a good recap, we need a way. Just a clarification, we need
a way for formats similar to H264/H265 for which the frame boundary is
often only discovered by parsing the following NAL or signalled through
a container.

> 
> I can think of 2 ways this can be expressed:
> 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> is produced), and add a flag to ask the driver to change that behavior
> and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> 2) We specify that no CAPTURE buffer is produced by default, unless a
> flag asking so is specified.

I don't think 1) is really a valid option. A buffer has one state. In
current implementation of Cedrus, when 1 unit is decoded (1 slice) the
capture buffer is marked as DONE. That signals any userspace polling
for capture buffer being ready to DQ. Now, if you drive the OUTPUT and
CAPTURE queue from separate thread, you end up with a race where
userspace thinks the buffer is ready but a new slice comes in, so the
state has been cleared between the poll returning and the call to DQ
buf. User-space will unexpectedly endup doing a blocking DQBuf which is
likely unwanted. Then if we leave is in DONE state, it's much worst,
since there is no way to signal that the buffer is ready (the decoding
the unit has completed).

As this API does not exist yet, introducing 2) is possible and is much
saner to handle from userspace. The benefit  is that you have no
special case. The driver just hold on marking the buffer DONE until it
has processed all unit up to one that had a frame completion flag on
it.

> 
> The flag could be specified in one of two ways:
> a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> b) As a dedicated control, either format-specific or more common to all codecs.
> 
> I tend to favor 2) and b) for this, for the reason that with H.264 at
> least, user-space does not know whether a slice is the last slice of a
> frame until it starts parsing the next one, and we don't know when we
> will receive it. If we use a control to ask that a CAPTURE buffer be
> produced, we can always submit another request with only that control
> set once it is clear that the frame is complete (and not delay
> decoding meanwhile). In practice I am not that familiar with
> latency-sensitive streaming ; maybe a smart streamer would just append
> an AUD NAL unit at the end of every frame and we can thus submit the
> flag it with the last slice without further delay?

AUD NAL, when present, are the first NAL of a frame, so latency wise it
is useless. So what we do is that we rely on the encoder to tell us. So
encoders will set a flag to signal the last slice of a frame. If you
are doing RTP, this flags is converted into a marker bit (RTP
specific). This marker bit is then received on the other side and
passed to the decoder. The decoder will process the slice and when this
is done will immediately deliver the resulting frame (if reordering
allow). If it's not present, it will wait for the next slice in order
to determine if the decoded frame can be delivered or not. So without
the marker, we effectively have 1 extra frame latency in the worst
case.

What I like of the b) proposal is that we can invert the logic and
effectively abstract this completely for formats that don't have slices
(or equivalent) while having this implemented generically.

What I had in mind was a) because I was thinking that we could reuse
the flag for stateful encoder/decoder in order to support the RTP
marker bit usecase and slice level streaming. Right now, we only do
full frame streaming, but it's limiting. the ZyncMP firmware that
Micheal is integrating does support low latency with slice processing,
so to match the vendor driver capacity we'll need that flag anyway.

But in stateless, it's easier, because not setting it at all simply
introduce more latency, while for accelerators we would like to make
the closing of a frame mandatory. So I'm totally fine with a different
mechanism. Again, this is handled in VAAPI and other similar API by
having begin/end function for frames, and then a number of
decode_slice() calls in the middle. So there is an extra context for
frames on top of slices in these API.

> 
> An extra constraint to enforce would be that each decoding unit
> belonging to the same frame must be submitted with the same timestamp,
> otherwise the request submission would fail. We really need a
> framework to enforce all this at a higher level than individual
> drivers, once we reach an agreement I will start working on this.

I agree with that. And adding checks for this would be really welcome
to catch errors.

> 
> Formats that do not support multiple decoding units per frame would
> reject any request that does not carry the end-of-frame information.

Again, we *could* also reverse the logic, so that by default all OUTPUT
buffer would be considered complete frames. So far I only know 3
formats that have this feature, H264, H265 and AV1. I'm not sure for
VP9, I would need to look. But clearly JPEG, VP8, H263, raw format and
more don't seem to have this. We could also have a generic control/flag
and make it mandatory for specific formats if that is simpler.

> 
> Anything missing / any further comment?
Nicolas Dufresne April 17, 2019, 4:09 p.m. UTC | #15
Le mercredi 17 avril 2019 à 14:39 +0900, Alexandre Courbot a écrit :
> Hi Paul,
> 
> On Tue, Apr 16, 2019 at 4:55 PM Paul Kocialkowski
> <paul.kocialkowski@bootlin.com> wrote:
> > Hi,
> > 
> > Le mardi 16 avril 2019 à 16:22 +0900, Alexandre Courbot a écrit :
> > 
> > [...]
> > 
> > > Thanks for this great discussion. Let me try to summarize the status
> > > of this thread + the IRC discussion and add my own thoughts:
> > > 
> > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > frame should not be an afterthought ; compliance to encoded formats
> > > depend on it, and the benefit of lower latency is a significant
> > > consideration for vendors.
> > > 
> > > m2m, which we use for all stateless codecs, has a strong assumption
> > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > produced. This assumption can however be overruled: at least the venus
> > > driver does it to implement the stateful specification.
> > > 
> > > So we need a way to specify frame boundaries when submitting encoded
> > > content to the driver. One request should contain a single OUTPUT
> > > buffer, containing a single decoding unit, but we need a way to
> > > specify whether the driver should directly produce a CAPTURE buffer
> > > from this request, or keep using the same CAPTURE buffer with
> > > subsequent requests.
> > > 
> > > I can think of 2 ways this can be expressed:
> > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > is produced), and add a flag to ask the driver to change that behavior
> > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > 
> > That would kind of break the stateless idea. I think we need requests
> > to be fully independent of eachother and have some entity that
> > coordinates requests for this kind of things.
> 
> Side note: the idea that stateless decoders are entirely stateless is
> not completely accurate anyway. When we specify a resolution on the
> OUTPUT queue, we already store some state. What matters IIUC is that
> the *hardware* behaves in a stateless manner. I don't think we should
> refrain from storing some internal driver state if it makes sense.
> 
> Back to the topic: the effect of this flag would just be that the
> first buffer is the CAPTURE queue is not removed, i.e. the next
> request will work on the same buffer. It doesn't really preserve any
> state - if the next request is the beginning of a different frame,
> then the previous work will be discarded and the driver will behave as
> it should, not considering any previous state.
> 
> > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > flag asking so is specified.
> > > 
> > > The flag could be specified in one of two ways:
> > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > b) As a dedicated control, either format-specific or more common to all codecs.
> > 
> > I think we must aim for a generic solution that would be at least
> > common to all codecs, and if possible common to requests regardless of
> > whether they concern video decoding or not.
> > 
> > I really like the idea of introducing a requests batch/group/queue,
> > which groups requests together and allows marking them done when the
> > whole group is done being decoded. For that, we explicitly mark one of
> > the requests as the final one, so that we can continue adding requests
> > to the batch even when it's already being processed. When all the
> > requests are done being decoded, we can mark them done.
> 
> I'd need to see this idea more developed (with maybe an example of the
> sequence of IOCTLs) to form an opinion about it. Also would need to be
> given a few examples of where this could be used outside of stateless
> codecs. Then we will have to address what this means for requests:
> your argument against using a "release CAPTURE buffer" flag was that
> requests won't be fully independent from each other anymore, but I
> don't see that situation changing with batches. Then, does the end of
> a batch only means that a CAPTURE buffer should be released, or are
> other actions required for non-codec use-cases? There are lots and
> lots of questions like this one lurking.
> 
> > With that, we also need some tweaking in the core to look for an
> > available capture buffer that matches the output buffer's timestamp
> > before trying to dequeue the next available capture buffer
> 
> I don't think that would be strictly necessary, unless we want to be
> able to decode slices from different frames before the first one is
> completed?
> 
> > This way,
> > the first request of the batch will get any queued capture buffer, but
> > subsequent requests will find the matchung capture buffer by timestamp.
> > 
> > I think that's basically all we need to handle that and the two aspects
> > (picking by timestamp and requests groups) are rather independent and
> > the latter could probably be used in other situations than video
> > decoding.
> > 
> > What do you think?
> 
> At the current point I'd like to avoid over-engineering things.
> Introducing a request batch mechanism would mean more months spent
> before we can set the stateless codec API in stone, and at some point
> we need to settle and release something that people can use. We don't
> even have clear idea of what batches would look like and in which
> cases they would be used. The idea of an extra flag is simple and
> AFAICT would do the job nicely, so why not proceed with this for the
> time being?

I also share this feeling that this might be a bit over-engineered for
what we want to solve. But I also don't fully understand Paul's
proposal.

> 
> Cheers,
> Alex.
Nicolas Dufresne April 17, 2019, 4:17 p.m. UTC | #16
Le mardi 16 avril 2019 à 09:37 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le lundi 15 avril 2019 à 11:30 -0400, Nicolas Dufresne a écrit :
> > Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > On Mon, 2019-04-15 at 08:24 -0400, Nicolas Dufresne wrote:
> > > > Le lundi 15 avril 2019 à 09:58 +0200, Paul Kocialkowski a écrit :
> > > > > Hi,
> > > > > 
> > > > > On Sun, 2019-04-14 at 18:38 -0400, Nicolas Dufresne wrote:
> > > > > > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > > > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > > > > > Documents the protocol that user-space should follow when
> > > > > > > > > communicating with stateless video decoders.
> > > > > > > > > 
> > > > > > > > > The stateless video decoding API makes use of the new request and tags
> > > > > > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > > > > > should probably still be considered staging for a short while.
> > > > > > > 
> > > > > > > [...]
> > > > > > > 
> > > > > > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > > > > > design problem in the decoding process.
> > > > > > > > 
> > > > > > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > > > > > (slices). This type of encoding is increasingly popular, specially for
> > > > > > > > low latency streaming use cases. The wording of this spec does allow
> > > > > > > > for the notion of decoding unit, and in practice it has been proven to
> > > > > > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > > > > > something important to know is that the FFMPEG RFC implements decoding
> > > > > > > > in lock steps. Which means:
> > > > > > > > 
> > > > > > > >   1. It queues a single free capture buffer
> > > > > > > >   2. It queues an output buffer, set controls, queue the request
> > > > > > > >   3. It waits for a capture buffer to reach state done
> > > > > > > >   4. It dequeues that capture buffer, and queue it back again
> > > > > > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > > > > > >      have a complete frame. After what, it restart at step 1
> > > > > > > > 
> > > > > > > > So the implementation makes no use of the queues. There is no batch
> > > > > > > > processing, so we might not be able to reach the maximum hardware
> > > > > > > > throughput.
> > > > > > > > 
> > > > > > > > So the optimal method would look like the following, but there comes
> > > > > > > > the design issue.
> > > > > > > > 
> > > > > > > >   1. Queue a single free capture buffer
> > > > > > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > > > > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > > > > > >   4. Wait for completion
> > > > > > > > 
> > > > > > > > The problem is in step 4. Completion means that the capture buffer done
> > > > > > > > decoding a single unit. So assuming the driver supports matching the
> > > > > > > > timestamp against the queued buffer, instead of waiting for a new
> > > > > > > > buffer, the driver would have to mark twice the same buffer to done
> > > > > > > > state, which is just not working to inform userspace that all slices
> > > > > > > > are decoded into the one capture buffer they share.
> > > > > > > 
> > > > > > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > > > > > 2D graphics blitter that has limited ouput scaling abilities which
> > > > > > > imply handlnig a large scaling operation as multiple clipped smaller
> > > > > > > scaling operations. The issue is basically that multiple jobs have to
> > > > > > > be submitted to complete a single frame and relying on an indication
> > > > > > > from the destination buffer (such as a fence) doesn't work to indicate
> > > > > > > that all the operations were completed, since we get the indication at
> > > > > > > each step instead of at the end of the batch.
> > > > > > > 
> > > > > > > One idea I see to solve this is to have a notion of batch in the driver
> > > > > > > (for our situation, that would be in v4l2) and provide means to get a
> > > > > > > done indication for that entity.
> > > > > > > 
> > > > > > > I think we could extend the request API to allow this. We already
> > > > > > > represent requests as individual file descriptors, we could totally
> > > > > > > group requests in batches and get a sync fd for the batch to poll on
> > > > > > > when we need to return the frames. It would be good if we could expose
> > > > > > > this in a way that makes it work with DRM as an in fence for display.
> > > > > > > Then we can pretty much schedule our flip + decoding together (which is
> > > > > > > quite nice to have when we're running late on the decoding side).
> > > > > > > 
> > > > > > > What do you think?
> > > > > > > 
> > > > > > > It feels to me like the request API was designed to open up the way for
> > > > > > > these kinds of improvements, so I'm sure we can find an agreeable
> > > > > > > solution that extends the API.
> > > > > > > 
> > > > > > > > To me, multi slice encoded stream are just too common, and they will
> > > > > > > > also exist for AV1. So we really need a solution to this that does not
> > > > > > > > require operating in lock steps. Specially that some HW can decode
> > > > > > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > > > > > that HW from being used efficiently. On top of this, we need a solution
> > > > > > > > so that we can also keep queuing slice of the following frames if they
> > > > > > > > arrive before decoding is done.
> > > > > > > 
> > > > > > > Agreed.
> > > > > > > 
> > > > > > > > I don't have a solution yet myself, but it would be nice to come up
> > > > > > > > with something before we freeze this API.
> > > > > > > 
> > > > > > > I think it's rather independent from the codec used and this is
> > > > > > > something that should be handled at the request API level. 
> > > > > > > 
> > > > > > > I'm not sure we can always expect the hardware to be able to operate on
> > > > > > > a per-slice basis. I think it would be useful to reflect this in the
> > > > > > > pixel format, so that we also have a possibility for a gathered slice
> > > > > > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > > > > > hardware that will need to decode one frame in one go from a contiguous
> > > > > > > buffer with all the slice data appended.
> > > > > > > 
> > > > > > > This updates my pixel format proposition from IRC to the following:
> > > > > > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > > > > > (appended buffer), slice params as v4l2 control (legacy);
> > > > > > 
> > > > > > SLICE_RAW_APPENDED ? Or FRAMED_SLICES_RAW ? You would need a new
> > > > > > control for the NAL index, as there is no way to figure-out otherwise.
> > > > > > I would not add this format unless a specific HW need it.
> > > > > 
> > > > > I don't really like using "raw" as a distinguisher: I don't think it's
> > > > > explicit enough. The idea here is to reflect that there is only one
> > > > > slice exposed, which is the appended result of all the frame slices
> > > > > with a single v4l2 control.
> > > > 
> > > > RAW in this context was suggested to reflect the fact there is no
> > > > header, no slice header and that emulation prevention bytes has been
> > > > removed and replaces by the real values.
> > > 
> > > That could also be understood as "slice params coded raw", which is the
> > > opposite of what it describes, hence my reluctance.
> > > 
> > > > Just SLICE alone was much worst.
> > > 
> > > Keep in mind that we already have a MPEG2_SLICE format in the public
> > > API. We should probably decide what it should become based on the
> > > outcome of this discussion.
> > > 
> > > >  There is to many properties to this type of H264 buffer to
> > > > encode everything into the name, so what will really matter in the end
> > > > if the documentation. Feel free to propose a better name.
> > > 
> > > Agreed, it's a side point. I always find it hard to find naming good,
> > > as well as finding good naming (my suggestions aren't really top-notch
> > > either).
> > > 
> > > Here is another proposition:
> > > - SLICE_PARSED
> > > - SLICE_ANNEX_B
> > > - SLICE_PARSED_ANNEX_B
> > 
> > Ok, we'll keep working on that then, naming is hard. I guess by PARSED
> > you meant that the slice headers are passed as controls, and that
> > indeed make sense.
> 
> Yep, that's right.
> 
> > But I really thought all stateless decoder would required that. A
> > hard bet obviously.
> 
> Well, I think that's a debate on its own. A strict interpretation of
> stateless could be that the decoder does not internally keep track of
> the reference frames and any frame can be decoded at any time (at the
> condition that the reference frames data is around). This doesn't have
> to be correlated with whether the decoder will take the slice header in
> raw or parsed format, after all.
> 
> Note that in cedrus, our decoder still has some state, but we get to
> decide where that state is stored, which makes the whole thing
> stateless  since we can bring the state we need dynamically without
> involving the hardware.

In general, we say stateless from a HW point of view. It simply means
that the HW (the accelerator) can be multiplexed to process several
independent streams. While with stateful firmware, you generally can't
save the state, and ends up with a specific number of concurrent stream
(scheduling happens in the firmware). There is exception to that of
course, the newest Amlogic/Meson video decoder allow for saving the
decoder state. The registers are undocumented, since they are filled by
the HW parser, but it's separated in a way that we could multiplex.

Of course we do have a state in our drivers. Each time you open an m2m
device, you create an new instance which will keep track of done jobs,
pending jobs, active format, allocated memory, etc.

> 
> > > > > > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > > > > > 
> > > > > > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > > > > > slice params encoded in the buffer;
> > > > > > 
> > > > > > We are still working on this one, this format will be used by Rockchip
> > > > > > driver for sure, but this needs clarification and maybe a rename if
> > > > > > it's not just one slice per buffer.
> > > > > 
> > > > > I thought the decoder also needed the parse slice data? At least IIRC
> > > > > for Tegra, we need Annex-B format and a parsed slice header (so the
> > > > > next one).
> > > > 
> > > > Yes, in every cases, the HW will parse the slice data.
> > > 
> > > Ah sorry, I meant "need the parsed slice data" (missed the d), as in,
> > > some decoders will need annex-b format but won't parse the slice header
> > > on their own, so they also need the parsed slice header control.
> > > Don't ask why...
> > > 
> > > In my proposition from above, that would be: SLICE_PARSED_ANNEX_B.
> > > 
> > > >  It's possible
> > > > that Tegra have a matching format as Rockchip, someone would need to do
> > > > a proper integration to verify. But the driver does not need the
> > > > following one, that is specific to ANNEX-B parsing.
> > > > 
> > > > > > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > > > > > slice params encoded in the buffer and in slice params control;
> > > > > > > 
> > > > > > > Also, we need to make sure to have a per-slice bit offset to the
> > > > > > > encoded data in the slice params control so that the same slice buffer
> > > > > > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > > > > > an annex-b slice and use any of the formats with it).
> > > > > > 
> > > > > > Ah, I we are saying the same.
> > > > > > 
> > > > > > > For the legacy format, we need to specify that the appended slices
> > > > > > > don't repeat the annex-b start code and NAL header.
> > > > > > 
> > > > > > I'm not sure this one make sense. the NAL header for each slices of one
> > > > > > frames are not always identical.
> > > > > 
> > > > > Yes but that's pretty much the point of the legacy format: to only
> > > > > expose a single slice buffer and slice header (even in cases where the
> > > > > bitstream codes them in multiple distinct ones).
> > > > > 
> > > > > We can't expect this to work in every case, that's why it's a legacy
> > > > > format. It seems to work pretty well for cedrus so far.
> > > > 
> > > > I'm not sure I follow you, what Cedrus does should be changed to
> > > > whatever we decide as a final API, we should not maintain two formats.
> > > 
> > > That point has me hesitating. It depends on whether we can expect to
> > > see hardware implementations with no support whatsoever for multi-slice 
> > > per frame and just expect an aggregated buffer of slice compressed
> > > data. This is one operation mode that the Allwinner VPU supports.
> > > 
> > > The point is not to use it in Cedrus since our VPU can operate per-
> > > slice, but to allow supporting hardware decoders that can't do that in
> > > the future.
> > > 
> > > I'm not sure it's healthy to make it a hard requirement for H.264
> > > decoding to operate per-slice. Does that seem too far-fetched from your
> > > perspective? I seem to recall from a discussion that some legacy
> > > hardware only handles single-slices frames, but I may be wrong.
> > > 
> > > > Also, what works for Cedrus is that a each buffers must have a single
> > > > slice regardless how many slices per frame. And this is what I expect
> > > > from most stateless HW.
> > > 
> > > Currently, we append all the slices into one buffer and decode it in
> > > one go with a slightly hacked slice params to reflect that. But of
> > > course, we should be operating per-slice.
> > > 
> > > >  This is how it works in VAAPI and VDPAU as an
> > > > example. Just for the reference, the API in VAAPI is (pseudo code, I
> > > > can't remember the exact name):
> > > > 
> > > >    - beginPicture()
> > > >    - decodeSlice() *
> > > >    - endPicture()
> > > > 
> > > > So the accelerator is told explicitly when a frame start/end, but also
> > > > it's told explicitly in which buffer to decode the frame to.
> > > 
> > > Yes definitely. We're also given all the parsed bitstream elements in
> > > the right order so that we could already start queuing requests when
> > > each slice is passed, and just wait for completion at endPicture.
> > > 
> > > > > We could also decide to ditch the legacy idea altogether and only
> > > > > specify formats that operate on a per-slice basis, but I'm afraid we'll
> > > > > find decoders that can only take a single slice per buffer.
> > > > 
> > > > It's impossible for a compliant decoder to only support 1 slice per
> > > > frame, so I don't follow you on this one. Also, I don't understand what
> > > > difference you see between per-slice basis and single slice per buffer.
> > > 
> > > Okay that's exactly what I wanted to know: whether it makes any sense
> > > to build a decoder that only operates per-frame and not per-slice.
> > > If you are confident we won't see that in the wild, we can make it an
> > > API requirement to operate per-slice.
> > 
> > There is probably a small distinction to make between supporting
> > multiple slices per frame and operating per slice. It's nice to know
> > that Cedrus support both. As we discussed today on IRC, if we introduce
> > a flag that tells the driver when the last slice of a frame is passed,
> > it would be relatively simple for the driver to do decide what to do.
> > Of course if the HW have a limitation of one allocation, it might not
> > be fully optimal as it would have to copy.
> > 
> > But as this is stateless decoder, I'm more inclined in introducing a
> > format that means just that, leaving it to userspace to do that right
> > packing.
> > 
> > > > > When decoding a multi-slice frame in that setup, I think we'll be
> > > > > better off with an appended buffer containing all the slices for the
> > > > > frame instead of passing only a the first slice.
> > > > 
> > > > Appended slices requires extra controls, but also introduce a lot more
> > > > decoding latency. As soon as we add the missing frame boundary
> > > > signalling, it should be really trivial for a driver to wait until it
> > > > received all slices before starting the decoding if that is a HW
> > > > requirement.
> > > 
> > > Well, I don't really like the idea of the driver being aware of any of
> > > that (IMO the logic should be in the media core, not the driver).
> > > 
> > > If a driver can't do multiple slices, it shouldn't be up to the driver
> > > to gather them together. But anyway, if you think we won't ever see
> > > this kind of hardware, we can just drop the whole idea.
> > 
> > A compliant HW will support multiple slices per frame, that's not
> > really optional. But it may require all slices to be packed in a single
> > allocation, in which case it could copy, or we can just have a
> > dedicated format for this behaviour.
> 
> I was thinking of allowing this by default with bit offsets to the
> slice. The main issue I see here will be trying to add new slice data
> to a buffer that was already queued (and might already be undergoing
> decoding), when submitting a new request to the batch (that may already
> have started decoding). We'd need a notion of "buffer partitions" or
> so.
> 
> > > > > > > What do you think?
> > > > > > > 
> > > > > > > >  By the way, if we could queue
> > > > > > > > twice the same buffer, that would in principal work, but internally
> > > > > > > > there is only one state per buffer. If you do external allocation, then
> > > > > > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > > > > > have two buffers with the same timestamp.
> > > > > > > 
> > > > > > > One advantage of the request API is that buffers are actually queued
> > > > > > > when the request is processed, so this might not be too problematic.
> > > > > > > 
> > > > > > > I think what we need boils down to:
> > > > > > > - Being able to queue the same output buffer to multiple requests,
> > > > > > > which the request API should already allow;
> > > > > > > - Being able to grab the right capture buffer based on the output
> > > > > > > timestamp so that the different requests for the slices are rendered to
> > > > > > > the same destination buffer.
> > > > > > > 
> > > > > > > For the second point, I don't really have a clear idea of whether we
> > > > > > > can already expect v4l2 to allow picking a buffer that was marked done
> > > > > > > but was not de-queued by userspace yet. It might already be allowed and
> > > > > > > we could just implement something to lookup the buffer to grab by
> > > > > > > timestamp.
> > > > > > 
> > > > > > An entirely difference solution that came to my mind in the last few
> > > > > > days would be to add a new buffer flag that would mean END_OF_FRAME (or
> > > > > > reused the generic LAST flag). This flag would be passed on the last
> > > > > > slice (if it is known that we are handling the last one) or in an empty
> > > > > > buffer if it is found through parsing the next following NAL. This is
> > > > > > inspired by the optional OMX END_OF_FRAME flag and RTP marker bit.
> > > > > > Though, if we make this flag mandatory, the driver could avoid marking
> > > > > > the frame done until all slices has been decoded. The cons is that
> > > > > > userpace is not informed when a specific slice is done decoding. This
> > > > > > is quite niche, but you can use this information along with the list of
> > > > > > macroblocks from the slice header so signal which portion of the image
> > > > > > is now ready for an hypothetical video processing. The pros is that
> > > > > > this solution can be per format, so this would not be needed for VP8 as
> > > > > > an example.
> > > > > 
> > > > > Mhh, I don't really like the idea of setting an explicit order when
> > > > > there is really none. I guess the slices for a given frame can be
> > > > > decoded in whatever order, so I would like it better if we could just
> > > > > submit the batch of requests and be told when the batch is done,
> > > > > instead of specifying an explicit order and waiting for the last buffer
> > > > > to be marked done.
> > > > > 
> > > > > And I think this batch idea could apply to other things than video
> > > > > decoding, so it feels good to have it as the highest level we can in
> > > > > media/v4l2.
> > > > 
> > > > I haven't said anything about order. I believe you can decode slice
> > > > out-of-order in H264 but it is likely not true for all formats. You are
> > > > again missing the point of decoding latency.
> > > 
> > > Well, having an END_OF_FRAME flag on one of the slices pretty much
> > > implicitly defines an order (at least regarding this slice vs the
> > > others).
> > 
> > No, the flag simply means that any following request will be on another
> > frame. It's more like "closing" the decoded frame. I believe you have a
> > good understanding of this proposal now after our IRC discussion.
> 
> Yes and I think what you are suggesting makes good sense.
> For the record, we certainly need to take care that the end frame *and
> the previous ones* are finished before marking the buffers done, so
> that we can handle parallelized pipelines that may finish decoding in a
> different order than request submission order.
> 
> > > > In live stream, the slices are transmitted over some serial link. If
> > > > you wait until you have all slice before you start decoding, you delay
> > > > further the moment the frame will be ready.
> > > 
> > > So that means we need some ability to add requests to a batch while the
> > > batch is being handled. Seems a bit exotic but definitely legit, and it
> > > can probably be done. Userspace would know when it has submitted all
> > > the slices and move on to displaying the frame.
> > > 
> > > >  A lot of vendors make use
> > > > of this to reduce latency, and libWebRTC also makes use of this. So
> > > > being able to pass slices as part of a specific frame is rather
> > > > important. Otherwise vendor will keep doing their own stuff as the
> > > > Linux kernel API won't allow reaching their customers expectation.
> > > 
> > > I fully agree we need to prepare for all these low-latency
> > > improvements. My goal is definitely to have something that can beat
> > > vendor-specific implementations in upstream, not just a proof of
> > > concept for half-baked decoding.
> > 
> > Great.
> > 
> > > > The batching capabilities should be used for the case the multiple
> > > > slices of a frame (or multiple slices of many frame is supported by the
> > > > HW) have been queued before the previous batch had completed.
> > > > 
> > > > > > A third approach could be to use the encoded buffer state to track the
> > > > > > progress decoding that slice. Many driver will mark the buffer done as
> > > > > > soon as it is transferred to the accelerator, it does not always match
> > > > > > the moment that slice has been decoded. But has use said, we would need
> > > > > > to study if it make sense to let a driver pick by timestamp a buffer
> > > > > > that might already have reached done state. Other cons, is that polling
> > > > > > for buffer states on the capture queue won't mean anything anymore. But
> > > > > > combined with the FLAG, it would fix the cons of the FLAG solution.
> > > > > 
> > > > > Well, I think we should keep the done operation as-is, only give it a
> > > > > different interpretation depending on whether the request is handled
> > > > > individually or as part of a batch. I really think we shouldn't rely on
> > > > > any buffer-level indication for completion when handling a batch, but
> > > > > rather have something about the batch entity itself.
> > > > 
> > > > But then there is no way to know when the frame is decoded anymore,
> > > > because as soon as the first slice is decoded, the capture is done and
> > > > stay done. So what's your idea here, how to do you know your decoding
> > > > is complete if you haven't dequeue/queue the frame in between slices ?
> > > 
> > > Yes, I was thinking of exposing a sync object as a fd that can be
> > > polled on, which would be associated with the request batch and
> > > signalled when completed. That's pretty much a standalone fence that's
> > > not backed by a buffer.
> > > 
> > > I would like that to be importable as a DRM input fence (if possible at
> > > all), so we can schedule decoding and schedule a page flip for the
> > > capture buffer at the same time. Then the capture buffer can be
> > > displayed as soon as the decoding batch is done.
> > 
> > We'll have to bring that one is a specific topic. Right now there is
> > many peaces missing on DRM side as you already know. Also, fences would
> > be delivered out-of-order (decoding order), we'd need strong doc to
> > help user-space figure-out how to use this.
> 
> I think DRM is quite relevant nowadays when it comes to sync (but
> that's something missing from V4L2 as of now). Yes a big part of the
> logic is left to userspace since as you mention, decoding order !=
> display order.
> 
> That means that fences are only relevant when we are sure that the
> decoded frame is to be displayed next and that we have already missed
> the target vblank where its flip should have been scheduled. As far as
> I can see, the fence can reduce the latency by one frame by scheduling
> the flip at the current vblank target still (instead of the next one
> where the frame is decoded), and letting it take effect "as soon as
> possible" (which might be before or after the next vblank).
> 
> Cheers,
> 
> Paul
> 
> > > > > > > > An argument that was made early was that we don't need to support this
> > > > > > > > right away because userspace can combine all the slices into one
> > > > > > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > > > > > extra control to tell the driver the offset to each slices, because the
> > > > > > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > > > > > also I believe de-emulated, which means the code use to prevent having
> > > > > > > > pattern looking like a start code has been removed, so you cannot just
> > > > > > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > > > > > the HW does not take care.
> > > > > > > 
> > > > > > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > > > > > for the legacy case. Just appending the encoded data (excluding slice
> > > > > > > header and start code) works for cedrus and I think it makes sense more
> > > > > > > generally. The idea is to only expose a single slice params and act as
> > > > > > > if it was just one big slice buffer.
> > > > > > > 
> > > > > > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > > > > > legacy pixfmt too...
> > > > > > > 
> > > > > > > > I also very dislike the idea that we would enforce merging all slice
> > > > > > > > into the same buffer. The entire purpose of slices and the reason they
> > > > > > > > are used in practice is that you can start decoding slices before you
> > > > > > > > have all slices of a frame. This reduce drastically the latency for
> > > > > > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > > > > > slices is basically like pretending slices have no benefits.
> > > > > > > 
> > > > > > > Of course, we don't want things to stay like this and this rework is
> > > > > > > definitely needed to get serious performance and latency going.
> > > > > > > 
> > > > > > > One thing you should also be aware of: we're currently using a
> > > > > > > workqueue between the job done irq and scheduling the next frame (in
> > > > > > > v4l2 m2m).
> > > > > > > 
> > > > > > > Maybe we could manage to fit that into an atomic path to schedule the
> > > > > > > next request in the previous job done irq context.
> > > > > > > 
> > > > > > > > I have just exposed the problem I see for now, to see what comes up.
> > > > > > > > But I hope we be able to propose solution too in the short term (in no
> > > > > > > > one beats me at it).
> > > > > > > 
> > > > > > > Seems that we have good grounds for a discussion!
> > > > > > > 
> > > > > > > Cheers,
> > > > > > > 
> > > > > > > Paul
> > > > > > > 
> > > > > > > > > +
> > > > > > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > > > > > +
> > > > > > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > > > > > +
> > > > > > > > > +    * **Required fields:**
> > > > > > > > > +
> > > > > > > > > +      ``index``
> > > > > > > > > +          index of the buffer being queued.
> > > > > > > > > +
> > > > > > > > > +      ``type``
> > > > > > > > > +          type of the buffer.
> > > > > > > > > +
> > > > > > > > > +      ``bytesused``
> > > > > > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > > > > > +
> > > > > > > > > +      ``flags``
> > > > > > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > > > > > +
> > > > > > > > > +      ``request_fd``
> > > > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > > > +
> > > > > > > > > +      ``timestamp``
> > > > > > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > > > > > +          as the reference of another.
> > > > > > > > > +
> > > > > > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > > > > > +
> > > > > > > > > +    * **Required fields:**
> > > > > > > > > +
> > > > > > > > > +      ``which``
> > > > > > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > > > > > +
> > > > > > > > > +      ``request_fd``
> > > > > > > > > +          must be set to the file descriptor of the decoding request.
> > > > > > > > > +
> > > > > > > > > +      other fields
> > > > > > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > > > > > +          array must contain all the codec-specific controls required to decode
> > > > > > > > > +          a frame.
> > > > > > > > > +
> > > > > > > > > +   .. note::
> > > > > > > > > +
> > > > > > > > > +      It is possible to specify the controls in different invocations of
> > > > > > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > > > > > +      at the moment of request submission is the one that will be considered.
> > > > > > > > > +
> > > > > > > > > +   .. note::
> > > > > > > > > +
> > > > > > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > > > > > +
> > > > > > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > > > > > +   request FD.
> > > > > > > > > +
> > > > > > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > > > > > +    required controls are missing from the request, then
> > > > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > > > > > +
> > > > > > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > > > > > +
> > > > > > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > > > > > +error, then all following decoded frames that refer to it also have the
> > > > > > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > > > > > +produce a (likely corrupted) frame.
> > > > > > > > > +
> > > > > > > > > +Buffer management while decoding
> > > > > > > > > +================================
> > > > > > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > > > > > +encompasses using the buffer for compositing or display.
> > > > > > > > > +
> > > > > > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > > > > > +buffer.
> > > > > > > > > +
> > > > > > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > > > > > +
> > > > > > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > > > > > +reference buffer when the following conditions are met:
> > > > > > > > > +
> > > > > > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > > > > > +   queued, and
> > > > > > > > > +
> > > > > > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > > > > > +   referencing frames have been queued.
> > > > > > > > > +
> > > > > > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > > > > > +all the resources associated with reference frames. This means that the client
> > > > > > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > > > > > +won't need them afterwards.
> > > > > > > > > +
> > > > > > > > > +Seeking
> > > > > > > > > +=======
> > > > > > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > > > > > +corresponding to the new stream position. It must however be aware that
> > > > > > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > > > > > +valid state is sent to the decoder.
> > > > > > > > > +
> > > > > > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > > > > > +from the pre-seek position.
> > > > > > > > > +
> > > > > > > > > +Pause
> > > > > > > > > +=====
> > > > > > > > > +
> > > > > > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > > > > > +will remain idle.
> > > > > > > > > +
> > > > > > > > > +Dynamic resolution change
> > > > > > > > > +=========================
> > > > > > > > > +
> > > > > > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > > > > > +the initialization sequence again with the new resolution:
> > > > > > > > > +
> > > > > > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > > > > > +   corresponding output buffers.
> > > > > > > > > +
> > > > > > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > > > > > +   queues.
> > > > > > > > > +
> > > > > > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > > > > > +
> > > > > > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > > > > > +   picked on the ``CAPTURE`` queue.
> > > > > > > > > +
> > > > > > > > > +Drain
> > > > > > > > > +=====
> > > > > > > > > +
> > > > > > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > > > > > +decoder.
Nicolas Dufresne April 17, 2019, 4:22 p.m. UTC | #17
Le mercredi 17 avril 2019 à 17:40 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> On Wed, 2019-04-17 at 11:30 -0400, Nicolas Dufresne wrote:
> > Le dimanche 14 avril 2019 à 18:41 +0200, Paul Kocialkowski a écrit :
> > > Hi,
> > > 
> > > Le vendredi 12 avril 2019 à 16:47 -0400, Nicolas Dufresne a écrit :
> > > > Le mercredi 06 mars 2019 à 17:00 +0900, Alexandre Courbot a écrit :
> > > > > Documents the protocol that user-space should follow when
> > > > > communicating with stateless video decoders.
> > > > > 
> > > > > The stateless video decoding API makes use of the new request and tags
> > > > > APIs. While it has been implemented with the Cedrus driver so far, it
> > > > > should probably still be considered staging for a short while.
> > > 
> > > [...]
> > > 
> > > > From an IRC discussion with Paul and some more digging, I have found a
> > > > design problem in the decoding process.
> > > > 
> > > > In H264 and HEVC you can have multiple decoding unit per frames
> > > > (slices). This type of encoding is increasingly popular, specially for
> > > > low latency streaming use cases. The wording of this spec does allow
> > > > for the notion of decoding unit, and in practice it has been proven to
> > > > work through some RFC FFMPEG patches and the Cedrus driver. But
> > > > something important to know is that the FFMPEG RFC implements decoding
> > > > in lock steps. Which means:
> > > > 
> > > >   1. It queues a single free capture buffer
> > > >   2. It queues an output buffer, set controls, queue the request
> > > >   3. It waits for a capture buffer to reach state done
> > > >   4. It dequeues that capture buffer, and queue it back again
> > > >   5. And then it runs step 2,4,3 again with following slices, until we 
> > > >      have a complete frame. After what, it restart at step 1
> > > > 
> > > > So the implementation makes no use of the queues. There is no batch
> > > > processing, so we might not be able to reach the maximum hardware
> > > > throughput.
> > > > 
> > > > So the optimal method would look like the following, but there comes
> > > > the design issue.
> > > > 
> > > >   1. Queue a single free capture buffer
> > > >   2. Queue output buffer for slice 1, set controls, queue the request
> > > >   3. Queue output buffer for slice 2, set controls, queue the request
> > > >   4. Wait for completion
> > > > 
> > > > The problem is in step 4. Completion means that the capture buffer done
> > > > decoding a single unit. So assuming the driver supports matching the
> > > > timestamp against the queued buffer, instead of waiting for a new
> > > > buffer, the driver would have to mark twice the same buffer to done
> > > > state, which is just not working to inform userspace that all slices
> > > > are decoded into the one capture buffer they share.
> > > 
> > > Interestingly, I'm experiencing the exact same problem dealing with a
> > > 2D graphics blitter that has limited ouput scaling abilities which
> > > imply handlnig a large scaling operation as multiple clipped smaller
> > > scaling operations. The issue is basically that multiple jobs have to
> > > be submitted to complete a single frame and relying on an indication
> > > from the destination buffer (such as a fence) doesn't work to indicate
> > > that all the operations were completed, since we get the indication at
> > > each step instead of at the end of the batch.
> > 
> > That looks similar to the IMX.6 IPU m2m driver. It splits the image in
> > tiles of 1024x1024 and process each tile separately. This driver has
> > been around for a long time, so I guess they have a solution to that.
> > They don't need requests, because there is nothing to be bundled with
> > the input image. I know that Renesas folks have started working on a
> > de-interlacer. Again, this kind of driver may process and reuse input
> > buffers for motion compensation, but I don't think they need special
> > userspace API for that.
> 
> Thanks for the reference! I hope it's not a blitter that was
> contributed as a V4L2 driver instead of DRM, as it probably would be
> more useful in DRM (but that's way beside the point).

DRM does not offer a generic and discoverable interface for these
accelerators. Note that these drivers have most of the time started as
DRM driver and their DRM side where dropped. That was the case for
Exynos drivers at least.

The thing is that DRM is great if you do immediate display stuff, while
V4L2 is nice if you do streaming, where you expect filling queued, and
popping buffers from queues.

In the end, this is just an interface, nothing prevents you from making
an internal driver (like the Meson Canvas) and simply letting multiple
sub-system expose it. Specially that some of these IP will often
support both signal and memory processing, so they equally fit into a
media controller ISP, a v4l2 m2m or a DRM driver.

Another driver you might want to look is Rockchip RGA driver (which is
a multi function IP, including blitting).

> 
> > > One idea I see to solve this is to have a notion of batch in the driver
> > > (for our situation, that would be in v4l2) and provide means to get a
> > > done indication for that entity.
> > 
> > Can't you just make this part of your driver state machine ?
> 
> Yes definitely, and I forgot to mention that's in DRM, not V4L2.
> 
> Anyway from that point on, I was back to talking about our codec
> situation, not my 2D blitter anymore :)
> 
> > > I think we could extend the request API to allow this. We already
> > > represent requests as individual file descriptors, we could totally
> > > group requests in batches and get a sync fd for the batch to poll on
> > > when we need to return the frames. It would be good if we could expose
> > > this in a way that makes it work with DRM as an in fence for display.
> > > Then we can pretty much schedule our flip + decoding together (which is
> > > quite nice to have when we're running late on the decoding side).
> > > 
> > > What do you think?
> > 
> > I'm not sure why this specific thing needs a userspace exposition.
> 
> Indeed, I'll handle it in my 2D driver.
> 
> Cheers,
> 
> Paul
> 
> > > It feels to me like the request API was designed to open up the way for
> > > these kinds of improvements, so I'm sure we can find an agreeable
> > > solution that extends the API.
> > > 
> > > > To me, multi slice encoded stream are just too common, and they will
> > > > also exist for AV1. So we really need a solution to this that does not
> > > > require operating in lock steps. Specially that some HW can decode
> > > > multiple slices in parallel (multi core), we would not want to prevent
> > > > that HW from being used efficiently. On top of this, we need a solution
> > > > so that we can also keep queuing slice of the following frames if they
> > > > arrive before decoding is done.
> > > 
> > > Agreed.
> > > 
> > > > I don't have a solution yet myself, but it would be nice to come up
> > > > with something before we freeze this API.
> > > 
> > > I think it's rather independent from the codec used and this is
> > > something that should be handled at the request API level. 
> > > 
> > > I'm not sure we can always expect the hardware to be able to operate on
> > > a per-slice basis. I think it would be useful to reflect this in the
> > > pixel format, so that we also have a possibility for a gathered slice
> > > buffer (as the spec currently mentions for mpeg-2) for legacy decoder
> > > hardware that will need to decode one frame in one go from a contiguous
> > > buffer with all the slice data appended.
> > > 
> > > This updates my pixel format proposition from IRC to the following:
> > > - V4L2_PIXFMT_H264_SLICE_APPENDED: single buffer for all the slices
> > > (appended buffer), slice params as v4l2 control (legacy);
> > > 
> > > - V4L2_PIXFMT_H264_SLICE: one buffer/slice, slice params as control;
> > > 
> > > - V4L2_PIXFMT_H264_SLICE_ANNEX_B: one buffer/slice in annex-b format, 
> > > slice params encoded in the buffer;
> > > 
> > > - V4L2_PIXFMT_H264_SLICE_MIXED: one buffer/slice in annex-b format,
> > > slice params encoded in the buffer and in slice params control;
> > > 
> > > Also, we need to make sure to have a per-slice bit offset to the
> > > encoded data in the slice params control so that the same slice buffer
> > > can be used with any of the _SLICE formats (e.g. ffmpeg would only have
> > > an annex-b slice and use any of the formats with it).
> > > 
> > > For the legacy format, we need to specify that the appended slices
> > > don't repeat the annex-b start code and NAL header.
> > > 
> > > What do you think?
> > > 
> > > >  By the way, if we could queue
> > > > twice the same buffer, that would in principal work, but internally
> > > > there is only one state per buffer. If you do external allocation, then
> > > > in theory you could workaround that, but then it's ugly, because you'll
> > > > have two buffers with the same timestamp.
> > > 
> > > One advantage of the request API is that buffers are actually queued
> > > when the request is processed, so this might not be too problematic.
> > > 
> > > I think what we need boils down to:
> > > - Being able to queue the same output buffer to multiple requests,
> > > which the request API should already allow;
> > > - Being able to grab the right capture buffer based on the output
> > > timestamp so that the different requests for the slices are rendered to
> > > the same destination buffer.
> > > 
> > > For the second point, I don't really have a clear idea of whether we
> > > can already expect v4l2 to allow picking a buffer that was marked done
> > > but was not de-queued by userspace yet. It might already be allowed and
> > > we could just implement something to lookup the buffer to grab by
> > > timestamp.
> > > 
> > > > An argument that was made early was that we don't need to support this
> > > > right away because userspace can combine all the slices into one
> > > > buffer. But for H264_SLICE_RAW format it's inconvenient, you'd need an
> > > > extra control to tell the driver the offset to each slices, because the
> > > > raw H264 does not have enough information to be parsed. RAW slice are
> > > > also I believe de-emulated, which means the code use to prevent having
> > > > pattern looking like a start code has been removed, so you cannot just
> > > > prepend start codes. De-emulation seems better placed in userspace if
> > > > the HW does not take care.
> > > 
> > > Mhh I'd like to avoid having having to specify the offset to each slice
> > > for the legacy case. Just appending the encoded data (excluding slice
> > > header and start code) works for cedrus and I think it makes sense more
> > > generally. The idea is to only expose a single slice params and act as
> > > if it was just one big slice buffer.
> > > 
> > > Come to think of it, maybe we need annex-b and mixed fashions of that
> > > legacy pixfmt too...
> > > 
> > > > I also very dislike the idea that we would enforce merging all slice
> > > > into the same buffer. The entire purpose of slices and the reason they
> > > > are used in practice is that you can start decoding slices before you
> > > > have all slices of a frame. This reduce drastically the latency for
> > > > streaming use cases, like video conferencing. So forcing the merging of
> > > > slices is basically like pretending slices have no benefits.
> > > 
> > > Of course, we don't want things to stay like this and this rework is
> > > definitely needed to get serious performance and latency going.
> > > 
> > > One thing you should also be aware of: we're currently using a
> > > workqueue between the job done irq and scheduling the next frame (in
> > > v4l2 m2m).
> > > 
> > > Maybe we could manage to fit that into an atomic path to schedule the
> > > next request in the previous job done irq context.
> > > 
> > > > I have just exposed the problem I see for now, to see what comes up.
> > > > But I hope we be able to propose solution too in the short term (in no
> > > > one beats me at it).
> > > 
> > > Seems that we have good grounds for a discussion!
> > > 
> > > Cheers,
> > > 
> > > Paul
> > > 
> > > > > +
> > > > > +A typical frame would thus be decoded using the following sequence:
> > > > > +
> > > > > +1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
> > > > > +   the decoding request, using :c:func:`VIDIOC_QBUF`.
> > > > > +
> > > > > +    * **Required fields:**
> > > > > +
> > > > > +      ``index``
> > > > > +          index of the buffer being queued.
> > > > > +
> > > > > +      ``type``
> > > > > +          type of the buffer.
> > > > > +
> > > > > +      ``bytesused``
> > > > > +          number of bytes taken by the encoded data frame in the buffer.
> > > > > +
> > > > > +      ``flags``
> > > > > +          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
> > > > > +
> > > > > +      ``request_fd``
> > > > > +          must be set to the file descriptor of the decoding request.
> > > > > +
> > > > > +      ``timestamp``
> > > > > +          must be set to a unique value per frame. This value will be propagated
> > > > > +          into the decoded frame's buffer and can also be used to use this frame
> > > > > +          as the reference of another.
> > > > > +
> > > > > +2. Set the codec-specific controls for the decoding request, using
> > > > > +   :c:func:`VIDIOC_S_EXT_CTRLS`.
> > > > > +
> > > > > +    * **Required fields:**
> > > > > +
> > > > > +      ``which``
> > > > > +          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
> > > > > +
> > > > > +      ``request_fd``
> > > > > +          must be set to the file descriptor of the decoding request.
> > > > > +
> > > > > +      other fields
> > > > > +          other fields are set as usual when setting controls. The ``controls``
> > > > > +          array must contain all the codec-specific controls required to decode
> > > > > +          a frame.
> > > > > +
> > > > > +   .. note::
> > > > > +
> > > > > +      It is possible to specify the controls in different invocations of
> > > > > +      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
> > > > > +      long as ``request_fd`` and ``which`` are properly set. The controls state
> > > > > +      at the moment of request submission is the one that will be considered.
> > > > > +
> > > > > +   .. note::
> > > > > +
> > > > > +      The order in which steps 1 and 2 take place is interchangeable.
> > > > > +
> > > > > +3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
> > > > > +   request FD.
> > > > > +
> > > > > +    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
> > > > > +    required controls are missing from the request, then
> > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
> > > > > +    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
> > > > > +    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
> > > > > +    ``CAPTURE`` buffer will be produced for this request.
> > > > > +
> > > > > +``CAPTURE`` buffers must not be part of the request, and are queued
> > > > > +independently. They are returned in decode order (i.e. the same order as coded
> > > > > +frames were submitted to the ``OUTPUT`` queue).
> > > > > +
> > > > > +Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
> > > > > +carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
> > > > > +error, then all following decoded frames that refer to it also have the
> > > > > +``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
> > > > > +produce a (likely corrupted) frame.
> > > > > +
> > > > > +Buffer management while decoding
> > > > > +================================
> > > > > +Contrary to stateful decoders, a stateless decoder does not perform any kind of
> > > > > +buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
> > > > > +used by the client for as long as they are not queued again. "Used" here
> > > > > +encompasses using the buffer for compositing or display.
> > > > > +
> > > > > +A dequeued capture buffer can also be used as the reference frame of another
> > > > > +buffer.
> > > > > +
> > > > > +A frame is specified as reference by converting its timestamp into nanoseconds,
> > > > > +and storing it into the relevant member of a codec-dependent control structure.
> > > > > +The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
> > > > > +conversion. The timestamp of a frame can be used to reference it as soon as all
> > > > > +its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
> > > > > +
> > > > > +A decoded buffer containing a reference frame must not be reused as a decoding
> > > > > +target until all the frames referencing it have been decoded. The safest way to
> > > > > +achieve this is to refrain from queueing a reference buffer until all the
> > > > > +decoded frames referencing it have been dequeued. However, if the driver can
> > > > > +guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
> > > > > +order, then user-space can take advantage of this guarantee and queue a
> > > > > +reference buffer when the following conditions are met:
> > > > > +
> > > > > +1. All the requests for frames affected by the reference frame have been
> > > > > +   queued, and
> > > > > +
> > > > > +2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
> > > > > +   referencing frames have been queued.
> > > > > +
> > > > > +When queuing a decoding request, the driver will increase the reference count of
> > > > > +all the resources associated with reference frames. This means that the client
> > > > > +can e.g. close the DMABUF file descriptors of reference frame buffers if it
> > > > > +won't need them afterwards.
> > > > > +
> > > > > +Seeking
> > > > > +=======
> > > > > +In order to seek, the client just needs to submit requests using input buffers
> > > > > +corresponding to the new stream position. It must however be aware that
> > > > > +resolution may have changed and follow the dynamic resolution change sequence in
> > > > > +that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
> > > > > +for H.264) may have changed and the client is responsible for making sure that a
> > > > > +valid state is sent to the decoder.
> > > > > +
> > > > > +The client is then free to ignore any returned ``CAPTURE`` buffer that comes
> > > > > +from the pre-seek position.
> > > > > +
> > > > > +Pause
> > > > > +=====
> > > > > +
> > > > > +In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
> > > > > +queue. Without source bitstream data, there is no data to process and the codec
> > > > > +will remain idle.
> > > > > +
> > > > > +Dynamic resolution change
> > > > > +=========================
> > > > > +
> > > > > +If the client detects a resolution change in the stream, it will need to perform
> > > > > +the initialization sequence again with the new resolution:
> > > > > +
> > > > > +1. Wait until all submitted requests have completed and dequeue the
> > > > > +   corresponding output buffers.
> > > > > +
> > > > > +2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
> > > > > +   queues.
> > > > > +
> > > > > +3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
> > > > > +   ``CAPTURE`` queue with a buffer count of zero.
> > > > > +
> > > > > +4. Perform the initialization sequence again (minus the allocation of
> > > > > +   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
> > > > > +   Note that due to resolution constraints, a different format may need to be
> > > > > +   picked on the ``CAPTURE`` queue.
> > > > > +
> > > > > +Drain
> > > > > +=====
> > > > > +
> > > > > +In order to drain the stream on a stateless decoder, the client just needs to
> > > > > +wait until all the submitted requests are completed. There is no need to send a
> > > > > +``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
> > > > > +decoder.
Paul Kocialkowski April 17, 2019, 5:18 p.m. UTC | #18
Hi,

Le mercredi 17 avril 2019 à 12:06 -0400, Nicolas Dufresne a écrit :
> Le mardi 16 avril 2019 à 16:22 +0900, Alexandre Courbot a écrit :
> > On Tue, Apr 16, 2019 at 12:30 AM Nicolas Dufresne <nicolas@ndufresne.ca> wrote:
> > > Le lundi 15 avril 2019 à 15:26 +0200, Paul Kocialkowski a écrit :

[...]

> > Thanks for this great discussion. Let me try to summarize the status
> > of this thread + the IRC discussion and add my own thoughts:
> > 
> > Proper support for multiple decoding units (e.g. H.264 slices) per
> > frame should not be an afterthought ; compliance to encoded formats
> > depend on it, and the benefit of lower latency is a significant
> > consideration for vendors.
> > 
> > m2m, which we use for all stateless codecs, has a strong assumption
> > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > produced. This assumption can however be overruled: at least the venus
> > driver does it to implement the stateful specification.
> 
> The m2m framework code, which is quite minimal, has this limitation,
> but it has nothing to do with the userspace M2M interface. In
> userspace, M2M are just two asynchronous queues. New input data is
> queued on the OUTPUT queue, and results is taken from the CAPTURE
> queue. There is nothing in the API or the spec that limits how many
> input data (OUTPUT queue) will be used to produce a number of results
> (CAPTURE queue).
> 
> > So we need a way to specify frame boundaries when submitting encoded
> > content to the driver. One request should contain a single OUTPUT
> > buffer, containing a single decoding unit, but we need a way to
> > specify whether the driver should directly produce a CAPTURE buffer
> > from this request, or keep using the same CAPTURE buffer with
> > subsequent requests.
> 
> Yes, that's a good recap, we need a way. Just a clarification, we need
> a way for formats similar to H264/H265 for which the frame boundary is
> often only discovered by parsing the following NAL or signalled through
> a container.
> 
> > I can think of 2 ways this can be expressed:
> > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > is produced), and add a flag to ask the driver to change that behavior
> > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > flag asking so is specified.
> 
> I don't think 1) is really a valid option. A buffer has one state. In
> current implementation of Cedrus, when 1 unit is decoded (1 slice) the
> capture buffer is marked as DONE. That signals any userspace polling
> for capture buffer being ready to DQ. Now, if you drive the OUTPUT and
> CAPTURE queue from separate thread, you end up with a race where
> userspace thinks the buffer is ready but a new slice comes in, so the
> state has been cleared between the poll returning and the call to DQ
> buf. User-space will unexpectedly endup doing a blocking DQBuf which is
> likely unwanted. Then if we leave is in DONE state, it's much worst,
> since there is no way to signal that the buffer is ready (the decoding
> the unit has completed).
> 
> As this API does not exist yet, introducing 2) is possible and is much
> saner to handle from userspace. The benefit  is that you have no
> special case. The driver just hold on marking the buffer DONE until it
> has processed all unit up to one that had a frame completion flag on
> it.

Mhh, without explicitly marking the requests as part of the same group,
we would basically lose the ability to function stateless with this
idea. By that, I mean that you can no longer schedule a request that's
not part of the current batch of slices, since the kernel-side would
refrain from marking the capture buffer done for that request or mark
the capture buffer for the slices along with the completion of that
request.

One way or another, I think we need to explicitly mark the requests for
the each slice of the frame as being part of the same group. And while
at it, we might want to make that possible for any use case that might
require it, not just video decoding.

The request API is quite complex on its own and I guess we're dealing
with complex topics that involve significantly changing the behavior of
v4l2. Not to say that we shouldn't try to keep things simple, we
definitely should, but it might be one of the cases where the long-term 
solution is not going to be that easy to carry out.

The way I see it, the media userspace interface was enriched with the
request API to deal with a specific case (associating source data and
associated meta-data) and managed to bring a generic and flexible
solution that also covers other use cases. So I feel like solving this
issue would be best done at the media level and in a way that's not too
centered on our specific use case.

> > The flag could be specified in one of two ways:
> > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > b) As a dedicated control, either format-specific or more common to all codecs.
> > 
> > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > least, user-space does not know whether a slice is the last slice of a
> > frame until it starts parsing the next one, and we don't know when we
> > will receive it. If we use a control to ask that a CAPTURE buffer be
> > produced, we can always submit another request with only that control
> > set once it is clear that the frame is complete (and not delay
> > decoding meanwhile). In practice I am not that familiar with
> > latency-sensitive streaming ; maybe a smart streamer would just append
> > an AUD NAL unit at the end of every frame and we can thus submit the
> > flag it with the last slice without further delay?
> 
> AUD NAL, when present, are the first NAL of a frame, so latency wise it
> is useless. So what we do is that we rely on the encoder to tell us. So
> encoders will set a flag to signal the last slice of a frame. If you
> are doing RTP, this flags is converted into a marker bit (RTP
> specific). This marker bit is then received on the other side and
> passed to the decoder. The decoder will process the slice and when this
> is done will immediately deliver the resulting frame (if reordering
> allow). If it's not present, it will wait for the next slice in order
> to determine if the decoded frame can be delivered or not. So without
> the marker, we effectively have 1 extra frame latency in the worst
> case.
> 
> What I like of the b) proposal is that we can invert the logic and
> effectively abstract this completely for formats that don't have slices
> (or equivalent) while having this implemented generically.
> 
> What I had in mind was a) because I was thinking that we could reuse
> the flag for stateful encoder/decoder in order to support the RTP
> marker bit usecase and slice level streaming. Right now, we only do
> full frame streaming, but it's limiting. the ZyncMP firmware that
> Micheal is integrating does support low latency with slice processing,
> so to match the vendor driver capacity we'll need that flag anyway.
> 
> But in stateless, it's easier, because not setting it at all simply
> introduce more latency, while for accelerators we would like to make
> the closing of a frame mandatory. So I'm totally fine with a different
> mechanism. Again, this is handled in VAAPI and other similar API by
> having begin/end function for frames, and then a number of
> decode_slice() calls in the middle. So there is an extra context for
> frames on top of slices in these API.
> 
> > An extra constraint to enforce would be that each decoding unit
> > belonging to the same frame must be submitted with the same timestamp,
> > otherwise the request submission would fail. We really need a
> > framework to enforce all this at a higher level than individual
> > drivers, once we reach an agreement I will start working on this.
> 
> I agree with that. And adding checks for this would be really welcome
> to catch errors.

Agreed, this feels like a sane thing to do.

> > Formats that do not support multiple decoding units per frame would
> > reject any request that does not carry the end-of-frame information.
> 
> Again, we *could* also reverse the logic, so that by default all OUTPUT
> buffer would be considered complete frames. So far I only know 3
> formats that have this feature, H264, H265 and AV1. I'm not sure for
> VP9, I would need to look. But clearly JPEG, VP8, H263, raw format and
> more don't seem to have this. We could also have a generic control/flag
> and make it mandatory for specific formats if that is simpler.

Well, now I am pretty much convinced that what's in the OUTPUT buffer
should be one entity of the decoding unit, which could be a slice or
coded data for one frame depending on the format. I think adding a
control or a flag for that would bring-in too much boilerplate and
different cases to handle in the driver, which does not feel that good.

I'm still open to introducing specific formats for submitting enough
slice data for a whole frame, but not sure it is worth the effort.

Cheers,

Paul
Paul Kocialkowski April 17, 2019, 5:21 p.m. UTC | #19
Hi,

Le mercredi 17 avril 2019 à 12:17 -0400, Nicolas Dufresne a écrit :
> In general, we say stateless from a HW point of view. It simply means
> that the HW (the accelerator) can be multiplexed to process several
> independent streams. While with stateful firmware, you generally can't
> save the state, and ends up with a specific number of concurrent stream
> (scheduling happens in the firmware). There is exception to that of
> course, the newest Amlogic/Meson video decoder allow for saving the
> decoder state. The registers are undocumented, since they are filled by
> the HW parser, but it's separated in a way that we could multiplex.

That's a nice summary! I think it constitutes a good definition of what
we should call a stateless decoder. Perhaps we could write it down in
the spec somewhere?

> Of course we do have a state in our drivers. Each time you open an m2m
> device, you create an new instance which will keep track of done jobs,
> pending jobs, active format, allocated memory, etc.

Yes, definitely.

Cheers,

Paul
Hans Verkuil April 26, 2019, 2:18 p.m. UTC | #20
On 4/16/19 9:22 AM, Alexandre Courbot wrote:

<snip>

> Thanks for this great discussion. Let me try to summarize the status
> of this thread + the IRC discussion and add my own thoughts:
> 
> Proper support for multiple decoding units (e.g. H.264 slices) per
> frame should not be an afterthought ; compliance to encoded formats
> depend on it, and the benefit of lower latency is a significant
> consideration for vendors.
> 
> m2m, which we use for all stateless codecs, has a strong assumption
> that one OUTPUT buffer consumed results in one CAPTURE buffer being
> produced. This assumption can however be overruled: at least the venus
> driver does it to implement the stateful specification.
> 
> So we need a way to specify frame boundaries when submitting encoded
> content to the driver. One request should contain a single OUTPUT
> buffer, containing a single decoding unit, but we need a way to
> specify whether the driver should directly produce a CAPTURE buffer
> from this request, or keep using the same CAPTURE buffer with
> subsequent requests.
> 
> I can think of 2 ways this can be expressed:
> 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> is produced), and add a flag to ask the driver to change that behavior
> and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> 2) We specify that no CAPTURE buffer is produced by default, unless a
> flag asking so is specified.
> 
> The flag could be specified in one of two ways:
> a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> b) As a dedicated control, either format-specific or more common to all codecs.
> 
> I tend to favor 2) and b) for this, for the reason that with H.264 at
> least, user-space does not know whether a slice is the last slice of a
> frame until it starts parsing the next one, and we don't know when we
> will receive it. If we use a control to ask that a CAPTURE buffer be
> produced, we can always submit another request with only that control
> set once it is clear that the frame is complete (and not delay
> decoding meanwhile). In practice I am not that familiar with
> latency-sensitive streaming ; maybe a smart streamer would just append
> an AUD NAL unit at the end of every frame and we can thus submit the
> flag it with the last slice without further delay?
> 
> An extra constraint to enforce would be that each decoding unit
> belonging to the same frame must be submitted with the same timestamp,
> otherwise the request submission would fail. We really need a
> framework to enforce all this at a higher level than individual
> drivers, once we reach an agreement I will start working on this.
> 
> Formats that do not support multiple decoding units per frame would
> reject any request that does not carry the end-of-frame information.
> 
> Anything missing / any further comment?
> 

After reading through this thread and a further irc discussion I now
understand the problem. I think there are several ways this can be
solved, but I think this is the easiest:

Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.

If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
done after processing the OUTPUT buffer.

If an OUTPUT buffer was queued with a different timestamp than was
used for the currently held CAPTURE buffer, then mark that CAPTURE
buffer as done before starting processing this OUTPUT buffer.

In other words, for slicing you can just always set this flag and
group the slices by the OUTPUT timestamp. If you know that you
reached the last slice of a frame, then you can optionally clear the
flag to ensure the CAPTURE buffer is marked done without having to wait
for the first slice of the next frame to arrive.

Potential disadvantage of this approach is that this relies on the
OUTPUT timestamp to be the same for all slices of the same frame.

Which sounds reasonable to me.

In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
capability to signal support for this flag.

I think this can be fairly easily implemented in v4l2-mem2mem.c.

In addition, this approach is not specific to codecs, it can be
used elsewhere as well (composing multiple output buffers into one
capture buffer is one use-case that comes to mind).

Comments? Other ideas?

Regards,

	Hans
Paul Kocialkowski April 26, 2019, 4:28 p.m. UTC | #21
Hi,

Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> 
> <snip>
> 
> > Thanks for this great discussion. Let me try to summarize the status
> > of this thread + the IRC discussion and add my own thoughts:
> > 
> > Proper support for multiple decoding units (e.g. H.264 slices) per
> > frame should not be an afterthought ; compliance to encoded formats
> > depend on it, and the benefit of lower latency is a significant
> > consideration for vendors.
> > 
> > m2m, which we use for all stateless codecs, has a strong assumption
> > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > produced. This assumption can however be overruled: at least the venus
> > driver does it to implement the stateful specification.
> > 
> > So we need a way to specify frame boundaries when submitting encoded
> > content to the driver. One request should contain a single OUTPUT
> > buffer, containing a single decoding unit, but we need a way to
> > specify whether the driver should directly produce a CAPTURE buffer
> > from this request, or keep using the same CAPTURE buffer with
> > subsequent requests.
> > 
> > I can think of 2 ways this can be expressed:
> > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > is produced), and add a flag to ask the driver to change that behavior
> > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > flag asking so is specified.
> > 
> > The flag could be specified in one of two ways:
> > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > b) As a dedicated control, either format-specific or more common to all codecs.
> > 
> > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > least, user-space does not know whether a slice is the last slice of a
> > frame until it starts parsing the next one, and we don't know when we
> > will receive it. If we use a control to ask that a CAPTURE buffer be
> > produced, we can always submit another request with only that control
> > set once it is clear that the frame is complete (and not delay
> > decoding meanwhile). In practice I am not that familiar with
> > latency-sensitive streaming ; maybe a smart streamer would just append
> > an AUD NAL unit at the end of every frame and we can thus submit the
> > flag it with the last slice without further delay?
> > 
> > An extra constraint to enforce would be that each decoding unit
> > belonging to the same frame must be submitted with the same timestamp,
> > otherwise the request submission would fail. We really need a
> > framework to enforce all this at a higher level than individual
> > drivers, once we reach an agreement I will start working on this.
> > 
> > Formats that do not support multiple decoding units per frame would
> > reject any request that does not carry the end-of-frame information.
> > 
> > Anything missing / any further comment?
> > 
> 
> After reading through this thread and a further irc discussion I now
> understand the problem. I think there are several ways this can be
> solved, but I think this is the easiest:
> 
> Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> 
> If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> done after processing the OUTPUT buffer.
> 
> If an OUTPUT buffer was queued with a different timestamp than was
> used for the currently held CAPTURE buffer, then mark that CAPTURE
> buffer as done before starting processing this OUTPUT buffer.
> 
> In other words, for slicing you can just always set this flag and
> group the slices by the OUTPUT timestamp. If you know that you
> reached the last slice of a frame, then you can optionally clear the
> flag to ensure the CAPTURE buffer is marked done without having to wait
> for the first slice of the next frame to arrive.
> 
> Potential disadvantage of this approach is that this relies on the
> OUTPUT timestamp to be the same for all slices of the same frame.
> 
> Which sounds reasonable to me.
> 
> In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> capability to signal support for this flag.
> 
> I think this can be fairly easily implemented in v4l2-mem2mem.c.
> 
> In addition, this approach is not specific to codecs, it can be
> used elsewhere as well (composing multiple output buffers into one
> capture buffer is one use-case that comes to mind).
> 
> Comments? Other ideas?

One remark I have: this implies that the order in which requests are
decoded will match the order in which they are submitted (in order to
be rely on the last slice being marked as such).

In the future, we might want to be ablle to add support for parallel
decoders that could handle multiple slices concurrently. I don't think
the M2M internal API is ready for that currently, but it could
certainly be extended to allow that eventually. In that case, we can't
rely on the order in which slices will complete their decoding and the
one slice that was marking the end of the frame may be decoded sooner
than other slices scheduled at the same time. In this case, we end up
having to wait for a new frame in order to mark the destination buffer
as done, which introduces a major latency issue for the frame.

So I think the problem we should be trying to resolve should be
formulated in terms of marking the end of a group of requests as done. 

For that, my proposal is to solve the issue at the media API level, by
introducing an entity representing a group of requests that share the
same destination buffer. The idea would be that requests are added to
that entity when they are submitted, and a field in the submit ioctl
would indicate whether the request is the last one of the batch.

The destination buffer gets picked up when the first request is
processed and all subsequent requests grouped in the entity use the
same one. Then, we can have the media core ensure that all the requests
of that entity are completed and that the last element of the batch was
submitted before marking the destination buffer as done.

This presents a few advantages:
- Userspace has a straightforward interface to group the completion of
requests, which is independent from both our use case and v4l2;
- This mechanism can then be used in other situations where grouping
the completion of different requests is desirable: for instance, it
could be used to sync two source feeds that need to be displayed
synchronized;
- Userspace can poll on a single file descriptor representing the
entity, instead of having to do the book keeping of checking that each
request was completed before dealing with the decoded buffer.

What do you think?

Cheers,

Paul
Nicolas Dufresne April 27, 2019, 12:06 p.m. UTC | #22
Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> 
> <snip>
> 
> > Thanks for this great discussion. Let me try to summarize the status
> > of this thread + the IRC discussion and add my own thoughts:
> > 
> > Proper support for multiple decoding units (e.g. H.264 slices) per
> > frame should not be an afterthought ; compliance to encoded formats
> > depend on it, and the benefit of lower latency is a significant
> > consideration for vendors.
> > 
> > m2m, which we use for all stateless codecs, has a strong assumption
> > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > produced. This assumption can however be overruled: at least the venus
> > driver does it to implement the stateful specification.
> > 
> > So we need a way to specify frame boundaries when submitting encoded
> > content to the driver. One request should contain a single OUTPUT
> > buffer, containing a single decoding unit, but we need a way to
> > specify whether the driver should directly produce a CAPTURE buffer
> > from this request, or keep using the same CAPTURE buffer with
> > subsequent requests.
> > 
> > I can think of 2 ways this can be expressed:
> > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > is produced), and add a flag to ask the driver to change that behavior
> > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > flag asking so is specified.
> > 
> > The flag could be specified in one of two ways:
> > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > b) As a dedicated control, either format-specific or more common to all codecs.
> > 
> > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > least, user-space does not know whether a slice is the last slice of a
> > frame until it starts parsing the next one, and we don't know when we
> > will receive it. If we use a control to ask that a CAPTURE buffer be
> > produced, we can always submit another request with only that control
> > set once it is clear that the frame is complete (and not delay
> > decoding meanwhile). In practice I am not that familiar with
> > latency-sensitive streaming ; maybe a smart streamer would just append
> > an AUD NAL unit at the end of every frame and we can thus submit the
> > flag it with the last slice without further delay?
> > 
> > An extra constraint to enforce would be that each decoding unit
> > belonging to the same frame must be submitted with the same timestamp,
> > otherwise the request submission would fail. We really need a
> > framework to enforce all this at a higher level than individual
> > drivers, once we reach an agreement I will start working on this.
> > 
> > Formats that do not support multiple decoding units per frame would
> > reject any request that does not carry the end-of-frame information.
> > 
> > Anything missing / any further comment?
> > 
> 
> After reading through this thread and a further irc discussion I now
> understand the problem. I think there are several ways this can be
> solved, but I think this is the easiest:
> 
> Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> 
> If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> done after processing the OUTPUT buffer.
> 
> If an OUTPUT buffer was queued with a different timestamp than was
> used for the currently held CAPTURE buffer, then mark that CAPTURE
> buffer as done before starting processing this OUTPUT buffer.

Just a curiosity, can you extend on how this would be handled. If there
is a number of capture buffer, these should have "no-timestamp". So I
suspect we need the condition to differentiate no-timestamp from
previous timestamp. What I'm unclear is to what does it mean "no-
timestamp". We already stated the timestamp 0 cannot be reserved as
being an unset timestamp.

> 
> In other words, for slicing you can just always set this flag and
> group the slices by the OUTPUT timestamp. If you know that you
> reached the last slice of a frame, then you can optionally clear the
> flag to ensure the CAPTURE buffer is marked done without having to wait
> for the first slice of the next frame to arrive.
> 
> Potential disadvantage of this approach is that this relies on the
> OUTPUT timestamp to be the same for all slices of the same frame.
> 
> Which sounds reasonable to me.
> 
> In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> capability to signal support for this flag.
> 
> I think this can be fairly easily implemented in v4l2-mem2mem.c.
> 
> In addition, this approach is not specific to codecs, it can be
> used elsewhere as well (composing multiple output buffers into one
> capture buffer is one use-case that comes to mind).
> 
> Comments? Other ideas?

Sounds reasonable to me. I'll read through Paul's comment now and
comment if needed.

> 
> Regards,
> 
> 	Hans
Nicolas Dufresne April 27, 2019, 12:23 p.m. UTC | #23
Le vendredi 26 avril 2019 à 18:28 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > 
> > <snip>
> > 
> > > Thanks for this great discussion. Let me try to summarize the status
> > > of this thread + the IRC discussion and add my own thoughts:
> > > 
> > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > frame should not be an afterthought ; compliance to encoded formats
> > > depend on it, and the benefit of lower latency is a significant
> > > consideration for vendors.
> > > 
> > > m2m, which we use for all stateless codecs, has a strong assumption
> > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > produced. This assumption can however be overruled: at least the venus
> > > driver does it to implement the stateful specification.
> > > 
> > > So we need a way to specify frame boundaries when submitting encoded
> > > content to the driver. One request should contain a single OUTPUT
> > > buffer, containing a single decoding unit, but we need a way to
> > > specify whether the driver should directly produce a CAPTURE buffer
> > > from this request, or keep using the same CAPTURE buffer with
> > > subsequent requests.
> > > 
> > > I can think of 2 ways this can be expressed:
> > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > is produced), and add a flag to ask the driver to change that behavior
> > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > flag asking so is specified.
> > > 
> > > The flag could be specified in one of two ways:
> > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > 
> > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > least, user-space does not know whether a slice is the last slice of a
> > > frame until it starts parsing the next one, and we don't know when we
> > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > produced, we can always submit another request with only that control
> > > set once it is clear that the frame is complete (and not delay
> > > decoding meanwhile). In practice I am not that familiar with
> > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > flag it with the last slice without further delay?
> > > 
> > > An extra constraint to enforce would be that each decoding unit
> > > belonging to the same frame must be submitted with the same timestamp,
> > > otherwise the request submission would fail. We really need a
> > > framework to enforce all this at a higher level than individual
> > > drivers, once we reach an agreement I will start working on this.
> > > 
> > > Formats that do not support multiple decoding units per frame would
> > > reject any request that does not carry the end-of-frame information.
> > > 
> > > Anything missing / any further comment?
> > > 
> > 
> > After reading through this thread and a further irc discussion I now
> > understand the problem. I think there are several ways this can be
> > solved, but I think this is the easiest:
> > 
> > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > 
> > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > done after processing the OUTPUT buffer.
> > 
> > If an OUTPUT buffer was queued with a different timestamp than was
> > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > buffer as done before starting processing this OUTPUT buffer.
> > 
> > In other words, for slicing you can just always set this flag and
> > group the slices by the OUTPUT timestamp. If you know that you
> > reached the last slice of a frame, then you can optionally clear the
> > flag to ensure the CAPTURE buffer is marked done without having to wait
> > for the first slice of the next frame to arrive.
> > 
> > Potential disadvantage of this approach is that this relies on the
> > OUTPUT timestamp to be the same for all slices of the same frame.
> > 
> > Which sounds reasonable to me.
> > 
> > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > capability to signal support for this flag.
> > 
> > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > 
> > In addition, this approach is not specific to codecs, it can be
> > used elsewhere as well (composing multiple output buffers into one
> > capture buffer is one use-case that comes to mind).
> > 
> > Comments? Other ideas?
> 
> One remark I have: this implies that the order in which requests are
> decoded will match the order in which they are submitted (in order to
> be rely on the last slice being marked as such).

Unlike my initially suggested approach, this won't mark the last slice.
Instead it marks all slices that are not the last one. If userspace is
aware of the last one, it will unmark it, otherwise the boundary will
be detected through a timestamp change.

It basically allow to run in both per-slice and per-frame, as not
setting the flag on a slice will result in the buffer being marked done
when the slice is decoded. It will likely fit more use cases outside of
CODEC and is not request specific.

> 
> In the future, we might want to be ablle to add support for parallel
> decoders that could handle multiple slices concurrently. I don't think
> the M2M internal API is ready for that currently, but it could
> certainly be extended to allow that eventually. In that case, we can't
> rely on the order in which slices will complete their decoding and the
> one slice that was marking the end of the frame may be decoded sooner
> than other slices scheduled at the same time. In this case, we end up
> having to wait for a new frame in order to mark the destination buffer
> as done, which introduces a major latency issue for the frame.

I don't see how this relates to the current design. The driver needs to
keep track of the active jobs anyway. Jobs just need to be matched
against a specific capture buffer. So initially you'd be receiving jobs
for a capture buffer. The job queue will be "open", so that if these
jobs finishes, it won't mark done the capture buffer. Then whenever one
of the two condition (no flag or new timestamp) is met, the capture is
then marked as "can be done". That might happen right away (if all
pending jobs for the capture buffer are completed) or later after the
last job have effectively completed. Regardless of the order they
completed. The order does not really matter if you have a good data
structure.

For a codec, one m2m instance will never be able to decode to multiple
capture buffer at the same time since it will likely used as reference
in the following frames. So parallelism should happen between m2m
instance. I believe we should look into the m2m scaler and color
converters for inspiration on how multi-plexing of m2m instance is
done.

> 
> So I think the problem we should be trying to resolve should be
> formulated in terms of marking the end of a group of requests as done. 
> 
> For that, my proposal is to solve the issue at the media API level, by
> introducing an entity representing a group of requests that share the
> same destination buffer. The idea would be that requests are added to
> that entity when they are submitted, and a field in the submit ioctl
> would indicate whether the request is the last one of the batch.
> 
> The destination buffer gets picked up when the first request is
> processed and all subsequent requests grouped in the entity use the
> same one. Then, we can have the media core ensure that all the requests
> of that entity are completed and that the last element of the batch was
> submitted before marking the destination buffer as done.
> 
> This presents a few advantages:
> - Userspace has a straightforward interface to group the completion of
> requests, which is independent from both our use case and v4l2;
> - This mechanism can then be used in other situations where grouping
> the completion of different requests is desirable: for instance, it
> could be used to sync two source feeds that need to be displayed
> synchronized;
> - Userspace can poll on a single file descriptor representing the
> entity, instead of having to do the book keeping of checking that each
> request was completed before dealing with the decoded buffer.
> 
> What do you think?
> 
> Cheers,
> 
> Paul
>
Paul Kocialkowski April 27, 2019, 12:48 p.m. UTC | #24
Hi,

Le samedi 27 avril 2019 à 08:23 -0400, Nicolas Dufresne a écrit :
> Le vendredi 26 avril 2019 à 18:28 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > > 
> > > <snip>
> > > 
> > > > Thanks for this great discussion. Let me try to summarize the status
> > > > of this thread + the IRC discussion and add my own thoughts:
> > > > 
> > > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > > frame should not be an afterthought ; compliance to encoded formats
> > > > depend on it, and the benefit of lower latency is a significant
> > > > consideration for vendors.
> > > > 
> > > > m2m, which we use for all stateless codecs, has a strong assumption
> > > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > > produced. This assumption can however be overruled: at least the venus
> > > > driver does it to implement the stateful specification.
> > > > 
> > > > So we need a way to specify frame boundaries when submitting encoded
> > > > content to the driver. One request should contain a single OUTPUT
> > > > buffer, containing a single decoding unit, but we need a way to
> > > > specify whether the driver should directly produce a CAPTURE buffer
> > > > from this request, or keep using the same CAPTURE buffer with
> > > > subsequent requests.
> > > > 
> > > > I can think of 2 ways this can be expressed:
> > > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > > is produced), and add a flag to ask the driver to change that behavior
> > > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > > flag asking so is specified.
> > > > 
> > > > The flag could be specified in one of two ways:
> > > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > > 
> > > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > > least, user-space does not know whether a slice is the last slice of a
> > > > frame until it starts parsing the next one, and we don't know when we
> > > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > > produced, we can always submit another request with only that control
> > > > set once it is clear that the frame is complete (and not delay
> > > > decoding meanwhile). In practice I am not that familiar with
> > > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > > flag it with the last slice without further delay?
> > > > 
> > > > An extra constraint to enforce would be that each decoding unit
> > > > belonging to the same frame must be submitted with the same timestamp,
> > > > otherwise the request submission would fail. We really need a
> > > > framework to enforce all this at a higher level than individual
> > > > drivers, once we reach an agreement I will start working on this.
> > > > 
> > > > Formats that do not support multiple decoding units per frame would
> > > > reject any request that does not carry the end-of-frame information.
> > > > 
> > > > Anything missing / any further comment?
> > > > 
> > > 
> > > After reading through this thread and a further irc discussion I now
> > > understand the problem. I think there are several ways this can be
> > > solved, but I think this is the easiest:
> > > 
> > > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > > 
> > > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > > done after processing the OUTPUT buffer.
> > > 
> > > If an OUTPUT buffer was queued with a different timestamp than was
> > > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > > buffer as done before starting processing this OUTPUT buffer.
> > > 
> > > In other words, for slicing you can just always set this flag and
> > > group the slices by the OUTPUT timestamp. If you know that you
> > > reached the last slice of a frame, then you can optionally clear the
> > > flag to ensure the CAPTURE buffer is marked done without having to wait
> > > for the first slice of the next frame to arrive.
> > > 
> > > Potential disadvantage of this approach is that this relies on the
> > > OUTPUT timestamp to be the same for all slices of the same frame.
> > > 
> > > Which sounds reasonable to me.
> > > 
> > > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > > capability to signal support for this flag.
> > > 
> > > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > > 
> > > In addition, this approach is not specific to codecs, it can be
> > > used elsewhere as well (composing multiple output buffers into one
> > > capture buffer is one use-case that comes to mind).
> > > 
> > > Comments? Other ideas?
> > 
> > One remark I have: this implies that the order in which requests are
> > decoded will match the order in which they are submitted (in order to
> > be rely on the last slice being marked as such).
> 
> Unlike my initially suggested approach, this won't mark the last slice.
> Instead it marks all slices that are not the last one. If userspace is
> aware of the last one, it will unmark it, otherwise the boundary will
> be detected through a timestamp change.

Oh you're right, the proposal definitely makes good sense to me then.
Maybe we should consider doing that marking at media request level
(instead of source buffer). As for polling, I guess that polling any of
the requests will work fine.

So no further objections on the design then!

Cheers,

Paul

> It basically allow to run in both per-slice and per-frame, as not
> setting the flag on a slice will result in the buffer being marked done
> when the slice is decoded. It will likely fit more use cases outside of
> CODEC and is not request specific.
> 
> > In the future, we might want to be ablle to add support for parallel
> > decoders that could handle multiple slices concurrently. I don't think
> > the M2M internal API is ready for that currently, but it could
> > certainly be extended to allow that eventually. In that case, we can't
> > rely on the order in which slices will complete their decoding and the
> > one slice that was marking the end of the frame may be decoded sooner
> > than other slices scheduled at the same time. In this case, we end up
> > having to wait for a new frame in order to mark the destination buffer
> > as done, which introduces a major latency issue for the frame.
> 
> I don't see how this relates to the current design. The driver needs to
> keep track of the active jobs anyway. Jobs just need to be matched
> against a specific capture buffer. So initially you'd be receiving jobs
> for a capture buffer. The job queue will be "open", so that if these
> jobs finishes, it won't mark done the capture buffer. Then whenever one
> of the two condition (no flag or new timestamp) is met, the capture is
> then marked as "can be done". That might happen right away (if all
> pending jobs for the capture buffer are completed) or later after the
> last job have effectively completed. Regardless of the order they
> completed. The order does not really matter if you have a good data
> structure.
> 
> For a codec, one m2m instance will never be able to decode to multiple
> capture buffer at the same time since it will likely used as reference
> in the following frames. So parallelism should happen between m2m
> instance. I believe we should look into the m2m scaler and color
> converters for inspiration on how multi-plexing of m2m instance is
> done.
> 
> > So I think the problem we should be trying to resolve should be
> > formulated in terms of marking the end of a group of requests as done. 
> > 
> > For that, my proposal is to solve the issue at the media API level, by
> > introducing an entity representing a group of requests that share the
> > same destination buffer. The idea would be that requests are added to
> > that entity when they are submitted, and a field in the submit ioctl
> > would indicate whether the request is the last one of the batch.
> > 
> > The destination buffer gets picked up when the first request is
> > processed and all subsequent requests grouped in the entity use the
> > same one. Then, we can have the media core ensure that all the requests
> > of that entity are completed and that the last element of the batch was
> > submitted before marking the destination buffer as done.
> > 
> > This presents a few advantages:
> > - Userspace has a straightforward interface to group the completion of
> > requests, which is independent from both our use case and v4l2;
> > - This mechanism can then be used in other situations where grouping
> > the completion of different requests is desirable: for instance, it
> > could be used to sync two source feeds that need to be displayed
> > synchronized;
> > - Userspace can poll on a single file descriptor representing the
> > entity, instead of having to do the book keeping of checking that each
> > request was completed before dealing with the decoded buffer.
> > 
> > What do you think?
> > 
> > Cheers,
> > 
> > Paul
> >
Hans Verkuil April 29, 2019, 8:41 a.m. UTC | #25
On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
> Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
>> On 4/16/19 9:22 AM, Alexandre Courbot wrote:
>>
>> <snip>
>>
>>> Thanks for this great discussion. Let me try to summarize the status
>>> of this thread + the IRC discussion and add my own thoughts:
>>>
>>> Proper support for multiple decoding units (e.g. H.264 slices) per
>>> frame should not be an afterthought ; compliance to encoded formats
>>> depend on it, and the benefit of lower latency is a significant
>>> consideration for vendors.
>>>
>>> m2m, which we use for all stateless codecs, has a strong assumption
>>> that one OUTPUT buffer consumed results in one CAPTURE buffer being
>>> produced. This assumption can however be overruled: at least the venus
>>> driver does it to implement the stateful specification.
>>>
>>> So we need a way to specify frame boundaries when submitting encoded
>>> content to the driver. One request should contain a single OUTPUT
>>> buffer, containing a single decoding unit, but we need a way to
>>> specify whether the driver should directly produce a CAPTURE buffer
>>> from this request, or keep using the same CAPTURE buffer with
>>> subsequent requests.
>>>
>>> I can think of 2 ways this can be expressed:
>>> 1) We keep the current m2m behavior as the default (a CAPTURE buffer
>>> is produced), and add a flag to ask the driver to change that behavior
>>> and hold on the CAPTURE buffer and reuse it with the next request(s) ;
>>> 2) We specify that no CAPTURE buffer is produced by default, unless a
>>> flag asking so is specified.
>>>
>>> The flag could be specified in one of two ways:
>>> a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
>>> b) As a dedicated control, either format-specific or more common to all codecs.
>>>
>>> I tend to favor 2) and b) for this, for the reason that with H.264 at
>>> least, user-space does not know whether a slice is the last slice of a
>>> frame until it starts parsing the next one, and we don't know when we
>>> will receive it. If we use a control to ask that a CAPTURE buffer be
>>> produced, we can always submit another request with only that control
>>> set once it is clear that the frame is complete (and not delay
>>> decoding meanwhile). In practice I am not that familiar with
>>> latency-sensitive streaming ; maybe a smart streamer would just append
>>> an AUD NAL unit at the end of every frame and we can thus submit the
>>> flag it with the last slice without further delay?
>>>
>>> An extra constraint to enforce would be that each decoding unit
>>> belonging to the same frame must be submitted with the same timestamp,
>>> otherwise the request submission would fail. We really need a
>>> framework to enforce all this at a higher level than individual
>>> drivers, once we reach an agreement I will start working on this.
>>>
>>> Formats that do not support multiple decoding units per frame would
>>> reject any request that does not carry the end-of-frame information.
>>>
>>> Anything missing / any further comment?
>>>
>>
>> After reading through this thread and a further irc discussion I now
>> understand the problem. I think there are several ways this can be
>> solved, but I think this is the easiest:
>>
>> Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
>>
>> If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
>> done after processing the OUTPUT buffer.
>>
>> If an OUTPUT buffer was queued with a different timestamp than was
>> used for the currently held CAPTURE buffer, then mark that CAPTURE
>> buffer as done before starting processing this OUTPUT buffer.
> 
> Just a curiosity, can you extend on how this would be handled. If there
> is a number of capture buffer, these should have "no-timestamp". So I
> suspect we need the condition to differentiate no-timestamp from
> previous timestamp. What I'm unclear is to what does it mean "no-
> timestamp". We already stated the timestamp 0 cannot be reserved as
> being an unset timestamp.

For OUTPUT buffers there is no such thing as 'no timestamp'. They always
have a timestamp (which may be 0). The currently active CAPTURE buffer
also always has a timestamp as that was copied from the first OUTPUT buffer
for that CAPTURE buffer.

> 
>>
>> In other words, for slicing you can just always set this flag and
>> group the slices by the OUTPUT timestamp. If you know that you
>> reached the last slice of a frame, then you can optionally clear the
>> flag to ensure the CAPTURE buffer is marked done without having to wait
>> for the first slice of the next frame to arrive.
>>
>> Potential disadvantage of this approach is that this relies on the
>> OUTPUT timestamp to be the same for all slices of the same frame.
>>
>> Which sounds reasonable to me.
>>
>> In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
>> capability to signal support for this flag.
>>
>> I think this can be fairly easily implemented in v4l2-mem2mem.c.
>>
>> In addition, this approach is not specific to codecs, it can be
>> used elsewhere as well (composing multiple output buffers into one
>> capture buffer is one use-case that comes to mind).
>>
>> Comments? Other ideas?
> 
> Sounds reasonable to me. I'll read through Paul's comment now and
> comment if needed.

Paul's OK with it as well. The only thing I am not 100% happy with is
the name of the flag. It's a very low-level name: i.e. it does what it
says, but it doesn't say for what purpose.

Does anyone have any better suggestions?

Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?

Regards,

	Hans
Paul Kocialkowski April 29, 2019, 8:48 a.m. UTC | #26
Hi,

On Mon, 2019-04-29 at 10:41 +0200, Hans Verkuil wrote:
> On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
> > Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > > 
> > > <snip>
> > > 
> > > > Thanks for this great discussion. Let me try to summarize the status
> > > > of this thread + the IRC discussion and add my own thoughts:
> > > > 
> > > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > > frame should not be an afterthought ; compliance to encoded formats
> > > > depend on it, and the benefit of lower latency is a significant
> > > > consideration for vendors.
> > > > 
> > > > m2m, which we use for all stateless codecs, has a strong assumption
> > > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > > produced. This assumption can however be overruled: at least the venus
> > > > driver does it to implement the stateful specification.
> > > > 
> > > > So we need a way to specify frame boundaries when submitting encoded
> > > > content to the driver. One request should contain a single OUTPUT
> > > > buffer, containing a single decoding unit, but we need a way to
> > > > specify whether the driver should directly produce a CAPTURE buffer
> > > > from this request, or keep using the same CAPTURE buffer with
> > > > subsequent requests.
> > > > 
> > > > I can think of 2 ways this can be expressed:
> > > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > > is produced), and add a flag to ask the driver to change that behavior
> > > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > > flag asking so is specified.
> > > > 
> > > > The flag could be specified in one of two ways:
> > > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > > 
> > > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > > least, user-space does not know whether a slice is the last slice of a
> > > > frame until it starts parsing the next one, and we don't know when we
> > > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > > produced, we can always submit another request with only that control
> > > > set once it is clear that the frame is complete (and not delay
> > > > decoding meanwhile). In practice I am not that familiar with
> > > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > > flag it with the last slice without further delay?
> > > > 
> > > > An extra constraint to enforce would be that each decoding unit
> > > > belonging to the same frame must be submitted with the same timestamp,
> > > > otherwise the request submission would fail. We really need a
> > > > framework to enforce all this at a higher level than individual
> > > > drivers, once we reach an agreement I will start working on this.
> > > > 
> > > > Formats that do not support multiple decoding units per frame would
> > > > reject any request that does not carry the end-of-frame information.
> > > > 
> > > > Anything missing / any further comment?
> > > > 
> > > 
> > > After reading through this thread and a further irc discussion I now
> > > understand the problem. I think there are several ways this can be
> > > solved, but I think this is the easiest:
> > > 
> > > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > > 
> > > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > > done after processing the OUTPUT buffer.
> > > 
> > > If an OUTPUT buffer was queued with a different timestamp than was
> > > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > > buffer as done before starting processing this OUTPUT buffer.
> > 
> > Just a curiosity, can you extend on how this would be handled. If there
> > is a number of capture buffer, these should have "no-timestamp". So I
> > suspect we need the condition to differentiate no-timestamp from
> > previous timestamp. What I'm unclear is to what does it mean "no-
> > timestamp". We already stated the timestamp 0 cannot be reserved as
> > being an unset timestamp.
> 
> For OUTPUT buffers there is no such thing as 'no timestamp'. They always
> have a timestamp (which may be 0). The currently active CAPTURE buffer
> also always has a timestamp as that was copied from the first OUTPUT buffer
> for that CAPTURE buffer.
> 
> > > In other words, for slicing you can just always set this flag and
> > > group the slices by the OUTPUT timestamp. If you know that you
> > > reached the last slice of a frame, then you can optionally clear the
> > > flag to ensure the CAPTURE buffer is marked done without having to wait
> > > for the first slice of the next frame to arrive.
> > > 
> > > Potential disadvantage of this approach is that this relies on the
> > > OUTPUT timestamp to be the same for all slices of the same frame.
> > > 
> > > Which sounds reasonable to me.
> > > 
> > > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > > capability to signal support for this flag.
> > > 
> > > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > > 
> > > In addition, this approach is not specific to codecs, it can be
> > > used elsewhere as well (composing multiple output buffers into one
> > > capture buffer is one use-case that comes to mind).
> > > 
> > > Comments? Other ideas?
> > 
> > Sounds reasonable to me. I'll read through Paul's comment now and
> > comment if needed.
> 
> Paul's OK with it as well. The only thing I am not 100% happy with is
> the name of the flag. It's a very low-level name: i.e. it does what it
> says, but it doesn't say for what purpose.
> 
> Does anyone have any better suggestions?

Good naming is always so hard to find... I don't have anything better
to suggest off the top of my head, but will definitely keep thinking
about it.

> Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?

Well, I no longer have time chunks allocated to the VPU topic at work,
so that means I'll have to do it on spare time and it may take me a
while to get there.

So if either one of you would like to pick it up to get it over with
faster, feel free to do that!

Cheers,

Paul
Hans Verkuil April 29, 2019, 8:49 a.m. UTC | #27
On 4/29/19 10:48 AM, Paul Kocialkowski wrote:
> Hi,
> 
> On Mon, 2019-04-29 at 10:41 +0200, Hans Verkuil wrote:
>> On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
>>> Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
>>>> On 4/16/19 9:22 AM, Alexandre Courbot wrote:
>>>>
>>>> <snip>
>>>>
>>>>> Thanks for this great discussion. Let me try to summarize the status
>>>>> of this thread + the IRC discussion and add my own thoughts:
>>>>>
>>>>> Proper support for multiple decoding units (e.g. H.264 slices) per
>>>>> frame should not be an afterthought ; compliance to encoded formats
>>>>> depend on it, and the benefit of lower latency is a significant
>>>>> consideration for vendors.
>>>>>
>>>>> m2m, which we use for all stateless codecs, has a strong assumption
>>>>> that one OUTPUT buffer consumed results in one CAPTURE buffer being
>>>>> produced. This assumption can however be overruled: at least the venus
>>>>> driver does it to implement the stateful specification.
>>>>>
>>>>> So we need a way to specify frame boundaries when submitting encoded
>>>>> content to the driver. One request should contain a single OUTPUT
>>>>> buffer, containing a single decoding unit, but we need a way to
>>>>> specify whether the driver should directly produce a CAPTURE buffer
>>>>> from this request, or keep using the same CAPTURE buffer with
>>>>> subsequent requests.
>>>>>
>>>>> I can think of 2 ways this can be expressed:
>>>>> 1) We keep the current m2m behavior as the default (a CAPTURE buffer
>>>>> is produced), and add a flag to ask the driver to change that behavior
>>>>> and hold on the CAPTURE buffer and reuse it with the next request(s) ;
>>>>> 2) We specify that no CAPTURE buffer is produced by default, unless a
>>>>> flag asking so is specified.
>>>>>
>>>>> The flag could be specified in one of two ways:
>>>>> a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
>>>>> b) As a dedicated control, either format-specific or more common to all codecs.
>>>>>
>>>>> I tend to favor 2) and b) for this, for the reason that with H.264 at
>>>>> least, user-space does not know whether a slice is the last slice of a
>>>>> frame until it starts parsing the next one, and we don't know when we
>>>>> will receive it. If we use a control to ask that a CAPTURE buffer be
>>>>> produced, we can always submit another request with only that control
>>>>> set once it is clear that the frame is complete (and not delay
>>>>> decoding meanwhile). In practice I am not that familiar with
>>>>> latency-sensitive streaming ; maybe a smart streamer would just append
>>>>> an AUD NAL unit at the end of every frame and we can thus submit the
>>>>> flag it with the last slice without further delay?
>>>>>
>>>>> An extra constraint to enforce would be that each decoding unit
>>>>> belonging to the same frame must be submitted with the same timestamp,
>>>>> otherwise the request submission would fail. We really need a
>>>>> framework to enforce all this at a higher level than individual
>>>>> drivers, once we reach an agreement I will start working on this.
>>>>>
>>>>> Formats that do not support multiple decoding units per frame would
>>>>> reject any request that does not carry the end-of-frame information.
>>>>>
>>>>> Anything missing / any further comment?
>>>>>
>>>>
>>>> After reading through this thread and a further irc discussion I now
>>>> understand the problem. I think there are several ways this can be
>>>> solved, but I think this is the easiest:
>>>>
>>>> Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
>>>>
>>>> If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
>>>> done after processing the OUTPUT buffer.
>>>>
>>>> If an OUTPUT buffer was queued with a different timestamp than was
>>>> used for the currently held CAPTURE buffer, then mark that CAPTURE
>>>> buffer as done before starting processing this OUTPUT buffer.
>>>
>>> Just a curiosity, can you extend on how this would be handled. If there
>>> is a number of capture buffer, these should have "no-timestamp". So I
>>> suspect we need the condition to differentiate no-timestamp from
>>> previous timestamp. What I'm unclear is to what does it mean "no-
>>> timestamp". We already stated the timestamp 0 cannot be reserved as
>>> being an unset timestamp.
>>
>> For OUTPUT buffers there is no such thing as 'no timestamp'. They always
>> have a timestamp (which may be 0). The currently active CAPTURE buffer
>> also always has a timestamp as that was copied from the first OUTPUT buffer
>> for that CAPTURE buffer.
>>
>>>> In other words, for slicing you can just always set this flag and
>>>> group the slices by the OUTPUT timestamp. If you know that you
>>>> reached the last slice of a frame, then you can optionally clear the
>>>> flag to ensure the CAPTURE buffer is marked done without having to wait
>>>> for the first slice of the next frame to arrive.
>>>>
>>>> Potential disadvantage of this approach is that this relies on the
>>>> OUTPUT timestamp to be the same for all slices of the same frame.
>>>>
>>>> Which sounds reasonable to me.
>>>>
>>>> In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
>>>> capability to signal support for this flag.
>>>>
>>>> I think this can be fairly easily implemented in v4l2-mem2mem.c.
>>>>
>>>> In addition, this approach is not specific to codecs, it can be
>>>> used elsewhere as well (composing multiple output buffers into one
>>>> capture buffer is one use-case that comes to mind).
>>>>
>>>> Comments? Other ideas?
>>>
>>> Sounds reasonable to me. I'll read through Paul's comment now and
>>> comment if needed.
>>
>> Paul's OK with it as well. The only thing I am not 100% happy with is
>> the name of the flag. It's a very low-level name: i.e. it does what it
>> says, but it doesn't say for what purpose.
>>
>> Does anyone have any better suggestions?
> 
> Good naming is always so hard to find... I don't have anything better
> to suggest off the top of my head, but will definitely keep thinking
> about it.
> 
>> Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?
> 
> Well, I no longer have time chunks allocated to the VPU topic at work,
> so that means I'll have to do it on spare time and it may take me a
> while to get there.
> 
> So if either one of you would like to pick it up to get it over with
> faster, feel free to do that!

OK, then I'll try to come up with something this week or next week.

Regards,

	Hans
Paul Kocialkowski April 29, 2019, 8:50 a.m. UTC | #28
On Mon, 2019-04-29 at 10:49 +0200, Hans Verkuil wrote:
> On 4/29/19 10:48 AM, Paul Kocialkowski wrote:
> > Hi,
> > 
> > On Mon, 2019-04-29 at 10:41 +0200, Hans Verkuil wrote:
> > > On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
> > > > Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > > > > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > Thanks for this great discussion. Let me try to summarize the status
> > > > > > of this thread + the IRC discussion and add my own thoughts:
> > > > > > 
> > > > > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > > > > frame should not be an afterthought ; compliance to encoded formats
> > > > > > depend on it, and the benefit of lower latency is a significant
> > > > > > consideration for vendors.
> > > > > > 
> > > > > > m2m, which we use for all stateless codecs, has a strong assumption
> > > > > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > > > > produced. This assumption can however be overruled: at least the venus
> > > > > > driver does it to implement the stateful specification.
> > > > > > 
> > > > > > So we need a way to specify frame boundaries when submitting encoded
> > > > > > content to the driver. One request should contain a single OUTPUT
> > > > > > buffer, containing a single decoding unit, but we need a way to
> > > > > > specify whether the driver should directly produce a CAPTURE buffer
> > > > > > from this request, or keep using the same CAPTURE buffer with
> > > > > > subsequent requests.
> > > > > > 
> > > > > > I can think of 2 ways this can be expressed:
> > > > > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > > > > is produced), and add a flag to ask the driver to change that behavior
> > > > > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > > > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > > > > flag asking so is specified.
> > > > > > 
> > > > > > The flag could be specified in one of two ways:
> > > > > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > > > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > > > > 
> > > > > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > > > > least, user-space does not know whether a slice is the last slice of a
> > > > > > frame until it starts parsing the next one, and we don't know when we
> > > > > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > > > > produced, we can always submit another request with only that control
> > > > > > set once it is clear that the frame is complete (and not delay
> > > > > > decoding meanwhile). In practice I am not that familiar with
> > > > > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > > > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > > > > flag it with the last slice without further delay?
> > > > > > 
> > > > > > An extra constraint to enforce would be that each decoding unit
> > > > > > belonging to the same frame must be submitted with the same timestamp,
> > > > > > otherwise the request submission would fail. We really need a
> > > > > > framework to enforce all this at a higher level than individual
> > > > > > drivers, once we reach an agreement I will start working on this.
> > > > > > 
> > > > > > Formats that do not support multiple decoding units per frame would
> > > > > > reject any request that does not carry the end-of-frame information.
> > > > > > 
> > > > > > Anything missing / any further comment?
> > > > > > 
> > > > > 
> > > > > After reading through this thread and a further irc discussion I now
> > > > > understand the problem. I think there are several ways this can be
> > > > > solved, but I think this is the easiest:
> > > > > 
> > > > > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > > > > 
> > > > > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > > > > done after processing the OUTPUT buffer.
> > > > > 
> > > > > If an OUTPUT buffer was queued with a different timestamp than was
> > > > > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > > > > buffer as done before starting processing this OUTPUT buffer.
> > > > 
> > > > Just a curiosity, can you extend on how this would be handled. If there
> > > > is a number of capture buffer, these should have "no-timestamp". So I
> > > > suspect we need the condition to differentiate no-timestamp from
> > > > previous timestamp. What I'm unclear is to what does it mean "no-
> > > > timestamp". We already stated the timestamp 0 cannot be reserved as
> > > > being an unset timestamp.
> > > 
> > > For OUTPUT buffers there is no such thing as 'no timestamp'. They always
> > > have a timestamp (which may be 0). The currently active CAPTURE buffer
> > > also always has a timestamp as that was copied from the first OUTPUT buffer
> > > for that CAPTURE buffer.
> > > 
> > > > > In other words, for slicing you can just always set this flag and
> > > > > group the slices by the OUTPUT timestamp. If you know that you
> > > > > reached the last slice of a frame, then you can optionally clear the
> > > > > flag to ensure the CAPTURE buffer is marked done without having to wait
> > > > > for the first slice of the next frame to arrive.
> > > > > 
> > > > > Potential disadvantage of this approach is that this relies on the
> > > > > OUTPUT timestamp to be the same for all slices of the same frame.
> > > > > 
> > > > > Which sounds reasonable to me.
> > > > > 
> > > > > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > > > > capability to signal support for this flag.
> > > > > 
> > > > > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > > > > 
> > > > > In addition, this approach is not specific to codecs, it can be
> > > > > used elsewhere as well (composing multiple output buffers into one
> > > > > capture buffer is one use-case that comes to mind).
> > > > > 
> > > > > Comments? Other ideas?
> > > > 
> > > > Sounds reasonable to me. I'll read through Paul's comment now and
> > > > comment if needed.
> > > 
> > > Paul's OK with it as well. The only thing I am not 100% happy with is
> > > the name of the flag. It's a very low-level name: i.e. it does what it
> > > says, but it doesn't say for what purpose.
> > > 
> > > Does anyone have any better suggestions?
> > 
> > Good naming is always so hard to find... I don't have anything better
> > to suggest off the top of my head, but will definitely keep thinking
> > about it.
> > 
> > > Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?
> > 
> > Well, I no longer have time chunks allocated to the VPU topic at work,
> > so that means I'll have to do it on spare time and it may take me a
> > while to get there.
> > 
> > So if either one of you would like to pick it up to get it over with
> > faster, feel free to do that!
> 
> OK, then I'll try to come up with something this week or next week.

Awesome, thanks!

Cheers,

Paul

> Regards,
> 
> 	Hans
Nicolas Dufresne April 29, 2019, 6:27 p.m. UTC | #29
Le lundi 29 avril 2019 à 10:48 +0200, Paul Kocialkowski a écrit :
> Hi,
> 
> On Mon, 2019-04-29 at 10:41 +0200, Hans Verkuil wrote:
> > On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
> > > Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > > > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > > > 
> > > > <snip>
> > > > 
> > > > > Thanks for this great discussion. Let me try to summarize the status
> > > > > of this thread + the IRC discussion and add my own thoughts:
> > > > > 
> > > > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > > > frame should not be an afterthought ; compliance to encoded formats
> > > > > depend on it, and the benefit of lower latency is a significant
> > > > > consideration for vendors.
> > > > > 
> > > > > m2m, which we use for all stateless codecs, has a strong assumption
> > > > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > > > produced. This assumption can however be overruled: at least the venus
> > > > > driver does it to implement the stateful specification.
> > > > > 
> > > > > So we need a way to specify frame boundaries when submitting encoded
> > > > > content to the driver. One request should contain a single OUTPUT
> > > > > buffer, containing a single decoding unit, but we need a way to
> > > > > specify whether the driver should directly produce a CAPTURE buffer
> > > > > from this request, or keep using the same CAPTURE buffer with
> > > > > subsequent requests.
> > > > > 
> > > > > I can think of 2 ways this can be expressed:
> > > > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > > > is produced), and add a flag to ask the driver to change that behavior
> > > > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > > > flag asking so is specified.
> > > > > 
> > > > > The flag could be specified in one of two ways:
> > > > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > > > 
> > > > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > > > least, user-space does not know whether a slice is the last slice of a
> > > > > frame until it starts parsing the next one, and we don't know when we
> > > > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > > > produced, we can always submit another request with only that control
> > > > > set once it is clear that the frame is complete (and not delay
> > > > > decoding meanwhile). In practice I am not that familiar with
> > > > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > > > flag it with the last slice without further delay?
> > > > > 
> > > > > An extra constraint to enforce would be that each decoding unit
> > > > > belonging to the same frame must be submitted with the same timestamp,
> > > > > otherwise the request submission would fail. We really need a
> > > > > framework to enforce all this at a higher level than individual
> > > > > drivers, once we reach an agreement I will start working on this.
> > > > > 
> > > > > Formats that do not support multiple decoding units per frame would
> > > > > reject any request that does not carry the end-of-frame information.
> > > > > 
> > > > > Anything missing / any further comment?
> > > > > 
> > > > 
> > > > After reading through this thread and a further irc discussion I now
> > > > understand the problem. I think there are several ways this can be
> > > > solved, but I think this is the easiest:
> > > > 
> > > > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > > > 
> > > > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > > > done after processing the OUTPUT buffer.
> > > > 
> > > > If an OUTPUT buffer was queued with a different timestamp than was
> > > > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > > > buffer as done before starting processing this OUTPUT buffer.
> > > 
> > > Just a curiosity, can you extend on how this would be handled. If there
> > > is a number of capture buffer, these should have "no-timestamp". So I
> > > suspect we need the condition to differentiate no-timestamp from
> > > previous timestamp. What I'm unclear is to what does it mean "no-
> > > timestamp". We already stated the timestamp 0 cannot be reserved as
> > > being an unset timestamp.
> > 
> > For OUTPUT buffers there is no such thing as 'no timestamp'. They always
> > have a timestamp (which may be 0). The currently active CAPTURE buffer
> > also always has a timestamp as that was copied from the first OUTPUT buffer
> > for that CAPTURE buffer.
> > 
> > > > In other words, for slicing you can just always set this flag and
> > > > group the slices by the OUTPUT timestamp. If you know that you
> > > > reached the last slice of a frame, then you can optionally clear the
> > > > flag to ensure the CAPTURE buffer is marked done without having to wait
> > > > for the first slice of the next frame to arrive.
> > > > 
> > > > Potential disadvantage of this approach is that this relies on the
> > > > OUTPUT timestamp to be the same for all slices of the same frame.
> > > > 
> > > > Which sounds reasonable to me.
> > > > 
> > > > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > > > capability to signal support for this flag.
> > > > 
> > > > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > > > 
> > > > In addition, this approach is not specific to codecs, it can be
> > > > used elsewhere as well (composing multiple output buffers into one
> > > > capture buffer is one use-case that comes to mind).
> > > > 
> > > > Comments? Other ideas?
> > > 
> > > Sounds reasonable to me. I'll read through Paul's comment now and
> > > comment if needed.
> > 
> > Paul's OK with it as well. The only thing I am not 100% happy with is
> > the name of the flag. It's a very low-level name: i.e. it does what it
> > says, but it doesn't say for what purpose.
> > 
> > Does anyone have any better suggestions?
> 
> Good naming is always so hard to find... I don't have anything better
> to suggest off the top of my head, but will definitely keep thinking
> about it.
> 
> > Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?
> 
> Well, I no longer have time chunks allocated to the VPU topic at work,
> so that means I'll have to do it on spare time and it may take me a
> while to get there.
> 
> So if either one of you would like to pick it up to get it over with
> faster, feel free to do that!

Adding Boris in CC. Boris, do you think that could possibly fit into
your todo while working on the H264 accelerator on RK ? If needed I can
generate test streams, there is couple of lines of code to remove / add
in FFMPEG backend if you want to test this properly, though I'm not
able to run this code atm (it requires a working DRM, and I'm having
issues with my board in this regard).

> 
> Cheers,
> 
> Paul
>
Paul Kocialkowski April 29, 2019, 8:32 p.m. UTC | #30
Hi,

Adding Thierry, Jonas and Jernej to the thread.

For context: this thread (was initially about the v4l2 m2m stateless
video decoder interface) is about defining a way to allow per-slice
video decoding to achieve a low latency. The main issue is that slices
are submitted one-by-one and share the same CAPTURE buffer, which must
only be marked as done once all the slices have finished decoding.

The proposed solution to do that is to pass a flag associated with the
OUTPUT buffer to indicate that the CAPTURE buffer must be held for now.
Once a buffer is passed without the flag set, the matching CAPTURE
buffer is released with the completion of the decoding of that slice.
(When adding support for parallel decoders in the future, we will need
to make sure that all the previously-submitted jobs are done in
addition to the one that should release thr CAPTURE buffer.)

Le lundi 29 avril 2019 à 14:27 -0400, Nicolas Dufresne a écrit :
> Le lundi 29 avril 2019 à 10:48 +0200, Paul Kocialkowski a écrit :
> > Hi,
> > 
> > On Mon, 2019-04-29 at 10:41 +0200, Hans Verkuil wrote:
> > > On 4/27/19 2:06 PM, Nicolas Dufresne wrote:
> > > > Le vendredi 26 avril 2019 à 16:18 +0200, Hans Verkuil a écrit :
> > > > > On 4/16/19 9:22 AM, Alexandre Courbot wrote:
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > Thanks for this great discussion. Let me try to summarize the status
> > > > > > of this thread + the IRC discussion and add my own thoughts:
> > > > > > 
> > > > > > Proper support for multiple decoding units (e.g. H.264 slices) per
> > > > > > frame should not be an afterthought ; compliance to encoded formats
> > > > > > depend on it, and the benefit of lower latency is a significant
> > > > > > consideration for vendors.
> > > > > > 
> > > > > > m2m, which we use for all stateless codecs, has a strong assumption
> > > > > > that one OUTPUT buffer consumed results in one CAPTURE buffer being
> > > > > > produced. This assumption can however be overruled: at least the venus
> > > > > > driver does it to implement the stateful specification.
> > > > > > 
> > > > > > So we need a way to specify frame boundaries when submitting encoded
> > > > > > content to the driver. One request should contain a single OUTPUT
> > > > > > buffer, containing a single decoding unit, but we need a way to
> > > > > > specify whether the driver should directly produce a CAPTURE buffer
> > > > > > from this request, or keep using the same CAPTURE buffer with
> > > > > > subsequent requests.
> > > > > > 
> > > > > > I can think of 2 ways this can be expressed:
> > > > > > 1) We keep the current m2m behavior as the default (a CAPTURE buffer
> > > > > > is produced), and add a flag to ask the driver to change that behavior
> > > > > > and hold on the CAPTURE buffer and reuse it with the next request(s) ;
> > > > > > 2) We specify that no CAPTURE buffer is produced by default, unless a
> > > > > > flag asking so is specified.
> > > > > > 
> > > > > > The flag could be specified in one of two ways:
> > > > > > a) As a new v4l2_buffer.flag for the OUTPUT buffer ;
> > > > > > b) As a dedicated control, either format-specific or more common to all codecs.
> > > > > > 
> > > > > > I tend to favor 2) and b) for this, for the reason that with H.264 at
> > > > > > least, user-space does not know whether a slice is the last slice of a
> > > > > > frame until it starts parsing the next one, and we don't know when we
> > > > > > will receive it. If we use a control to ask that a CAPTURE buffer be
> > > > > > produced, we can always submit another request with only that control
> > > > > > set once it is clear that the frame is complete (and not delay
> > > > > > decoding meanwhile). In practice I am not that familiar with
> > > > > > latency-sensitive streaming ; maybe a smart streamer would just append
> > > > > > an AUD NAL unit at the end of every frame and we can thus submit the
> > > > > > flag it with the last slice without further delay?
> > > > > > 
> > > > > > An extra constraint to enforce would be that each decoding unit
> > > > > > belonging to the same frame must be submitted with the same timestamp,
> > > > > > otherwise the request submission would fail. We really need a
> > > > > > framework to enforce all this at a higher level than individual
> > > > > > drivers, once we reach an agreement I will start working on this.
> > > > > > 
> > > > > > Formats that do not support multiple decoding units per frame would
> > > > > > reject any request that does not carry the end-of-frame information.
> > > > > > 
> > > > > > Anything missing / any further comment?
> > > > > > 
> > > > > 
> > > > > After reading through this thread and a further irc discussion I now
> > > > > understand the problem. I think there are several ways this can be
> > > > > solved, but I think this is the easiest:
> > > > > 
> > > > > Introduce a new V4L2_BUF_FLAG_HOLD_CAPTURE_BUFFER flag.
> > > > > 
> > > > > If set in the OUTPUT buffer, then don't mark the CAPTURE buffer as
> > > > > done after processing the OUTPUT buffer.
> > > > > 
> > > > > If an OUTPUT buffer was queued with a different timestamp than was
> > > > > used for the currently held CAPTURE buffer, then mark that CAPTURE
> > > > > buffer as done before starting processing this OUTPUT buffer.
> > > > 
> > > > Just a curiosity, can you extend on how this would be handled. If there
> > > > is a number of capture buffer, these should have "no-timestamp". So I
> > > > suspect we need the condition to differentiate no-timestamp from
> > > > previous timestamp. What I'm unclear is to what does it mean "no-
> > > > timestamp". We already stated the timestamp 0 cannot be reserved as
> > > > being an unset timestamp.
> > > 
> > > For OUTPUT buffers there is no such thing as 'no timestamp'. They always
> > > have a timestamp (which may be 0). The currently active CAPTURE buffer
> > > also always has a timestamp as that was copied from the first OUTPUT buffer
> > > for that CAPTURE buffer.
> > > 
> > > > > In other words, for slicing you can just always set this flag and
> > > > > group the slices by the OUTPUT timestamp. If you know that you
> > > > > reached the last slice of a frame, then you can optionally clear the
> > > > > flag to ensure the CAPTURE buffer is marked done without having to wait
> > > > > for the first slice of the next frame to arrive.
> > > > > 
> > > > > Potential disadvantage of this approach is that this relies on the
> > > > > OUTPUT timestamp to be the same for all slices of the same frame.
> > > > > 
> > > > > Which sounds reasonable to me.
> > > > > 
> > > > > In addition add a V4L2_BUF_CAP_SUPPORTS_HOLD_CAPTURE_BUFFER
> > > > > capability to signal support for this flag.
> > > > > 
> > > > > I think this can be fairly easily implemented in v4l2-mem2mem.c.
> > > > > 
> > > > > In addition, this approach is not specific to codecs, it can be
> > > > > used elsewhere as well (composing multiple output buffers into one
> > > > > capture buffer is one use-case that comes to mind).
> > > > > 
> > > > > Comments? Other ideas?
> > > > 
> > > > Sounds reasonable to me. I'll read through Paul's comment now and
> > > > comment if needed.
> > > 
> > > Paul's OK with it as well. The only thing I am not 100% happy with is
> > > the name of the flag. It's a very low-level name: i.e. it does what it
> > > says, but it doesn't say for what purpose.
> > > 
> > > Does anyone have any better suggestions?
> > 
> > Good naming is always so hard to find... I don't have anything better
> > to suggest off the top of my head, but will definitely keep thinking
> > about it.
> > 
> > > Also, who will implement this in v4l2-mem2mem? Paul, where you planning to do that?
> > 
> > Well, I no longer have time chunks allocated to the VPU topic at work,
> > so that means I'll have to do it on spare time and it may take me a
> > while to get there.
> > 
> > So if either one of you would like to pick it up to get it over with
> > faster, feel free to do that!
> 
> Adding Boris in CC. Boris, do you think that could possibly fit into
> your todo while working on the H264 accelerator on RK ? If needed I can
> generate test streams, there is couple of lines of code to remove / add
> in FFMPEG backend if you want to test this properly, though I'm not
> able to run this code atm (it requires a working DRM, and I'm having
> issues with my board in this regard).

Well, that seems like a task that requires in-depth knowledge about how
the v4l2 m2m core and the request API work and some familiary with
it. My feeling is that Boris is pretty new to all of this, so perhaps
it would be best for him to focus on the rockchip driver alone, which
is already a significant piece of work on its own.

It looks like Hans has proposed to come up with something soon, so
things are looking good for us. Once we have that, I think the next
area we need to look into is how we need to rework and refine the
controls. I think it would be good to define common guidelines for
adapting bitstream descriptions into controls with what the hardware
needs to know about precisely.

In spite of that, I would be very interested in knowing what the
rockchip MPEG-2 and H.264 decoders expect precisely. I'm also
interested in learning about Tegra decoders and there are also docs
about the Hantro G1 (MPEG-2 to H.264) and Hantro G2 (H.265) which are
well documented in the i.MX8M docs. It's also used on some Atmel
platforms apparently. So feedback regarding the current controls that
Maxime and I came up with would be welcome.

Cheers,

Paul
Nicolas Dufresne April 30, 2019, 12:47 a.m. UTC | #31
Le lundi 29 avril 2019 à 22:32 +0200, Paul Kocialkowski a écrit :
> > Adding Boris in CC. Boris, do you think that could possibly fit into
> > your todo while working on the H264 accelerator on RK ? If needed I can
> > generate test streams, there is couple of lines of code to remove / add
> > in FFMPEG backend if you want to test this properly, though I'm not
> > able to run this code atm (it requires a working DRM, and I'm having
> > issues with my board in this regard).
> 
> Well, that seems like a task that requires in-depth knowledge about how
> the v4l2 m2m core and the request API work and some familiary with
> it. My feeling is that Boris is pretty new to all of this, so perhaps
> it would be best for him to focus on the rockchip driver alone, which
> is already a significant piece of work on its own.
> 
> It looks like Hans has proposed to come up with something soon, so
> things are looking good for us. Once we have that, I think the next
> area we need to look into is how we need to rework and refine the
> controls. I think it would be good to define common guidelines for
> adapting bitstream descriptions into controls with what the hardware
> needs to know about precisely.
> 
> In spite of that, I would be very interested in knowing what the
> rockchip MPEG-2 and H.264 decoders expect precisely. I'm also

We are still working on that. For now, we believe that the list (and
traces from real stream matches) are according to the standard
"Initialization process" section 8.2.4.2. But they run both P and B
initialization regardless of the type, hence the 3 lists. But the
modification (section 8.4.2.3) are not applied. They also program the 3
lists regardless of the current picture type. This is quite strange.
Tomorrow I'll mark all b0/b1 value as invalid on P slice and the
opposite for B slice to see if that still decodes fine. If that was the
case, it would mean that the current list are complete, but not in the
expected order.

What I'm wondering is if it would be fine to add more information to
the DPB entry so that we could simply implement 8.2.4.2 to re-create
the pre-modification order. It's more doable then trying to reverse the
modifications and would offer a better uAPI in exchange for a very
small overhead.

> interested in learning about Tegra decoders and there are also docs
> about the Hantro G1 (MPEG-2 to H.264) and Hantro G2 (H.265) which are
> well documented in the i.MX8M docs. It's also used on some Atmel
> platforms apparently. So feedback regarding the current controls that
> Maxime and I came up with would be welcome.
diff mbox series

Patch

diff --git a/Documentation/media/uapi/v4l/dev-mem2mem.rst b/Documentation/media/uapi/v4l/dev-mem2mem.rst
index 67a980818dc8..db6f4efc458d 100644
--- a/Documentation/media/uapi/v4l/dev-mem2mem.rst
+++ b/Documentation/media/uapi/v4l/dev-mem2mem.rst
@@ -13,6 +13,11 @@ 
 Video Memory-To-Memory Interface
 ********************************
 
+.. toctree::
+    :maxdepth: 1
+
+    dev-stateless-decoder
+
 A V4L2 memory-to-memory device can compress, decompress, transform, or
 otherwise convert video data from one format into another format, in memory.
 Such memory-to-memory devices set the ``V4L2_CAP_VIDEO_M2M`` or
diff --git a/Documentation/media/uapi/v4l/dev-stateless-decoder.rst b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst
new file mode 100644
index 000000000000..861fd2662886
--- /dev/null
+++ b/Documentation/media/uapi/v4l/dev-stateless-decoder.rst
@@ -0,0 +1,386 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+.. _stateless_decoder:
+
+**************************************************
+Memory-to-memory Stateless Video Decoder Interface
+**************************************************
+
+A stateless decoder is a decoder that works without retaining any kind of state
+between processed frames. This means that each frame is decoded independently
+of any previous and future frames, and that the client is responsible for
+maintaining the decoding state and providing it to the decoder with each
+decoding request. This is in contrast to the stateful video decoder interface,
+where the hardware and driver maintain the decoding state and all the client
+has to do is to provide the raw encoded stream and dequeue decoded frames in
+display order.
+
+This section describes how user-space ("the client") is expected to communicate
+with stateless decoders in order to successfully decode an encoded stream.
+Compared to stateful codecs, the decoder/client sequence is simpler, but the
+cost of this simplicity is extra complexity in the client which must maintain a
+consistent decoding state.
+
+Stateless decoders make use of the request API. A stateless decoder must expose
+the ``V4L2_BUF_CAP_SUPPORTS_REQUESTS`` capability on its ``OUTPUT`` queue when
+:c:func:`VIDIOC_REQBUFS` or :c:func:`VIDIOC_CREATE_BUFS` are invoked.
+
+Querying capabilities
+=====================
+
+1. To enumerate the set of coded formats supported by the decoder, the client
+   calls :c:func:`VIDIOC_ENUM_FMT` on the ``OUTPUT`` queue.
+
+   * The driver must always return the full set of supported ``OUTPUT`` formats,
+     irrespective of the format currently set on the ``CAPTURE`` queue.
+
+   * Simultaneously, the driver must restrain the set of values returned by
+     codec-specific capability controls (such as H.264 profiles) to the set
+     actually supported by the hardware.
+
+2. To enumerate the set of supported raw formats, the client calls
+   :c:func:`VIDIOC_ENUM_FMT` on the ``CAPTURE`` queue.
+
+   * The driver must return only the formats supported for the format currently
+     active on the ``OUTPUT`` queue.
+
+   * Depending on the currently set ``OUTPUT`` format, the set of supported raw
+     formats may depend on the value of some controls (e.g. parsed format
+     headers) which are codec-dependent. The client is responsible for making
+     sure that these controls are set before querying the ``CAPTURE`` queue.
+     Failure to do so will result in the default values for these controls being
+     used, and a returned set of formats that may not be usable for the media
+     the client is trying to decode.
+
+3. The client may use :c:func:`VIDIOC_ENUM_FRAMESIZES` to detect supported
+   resolutions for a given format, passing desired pixel format in
+   :c:type:`v4l2_frmsizeenum`'s ``pixel_format``.
+
+4. Supported profiles and levels for the current ``OUTPUT`` format, if
+   applicable, may be queried using their respective controls via
+   :c:func:`VIDIOC_QUERYCTRL`.
+
+Initialization
+==============
+
+1. Set the coded format on the ``OUTPUT`` queue via :c:func:`VIDIOC_S_FMT`.
+
+   * **Required fields:**
+
+     ``type``
+         a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``.
+
+     ``pixelformat``
+         a coded pixel format.
+
+     ``width``, ``height``
+         coded width and height parsed from the stream.
+
+     other fields
+         follow standard semantics.
+
+   .. note::
+
+      Changing the ``OUTPUT`` format may change the currently set ``CAPTURE``
+      format. The driver will derive a new ``CAPTURE`` format from the
+      ``OUTPUT`` format being set, including resolution, colorimetry
+      parameters, etc. If the client needs a specific ``CAPTURE`` format,
+      it must adjust it afterwards.
+
+2. Call :c:func:`VIDIOC_S_EXT_CTRLS` to set all the controls (parsed headers,
+   etc.) required by the ``OUTPUT`` format to enumerate the ``CAPTURE`` formats.
+
+3. Call :c:func:`VIDIOC_G_FMT` for ``CAPTURE`` queue to get the format for the
+   destination buffers parsed/decoded from the bitstream.
+
+   * **Required fields:**
+
+     ``type``
+         a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
+
+   * **Returned fields:**
+
+     ``width``, ``height``
+         frame buffer resolution for the decoded frames.
+
+     ``pixelformat``
+         pixel format for decoded frames.
+
+     ``num_planes`` (for _MPLANE ``type`` only)
+         number of planes for pixelformat.
+
+     ``sizeimage``, ``bytesperline``
+         as per standard semantics; matching frame buffer format.
+
+   .. note::
+
+      The value of ``pixelformat`` may be any pixel format supported for the
+      ``OUTPUT`` format, based on the hardware capabilities. It is suggested
+      that driver chooses the preferred/optimal format for the current
+      configuration. For example, a YUV format may be preferred over an RGB
+      format, if an additional conversion step would be required for RGB.
+
+4. *[optional]* Enumerate ``CAPTURE`` formats via :c:func:`VIDIOC_ENUM_FMT` on
+   the ``CAPTURE`` queue. The client may use this ioctl to discover which
+   alternative raw formats are supported for the current ``OUTPUT`` format and
+   select one of them via :c:func:`VIDIOC_S_FMT`.
+
+   .. note::
+
+      The driver will return only formats supported for the currently selected
+      ``OUTPUT`` format, even if more formats may be supported by the decoder in
+      general.
+
+      For example, a decoder may support YUV and RGB formats for
+      resolutions 1920x1088 and lower, but only YUV for higher resolutions (due
+      to hardware limitations). After setting a resolution of 1920x1088 or lower
+      as the ``OUTPUT`` format, :c:func:`VIDIOC_ENUM_FMT` may return a set of
+      YUV and RGB pixel formats, but after setting a resolution higher than
+      1920x1088, the driver will not return RGB pixel formats, since they are
+      unsupported for this resolution.
+
+5. *[optional]* Choose a different ``CAPTURE`` format than suggested via
+   :c:func:`VIDIOC_S_FMT` on ``CAPTURE`` queue. It is possible for the client to
+   choose a different format than selected/suggested by the driver in
+   :c:func:`VIDIOC_G_FMT`.
+
+    * **Required fields:**
+
+      ``type``
+          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
+
+      ``pixelformat``
+          a raw pixel format.
+
+6. Allocate source (bitstream) buffers via :c:func:`VIDIOC_REQBUFS` on
+   ``OUTPUT`` queue.
+
+    * **Required fields:**
+
+      ``count``
+          requested number of buffers to allocate; greater than zero.
+
+      ``type``
+          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``OUTPUT``.
+
+      ``memory``
+          follows standard semantics.
+
+    * **Return fields:**
+
+      ``count``
+          actual number of buffers allocated.
+
+    * If required, the driver will adjust ``count`` to be equal or bigger to the
+      minimum of required number of ``OUTPUT`` buffers for the given format and
+      requested count. The client must check this value after the ioctl returns
+      to get the actual number of buffers allocated.
+
+7. Allocate destination (raw format) buffers via :c:func:`VIDIOC_REQBUFS` on the
+   ``CAPTURE`` queue.
+
+    * **Required fields:**
+
+      ``count``
+          requested number of buffers to allocate; greater than zero. The client
+          is responsible for deducing the minimum number of buffers required
+          for the stream to be properly decoded (taking e.g. reference frames
+          into account) and pass an equal or bigger number.
+
+      ``type``
+          a ``V4L2_BUF_TYPE_*`` enum appropriate for ``CAPTURE``.
+
+      ``memory``
+          follows standard semantics. ``V4L2_MEMORY_USERPTR`` is not supported
+          for ``CAPTURE`` buffers.
+
+    * **Return fields:**
+
+      ``count``
+          adjusted to allocated number of buffers, in case the codec requires
+          more buffers than requested.
+
+    * The driver must adjust count to the minimum of required number of
+      ``CAPTURE`` buffers for the current format, stream configuration and
+      requested count. The client must check this value after the ioctl
+      returns to get the number of buffers allocated.
+
+8. Allocate requests (likely one per ``OUTPUT`` buffer) via
+    :c:func:`MEDIA_IOC_REQUEST_ALLOC` on the media device.
+
+9. Start streaming on both ``OUTPUT`` and ``CAPTURE`` queues via
+    :c:func:`VIDIOC_STREAMON`.
+
+Decoding
+========
+
+For each frame, the client is responsible for submitting at least one request to
+which the following is attached:
+
+* The amount of encoded data expected by the codec for its current
+  configuration, as a buffer submitted to the ``OUTPUT`` queue. Typically, this
+  corresponds to one frame worth of encoded data, but some formats may allow (or
+  require) different amounts per unit.
+* All the metadata needed to decode the submitted encoded data, in the form of
+  controls relevant to the format being decoded.
+
+The amount and contents of the source ``OUTPUT`` buffer, as well as the controls
+that must be set on the request, depend on the active coded pixel format and
+might be affected by codec-specific extended controls, as stated in
+documentation of each format.
+
+A typical frame would thus be decoded using the following sequence:
+
+1. Queue an ``OUTPUT`` buffer containing one unit of encoded bitstream data for
+   the decoding request, using :c:func:`VIDIOC_QBUF`.
+
+    * **Required fields:**
+
+      ``index``
+          index of the buffer being queued.
+
+      ``type``
+          type of the buffer.
+
+      ``bytesused``
+          number of bytes taken by the encoded data frame in the buffer.
+
+      ``flags``
+          the ``V4L2_BUF_FLAG_REQUEST_FD`` flag must be set.
+
+      ``request_fd``
+          must be set to the file descriptor of the decoding request.
+
+      ``timestamp``
+          must be set to a unique value per frame. This value will be propagated
+          into the decoded frame's buffer and can also be used to use this frame
+          as the reference of another.
+
+2. Set the codec-specific controls for the decoding request, using
+   :c:func:`VIDIOC_S_EXT_CTRLS`.
+
+    * **Required fields:**
+
+      ``which``
+          must be ``V4L2_CTRL_WHICH_REQUEST_VAL``.
+
+      ``request_fd``
+          must be set to the file descriptor of the decoding request.
+
+      other fields
+          other fields are set as usual when setting controls. The ``controls``
+          array must contain all the codec-specific controls required to decode
+          a frame.
+
+   .. note::
+
+      It is possible to specify the controls in different invocations of
+      :c:func:`VIDIOC_S_EXT_CTRLS`, or to overwrite a previously set control, as
+      long as ``request_fd`` and ``which`` are properly set. The controls state
+      at the moment of request submission is the one that will be considered.
+
+   .. note::
+
+      The order in which steps 1 and 2 take place is interchangeable.
+
+3. Submit the request by invoking :c:func:`MEDIA_REQUEST_IOC_QUEUE` on the
+   request FD.
+
+    If the request is submitted without an ``OUTPUT`` buffer, or if some of the
+    required controls are missing from the request, then
+    :c:func:`MEDIA_REQUEST_IOC_QUEUE` will return ``-ENOENT``. If more than one
+    ``OUTPUT`` buffer is queued, then it will return ``-EINVAL``.
+    :c:func:`MEDIA_REQUEST_IOC_QUEUE` returning non-zero means that no
+    ``CAPTURE`` buffer will be produced for this request.
+
+``CAPTURE`` buffers must not be part of the request, and are queued
+independently. They are returned in decode order (i.e. the same order as coded
+frames were submitted to the ``OUTPUT`` queue).
+
+Runtime decoding errors are signaled by the dequeued ``CAPTURE`` buffers
+carrying the ``V4L2_BUF_FLAG_ERROR`` flag. If a decoded reference frame has an
+error, then all following decoded frames that refer to it also have the
+``V4L2_BUF_FLAG_ERROR`` flag set, although the decoder will still try to
+produce a (likely corrupted) frame.
+
+Buffer management while decoding
+================================
+Contrary to stateful decoders, a stateless decoder does not perform any kind of
+buffer management: it only guarantees that dequeued ``CAPTURE`` buffers can be
+used by the client for as long as they are not queued again. "Used" here
+encompasses using the buffer for compositing or display.
+
+A dequeued capture buffer can also be used as the reference frame of another
+buffer.
+
+A frame is specified as reference by converting its timestamp into nanoseconds,
+and storing it into the relevant member of a codec-dependent control structure.
+The :c:func:`v4l2_timeval_to_ns` function must be used to perform that
+conversion. The timestamp of a frame can be used to reference it as soon as all
+its units of encoded data are successfully submitted to the ``OUTPUT`` queue.
+
+A decoded buffer containing a reference frame must not be reused as a decoding
+target until all the frames referencing it have been decoded. The safest way to
+achieve this is to refrain from queueing a reference buffer until all the
+decoded frames referencing it have been dequeued. However, if the driver can
+guarantee that buffer queued to the ``CAPTURE`` queue will be used in queued
+order, then user-space can take advantage of this guarantee and queue a
+reference buffer when the following conditions are met:
+
+1. All the requests for frames affected by the reference frame have been
+   queued, and
+
+2. A sufficient number of ``CAPTURE`` buffers to cover all the decoded
+   referencing frames have been queued.
+
+When queuing a decoding request, the driver will increase the reference count of
+all the resources associated with reference frames. This means that the client
+can e.g. close the DMABUF file descriptors of reference frame buffers if it
+won't need them afterwards.
+
+Seeking
+=======
+In order to seek, the client just needs to submit requests using input buffers
+corresponding to the new stream position. It must however be aware that
+resolution may have changed and follow the dynamic resolution change sequence in
+that case. Also depending on the codec used, picture parameters (e.g. SPS/PPS
+for H.264) may have changed and the client is responsible for making sure that a
+valid state is sent to the decoder.
+
+The client is then free to ignore any returned ``CAPTURE`` buffer that comes
+from the pre-seek position.
+
+Pause
+=====
+
+In order to pause, the client can just cease queuing buffers onto the ``OUTPUT``
+queue. Without source bitstream data, there is no data to process and the codec
+will remain idle.
+
+Dynamic resolution change
+=========================
+
+If the client detects a resolution change in the stream, it will need to perform
+the initialization sequence again with the new resolution:
+
+1. Wait until all submitted requests have completed and dequeue the
+   corresponding output buffers.
+
+2. Call :c:func:`VIDIOC_STREAMOFF` on both the ``OUTPUT`` and ``CAPTURE``
+   queues.
+
+3. Free all ``CAPTURE`` buffers by calling :c:func:`VIDIOC_REQBUFS` on the
+   ``CAPTURE`` queue with a buffer count of zero.
+
+4. Perform the initialization sequence again (minus the allocation of
+   ``OUTPUT`` buffers), with the new resolution set on the ``OUTPUT`` queue.
+   Note that due to resolution constraints, a different format may need to be
+   picked on the ``CAPTURE`` queue.
+
+Drain
+=====
+
+In order to drain the stream on a stateless decoder, the client just needs to
+wait until all the submitted requests are completed. There is no need to send a
+``V4L2_DEC_CMD_STOP`` command since requests are processed sequentially by the
+decoder.