mbox series

[v2,00/41] drm: Analog TV Improvements

Message ID 20220728-rpi-analog-tv-properties-v2-0-459522d653a7@cerno.tech (mailing list archive)
Headers show
Series drm: Analog TV Improvements | expand

Message

Maxime Ripard Aug. 29, 2022, 1:11 p.m. UTC
Hi,

Here's a series aiming at improving the command line named modes support,
and more importantly how we deal with all the analog TV variants.

The named modes support were initially introduced to allow to specify the
analog TV mode to be used.

However, this was causing multiple issues:

  * The mode name parsed on the command line was passed directly to the
    driver, which had to figure out which mode it was suppose to match;

  * Figuring that out wasn't really easy, since the video= argument or what
    the userspace might not even have a name in the first place, but
    instead could have passed a mode with the same timings;

  * The fallback to matching on the timings was mostly working as long as
    we were supporting one 525 lines (most likely NSTC) and one 625 lines
    (PAL), but couldn't differentiate between two modes with the same
    timings (NTSC vs PAL-M vs NSTC-J for example);

  * There was also some overlap with the tv mode property registered by
    drm_mode_create_tv_properties(), but named modes weren't interacting
    with that property at all.

  * Even though that property was generic, its possible values were
    specific to each drivers, which made some generic support difficult.

Thus, I chose to tackle in multiple steps:

  * A new TV norm property was introduced, with generic values, each driver
    reporting through a bitmask what standard it supports to the userspace;

  * This option was added to the command line parsing code to be able to
    specify it on the kernel command line, and new atomic_check and reset
    helpers were created to integrate properly into atomic KMS;

  * The named mode parsing code is now creating a proper display mode for
    the given named mode, and the TV standard will thus be part of the
    connector state;

  * Two drivers were converted and tested for now (vc4 and sun4i), with
    some backward compatibility code to translate the old TV mode to the
    new TV mode;

Unit tests were created along the way.

One can switch from NTSC to PAL now using (on vc4)

modetest -M vc4  -s 53:720x480i -w 53:'tv norm':0

modetest -M vc4 -s 53:720x480i -w 53:'tv norm':4

Let me know what you think,
Maxime

Changes from v1 (https://lore.kernel.org/dri-devel/20220728-rpi-analog-tv-properties-v1-0-3d53ae722097@cerno.tech/):
  - Kept the older TV mode property as legacy so we can keep the old drivers functional
  - Renamed the tv_norm property to tv_mode
  - Added a function to create PAL and NTSC compatible display modes
  - Added some helpers to instantiate a mock DRM device in Kunit
  - More Kunit tests
  - Removed the HD analog TV modes
  - Renamed some of the tests
  - Renamed some of the named modes
  - Fixed typos in commit logs
  - Added the various tags

Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Mateusz Kwiatkowski <kfyatek+publicgit@gmail.com>
Cc: "Noralf Trønnes" <noralf@tronnes.org>
Cc: Dave Stevenson <dave.stevenson@raspberrypi.com>
Cc: Dom Cobley <dom@raspberrypi.com>
Cc: Phil Elwell <phil@raspberrypi.com>
Cc: <dri-devel@lists.freedesktop.org>

---
Geert Uytterhoeven (1):
      drm/modes: parse_cmdline: Add support for named modes containing dashes

Mateusz Kwiatkowski (5):
      drm/vc4: vec: Refactor VEC TV mode setting
      drm/vc4: vec: Remove redundant atomic_mode_set
      drm/vc4: vec: Fix timings for VEC modes
      drm/vc4: vec: Fix definition of PAL-M mode
      drm/vc4: vec: Add support for more analog TV standards

Maxime Ripard (35):
      drm/tests: Order Kunit tests in Makefile
      drm/tests: Add Kunit Helpers
      drm/atomic-helper: Rename drm_atomic_helper_connector_tv_reset to avoid ambiguity
      drm/connector: Rename subconnector state variable
      drm/atomic: Add TV subconnector property to get/set_property
      drm/connector: Rename legacy TV property
      drm/connector: Only register TV mode property if present
      drm/connector: Rename drm_mode_create_tv_properties
      drm/connector: Add TV standard property
      drm/modes: Add a function to generate analog display modes
      drm/modes: Only consider bpp and refresh before options
      drm/client: Add some tests for drm_connector_pick_cmdline_mode()
      drm/modes: Move named modes parsing to a separate function
      drm/modes: Switch to named mode descriptors
      drm/modes: Fill drm_cmdline mode from named modes
      drm/connector: Add pixel clock to cmdline mode
      drm/connector: Add a function to lookup a TV mode by its name
      drm/modes: Introduce the tv_mode property as a command-line option
      drm/modes: Properly generate a drm_display_mode from a named mode
      drm/modes: Introduce more named modes
      drm/atomic-helper: Add a TV properties reset helper
      drm/atomic-helper: Add an analog TV atomic_check implementation
      drm/vc4: vec: Remove empty mode_fixup
      drm/vc4: vec: Convert to atomic helpers
      drm/vc4: vec: Switch for common modes
      drm/vc4: vec: Use TV Reset implementation
      drm/vc4: vec: Convert to the new TV mode property
      drm/sun4i: tv: Remove unused mode_valid
      drm/sun4i: tv: Convert to atomic hooks
      drm/sun4i: tv: Merge mode_set into atomic_enable
      drm/sun4i: tv: Remove useless function
      drm/sun4i: tv: Remove useless destroy function
      drm/sun4i: tv: Rename error label
      drm/sun4i: tv: Add missing reset assertion
      drm/sun4i: tv: Convert to the new TV mode property

 drivers/gpu/drm/drm_atomic_state_helper.c       | 115 ++++-
 drivers/gpu/drm/drm_atomic_uapi.c               |   8 +
 drivers/gpu/drm/drm_client_modeset.c            |   4 +
 drivers/gpu/drm/drm_connector.c                 | 119 ++++-
 drivers/gpu/drm/drm_modes.c                     | 638 +++++++++++++++++++++++-
 drivers/gpu/drm/gud/gud_connector.c             |   8 +-
 drivers/gpu/drm/i2c/ch7006_drv.c                |   6 +-
 drivers/gpu/drm/i915/display/intel_tv.c         |   2 +-
 drivers/gpu/drm/nouveau/dispnv04/tvnv17.c       |   6 +-
 drivers/gpu/drm/sun4i/sun4i_tv.c                | 198 +++-----
 drivers/gpu/drm/tests/Makefile                  |  16 +-
 drivers/gpu/drm/tests/drm_client_modeset_test.c | 239 +++++++++
 drivers/gpu/drm/tests/drm_cmdline_parser_test.c | 216 ++++++++
 drivers/gpu/drm/tests/drm_kunit_helpers.c       |  54 ++
 drivers/gpu/drm/tests/drm_kunit_helpers.h       |   9 +
 drivers/gpu/drm/tests/drm_modes_test.c          | 131 +++++
 drivers/gpu/drm/vc4/vc4_hdmi.c                  |   2 +-
 drivers/gpu/drm/vc4/vc4_vec.c                   | 422 ++++++++++------
 include/drm/drm_atomic_state_helper.h           |   4 +
 include/drm/drm_connector.h                     | 165 +++++-
 include/drm/drm_mode_config.h                   |  12 +-
 include/drm/drm_modes.h                         |  17 +
 22 files changed, 2057 insertions(+), 334 deletions(-)
---
base-commit: 8869fa666a9e6782c3c896c1fa57d65adca23249
change-id: 20220728-rpi-analog-tv-properties-0914dfcee460

Best regards,

Comments

Noralf Trønnes Sept. 1, 2022, 7:35 p.m. UTC | #1
Den 29.08.2022 15.11, skrev Maxime Ripard:
> Hi,
> 
> 
> 
> Here's a series aiming at improving the command line named modes support,
> 
> and more importantly how we deal with all the analog TV variants.
> 
> 
> 
> The named modes support were initially introduced to allow to specify the
> 
> analog TV mode to be used.
> 
> 
> 
> However, this was causing multiple issues:
> 
> 
> 
>   * The mode name parsed on the command line was passed directly to the
> 
>     driver, which had to figure out which mode it was suppose to match;
> 
> 
> 
>   * Figuring that out wasn't really easy, since the video= argument or what
> 
>     the userspace might not even have a name in the first place, but
> 
>     instead could have passed a mode with the same timings;
> 
> 
> 
>   * The fallback to matching on the timings was mostly working as long as
> 
>     we were supporting one 525 lines (most likely NSTC) and one 625 lines
> 
>     (PAL), but couldn't differentiate between two modes with the same
> 
>     timings (NTSC vs PAL-M vs NSTC-J for example);
> 
> 
> 
>   * There was also some overlap with the tv mode property registered by
> 
>     drm_mode_create_tv_properties(), but named modes weren't interacting
> 
>     with that property at all.
> 
> 
> 
>   * Even though that property was generic, its possible values were
> 
>     specific to each drivers, which made some generic support difficult.
> 
> 
> 
> Thus, I chose to tackle in multiple steps:
> 
> 
> 
>   * A new TV norm property was introduced, with generic values, each driver
> 
>     reporting through a bitmask what standard it supports to the userspace;
> 
> 
> 
>   * This option was added to the command line parsing code to be able to
> 
>     specify it on the kernel command line, and new atomic_check and reset
> 
>     helpers were created to integrate properly into atomic KMS;
> 
> 
> 
>   * The named mode parsing code is now creating a proper display mode for
> 
>     the given named mode, and the TV standard will thus be part of the
> 
>     connector state;
> 
> 
> 
>   * Two drivers were converted and tested for now (vc4 and sun4i), with
> 
>     some backward compatibility code to translate the old TV mode to the
> 
>     new TV mode;
> 
> 
> 
> Unit tests were created along the way.
> 
> 
> 
> One can switch from NTSC to PAL now using (on vc4)
> 
> 
> 
> modetest -M vc4  -s 53:720x480i -w 53:'tv norm':0
> 
> 
> 
> modetest -M vc4 -s 53:720x480i -w 53:'tv norm':4
> 

The property name has changed, this gives me PAL:

$ modetest -M vc4 -s 45:720x576i -w 45:'TV mode':4


I have finally found a workaround for my kernel hangs.

Dom had a look at my kernel and found that the VideoCore was fine, and
he said this:

> That suggests cause of lockup was on arm side rather than VC side.
>
> But it's hard to diagnose further. Once you've had a peripheral not
> respond, the AXI bus locks up and no further operations are possible.
> Usual causes of this are required clocks being stopped or domains
> disabled and then trying to access the hardware.
>

So when I got this on my 64-bit build:

[  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
[  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
    5.19.0-rc6-00096-gba7973977976-dirty #1
[  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
[  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
[  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[  166.702242] pc : regmap_mmio_read32le+0x10/0x28
[  166.702261] lr : regmap_mmio_read+0x44/0x70
...
[  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]

I wondered if that reg read was stalled due to a clock being stopped.

Lo and behold, disabling runtime pm and keeping the vec clock running
all the time fixed it[1].

I don't know what the problem is, but at least I can now test this patchset.

[1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065

Noralf.
Noralf Trønnes Sept. 2, 2022, 11:28 a.m. UTC | #2
Den 01.09.2022 21.35, skrev Noralf Trønnes:
> 
> 
> I have finally found a workaround for my kernel hangs.
> 
> Dom had a look at my kernel and found that the VideoCore was fine, and
> he said this:
> 
>> That suggests cause of lockup was on arm side rather than VC side.
>>
>> But it's hard to diagnose further. Once you've had a peripheral not
>> respond, the AXI bus locks up and no further operations are possible.
>> Usual causes of this are required clocks being stopped or domains
>> disabled and then trying to access the hardware.
>>
> 
> So when I got this on my 64-bit build:
> 
> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>     5.19.0-rc6-00096-gba7973977976-dirty #1
> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
> [  166.702261] lr : regmap_mmio_read+0x44/0x70
> ...
> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
> 
> I wondered if that reg read was stalled due to a clock being stopped.
> 
> Lo and behold, disabling runtime pm and keeping the vec clock running
> all the time fixed it[1].
> 
> I don't know what the problem is, but at least I can now test this patchset.
> 
> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
> 

It turns out I didn't have to disable runtime pm:
https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2

Noralf.
Maxime Ripard Sept. 5, 2022, 2:57 p.m. UTC | #3
On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
> 
> 
> Den 01.09.2022 21.35, skrev Noralf Trønnes:
> > 
> > 
> > I have finally found a workaround for my kernel hangs.
> > 
> > Dom had a look at my kernel and found that the VideoCore was fine, and
> > he said this:
> > 
> >> That suggests cause of lockup was on arm side rather than VC side.
> >>
> >> But it's hard to diagnose further. Once you've had a peripheral not
> >> respond, the AXI bus locks up and no further operations are possible.
> >> Usual causes of this are required clocks being stopped or domains
> >> disabled and then trying to access the hardware.
> >>
> > 
> > So when I got this on my 64-bit build:
> > 
> > [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
> > [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
> >     5.19.0-rc6-00096-gba7973977976-dirty #1
> > [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> > [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
> > [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
> > BTYPE=--)
> > [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
> > [  166.702261] lr : regmap_mmio_read+0x44/0x70
> > ...
> > [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
> > 
> > I wondered if that reg read was stalled due to a clock being stopped.
> > 
> > Lo and behold, disabling runtime pm and keeping the vec clock running
> > all the time fixed it[1].
> > 
> > I don't know what the problem is, but at least I can now test this patchset.
> > 
> > [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
> > 
> 
> It turns out I didn't have to disable runtime pm:
> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2

If the bcm2711_thermal IP needs that clock to be enabled, it should grab
a reference itself, but it looks like even the device tree binding
doesn't ask for one.

Maxime
Noralf Trønnes Sept. 5, 2022, 3:17 p.m. UTC | #4
Den 05.09.2022 16.57, skrev Maxime Ripard:
> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>
>>
>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>
>>>
>>> I have finally found a workaround for my kernel hangs.
>>>
>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>> he said this:
>>>
>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>
>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>> respond, the AXI bus locks up and no further operations are possible.
>>>> Usual causes of this are required clocks being stopped or domains
>>>> disabled and then trying to access the hardware.
>>>>
>>>
>>> So when I got this on my 64-bit build:
>>>
>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>     5.19.0-rc6-00096-gba7973977976-dirty #1
>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>> [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>> ...
>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>
>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>
>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>> all the time fixed it[1].
>>>
>>> I don't know what the problem is, but at least I can now test this patchset.
>>>
>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>
>>
>> It turns out I didn't have to disable runtime pm:
>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
> 
> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
> a reference itself, but it looks like even the device tree binding
> doesn't ask for one.
> 

The first thing I tried was to unload the bcm2711_thermal module before
running modeset and it still hung, so I don't think that's the problem.

Noralf.
Maxime Ripard Sept. 7, 2022, 9:58 a.m. UTC | #5
On Mon, Sep 05, 2022 at 05:17:18PM +0200, Noralf Trønnes wrote:
> Den 05.09.2022 16.57, skrev Maxime Ripard:
> > On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
> >>
> >>
> >> Den 01.09.2022 21.35, skrev Noralf Trønnes:
> >>>
> >>>
> >>> I have finally found a workaround for my kernel hangs.
> >>>
> >>> Dom had a look at my kernel and found that the VideoCore was fine, and
> >>> he said this:
> >>>
> >>>> That suggests cause of lockup was on arm side rather than VC side.
> >>>>
> >>>> But it's hard to diagnose further. Once you've had a peripheral not
> >>>> respond, the AXI bus locks up and no further operations are possible.
> >>>> Usual causes of this are required clocks being stopped or domains
> >>>> disabled and then trying to access the hardware.
> >>>>
> >>>
> >>> So when I got this on my 64-bit build:
> >>>
> >>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
> >>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
> >>>     5.19.0-rc6-00096-gba7973977976-dirty #1
> >>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> >>> [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
> >>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
> >>> BTYPE=--)
> >>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
> >>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
> >>> ...
> >>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
> >>>
> >>> I wondered if that reg read was stalled due to a clock being stopped.
> >>>
> >>> Lo and behold, disabling runtime pm and keeping the vec clock running
> >>> all the time fixed it[1].
> >>>
> >>> I don't know what the problem is, but at least I can now test this patchset.
> >>>
> >>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
> >>>
> >>
> >> It turns out I didn't have to disable runtime pm:
> >> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
> > 
> > If the bcm2711_thermal IP needs that clock to be enabled, it should grab
> > a reference itself, but it looks like even the device tree binding
> > doesn't ask for one.
> > 
> 
> The first thing I tried was to unload the bcm2711_thermal module before
> running modeset and it still hung, so I don't think that's the problem.

Ack. Just to confirm, is this happening on mainline or on the downstream tree?

Maxime
Stefan Wahren Sept. 7, 2022, 10:36 a.m. UTC | #6
Hi Maxime,

Am 05.09.22 um 16:57 schrieb Maxime Ripard:
> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>
>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>
>>> I have finally found a workaround for my kernel hangs.
>>>
>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>> he said this:
>>>
>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>
>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>> respond, the AXI bus locks up and no further operations are possible.
>>>> Usual causes of this are required clocks being stopped or domains
>>>> disabled and then trying to access the hardware.
>>>>
>>> So when I got this on my 64-bit build:
>>>
>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>      5.19.0-rc6-00096-gba7973977976-dirty #1
>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>> [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>> ...
>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>
>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>
>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>> all the time fixed it[1].
>>>
>>> I don't know what the problem is, but at least I can now test this patchset.
>>>
>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>
>> It turns out I didn't have to disable runtime pm:
>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
> a reference itself, but it looks like even the device tree binding
> doesn't ask for one.
The missing clock in the device tree binding is expected, because 
despite of the code there is not much information about the BCM2711 
clock tree. But i'm skeptical that the AVS IP actually needs the VEC 
clock, i think it's more likely that the VEC clock parent is changed and 
that cause this issue. I could take care of the bcm2711 binding & driver 
if i know which clock is really necessary.
>
> Maxime
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Noralf Trønnes Sept. 7, 2022, 10:56 a.m. UTC | #7
Den 07.09.2022 11.58, skrev Maxime Ripard:
> On Mon, Sep 05, 2022 at 05:17:18PM +0200, Noralf Trønnes wrote:
>> Den 05.09.2022 16.57, skrev Maxime Ripard:
>>> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>>>
>>>>
>>>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>>>
>>>>>
>>>>> I have finally found a workaround for my kernel hangs.
>>>>>
>>>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>>>> he said this:
>>>>>
>>>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>>>
>>>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>>>> respond, the AXI bus locks up and no further operations are possible.
>>>>>> Usual causes of this are required clocks being stopped or domains
>>>>>> disabled and then trying to access the hardware.
>>>>>>
>>>>>
>>>>> So when I got this on my 64-bit build:
>>>>>
>>>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 -- SError
>>>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>>>     5.19.0-rc6-00096-gba7973977976-dirty #1
>>>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>>>> [  166.702206] Workqueue: events_freezable_power_ thermal_zone_device_check
>>>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>>>> BTYPE=--)
>>>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>>>> ...
>>>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>>>
>>>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>>>
>>>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>>>> all the time fixed it[1].
>>>>>
>>>>> I don't know what the problem is, but at least I can now test this patchset.
>>>>>
>>>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>>>
>>>>
>>>> It turns out I didn't have to disable runtime pm:
>>>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
>>>
>>> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
>>> a reference itself, but it looks like even the device tree binding
>>> doesn't ask for one.
>>>
>>
>> The first thing I tried was to unload the bcm2711_thermal module before
>> running modeset and it still hung, so I don't think that's the problem.
> 
> Ack. Just to confirm, is this happening on mainline or on the downstream tree?
> 

It's mainline.

Noralf.
Noralf Trønnes Sept. 7, 2022, 4:44 p.m. UTC | #8
Den 07.09.2022 12.36, skrev Stefan Wahren:
> Hi Maxime,
> 
> Am 05.09.22 um 16:57 schrieb Maxime Ripard:
>> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>>
>>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>>
>>>> I have finally found a workaround for my kernel hangs.
>>>>
>>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>>> he said this:
>>>>
>>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>>
>>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>>> respond, the AXI bus locks up and no further operations are possible.
>>>>> Usual causes of this are required clocks being stopped or domains
>>>>> disabled and then trying to access the hardware.
>>>>>
>>>> So when I got this on my 64-bit build:
>>>>
>>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 --
>>>> SError
>>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>>      5.19.0-rc6-00096-gba7973977976-dirty #1
>>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>>> [  166.702206] Workqueue: events_freezable_power_
>>>> thermal_zone_device_check
>>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>>> BTYPE=--)
>>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>>> ...
>>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>>
>>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>>
>>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>>> all the time fixed it[1].
>>>>
>>>> I don't know what the problem is, but at least I can now test this
>>>> patchset.
>>>>
>>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>>
>>> It turns out I didn't have to disable runtime pm:
>>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
>> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
>> a reference itself, but it looks like even the device tree binding
>> doesn't ask for one.
> The missing clock in the device tree binding is expected, because
> despite of the code there is not much information about the BCM2711
> clock tree. But i'm skeptical that the AVS IP actually needs the VEC
> clock, i think it's more likely that the VEC clock parent is changed and
> that cause this issue. I could take care of the bcm2711 binding & driver
> if i know which clock is really necessary.

Seems you're right, keeping the parent always enabled is enough:

	clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per

I tried enabling just the grandparent clock as well, but that didn't help.

Without the clock hack it seems the hang occurs when switching between
NTSC and PAL, at most I've been able to do that 4-5 times before it hangs.

For a while it looked like fbdev/fbcon had a play in this, but then I
realised that it just gave me a NTSC mode to start from and to go back
to when qutting modetest.

Noralf.
Noralf Trønnes Sept. 10, 2022, 3:34 p.m. UTC | #9
Den 07.09.2022 18.44, skrev Noralf Trønnes:
> 
> 
> Den 07.09.2022 12.36, skrev Stefan Wahren:
>> Hi Maxime,
>>
>> Am 05.09.22 um 16:57 schrieb Maxime Ripard:
>>> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>>>
>>>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>>>
>>>>> I have finally found a workaround for my kernel hangs.
>>>>>
>>>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>>>> he said this:
>>>>>
>>>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>>>
>>>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>>>> respond, the AXI bus locks up and no further operations are possible.
>>>>>> Usual causes of this are required clocks being stopped or domains
>>>>>> disabled and then trying to access the hardware.
>>>>>>
>>>>> So when I got this on my 64-bit build:
>>>>>
>>>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 --
>>>>> SError
>>>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>>>      5.19.0-rc6-00096-gba7973977976-dirty #1
>>>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>>>> [  166.702206] Workqueue: events_freezable_power_
>>>>> thermal_zone_device_check
>>>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>>>> BTYPE=--)
>>>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>>>> ...
>>>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>>>
>>>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>>>
>>>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>>>> all the time fixed it[1].
>>>>>
>>>>> I don't know what the problem is, but at least I can now test this
>>>>> patchset.
>>>>>
>>>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>>>
>>>> It turns out I didn't have to disable runtime pm:
>>>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
>>> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
>>> a reference itself, but it looks like even the device tree binding
>>> doesn't ask for one.
>> The missing clock in the device tree binding is expected, because
>> despite of the code there is not much information about the BCM2711
>> clock tree. But i'm skeptical that the AVS IP actually needs the VEC
>> clock, i think it's more likely that the VEC clock parent is changed and
>> that cause this issue. I could take care of the bcm2711 binding & driver
>> if i know which clock is really necessary.
> 
> Seems you're right, keeping the parent always enabled is enough:
> 
> 	clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per
> 
> I tried enabling just the grandparent clock as well, but that didn't help.
> 
> Without the clock hack it seems the hang occurs when switching between
> NTSC and PAL, at most I've been able to do that 4-5 times before it hangs.
> 
> For a while it looked like fbdev/fbcon had a play in this, but then I
> realised that it just gave me a NTSC mode to start from and to go back
> to when qutting modetest.
> 

I've looked some more into this problem and I see that downstream is
using a firmware clock for vec:

clk: Move vec clock to clk-raspberrypi
https://github.com/raspberrypi/linux/pull/4639

If I do the same my problem goes away.

It's interesting to note that on downstream 5.10.103-v7l+ #1530,
pllc_per is enabled even if tvout is not enabled:

$ sudo cat /sys/kernel/debug/clk/pllc_per/regdump
cm = 0x00000000
a2w = 0x00000004 (disable bit(8) is not set)

It's when mainline vc4_vec disables this vec parent clock that the crash
occurs.

Sidenote: Another downstream fw clock change with a vec reference[1]:


Another issue not related to the clock crash problem:

I assumed that unloading the vc4 module would release the clocks, but
this didn't happen.

When I looked at it I remembered that there's a catch in the DRM unplug
machinery when it comes to unloading a driver and the DRM disable hooks.

static void vc4_drm_unbind(struct device *dev)
{
	struct drm_device *drm = dev_get_drvdata(dev);

	drm_dev_unplug(drm);
	drm_atomic_helper_shutdown(drm);
}

Here the drm_device is first marked as unplugged and then the pipeline
is disabled. Since vc4_vec_encoder_disable() is protected by
drm_dev_enter() the VEC is not disabled, clocks are not released and PM
is left on.

In the drivers that I have written where the hardware is not expected to
have gone away on device unbind (SPI), I've just left out the
drm_dev_enter() check in the disable hook.

Noralf.

[1] https://github.com/raspberrypi/linux/pull/4706
Maxime Ripard Sept. 21, 2022, 2:03 p.m. UTC | #10
On Wed, Sep 07, 2022 at 06:44:53PM +0200, Noralf Trønnes wrote:
> 
> 
> Den 07.09.2022 12.36, skrev Stefan Wahren:
> > Hi Maxime,
> > 
> > Am 05.09.22 um 16:57 schrieb Maxime Ripard:
> >> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
> >>>
> >>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
> >>>>
> >>>> I have finally found a workaround for my kernel hangs.
> >>>>
> >>>> Dom had a look at my kernel and found that the VideoCore was fine, and
> >>>> he said this:
> >>>>
> >>>>> That suggests cause of lockup was on arm side rather than VC side.
> >>>>>
> >>>>> But it's hard to diagnose further. Once you've had a peripheral not
> >>>>> respond, the AXI bus locks up and no further operations are possible.
> >>>>> Usual causes of this are required clocks being stopped or domains
> >>>>> disabled and then trying to access the hardware.
> >>>>>
> >>>> So when I got this on my 64-bit build:
> >>>>
> >>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 --
> >>>> SError
> >>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
> >>>>      5.19.0-rc6-00096-gba7973977976-dirty #1
> >>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
> >>>> [  166.702206] Workqueue: events_freezable_power_
> >>>> thermal_zone_device_check
> >>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
> >>>> BTYPE=--)
> >>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
> >>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
> >>>> ...
> >>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
> >>>>
> >>>> I wondered if that reg read was stalled due to a clock being stopped.
> >>>>
> >>>> Lo and behold, disabling runtime pm and keeping the vec clock running
> >>>> all the time fixed it[1].
> >>>>
> >>>> I don't know what the problem is, but at least I can now test this
> >>>> patchset.
> >>>>
> >>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
> >>>>
> >>> It turns out I didn't have to disable runtime pm:
> >>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
> >> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
> >> a reference itself, but it looks like even the device tree binding
> >> doesn't ask for one.
> > The missing clock in the device tree binding is expected, because
> > despite of the code there is not much information about the BCM2711
> > clock tree. But i'm skeptical that the AVS IP actually needs the VEC
> > clock, i think it's more likely that the VEC clock parent is changed and
> > that cause this issue. I could take care of the bcm2711 binding & driver
> > if i know which clock is really necessary.
> 
> Seems you're right, keeping the parent always enabled is enough:
> 
> 	clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per
> 
> I tried enabling just the grandparent clock as well, but that didn't help.

Yeah, adding tracing to the clock framework shows that it indeed
disables PLLC_PER. So there's probably some other device that depends on
it but doesn't take a reference to it.

I had a look, but it's not really obvious what that might be.

This patch makes sure that the PLL*_PER are never disabled, could you
test it? It seems to work for me.


diff --git a/drivers/clk/bcm/clk-bcm2835.c b/drivers/clk/bcm/clk-bcm2835.c
index 48a1eb9f2d55..3839261ee419 100644
--- a/drivers/clk/bcm/clk-bcm2835.c
+++ b/drivers/clk/bcm/clk-bcm2835.c
@@ -1675,7 +1675,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
 		.load_mask = CM_PLLA_LOADPER,
 		.hold_mask = CM_PLLA_HOLDPER,
 		.fixed_divider = 1,
-		.flags = CLK_SET_RATE_PARENT),
+		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),
 	[BCM2835_PLLA_DSI0]	= REGISTER_PLL_DIV(
 		SOC_ALL,
 		.name = "plla_dsi0",
@@ -1784,7 +1784,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
 		.load_mask = CM_PLLC_LOADPER,
 		.hold_mask = CM_PLLC_HOLDPER,
 		.fixed_divider = 1,
-		.flags = CLK_SET_RATE_PARENT),
+		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),

 	/*
 	 * PLLD is the display PLL, used to drive DSI display panels.
@@ -1891,7 +1891,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
 		.load_mask = CM_PLLH_LOADAUX,
 		.hold_mask = 0,
 		.fixed_divider = 1,
-		.flags = CLK_SET_RATE_PARENT),
+		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),
 	[BCM2835_PLLH_PIX]	= REGISTER_PLL_DIV(
 		SOC_BCM2835,
 		.name = "pllh_pix",


Maxime
Noralf Trønnes Sept. 24, 2022, 3:33 p.m. UTC | #11
Den 21.09.2022 16.03, skrev Maxime Ripard:
> On Wed, Sep 07, 2022 at 06:44:53PM +0200, Noralf Trønnes wrote:
>>
>>
>> Den 07.09.2022 12.36, skrev Stefan Wahren:
>>> Hi Maxime,
>>>
>>> Am 05.09.22 um 16:57 schrieb Maxime Ripard:
>>>> On Fri, Sep 02, 2022 at 01:28:16PM +0200, Noralf Trønnes wrote:
>>>>>
>>>>> Den 01.09.2022 21.35, skrev Noralf Trønnes:
>>>>>>
>>>>>> I have finally found a workaround for my kernel hangs.
>>>>>>
>>>>>> Dom had a look at my kernel and found that the VideoCore was fine, and
>>>>>> he said this:
>>>>>>
>>>>>>> That suggests cause of lockup was on arm side rather than VC side.
>>>>>>>
>>>>>>> But it's hard to diagnose further. Once you've had a peripheral not
>>>>>>> respond, the AXI bus locks up and no further operations are possible.
>>>>>>> Usual causes of this are required clocks being stopped or domains
>>>>>>> disabled and then trying to access the hardware.
>>>>>>>
>>>>>> So when I got this on my 64-bit build:
>>>>>>
>>>>>> [  166.702171] SError Interrupt on CPU1, code 0x00000000bf000002 --
>>>>>> SError
>>>>>> [  166.702187] CPU: 1 PID: 8 Comm: kworker/u8:0 Tainted: G        W
>>>>>>      5.19.0-rc6-00096-gba7973977976-dirty #1
>>>>>> [  166.702200] Hardware name: Raspberry Pi 4 Model B Rev 1.1 (DT)
>>>>>> [  166.702206] Workqueue: events_freezable_power_
>>>>>> thermal_zone_device_check
>>>>>> [  166.702231] pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO -DIT -SSBS
>>>>>> BTYPE=--)
>>>>>> [  166.702242] pc : regmap_mmio_read32le+0x10/0x28
>>>>>> [  166.702261] lr : regmap_mmio_read+0x44/0x70
>>>>>> ...
>>>>>> [  166.702606]  bcm2711_get_temp+0x58/0xb0 [bcm2711_thermal]
>>>>>>
>>>>>> I wondered if that reg read was stalled due to a clock being stopped.
>>>>>>
>>>>>> Lo and behold, disabling runtime pm and keeping the vec clock running
>>>>>> all the time fixed it[1].
>>>>>>
>>>>>> I don't know what the problem is, but at least I can now test this
>>>>>> patchset.
>>>>>>
>>>>>> [1] https://gist.github.com/notro/23b984e7fa05cfbda2db50a421cac065
>>>>>>
>>>>> It turns out I didn't have to disable runtime pm:
>>>>> https://gist.github.com/notro/0adcfcb12460b54e54458afe11dc8ea2
>>>> If the bcm2711_thermal IP needs that clock to be enabled, it should grab
>>>> a reference itself, but it looks like even the device tree binding
>>>> doesn't ask for one.
>>> The missing clock in the device tree binding is expected, because
>>> despite of the code there is not much information about the BCM2711
>>> clock tree. But i'm skeptical that the AVS IP actually needs the VEC
>>> clock, i think it's more likely that the VEC clock parent is changed and
>>> that cause this issue. I could take care of the bcm2711 binding & driver
>>> if i know which clock is really necessary.
>>
>> Seems you're right, keeping the parent always enabled is enough:
>>
>> 	clk_prepare_enable(clk_get_parent(vec->clock)); // pllc_per
>>
>> I tried enabling just the grandparent clock as well, but that didn't help.
> 
> Yeah, adding tracing to the clock framework shows that it indeed
> disables PLLC_PER. So there's probably some other device that depends on
> it but doesn't take a reference to it.
> 
> I had a look, but it's not really obvious what that might be.
> 
> This patch makes sure that the PLL*_PER are never disabled, could you
> test it? It seems to work for me.
> 
> 
> diff --git a/drivers/clk/bcm/clk-bcm2835.c b/drivers/clk/bcm/clk-bcm2835.c
> index 48a1eb9f2d55..3839261ee419 100644
> --- a/drivers/clk/bcm/clk-bcm2835.c
> +++ b/drivers/clk/bcm/clk-bcm2835.c
> @@ -1675,7 +1675,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
>  		.load_mask = CM_PLLA_LOADPER,
>  		.hold_mask = CM_PLLA_HOLDPER,
>  		.fixed_divider = 1,
> -		.flags = CLK_SET_RATE_PARENT),
> +		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),
>  	[BCM2835_PLLA_DSI0]	= REGISTER_PLL_DIV(
>  		SOC_ALL,
>  		.name = "plla_dsi0",
> @@ -1784,7 +1784,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
>  		.load_mask = CM_PLLC_LOADPER,
>  		.hold_mask = CM_PLLC_HOLDPER,
>  		.fixed_divider = 1,
> -		.flags = CLK_SET_RATE_PARENT),
> +		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),
> 

Yes, this worked, but it's enough to mark pllc_per as critical.

Noralf.

>  	/*
>  	 * PLLD is the display PLL, used to drive DSI display panels.
> @@ -1891,7 +1891,7 @@ static const struct bcm2835_clk_desc clk_desc_array[] = {
>  		.load_mask = CM_PLLH_LOADAUX,
>  		.hold_mask = 0,
>  		.fixed_divider = 1,
> -		.flags = CLK_SET_RATE_PARENT),
> +		.flags = CLK_IS_CRITICAL | CLK_SET_RATE_PARENT),
>  	[BCM2835_PLLH_PIX]	= REGISTER_PLL_DIV(
>  		SOC_BCM2835,
>  		.name = "pllh_pix",
> 
> 
> Maxime