[v7,00/10] drm/i915: ttm for stolen

Message ID	20220620213340.3199605-1-bob.beckett@collabora.com (mailing list archive)
Headers	show Return-Path: <intel-gfx-bounces@lists.freedesktop.org> sender: bbeckett) by madras.collabora.co.uk (Postfix) with ESMTPSA id 0F13E6601703; Mon, 20 Jun 2022 22:34:10 +0100 (BST) From: Robert Beckett <bob.beckett@collabora.com> To: dri-devel@lists.freedesktop.org, intel-gfx@lists.freedesktop.org Date: Mon, 20 Jun 2022 21:33:30 +0000 Message-Id: <20220620213340.3199605-1-bob.beckett@collabora.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Subject: [Intel-gfx] [PATCH v7 00/10] drm/i915: ttm for stolen Precedence: list Cc: =?utf-8?q?Thomas_Hellstr=C3=B6m?= <thomas.hellstrom@linux.intel.com>, kernel@collabora.com, Matthew Auld <matthew.auld@intel.com> Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" <intel-gfx-bounces@lists.freedesktop.org>
Series	drm/i915: ttm for stolen \| expand [v7,00/10] drm/i915: ttm for stolen [v7,01/10] drm/i915/ttm: dont trample cache_level overrides during ttm move [v7,02/10] drm/i915: limit ttm to dma32 for i965G[M] [v7,03/10] drm/i915/ttm: only trust snooping for dgfx when deciding default cache_level [v7,04/10] drm/i915/gem: selftest should not attempt mmap of private regions [v7,05/10] drm/i915: instantiate ttm ranger manager for stolen memory [v7,06/10] drm/i915: sanitize mem_flags for stolen buffers [v7,07/10] drm/i915: ttm move/clear logic fix [v7,08/10] drm/i915: allow memory region creators to alloc and free the region [v7,09/10] drm/i915/ttm: add buffer pin on alloc flag [v7,10/10] drm/i915: stolen memory use ttm backend

Bob Beckett June 20, 2022, 9:33 p.m. UTC

This series refactors i915's stolen memory region to use ttm.

v2:	handle disabled stolen similar to legacy version.
	relying on ttm to fail allocs works fine, but is dmesg noisy and causes testing
	dmesg warning regressions.

v3:	rebase to latest drm-tip.
	fix v2 code refactor which could leave a buffer pinned.
	locally passes fftl again now.

v4:	- Allow memory regions creators to do allocation. Allows stolen region to track
	  it's own reservations.
	- Pre-reserve first page of stolen mem (add back WaSkipStolenMemoryFirstPage:bdw+)
	- Improve commit descritpion for "drm/i915: sanitize mem_flags for stolen buffers"
	- replace i915_gem_object_pin_pages_unlocked() call with manual locking and pinning.
	  this avoids ww ctx class reuse during context creation -> ring vma obj alloc.

v5:	- detect both types of stolen as stolen buffers in
	  "drm/i915: sanitize mem_flags for stolen buffers"
	- in stolen_object_init limit page size to mem region minimum.
	  The range allocator expects the page_size to define the
	  alignment

v6:	- Share first 4 patches from ttm for internal series as generic
	  i915 ttm fixes
	- Drop patch 4 from v5. We don't need separate object ops just
	  to satisfy test interfaces. The tests have now been fixed via
	  checking whether the memory region is private to decide
	  whether to mmap
	- Add new buffer pin alloc flag to allow creation of buffers in
	  their final ttm placement instead of deferring until
	  get_pages. This fixes legacy fallback paths for buffer
	  allocations during stolen memory pressure.

v7: 	- fix mock_region_get_pages() to correctly handle I915_BO_INVALID_OFFSET

Robert Beckett (10):
  drm/i915/ttm: dont trample cache_level overrides during ttm move
  drm/i915: limit ttm to dma32 for i965G[M]
  drm/i915/ttm: only trust snooping for dgfx when deciding default
    cache_level
  drm/i915/gem: selftest should not attempt mmap of private regions
  drm/i915: instantiate ttm ranger manager for stolen memory
  drm/i915: sanitize mem_flags for stolen buffers
  drm/i915: ttm move/clear logic fix
  drm/i915: allow memory region creators to alloc and free the region
  drm/i915/ttm: add buffer pin on alloc flag
  drm/i915: stolen memory use ttm backend

 drivers/gpu/drm/i915/display/intel_fbc.c      |  78 ++--
 drivers/gpu/drm/i915/gem/i915_gem_object.c    |   1 +
 .../gpu/drm/i915/gem/i915_gem_object_types.h  |  16 +-
 drivers/gpu/drm/i915/gem/i915_gem_stolen.c    | 440 +++++++-----------
 drivers/gpu/drm/i915/gem/i915_gem_stolen.h    |  21 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.c       |  29 +-
 drivers/gpu/drm/i915/gem/i915_gem_ttm.h       |   7 +
 drivers/gpu/drm/i915/gem/i915_gem_ttm_move.c  |  47 +-
 .../drm/i915/gem/selftests/i915_gem_mman.c    |   3 +
 drivers/gpu/drm/i915/gt/intel_rc6.c           |   4 +-
 drivers/gpu/drm/i915/gt/selftest_reset.c      |  16 +-
 drivers/gpu/drm/i915/i915_debugfs.c           |   7 +-
 drivers/gpu/drm/i915/i915_drv.h               |   5 -
 drivers/gpu/drm/i915/intel_memory_region.c    |  16 +-
 drivers/gpu/drm/i915/intel_memory_region.h    |   2 +
 drivers/gpu/drm/i915/intel_region_ttm.c       |  80 +++-
 drivers/gpu/drm/i915/intel_region_ttm.h       |   8 +-
 drivers/gpu/drm/i915/selftests/mock_region.c  |  12 +-
 18 files changed, 423 insertions(+), 369 deletions(-)

Bob Beckett June 21, 2022, 7:11 p.m. UTC | #1

On 21/06/2022 18:37, Patchwork wrote:
> *Patch Details*
> *Series:*	drm/i915: ttm for stolen (rev5)
> *URL:*	https://patchwork.freedesktop.org/series/101396/ 
> <https://patchwork.freedesktop.org/series/101396/>
> *State:*	failure
> *Details:* 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html>
> 
> 
>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
> 
> 
>     Summary
> 
> *FAILURE*
> 
> Serious unknown changes coming with Patchwork_101396v5 absolutely need to be
> verified manually.
> 
> If you think the reported changes have nothing to do with the changes
> introduced in Patchwork_101396v5, please notify your bug team to allow them
> to document this new failure mode, which will reduce false positives in CI.
> 
> External URL: 
> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
> 
> 
>     Participating hosts (40 -> 41)
> 
> Additional (2): fi-icl-u2 bat-dg2-9
> Missing (1): fi-bdw-samus
> 
> 
>     Possible new issues
> 
> Here are the unknown changes that may have been introduced in 
> Patchwork_101396v5:
> 
> 
>       IGT changes
> 
> 
>         Possible regressions
> 
>   * igt@i915_selftest@live@reset:
>       o bat-adlp-4: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html>
>         -> DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html>
> 

I keep hitting clobbered pages during engine resets on bat-adlp-4.
It seems to happen most of the time on that machine and occasionally on 
bat-adlp-6.

Should bat-adlp-4 be considered an unreliable machine like bat-adlp-6 is 
for now?

Alternatively, seeing the history of this in

commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude low 
pages (128KiB) of stolen from use"

could this be an indication that maybe the original issue is worse on 
adlp machines?
I have only ever seen page page 135 or 136 clobbered across many runs 
via trybot, so it looks fairly consistent.
Though excluding the use of over 540K of stolen might be too severe.


> 
>         Suppressed
> 
> The following results come from untrusted machines, tests, or statuses.
> They do not affect the overall result.
> 
>   * igt@kms_busy@basic@flip:
>       o {bat-dg2-9}: NOTRUN -> DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-dg2-9/igt@kms_busy@basic@flip.html>
> 
> 
>     Known issues
> 
> Here are the changes found in Patchwork_101396v5 that come from known 
> issues:
> 
> 
>       IGT changes
> 
> 
>         Issues hit
> 
>   *
> 
>     igt@gem_huc_copy@huc-copy:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@gem_huc_copy@huc-copy.html>
>         (i915#2190 <https://gitlab.freedesktop.org/drm/intel/issues/2190>)
>   *
> 
>     igt@gem_lmem_swapping@random-engines:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@gem_lmem_swapping@random-engines.html>
>         (i915#4613
>         <https://gitlab.freedesktop.org/drm/intel/issues/4613>) +3
>         similar issues
>   *
> 
>     igt@i915_pm_rpm@module-reload:
> 
>       o bat-adlp-4: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_pm_rpm@module-reload.html>
>         -> DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_pm_rpm@module-reload.html>
>         (i915#1888
>         <https://gitlab.freedesktop.org/drm/intel/issues/1888> /
>         i915#3576 <https://gitlab.freedesktop.org/drm/intel/issues/3576>)
>   *
> 
>     igt@i915_selftest@live@hangcheck:
> 
>       o bat-dg1-6: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-dg1-6/igt@i915_selftest@live@hangcheck.html>
>         -> DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-dg1-6/igt@i915_selftest@live@hangcheck.html>
>         (i915#4494
>         <https://gitlab.freedesktop.org/drm/intel/issues/4494> /
>         i915#4957 <https://gitlab.freedesktop.org/drm/intel/issues/4957>)
>   *
> 
>     igt@i915_suspend@basic-s3-without-i915:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@i915_suspend@basic-s3-without-i915.html>
>         (i915#5903 <https://gitlab.freedesktop.org/drm/intel/issues/5903>)
>   *
> 
>     igt@kms_busy@basic@flip:
> 
>       o bat-adlp-4: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@kms_busy@basic@flip.html>
>         -> DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@kms_busy@basic@flip.html>
>         (i915#3576 <https://gitlab.freedesktop.org/drm/intel/issues/3576>)
>   *
> 
>     igt@kms_chamelium@common-hpd-after-suspend:
> 
>       o
> 
>         fi-hsw-g3258: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-hsw-g3258/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#109271
>         <https://bugs.freedesktop.org/show_bug.cgi?id=109271> /
>         fdo#111827 <https://bugs.freedesktop.org/show_bug.cgi?id=111827>)
> 
>       o
> 
>         fi-hsw-4770: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-hsw-4770/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#109271
>         <https://bugs.freedesktop.org/show_bug.cgi?id=109271> /
>         fdo#111827 <https://bugs.freedesktop.org/show_bug.cgi?id=111827>)
> 
>       o
> 
>         fi-blb-e6850: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-blb-e6850/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#109271 <https://bugs.freedesktop.org/show_bug.cgi?id=109271>)
> 
>       o
> 
>         fi-pnv-d510: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-pnv-d510/igt@kms_chamelium@common-hpd-after-suspend.html>
>         (fdo#109271 <https://bugs.freedesktop.org/show_bug.cgi?id=109271>)
> 
>   *
> 
>     igt@kms_chamelium@hdmi-hpd-fast:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_chamelium@hdmi-hpd-fast.html>
>         (fdo#111827
>         <https://bugs.freedesktop.org/show_bug.cgi?id=111827>) +8
>         similar issues
>   *
> 
>     igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html>
>         (fdo#109278
>         <https://bugs.freedesktop.org/show_bug.cgi?id=109278> /
>         i915#4103
>         <https://gitlab.freedesktop.org/drm/intel/issues/4103>) +1
>         similar issue
>   *
> 
>     igt@kms_flip@basic-flip-vs-modeset@a-edp1:
> 
>       o fi-tgl-u2: PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/fi-tgl-u2/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html>
>         -> DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-tgl-u2/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html>
>         (i915#402 <https://gitlab.freedesktop.org/drm/intel/issues/402>)
>         +1 similar issue
>   *
> 
>     igt@kms_force_connector_basic@force-connector-state:
> 
>       o fi-icl-u2: NOTRUN -> WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_force_connector_basic@force-connector-state.html>
>         (i915#6008 <https://gitlab.freedesktop.org/drm/intel/issues/6008>)
>   *
> 
>     igt@kms_force_connector_basic@force-load-detect:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_force_connector_basic@force-load-detect.html>
>         (fdo#109285 <https://bugs.freedesktop.org/show_bug.cgi?id=109285>)
>   *
> 
>     igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html>
>         (fdo#109278 <https://bugs.freedesktop.org/show_bug.cgi?id=109278>)
>   *
> 
>     igt@kms_setmode@basic-clone-single-crtc:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@kms_setmode@basic-clone-single-crtc.html>
>         (i915#3555 <https://gitlab.freedesktop.org/drm/intel/issues/3555>)
>   *
> 
>     igt@prime_vgem@basic-userptr:
> 
>       o fi-icl-u2: NOTRUN -> SKIP
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-icl-u2/igt@prime_vgem@basic-userptr.html>
>         (fdo#109295
>         <https://bugs.freedesktop.org/show_bug.cgi?id=109295> /
>         i915#3301 <https://gitlab.freedesktop.org/drm/intel/issues/3301>)
>   *
> 
>     igt@runner@aborted:
> 
>       o fi-cfl-guc: NOTRUN -> FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-cfl-guc/igt@runner@aborted.html>
>         (i915#4312 <https://gitlab.freedesktop.org/drm/intel/issues/4312>)
> 
> 
>         Possible fixes
> 
>   *
> 
>     igt@i915_selftest@live@hangcheck:
> 
>       o
> 
>         fi-hsw-4770: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html>
>         (i915#3303
>         <https://gitlab.freedesktop.org/drm/intel/issues/3303> /
>         i915#4785
>         <https://gitlab.freedesktop.org/drm/intel/issues/4785>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html>
> 
>       o
> 
>         fi-hsw-g3258: INCOMPLETE
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/fi-hsw-g3258/igt@i915_selftest@live@hangcheck.html>
>         (i915#3303
>         <https://gitlab.freedesktop.org/drm/intel/issues/3303> /
>         i915#4785
>         <https://gitlab.freedesktop.org/drm/intel/issues/4785>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-hsw-g3258/igt@i915_selftest@live@hangcheck.html>
> 
>   *
> 
>     igt@i915_selftest@live@requests:
> 
>       o
> 
>         fi-blb-e6850: DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/fi-blb-e6850/igt@i915_selftest@live@requests.html>
>         (i915#4528
>         <https://gitlab.freedesktop.org/drm/intel/issues/4528>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-blb-e6850/igt@i915_selftest@live@requests.html>
> 
>       o
> 
>         fi-pnv-d510: DMESG-FAIL
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/fi-pnv-d510/igt@i915_selftest@live@requests.html>
>         (i915#4528
>         <https://gitlab.freedesktop.org/drm/intel/issues/4528>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/fi-pnv-d510/igt@i915_selftest@live@requests.html>
> 
>   *
> 
>     igt@kms_flip@basic-flip-vs-modeset@a-edp1:
> 
>       o bat-adlp-4: DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html>
>         (i915#3576
>         <https://gitlab.freedesktop.org/drm/intel/issues/3576>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@kms_flip@basic-flip-vs-modeset@a-edp1.html>
>   *
> 
>     igt@kms_flip@basic-plain-flip@a-edp1:
> 
>       o {bat-adlp-6}: DMESG-WARN
>         <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-6/igt@kms_flip@basic-plain-flip@a-edp1.html>
>         (i915#3576
>         <https://gitlab.freedesktop.org/drm/intel/issues/3576>) -> PASS
>         <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-6/igt@kms_flip@basic-plain-flip@a-edp1.html>
> 
> {name}: This element is suppressed. This means it is ignored when computing
> the status of the difference (SUCCESS, WARNING, or FAILURE).
> 
> 
>     Build changes
> 
>   * Linux: CI_DRM_11790 -> Patchwork_101396v5
> 
> CI-20190529: 20190529
> CI_DRM_11790: a838c72d1e41413a5ace5e3cf2ada93c2e81f66b @ 
> git://anongit.freedesktop.org/gfx-ci/linux
> IGT_6537: 331658a8475c8b0c0f7ffe5268a7318ef83da34e @ 
> https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
> Patchwork_101396v5: a838c72d1e41413a5ace5e3cf2ada93c2e81f66b @ 
> git://anongit.freedesktop.org/gfx-ci/linux
> 
> 
>       Linux commits
> 
> cb4854cf1605 drm/i915: stolen memory use ttm backend
> 25a96de09969 drm/i915/ttm: add buffer pin on alloc flag
> ad58b349349b drm/i915: allow memory region creators to alloc and free 
> the region
> 785ff03c0542 drm/i915: ttm move/clear logic fix
> d0902500e074 drm/i915: sanitize mem_flags for stolen buffers
> 8a7630243d04 drm/i915: instantiate ttm ranger manager for stolen memory
> 055ef33c2ac8 drm/i915/gem: selftest should not attempt mmap of private 
> regions
> eb22d2a146a5 drm/i915/ttm: only trust snooping for dgfx when deciding 
> default cache_level
> f02995012a97 drm/i915: limit ttm to dma32 for i965G[M]
> 1770347378a1 drm/i915/ttm: dont trample cache_level overrides during ttm 
> move
>

Tvrtko Ursulin June 22, 2022, 9:05 a.m. UTC | #2

On 21/06/2022 20:11, Robert Beckett wrote:
> 
> 
> On 21/06/2022 18:37, Patchwork wrote:
>> *Patch Details*
>> *Series:*    drm/i915: ttm for stolen (rev5)
>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>> <https://patchwork.freedesktop.org/series/101396/>
>> *State:*    failure
>> *Details:* 
>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html>
>>
>>
>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>
>>
>>     Summary
>>
>> *FAILURE*
>>
>> Serious unknown changes coming with Patchwork_101396v5 absolutely need 
>> to be
>> verified manually.
>>
>> If you think the reported changes have nothing to do with the changes
>> introduced in Patchwork_101396v5, please notify your bug team to allow 
>> them
>> to document this new failure mode, which will reduce false positives 
>> in CI.
>>
>> External URL: 
>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>
>>
>>     Participating hosts (40 -> 41)
>>
>> Additional (2): fi-icl-u2 bat-dg2-9
>> Missing (1): fi-bdw-samus
>>
>>
>>     Possible new issues
>>
>> Here are the unknown changes that may have been introduced in 
>> Patchwork_101396v5:
>>
>>
>>       IGT changes
>>
>>
>>         Possible regressions
>>
>>   * igt@i915_selftest@live@reset:
>>       o bat-adlp-4: PASS
>>         
>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>
>>         -> DMESG-FAIL
>>         
>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>
>>
> 
> I keep hitting clobbered pages during engine resets on bat-adlp-4.
> It seems to happen most of the time on that machine and occasionally on 
> bat-adlp-6.
> 
> Should bat-adlp-4 be considered an unreliable machine like bat-adlp-6 is 
> for now?
> 
> Alternatively, seeing the history of this in
> 
> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude low 
> pages (128KiB) of stolen from use"
> 
> could this be an indication that maybe the original issue is worse on 
> adlp machines?
> I have only ever seen page page 135 or 136 clobbered across many runs 
> via trybot, so it looks fairly consistent.
> Though excluding the use of over 540K of stolen might be too severe.

Don't know but I see that on the latest version you even hit pages 165/166.

Any history of hitting this in CI without your series? If not, are there 
some other changes which could explain it? Are you touching the selftest 
itself?

Hexdump of the clobbered page looks quite complex. Especially 
POISON_FREE. Any idea how that ends up there?

Btw what is the benefit of converting stolen to start with? It's not 
much of a backend since it just uses the drm range manager. So quite 
thin and uneventful. Diffstats for the series also do not look like you 
end up with much code reduction?

Regards,

Tvrtko

Bob Beckett June 27, 2022, 5:08 p.m. UTC | #3

On 22/06/2022 10:05, Tvrtko Ursulin wrote:
> 
> On 21/06/2022 20:11, Robert Beckett wrote:
>>
>>
>> On 21/06/2022 18:37, Patchwork wrote:
>>> *Patch Details*
>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>> <https://patchwork.freedesktop.org/series/101396/>
>>> *State:*    failure
>>> *Details:* 
>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>
>>>
>>>
>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>
>>>
>>>     Summary
>>>
>>> *FAILURE*
>>>
>>> Serious unknown changes coming with Patchwork_101396v5 absolutely 
>>> need to be
>>> verified manually.
>>>
>>> If you think the reported changes have nothing to do with the changes
>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>> allow them
>>> to document this new failure mode, which will reduce false positives 
>>> in CI.
>>>
>>> External URL: 
>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>
>>>
>>>     Participating hosts (40 -> 41)
>>>
>>> Additional (2): fi-icl-u2 bat-dg2-9
>>> Missing (1): fi-bdw-samus
>>>
>>>
>>>     Possible new issues
>>>
>>> Here are the unknown changes that may have been introduced in 
>>> Patchwork_101396v5:
>>>
>>>
>>>       IGT changes
>>>
>>>
>>>         Possible regressions
>>>
>>>   * igt@i915_selftest@live@reset:
>>>       o bat-adlp-4: PASS
>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>
>>>         -> DMESG-FAIL
>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>
>>>
>>
>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>> It seems to happen most of the time on that machine and occasionally 
>> on bat-adlp-6.
>>
>> Should bat-adlp-4 be considered an unreliable machine like bat-adlp-6 
>> is for now?
>>
>> Alternatively, seeing the history of this in
>>
>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude low 
>> pages (128KiB) of stolen from use"
>>
>> could this be an indication that maybe the original issue is worse on 
>> adlp machines?
>> I have only ever seen page page 135 or 136 clobbered across many runs 
>> via trybot, so it looks fairly consistent.
>> Though excluding the use of over 540K of stolen might be too severe.
> 
> Don't know but I see that on the latest version you even hit pages 165/166.
> 
> Any history of hitting this in CI without your series? If not, are there 
> some other changes which could explain it? Are you touching the selftest 
> itself?
> 
> Hexdump of the clobbered page looks quite complex. Especially 
> POISON_FREE. Any idea how that ends up there?

(see 
https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702)

after lots of slow debug via CI, it looks like the issue is that a ring 
buffer was allocated and taking up that page during the initial crc 
capture in the test, but by the time it came to check for corruption, it 
had been freed from that page.

The test has a number of weaknesses:

1. the busy check is done twice, without taking in to account any change 
in between. I assume previously this could be relied on never to occur, 
but now it can for some reason (more on that later)

2. the engine reset returns early with an error for guc submission 
engines, but it is silently ignored in the test. Perhaps it should 
ignore guc submission engines as it is a largely useless test for those 
situations.

A quick obvious fix is to have a busy bitmask that remembers each page's 
busy state initially and only check for corruption if it was busy during 
both checks.

However, the main question is why this is occurring now with my changes.
I have added more debug to check where the stolen memory is being freed, 
but the first run last night didn't hit the issue for once.
I am running again now, will report back if I figure out where it is 
being freed.

I am pretty sure the "corruption" (which isn't actually corruption) is 
from a ring buffer.
The POISON_FREE is the only difference between the captured before and 
after dumps:

[0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
6b6b6b6b

with the 2nd dword being the MI_ARB_CHECK used for the spinner.
I think this is the request poisoning from i915_request_retire()

The bit I don't know yet is why a ring buffer was freed between the 
initial crc capture and the corruption check. The spinner should be 
active across the entire test, maintaining a ref on the context and it's 
ring.

hopefully my latest debug will give more answers.

> 
> Btw what is the benefit of converting stolen to start with? It's not 
> much of a backend since it just uses the drm range manager. So quite 
> thin and uneventful. Diffstats for the series also do not look like you 
> end up with much code reduction?
> 
> Regards,
> 
> Tvrtko

Tvrtko Ursulin June 28, 2022, 8:46 a.m. UTC | #4

On 27/06/2022 18:08, Robert Beckett wrote:
> 
> 
> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>
>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>
>>>
>>> On 21/06/2022 18:37, Patchwork wrote:
>>>> *Patch Details*
>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>> *State:*    failure
>>>> *Details:* 
>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>>
>>>>
>>>>
>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>
>>>>
>>>>     Summary
>>>>
>>>> *FAILURE*
>>>>
>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely 
>>>> need to be
>>>> verified manually.
>>>>
>>>> If you think the reported changes have nothing to do with the changes
>>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>>> allow them
>>>> to document this new failure mode, which will reduce false positives 
>>>> in CI.
>>>>
>>>> External URL: 
>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>
>>>>
>>>>     Participating hosts (40 -> 41)
>>>>
>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>> Missing (1): fi-bdw-samus
>>>>
>>>>
>>>>     Possible new issues
>>>>
>>>> Here are the unknown changes that may have been introduced in 
>>>> Patchwork_101396v5:
>>>>
>>>>
>>>>       IGT changes
>>>>
>>>>
>>>>         Possible regressions
>>>>
>>>>   * igt@i915_selftest@live@reset:
>>>>       o bat-adlp-4: PASS
>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>
>>>>         -> DMESG-FAIL
>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>
>>>>
>>>
>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>> It seems to happen most of the time on that machine and occasionally 
>>> on bat-adlp-6.
>>>
>>> Should bat-adlp-4 be considered an unreliable machine like bat-adlp-6 
>>> is for now?
>>>
>>> Alternatively, seeing the history of this in
>>>
>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude 
>>> low pages (128KiB) of stolen from use"
>>>
>>> could this be an indication that maybe the original issue is worse on 
>>> adlp machines?
>>> I have only ever seen page page 135 or 136 clobbered across many runs 
>>> via trybot, so it looks fairly consistent.
>>> Though excluding the use of over 540K of stolen might be too severe.
>>
>> Don't know but I see that on the latest version you even hit pages 
>> 165/166.
>>
>> Any history of hitting this in CI without your series? If not, are 
>> there some other changes which could explain it? Are you touching the 
>> selftest itself?
>>
>> Hexdump of the clobbered page looks quite complex. Especially 
>> POISON_FREE. Any idea how that ends up there?
> 
> 
> (see 
> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) 
> 
> 
> after lots of slow debug via CI, it looks like the issue is that a ring 
> buffer was allocated and taking up that page during the initial crc 
> capture in the test, but by the time it came to check for corruption, it 
> had been freed from that page.
> 
> The test has a number of weaknesses:
> 
> 1. the busy check is done twice, without taking in to account any change 
> in between. I assume previously this could be relied on never to occur, 
> but now it can for some reason (more on that later)

You mean the stolen page used/unused test? Probably the premise is that 
the test controls the driver completely ie. is the sole user and the two 
checks are run at the time where nothing else could have changed the state.

With the nerfed request (as with GuC) this actually should hold. In the 
generic case I am less sure, my working knowledge faded a bit, but 
perhaps there was something guaranteeing the spinner couldn't have been 
retired yet at the time of the second check. Would need clarifying at 
least in comments.
> 
> 2. the engine reset returns early with an error for guc submission 
> engines, but it is silently ignored in the test. Perhaps it should 
> ignore guc submission engines as it is a largely useless test for those 
> situations.

Yes looks dodgy indeed. You will need to summon the owners of the GuC 
backend to comment on this.

However even if the test should be skipped with GuC it is extremely 
interesting that you are hitting this so I suspect there is a more 
serious issue at play.

> A quick obvious fix is to have a busy bitmask that remembers each page's 
> busy state initially and only check for corruption if it was busy during 
> both checks.
> 
> However, the main question is why this is occurring now with my changes.
> I have added more debug to check where the stolen memory is being freed, 
> but the first run last night didn't hit the issue for once.
> I am running again now, will report back if I figure out where it is 
> being freed.
> 
> I am pretty sure the "corruption" (which isn't actually corruption) is 
> from a ring buffer.
> The POISON_FREE is the only difference between the captured before and 
> after dumps:
> 
> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
> 6b6b6b6b
> 
> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
> I think this is the request poisoning from i915_request_retire()
> 
> The bit I don't know yet is why a ring buffer was freed between the 
> initial crc capture and the corruption check. The spinner should be 
> active across the entire test, maintaining a ref on the context and it's 
> ring.
> 
> hopefully my latest debug will give more answers.

Yeah if you can figure our whether the a) spinner is still active during 
the 2nd check (as I think it should be), and b) is the corruption 
detected in the same pages which were used in the 1st pass that would be 
interesting.

Regards,

Tvrtko

> 
> 
>>
>> Btw what is the benefit of converting stolen to start with? It's not 
>> much of a backend since it just uses the drm range manager. So quite 
>> thin and uneventful. Diffstats for the series also do not look like 
>> you end up with much code reduction?
>>
>> Regards,
>>
>> Tvrtko

Bob Beckett June 28, 2022, 4:22 p.m. UTC | #5

On 28/06/2022 09:46, Tvrtko Ursulin wrote:
> 
> On 27/06/2022 18:08, Robert Beckett wrote:
>>
>>
>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>
>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>
>>>>
>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>> *Patch Details*
>>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>> *State:*    failure
>>>>> *Details:* 
>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>>>
>>>>>
>>>>>
>>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>>
>>>>>
>>>>>     Summary
>>>>>
>>>>> *FAILURE*
>>>>>
>>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely 
>>>>> need to be
>>>>> verified manually.
>>>>>
>>>>> If you think the reported changes have nothing to do with the changes
>>>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>>>> allow them
>>>>> to document this new failure mode, which will reduce false 
>>>>> positives in CI.
>>>>>
>>>>> External URL: 
>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>>
>>>>>
>>>>>     Participating hosts (40 -> 41)
>>>>>
>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>> Missing (1): fi-bdw-samus
>>>>>
>>>>>
>>>>>     Possible new issues
>>>>>
>>>>> Here are the unknown changes that may have been introduced in 
>>>>> Patchwork_101396v5:
>>>>>
>>>>>
>>>>>       IGT changes
>>>>>
>>>>>
>>>>>         Possible regressions
>>>>>
>>>>>   * igt@i915_selftest@live@reset:
>>>>>       o bat-adlp-4: PASS
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>
>>>>>         -> DMESG-FAIL
>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>
>>>>>
>>>>
>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>>> It seems to happen most of the time on that machine and occasionally 
>>>> on bat-adlp-6.
>>>>
>>>> Should bat-adlp-4 be considered an unreliable machine like 
>>>> bat-adlp-6 is for now?
>>>>
>>>> Alternatively, seeing the history of this in
>>>>
>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude 
>>>> low pages (128KiB) of stolen from use"
>>>>
>>>> could this be an indication that maybe the original issue is worse 
>>>> on adlp machines?
>>>> I have only ever seen page page 135 or 136 clobbered across many 
>>>> runs via trybot, so it looks fairly consistent.
>>>> Though excluding the use of over 540K of stolen might be too severe.
>>>
>>> Don't know but I see that on the latest version you even hit pages 
>>> 165/166.
>>>
>>> Any history of hitting this in CI without your series? If not, are 
>>> there some other changes which could explain it? Are you touching the 
>>> selftest itself?
>>>
>>> Hexdump of the clobbered page looks quite complex. Especially 
>>> POISON_FREE. Any idea how that ends up there?
>>
>>
>> (see 
>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) 
>>
>>
>> after lots of slow debug via CI, it looks like the issue is that a 
>> ring buffer was allocated and taking up that page during the initial 
>> crc capture in the test, but by the time it came to check for 
>> corruption, it had been freed from that page.
>>
>> The test has a number of weaknesses:
>>
>> 1. the busy check is done twice, without taking in to account any 
>> change in between. I assume previously this could be relied on never 
>> to occur, but now it can for some reason (more on that later)
> 
> You mean the stolen page used/unused test? Probably the premise is that 
> the test controls the driver completely ie. is the sole user and the two 
> checks are run at the time where nothing else could have changed the state.
> 
> With the nerfed request (as with GuC) this actually should hold. In the 
> generic case I am less sure, my working knowledge faded a bit, but 
> perhaps there was something guaranteeing the spinner couldn't have been 
> retired yet at the time of the second check. Would need clarifying at 
> least in comments.
>>
>> 2. the engine reset returns early with an error for guc submission 
>> engines, but it is silently ignored in the test. Perhaps it should 
>> ignore guc submission engines as it is a largely useless test for 
>> those situations.
> 
> Yes looks dodgy indeed. You will need to summon the owners of the GuC 
> backend to comment on this.
> 
> However even if the test should be skipped with GuC it is extremely 
> interesting that you are hitting this so I suspect there is a more 
> serious issue at play.

indeed. That's why I am keen to get to the root cause instead of just 
slapping in a fix.

> 
>> A quick obvious fix is to have a busy bitmask that remembers each 
>> page's busy state initially and only check for corruption if it was 
>> busy during both checks.
>>
>> However, the main question is why this is occurring now with my changes.
>> I have added more debug to check where the stolen memory is being 
>> freed, but the first run last night didn't hit the issue for once.
>> I am running again now, will report back if I figure out where it is 
>> being freed.
>>
>> I am pretty sure the "corruption" (which isn't actually corruption) is 
>> from a ring buffer.
>> The POISON_FREE is the only difference between the captured before and 
>> after dumps:
>>
>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
>> 6b6b6b6b
>>
>> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
>> I think this is the request poisoning from i915_request_retire()
>>
>> The bit I don't know yet is why a ring buffer was freed between the 
>> initial crc capture and the corruption check. The spinner should be 
>> active across the entire test, maintaining a ref on the context and 
>> it's ring.
>>
>> hopefully my latest debug will give more answers.
> 
> Yeah if you can figure our whether the a) spinner is still active during 
> the 2nd check (as I think it should be), and b) is the corruption 
> detected in the same pages which were used in the 1st pass that would be 
> interesting.

yep. The latest run is still stuck in the CI queue after 27 hours.
I think I have enough debug in there to catch it now.
Hopefully I can get a root cause once it gets chance to run.


> 
> Regards,
> 
> Tvrtko
> 
>>
>>
>>>
>>> Btw what is the benefit of converting stolen to start with? It's not 
>>> much of a backend since it just uses the drm range manager. So quite 
>>> thin and uneventful. Diffstats for the series also do not look like 
>>> you end up with much code reduction?
>>>
>>> Regards,
>>>
>>> Tvrtko

Bob Beckett June 29, 2022, 12:51 p.m. UTC | #6

On 28/06/2022 17:22, Robert Beckett wrote:
> 
> 
> On 28/06/2022 09:46, Tvrtko Ursulin wrote:
>>
>> On 27/06/2022 18:08, Robert Beckett wrote:
>>>
>>>
>>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>>
>>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>>
>>>>>
>>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>>> *Patch Details*
>>>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>>> *State:*    failure
>>>>>> *Details:* 
>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>>>>
>>>>>>
>>>>>>
>>>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>>>
>>>>>>
>>>>>>     Summary
>>>>>>
>>>>>> *FAILURE*
>>>>>>
>>>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely 
>>>>>> need to be
>>>>>> verified manually.
>>>>>>
>>>>>> If you think the reported changes have nothing to do with the changes
>>>>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>>>>> allow them
>>>>>> to document this new failure mode, which will reduce false 
>>>>>> positives in CI.
>>>>>>
>>>>>> External URL: 
>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>>
>>>>>>
>>>>>>
>>>>>>     Participating hosts (40 -> 41)
>>>>>>
>>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>>> Missing (1): fi-bdw-samus
>>>>>>
>>>>>>
>>>>>>     Possible new issues
>>>>>>
>>>>>> Here are the unknown changes that may have been introduced in 
>>>>>> Patchwork_101396v5:
>>>>>>
>>>>>>
>>>>>>       IGT changes
>>>>>>
>>>>>>
>>>>>>         Possible regressions
>>>>>>
>>>>>>   * igt@i915_selftest@live@reset:
>>>>>>       o bat-adlp-4: PASS
>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>
>>>>>>         -> DMESG-FAIL
>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>
>>>>>>
>>>>>
>>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>>>> It seems to happen most of the time on that machine and 
>>>>> occasionally on bat-adlp-6.
>>>>>
>>>>> Should bat-adlp-4 be considered an unreliable machine like 
>>>>> bat-adlp-6 is for now?
>>>>>
>>>>> Alternatively, seeing the history of this in
>>>>>
>>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude 
>>>>> low pages (128KiB) of stolen from use"
>>>>>
>>>>> could this be an indication that maybe the original issue is worse 
>>>>> on adlp machines?
>>>>> I have only ever seen page page 135 or 136 clobbered across many 
>>>>> runs via trybot, so it looks fairly consistent.
>>>>> Though excluding the use of over 540K of stolen might be too severe.
>>>>
>>>> Don't know but I see that on the latest version you even hit pages 
>>>> 165/166.
>>>>
>>>> Any history of hitting this in CI without your series? If not, are 
>>>> there some other changes which could explain it? Are you touching 
>>>> the selftest itself?
>>>>
>>>> Hexdump of the clobbered page looks quite complex. Especially 
>>>> POISON_FREE. Any idea how that ends up there?
>>>
>>>
>>> (see 
>>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) 
>>>
>>>
>>> after lots of slow debug via CI, it looks like the issue is that a 
>>> ring buffer was allocated and taking up that page during the initial 
>>> crc capture in the test, but by the time it came to check for 
>>> corruption, it had been freed from that page.
>>>
>>> The test has a number of weaknesses:
>>>
>>> 1. the busy check is done twice, without taking in to account any 
>>> change in between. I assume previously this could be relied on never 
>>> to occur, but now it can for some reason (more on that later)
>>
>> You mean the stolen page used/unused test? Probably the premise is 
>> that the test controls the driver completely ie. is the sole user and 
>> the two checks are run at the time where nothing else could have 
>> changed the state.
>>
>> With the nerfed request (as with GuC) this actually should hold. In 
>> the generic case I am less sure, my working knowledge faded a bit, but 
>> perhaps there was something guaranteeing the spinner couldn't have 
>> been retired yet at the time of the second check. Would need 
>> clarifying at least in comments.
>>>
>>> 2. the engine reset returns early with an error for guc submission 
>>> engines, but it is silently ignored in the test. Perhaps it should 
>>> ignore guc submission engines as it is a largely useless test for 
>>> those situations.
>>
>> Yes looks dodgy indeed. You will need to summon the owners of the GuC 
>> backend to comment on this.
>>
>> However even if the test should be skipped with GuC it is extremely 
>> interesting that you are hitting this so I suspect there is a more 
>> serious issue at play.
> 
> indeed. That's why I am keen to get to the root cause instead of just 
> slapping in a fix.
> 
>>
>>> A quick obvious fix is to have a busy bitmask that remembers each 
>>> page's busy state initially and only check for corruption if it was 
>>> busy during both checks.
>>>
>>> However, the main question is why this is occurring now with my changes.
>>> I have added more debug to check where the stolen memory is being 
>>> freed, but the first run last night didn't hit the issue for once.
>>> I am running again now, will report back if I figure out where it is 
>>> being freed.
>>>
>>> I am pretty sure the "corruption" (which isn't actually corruption) 
>>> is from a ring buffer.
>>> The POISON_FREE is the only difference between the captured before 
>>> and after dumps:
>>>
>>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
>>> 6b6b6b6b
>>>
>>> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
>>> I think this is the request poisoning from i915_request_retire()
>>>
>>> The bit I don't know yet is why a ring buffer was freed between the 
>>> initial crc capture and the corruption check. The spinner should be 
>>> active across the entire test, maintaining a ref on the context and 
>>> it's ring.
>>>
>>> hopefully my latest debug will give more answers.
>>
>> Yeah if you can figure our whether the a) spinner is still active 
>> during the 2nd check (as I think it should be), and b) is the 
>> corruption detected in the same pages which were used in the 1st pass 
>> that would be interesting.
> 
> yep. The latest run is still stuck in the CI queue after 27 hours.
> I think I have enough debug in there to catch it now.
> Hopefully I can get a root cause once it gets chance to run.
> 

https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496

well, the run finally happened.
And it shows that the freed resource happens from a workqueue. Not helpful.

I'll now add a saved stack traces to all objects that saves where it is 
allocated and freed/queued for free.


> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>>
>>>
>>>>
>>>> Btw what is the benefit of converting stolen to start with? It's not 
>>>> much of a backend since it just uses the drm range manager. So quite 
>>>> thin and uneventful. Diffstats for the series also do not look like 
>>>> you end up with much code reduction?
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko

Bob Beckett June 30, 2022, 2:20 p.m. UTC | #7

On 29/06/2022 13:51, Robert Beckett wrote:
> 
> 
> On 28/06/2022 17:22, Robert Beckett wrote:
>>
>>
>> On 28/06/2022 09:46, Tvrtko Ursulin wrote:
>>>
>>> On 27/06/2022 18:08, Robert Beckett wrote:
>>>>
>>>>
>>>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>>>
>>>>>>
>>>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>>>> *Patch Details*
>>>>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>>>> *State:*    failure
>>>>>>> *Details:* 
>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>>>>
>>>>>>>
>>>>>>>     Summary
>>>>>>>
>>>>>>> *FAILURE*
>>>>>>>
>>>>>>> Serious unknown changes coming with Patchwork_101396v5 absolutely 
>>>>>>> need to be
>>>>>>> verified manually.
>>>>>>>
>>>>>>> If you think the reported changes have nothing to do with the 
>>>>>>> changes
>>>>>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>>>>>> allow them
>>>>>>> to document this new failure mode, which will reduce false 
>>>>>>> positives in CI.
>>>>>>>
>>>>>>> External URL: 
>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>     Participating hosts (40 -> 41)
>>>>>>>
>>>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>>>> Missing (1): fi-bdw-samus
>>>>>>>
>>>>>>>
>>>>>>>     Possible new issues
>>>>>>>
>>>>>>> Here are the unknown changes that may have been introduced in 
>>>>>>> Patchwork_101396v5:
>>>>>>>
>>>>>>>
>>>>>>>       IGT changes
>>>>>>>
>>>>>>>
>>>>>>>         Possible regressions
>>>>>>>
>>>>>>>   * igt@i915_selftest@live@reset:
>>>>>>>       o bat-adlp-4: PASS
>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>>
>>>>>>>         -> DMESG-FAIL
>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>>>>> It seems to happen most of the time on that machine and 
>>>>>> occasionally on bat-adlp-6.
>>>>>>
>>>>>> Should bat-adlp-4 be considered an unreliable machine like 
>>>>>> bat-adlp-6 is for now?
>>>>>>
>>>>>> Alternatively, seeing the history of this in
>>>>>>
>>>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: Exclude 
>>>>>> low pages (128KiB) of stolen from use"
>>>>>>
>>>>>> could this be an indication that maybe the original issue is worse 
>>>>>> on adlp machines?
>>>>>> I have only ever seen page page 135 or 136 clobbered across many 
>>>>>> runs via trybot, so it looks fairly consistent.
>>>>>> Though excluding the use of over 540K of stolen might be too severe.
>>>>>
>>>>> Don't know but I see that on the latest version you even hit pages 
>>>>> 165/166.
>>>>>
>>>>> Any history of hitting this in CI without your series? If not, are 
>>>>> there some other changes which could explain it? Are you touching 
>>>>> the selftest itself?
>>>>>
>>>>> Hexdump of the clobbered page looks quite complex. Especially 
>>>>> POISON_FREE. Any idea how that ends up there?
>>>>
>>>>
>>>> (see 
>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) 
>>>>
>>>>
>>>> after lots of slow debug via CI, it looks like the issue is that a 
>>>> ring buffer was allocated and taking up that page during the initial 
>>>> crc capture in the test, but by the time it came to check for 
>>>> corruption, it had been freed from that page.
>>>>
>>>> The test has a number of weaknesses:
>>>>
>>>> 1. the busy check is done twice, without taking in to account any 
>>>> change in between. I assume previously this could be relied on never 
>>>> to occur, but now it can for some reason (more on that later)
>>>
>>> You mean the stolen page used/unused test? Probably the premise is 
>>> that the test controls the driver completely ie. is the sole user and 
>>> the two checks are run at the time where nothing else could have 
>>> changed the state.
>>>
>>> With the nerfed request (as with GuC) this actually should hold. In 
>>> the generic case I am less sure, my working knowledge faded a bit, 
>>> but perhaps there was something guaranteeing the spinner couldn't 
>>> have been retired yet at the time of the second check. Would need 
>>> clarifying at least in comments.
>>>>
>>>> 2. the engine reset returns early with an error for guc submission 
>>>> engines, but it is silently ignored in the test. Perhaps it should 
>>>> ignore guc submission engines as it is a largely useless test for 
>>>> those situations.
>>>
>>> Yes looks dodgy indeed. You will need to summon the owners of the GuC 
>>> backend to comment on this.
>>>
>>> However even if the test should be skipped with GuC it is extremely 
>>> interesting that you are hitting this so I suspect there is a more 
>>> serious issue at play.
>>
>> indeed. That's why I am keen to get to the root cause instead of just 
>> slapping in a fix.
>>
>>>
>>>> A quick obvious fix is to have a busy bitmask that remembers each 
>>>> page's busy state initially and only check for corruption if it was 
>>>> busy during both checks.
>>>>
>>>> However, the main question is why this is occurring now with my 
>>>> changes.
>>>> I have added more debug to check where the stolen memory is being 
>>>> freed, but the first run last night didn't hit the issue for once.
>>>> I am running again now, will report back if I figure out where it is 
>>>> being freed.
>>>>
>>>> I am pretty sure the "corruption" (which isn't actually corruption) 
>>>> is from a ring buffer.
>>>> The POISON_FREE is the only difference between the captured before 
>>>> and after dumps:
>>>>
>>>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
>>>> 6b6b6b6b 6b6b6b6b
>>>>
>>>> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
>>>> I think this is the request poisoning from i915_request_retire()
>>>>
>>>> The bit I don't know yet is why a ring buffer was freed between the 
>>>> initial crc capture and the corruption check. The spinner should be 
>>>> active across the entire test, maintaining a ref on the context and 
>>>> it's ring.
>>>>
>>>> hopefully my latest debug will give more answers.
>>>
>>> Yeah if you can figure our whether the a) spinner is still active 
>>> during the 2nd check (as I think it should be), and b) is the 
>>> corruption detected in the same pages which were used in the 1st pass 
>>> that would be interesting.
>>
>> yep. The latest run is still stuck in the CI queue after 27 hours.
>> I think I have enough debug in there to catch it now.
>> Hopefully I can get a root cause once it gets chance to run.
>>
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496 
> 
> 
> well, the run finally happened.
> And it shows that the freed resource happens from a workqueue. Not helpful.
> 
> I'll now add a saved stack traces to all objects that saves where it is 
> allocated and freed/queued for free.
> 

https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v8/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings419

I'm pretty sure I know what is going on now.

igt_reset_engines_stolen() loops around each engine and calls 
__igt_reset_stolen() for that engine.


__igt_reset_stolen() does
intel_context_create()

igt_spinner_create_request()->intel_context_create_request()->__i915_request_create()->intel_context_get()

intel_context_put()

leaving the request as the remaining holder of the context.

it then does the reset, which does nothing on GuC setups, does the 
comparisons, then ends the spinner via igt_spinner_fini()->igt_spinner_end()
which lets the spinner request finish.

once the request is retired, intel_context_put() is eventually called, 
which starts the GuC teardown of the context as the request was the last 
holder of the context.

This GuC teardown is asynchronous via ct transactions.
By the time the ct_process_request() sees the 
INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE message, the test has already 
looped to the next engine and has already checked the original status of 
the page that the destroying context used for its ring buffer, so the 
test sees it being freed from the previous loop while testing the next 
engine. It considers this a corruption of the stolen memory due to the 
previously highlighted double checking of busy state for each page.

I think for now, we should simply not test GuC submission engines in 
line with the reset call returning an error.
If at some point we want to enable this test for GuC setups, then 
flushing and waiting for context cleanup would need to be added to the test.

Anyone know why per engine reset is not allowed for GuC submission setup?
looking at commit "eb5e7da736f3 drm/i915/guc: Reset implementation for 
new GuC interface" doesn't really detail why per engine resets are not 
allowed.
Maybe it just never got implemented? or are there reasons to not allow 
the host to request specific engine resets?



> 
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>>>
>>>>
>>>>>
>>>>> Btw what is the benefit of converting stolen to start with? It's 
>>>>> not much of a backend since it just uses the drm range manager. So 
>>>>> quite thin and uneventful. Diffstats for the series also do not 
>>>>> look like you end up with much code reduction?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko

Hellstrom, Thomas June 30, 2022, 2:52 p.m. UTC | #8

Hi!

On Thu, 2022-06-30 at 15:20 +0100, Robert Beckett wrote:
> 
> 
> On 29/06/2022 13:51, Robert Beckett wrote:
> > 
> > 
> > On 28/06/2022 17:22, Robert Beckett wrote:
> > > 
> > > 
> > > On 28/06/2022 09:46, Tvrtko Ursulin wrote:
> > > > 
> > > > On 27/06/2022 18:08, Robert Beckett wrote:
> > > > > 
> > > > > 
> > > > > On 22/06/2022 10:05, Tvrtko Ursulin wrote:
> > > > > > 
> > > > > > On 21/06/2022 20:11, Robert Beckett wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 21/06/2022 18:37, Patchwork wrote:
> > > > > > > > *Patch Details*
> > > > > > > > *Series:*    drm/i915: ttm for stolen (rev5)
> > > > > > > > *URL:*   
> > > > > > > > https://patchwork.freedesktop.org/series/101396/ 
> > > > > > > > <https://patchwork.freedesktop.org/series/101396/>
> > > > > > > > *State:*    failure
> > > > > > > > *Details:* 
> > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
> > > > > > > >  
> > > > > > > > <
> > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
> > > > > > > > >
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > >   CI Bug Log - changes from CI_DRM_11790 ->
> > > > > > > > Patchwork_101396v5
> > > > > > > > 
> > > > > > > > 
> > > > > > > >     Summary
> > > > > > > > 
> > > > > > > > *FAILURE*
> > > > > > > > 
> > > > > > > > Serious unknown changes coming with Patchwork_101396v5
> > > > > > > > absolutely 
> > > > > > > > need to be
> > > > > > > > verified manually.
> > > > > > > > 
> > > > > > > > If you think the reported changes have nothing to do
> > > > > > > > with the 
> > > > > > > > changes
> > > > > > > > introduced in Patchwork_101396v5, please notify your
> > > > > > > > bug team to 
> > > > > > > > allow them
> > > > > > > > to document this new failure mode, which will reduce
> > > > > > > > false 
> > > > > > > > positives in CI.
> > > > > > > > 
> > > > > > > > External URL: 
> > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
> > > > > > > >  
> > > > > > > > 
> > > > > > > > 
> > > > > > > > 
> > > > > > > >     Participating hosts (40 -> 41)
> > > > > > > > 
> > > > > > > > Additional (2): fi-icl-u2 bat-dg2-9
> > > > > > > > Missing (1): fi-bdw-samus
> > > > > > > > 
> > > > > > > > 
> > > > > > > >     Possible new issues
> > > > > > > > 
> > > > > > > > Here are the unknown changes that may have been
> > > > > > > > introduced in 
> > > > > > > > Patchwork_101396v5:
> > > > > > > > 
> > > > > > > > 
> > > > > > > >       IGT changes
> > > > > > > > 
> > > > > > > > 
> > > > > > > >         Possible regressions
> > > > > > > > 
> > > > > > > >   * igt@i915_selftest@live@reset:
> > > > > > > >       o bat-adlp-4: PASS
> > > > > > > > <
> > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html
> > > > > > > > >
> > > > > > > > 
> > > > > > > >         -> DMESG-FAIL
> > > > > > > > <
> > > > > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html
> > > > > > > > >
> > > > > > > > 
> > > > > > > > 
> > > > > > > 
> > > > > > > I keep hitting clobbered pages during engine resets on
> > > > > > > bat-adlp-4.
> > > > > > > It seems to happen most of the time on that machine and 
> > > > > > > occasionally on bat-adlp-6.
> > > > > > > 
> > > > > > > Should bat-adlp-4 be considered an unreliable machine
> > > > > > > like 
> > > > > > > bat-adlp-6 is for now?
> > > > > > > 
> > > > > > > Alternatively, seeing the history of this in
> > > > > > > 
> > > > > > > commit 3da3c5c1c9825c24168f27b021339e90af37e969
> > > > > > > "drm/i915: Exclude 
> > > > > > > low pages (128KiB) of stolen from use"
> > > > > > > 
> > > > > > > could this be an indication that maybe the original issue
> > > > > > > is worse 
> > > > > > > on adlp machines?
> > > > > > > I have only ever seen page page 135 or 136 clobbered
> > > > > > > across many 
> > > > > > > runs via trybot, so it looks fairly consistent.
> > > > > > > Though excluding the use of over 540K of stolen might be
> > > > > > > too severe.
> > > > > > 
> > > > > > Don't know but I see that on the latest version you even
> > > > > > hit pages 
> > > > > > 165/166.
> > > > > > 
> > > > > > Any history of hitting this in CI without your series? If
> > > > > > not, are 
> > > > > > there some other changes which could explain it? Are you
> > > > > > touching 
> > > > > > the selftest itself?
> > > > > > 
> > > > > > Hexdump of the clobbered page looks quite complex.
> > > > > > Especially 
> > > > > > POISON_FREE. Any idea how that ends up there?
> > > > > 
> > > > > 
> > > > > (see 
> > > > > https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702
> > > > > )
> > > > > 
> > > > > 
> > > > > after lots of slow debug via CI, it looks like the issue is
> > > > > that a 
> > > > > ring buffer was allocated and taking up that page during the
> > > > > initial 
> > > > > crc capture in the test, but by the time it came to check for
> > > > > corruption, it had been freed from that page.
> > > > > 
> > > > > The test has a number of weaknesses:
> > > > > 
> > > > > 1. the busy check is done twice, without taking in to account
> > > > > any 
> > > > > change in between. I assume previously this could be relied
> > > > > on never 
> > > > > to occur, but now it can for some reason (more on that later)
> > > > 
> > > > You mean the stolen page used/unused test? Probably the premise
> > > > is 
> > > > that the test controls the driver completely ie. is the sole
> > > > user and 
> > > > the two checks are run at the time where nothing else could
> > > > have 
> > > > changed the state.
> > > > 
> > > > With the nerfed request (as with GuC) this actually should
> > > > hold. In 
> > > > the generic case I am less sure, my working knowledge faded a
> > > > bit, 
> > > > but perhaps there was something guaranteeing the spinner
> > > > couldn't 
> > > > have been retired yet at the time of the second check. Would
> > > > need 
> > > > clarifying at least in comments.
> > > > > 
> > > > > 2. the engine reset returns early with an error for guc
> > > > > submission 
> > > > > engines, but it is silently ignored in the test. Perhaps it
> > > > > should 
> > > > > ignore guc submission engines as it is a largely useless test
> > > > > for 
> > > > > those situations.
> > > > 
> > > > Yes looks dodgy indeed. You will need to summon the owners of
> > > > the GuC 
> > > > backend to comment on this.
> > > > 
> > > > However even if the test should be skipped with GuC it is
> > > > extremely 
> > > > interesting that you are hitting this so I suspect there is a
> > > > more 
> > > > serious issue at play.
> > > 
> > > indeed. That's why I am keen to get to the root cause instead of
> > > just 
> > > slapping in a fix.
> > > 
> > > > 
> > > > > A quick obvious fix is to have a busy bitmask that remembers
> > > > > each 
> > > > > page's busy state initially and only check for corruption if
> > > > > it was 
> > > > > busy during both checks.
> > > > > 
> > > > > However, the main question is why this is occurring now with
> > > > > my 
> > > > > changes.
> > > > > I have added more debug to check where the stolen memory is
> > > > > being 
> > > > > freed, but the first run last night didn't hit the issue for
> > > > > once.
> > > > > I am running again now, will report back if I figure out
> > > > > where it is 
> > > > > being freed.
> > > > > 
> > > > > I am pretty sure the "corruption" (which isn't actually
> > > > > corruption) 
> > > > > is from a ring buffer.
> > > > > The POISON_FREE is the only difference between the captured
> > > > > before 
> > > > > and after dumps:
> > > > > 
> > > > > [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
> > > > > 6b6b6b6b 6b6b6b6b
> > > > > 
> > > > > with the 2nd dword being the MI_ARB_CHECK used for the
> > > > > spinner.
> > > > > I think this is the request poisoning from
> > > > > i915_request_retire()
> > > > > 
> > > > > The bit I don't know yet is why a ring buffer was freed
> > > > > between the 
> > > > > initial crc capture and the corruption check. The spinner
> > > > > should be 
> > > > > active across the entire test, maintaining a ref on the
> > > > > context and 
> > > > > it's ring.
> > > > > 
> > > > > hopefully my latest debug will give more answers.
> > > > 
> > > > Yeah if you can figure our whether the a) spinner is still
> > > > active 
> > > > during the 2nd check (as I think it should be), and b) is the 
> > > > corruption detected in the same pages which were used in the
> > > > 1st pass 
> > > > that would be interesting.
> > > 
> > > yep. The latest run is still stuck in the CI queue after 27
> > > hours.
> > > I think I have enough debug in there to catch it now.
> > > Hopefully I can get a root cause once it gets chance to run.
> > > 
> > 
> > https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496
> >  
> > 
> > 
> > well, the run finally happened.
> > And it shows that the freed resource happens from a workqueue. Not
> > helpful.
> > 
> > I'll now add a saved stack traces to all objects that saves where
> > it is 
> > allocated and freed/queued for free.
> > 
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v8/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings419
> 
> I'm pretty sure I know what is going on now.
> 
> igt_reset_engines_stolen() loops around each engine and calls 
> __igt_reset_stolen() for that engine.
> 
> 
> __igt_reset_stolen() does
> intel_context_create()
> 
> igt_spinner_create_request()->intel_context_create_request()-
> >__i915_request_create()->intel_context_get()
> 
> intel_context_put()
> 
> leaving the request as the remaining holder of the context.
> 
> it then does the reset, which does nothing on GuC setups, does the 
> comparisons, then ends the spinner via igt_spinner_fini()-
> >igt_spinner_end()
> which lets the spinner request finish.
> 
> once the request is retired, intel_context_put() is eventually
> called, 
> which starts the GuC teardown of the context as the request was the
> last 
> holder of the context.
> 
> This GuC teardown is asynchronous via ct transactions.
> By the time the ct_process_request() sees the 
> INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE message, the test has
> already 
> looped to the next engine and has already checked the original status
> of 
> the page that the destroying context used for its ring buffer, so the
> test sees it being freed from the previous loop while testing the
> next 
> engine. It considers this a corruption of the stolen memory due to
> the 
> previously highlighted double checking of busy state for each page.
> 
> I think for now, we should simply not test GuC submission engines in 
> line with the reset call returning an error.
> If at some point we want to enable this test for GuC setups, then 
> flushing and waiting for context cleanup would need to be added to
> the test.
> 
> Anyone know why per engine reset is not allowed for GuC submission
> setup?
> looking at commit "eb5e7da736f3 drm/i915/guc: Reset implementation
> for 
> new GuC interface" doesn't really detail why per engine resets are
> not 
> allowed.
> Maybe it just never got implemented? or are there reasons to not
> allow 
> the host to request specific engine resets?
> 

IIRC, the GuC by design decides for itself when it needs to do a per-
engine reset, hence the host can't trigger those.

/Thomas

Bob Beckett June 30, 2022, 3:24 p.m. UTC | #9

On 30/06/2022 15:52, Hellstrom, Thomas wrote:
> Hi!
> 
> On Thu, 2022-06-30 at 15:20 +0100, Robert Beckett wrote:
>>
>>
>> On 29/06/2022 13:51, Robert Beckett wrote:
>>>
>>>
>>> On 28/06/2022 17:22, Robert Beckett wrote:
>>>>
>>>>
>>>> On 28/06/2022 09:46, Tvrtko Ursulin wrote:
>>>>>
>>>>> On 27/06/2022 18:08, Robert Beckett wrote:
>>>>>>
>>>>>>
>>>>>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>>>>>
>>>>>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>>>>>> *Patch Details*
>>>>>>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>>>>>>> *URL:*
>>>>>>>>> https://patchwork.freedesktop.org/series/101396/
>>>>>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>>>>>> *State:*    failure
>>>>>>>>> *Details:*
>>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>>>>>>   
>>>>>>>>> <
>>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    CI Bug Log - changes from CI_DRM_11790 ->
>>>>>>>>> Patchwork_101396v5
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Summary
>>>>>>>>>
>>>>>>>>> *FAILURE*
>>>>>>>>>
>>>>>>>>> Serious unknown changes coming with Patchwork_101396v5
>>>>>>>>> absolutely
>>>>>>>>> need to be
>>>>>>>>> verified manually.
>>>>>>>>>
>>>>>>>>> If you think the reported changes have nothing to do
>>>>>>>>> with the
>>>>>>>>> changes
>>>>>>>>> introduced in Patchwork_101396v5, please notify your
>>>>>>>>> bug team to
>>>>>>>>> allow them
>>>>>>>>> to document this new failure mode, which will reduce
>>>>>>>>> false
>>>>>>>>> positives in CI.
>>>>>>>>>
>>>>>>>>> External URL:
>>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html
>>>>>>>>>   
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Participating hosts (40 -> 41)
>>>>>>>>>
>>>>>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>>>>>> Missing (1): fi-bdw-samus
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>      Possible new issues
>>>>>>>>>
>>>>>>>>> Here are the unknown changes that may have been
>>>>>>>>> introduced in
>>>>>>>>> Patchwork_101396v5:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>        IGT changes
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          Possible regressions
>>>>>>>>>
>>>>>>>>>    * igt@i915_selftest@live@reset:
>>>>>>>>>        o bat-adlp-4: PASS
>>>>>>>>> <
>>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>          -> DMESG-FAIL
>>>>>>>>> <
>>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> I keep hitting clobbered pages during engine resets on
>>>>>>>> bat-adlp-4.
>>>>>>>> It seems to happen most of the time on that machine and
>>>>>>>> occasionally on bat-adlp-6.
>>>>>>>>
>>>>>>>> Should bat-adlp-4 be considered an unreliable machine
>>>>>>>> like
>>>>>>>> bat-adlp-6 is for now?
>>>>>>>>
>>>>>>>> Alternatively, seeing the history of this in
>>>>>>>>
>>>>>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969
>>>>>>>> "drm/i915: Exclude
>>>>>>>> low pages (128KiB) of stolen from use"
>>>>>>>>
>>>>>>>> could this be an indication that maybe the original issue
>>>>>>>> is worse
>>>>>>>> on adlp machines?
>>>>>>>> I have only ever seen page page 135 or 136 clobbered
>>>>>>>> across many
>>>>>>>> runs via trybot, so it looks fairly consistent.
>>>>>>>> Though excluding the use of over 540K of stolen might be
>>>>>>>> too severe.
>>>>>>>
>>>>>>> Don't know but I see that on the latest version you even
>>>>>>> hit pages
>>>>>>> 165/166.
>>>>>>>
>>>>>>> Any history of hitting this in CI without your series? If
>>>>>>> not, are
>>>>>>> there some other changes which could explain it? Are you
>>>>>>> touching
>>>>>>> the selftest itself?
>>>>>>>
>>>>>>> Hexdump of the clobbered page looks quite complex.
>>>>>>> Especially
>>>>>>> POISON_FREE. Any idea how that ends up there?
>>>>>>
>>>>>>
>>>>>> (see
>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702
>>>>>> )
>>>>>>
>>>>>>
>>>>>> after lots of slow debug via CI, it looks like the issue is
>>>>>> that a
>>>>>> ring buffer was allocated and taking up that page during the
>>>>>> initial
>>>>>> crc capture in the test, but by the time it came to check for
>>>>>> corruption, it had been freed from that page.
>>>>>>
>>>>>> The test has a number of weaknesses:
>>>>>>
>>>>>> 1. the busy check is done twice, without taking in to account
>>>>>> any
>>>>>> change in between. I assume previously this could be relied
>>>>>> on never
>>>>>> to occur, but now it can for some reason (more on that later)
>>>>>
>>>>> You mean the stolen page used/unused test? Probably the premise
>>>>> is
>>>>> that the test controls the driver completely ie. is the sole
>>>>> user and
>>>>> the two checks are run at the time where nothing else could
>>>>> have
>>>>> changed the state.
>>>>>
>>>>> With the nerfed request (as with GuC) this actually should
>>>>> hold. In
>>>>> the generic case I am less sure, my working knowledge faded a
>>>>> bit,
>>>>> but perhaps there was something guaranteeing the spinner
>>>>> couldn't
>>>>> have been retired yet at the time of the second check. Would
>>>>> need
>>>>> clarifying at least in comments.
>>>>>>
>>>>>> 2. the engine reset returns early with an error for guc
>>>>>> submission
>>>>>> engines, but it is silently ignored in the test. Perhaps it
>>>>>> should
>>>>>> ignore guc submission engines as it is a largely useless test
>>>>>> for
>>>>>> those situations.
>>>>>
>>>>> Yes looks dodgy indeed. You will need to summon the owners of
>>>>> the GuC
>>>>> backend to comment on this.
>>>>>
>>>>> However even if the test should be skipped with GuC it is
>>>>> extremely
>>>>> interesting that you are hitting this so I suspect there is a
>>>>> more
>>>>> serious issue at play.
>>>>
>>>> indeed. That's why I am keen to get to the root cause instead of
>>>> just
>>>> slapping in a fix.
>>>>
>>>>>
>>>>>> A quick obvious fix is to have a busy bitmask that remembers
>>>>>> each
>>>>>> page's busy state initially and only check for corruption if
>>>>>> it was
>>>>>> busy during both checks.
>>>>>>
>>>>>> However, the main question is why this is occurring now with
>>>>>> my
>>>>>> changes.
>>>>>> I have added more debug to check where the stolen memory is
>>>>>> being
>>>>>> freed, but the first run last night didn't hit the issue for
>>>>>> once.
>>>>>> I am running again now, will report back if I figure out
>>>>>> where it is
>>>>>> being freed.
>>>>>>
>>>>>> I am pretty sure the "corruption" (which isn't actually
>>>>>> corruption)
>>>>>> is from a ring buffer.
>>>>>> The POISON_FREE is the only difference between the captured
>>>>>> before
>>>>>> and after dumps:
>>>>>>
>>>>>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b
>>>>>> 6b6b6b6b 6b6b6b6b
>>>>>>
>>>>>> with the 2nd dword being the MI_ARB_CHECK used for the
>>>>>> spinner.
>>>>>> I think this is the request poisoning from
>>>>>> i915_request_retire()
>>>>>>
>>>>>> The bit I don't know yet is why a ring buffer was freed
>>>>>> between the
>>>>>> initial crc capture and the corruption check. The spinner
>>>>>> should be
>>>>>> active across the entire test, maintaining a ref on the
>>>>>> context and
>>>>>> it's ring.
>>>>>>
>>>>>> hopefully my latest debug will give more answers.
>>>>>
>>>>> Yeah if you can figure our whether the a) spinner is still
>>>>> active
>>>>> during the 2nd check (as I think it should be), and b) is the
>>>>> corruption detected in the same pages which were used in the
>>>>> 1st pass
>>>>> that would be interesting.
>>>>
>>>> yep. The latest run is still stuck in the CI queue after 27
>>>> hours.
>>>> I think I have enough debug in there to catch it now.
>>>> Hopefully I can get a root cause once it gets chance to run.
>>>>
>>>
>>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496
>>>   
>>>
>>>
>>> well, the run finally happened.
>>> And it shows that the freed resource happens from a workqueue. Not
>>> helpful.
>>>
>>> I'll now add a saved stack traces to all objects that saves where
>>> it is
>>> allocated and freed/queued for free.
>>>
>>
>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v8/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings419
>>
>> I'm pretty sure I know what is going on now.
>>
>> igt_reset_engines_stolen() loops around each engine and calls
>> __igt_reset_stolen() for that engine.
>>
>>
>> __igt_reset_stolen() does
>> intel_context_create()
>>
>> igt_spinner_create_request()->intel_context_create_request()-
>>> __i915_request_create()->intel_context_get()
>>
>> intel_context_put()
>>
>> leaving the request as the remaining holder of the context.
>>
>> it then does the reset, which does nothing on GuC setups, does the
>> comparisons, then ends the spinner via igt_spinner_fini()-
>>> igt_spinner_end()
>> which lets the spinner request finish.
>>
>> once the request is retired, intel_context_put() is eventually
>> called,
>> which starts the GuC teardown of the context as the request was the
>> last
>> holder of the context.
>>
>> This GuC teardown is asynchronous via ct transactions.
>> By the time the ct_process_request() sees the
>> INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE message, the test has
>> already
>> looped to the next engine and has already checked the original status
>> of
>> the page that the destroying context used for its ring buffer, so the
>> test sees it being freed from the previous loop while testing the
>> next
>> engine. It considers this a corruption of the stolen memory due to
>> the
>> previously highlighted double checking of busy state for each page.
>>
>> I think for now, we should simply not test GuC submission engines in
>> line with the reset call returning an error.
>> If at some point we want to enable this test for GuC setups, then
>> flushing and waiting for context cleanup would need to be added to
>> the test.
>>
>> Anyone know why per engine reset is not allowed for GuC submission
>> setup?
>> looking at commit "eb5e7da736f3 drm/i915/guc: Reset implementation
>> for
>> new GuC interface" doesn't really detail why per engine resets are
>> not
>> allowed.
>> Maybe it just never got implemented? or are there reasons to not
>> allow
>> the host to request specific engine resets?
>>
> 
> IIRC, the GuC by design decides for itself when it needs to do a per-
> engine reset, hence the host can't trigger those.
> 
> /Thomas
> 

okay, understood.
I'll include a fix to the test in the next version to not test if GuC 
submissions are active.

The only curious bit is that the conversion to ttm makes this issue 
occur more often. In theory this should have been happening with the old 
code too.
I suspect it is just a timing thing. Maybe the ttm code has sped up or 
slowed down the process somewhere.

Tvrtko Ursulin July 1, 2022, 7:34 a.m. UTC | #10

On 30/06/2022 15:20, Robert Beckett wrote:
> 
> 
> On 29/06/2022 13:51, Robert Beckett wrote:
>>
>>
>> On 28/06/2022 17:22, Robert Beckett wrote:
>>>
>>>
>>> On 28/06/2022 09:46, Tvrtko Ursulin wrote:
>>>>
>>>> On 27/06/2022 18:08, Robert Beckett wrote:
>>>>>
>>>>>
>>>>> On 22/06/2022 10:05, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 21/06/2022 20:11, Robert Beckett wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 21/06/2022 18:37, Patchwork wrote:
>>>>>>>> *Patch Details*
>>>>>>>> *Series:*    drm/i915: ttm for stolen (rev5)
>>>>>>>> *URL:*    https://patchwork.freedesktop.org/series/101396/ 
>>>>>>>> <https://patchwork.freedesktop.org/series/101396/>
>>>>>>>> *State:*    failure
>>>>>>>> *Details:* 
>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html> 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   CI Bug Log - changes from CI_DRM_11790 -> Patchwork_101396v5
>>>>>>>>
>>>>>>>>
>>>>>>>>     Summary
>>>>>>>>
>>>>>>>> *FAILURE*
>>>>>>>>
>>>>>>>> Serious unknown changes coming with Patchwork_101396v5 
>>>>>>>> absolutely need to be
>>>>>>>> verified manually.
>>>>>>>>
>>>>>>>> If you think the reported changes have nothing to do with the 
>>>>>>>> changes
>>>>>>>> introduced in Patchwork_101396v5, please notify your bug team to 
>>>>>>>> allow them
>>>>>>>> to document this new failure mode, which will reduce false 
>>>>>>>> positives in CI.
>>>>>>>>
>>>>>>>> External URL: 
>>>>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/index.html 
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     Participating hosts (40 -> 41)
>>>>>>>>
>>>>>>>> Additional (2): fi-icl-u2 bat-dg2-9
>>>>>>>> Missing (1): fi-bdw-samus
>>>>>>>>
>>>>>>>>
>>>>>>>>     Possible new issues
>>>>>>>>
>>>>>>>> Here are the unknown changes that may have been introduced in 
>>>>>>>> Patchwork_101396v5:
>>>>>>>>
>>>>>>>>
>>>>>>>>       IGT changes
>>>>>>>>
>>>>>>>>
>>>>>>>>         Possible regressions
>>>>>>>>
>>>>>>>>   * igt@i915_selftest@live@reset:
>>>>>>>>       o bat-adlp-4: PASS
>>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_11790/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>>>
>>>>>>>>         -> DMESG-FAIL
>>>>>>>> <https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_101396v5/bat-adlp-4/igt@i915_selftest@live@reset.html> 
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> I keep hitting clobbered pages during engine resets on bat-adlp-4.
>>>>>>> It seems to happen most of the time on that machine and 
>>>>>>> occasionally on bat-adlp-6.
>>>>>>>
>>>>>>> Should bat-adlp-4 be considered an unreliable machine like 
>>>>>>> bat-adlp-6 is for now?
>>>>>>>
>>>>>>> Alternatively, seeing the history of this in
>>>>>>>
>>>>>>> commit 3da3c5c1c9825c24168f27b021339e90af37e969 "drm/i915: 
>>>>>>> Exclude low pages (128KiB) of stolen from use"
>>>>>>>
>>>>>>> could this be an indication that maybe the original issue is 
>>>>>>> worse on adlp machines?
>>>>>>> I have only ever seen page page 135 or 136 clobbered across many 
>>>>>>> runs via trybot, so it looks fairly consistent.
>>>>>>> Though excluding the use of over 540K of stolen might be too severe.
>>>>>>
>>>>>> Don't know but I see that on the latest version you even hit pages 
>>>>>> 165/166.
>>>>>>
>>>>>> Any history of hitting this in CI without your series? If not, are 
>>>>>> there some other changes which could explain it? Are you touching 
>>>>>> the selftest itself?
>>>>>>
>>>>>> Hexdump of the clobbered page looks quite complex. Especially 
>>>>>> POISON_FREE. Any idea how that ends up there?
>>>>>
>>>>>
>>>>> (see 
>>>>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v4/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings702) 
>>>>>
>>>>>
>>>>> after lots of slow debug via CI, it looks like the issue is that a 
>>>>> ring buffer was allocated and taking up that page during the 
>>>>> initial crc capture in the test, but by the time it came to check 
>>>>> for corruption, it had been freed from that page.
>>>>>
>>>>> The test has a number of weaknesses:
>>>>>
>>>>> 1. the busy check is done twice, without taking in to account any 
>>>>> change in between. I assume previously this could be relied on 
>>>>> never to occur, but now it can for some reason (more on that later)
>>>>
>>>> You mean the stolen page used/unused test? Probably the premise is 
>>>> that the test controls the driver completely ie. is the sole user 
>>>> and the two checks are run at the time where nothing else could have 
>>>> changed the state.
>>>>
>>>> With the nerfed request (as with GuC) this actually should hold. In 
>>>> the generic case I am less sure, my working knowledge faded a bit, 
>>>> but perhaps there was something guaranteeing the spinner couldn't 
>>>> have been retired yet at the time of the second check. Would need 
>>>> clarifying at least in comments.
>>>>>
>>>>> 2. the engine reset returns early with an error for guc submission 
>>>>> engines, but it is silently ignored in the test. Perhaps it should 
>>>>> ignore guc submission engines as it is a largely useless test for 
>>>>> those situations.
>>>>
>>>> Yes looks dodgy indeed. You will need to summon the owners of the 
>>>> GuC backend to comment on this.
>>>>
>>>> However even if the test should be skipped with GuC it is extremely 
>>>> interesting that you are hitting this so I suspect there is a more 
>>>> serious issue at play.
>>>
>>> indeed. That's why I am keen to get to the root cause instead of just 
>>> slapping in a fix.
>>>
>>>>
>>>>> A quick obvious fix is to have a busy bitmask that remembers each 
>>>>> page's busy state initially and only check for corruption if it was 
>>>>> busy during both checks.
>>>>>
>>>>> However, the main question is why this is occurring now with my 
>>>>> changes.
>>>>> I have added more debug to check where the stolen memory is being 
>>>>> freed, but the first run last night didn't hit the issue for once.
>>>>> I am running again now, will report back if I figure out where it 
>>>>> is being freed.
>>>>>
>>>>> I am pretty sure the "corruption" (which isn't actually corruption) 
>>>>> is from a ring buffer.
>>>>> The POISON_FREE is the only difference between the captured before 
>>>>> and after dumps:
>>>>>
>>>>> [0040] 00000000 02800000 6b6b6b6b 6b6b6b6b 6b6b6b6b 6b6b6b6b 
>>>>> 6b6b6b6b 6b6b6b6b
>>>>>
>>>>> with the 2nd dword being the MI_ARB_CHECK used for the spinner.
>>>>> I think this is the request poisoning from i915_request_retire()
>>>>>
>>>>> The bit I don't know yet is why a ring buffer was freed between the 
>>>>> initial crc capture and the corruption check. The spinner should be 
>>>>> active across the entire test, maintaining a ref on the context and 
>>>>> it's ring.
>>>>>
>>>>> hopefully my latest debug will give more answers.
>>>>
>>>> Yeah if you can figure our whether the a) spinner is still active 
>>>> during the 2nd check (as I think it should be), and b) is the 
>>>> corruption detected in the same pages which were used in the 1st 
>>>> pass that would be interesting.
>>>
>>> yep. The latest run is still stuck in the CI queue after 27 hours.
>>> I think I have enough debug in there to catch it now.
>>> Hopefully I can get a root cause once it gets chance to run.
>>>
>>
>> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v7/fi-adl-ddr5/igt@i915_selftest@live@reset.html#dmesg-warnings496 
>>
>>
>> well, the run finally happened.
>> And it shows that the freed resource happens from a workqueue. Not 
>> helpful.
>>
>> I'll now add a saved stack traces to all objects that saves where it 
>> is allocated and freed/queued for free.
>>
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_105517v8/fi-rkl-guc/igt@i915_selftest@live@reset.html#dmesg-warnings419 
> 
> 
> I'm pretty sure I know what is going on now.
> 
> igt_reset_engines_stolen() loops around each engine and calls 
> __igt_reset_stolen() for that engine.
> 
> 
> __igt_reset_stolen() does
> intel_context_create()
> 
> igt_spinner_create_request()->intel_context_create_request()->__i915_request_create()->intel_context_get() 
> 
> 
> intel_context_put()
> 
> leaving the request as the remaining holder of the context.
> 
> it then does the reset, which does nothing on GuC setups, does the 
> comparisons, then ends the spinner via 
> igt_spinner_fini()->igt_spinner_end()
> which lets the spinner request finish.
> 
> once the request is retired, intel_context_put() is eventually called, 
> which starts the GuC teardown of the context as the request was the last 
> holder of the context.
> 
> This GuC teardown is asynchronous via ct transactions.
> By the time the ct_process_request() sees the 
> INTEL_GUC_ACTION_DEREGISTER_CONTEXT_DONE message, the test has already 
> looped to the next engine and has already checked the original status of 
> the page that the destroying context used for its ring buffer, so the 
> test sees it being freed from the previous loop while testing the next 
> engine. It considers this a corruption of the stolen memory due to the 
> previously highlighted double checking of busy state for each page.

Alright, makes sense. Test kind of depends on implementation details and 
perhaps ideally it should do some explicit flushing before moving to the 
next engine, instead of assuming engine reset leaves everything idle and 
flushed. I don't know from the top of my head what kind of flushing 
would that be. In theory all possible delayed workers that we have. 
Maybe. But never mind for now.

> I think for now, we should simply not test GuC submission engines in 
> line with the reset call returning an error.
> If at some point we want to enable this test for GuC setups, then 
> flushing and waiting for context cleanup would need to be added to the 
> test.

Yeah that's okay. Alternative could be to provoke for instance the 
preempt timeout and hit the reset in that way. I am pretty sure some 
tests fiddle with it to enable rapid execution time. (Remove 
MI_ARB_CEHCK from the spinner and send an enging pulse, then wait for 
reset.)

> Anyone know why per engine reset is not allowed for GuC submission setup?
> looking at commit "eb5e7da736f3 drm/i915/guc: Reset implementation for 
> new GuC interface" doesn't really detail why per engine resets are not 
> allowed.
> Maybe it just never got implemented? or are there reasons to not allow 
> the host to request specific engine resets?

As Thomas has said - thinking seems to be there must be no explicit 
external mechanism to make GuC trigger the engine reset. Maybe it would 
be useful for testing purposes, or maybe indirect route as above is enough.

Regards,

Tvrtko

[v7,00/10] drm/i915: ttm for stolen

Message

Comments