Message ID | 20220906174609.23494-1-ville.syrjala@linux.intel.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | drm/i915: Kick rcu harder to free objects | expand |
On 06/09/2022 18:46, Ville Syrjala wrote: > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > On gen3 the selftests are pretty much always tripping this: > <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) > <4> [383.822546] WARNING: CPU: 2 PID: 3560 at drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 [i915] > > Looks to be due to the status page object lingering on the > purge_list. Call synchronize_rcu() ahead of it to make more > sure all objects have been freed. > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 0f49ec9d494a..5b61f7ad6473 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) > flush_delayed_work(&i915->bdev.wq); > rcu_barrier(); > } > + synchronize_rcu(); Looks a bit suspicious that a loop would not free all but one last rcu grace would. Definitely fixes the issue in your testing? Perhaps the fact there is a cond_resched in __i915_gem_free_objects, but then again free count should reflect the state and keep it looping in here.. Regards, Tvrtko
On Thu, Sep 08, 2022 at 01:23:50PM +0100, Tvrtko Ursulin wrote: > > On 06/09/2022 18:46, Ville Syrjala wrote: > > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > > > On gen3 the selftests are pretty much always tripping this: > > <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) > > <4> [383.822546] WARNING: CPU: 2 PID: 3560 at drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 [i915] > > > > Looks to be due to the status page object lingering on the > > purge_list. Call synchronize_rcu() ahead of it to make more > > sure all objects have been freed. > > > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > > --- > > drivers/gpu/drm/i915/i915_gem.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > > index 0f49ec9d494a..5b61f7ad6473 100644 > > --- a/drivers/gpu/drm/i915/i915_gem.c > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) > > flush_delayed_work(&i915->bdev.wq); > > rcu_barrier(); > > } > > + synchronize_rcu(); > > Looks a bit suspicious that a loop would not free all but one last rcu > grace would. Definitely fixes the issue in your testing? Definite is a bit hard to say with fuzzy stuff like this. But yes, so far didn't see the warn triggering anymore. CI results show the same. > > Perhaps the fact there is a cond_resched in __i915_gem_free_objects, but > then again free count should reflect the state and keep it looping in here.. > > Regards, > > Tvrtko
Hi Ville, I fixed a similar issue in DII but I couldn't reproduce it in drm http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. I wonder if that fixes the problem you are facing then I can send that to drm. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 7809be3a6840..5438e9277924 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1213,7 +1213,7 @@ void i915_gem_init_early(struct drm_i915_private *dev_priv) void i915_gem_cleanup_early(struct drm_i915_private *dev_priv) { - i915_gem_drain_freed_objects(dev_priv); + i915_gem_drain_workqueue(dev_priv); GEM_BUG_ON(!llist_empty(&dev_priv->mm.free_list)); GEM_BUG_ON(atomic_read(&dev_priv->mm.free_count)); drm_WARN_ON(&dev_priv->drm, dev_priv->mm.shrink_count); Regards, Nirmoy On 9/6/2022 7:46 PM, Ville Syrjala wrote: > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > On gen3 the selftests are pretty much always tripping this: > <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) > <4> [383.822546] WARNING: CPU: 2 PID: 3560 at drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 [i915] > > Looks to be due to the status page object lingering on the > purge_list. Call synchronize_rcu() ahead of it to make more > sure all objects have been freed. > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > --- > drivers/gpu/drm/i915/i915_gem.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > index 0f49ec9d494a..5b61f7ad6473 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) > flush_delayed_work(&i915->bdev.wq); > rcu_barrier(); > } > + synchronize_rcu(); > } > > /*
On 08/09/2022 15:32, Das, Nirmoy wrote: > Hi Ville, > > > I fixed a similar issue in DII but I couldn't reproduce it in drm > > http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. > > I wonder if that fixes the problem you are facing then I can send that > to drm. > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 7809be3a6840..5438e9277924 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1213,7 +1213,7 @@ void i915_gem_init_early(struct drm_i915_private > *dev_priv) > > void i915_gem_cleanup_early(struct drm_i915_private *dev_priv) > { > - i915_gem_drain_freed_objects(dev_priv); > + i915_gem_drain_workqueue(dev_priv); > GEM_BUG_ON(!llist_empty(&dev_priv->mm.free_list)); > GEM_BUG_ON(atomic_read(&dev_priv->mm.free_count)); > drm_WARN_ON(&dev_priv->drm, dev_priv->mm.shrink_count); Yes why not, more black magic (count to three) but if it works... :) I also spy the general area has been a bit neglected. Like: i915_gem_driver_remove: ... i915_gem_drain_workqueue i915_gem_drain_freed_objects While i915_gem_drain_workqueue: ... i915_gem_drain_freed_objects So i915_gem_drain_freed_objects in i915_gem_driver_remove is redundant already. Should i915_gem_drain_freed_objects be unexported and all callers made just call i915_gem_drain_workqueue after your patch? Or if "drain free objects" is considered more self descriptive it could be made as an alias to i915_gem_drain_workqueue. Regards, Tvrtko > > > Regards, > > Nirmoy > > On 9/6/2022 7:46 PM, Ville Syrjala wrote: >> From: Ville Syrjälä <ville.syrjala@linux.intel.com> >> >> On gen3 the selftests are pretty much always tripping this: >> <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) >> <4> [383.822546] WARNING: CPU: 2 PID: 3560 at >> drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 >> [i915] >> >> Looks to be due to the status page object lingering on the >> purge_list. Call synchronize_rcu() ahead of it to make more >> sure all objects have been freed. >> >> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> >> --- >> drivers/gpu/drm/i915/i915_gem.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c >> b/drivers/gpu/drm/i915/i915_gem.c >> index 0f49ec9d494a..5b61f7ad6473 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct >> drm_i915_private *i915) >> flush_delayed_work(&i915->bdev.wq); >> rcu_barrier(); >> } >> + synchronize_rcu(); >> } >> /*
On Thu, Sep 08, 2022 at 04:32:56PM +0200, Das, Nirmoy wrote: > Hi Ville, > > > I fixed a similar issue in DII but I couldn't reproduce it in drm > > http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. > > I wonder if that fixes the problem you are facing then I can send that > to drm. CI can tell you. It has been complaining about this for ages without anyone doing anything about it. > > diff --git a/drivers/gpu/drm/i915/i915_gem.c > b/drivers/gpu/drm/i915/i915_gem.c > index 7809be3a6840..5438e9277924 100644 > --- a/drivers/gpu/drm/i915/i915_gem.c > +++ b/drivers/gpu/drm/i915/i915_gem.c > @@ -1213,7 +1213,7 @@ void i915_gem_init_early(struct drm_i915_private > *dev_priv) > > void i915_gem_cleanup_early(struct drm_i915_private *dev_priv) > { > - i915_gem_drain_freed_objects(dev_priv); > + i915_gem_drain_workqueue(dev_priv); > GEM_BUG_ON(!llist_empty(&dev_priv->mm.free_list)); > GEM_BUG_ON(atomic_read(&dev_priv->mm.free_count)); > drm_WARN_ON(&dev_priv->drm, dev_priv->mm.shrink_count); > > > Regards, > > Nirmoy > > On 9/6/2022 7:46 PM, Ville Syrjala wrote: > > From: Ville Syrjälä <ville.syrjala@linux.intel.com> > > > > On gen3 the selftests are pretty much always tripping this: > > <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) > > <4> [383.822546] WARNING: CPU: 2 PID: 3560 at drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 [i915] > > > > Looks to be due to the status page object lingering on the > > purge_list. Call synchronize_rcu() ahead of it to make more > > sure all objects have been freed. > > > > Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> > > --- > > drivers/gpu/drm/i915/i915_gem.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c > > index 0f49ec9d494a..5b61f7ad6473 100644 > > --- a/drivers/gpu/drm/i915/i915_gem.c > > +++ b/drivers/gpu/drm/i915/i915_gem.c > > @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) > > flush_delayed_work(&i915->bdev.wq); > > rcu_barrier(); > > } > > + synchronize_rcu(); > > } > > > > /*
On 9/8/2022 4:55 PM, Tvrtko Ursulin wrote: > > On 08/09/2022 15:32, Das, Nirmoy wrote: >> Hi Ville, >> >> >> I fixed a similar issue in DII but I couldn't reproduce it in drm >> >> http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. >> >> I wonder if that fixes the problem you are facing then I can send >> that to drm. >> >> diff --git a/drivers/gpu/drm/i915/i915_gem.c >> b/drivers/gpu/drm/i915/i915_gem.c >> index 7809be3a6840..5438e9277924 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -1213,7 +1213,7 @@ void i915_gem_init_early(struct >> drm_i915_private *dev_priv) >> >> void i915_gem_cleanup_early(struct drm_i915_private *dev_priv) >> { >> - i915_gem_drain_freed_objects(dev_priv); >> + i915_gem_drain_workqueue(dev_priv); >> GEM_BUG_ON(!llist_empty(&dev_priv->mm.free_list)); >> GEM_BUG_ON(atomic_read(&dev_priv->mm.free_count)); >> drm_WARN_ON(&dev_priv->drm, dev_priv->mm.shrink_count); > > Yes why not, more black magic (count to three) but if it works... :) I > also spy the general area has been a bit neglected. Like: Not sure what should be the correct solution here. I wonder if we might have to change this because of https://lwn.net/Articles/906975/ ? > > i915_gem_driver_remove: > ... > i915_gem_drain_workqueue > i915_gem_drain_freed_objects > > While i915_gem_drain_workqueue: > ... > i915_gem_drain_freed_objects > > So i915_gem_drain_freed_objects in i915_gem_driver_remove is redundant > already. > > Should i915_gem_drain_freed_objects be unexported and all callers made > just call i915_gem_drain_workqueue after your patch? Or if "drain free > objects" is considered more self descriptive it could be made as an > alias to i915_gem_drain_workqueue. We are using i915_gem_drain_freed_objects() in many places and replacing that with i915_gem_drain_workqueue() might have performance implication. Nirmoy > > Regards, > > Tvrtko > >> >> >> Regards, >> >> Nirmoy >> >> On 9/6/2022 7:46 PM, Ville Syrjala wrote: >>> From: Ville Syrjälä <ville.syrjala@linux.intel.com> >>> >>> On gen3 the selftests are pretty much always tripping this: >>> <4> [383.822424] pci 0000:00:02.0: >>> drm_WARN_ON(dev_priv->mm.shrink_count) >>> <4> [383.822546] WARNING: CPU: 2 PID: 3560 at >>> drivers/gpu/drm/i915/i915_gem.c:1223 >>> i915_gem_cleanup_early+0x96/0xb0 [i915] >>> >>> Looks to be due to the status page object lingering on the >>> purge_list. Call synchronize_rcu() ahead of it to make more >>> sure all objects have been freed. >>> >>> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gem.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c >>> b/drivers/gpu/drm/i915/i915_gem.c >>> index 0f49ec9d494a..5b61f7ad6473 100644 >>> --- a/drivers/gpu/drm/i915/i915_gem.c >>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>> @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct >>> drm_i915_private *i915) >>> flush_delayed_work(&i915->bdev.wq); >>> rcu_barrier(); >>> } >>> + synchronize_rcu(); >>> } >>> /*
On 9/8/2022 5:11 PM, Ville Syrjälä wrote: > On Thu, Sep 08, 2022 at 04:32:56PM +0200, Das, Nirmoy wrote: >> Hi Ville, >> >> >> I fixed a similar issue in DII but I couldn't reproduce it in drm >> >> http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. >> >> I wonder if that fixes the problem you are facing then I can send that >> to drm. > CI can tell you. It has been complaining about this for ages Could you please share a url/failed test name. I must be searching the wrong hw/test(https://intel-gfx-ci.01.org/tree/drm-tip/fi-ivb-3770.html). Thanks, Nirmoy > without > anyone doing anything about it. > >> diff --git a/drivers/gpu/drm/i915/i915_gem.c >> b/drivers/gpu/drm/i915/i915_gem.c >> index 7809be3a6840..5438e9277924 100644 >> --- a/drivers/gpu/drm/i915/i915_gem.c >> +++ b/drivers/gpu/drm/i915/i915_gem.c >> @@ -1213,7 +1213,7 @@ void i915_gem_init_early(struct drm_i915_private >> *dev_priv) >> >> void i915_gem_cleanup_early(struct drm_i915_private *dev_priv) >> { >> - i915_gem_drain_freed_objects(dev_priv); >> + i915_gem_drain_workqueue(dev_priv); >> GEM_BUG_ON(!llist_empty(&dev_priv->mm.free_list)); >> GEM_BUG_ON(atomic_read(&dev_priv->mm.free_count)); >> drm_WARN_ON(&dev_priv->drm, dev_priv->mm.shrink_count); >> >> >> Regards, >> >> Nirmoy >> >> On 9/6/2022 7:46 PM, Ville Syrjala wrote: >>> From: Ville Syrjälä <ville.syrjala@linux.intel.com> >>> >>> On gen3 the selftests are pretty much always tripping this: >>> <4> [383.822424] pci 0000:00:02.0: drm_WARN_ON(dev_priv->mm.shrink_count) >>> <4> [383.822546] WARNING: CPU: 2 PID: 3560 at drivers/gpu/drm/i915/i915_gem.c:1223 i915_gem_cleanup_early+0x96/0xb0 [i915] >>> >>> Looks to be due to the status page object lingering on the >>> purge_list. Call synchronize_rcu() ahead of it to make more >>> sure all objects have been freed. >>> >>> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> >>> --- >>> drivers/gpu/drm/i915/i915_gem.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c >>> index 0f49ec9d494a..5b61f7ad6473 100644 >>> --- a/drivers/gpu/drm/i915/i915_gem.c >>> +++ b/drivers/gpu/drm/i915/i915_gem.c >>> @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) >>> flush_delayed_work(&i915->bdev.wq); >>> rcu_barrier(); >>> } >>> + synchronize_rcu(); >>> } >>> >>> /*
On Thu, Sep 08, 2022 at 09:34:04PM +0200, Das, Nirmoy wrote: > > On 9/8/2022 5:11 PM, Ville Syrjälä wrote: > > On Thu, Sep 08, 2022 at 04:32:56PM +0200, Das, Nirmoy wrote: > >> Hi Ville, > >> > >> > >> I fixed a similar issue in DII but I couldn't reproduce it in drm > >> > >> http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. > >> > >> I wonder if that fixes the problem you are facing then I can send that > >> to drm. > > CI can tell you. It has been complaining about this for ages > > > Could you please share a url/failed test name. I must be searching the > wrong hw/test(https://intel-gfx-ci.01.org/tree/drm-tip/fi-ivb-3770.html). https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12102/fi-pnv-d510/igt@i915_selftest@live@requests.html gen3 hits it nearly 100% of the time in the live selftests. IIRC it also looked like shard-snb has also hit it more sporadically in some tests. There's also at least one bug about it https://gitlab.freedesktop.org/drm/intel/-/issues/4528 which I guess is why ci is able to ignore this, and that then has enabled all the humans to ignore it as well.
On Fri, Sep 09, 2022 at 10:29:55AM +0300, Ville Syrjälä wrote: > On Thu, Sep 08, 2022 at 09:34:04PM +0200, Das, Nirmoy wrote: > > > > On 9/8/2022 5:11 PM, Ville Syrjälä wrote: > > > On Thu, Sep 08, 2022 at 04:32:56PM +0200, Das, Nirmoy wrote: > > >> Hi Ville, > > >> > > >> > > >> I fixed a similar issue in DII but I couldn't reproduce it in drm > > >> > > >> http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. > > >> > > >> I wonder if that fixes the problem you are facing then I can send that > > >> to drm. > > > CI can tell you. It has been complaining about this for ages > > > > > > Could you please share a url/failed test name. I must be searching the > > wrong hw/test(https://intel-gfx-ci.01.org/tree/drm-tip/fi-ivb-3770.html). > > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12102/fi-pnv-d510/igt@i915_selftest@live@requests.html > > gen3 hits it nearly 100% of the time in the live selftests. > IIRC it also looked like shard-snb has also hit it more > sporadically in some tests. > > There's also at least one bug about it > https://gitlab.freedesktop.org/drm/intel/-/issues/4528 > which I guess is why ci is able to ignore this, and that > then has enabled all the humans to ignore it as well. Did we get anywhere with this? I'm getting tired of seeing the red in CI.
On 9/21/2022 9:56 AM, Ville Syrjälä wrote: > On Fri, Sep 09, 2022 at 10:29:55AM +0300, Ville Syrjälä wrote: >> On Thu, Sep 08, 2022 at 09:34:04PM +0200, Das, Nirmoy wrote: >>> On 9/8/2022 5:11 PM, Ville Syrjälä wrote: >>>> On Thu, Sep 08, 2022 at 04:32:56PM +0200, Das, Nirmoy wrote: >>>>> Hi Ville, >>>>> >>>>> >>>>> I fixed a similar issue in DII but I couldn't reproduce it in drm >>>>> >>>>> http://intel-gfx-pw.fi.intel.com/patch/228850/?series=15910&rev=2. >>>>> >>>>> I wonder if that fixes the problem you are facing then I can send that >>>>> to drm. >>>> CI can tell you. It has been complaining about this for ages >>> >>> Could you please share a url/failed test name. I must be searching the >>> wrong hw/test(https://intel-gfx-ci.01.org/tree/drm-tip/fi-ivb-3770.html). >> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12102/fi-pnv-d510/igt@i915_selftest@live@requests.html >> >> gen3 hits it nearly 100% of the time in the live selftests. >> IIRC it also looked like shard-snb has also hit it more >> sporadically in some tests. >> >> There's also at least one bug about it >> https://gitlab.freedesktop.org/drm/intel/-/issues/4528 >> which I guess is why ci is able to ignore this, and that >> then has enabled all the humans to ignore it as well. > Did we get anywhere with this? I'm getting tired of > seeing the red in CI. Please review https://patchwork.freedesktop.org/patch/501991/?series=108314&rev=1 Sorry, I should've Cced you. It seems to fix the issue: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_108314v1/fi-pnv-d510/igt@i915_selftest@live@requests.html Regards, Nirmoy >
diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 0f49ec9d494a..5b61f7ad6473 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1098,6 +1098,7 @@ void i915_gem_drain_freed_objects(struct drm_i915_private *i915) flush_delayed_work(&i915->bdev.wq); rcu_barrier(); } + synchronize_rcu(); } /*