diff mbox series

nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf

Message ID 20220819200928.401416-1-kherbst@redhat.com (mailing list archive)
State New, archived
Headers show
Series nouveau: explicitly wait on the fence in nouveau_bo_move_m2mf | expand

Commit Message

Karol Herbst Aug. 19, 2022, 8:09 p.m. UTC
It is a bit unlcear to us why that's helping, but it does and unbreaks
suspend/resume on a lot of GPUs without any known drawbacks.

Cc: stable@vger.kernel.org # v5.15+
Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
Signed-off-by: Karol Herbst <kherbst@redhat.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
 1 file changed, 9 insertions(+)

Comments

Lyude Paul Aug. 22, 2022, 9:15 p.m. UTC | #1
Reviewed-by: Lyude Paul <lyude@redhat.com>

On Fri, 2022-08-19 at 22:09 +0200, Karol Herbst wrote:
> It is a bit unlcear to us why that's helping, but it does and unbreaks
> suspend/resume on a lot of GPUs without any known drawbacks.
> 
> Cc: stable@vger.kernel.org # v5.15+
> Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 35bb0bb3fe61..126b3c6e12f9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
>  		if (ret == 0) {
>  			ret = nouveau_fence_new(chan, false, &fence);
>  			if (ret == 0) {
> +				/* TODO: figure out a better solution here
> +				 *
> +				 * wait on the fence here explicitly as going through
> +				 * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
> +				 *
> +				 * Without this the operation can timeout and we'll fallback to a
> +				 * software copy, which might take several minutes to finish.
> +				 */
> +				nouveau_fence_wait(fence, false, false);
>  				ret = ttm_bo_move_accel_cleanup(bo,
>  								&fence->base,
>  								evict, false,
Salvatore Bonaccorso Sept. 20, 2022, 10:42 a.m. UTC | #2
Hi,

On Fri, Aug 19, 2022 at 10:09:28PM +0200, Karol Herbst wrote:
> It is a bit unlcear to us why that's helping, but it does and unbreaks
> suspend/resume on a lot of GPUs without any known drawbacks.
> 
> Cc: stable@vger.kernel.org # v5.15+
> Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 35bb0bb3fe61..126b3c6e12f9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
>  		if (ret == 0) {
>  			ret = nouveau_fence_new(chan, false, &fence);
>  			if (ret == 0) {
> +				/* TODO: figure out a better solution here
> +				 *
> +				 * wait on the fence here explicitly as going through
> +				 * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
> +				 *
> +				 * Without this the operation can timeout and we'll fallback to a
> +				 * software copy, which might take several minutes to finish.
> +				 */
> +				nouveau_fence_wait(fence, false, false);
>  				ret = ttm_bo_move_accel_cleanup(bo,
>  								&fence->base,
>  								evict, false,
> -- 
> 2.37.1
> 
> 

While this is marked for 5.15+ only, a user in Debian was seeing the
suspend issue as well on 5.10.y and did confirm the commit fixes the
issue as well in the 5.10.y series:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#69

Karol, Lyude, should that as well be picked for 5.10.y?

Regards,
Salvatore
Karol Herbst Sept. 20, 2022, 11:36 a.m. UTC | #3
On Tue, Sep 20, 2022 at 12:42 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
>
> Hi,
>
> On Fri, Aug 19, 2022 at 10:09:28PM +0200, Karol Herbst wrote:
> > It is a bit unlcear to us why that's helping, but it does and unbreaks
> > suspend/resume on a lot of GPUs without any known drawbacks.
> >
> > Cc: stable@vger.kernel.org # v5.15+
> > Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > ---
> >  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > index 35bb0bb3fe61..126b3c6e12f9 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
> >               if (ret == 0) {
> >                       ret = nouveau_fence_new(chan, false, &fence);
> >                       if (ret == 0) {
> > +                             /* TODO: figure out a better solution here
> > +                              *
> > +                              * wait on the fence here explicitly as going through
> > +                              * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
> > +                              *
> > +                              * Without this the operation can timeout and we'll fallback to a
> > +                              * software copy, which might take several minutes to finish.
> > +                              */
> > +                             nouveau_fence_wait(fence, false, false);
> >                               ret = ttm_bo_move_accel_cleanup(bo,
> >                                                               &fence->base,
> >                                                               evict, false,
> > --
> > 2.37.1
> >
> >
>
> While this is marked for 5.15+ only, a user in Debian was seeing the
> suspend issue as well on 5.10.y and did confirm the commit fixes the
> issue as well in the 5.10.y series:
>
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#69
>
> Karol, Lyude, should that as well be picked for 5.10.y?
>

mhh from the original report 5.10 was fine, but maybe something got
backported and it broke it? I'll try to do some testing on my machine
and see what I can figure out, but it could also be a debian only
issue at this point.

> Regards,
> Salvatore
>
Salvatore Bonaccorso Sept. 20, 2022, 11:59 a.m. UTC | #4
Hi,

On Tue, Sep 20, 2022 at 01:36:32PM +0200, Karol Herbst wrote:
> On Tue, Sep 20, 2022 at 12:42 PM Salvatore Bonaccorso <carnil@debian.org> wrote:
> >
> > Hi,
> >
> > On Fri, Aug 19, 2022 at 10:09:28PM +0200, Karol Herbst wrote:
> > > It is a bit unlcear to us why that's helping, but it does and unbreaks
> > > suspend/resume on a lot of GPUs without any known drawbacks.
> > >
> > > Cc: stable@vger.kernel.org # v5.15+
> > > Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> > > Signed-off-by: Karol Herbst <kherbst@redhat.com>
> > > ---
> > >  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > index 35bb0bb3fe61..126b3c6e12f9 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
> > >               if (ret == 0) {
> > >                       ret = nouveau_fence_new(chan, false, &fence);
> > >                       if (ret == 0) {
> > > +                             /* TODO: figure out a better solution here
> > > +                              *
> > > +                              * wait on the fence here explicitly as going through
> > > +                              * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
> > > +                              *
> > > +                              * Without this the operation can timeout and we'll fallback to a
> > > +                              * software copy, which might take several minutes to finish.
> > > +                              */
> > > +                             nouveau_fence_wait(fence, false, false);
> > >                               ret = ttm_bo_move_accel_cleanup(bo,
> > >                                                               &fence->base,
> > >                                                               evict, false,
> > > --
> > > 2.37.1
> > >
> > >
> >
> > While this is marked for 5.15+ only, a user in Debian was seeing the
> > suspend issue as well on 5.10.y and did confirm the commit fixes the
> > issue as well in the 5.10.y series:
> >
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=989705#69
> >
> > Karol, Lyude, should that as well be picked for 5.10.y?
> >
> 
> mhh from the original report 5.10 was fine, but maybe something got
> backported and it broke it? I'll try to do some testing on my machine
> and see what I can figure out, but it could also be a debian only
> issue at this point.

Right, this is a possiblity, thanks for looking into it!

Computer Enthusiastic, can you verify the problem as well in a
non-Debian patched upstream kernel directly from the 5.10.y series
(latest 5.10.144) and verify the fix there?

Regards,
Salvatore
Computer Enthusiastic Sept. 30, 2022, 9:09 p.m. UTC | #5
Hello,

Il giorno mar 20 set 2022 alle ore 13:59 Salvatore Bonaccorso
<carnil@debian.org> ha scritto:
[..]
> Computer Enthusiastic, can you verify the problem as well in a
> non-Debian patched upstream kernel directly from the 5.10.y series
> (latest 5.10.144) and verify the fix there?
>
> Regards,
> Salvatore

I've tested the vanilla kernel 5.10.145 (it was the latest one week
ago) without Debian kernel patches, but using the kernel config file
from the latest kernel for Debian Stable:
- without the Karol's patch: it always fails both suspend to ram and
hibernate to disk with the usual behavior (a very long time to suspend
or hibernate, then it fails on resume with a garbled screen)
- with the Karol's patch: it succeeds both suspend and hibernate and
it correctly resumes afterwards.

The kernel was tested using the following graphic adapter:
Graphics:  Device-1: NVIDIA G96CM [GeForce 9600M GT] driver: nouveau v: kernel
          Device-2: Suyin Acer HD Crystal Eye webcam type: USB driver:
uvcvideo
          Display: x11 server: X.Org 1.20.11 driver: loaded:
modesetting unloaded: fbdev,vesa
          resolution: 1280x800~60Hz
          OpenGL: renderer: NV96 v: 3.3 Mesa 20.3.5

Therefore, 5.10.y series of the kernel need to be patched to work
correctly at least with the aforementioned graphic card.

The script I used to compile the kernel are attached for further
reference and verification.

Hope that helps.
Computer Enthusiastic Nov. 19, 2022, 5:20 a.m. UTC | #6
Hello,

Il giorno ven 19 ago 2022 alle ore 22:09 Karol Herbst
<kherbst@redhat.com> ha scritto:
>
> It is a bit unlcear to us why that's helping, but it does and unbreaks
> suspend/resume on a lot of GPUs without any known drawbacks.
>
> Cc: stable@vger.kernel.org # v5.15+
> Closes: https://gitlab.freedesktop.org/drm/nouveau/-/issues/156
> Signed-off-by: Karol Herbst <kherbst@redhat.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 35bb0bb3fe61..126b3c6e12f9 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -822,6 +822,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
>                 if (ret == 0) {
>                         ret = nouveau_fence_new(chan, false, &fence);
>                         if (ret == 0) {
> +                               /* TODO: figure out a better solution here
> +                                *
> +                                * wait on the fence here explicitly as going through
> +                                * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
> +                                *
> +                                * Without this the operation can timeout and we'll fallback to a
> +                                * software copy, which might take several minutes to finish.
> +                                */
> +                               nouveau_fence_wait(fence, false, false);
>                                 ret = ttm_bo_move_accel_cleanup(bo,
>                                                                 &fence->base,
>                                                                 evict, false,
> --
> 2.37.1
>

Could it be possible to make land the aforementioned patch to the
5.10.x kernel version ? It is currently for >= 5.15.x kernel version
only.

Thanks.
diff mbox series

Patch

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 35bb0bb3fe61..126b3c6e12f9 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -822,6 +822,15 @@  nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict,
 		if (ret == 0) {
 			ret = nouveau_fence_new(chan, false, &fence);
 			if (ret == 0) {
+				/* TODO: figure out a better solution here
+				 *
+				 * wait on the fence here explicitly as going through
+				 * ttm_bo_move_accel_cleanup somehow doesn't seem to do it.
+				 *
+				 * Without this the operation can timeout and we'll fallback to a
+				 * software copy, which might take several minutes to finish.
+				 */
+				nouveau_fence_wait(fence, false, false);
 				ret = ttm_bo_move_accel_cleanup(bo,
 								&fence->base,
 								evict, false,