diff mbox

Do not loop on ERESTARTSYS using interruptible waits

Message ID 163000764.1114319.1432029294390.JavaMail.zimbra@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Frediano Ziglio May 19, 2015, 9:54 a.m. UTC
This problem happens using KMS surfaces and QXL driver.
To easy reproduce use KDE Plasma (which use surfaces a lot) and assure
you are using KMS surfaces (QXL driver on Fedora/RedHat has a patch to
stop using them). Open some complex application like LibreOffice and
after a while your machine get stuck using 100% CPU on Xorg.
The problem occurs as creating new surfaces not interruptible wait
are used however instead of returning ERESTARTSYS back to userspace
you try to loop but wait routines always keep returning ERESTARTSYS
once the signal is marked.
On out of memory conditions TTM module try to move objects to system
memory and QXL assure surface is updated before the move.
The fix handle differently this case using no interruptible wait so
wait functions will wait instead of returning ERESTARTSYS.
Note the when the loop occurs driver will send a lot of update requests
causing more CPU usage on Qemu side too.

Signed-off-by: Frediano Ziglio <fziglio@redhat.com>
---
 qxl/qxl_cmd.c   | 12 +++---------
 qxl/qxl_drv.h   |  2 +-
 qxl/qxl_ioctl.c |  2 +-
 3 files changed, 5 insertions(+), 11 deletions(-)

Comments

Christophe Fergeau May 22, 2015, 11:58 a.m. UTC | #1
Hey,

On Tue, May 19, 2015 at 05:54:54AM -0400, Frediano Ziglio wrote:
> This problem happens using KMS surfaces and QXL driver.
> To easy reproduce use KDE Plasma (which use surfaces a lot) and assure
> you are using KMS surfaces (QXL driver on Fedora/RedHat has a patch to
> stop using them). Open some complex application like LibreOffice and
> after a while your machine get stuck using 100% CPU on Xorg.
> The problem occurs as creating new surfaces not interruptible wait
> are used however instead of returning ERESTARTSYS back to userspace
> you try to loop but wait routines always keep returning ERESTARTSYS
> once the signal is marked.
> On out of memory conditions TTM module try to move objects to system
> memory and QXL assure surface is updated before the move.
> The fix handle differently this case using no interruptible wait so
> wait functions will wait instead of returning ERESTARTSYS.
> Note the when the loop occurs driver will send a lot of update requests
> causing more CPU usage on Qemu side too.
> 
> Signed-off-by: Frediano Ziglio <fziglio@redhat.com>
> ---
>  qxl/qxl_cmd.c   | 12 +++---------
>  qxl/qxl_drv.h   |  2 +-
>  qxl/qxl_ioctl.c |  2 +-
>  3 files changed, 5 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_cmd.c b/qxl/qxl_cmd.c
> index 9782364..bd5404e 100644
> --- a/drivers/gpu/drm/qxl/qxl_cmd.c
> +++ b/drivers/gpu/drm/qxl/qxl_cmd.c
> @@ -317,14 +317,11 @@ static void wait_for_io_cmd(struct qxl_device *qdev, uint8_t val, long port)
>  {
>  	int ret;
>  
> -restart:
>  	ret = wait_for_io_cmd_user(qdev, val, port, false);
> -	if (ret == -ERESTARTSYS)
> -		goto restart;

I think this one is not directly related to the fix, but can be removed
because wait_for_io_cmd_user(qdev, val, port, false); will call
wait_event_timeout() which cannot return ERESTARTSYS? Or was this loop
causing issues too?

>  }
>  
>  int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
> -			const struct qxl_rect *area)
> +			const struct qxl_rect *area, bool intr)
>  {
>  	int surface_id;
>  	uint32_t surface_width, surface_height;
> @@ -350,7 +347,7 @@ int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
>  	mutex_lock(&qdev->update_area_mutex);
>  	qdev->ram_header->update_area = *area;
>  	qdev->ram_header->update_surface = surface_id;
> -	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, true);
> +	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, intr);
>  	mutex_unlock(&qdev->update_area_mutex);
>  	return ret;
>  }
> @@ -588,10 +585,7 @@ int qxl_update_surface(struct qxl_device *qdev, struct qxl_bo *surf)
>  	rect.right = surf->surf.width;
>  	rect.top = 0;
>  	rect.bottom = surf->surf.height;
> -retry:
> -	ret = qxl_io_update_area(qdev, surf, &rect);
> -	if (ret == -ERESTARTSYS)
> -		goto retry;
> +	ret = qxl_io_update_area(qdev, surf, &rect, false);

My understanding is that the fix is this hunk? If so, this could be made
more obvious with an intermediate commit adding the 'bool intr' arg to
qxl_io_update_area and only calling it with 'true' in the appropriate
places.
This code path is only triggered from qxl_surface_evict() which I assume
is not necessarily easily interruptible, so this change makes sense to
me. However it would be much better to get a review from Dave Airlie ;)

Christophe
Frediano Ziglio May 22, 2015, 1:58 p.m. UTC | #2
> 
> Hey,
> 
> On Tue, May 19, 2015 at 05:54:54AM -0400, Frediano Ziglio wrote:
> > This problem happens using KMS surfaces and QXL driver.
> > To easy reproduce use KDE Plasma (which use surfaces a lot) and assure
> > you are using KMS surfaces (QXL driver on Fedora/RedHat has a patch to
> > stop using them). Open some complex application like LibreOffice and
> > after a while your machine get stuck using 100% CPU on Xorg.
> > The problem occurs as creating new surfaces not interruptible wait
> > are used however instead of returning ERESTARTSYS back to userspace
> > you try to loop but wait routines always keep returning ERESTARTSYS
> > once the signal is marked.
> > On out of memory conditions TTM module try to move objects to system
> > memory and QXL assure surface is updated before the move.
> > The fix handle differently this case using no interruptible wait so
> > wait functions will wait instead of returning ERESTARTSYS.
> > Note the when the loop occurs driver will send a lot of update requests
> > causing more CPU usage on Qemu side too.
> > 
> > Signed-off-by: Frediano Ziglio <fziglio@redhat.com>
> > ---
> >  qxl/qxl_cmd.c   | 12 +++---------
> >  qxl/qxl_drv.h   |  2 +-
> >  qxl/qxl_ioctl.c |  2 +-
> >  3 files changed, 5 insertions(+), 11 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_cmd.c b/qxl/qxl_cmd.c
> > index 9782364..bd5404e 100644
> > --- a/drivers/gpu/drm/qxl/qxl_cmd.c
> > +++ b/drivers/gpu/drm/qxl/qxl_cmd.c
> > @@ -317,14 +317,11 @@ static void wait_for_io_cmd(struct qxl_device *qdev,
> > uint8_t val, long port)
> >  {
> >  	int ret;
> >  
> > -restart:
> >  	ret = wait_for_io_cmd_user(qdev, val, port, false);
> > -	if (ret == -ERESTARTSYS)
> > -		goto restart;
> 
> I think this one is not directly related to the fix, but can be removed
> because wait_for_io_cmd_user(qdev, val, port, false); will call
> wait_event_timeout() which cannot return ERESTARTSYS? Or was this loop
> causing issues too?
> 

Yes, but it has the same issue. Try till ERESTARTSYS are gone.
Currently perhaps not broken but prone to have same issue.

> >  }
> >  
> >  int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
> > -			const struct qxl_rect *area)
> > +			const struct qxl_rect *area, bool intr)
> >  {
> >  	int surface_id;
> >  	uint32_t surface_width, surface_height;
> > @@ -350,7 +347,7 @@ int qxl_io_update_area(struct qxl_device *qdev, struct
> > qxl_bo *surf,
> >  	mutex_lock(&qdev->update_area_mutex);
> >  	qdev->ram_header->update_area = *area;
> >  	qdev->ram_header->update_surface = surface_id;
> > -	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, true);
> > +	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, intr);
> >  	mutex_unlock(&qdev->update_area_mutex);
> >  	return ret;
> >  }
> > @@ -588,10 +585,7 @@ int qxl_update_surface(struct qxl_device *qdev, struct
> > qxl_bo *surf)
> >  	rect.right = surf->surf.width;
> >  	rect.top = 0;
> >  	rect.bottom = surf->surf.height;
> > -retry:
> > -	ret = qxl_io_update_area(qdev, surf, &rect);
> > -	if (ret == -ERESTARTSYS)
> > -		goto retry;
> > +	ret = qxl_io_update_area(qdev, surf, &rect, false);
> 
> My understanding is that the fix is this hunk? If so, this could be made
> more obvious with an intermediate commit adding the 'bool intr' arg to
> qxl_io_update_area and only calling it with 'true' in the appropriate
> places.
> This code path is only triggered from qxl_surface_evict() which I assume
> is not necessarily easily interruptible, so this change makes sense to
> me. However it would be much better to get a review from Dave Airlie ;)
> 
> Christophe
> 

Are you asking if just removing the loop would fix the issue?
So you are proposing a first patch that add the argument always passing true and another that change some calls to false? It make sense but still the loop should be removed.

Frediano
Christophe Fergeau May 26, 2015, 8:20 a.m. UTC | #3
Hey,

On Fri, May 22, 2015 at 09:58:10AM -0400, Frediano Ziglio wrote:
> > >  }
> > >  
> > >  int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
> > > -			const struct qxl_rect *area)
> > > +			const struct qxl_rect *area, bool intr)
> > >  {
> > >  	int surface_id;
> > >  	uint32_t surface_width, surface_height;
> > > @@ -350,7 +347,7 @@ int qxl_io_update_area(struct qxl_device *qdev, struct
> > > qxl_bo *surf,
> > >  	mutex_lock(&qdev->update_area_mutex);
> > >  	qdev->ram_header->update_area = *area;
> > >  	qdev->ram_header->update_surface = surface_id;
> > > -	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, true);
> > > +	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, intr);
> > >  	mutex_unlock(&qdev->update_area_mutex);
> > >  	return ret;
> > >  }
> > > @@ -588,10 +585,7 @@ int qxl_update_surface(struct qxl_device *qdev, struct
> > > qxl_bo *surf)
> > >  	rect.right = surf->surf.width;
> > >  	rect.top = 0;
> > >  	rect.bottom = surf->surf.height;
> > > -retry:
> > > -	ret = qxl_io_update_area(qdev, surf, &rect);
> > > -	if (ret == -ERESTARTSYS)
> > > -		goto retry;
> > > +	ret = qxl_io_update_area(qdev, surf, &rect, false);
> > 
> > My understanding is that the fix is this hunk? If so, this could be made
> > more obvious with an intermediate commit adding the 'bool intr' arg to
> > qxl_io_update_area and only calling it with 'true' in the appropriate
> > places.
> > This code path is only triggered from qxl_surface_evict() which I assume
> > is not necessarily easily interruptible, so this change makes sense to
> > me. However it would be much better to get a review from Dave Airlie ;)
> > 
> > Christophe
> > 
> 
> Are you asking if just removing the loop would fix the issue?
> So you are proposing a first patch that add the argument always
> passing true and another that change some calls to false? It make
> sense but still the loop should be removed.

Sorry, I was not clear, removing the loop is fine, I was not suggesting
to keep it.

Christophe
Dave Airlie May 28, 2015, 3:38 a.m. UTC | #4
On 19 May 2015 at 19:54, Frediano Ziglio <fziglio@redhat.com> wrote:
> This problem happens using KMS surfaces and QXL driver.
> To easy reproduce use KDE Plasma (which use surfaces a lot) and assure
> you are using KMS surfaces (QXL driver on Fedora/RedHat has a patch to
> stop using them). Open some complex application like LibreOffice and
> after a while your machine get stuck using 100% CPU on Xorg.
> The problem occurs as creating new surfaces not interruptible wait
> are used however instead of returning ERESTARTSYS back to userspace
> you try to loop but wait routines always keep returning ERESTARTSYS
> once the signal is marked.
> On out of memory conditions TTM module try to move objects to system
> memory and QXL assure surface is updated before the move.
> The fix handle differently this case using no interruptible wait so
> wait functions will wait instead of returning ERESTARTSYS.
> Note the when the loop occurs driver will send a lot of update requests
> causing more CPU usage on Qemu side too.

I actually don't think we should be enabling surfaces upstream, I
don't mind fixing
the kernel driver to not be crap, but I really don't think surfaces
really help the
SPICE protocol.

I should h ave pushed the disable surface in all cases upstream, feel free to do
so, they were a bad experiment, and nobody ever showed they were faster, or
at least when they hit eviction paths they didn't plummet down the
side of a massive
cliff.

The reason this loops in -ERESTARTSYS is that the hw craps itself if you try
and redo an operation, so you can't go back out to userspace and re-enter the
kernel, just one of many bad design points in the QXL hw.

you should probably drop wait_for_io_cmd completely.

Dave.

>
> Signed-off-by: Frediano Ziglio <fziglio@redhat.com>
> ---
>  qxl/qxl_cmd.c   | 12 +++---------
>  qxl/qxl_drv.h   |  2 +-
>  qxl/qxl_ioctl.c |  2 +-
>  3 files changed, 5 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_cmd.c b/qxl/qxl_cmd.c
> index 9782364..bd5404e 100644
> --- a/drivers/gpu/drm/qxl/qxl_cmd.c
> +++ b/drivers/gpu/drm/qxl/qxl_cmd.c
> @@ -317,14 +317,11 @@ static void wait_for_io_cmd(struct qxl_device *qdev, uint8_t val, long port)
>  {
>         int ret;
>
> -restart:
>         ret = wait_for_io_cmd_user(qdev, val, port, false);
> -       if (ret == -ERESTARTSYS)
> -               goto restart;
>  }
>
>  int qxl_io_update_area(struct qxl_ddevice *qdev, struct qxl_bo *surf,
> -                       const struct qxl_rect *area)
> +                       const struct qxl_rect *area, bool intr)
>  {
>         int surface_id;
>         uint32_t surface_width, surface_height;
> @@ -350,7 +347,7 @@ int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
>         mutex_lock(&qdev->update_area_mutex);
>         qdev->ram_header->update_area = *area;
>         qdev->ram_header->update_surface = surface_id;
> -       ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, true);
> +       ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, intr);
>         mutex_unlock(&qdev->update_area_mutex);
>         return ret;
>  }
> @@ -588,10 +585,7 @@ int qxl_update_surface(struct qxl_device *qdev, struct qxl_bo *surf)
>         rect.right = surf->surf.width;
>         rect.top = 0;
>         rect.bottom = surf->surf.height;
> -retry:
> -       ret = qxl_io_update_area(qdev, surf, &rect);
> -       if (ret == -ERESTARTSYS)
> -               goto retry;
> +       ret = qxl_io_update_area(qdev, surf, &rect, false);
>         return ret;
>  }
>
> diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_drv.h b/qxl/qxl_drv.h
> index 7c6cafe..6745c44 100644
> --- a/drivers/gpu/drm/qxl/qxl_drv.h
> +++ b/drivers/gpu/drm/qxl/qxl_drv.h
> @@ -462,7 +462,7 @@ void qxl_io_memslot_add(struct qxl_device *qdev, uint8_t id);
>  void qxl_io_notify_oom(struct qxl_device *qdev);
>
>  int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
> -                      const struct qxl_rect *area);
> +                      const struct qxl_rect *area, bool intr);
>
>  void qxl_io_reset(struct qxl_device *qdev);
>  void qxl_io_monitors_config(struct qxl_device *qdev);
> diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/qxl/qxl_ioctl.c
> index b110883..afd7297 100644
> --- a/drivers/gpu/drm/qxl/qxl_ioctl.c
> +++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
> @@ -348,7 +348,7 @@ static int qxl_update_area_ioctl(struct drm_device *dev, void *data,
>                 goto out2;
>         if (!qobj->surface_id)
>                 DRM_ERROR("got update area for surface with no id %d\n", update_area->handle);
> -       ret = qxl_io_update_area(qdev, qobj, &area);
> +       ret = qxl_io_update_area(qdev, qobj, &area, true);
>
>  out2:
>         qxl_bo_unreserve(qobj);
> --
> 2.1.0
> _______________________________________________
> Spice-devel mailing list
> Spice-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/spice-devel
diff mbox

Patch

diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_cmd.c b/qxl/qxl_cmd.c
index 9782364..bd5404e 100644
--- a/drivers/gpu/drm/qxl/qxl_cmd.c
+++ b/drivers/gpu/drm/qxl/qxl_cmd.c
@@ -317,14 +317,11 @@  static void wait_for_io_cmd(struct qxl_device *qdev, uint8_t val, long port)
 {
 	int ret;
 
-restart:
 	ret = wait_for_io_cmd_user(qdev, val, port, false);
-	if (ret == -ERESTARTSYS)
-		goto restart;
 }
 
 int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
-			const struct qxl_rect *area)
+			const struct qxl_rect *area, bool intr)
 {
 	int surface_id;
 	uint32_t surface_width, surface_height;
@@ -350,7 +347,7 @@  int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
 	mutex_lock(&qdev->update_area_mutex);
 	qdev->ram_header->update_area = *area;
 	qdev->ram_header->update_surface = surface_id;
-	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, true);
+	ret = wait_for_io_cmd_user(qdev, 0, QXL_IO_UPDATE_AREA_ASYNC, intr);
 	mutex_unlock(&qdev->update_area_mutex);
 	return ret;
 }
@@ -588,10 +585,7 @@  int qxl_update_surface(struct qxl_device *qdev, struct qxl_bo *surf)
 	rect.right = surf->surf.width;
 	rect.top = 0;
 	rect.bottom = surf->surf.height;
-retry:
-	ret = qxl_io_update_area(qdev, surf, &rect);
-	if (ret == -ERESTARTSYS)
-		goto retry;
+	ret = qxl_io_update_area(qdev, surf, &rect, false);
 	return ret;
 }
 
diff --git a/drivers/gpu/drm/drivers/gpu/drm/qxl/qxl_drv.h b/qxl/qxl_drv.h
index 7c6cafe..6745c44 100644
--- a/drivers/gpu/drm/qxl/qxl_drv.h
+++ b/drivers/gpu/drm/qxl/qxl_drv.h
@@ -462,7 +462,7 @@  void qxl_io_memslot_add(struct qxl_device *qdev, uint8_t id);
 void qxl_io_notify_oom(struct qxl_device *qdev);
 
 int qxl_io_update_area(struct qxl_device *qdev, struct qxl_bo *surf,
-		       const struct qxl_rect *area);
+		       const struct qxl_rect *area, bool intr);
 
 void qxl_io_reset(struct qxl_device *qdev);
 void qxl_io_monitors_config(struct qxl_device *qdev);
diff --git a/drivers/gpu/drm/qxl/qxl_ioctl.c b/qxl/qxl_ioctl.c
index b110883..afd7297 100644
--- a/drivers/gpu/drm/qxl/qxl_ioctl.c
+++ b/drivers/gpu/drm/qxl/qxl_ioctl.c
@@ -348,7 +348,7 @@  static int qxl_update_area_ioctl(struct drm_device *dev, void *data,
 		goto out2;
 	if (!qobj->surface_id)
 		DRM_ERROR("got update area for surface with no id %d\n", update_area->handle);
-	ret = qxl_io_update_area(qdev, qobj, &area);
+	ret = qxl_io_update_area(qdev, qobj, &area, true);
 
 out2:
 	qxl_bo_unreserve(qobj);