diff mbox series

[11/11] drm/amdgpu: stop removing BOs from the LRU during CS

Message ID 20190514123127.1650-11-christian.koenig@amd.com (mailing list archive)
State New, archived
Headers show
Series [01/11] drm/ttm: Make LRU removal optional. | expand

Commit Message

Christian König May 14, 2019, 12:31 p.m. UTC
This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Chunming Zhou May 14, 2019, 1:12 p.m. UTC | #1
my only concern is how to fresh LRU when bo is from bo list.

-David

-------- Original Message --------
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
CC:

[CAUTION: External Email]

This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Exchange Server">
<!-- converted from text --><style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
</head>
<body>
<div>my only concern is how to fresh LRU when bo is from bo list.<br>
<br>
-David<br>
<br>
-------- Original Message --------<br>
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS<br>
From: Christian König <br>
To: &quot;Olsak, Marek&quot; ,&quot;Zhou, David(ChunMing)&quot; ,&quot;Liang, Prike&quot; ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<br>
CC: <br>
<br>
</div>
<font size="2"><span style="font-size:11pt;">
<div class="PlainText">[CAUTION: External Email]<br>
<br>
This avoids OOM situations when we have lots of threads<br>
submitting at the same time.<br>
<br>
Signed-off-by: Christian König &lt;christian.koenig@amd.com&gt;<br>
---<br>
&nbsp;drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 &#43;-<br>
&nbsp;1 file changed, 1 insertion(&#43;), 1 deletion(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
index fff558cf385b..f9240a94217b 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
&#43;&#43;&#43; b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
@@ -648,7 &#43;648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket, &amp;p-&gt;validated, true,<br>
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;duplicates, true);<br>
&#43;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;duplicates, false);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (unlikely(r != 0)) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (r != -ERESTARTSYS)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DRM_ERROR(&quot;ttm_eu_reserve_buffers failed.\n&quot;);<br>
--<br>
2.17.1<br>
<br>
</div>
</span></font>
</body>
</html>
Christian König May 14, 2019, 1:47 p.m. UTC | #2
Hui? What do you mean with that?

Christian.

Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
> my only concern is how to fresh LRU when bo is from bo list.
>
> -David
>
> -------- Original Message --------
> Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU 
> during CS
> From: Christian König
> To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" 
> ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
> CC:
>
> [CAUTION: External Email]
>
> This avoids OOM situations when we have lots of threads
> submitting at the same time.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index fff558cf385b..f9240a94217b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct 
> amdgpu_cs_parser *p,
>         }
>
>         r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
> -                                  &duplicates, true);
> +                                  &duplicates, false);
>         if (unlikely(r != 0)) {
>                 if (r != -ERESTARTSYS)
>                         DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
> --
> 2.17.1
>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hui? What do you mean with that?<br>
      <br>
      Christian.<br>
      <br>
      Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):<br>
    </div>
    <blockquote type="cite"
cite="mid:-vf7xt3-qgf5mz-veq8ih-okgxtz-9ehg3tx8dyemoidihe-fwj066fntvvx-x3y4nh-bn07hl-82anfo4oofx-4di7gg-3nkfhtbcgh58-yj9ws0-pthytc-oq9qcxd40s4g-249dv8-x6wbfujry6xi-mu2nvl.1557839540398@email.android.com">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <meta name="Generator" content="Microsoft Exchange Server">
      <!-- converted from text -->
      <style><!-- .EmailQuote { margin-left: 1pt; padding-left: 4pt; border-left: #800000 2px solid; } --></style>
      <div>my only concern is how to fresh LRU when bo is from bo list.<br>
        <br>
        -David<br>
        <br>
        -------- Original Message --------<br>
        Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the
        LRU during CS<br>
        From: Christian König <br>
        To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike"
        ,<a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org">dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org</a><br>
        CC: <br>
        <br>
      </div>
      <font size="2"><span style="font-size:11pt;">
          <div class="PlainText">[CAUTION: External Email]<br>
            <br>
            This avoids OOM situations when we have lots of threads<br>
            submitting at the same time.<br>
            <br>
            Signed-off-by: Christian König
            <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com">&lt;christian.koenig@amd.com&gt;</a><br>
            ---<br>
             drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-<br>
             1 file changed, 1 insertion(+), 1 deletion(-)<br>
            <br>
            diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
            b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
            index fff558cf385b..f9240a94217b 100644<br>
            --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
            +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
            @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
            amdgpu_cs_parser *p,<br>
                    }<br>
            <br>
                    r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket,
            &amp;p-&gt;validated, true,<br>
            -                                  &amp;duplicates, true);<br>
            +                                  &amp;duplicates, false);<br>
                    if (unlikely(r != 0)) {<br>
                            if (r != -ERESTARTSYS)<br>
                                    DRM_ERROR("ttm_eu_reserve_buffers
            failed.\n");<br>
            --<br>
            2.17.1<br>
            <br>
          </div>
        </span></font>
    </blockquote>
    <br>
  </body>
</html>
Chunming Zhou May 14, 2019, 2:31 p.m. UTC | #3
how to refresh LRU to keep the order align with bo list passed from user space?

you can verify it by some games, performance could be different much between multiple runnings.

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
CC:

[CAUTION: External Email]
Hui? What do you mean with that?

Christian.

Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
my only concern is how to fresh LRU when bo is from bo list.

-David

-------- Original Message --------
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]

This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta content="text/html; charset=Windows-1252">
</head>
<body bgcolor="#FFFFFF">
how to refresh LRU to keep the order align with bo list passed from user space?<br>
<br>
you can verify it by some games, performance could be different much between multiple runnings.<br>
<br>
-David<br>
<br>
-------- Original Message --------<br>
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS<br>
From: Christian König <br>
To: &quot;Zhou, David(ChunMing)&quot; ,&quot;Olsak, Marek&quot; ,&quot;Liang, Prike&quot; ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<br>
CC: <br>
<br>
<div>[CAUTION: External Email]
<div>
<div class="moz-cite-prefix">Hui? What do you mean with that?<br>
<br>
Christian.<br>
<br>
Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):<br>
</div>
<blockquote type="cite">
<meta name="Generator" content="Microsoft Exchange Server">
<style>
<!--
.EmailQuote
	{margin-left:1pt;
	padding-left:4pt;
	border-left:#800000 2px solid}
-->
</style>
<div>my only concern is how to fresh LRU when bo is from bo list.<br>
<br>
-David<br>
<br>
-------- Original Message --------<br>
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS<br>
From: Christian König <br>
To: &quot;Olsak, Marek&quot; ,&quot;Zhou, David(ChunMing)&quot; ,&quot;Liang, Prike&quot; ,<a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org">dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org</a><br>
CC: <br>
<br>
</div>
<font size="2"><span style="font-size:11pt">
<div class="PlainText">[CAUTION: External Email]<br>
<br>
This avoids OOM situations when we have lots of threads<br>
submitting at the same time.<br>
<br>
Signed-off-by: Christian König <a class="moz-txt-link-rfc2396E" href="mailto:christian.koenig@amd.com">
&lt;christian.koenig@amd.com&gt;</a><br>
---<br>
&nbsp;drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 &#43;-<br>
&nbsp;1 file changed, 1 insertion(&#43;), 1 deletion(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
index fff558cf385b..f9240a94217b 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
&#43;&#43;&#43; b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
@@ -648,7 &#43;648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; }<br>
<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket, &amp;p-&gt;validated, true,<br>
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;duplicates, true);<br>
&#43;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; &amp;duplicates, false);<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (unlikely(r != 0)) {<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; if (r != -ERESTARTSYS)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; DRM_ERROR(&quot;ttm_eu_reserve_buffers failed.\n&quot;);<br>
--<br>
2.17.1<br>
<br>
</div>
</span></font></blockquote>
<br>
</div>
</div>
</body>
</html>
Marek Olšák May 14, 2019, 7:33 p.m. UTC | #4
This series fixes the OOM errors. However, if I torture the kernel driver
more, I can get it to deadlock and end up with unkillable processes. I can
also get an OOM error. I just ran the test 5 times:

AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears &
AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears &
AMD_DEBUG=testgdsmm glxgears

Marek

On Tue, May 14, 2019 at 8:31 AM Christian König <
ckoenig.leichtzumerken@gmail.com> wrote:

> This avoids OOM situations when we have lots of threads
> submitting at the same time.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index fff558cf385b..f9240a94217b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
> amdgpu_cs_parser *p,
>         }
>
>         r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
> -                                  &duplicates, true);
> +                                  &duplicates, false);
>         if (unlikely(r != 0)) {
>                 if (r != -ERESTARTSYS)
>                         DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
> --
> 2.17.1
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
<div dir="ltr"><div dir="ltr"><div>This series fixes the OOM errors. However, if I torture the kernel driver more, I can get it to deadlock and end up with unkillable processes. I can also get an OOM error. I just ran the test 5 times:</div><div><br></div><div>AMD_DEBUG=testgdsmm glxgears &amp; AMD_DEBUG=testgdsmm glxgears &amp; AMD_DEBUG=testgdsmm glxgears &amp; AMD_DEBUG=testgdsmm glxgears &amp; AMD_DEBUG=testgdsmm glxgears</div><div></div><br><div>Marek<br></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, May 14, 2019 at 8:31 AM Christian König &lt;<a href="mailto:ckoenig.leichtzumerken@gmail.com">ckoenig.leichtzumerken@gmail.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">This avoids OOM situations when we have lots of threads<br>
submitting at the same time.<br>
<br>
Signed-off-by: Christian König &lt;<a href="mailto:christian.koenig@amd.com" target="_blank">christian.koenig@amd.com</a>&gt;<br>
---<br>
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-<br>
 1 file changed, 1 insertion(+), 1 deletion(-)<br>
<br>
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
index fff558cf385b..f9240a94217b 100644<br>
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,<br>
        }<br>
<br>
        r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket, &amp;p-&gt;validated, true,<br>
-                                  &amp;duplicates, true);<br>
+                                  &amp;duplicates, false);<br>
        if (unlikely(r != 0)) {<br>
                if (r != -ERESTARTSYS)<br>
                        DRM_ERROR(&quot;ttm_eu_reserve_buffers failed.\n&quot;);<br>
-- <br>
2.17.1<br>
<br>
_______________________________________________<br>
amd-gfx mailing list<br>
<a href="mailto:amd-gfx@lists.freedesktop.org" target="_blank">amd-gfx@lists.freedesktop.org</a><br>
<a href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx" rel="noreferrer" target="_blank">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></blockquote></div>
Liang, Prike May 15, 2019, 2 a.m. UTC | #5
Hi Christian ,

I just wonder when encounter ENOMEM error during pin amdgpu BOs can we retry validate again as below.
With the following simply patch the Abaqus pinned issue not observed.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 11cbf63..72a32f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -902,11 +902,15 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain,
                        bo->placements[i].lpfn = lpfn;
                bo->placements[i].flags |= TTM_PL_FLAG_NO_EVICT;
        }
-
+retry:
        r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
        if (unlikely(r)) {
-               dev_err(adev->dev, "%p pin failed\n", bo);
-               goto error;
+                if (r == -ENOMEM){
+                        goto retry;
+                } else {
+                       dev_err(adev->dev, "%p pin failed\n", bo);
+                       goto error;
+                }
        }

        bo->pin_count = 1;


Thanks,
Prike

From: Marek Olšák <maraeo@gmail.com>
Sent: Wednesday, May 15, 2019 3:33 AM
To: Christian König <ckoenig.leichtzumerken@gmail.com>
Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liang, Prike <Prike.Liang@amd.com>; dri-devel <dri-devel@lists.freedesktop.org>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS

[CAUTION: External Email]
This series fixes the OOM errors. However, if I torture the kernel driver more, I can get it to deadlock and end up with unkillable processes. I can also get an OOM error. I just ran the test 5 times:

AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears

Marek

On Tue, May 14, 2019 at 8:31 AM Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
Christian König May 15, 2019, 7:04 a.m. UTC | #6
Hi Prike,

no, that can lead to massive problems in a real OOM situation and is not 
something we can do here.

Christian.

Am 15.05.19 um 04:00 schrieb Liang, Prike:
>
> Hi Christian ,
>
> I just wonder when encounter ENOMEM error during pin amdgpu BOs can we 
> retry validate again as below.
>
> With the following simply patch the Abaqus pinned issue not observed.
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> index 11cbf63..72a32f5 100644
>
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>
> @@ -902,11 +902,15 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo 
> *bo, u32 domain,
>
> bo->placements[i].lpfn = lpfn;
>
>                 bo->placements[i].flags |= TTM_PL_FLAG_NO_EVICT;
>
>         }
>
> -
>
> +retry:
>
>         r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
>
>         if (unlikely(r)) {
>
> -               dev_err(adev->dev, "%p pin failed\n", bo);
>
> -               goto error;
>
> +                if (r == -ENOMEM){
>
> +                        goto retry;
>
> +                } else {
>
> + dev_err(adev->dev, "%p pin failed\n", bo);
>
> +                       goto error;
>
> +                }
>
>         }
>
>         bo->pin_count = 1;
>
> Thanks,
>
> Prike
>
> *From:* Marek Olšák <maraeo@gmail.com>
> *Sent:* Wednesday, May 15, 2019 3:33 AM
> *To:* Christian König <ckoenig.leichtzumerken@gmail.com>
> *Cc:* Zhou, David(ChunMing) <David1.Zhou@amd.com>; Liang, Prike 
> <Prike.Liang@amd.com>; dri-devel <dri-devel@lists.freedesktop.org>; 
> amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
> *Subject:* Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the 
> LRU during CS
>
> [CAUTION: External Email]
>
> This series fixes the OOM errors. However, if I torture the kernel 
> driver more, I can get it to deadlock and end up with unkillable 
> processes. I can also get an OOM error. I just ran the test 5 times:
>
> AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & 
> AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & 
> AMD_DEBUG=testgdsmm glxgears
>
> Marek
>
> On Tue, May 14, 2019 at 8:31 AM Christian König 
> <ckoenig.leichtzumerken@gmail.com 
> <mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
>
>     This avoids OOM situations when we have lots of threads
>     submitting at the same time.
>
>     Signed-off-by: Christian König <christian.koenig@amd.com
>     <mailto:christian.koenig@amd.com>>
>     ---
>      drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>      1 file changed, 1 insertion(+), 1 deletion(-)
>
>     diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     index fff558cf385b..f9240a94217b 100644
>     --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>     @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct
>     amdgpu_cs_parser *p,
>             }
>
>             r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
>     -                                  &duplicates, true);
>     +                                  &duplicates, false);
>             if (unlikely(r != 0)) {
>                     if (r != -ERESTARTSYS)
>     DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
>     -- 
>     2.17.1
>
>     _______________________________________________
>     amd-gfx mailing list
>     amd-gfx@lists.freedesktop.org <mailto:amd-gfx@lists.freedesktop.org>
>     https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">Hi Prike,<br>
      <br>
      no, that can lead to massive problems in a real OOM situation and
      is not something we can do here.<br>
      <br>
      Christian.<br>
      <br>
      Am 15.05.19 um 04:00 schrieb Liang, Prike:<br>
    </div>
    <blockquote type="cite"
cite="mid:BYAPR12MB35256D8A0583B5DD019C2925FB090@BYAPR12MB3525.namprd12.prod.outlook.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <meta name="Generator" content="Microsoft Word 15 (filtered
        medium)">
      <style><!--
/* Font Definitions */
@font-face
	{font-family:"Cambria Math";
	panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
	{font-family:DengXian;
	panose-1:2 1 6 0 3 1 1 1 1 1;}
@font-face
	{font-family:Calibri;
	panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
	{font-family:"\@等线";
	panose-1:2 1 6 0 3 1 1 1 1 1;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
	{margin:0in;
	margin-bottom:.0001pt;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
a:link, span.MsoHyperlink
	{mso-style-priority:99;
	color:blue;
	text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
	{mso-style-priority:99;
	color:purple;
	text-decoration:underline;}
p.msonormal0, li.msonormal0, div.msonormal0
	{mso-style-name:msonormal;
	mso-margin-top-alt:auto;
	margin-right:0in;
	mso-margin-bottom-alt:auto;
	margin-left:0in;
	font-size:11.0pt;
	font-family:"Calibri",sans-serif;}
span.EmailStyle18
	{mso-style-type:personal-reply;
	font-family:"Calibri",sans-serif;
	color:windowtext;}
.MsoChpDefault
	{mso-style-type:export-only;
	font-size:10.0pt;}
@page WordSection1
	{size:8.5in 11.0in;
	margin:1.0in 1.25in 1.0in 1.25in;}
div.WordSection1
	{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
      <div class="WordSection1">
        <p class="MsoNormal">Hi Christian ,<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">I just wonder when encounter ENOMEM error
          during pin amdgpu BOs can we retry validate again as below.<o:p></o:p></p>
        <p class="MsoNormal">With the following simply patch the Abaqus
          pinned issue not observed.
          <o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">diff --git
          a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
          b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c<o:p></o:p></p>
        <p class="MsoNormal">index 11cbf63..72a32f5 100644<o:p></o:p></p>
        <p class="MsoNormal">---
          a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c<o:p></o:p></p>
        <p class="MsoNormal">+++
          b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c<o:p></o:p></p>
        <p class="MsoNormal">@@ -902,11 +902,15 @@ int
          amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain,<o:p></o:p></p>
        <p class="MsoNormal">                       
          bo-&gt;placements[i].lpfn = lpfn;<o:p></o:p></p>
        <p class="MsoNormal">                bo-&gt;placements[i].flags
          |= TTM_PL_FLAG_NO_EVICT;<o:p></o:p></p>
        <p class="MsoNormal">        }<o:p></o:p></p>
        <p class="MsoNormal">-<o:p></o:p></p>
        <p class="MsoNormal">+retry:<o:p></o:p></p>
        <p class="MsoNormal">        r =
          ttm_bo_validate(&amp;bo-&gt;tbo, &amp;bo-&gt;placement,
          &amp;ctx);<o:p></o:p></p>
        <p class="MsoNormal">        if (unlikely(r)) {<o:p></o:p></p>
        <p class="MsoNormal">-               dev_err(adev-&gt;dev, "%p
          pin failed\n", bo);<o:p></o:p></p>
        <p class="MsoNormal">-               goto error;<o:p></o:p></p>
        <p class="MsoNormal">+                if (r == -ENOMEM){<o:p></o:p></p>
        <p class="MsoNormal">+                        goto retry;<o:p></o:p></p>
        <p class="MsoNormal">+                } else {<o:p></o:p></p>
        <p class="MsoNormal">+                      
          dev_err(adev-&gt;dev, "%p pin failed\n", bo);<o:p></o:p></p>
        <p class="MsoNormal">+                       goto error;<o:p></o:p></p>
        <p class="MsoNormal">+                }<o:p></o:p></p>
        <p class="MsoNormal">        }<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">        bo-&gt;pin_count = 1;<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">Thanks,<o:p></o:p></p>
        <p class="MsoNormal">Prike<o:p></o:p></p>
        <p class="MsoNormal"><o:p> </o:p></p>
        <div>
          <div style="border:none;border-top:solid #E1E1E1
            1.0pt;padding:3.0pt 0in 0in 0in">
            <p class="MsoNormal"><b>From:</b> Marek Olšák
              <a class="moz-txt-link-rfc2396E" href="mailto:maraeo@gmail.com">&lt;maraeo@gmail.com&gt;</a> <br>
              <b>Sent:</b> Wednesday, May 15, 2019 3:33 AM<br>
              <b>To:</b> Christian König
              <a class="moz-txt-link-rfc2396E" href="mailto:ckoenig.leichtzumerken@gmail.com">&lt;ckoenig.leichtzumerken@gmail.com&gt;</a><br>
              <b>Cc:</b> Zhou, David(ChunMing)
              <a class="moz-txt-link-rfc2396E" href="mailto:David1.Zhou@amd.com">&lt;David1.Zhou@amd.com&gt;</a>; Liang, Prike
              <a class="moz-txt-link-rfc2396E" href="mailto:Prike.Liang@amd.com">&lt;Prike.Liang@amd.com&gt;</a>; dri-devel
              <a class="moz-txt-link-rfc2396E" href="mailto:dri-devel@lists.freedesktop.org">&lt;dri-devel@lists.freedesktop.org&gt;</a>; amd-gfx mailing
              list <a class="moz-txt-link-rfc2396E" href="mailto:amd-gfx@lists.freedesktop.org">&lt;amd-gfx@lists.freedesktop.org&gt;</a><br>
              <b>Subject:</b> Re: [PATCH 11/11] drm/amdgpu: stop
              removing BOs from the LRU during CS<o:p></o:p></p>
          </div>
        </div>
        <p class="MsoNormal"><o:p> </o:p></p>
        <p class="MsoNormal">[CAUTION: External Email] <o:p></o:p></p>
        <div>
          <div>
            <div>
              <div>
                <p class="MsoNormal">This series fixes the OOM errors.
                  However, if I torture the kernel driver more, I can
                  get it to deadlock and end up with unkillable
                  processes. I can also get an OOM error. I just ran the
                  test 5 times:<o:p></o:p></p>
              </div>
              <div>
                <p class="MsoNormal"><o:p> </o:p></p>
              </div>
              <div>
                <p class="MsoNormal">AMD_DEBUG=testgdsmm glxgears &amp;
                  AMD_DEBUG=testgdsmm glxgears &amp; AMD_DEBUG=testgdsmm
                  glxgears &amp; AMD_DEBUG=testgdsmm glxgears &amp;
                  AMD_DEBUG=testgdsmm glxgears<o:p></o:p></p>
              </div>
              <p class="MsoNormal"><o:p> </o:p></p>
              <div>
                <p class="MsoNormal">Marek<o:p></o:p></p>
              </div>
            </div>
          </div>
          <p class="MsoNormal"><o:p> </o:p></p>
          <div>
            <div>
              <p class="MsoNormal">On Tue, May 14, 2019 at 8:31 AM
                Christian König &lt;<a
                  href="mailto:ckoenig.leichtzumerken@gmail.com"
                  moz-do-not-send="true">ckoenig.leichtzumerken@gmail.com</a>&gt;
                wrote:<o:p></o:p></p>
            </div>
            <blockquote style="border:none;border-left:solid #CCCCCC
              1.0pt;padding:0in 0in 0in
              6.0pt;margin-left:4.8pt;margin-right:0in">
              <p class="MsoNormal">This avoids OOM situations when we
                have lots of threads<br>
                submitting at the same time.<br>
                <br>
                Signed-off-by: Christian König &lt;<a
                  href="mailto:christian.koenig@amd.com" target="_blank"
                  moz-do-not-send="true">christian.koenig@amd.com</a>&gt;<br>
                ---<br>
                 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-<br>
                 1 file changed, 1 insertion(+), 1 deletion(-)<br>
                <br>
                diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
                b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                index fff558cf385b..f9240a94217b 100644<br>
                --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                @@ -648,7 +648,7 @@ static int
                amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,<br>
                        }<br>
                <br>
                        r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket,
                &amp;p-&gt;validated, true,<br>
                -                                  &amp;duplicates,
                true);<br>
                +                                  &amp;duplicates,
                false);<br>
                        if (unlikely(r != 0)) {<br>
                                if (r != -ERESTARTSYS)<br>
                                       
                DRM_ERROR("ttm_eu_reserve_buffers failed.\n");<br>
                -- <br>
                2.17.1<br>
                <br>
                _______________________________________________<br>
                amd-gfx mailing list<br>
                <a href="mailto:amd-gfx@lists.freedesktop.org"
                  target="_blank" moz-do-not-send="true">amd-gfx@lists.freedesktop.org</a><br>
                <a
                  href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx"
                  target="_blank" moz-do-not-send="true">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a><o:p></o:p></p>
            </blockquote>
          </div>
        </div>
      </div>
    </blockquote>
    <br>
  </body>
</html>
Christian König May 15, 2019, 2:16 p.m. UTC | #7
That is a good point, but actually not a problem in practice.

See the change to ttm_eu_fence_buffer_objects:
> -               ttm_bo_add_to_lru(bo);
> +               if (list_empty(&bo->lru))
> +                       ttm_bo_add_to_lru(bo);
> +               else
> +                       ttm_bo_move_to_lru_tail(bo, NULL);

We still move the BOs to the end of the LRU in the same order we have 
before, we just don't remove them when they are reserved.

Regards,
Christian.

Am 14.05.19 um 16:31 schrieb Zhou, David(ChunMing):
> how to refresh LRU to keep the order align with bo list passed from 
> user space?
>
> you can verify it by some games, performance could be different much 
> between multiple runnings.
>
> -David
>
> -------- Original Message --------
> Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU 
> during CS
> From: Christian König
> To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" 
> ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
> CC:
>
> [CAUTION: External Email]
> Hui? What do you mean with that?
>
> Christian.
>
> Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
>> my only concern is how to fresh LRU when bo is from bo list.
>>
>> -David
>>
>> -------- Original Message --------
>> Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU 
>> during CS
>> From: Christian König
>> To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" 
>> ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
>> CC:
>>
>> [CAUTION: External Email]
>>
>> This avoids OOM situations when we have lots of threads
>> submitting at the same time.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> index fff558cf385b..f9240a94217b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
>> @@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct 
>> amdgpu_cs_parser *p,
>>         }
>>
>>         r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
>> -                                  &duplicates, true);
>> +                                  &duplicates, false);
>>         if (unlikely(r != 0)) {
>>                 if (r != -ERESTARTSYS)
>> DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
>> --
>> 2.17.1
>>
>
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <div class="moz-cite-prefix">That is a good point, but actually not
      a problem in practice.<br>
      <br>
      See the change to ttm_eu_fence_buffer_objects:<br>
      <blockquote type="cite">-               ttm_bo_add_to_lru(bo);<br>
        +               if (list_empty(&amp;bo-&gt;lru))<br>
        +                       ttm_bo_add_to_lru(bo);<br>
        +               else<br>
        +                       ttm_bo_move_to_lru_tail(bo, NULL);<br>
      </blockquote>
      <br>
      We still move the BOs to the end of the LRU in the same order we
      have before, we just don't remove them when they are reserved.<br>
      <br>
      Regards,<br>
      Christian.<br>
      <br>
      Am 14.05.19 um 16:31 schrieb Zhou, David(ChunMing):<br>
    </div>
    <blockquote type="cite"
cite="mid:-wsx1tz-kxfbz1yns7x33sra134gl11xhlux4lx3izissqr2httt4mb1vleyxgj8i7k6-q6ze8ub3ff8c4o0fxmx7niu76yg4-ybakue-3v14jw-ed5ol8ybh6o9-1ze886-hbstfi448pvq3pwhkj.1557844282594@email.android.com">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <meta content="text/html; charset=Windows-1252">
      how to refresh LRU to keep the order align with bo list passed
      from user space?<br>
      <br>
      you can verify it by some games, performance could be different
      much between multiple runnings.<br>
      <br>
      -David<br>
      <br>
      -------- Original Message --------<br>
      Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the
      LRU during CS<br>
      From: Christian König <br>
      To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike"
      ,<a class="moz-txt-link-abbreviated" href="mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org">dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org</a><br>
      CC: <br>
      <br>
      <div>[CAUTION: External Email]
        <div>
          <div class="moz-cite-prefix">Hui? What do you mean with that?<br>
            <br>
            Christian.<br>
            <br>
            Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):<br>
          </div>
          <blockquote type="cite">
            <meta name="Generator" content="Microsoft Exchange Server">
            <style>
<!--
.EmailQuote
	{margin-left:1pt;
	padding-left:4pt;
	border-left:#800000 2px solid}
-->
</style>
            <div>my only concern is how to fresh LRU when bo is from bo
              list.<br>
              <br>
              -David<br>
              <br>
              -------- Original Message --------<br>
              Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from
              the LRU during CS<br>
              From: Christian König <br>
              To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang,
              Prike" ,<a class="moz-txt-link-abbreviated"
href="mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org"
                moz-do-not-send="true">dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org</a><br>
              CC: <br>
              <br>
            </div>
            <font size="2"><span style="font-size:11pt">
                <div class="PlainText">[CAUTION: External Email]<br>
                  <br>
                  This avoids OOM situations when we have lots of
                  threads<br>
                  submitting at the same time.<br>
                  <br>
                  Signed-off-by: Christian König <a
                    class="moz-txt-link-rfc2396E"
                    href="mailto:christian.koenig@amd.com"
                    moz-do-not-send="true">
                    &lt;christian.koenig@amd.com&gt;</a><br>
                  ---<br>
                   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-<br>
                   1 file changed, 1 insertion(+), 1 deletion(-)<br>
                  <br>
                  diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
                  b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                  index fff558cf385b..f9240a94217b 100644<br>
                  --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                  +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c<br>
                  @@ -648,7 +648,7 @@ static int
                  amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,<br>
                          }<br>
                  <br>
                          r = ttm_eu_reserve_buffers(&amp;p-&gt;ticket,
                  &amp;p-&gt;validated, true,<br>
                  -                                  &amp;duplicates,
                  true);<br>
                  +                                  &amp;duplicates,
                  false);<br>
                          if (unlikely(r != 0)) {<br>
                                  if (r != -ERESTARTSYS)<br>
                                         
                  DRM_ERROR("ttm_eu_reserve_buffers failed.\n");<br>
                  --<br>
                  2.17.1<br>
                  <br>
                </div>
              </span></font></blockquote>
          <br>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
amd-gfx mailing list
<a class="moz-txt-link-abbreviated" href="mailto:amd-gfx@lists.freedesktop.org">amd-gfx@lists.freedesktop.org</a>
<a class="moz-txt-link-freetext" href="https://lists.freedesktop.org/mailman/listinfo/amd-gfx">https://lists.freedesktop.org/mailman/listinfo/amd-gfx</a></pre>
    </blockquote>
    <br>
  </body>
</html>
Chunming Zhou May 15, 2019, 2:21 p.m. UTC | #8
Isn't this patch trying to stop removing for all BOs  from bo list?

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Koenig, Christian" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
CC:

[CAUTION: External Email]
That is a good point, but actually not a problem in practice.

See the change to ttm_eu_fence_buffer_objects:
-               ttm_bo_add_to_lru(bo);
+               if (list_empty(&bo->lru))
+                       ttm_bo_add_to_lru(bo);
+               else
+                       ttm_bo_move_to_lru_tail(bo, NULL);

We still move the BOs to the end of the LRU in the same order we have before, we just don't remove them when they are reserved.

Regards,
Christian.

Am 14.05.19 um 16:31 schrieb Zhou, David(ChunMing):
how to refresh LRU to keep the order align with bo list passed from user space?

you can verify it by some games, performance could be different much between multiple runnings.

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]
Hui? What do you mean with that?

Christian.

Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
my only concern is how to fresh LRU when bo is from bo list.

-David

-------- Original Message --------
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]

This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
Christian König May 15, 2019, 2:22 p.m. UTC | #9
BO list? No, we stop removing them from the LRU.

But we still move them to the end of the LRU before releasing them.

Christian.

Am 15.05.19 um 16:21 schrieb Zhou, David(ChunMing):
Isn't this patch trying to stop removing for all BOs  from bo list?

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Koenig, Christian" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]
That is a good point, but actually not a problem in practice.

See the change to ttm_eu_fence_buffer_objects:
-               ttm_bo_add_to_lru(bo);
+               if (list_empty(&bo->lru))
+                       ttm_bo_add_to_lru(bo);
+               else
+                       ttm_bo_move_to_lru_tail(bo, NULL);

We still move the BOs to the end of the LRU in the same order we have before, we just don't remove them when they are reserved.

Regards,
Christian.

Am 14.05.19 um 16:31 schrieb Zhou, David(ChunMing):
how to refresh LRU to keep the order align with bo list passed from user space?

you can verify it by some games, performance could be different much between multiple runnings.

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]
Hui? What do you mean with that?

Christian.

Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
my only concern is how to fresh LRU when bo is from bo list.

-David

-------- Original Message --------
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]

This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
Chunming Zhou May 15, 2019, 2:27 p.m. UTC | #10
Ah, sorry, I missed  "+                      ttm_bo_move_to_lru_tail(bo, NULL);".

Right, moving them to end before releasing is fixing my concern.

Sorry for noise.
-David


-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: "Koenig, Christian"
To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org
CC:

[CAUTION: External Email]
BO list? No, we stop removing them from the LRU.

But we still move them to the end of the LRU before releasing them.

Christian.

Am 15.05.19 um 16:21 schrieb Zhou, David(ChunMing):
Isn't this patch trying to stop removing for all BOs  from bo list?

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Koenig, Christian" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]
That is a good point, but actually not a problem in practice.

See the change to ttm_eu_fence_buffer_objects:
-               ttm_bo_add_to_lru(bo);
+               if (list_empty(&bo->lru))
+                       ttm_bo_add_to_lru(bo);
+               else
+                       ttm_bo_move_to_lru_tail(bo, NULL);

We still move the BOs to the end of the LRU in the same order we have before, we just don't remove them when they are reserved.

Regards,
Christian.

Am 14.05.19 um 16:31 schrieb Zhou, David(ChunMing):
how to refresh LRU to keep the order align with bo list passed from user space?

you can verify it by some games, performance could be different much between multiple runnings.

-David

-------- Original Message --------
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Zhou, David(ChunMing)" ,"Olsak, Marek" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]
Hui? What do you mean with that?

Christian.

Am 14.05.19 um 15:12 schrieb Zhou, David(ChunMing):
my only concern is how to fresh LRU when bo is from bo list.

-David

-------- Original Message --------
Subject: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS
From: Christian König
To: "Olsak, Marek" ,"Zhou, David(ChunMing)" ,"Liang, Prike" ,dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org<mailto:dri-devel@lists.freedesktop.org,amd-gfx@lists.freedesktop.org>
CC:

[CAUTION: External Email]

This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com><mailto:christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
Liang, Prike May 17, 2019, 8:16 a.m. UTC | #11
Hi Christian,

With the series patch set , amdgpu_vm_validate_pt_bos occasionally evicted amdgpu BOs failed and can’t
find the valid first busy bo . Another problem is that  during the first BOs get lock period will run into deadlock .

/* check if other user occupy memory too long time */
                if (!first_bo || !request_resv || !request_resv->lock.ctx) {
                        if (first_bo)
                                ttm_bo_put(first_bo);
                        return -EBUSY;
                }
                if (first_bo->resv == request_resv) {
                        ttm_bo_put(first_bo);
                        return -EBUSY;
                }
                if (ctx->interruptible)
                        ret = ww_mutex_lock_interruptible(&first_bo->resv->lock,
                                                          request_resv->lock.ctx);
                else
                        ret = ww_mutex_lock(&first_bo->resv->lock, request_resv->lock.ctx);
                if (ret) {
                        ttm_bo_put(first_bo);
                        if (ret == -EDEADLK) {
                                ret = -EAGAIN;
                        }

                        return ret;
                }

Thanks
Prike

From: Christian König <ckoenig.leichtzumerken@gmail.com>
Sent: Wednesday, May 15, 2019 3:05 PM
To: Liang, Prike <Prike.Liang@amd.com>; Marek Olšák <maraeo@gmail.com>
Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com>; dri-devel <dri-devel@lists.freedesktop.org>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS

[CAUTION: External Email]
Hi Prike,

no, that can lead to massive problems in a real OOM situation and is not something we can do here.

Christian.

Am 15.05.19 um 04:00 schrieb Liang, Prike:
Hi Christian ,

I just wonder when encounter ENOMEM error during pin amdgpu BOs can we retry validate again as below.
With the following simply patch the Abaqus pinned issue not observed.

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 11cbf63..72a32f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -902,11 +902,15 @@ int amdgpu_bo_pin_restricted(struct amdgpu_bo *bo, u32 domain,
                        bo->placements[i].lpfn = lpfn;
                bo->placements[i].flags |= TTM_PL_FLAG_NO_EVICT;
        }
-
+retry:
        r = ttm_bo_validate(&bo->tbo, &bo->placement, &ctx);
        if (unlikely(r)) {
-               dev_err(adev->dev, "%p pin failed\n", bo);
-               goto error;
+                if (r == -ENOMEM){
+                        goto retry;
+                } else {
+                       dev_err(adev->dev, "%p pin failed\n", bo);
+                       goto error;
+                }
        }

        bo->pin_count = 1;


Thanks,
Prike

From: Marek Olšák <maraeo@gmail.com><mailto:maraeo@gmail.com>
Sent: Wednesday, May 15, 2019 3:33 AM
To: Christian König <ckoenig.leichtzumerken@gmail.com><mailto:ckoenig.leichtzumerken@gmail.com>
Cc: Zhou, David(ChunMing) <David1.Zhou@amd.com><mailto:David1.Zhou@amd.com>; Liang, Prike <Prike.Liang@amd.com><mailto:Prike.Liang@amd.com>; dri-devel <dri-devel@lists.freedesktop.org><mailto:dri-devel@lists.freedesktop.org>; amd-gfx mailing list <amd-gfx@lists.freedesktop.org><mailto:amd-gfx@lists.freedesktop.org>
Subject: Re: [PATCH 11/11] drm/amdgpu: stop removing BOs from the LRU during CS

[CAUTION: External Email]
This series fixes the OOM errors. However, if I torture the kernel driver more, I can get it to deadlock and end up with unkillable processes. I can also get an OOM error. I just ran the test 5 times:

AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears & AMD_DEBUG=testgdsmm glxgears

Marek

On Tue, May 14, 2019 at 8:31 AM Christian König <ckoenig.leichtzumerken@gmail.com<mailto:ckoenig.leichtzumerken@gmail.com>> wrote:
This avoids OOM situations when we have lots of threads
submitting at the same time.

Signed-off-by: Christian König <christian.koenig@amd.com<mailto:christian.koenig@amd.com>>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@ static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
        }

        r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-                                  &duplicates, true);
+                                  &duplicates, false);
        if (unlikely(r != 0)) {
                if (r != -ERESTARTSYS)
                        DRM_ERROR("ttm_eu_reserve_buffers failed.\n");
--
2.17.1
diff mbox series

Patch

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index fff558cf385b..f9240a94217b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -648,7 +648,7 @@  static int amdgpu_cs_parser_bos(struct amdgpu_cs_parser *p,
 	}
 
 	r = ttm_eu_reserve_buffers(&p->ticket, &p->validated, true,
-				   &duplicates, true);
+				   &duplicates, false);
 	if (unlikely(r != 0)) {
 		if (r != -ERESTARTSYS)
 			DRM_ERROR("ttm_eu_reserve_buffers failed.\n");