diff mbox series

[-fixes,5/5] drm/vmwgfx: Fix a buffer object eviction regression

Message ID 20180913115837.7746-5-thellstrom@vmware.com (mailing list archive)
State New, archived
Headers show
Series [-fixes,1/5] drm/vmwgfx: don't check for old_crtc_state enable status | expand

Commit Message

Thomas Hellstrom Sept. 13, 2018, 11:58 a.m. UTC
Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
an incorrect return value from the function vmw_gmrid_man_get_node(),
when we run out if integer ids. Instead of returning 0 (meaning
non-fatal error) we forward the ida_simple_get error code -ENOSPC.
This causes TTM not to retry allocation after buffer eviction and
instead return -ENOSPC to user-space.

Fix this by returning 0 when ida_simple_get() returns -ENOSPC.

Tested using glretrace.

Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
Reviewed-by: Charmaine Lee <charmainel@vmware.com>
Reviewed-by: Deepak Rawat <drawat@vmware.com>
---
 drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Matthew Wilcox Sept. 13, 2018, 2:10 p.m. UTC | #1
On Thu, Sep 13, 2018 at 01:58:37PM +0200, Thomas Hellstrom wrote:
> Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
> an incorrect return value from the function vmw_gmrid_man_get_node(),
> when we run out if integer ids. Instead of returning 0 (meaning
> non-fatal error) we forward the ida_simple_get error code -ENOSPC.
> This causes TTM not to retry allocation after buffer eviction and
> instead return -ENOSPC to user-space.
> 
> Fix this by returning 0 when ida_simple_get() returns -ENOSPC.

Thanks.  I got confused by the convoluted code that was there before ;-(

I think this could be better though ... if ida_alloc() ever starts
returning a different errno in the future, you'll hit the same problem,
right?  So how about this ...

 	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
+	if (id == -ENOMEM)
+		return -ENOMEM;
+	if (id < 0)
+		return 0;
  
 	spin_lock(&gman->lock);

But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
some memory, you'll free up an ID, so the next time around you should
be able to allocate an ID.  So shouldn't this function just have
been doing this all along?

 	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
+	if (id < 0)
+		return 0;
Thomas Hellstrom Sept. 13, 2018, 2:56 p.m. UTC | #2
Hi,

On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
> On Thu, Sep 13, 2018 at 01:58:37PM +0200, Thomas Hellstrom wrote:
>> Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
>> an incorrect return value from the function vmw_gmrid_man_get_node(),
>> when we run out if integer ids. Instead of returning 0 (meaning
>> non-fatal error) we forward the ida_simple_get error code -ENOSPC.
>> This causes TTM not to retry allocation after buffer eviction and
>> instead return -ENOSPC to user-space.
>>
>> Fix this by returning 0 when ida_simple_get() returns -ENOSPC.
> Thanks.  I got confused by the convoluted code that was there before ;-(
>
> I think this could be better though ... if ida_alloc() ever starts
> returning a different errno in the future, you'll hit the same problem,
> right?  So how about this ...
>
>   	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> +	if (id == -ENOMEM)
> +		return -ENOMEM;
> +	if (id < 0)
> +		return 0;
>    
>   	spin_lock(&gman->lock);
>
> But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
> some memory, you'll free up an ID, so the next time around you should
> be able to allocate an ID.  So shouldn't this function just have
> been doing this all along?
>
>   	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> +	if (id < 0)
> +		return 0;
>
Non-fatal errors are errors that can be remedied by GPU buffer eviction, 
and buffer eviction will free up IDA space, so basically we need to 
target only the error code that indicates we've run out of IDA space.

If we're worried that ida_alloc_max() will change return value, I guess 
we will have to increase the IDA space and detect the error ourselves:  
error if (id >= gman->max_gmr_ids)

/Thomas
Matthew Wilcox Sept. 13, 2018, 3:28 p.m. UTC | #3
On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
> Hi,
> 
> On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
> > On Thu, Sep 13, 2018 at 01:58:37PM +0200, Thomas Hellstrom wrote:
> > > Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
> > > an incorrect return value from the function vmw_gmrid_man_get_node(),
> > > when we run out if integer ids. Instead of returning 0 (meaning
> > > non-fatal error) we forward the ida_simple_get error code -ENOSPC.
> > > This causes TTM not to retry allocation after buffer eviction and
> > > instead return -ENOSPC to user-space.
> > > 
> > > Fix this by returning 0 when ida_simple_get() returns -ENOSPC.
> > Thanks.  I got confused by the convoluted code that was there before ;-(
> > 
> > I think this could be better though ... if ida_alloc() ever starts
> > returning a different errno in the future, you'll hit the same problem,
> > right?  So how about this ...
> > 
> >   	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> > +	if (id == -ENOMEM)
> > +		return -ENOMEM;
> > +	if (id < 0)
> > +		return 0;
> >   	spin_lock(&gman->lock);
> > 
> > But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
> > some memory, you'll free up an ID, so the next time around you should
> > be able to allocate an ID.  So shouldn't this function just have
> > been doing this all along?
> > 
> >   	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> > +	if (id < 0)
> > +		return 0;
> > 
> Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
> buffer eviction will free up IDA space, so basically we need to target only
> the error code that indicates we've run out of IDA space.

Yes, but the following situation can happen:

 - Allocate 1024 IDs
 - Run very low on memory
 - Allocating ID 1025 will fail (very very unlikely)
 - ida_alloc_max() returns -ENOMEM

In this situation, we want ttm_mem_evict_first() to be called which will
free up one of the 1024 existing IDs and then we can allocate that ID for
our new node.

I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
here.

> If we're worried that ida_alloc_max() will change return value, I guess we
> will have to increase the IDA space and detect the error ourselves:  error
> if (id >= gman->max_gmr_ids)

My point was that your solution (detect the one error which should be
deemed as non-fatal) was not as robust as its inverse (detect the one
error which the previous code deemed as fatal).  But I now believe no
error from the IDA should be seen as fatal.
Thomas Hellstrom Sept. 13, 2018, 4:52 p.m. UTC | #4
On 09/13/2018 05:28 PM, Matthew Wilcox wrote:
> On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
>> Hi,
>>
>> On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
>>> On Thu, Sep 13, 2018 at 01:58:37PM +0200, Thomas Hellstrom wrote:
>>>> Commit 4eb085e42fde ("drm/vmwgfx: Convert to new IDA API") indroduced
>>>> an incorrect return value from the function vmw_gmrid_man_get_node(),
>>>> when we run out if integer ids. Instead of returning 0 (meaning
>>>> non-fatal error) we forward the ida_simple_get error code -ENOSPC.
>>>> This causes TTM not to retry allocation after buffer eviction and
>>>> instead return -ENOSPC to user-space.
>>>>
>>>> Fix this by returning 0 when ida_simple_get() returns -ENOSPC.
>>> Thanks.  I got confused by the convoluted code that was there before ;-(
>>>
>>> I think this could be better though ... if ida_alloc() ever starts
>>> returning a different errno in the future, you'll hit the same problem,
>>> right?  So how about this ...
>>>
>>>    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>> +	if (id == -ENOMEM)
>>> +		return -ENOMEM;
>>> +	if (id < 0)
>>> +		return 0;
>>>    	spin_lock(&gman->lock);
>>>
>>> But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
>>> some memory, you'll free up an ID, so the next time around you should
>>> be able to allocate an ID.  So shouldn't this function just have
>>> been doing this all along?
>>>
>>>    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>> +	if (id < 0)
>>> +		return 0;
>>>
>> Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
>> buffer eviction will free up IDA space, so basically we need to target only
>> the error code that indicates we've run out of IDA space.
> Yes, but the following situation can happen:
>
>   - Allocate 1024 IDs
>   - Run very low on memory
>   - Allocating ID 1025 will fail (very very unlikely)
>   - ida_alloc_max() returns -ENOMEM
>
> In this situation, we want ttm_mem_evict_first() to be called which will
> free up one of the 1024 existing IDs and then we can allocate that ID for
> our new node.
>
> I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
> here.

Well, that's true, but that situation depends I guess very much on the 
radix tree implementation of IDA? Also I would expect the eviction paths 
to try to allocate more memory here and there, so to me the preferred 
option when -ENOMEM happens, is really to back off as soon as possible 
to avoid interfering with shrinker work going on etc.

>> If we're worried that ida_alloc_max() will change return value, I guess we
>> will have to increase the IDA space and detect the error ourselves:  error
>> if (id >= gman->max_gmr_ids)
> My point was that your solution (detect the one error which should be
> deemed as non-fatal) was not as robust as its inverse (detect the one
> error which the previous code deemed as fatal).  But I now believe no
> error from the IDA should be seen as fatal.

If you insist, I can test on -ENOMEM instead of -ENOSPC to mimic the 
pre-change behaviour. We should really focus on the IDA api changes 
here, and defer changing -ENOMEM to non-fatal to a follow-up patch if 
needed.

Thanks,

Thomas
Matthew Wilcox Sept. 13, 2018, 5:38 p.m. UTC | #5
On Thu, Sep 13, 2018 at 06:52:43PM +0200, Thomas Hellstrom wrote:
> On 09/13/2018 05:28 PM, Matthew Wilcox wrote:
> > On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
> > > On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
> > > > I think this could be better though ... if ida_alloc() ever starts
> > > > returning a different errno in the future, you'll hit the same problem,
> > > > right?  So how about this ...
> > > > 
> > > >    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> > > > +	if (id == -ENOMEM)
> > > > +		return -ENOMEM;
> > > > +	if (id < 0)
> > > > +		return 0;
> > > >    	spin_lock(&gman->lock);
> > > > 
> > > > But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
> > > > some memory, you'll free up an ID, so the next time around you should
> > > > be able to allocate an ID.  So shouldn't this function just have
> > > > been doing this all along?
> > > > 
> > > >    	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
> > > > +	if (id < 0)
> > > > +		return 0;
> > > > 
> > > Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
> > > buffer eviction will free up IDA space, so basically we need to target only
> > > the error code that indicates we've run out of IDA space.
> > Yes, but the following situation can happen:
> > 
> >   - Allocate 1024 IDs
> >   - Run very low on memory
> >   - Allocating ID 1025 will fail (very very unlikely)
> >   - ida_alloc_max() returns -ENOMEM
> > 
> > In this situation, we want ttm_mem_evict_first() to be called which will
> > free up one of the 1024 existing IDs and then we can allocate that ID for
> > our new node.
> > 
> > I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
> > here.
> 
> Well, that's true, but that situation depends I guess very much on the radix
> tree implementation of IDA?

The specific number 1024 depends on the current implementation, but
generally speaking at some point, the IDA has to allocate memory to store
one extra ID.  Because the IDA cannot allocate memory when freeing an
ID, there is no more compact representation of the allocated IDs smaller
than a bitmap.

> Also I would expect the eviction paths to try to
> allocate more memory here and there, so to me the preferred option when
> -ENOMEM happens, is really to back off as soon as possible to avoid
> interfering with shrinker work going on etc.

I would be surprised if freeing resources needs memory to be allocated.
That's not supposed to happen; the filesystems go to great lengths to
pre-allocate enough memory that they can always write at least one dirty
page back to storage without allocating any memory, for example.  Maybe
the DRM subsystem is different; I'm not an expert in your subsystem.

> > My point was that your solution (detect the one error which should be
> > deemed as non-fatal) was not as robust as its inverse (detect the one
> > error which the previous code deemed as fatal).  But I now believe no
> > error from the IDA should be seen as fatal.
> 
> If you insist, I can test on -ENOMEM instead of -ENOSPC to mimic the
> pre-change behaviour. We should really focus on the IDA api changes here,
> and defer changing -ENOMEM to non-fatal to a follow-up patch if needed.

I'd be comfortable with that solution for now.
Thomas Hellstrom Sept. 13, 2018, 6:17 p.m. UTC | #6
On 09/13/2018 07:38 PM, Matthew Wilcox wrote:
> On Thu, Sep 13, 2018 at 06:52:43PM +0200, Thomas Hellstrom wrote:
>> On 09/13/2018 05:28 PM, Matthew Wilcox wrote:
>>> On Thu, Sep 13, 2018 at 04:56:53PM +0200, Thomas Hellstrom wrote:
>>>> On 09/13/2018 04:10 PM, Matthew Wilcox wrote:
>>>>> I think this could be better though ... if ida_alloc() ever starts
>>>>> returning a different errno in the future, you'll hit the same problem,
>>>>> right?  So how about this ...
>>>>>
>>>>>     	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>>>> +	if (id == -ENOMEM)
>>>>> +		return -ENOMEM;
>>>>> +	if (id < 0)
>>>>> +		return 0;
>>>>>     	spin_lock(&gman->lock);
>>>>>
>>>>> But I wonder ... why is -ENOMEM seen as a fatal error?  If you free up
>>>>> some memory, you'll free up an ID, so the next time around you should
>>>>> be able to allocate an ID.  So shouldn't this function just have
>>>>> been doing this all along?
>>>>>
>>>>>     	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
>>>>> +	if (id < 0)
>>>>> +		return 0;
>>>>>
>>>> Non-fatal errors are errors that can be remedied by GPU buffer eviction, and
>>>> buffer eviction will free up IDA space, so basically we need to target only
>>>> the error code that indicates we've run out of IDA space.
>>> Yes, but the following situation can happen:
>>>
>>>    - Allocate 1024 IDs
>>>    - Run very low on memory
>>>    - Allocating ID 1025 will fail (very very unlikely)
>>>    - ida_alloc_max() returns -ENOMEM
>>>
>>> In this situation, we want ttm_mem_evict_first() to be called which will
>>> free up one of the 1024 existing IDs and then we can allocate that ID for
>>> our new node.
>>>
>>> I'm assuming we're analysing the behaviour of ttm_bo_mem_force_space()
>>> here.
>> Well, that's true, but that situation depends I guess very much on the radix
>> tree implementation of IDA?
> The specific number 1024 depends on the current implementation, but
> generally speaking at some point, the IDA has to allocate memory to store
> one extra ID.  Because the IDA cannot allocate memory when freeing an
> ID, there is no more compact representation of the allocated IDs smaller
> than a bitmap.
>
>> Also I would expect the eviction paths to try to
>> allocate more memory here and there, so to me the preferred option when
>> -ENOMEM happens, is really to back off as soon as possible to avoid
>> interfering with shrinker work going on etc.
> I would be surprised if freeing resources needs memory to be allocated.
> That's not supposed to happen; the filesystems go to great lengths to
> pre-allocate enough memory that they can always write at least one dirty
> page back to storage without allocating any memory, for example.  Maybe
> the DRM subsystem is different; I'm not an expert in your subsystem.

It's different. In particular when evicting a large buffer from VRAM 
(which is on-card memory) to system memory, the subsystem may allocate a 
huge amount of memory. But this particular case is not evicting from 
VRAM (although it may lead to it).

>>> My point was that your solution (detect the one error which should be
>>> deemed as non-fatal) was not as robust as its inverse (detect the one
>>> error which the previous code deemed as fatal).  But I now believe no
>>> error from the IDA should be seen as fatal.
>> If you insist, I can test on -ENOMEM instead of -ENOSPC to mimic the
>> pre-change behaviour. We should really focus on the IDA api changes here,
>> and defer changing -ENOMEM to non-fatal to a follow-up patch if needed.
> I'd be comfortable with that solution for now.

OK, I'll respin and check for -ENOMEM instead.

Thanks,
Thomas
diff mbox series

Patch

diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
index b93c558dd86e..a38a0c3777f7 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_gmrid_manager.c
@@ -57,7 +57,7 @@  static int vmw_gmrid_man_get_node(struct ttm_mem_type_manager *man,
 
 	id = ida_alloc_max(&gman->gmr_ida, gman->max_gmr_ids - 1, GFP_KERNEL);
 	if (id < 0)
-		return id;
+		return (id == -ENOSPC ? 0 : id);
 
 	spin_lock(&gman->lock);