[4/4] drm/ttm: optimize ttm pool shrinker a bit

Message ID	20201218175538.1364-4-christian.koenig@amd.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=dvvf=FW=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8C1AA23B51 From: " =?utf-8?q?Christian_K=C3=B6nig?= " <ckoenig.leichtzumerken@gmail.com> To: dri-devel@lists.freedesktop.org Subject: [PATCH 4/4] drm/ttm: optimize ttm pool shrinker a bit Date: Fri, 18 Dec 2020 18:55:38 +0100 Message-Id: <20201218175538.1364-4-christian.koenig@amd.com> In-Reply-To: <20201218175538.1364-1-christian.koenig@amd.com> References: <20201218175538.1364-1-christian.koenig@amd.com> MIME-Version: 1.0 Precedence: list Cc: ray.huang@amd.com Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	[1/4] drm/ttm: add debugfs directory v2 \| expand [1/4] drm/ttm: add debugfs directory v2 [2/4] drm/ttm: add a debugfs file for the global page pools [3/4] drm/ttm: add debugfs entry to test pool shrinker [4/4] drm/ttm: optimize ttm pool shrinker a bit

Christian König Dec. 18, 2020, 5:55 p.m. UTC

Only initialize the DMA coherent pools if they are used.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

Daniel Vetter Dec. 22, 2020, 1:51 p.m. UTC | #1

On Fri, Dec 18, 2020 at 06:55:38PM +0100, Christian König wrote:
> Only initialize the DMA coherent pools if they are used.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>

Ah, just realized the answer to my question on patch 2: The pools are
per-device, due to dma_alloc_coherent being per-device (but really mostly
it isn't, but that's what we have to deal with fighting the dma-api
abstraction).

I think this would make a lot more sense if the shrinkers are per-pool
(and also most of the debugfs files), since as-is in a multi-gpu system
the first gpu's pool gets preferrentially thrashed. Which isn't a nice
design. Splitting that into per gpu shrinkers means we get equal shrinking
without having to maintain a global lru. This is how xfs seems to set up
their shrinkers, and in general xfs people have a solid understanding of
this stuff.

Aside: I think it also would make tons of sense to split up your new ttm
bo shrinker up into a per-device lru, and throw the global system memory
lru out the window completely :-) Assuming we can indeed get rid of it,
and vmwgfx doesn't need it somewhere still.

Aside from this lgtm, but I guess will change a bit with that shuffling.
-Daniel

> ---
>  drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
>  1 file changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> index 1cdacd58753a..f09e34614226 100644
> --- a/drivers/gpu/drm/ttm/ttm_pool.c
> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> @@ -504,10 +504,12 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>  	pool->use_dma_alloc = use_dma_alloc;
>  	pool->use_dma32 = use_dma32;
>  
> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> -		for (j = 0; j < MAX_ORDER; ++j)
> -			ttm_pool_type_init(&pool->caching[i].orders[j],
> -					   pool, i, j);
> +	if (use_dma_alloc) {
> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> +			for (j = 0; j < MAX_ORDER; ++j)
> +				ttm_pool_type_init(&pool->caching[i].orders[j],
> +						   pool, i, j);
> +	}
>  }
>  EXPORT_SYMBOL(ttm_pool_init);
>  
> @@ -523,9 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
>  {
>  	unsigned int i, j;
>  
> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> -		for (j = 0; j < MAX_ORDER; ++j)
> -			ttm_pool_type_fini(&pool->caching[i].orders[j]);
> +	if (pool->use_dma_alloc) {
> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> +			for (j = 0; j < MAX_ORDER; ++j)
> +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
> +	}
>  }
>  EXPORT_SYMBOL(ttm_pool_fini);
>  
> @@ -630,6 +634,11 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
>  {
>  	unsigned int i;
>  
> +	if (!pool->use_dma_alloc) {
> +		seq_puts(m, "unused\n");
> +		return 0;
> +	}
> +
>  	ttm_pool_debugfs_header(m);
>  
>  	spin_lock(&shrinker_lock);
> -- 
> 2.25.1
>

Christian König Jan. 7, 2021, 12:49 p.m. UTC | #2

Am 22.12.20 um 14:51 schrieb Daniel Vetter:
> On Fri, Dec 18, 2020 at 06:55:38PM +0100, Christian König wrote:
>> Only initialize the DMA coherent pools if they are used.
>>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
> Ah, just realized the answer to my question on patch 2: The pools are
> per-device, due to dma_alloc_coherent being per-device (but really mostly
> it isn't, but that's what we have to deal with fighting the dma-api
> abstraction).
>
> I think this would make a lot more sense if the shrinkers are per-pool
> (and also most of the debugfs files), since as-is in a multi-gpu system
> the first gpu's pool gets preferrentially thrashed. Which isn't a nice
> design. Splitting that into per gpu shrinkers means we get equal shrinking
> without having to maintain a global lru. This is how xfs seems to set up
> their shrinkers, and in general xfs people have a solid understanding of
> this stuff.

Well fairness and not trashing the first GPUs pool is the reason why I 
implemented just one shrinker plus a global LRU.

In other words shrink_slab() just uses list_for_each_entry() on all 
shrinkers.

In the pool shrinker callback shrink one pool and move it to the end of 
the shrinker list.

>
> Aside: I think it also would make tons of sense to split up your new ttm
> bo shrinker up into a per-device lru, and throw the global system memory
> lru out the window completely :-) Assuming we can indeed get rid of it,
> and vmwgfx doesn't need it somewhere still.

Yeah, I already have that as a patch set here, but I have this dependent 
on a larger rename of the device structures.

> Aside from this lgtm, but I guess will change a bit with that shuffling.

Thanks for the review, going to send out a new version with the 
fs_reclaim_acquire/release added in a minute.

Christian.

> -Daniel
>
>> ---
>>   drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
>>   1 file changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>> index 1cdacd58753a..f09e34614226 100644
>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>> @@ -504,10 +504,12 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>   	pool->use_dma_alloc = use_dma_alloc;
>>   	pool->use_dma32 = use_dma32;
>>   
>> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> -		for (j = 0; j < MAX_ORDER; ++j)
>> -			ttm_pool_type_init(&pool->caching[i].orders[j],
>> -					   pool, i, j);
>> +	if (use_dma_alloc) {
>> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> +			for (j = 0; j < MAX_ORDER; ++j)
>> +				ttm_pool_type_init(&pool->caching[i].orders[j],
>> +						   pool, i, j);
>> +	}
>>   }
>>   EXPORT_SYMBOL(ttm_pool_init);
>>   
>> @@ -523,9 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
>>   {
>>   	unsigned int i, j;
>>   
>> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> -		for (j = 0; j < MAX_ORDER; ++j)
>> -			ttm_pool_type_fini(&pool->caching[i].orders[j]);
>> +	if (pool->use_dma_alloc) {
>> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>> +			for (j = 0; j < MAX_ORDER; ++j)
>> +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
>> +	}
>>   }
>>   EXPORT_SYMBOL(ttm_pool_fini);
>>   
>> @@ -630,6 +634,11 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
>>   {
>>   	unsigned int i;
>>   
>> +	if (!pool->use_dma_alloc) {
>> +		seq_puts(m, "unused\n");
>> +		return 0;
>> +	}
>> +
>>   	ttm_pool_debugfs_header(m);
>>   
>>   	spin_lock(&shrinker_lock);
>> -- 
>> 2.25.1
>>

Daniel Vetter Jan. 7, 2021, 4:38 p.m. UTC | #3

On Thu, Jan 07, 2021 at 01:49:45PM +0100, Christian König wrote:
> Am 22.12.20 um 14:51 schrieb Daniel Vetter:
> > On Fri, Dec 18, 2020 at 06:55:38PM +0100, Christian König wrote:
> > > Only initialize the DMA coherent pools if they are used.
> > > 
> > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > Ah, just realized the answer to my question on patch 2: The pools are
> > per-device, due to dma_alloc_coherent being per-device (but really mostly
> > it isn't, but that's what we have to deal with fighting the dma-api
> > abstraction).
> > 
> > I think this would make a lot more sense if the shrinkers are per-pool
> > (and also most of the debugfs files), since as-is in a multi-gpu system
> > the first gpu's pool gets preferrentially thrashed. Which isn't a nice
> > design. Splitting that into per gpu shrinkers means we get equal shrinking
> > without having to maintain a global lru. This is how xfs seems to set up
> > their shrinkers, and in general xfs people have a solid understanding of
> > this stuff.
> 
> Well fairness and not trashing the first GPUs pool is the reason why I
> implemented just one shrinker plus a global LRU.

That's kinda defeating the point of how the core mm works. At least of how
I'm understanding how it works. Imo we shouldn't try to re-implement this
kind of balancing across different pools in our callback, since core mm
tries pretty hard to equally shrink already (it only shrinks each shrinker
a little bit each round, but does a lot of rounds under memory pressure).

Also maintaining your own global lru means global locking for the usual
case of none-to-little memory contention, unecessarily wasting the fast
path.

> In other words shrink_slab() just uses list_for_each_entry() on all
> shrinkers.
> 
> In the pool shrinker callback shrink one pool and move it to the end of the
> shrinker list.
> 
> > 
> > Aside: I think it also would make tons of sense to split up your new ttm
> > bo shrinker up into a per-device lru, and throw the global system memory
> > lru out the window completely :-) Assuming we can indeed get rid of it,
> > and vmwgfx doesn't need it somewhere still.
> 
> Yeah, I already have that as a patch set here, but I have this dependent on
> a larger rename of the device structures.

Hm maybe include that in the next round, just for the bigger picture?
Don't have to merge it all in one go, just want to make sure we agree on
where we're going.

> > Aside from this lgtm, but I guess will change a bit with that shuffling.
> 
> Thanks for the review, going to send out a new version with the
> fs_reclaim_acquire/release added in a minute.

Cool.

Cheers, Daniel

> 
> Christian.
> 
> > -Daniel
> > 
> > > ---
> > >   drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
> > >   1 file changed, 16 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > > index 1cdacd58753a..f09e34614226 100644
> > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > @@ -504,10 +504,12 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> > >   	pool->use_dma_alloc = use_dma_alloc;
> > >   	pool->use_dma32 = use_dma32;
> > > -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > -		for (j = 0; j < MAX_ORDER; ++j)
> > > -			ttm_pool_type_init(&pool->caching[i].orders[j],
> > > -					   pool, i, j);
> > > +	if (use_dma_alloc) {
> > > +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > +			for (j = 0; j < MAX_ORDER; ++j)
> > > +				ttm_pool_type_init(&pool->caching[i].orders[j],
> > > +						   pool, i, j);
> > > +	}
> > >   }
> > >   EXPORT_SYMBOL(ttm_pool_init);
> > > @@ -523,9 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
> > >   {
> > >   	unsigned int i, j;
> > > -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > -		for (j = 0; j < MAX_ORDER; ++j)
> > > -			ttm_pool_type_fini(&pool->caching[i].orders[j]);
> > > +	if (pool->use_dma_alloc) {
> > > +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > +			for (j = 0; j < MAX_ORDER; ++j)
> > > +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
> > > +	}
> > >   }
> > >   EXPORT_SYMBOL(ttm_pool_fini);
> > > @@ -630,6 +634,11 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
> > >   {
> > >   	unsigned int i;
> > > +	if (!pool->use_dma_alloc) {
> > > +		seq_puts(m, "unused\n");
> > > +		return 0;
> > > +	}
> > > +
> > >   	ttm_pool_debugfs_header(m);
> > >   	spin_lock(&shrinker_lock);
> > > -- 
> > > 2.25.1
> > > 
>

Christian König Jan. 19, 2021, 12:11 p.m. UTC | #4

Am 07.01.21 um 17:38 schrieb Daniel Vetter:
> On Thu, Jan 07, 2021 at 01:49:45PM +0100, Christian König wrote:
>> Am 22.12.20 um 14:51 schrieb Daniel Vetter:
>>> On Fri, Dec 18, 2020 at 06:55:38PM +0100, Christian König wrote:
>>>> Only initialize the DMA coherent pools if they are used.
>>>>
>>>> Signed-off-by: Christian König <christian.koenig@amd.com>
>>> Ah, just realized the answer to my question on patch 2: The pools are
>>> per-device, due to dma_alloc_coherent being per-device (but really mostly
>>> it isn't, but that's what we have to deal with fighting the dma-api
>>> abstraction).
>>>
>>> I think this would make a lot more sense if the shrinkers are per-pool
>>> (and also most of the debugfs files), since as-is in a multi-gpu system
>>> the first gpu's pool gets preferrentially thrashed. Which isn't a nice
>>> design. Splitting that into per gpu shrinkers means we get equal shrinking
>>> without having to maintain a global lru. This is how xfs seems to set up
>>> their shrinkers, and in general xfs people have a solid understanding of
>>> this stuff.
>> Well fairness and not trashing the first GPUs pool is the reason why I
>> implemented just one shrinker plus a global LRU.
> That's kinda defeating the point of how the core mm works. At least of how
> I'm understanding how it works. Imo we shouldn't try to re-implement this
> kind of balancing across different pools in our callback, since core mm
> tries pretty hard to equally shrink already (it only shrinks each shrinker
> a little bit each round, but does a lot of rounds under memory pressure).

Correct, see the problem is that we don't want to shrink from each pool 
on each round.

E.g. we have something like 48 global pools and 36 for each device which 
needs a DMA coherent pool.

On each round we want to shrink only one cached item from one pool and 
not 48.

> Also maintaining your own global lru means global locking for the usual
> case of none-to-little memory contention, unecessarily wasting the fast
> path.

No, the fast path doesn't need to take the global LRU lock.

I've optimized this quite a bit by looking into the pools only once for 
each power of two.

>> In other words shrink_slab() just uses list_for_each_entry() on all
>> shrinkers.
>>
>> In the pool shrinker callback shrink one pool and move it to the end of the
>> shrinker list.
>>
>>> Aside: I think it also would make tons of sense to split up your new ttm
>>> bo shrinker up into a per-device lru, and throw the global system memory
>>> lru out the window completely :-) Assuming we can indeed get rid of it,
>>> and vmwgfx doesn't need it somewhere still.
>> Yeah, I already have that as a patch set here, but I have this dependent on
>> a larger rename of the device structures.
> Hm maybe include that in the next round, just for the bigger picture?
> Don't have to merge it all in one go, just want to make sure we agree on
> where we're going.

I need to clean this set up quite a bit. Let's push this one here 
upstream first.

>>> Aside from this lgtm, but I guess will change a bit with that shuffling.
>> Thanks for the review, going to send out a new version with the
>> fs_reclaim_acquire/release added in a minute.
> Cool.
>
> Cheers, Daniel

Got distracted by bug fixes in the last two weeks, but really going to 
send that out now :)

Christian.

>
>> Christian.
>>
>>> -Daniel
>>>
>>>> ---
>>>>    drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
>>>>    1 file changed, 16 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> index 1cdacd58753a..f09e34614226 100644
>>>> --- a/drivers/gpu/drm/ttm/ttm_pool.c
>>>> +++ b/drivers/gpu/drm/ttm/ttm_pool.c
>>>> @@ -504,10 +504,12 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
>>>>    	pool->use_dma_alloc = use_dma_alloc;
>>>>    	pool->use_dma32 = use_dma32;
>>>> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>>>> -		for (j = 0; j < MAX_ORDER; ++j)
>>>> -			ttm_pool_type_init(&pool->caching[i].orders[j],
>>>> -					   pool, i, j);
>>>> +	if (use_dma_alloc) {
>>>> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>>>> +			for (j = 0; j < MAX_ORDER; ++j)
>>>> +				ttm_pool_type_init(&pool->caching[i].orders[j],
>>>> +						   pool, i, j);
>>>> +	}
>>>>    }
>>>>    EXPORT_SYMBOL(ttm_pool_init);
>>>> @@ -523,9 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
>>>>    {
>>>>    	unsigned int i, j;
>>>> -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>>>> -		for (j = 0; j < MAX_ORDER; ++j)
>>>> -			ttm_pool_type_fini(&pool->caching[i].orders[j]);
>>>> +	if (pool->use_dma_alloc) {
>>>> +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
>>>> +			for (j = 0; j < MAX_ORDER; ++j)
>>>> +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
>>>> +	}
>>>>    }
>>>>    EXPORT_SYMBOL(ttm_pool_fini);
>>>> @@ -630,6 +634,11 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
>>>>    {
>>>>    	unsigned int i;
>>>> +	if (!pool->use_dma_alloc) {
>>>> +		seq_puts(m, "unused\n");
>>>> +		return 0;
>>>> +	}
>>>> +
>>>>    	ttm_pool_debugfs_header(m);
>>>>    	spin_lock(&shrinker_lock);
>>>> -- 
>>>> 2.25.1
>>>>

Daniel Vetter Jan. 19, 2021, 1:22 p.m. UTC | #5

On Tue, Jan 19, 2021 at 01:11:36PM +0100, Christian König wrote:
> Am 07.01.21 um 17:38 schrieb Daniel Vetter:
> > On Thu, Jan 07, 2021 at 01:49:45PM +0100, Christian König wrote:
> > > Am 22.12.20 um 14:51 schrieb Daniel Vetter:
> > > > On Fri, Dec 18, 2020 at 06:55:38PM +0100, Christian König wrote:
> > > > > Only initialize the DMA coherent pools if they are used.
> > > > > 
> > > > > Signed-off-by: Christian König <christian.koenig@amd.com>
> > > > Ah, just realized the answer to my question on patch 2: The pools are
> > > > per-device, due to dma_alloc_coherent being per-device (but really mostly
> > > > it isn't, but that's what we have to deal with fighting the dma-api
> > > > abstraction).
> > > > 
> > > > I think this would make a lot more sense if the shrinkers are per-pool
> > > > (and also most of the debugfs files), since as-is in a multi-gpu system
> > > > the first gpu's pool gets preferrentially thrashed. Which isn't a nice
> > > > design. Splitting that into per gpu shrinkers means we get equal shrinking
> > > > without having to maintain a global lru. This is how xfs seems to set up
> > > > their shrinkers, and in general xfs people have a solid understanding of
> > > > this stuff.
> > > Well fairness and not trashing the first GPUs pool is the reason why I
> > > implemented just one shrinker plus a global LRU.
> > That's kinda defeating the point of how the core mm works. At least of how
> > I'm understanding how it works. Imo we shouldn't try to re-implement this
> > kind of balancing across different pools in our callback, since core mm
> > tries pretty hard to equally shrink already (it only shrinks each shrinker
> > a little bit each round, but does a lot of rounds under memory pressure).
> 
> Correct, see the problem is that we don't want to shrink from each pool on
> each round.
> 
> E.g. we have something like 48 global pools and 36 for each device which
> needs a DMA coherent pool.
> 
> On each round we want to shrink only one cached item from one pool and not
> 48.

Hm if the pool is that small, then this feels like we're caching at the
wrong level, and probably we should cache at the dma-api level. Or well,
below that even.

Either way that kind of design stuff should be captured in an overview
DOC: kerneldoc imo.

Also the point of shrinkers is that they really should be all sized
equally, so if there's very little stuff in them, they shouldn't get
shrunk on first round.

Otoh if they're huge, they will be shrunk big time. So aside from fringe
effects of rounding slightly different since it's all integers, did you
actually measure a benefit here? Or is this more conjecture about how you
think shrinkers work or don't work?

> > Also maintaining your own global lru means global locking for the usual
> > case of none-to-little memory contention, unecessarily wasting the fast
> > path.
> 
> No, the fast path doesn't need to take the global LRU lock.
> 
> I've optimized this quite a bit by looking into the pools only once for each
> power of two.
> 
> > > In other words shrink_slab() just uses list_for_each_entry() on all
> > > shrinkers.
> > > 
> > > In the pool shrinker callback shrink one pool and move it to the end of the
> > > shrinker list.
> > > 
> > > > Aside: I think it also would make tons of sense to split up your new ttm
> > > > bo shrinker up into a per-device lru, and throw the global system memory
> > > > lru out the window completely :-) Assuming we can indeed get rid of it,
> > > > and vmwgfx doesn't need it somewhere still.
> > > Yeah, I already have that as a patch set here, but I have this dependent on
> > > a larger rename of the device structures.
> > Hm maybe include that in the next round, just for the bigger picture?
> > Don't have to merge it all in one go, just want to make sure we agree on
> > where we're going.
> 
> I need to clean this set up quite a bit. Let's push this one here upstream
> first.

Hm yeah I guess we need to get somewhere first, but this feels a bit
murky. I'll try and review more details for the next round at least.
-Daniel

> 
> > > > Aside from this lgtm, but I guess will change a bit with that shuffling.
> > > Thanks for the review, going to send out a new version with the
> > > fs_reclaim_acquire/release added in a minute.
> > Cool.
> > 
> > Cheers, Daniel
> 
> Got distracted by bug fixes in the last two weeks, but really going to send
> that out now :)
> 
> Christian.
> 
> > 
> > > Christian.
> > > 
> > > > -Daniel
> > > > 
> > > > > ---
> > > > >    drivers/gpu/drm/ttm/ttm_pool.c | 23 ++++++++++++++++-------
> > > > >    1 file changed, 16 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/ttm/ttm_pool.c b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > index 1cdacd58753a..f09e34614226 100644
> > > > > --- a/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > +++ b/drivers/gpu/drm/ttm/ttm_pool.c
> > > > > @@ -504,10 +504,12 @@ void ttm_pool_init(struct ttm_pool *pool, struct device *dev,
> > > > >    	pool->use_dma_alloc = use_dma_alloc;
> > > > >    	pool->use_dma32 = use_dma32;
> > > > > -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > > > -		for (j = 0; j < MAX_ORDER; ++j)
> > > > > -			ttm_pool_type_init(&pool->caching[i].orders[j],
> > > > > -					   pool, i, j);
> > > > > +	if (use_dma_alloc) {
> > > > > +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > > > +			for (j = 0; j < MAX_ORDER; ++j)
> > > > > +				ttm_pool_type_init(&pool->caching[i].orders[j],
> > > > > +						   pool, i, j);
> > > > > +	}
> > > > >    }
> > > > >    EXPORT_SYMBOL(ttm_pool_init);
> > > > > @@ -523,9 +525,11 @@ void ttm_pool_fini(struct ttm_pool *pool)
> > > > >    {
> > > > >    	unsigned int i, j;
> > > > > -	for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > > > -		for (j = 0; j < MAX_ORDER; ++j)
> > > > > -			ttm_pool_type_fini(&pool->caching[i].orders[j]);
> > > > > +	if (pool->use_dma_alloc) {
> > > > > +		for (i = 0; i < TTM_NUM_CACHING_TYPES; ++i)
> > > > > +			for (j = 0; j < MAX_ORDER; ++j)
> > > > > +				ttm_pool_type_fini(&pool->caching[i].orders[j]);
> > > > > +	}
> > > > >    }
> > > > >    EXPORT_SYMBOL(ttm_pool_fini);
> > > > > @@ -630,6 +634,11 @@ int ttm_pool_debugfs(struct ttm_pool *pool, struct seq_file *m)
> > > > >    {
> > > > >    	unsigned int i;
> > > > > +	if (!pool->use_dma_alloc) {
> > > > > +		seq_puts(m, "unused\n");
> > > > > +		return 0;
> > > > > +	}
> > > > > +
> > > > >    	ttm_pool_debugfs_header(m);
> > > > >    	spin_lock(&shrinker_lock);
> > > > > -- 
> > > > > 2.25.1
> > > > > 
>

[4/4] drm/ttm: optimize ttm pool shrinker a bit

Commit Message

Comments

Patch