mm/page_alloc: use ac->high_zoneidx for classzone_idx
diff mbox

Message ID 1525408246-14768-1-git-send-email-iamjoonsoo.kim@lge.com
State New
Headers show

Commit Message

Joonsoo Kim May 4, 2018, 4:30 a.m. UTC
From: Joonsoo Kim <iamjoonsoo.kim@lge.com>

Currently, we use the zone index of preferred_zone which represents
the best matching zone for allocation, as classzone_idx. It has a problem
on NUMA system with ZONE_MOVABLE.

In NUMA system, it can be possible that each node has different populated
zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
node 1 could have only NORMAL zone. In this setup, allocation request
initiated on node 0 and the one on node 1 would have different
classzone_idx, 3 and 2, respectively, since their preferred_zones are
different. If they are handled by only their own node, there is no problem.
However, if they are somtimes handled by the remote node, the problem
would happen.

In the following setup, allocation initiated on node 1 will have some
precedence than allocation initiated on node 0 when former allocation is
processed on node 0 due to not enough memory on node 1. They will have
different lowmem reserve due to their different classzone_idx thus
an watermark bars are also different.

root@ubuntu:/sys/devices/system/memory# cat /proc/zoneinfo
Node 0, zone      DMA
  per-node stats
...
  pages free     3965
        min      5
        low      8
        high     11
        spanned  4095
        present  3998
        managed  3977
        protection: (0, 2961, 4928, 5440)
...
Node 0, zone    DMA32
  pages free     757955
        min      1129
        low      1887
        high     2645
        spanned  1044480
        present  782303
        managed  758116
        protection: (0, 0, 1967, 2479)
...
Node 0, zone   Normal
  pages free     459806
        min      750
        low      1253
        high     1756
        spanned  524288
        present  524288
        managed  503620
        protection: (0, 0, 0, 4096)
...
Node 0, zone  Movable
  pages free     130759
        min      195
        low      326
        high     457
        spanned  1966079
        present  131072
        managed  131072
        protection: (0, 0, 0, 0)
...
Node 1, zone      DMA
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 1006, 1006)
Node 1, zone    DMA32
  pages free     0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        protection: (0, 0, 1006, 1006)
Node 1, zone   Normal
  per-node stats
...
  pages free     233277
        min      383
        low      640
        high     897
        spanned  262144
        present  262144
        managed  257744
        protection: (0, 0, 0, 0)
...
Node 1, zone  Movable
  pages free     0
        min      0
        low      0
        high     0
        spanned  262144
        present  0
        managed  0
        protection: (0, 0, 0, 0)

min watermark for NORMAL zone on node 0
allocation initiated on node 0: 750 + 4096 = 4846
allocation initiated on node 1: 750 + 0 = 750

This watermark difference could cause too many numa_miss allocation
in some situation and then performance could be downgraded.

Recently, there was a regression report about this problem on CMA patches
since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
that problem is disappeared with this fix that uses high_zoneidx
for classzone_idx.

http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop

Using high_zoneidx for classzone_idx is more consistent way than previous
approach because system's memory layout doesn't affect anything to it.
With this patch, both classzone_idx on above example will be 3 so will
have the same min watermark.

allocation initiated on node 0: 750 + 4096 = 4846
allocation initiated on node 1: 750 + 4096 = 4846

Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
Tested-by: Ye Xiaolong <xiaolong.ye@intel.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Vlastimil Babka May 4, 2018, 7:03 a.m. UTC | #1
On 05/04/2018 06:30 AM, js1304@gmail.com wrote:
> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> 
> Currently, we use the zone index of preferred_zone which represents
> the best matching zone for allocation, as classzone_idx. It has a problem
> on NUMA system with ZONE_MOVABLE.
> 
> In NUMA system, it can be possible that each node has different populated
> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
> node 1 could have only NORMAL zone. In this setup, allocation request
> initiated on node 0 and the one on node 1 would have different
> classzone_idx, 3 and 2, respectively, since their preferred_zones are
> different. If they are handled by only their own node, there is no problem.
> However, if they are somtimes handled by the remote node, the problem
> would happen.
> 
> In the following setup, allocation initiated on node 1 will have some
> precedence than allocation initiated on node 0 when former allocation is
> processed on node 0 due to not enough memory on node 1. They will have
> different lowmem reserve due to their different classzone_idx thus
> an watermark bars are also different.
> 
...

> 
> min watermark for NORMAL zone on node 0
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 0 = 750
> 
> This watermark difference could cause too many numa_miss allocation
> in some situation and then performance could be downgraded.
> 
> Recently, there was a regression report about this problem on CMA patches
> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> that problem is disappeared with this fix that uses high_zoneidx
> for classzone_idx.
> 
> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> 
> Using high_zoneidx for classzone_idx is more consistent way than previous
> approach because system's memory layout doesn't affect anything to it.

So to summarize;
- ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
represents the highest zone the allocation can use
- classzone_idx was supposed to be the highest zone that the allocation
can use, that is actually available in the system. Somehow that became
the highest zone that is available on the preferred node (in the default
node-order zonelist), which causes the watermark inconsistencies you
mention.

I don't see a problem with your change. I would be worried about
inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
seem to be the case. My laptop has empty ZONE_MOVABLE and the
ZONE_NORMAL protection for movable is 0.

But there had to be some reason for classzone_idx to be like this and
not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
then, but is not anymore? Sigh, it seems to be pre-git.

> With this patch, both classzone_idx on above example will be 3 so will
> have the same min watermark.
> 
> allocation initiated on node 0: 750 + 4096 = 4846
> allocation initiated on node 1: 750 + 4096 = 4846
> 
> Reported-by: Ye Xiaolong <xiaolong.ye@intel.com>
> Tested-by: Ye Xiaolong <xiaolong.ye@intel.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/internal.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 228dd66..e1d7376 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -123,7 +123,7 @@ struct alloc_context {
>  	bool spread_dirty_pages;
>  };
>  
> -#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
> +#define ac_classzone_idx(ac) (ac->high_zoneidx)
>  
>  /*
>   * Locate the struct page for both the matching buddy in our
>
Joonsoo Kim May 4, 2018, 7:31 a.m. UTC | #2
2018-05-04 16:03 GMT+09:00 Vlastimil Babka <vbabka@suse.cz>:
> On 05/04/2018 06:30 AM, js1304@gmail.com wrote:
>> From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>>
>> Currently, we use the zone index of preferred_zone which represents
>> the best matching zone for allocation, as classzone_idx. It has a problem
>> on NUMA system with ZONE_MOVABLE.
>>
>> In NUMA system, it can be possible that each node has different populated
>> zones. For example, node 0 could have DMA/DMA32/NORMAL/MOVABLE zone and
>> node 1 could have only NORMAL zone. In this setup, allocation request
>> initiated on node 0 and the one on node 1 would have different
>> classzone_idx, 3 and 2, respectively, since their preferred_zones are
>> different. If they are handled by only their own node, there is no problem.
>> However, if they are somtimes handled by the remote node, the problem
>> would happen.
>>
>> In the following setup, allocation initiated on node 1 will have some
>> precedence than allocation initiated on node 0 when former allocation is
>> processed on node 0 due to not enough memory on node 1. They will have
>> different lowmem reserve due to their different classzone_idx thus
>> an watermark bars are also different.
>>
> ...
>
>>
>> min watermark for NORMAL zone on node 0
>> allocation initiated on node 0: 750 + 4096 = 4846
>> allocation initiated on node 1: 750 + 0 = 750
>>
>> This watermark difference could cause too many numa_miss allocation
>> in some situation and then performance could be downgraded.
>>
>> Recently, there was a regression report about this problem on CMA patches
>> since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> that problem is disappeared with this fix that uses high_zoneidx
>> for classzone_idx.
>>
>> http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>>
>> Using high_zoneidx for classzone_idx is more consistent way than previous
>> approach because system's memory layout doesn't affect anything to it.
>
> So to summarize;
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the highest zone the allocation can use
> - classzone_idx was supposed to be the highest zone that the allocation
> can use, that is actually available in the system. Somehow that became
> the highest zone that is available on the preferred node (in the default
> node-order zonelist), which causes the watermark inconsistencies you
> mention.

Yes! Thanks for summarize!

> I don't see a problem with your change. I would be worried about
> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
> seem to be the case. My laptop has empty ZONE_MOVABLE and the
> ZONE_NORMAL protection for movable is 0.

Yes! Protection number is calculated by using the number of managed page
in upper zone. If there is no memory on the upper zone, protection will be 0.

> But there had to be some reason for classzone_idx to be like this and
> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
> then, but is not anymore? Sigh, it seems to be pre-git.

Based on my code inspection, this patch changing classzone_idx implementation
would not cause the problem. I also have tried to find the reason
for classzone_idx implementation by searching git history but I can't.
As you said,
it seems to be pre-git. It would be really helpful that someone who remembers
the reason for current classzone_idx implementation teaches me the reason.

Thanks.
Mel Gorman May 4, 2018, 10:33 a.m. UTC | #3
On Fri, May 04, 2018 at 09:03:02AM +0200, Vlastimil Babka wrote:
> > min watermark for NORMAL zone on node 0
> > allocation initiated on node 0: 750 + 4096 = 4846
> > allocation initiated on node 1: 750 + 0 = 750
> > 
> > This watermark difference could cause too many numa_miss allocation
> > in some situation and then performance could be downgraded.
> > 
> > Recently, there was a regression report about this problem on CMA patches
> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> > that problem is disappeared with this fix that uses high_zoneidx
> > for classzone_idx.
> > 
> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> > 
> > Using high_zoneidx for classzone_idx is more consistent way than previous
> > approach because system's memory layout doesn't affect anything to it.
> 
> So to summarize;
> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> represents the highest zone the allocation can use

It's arcane but it was simply a fast-path calculation. A much older
definition would be easier to understand but it was slower.

> - classzone_idx was supposed to be the highest zone that the allocation
> can use, that is actually available in the system. Somehow that became
> the highest zone that is available on the preferred node (in the default
> node-order zonelist), which causes the watermark inconsistencies you
> mention.
> 

I think it *always* was the index of the first preferred zone of a
zonelist. The treatment of classzone has changed a lot over the years and
I didn't do a historical check but the general intent was always "protect
some pages in lower zones". This was particularly important for 32-bit
and highmem albeit that is less of a concern today. When it transferred to
NUMA, I don't think it ever was seriously considered if it should change
as the critical node was likely to be node 0 with all the zones and the
remote nodes all used the highest zone. CMA/MOVABLE changed that slightly
by allowing the possibility of node0 having a "higher" zone than every
other node. When MOVABLE was introduced, it wasn't much of a problem as
the purpose of MOVABLE was for systems that dynamically needed to allocate
hugetlbfs later in the runtime but for CMA, it was a lot more critical
for ordinary usage so this is primarily a CMA thing.

> I don't see a problem with your change. I would be worried about
> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
> seem to be the case. My laptop has empty ZONE_MOVABLE and the
> ZONE_NORMAL protection for movable is 0.
> 
> But there had to be some reason for classzone_idx to be like this and
> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
> then, but is not anymore? Sigh, it seems to be pre-git.
> 

classzone predates my involvement with Linux but I would be less concerneed
about what the original intent was and instead ensure that classzone index
is consistent, sane and potentially renamed while preserving the intent of
"reserve pages in lower zones when an allocation request can use higher
zones". While historically the critical intent was to preserve Normal and
to a lesser extent DMA on 32-bit systems, there still should be some care
of DMA32 so we should not lose that.

With the patch, the allocator looks like it would be fine as just
reservations change. I think it's unlikely that CMA usage will result
in lowmem starvation.  Compaction becomes a bit weird as classzone index
has no special meaning versis highmem and I think it'll be very easy to
forget. Similarly, vmscan can reclaim pages from remote nodes and zones
that are higher than the original request. That is not likely to be a
problem but it's a change in behaviour and easy to miss.

Fundamentally, I find it extremely weird we now have two variables that are
essentially the same thing. They should be collapsed into one variable,
renamed and documented on what the index means for page allocator,
compaction, vmscan and the special casing around CMA.
Joonsoo Kim May 8, 2018, 1 a.m. UTC | #4
Hello, Mel.

Thanks for precious input!

2018-05-04 19:33 GMT+09:00 Mel Gorman <mgorman@suse.de>:
> On Fri, May 04, 2018 at 09:03:02AM +0200, Vlastimil Babka wrote:
>> > min watermark for NORMAL zone on node 0
>> > allocation initiated on node 0: 750 + 4096 = 4846
>> > allocation initiated on node 1: 750 + 0 = 750
>> >
>> > This watermark difference could cause too many numa_miss allocation
>> > in some situation and then performance could be downgraded.
>> >
>> > Recently, there was a regression report about this problem on CMA patches
>> > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
>> > that problem is disappeared with this fix that uses high_zoneidx
>> > for classzone_idx.
>> >
>> > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
>> >
>> > Using high_zoneidx for classzone_idx is more consistent way than previous
>> > approach because system's memory layout doesn't affect anything to it.
>>
>> So to summarize;
>> - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
>> represents the highest zone the allocation can use
>
> It's arcane but it was simply a fast-path calculation. A much older
> definition would be easier to understand but it was slower.
>
>> - classzone_idx was supposed to be the highest zone that the allocation
>> can use, that is actually available in the system. Somehow that became
>> the highest zone that is available on the preferred node (in the default
>> node-order zonelist), which causes the watermark inconsistencies you
>> mention.
>>
>
> I think it *always* was the index of the first preferred zone of a
> zonelist. The treatment of classzone has changed a lot over the years and
> I didn't do a historical check but the general intent was always "protect
> some pages in lower zones". This was particularly important for 32-bit
> and highmem albeit that is less of a concern today. When it transferred to
> NUMA, I don't think it ever was seriously considered if it should change
> as the critical node was likely to be node 0 with all the zones and the
> remote nodes all used the highest zone. CMA/MOVABLE changed that slightly
> by allowing the possibility of node0 having a "higher" zone than every

I think that this problem is related to not only protection of the
lowmem (that is
lower than normal) but also node balance.

In fact, problem reported by zeroday-bot is caused by node1 having a
"higher" zone. In this case, node0's lowmem is protected well but
node balance of the allocation is broken since node1's normal memory cannot
be protected from allocation that is initiated on remote node.

> other node. When MOVABLE was introduced, it wasn't much of a problem as
> the purpose of MOVABLE was for systems that dynamically needed to allocate
> hugetlbfs later in the runtime but for CMA, it was a lot more critical
> for ordinary usage so this is primarily a CMA thing.

I'm not sure that it's primarily a CMA thing. There is an another critical setup
for this problem, that is, memory hotplug. If someone plug-in a new memory to
the MOVABLE zone, "higher" zone will be created in a specific node and
this problem happens. I have checked this with QEMU.

>> I don't see a problem with your change. I would be worried about
>> inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
>> seem to be the case. My laptop has empty ZONE_MOVABLE and the
>> ZONE_NORMAL protection for movable is 0.
>>
>> But there had to be some reason for classzone_idx to be like this and
>> not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
>> then, but is not anymore? Sigh, it seems to be pre-git.
>>
>
> classzone predates my involvement with Linux but I would be less concerneed
> about what the original intent was and instead ensure that classzone index
> is consistent, sane and potentially renamed while preserving the intent of
> "reserve pages in lower zones when an allocation request can use higher
> zones". While historically the critical intent was to preserve Normal and
> to a lesser extent DMA on 32-bit systems, there still should be some care
> of DMA32 so we should not lose that.

Agreed!

> With the patch, the allocator looks like it would be fine as just
> reservations change. I think it's unlikely that CMA usage will result
> in lowmem starvation.  Compaction becomes a bit weird as classzone index
> has no special meaning versis highmem and I think it'll be very easy to
> forget. Similarly, vmscan can reclaim pages from remote nodes and zones
> that are higher than the original request. That is not likely to be a
> problem but it's a change in behaviour and easy to miss.
>
> Fundamentally, I find it extremely weird we now have two variables that are
> essentially the same thing. They should be collapsed into one variable,
> renamed and documented on what the index means for page allocator,
> compaction, vmscan and the special casing around CMA.

Agreed!
I will update this patch to reflect your comment. If someone have an idea
on renaming this variable, please let me know.

Thanks.
Andrew Morton May 8, 2018, 11:13 p.m. UTC | #5
On Fri, 4 May 2018 11:33:22 +0100 Mel Gorman <mgorman@suse.de> wrote:

> On Fri, May 04, 2018 at 09:03:02AM +0200, Vlastimil Babka wrote:
> > > min watermark for NORMAL zone on node 0
> > > allocation initiated on node 0: 750 + 4096 = 4846
> > > allocation initiated on node 1: 750 + 0 = 750
> > > 
> > > This watermark difference could cause too many numa_miss allocation
> > > in some situation and then performance could be downgraded.
> > > 
> > > Recently, there was a regression report about this problem on CMA patches
> > > since CMA memory are placed in ZONE_MOVABLE by those patches. I checked
> > > that problem is disappeared with this fix that uses high_zoneidx
> > > for classzone_idx.
> > > 
> > > http://lkml.kernel.org/r/20180102063528.GG30397@yexl-desktop
> > > 
> > > Using high_zoneidx for classzone_idx is more consistent way than previous
> > > approach because system's memory layout doesn't affect anything to it.
> > 
> > So to summarize;
> > - ac->high_zoneidx is computed via the arcane gfp_zone(gfp_mask) and
> > represents the highest zone the allocation can use
> 
> It's arcane but it was simply a fast-path calculation. A much older
> definition would be easier to understand but it was slower.
> 
> > - classzone_idx was supposed to be the highest zone that the allocation
> > can use, that is actually available in the system. Somehow that became
> > the highest zone that is available on the preferred node (in the default
> > node-order zonelist), which causes the watermark inconsistencies you
> > mention.
> > 
> 
> I think it *always* was the index of the first preferred zone of a
> zonelist. The treatment of classzone has changed a lot over the years and
> I didn't do a historical check but the general intent was always "protect
> some pages in lower zones". This was particularly important for 32-bit
> and highmem albeit that is less of a concern today. When it transferred to
> NUMA, I don't think it ever was seriously considered if it should change
> as the critical node was likely to be node 0 with all the zones and the
> remote nodes all used the highest zone. CMA/MOVABLE changed that slightly
> by allowing the possibility of node0 having a "higher" zone than every
> other node. When MOVABLE was introduced, it wasn't much of a problem as
> the purpose of MOVABLE was for systems that dynamically needed to allocate
> hugetlbfs later in the runtime but for CMA, it was a lot more critical
> for ordinary usage so this is primarily a CMA thing.
> 
> > I don't see a problem with your change. I would be worried about
> > inflated reserves when e.g. ZONE_MOVABLE doesn't exist, but that doesn't
> > seem to be the case. My laptop has empty ZONE_MOVABLE and the
> > ZONE_NORMAL protection for movable is 0.
> > 
> > But there had to be some reason for classzone_idx to be like this and
> > not simple high_zoneidx. Maybe Mel remembers? Maybe it was important
> > then, but is not anymore? Sigh, it seems to be pre-git.
> > 
> 
> classzone predates my involvement with Linux but I would be less concerneed
> about what the original intent was and instead ensure that classzone index
> is consistent, sane and potentially renamed while preserving the intent of
> "reserve pages in lower zones when an allocation request can use higher
> zones". While historically the critical intent was to preserve Normal and
> to a lesser extent DMA on 32-bit systems, there still should be some care
> of DMA32 so we should not lose that.
> 
> With the patch, the allocator looks like it would be fine as just
> reservations change. I think it's unlikely that CMA usage will result
> in lowmem starvation.  Compaction becomes a bit weird as classzone index
> has no special meaning versis highmem and I think it'll be very easy to
> forget. Similarly, vmscan can reclaim pages from remote nodes and zones
> that are higher than the original request. That is not likely to be a
> problem but it's a change in behaviour and easy to miss.
> 
> Fundamentally, I find it extremely weird we now have two variables that are
> essentially the same thing. They should be collapsed into one variable,
> renamed and documented on what the index means for page allocator,
> compaction, vmscan and the special casing around CMA.

You're all so young ;)

classzone was Andrea.  Perhaps he can shed some light upon the
questions which have been raised?
Vlastimil Babka May 16, 2018, 9:35 a.m. UTC | #6
On 05/08/2018 03:00 AM, Joonsoo Kim wrote:
>> classzone predates my involvement with Linux but I would be less concerneed
>> about what the original intent was and instead ensure that classzone index
>> is consistent, sane and potentially renamed while preserving the intent of
>> "reserve pages in lower zones when an allocation request can use higher
>> zones". While historically the critical intent was to preserve Normal and
>> to a lesser extent DMA on 32-bit systems, there still should be some care
>> of DMA32 so we should not lose that.
> 
> Agreed!
> 
>> With the patch, the allocator looks like it would be fine as just
>> reservations change. I think it's unlikely that CMA usage will result
>> in lowmem starvation.  Compaction becomes a bit weird as classzone index
>> has no special meaning versis highmem and I think it'll be very easy to
>> forget.

I don't understand this point, what do you mean about highmem here? I've
checked and compaction seems to use classzone_idx 1) to pass it to
watermark checks as part of compaction suitability checks, i.e. the
usual lowmem protection, and 2) to limit compaction of higher zones in
kcompactd if the direct compactor can't use them anyway - seems this
part has currently the same zone imbalance problem as reclaim.

>> Similarly, vmscan can reclaim pages from remote nodes and zones
>> that are higher than the original request. That is not likely to be a
>> problem but it's a change in behaviour and easy to miss.
>>
>> Fundamentally, I find it extremely weird we now have two variables that are
>> essentially the same thing. They should be collapsed into one variable,
>> renamed and documented on what the index means for page allocator,
>> compaction, vmscan and the special casing around CMA.
> 
> Agreed!
> I will update this patch to reflect your comment. If someone have an idea
> on renaming this variable, please let me know.

Pehaps max_zone_idx? Seems a bit more clear than "high_zoneidx". And I
have no idea what was actually meant by "class".

> Thanks.
>
Mel Gorman May 16, 2018, 10:28 a.m. UTC | #7
On Wed, May 16, 2018 at 11:35:55AM +0200, Vlastimil Babka wrote:
> On 05/08/2018 03:00 AM, Joonsoo Kim wrote:
> >> classzone predates my involvement with Linux but I would be less concerneed
> >> about what the original intent was and instead ensure that classzone index
> >> is consistent, sane and potentially renamed while preserving the intent of
> >> "reserve pages in lower zones when an allocation request can use higher
> >> zones". While historically the critical intent was to preserve Normal and
> >> to a lesser extent DMA on 32-bit systems, there still should be some care
> >> of DMA32 so we should not lose that.
> > 
> > Agreed!
> > 
> >> With the patch, the allocator looks like it would be fine as just
> >> reservations change. I think it's unlikely that CMA usage will result
> >> in lowmem starvation.  Compaction becomes a bit weird as classzone index
> >> has no special meaning versis highmem and I think it'll be very easy to
> >> forget.
> 
> I don't understand this point, what do you mean about highmem here?

I mean it has no special meaning as compaction is not primarily concerned
with lowmem protections as it compacts within a zone. It preserves watermarks
but it does not have the same degree of criticality as the page allocator
and reclaim is concerned with.

> I've
> checked and compaction seems to use classzone_idx 1) to pass it to
> watermark checks as part of compaction suitability checks, i.e. the
> usual lowmem protection, and 2) to limit compaction of higher zones in
> kcompactd if the direct compactor can't use them anyway - seems this
> part has currently the same zone imbalance problem as reclaim.
> 

Originally the watermark check for compaction was primarily about not
depleting a single zone but the checks were duplicated anyway. It's not
actually super critical for it to preserve lowmem zones as any memory
usage by compaction is transient.

> > Agreed!
> > I will update this patch to reflect your comment. If someone have an idea
> > on renaming this variable, please let me know.
> 
> Pehaps max_zone_idx? Seems a bit more clear than "high_zoneidx". And I
> have no idea what was actually meant by "class".
> 

I don't have a better suggestion.

Patch
diff mbox

diff --git a/mm/internal.h b/mm/internal.h
index 228dd66..e1d7376 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -123,7 +123,7 @@  struct alloc_context {
 	bool spread_dirty_pages;
 };
 
-#define ac_classzone_idx(ac) zonelist_zone_idx(ac->preferred_zoneref)
+#define ac_classzone_idx(ac) (ac->high_zoneidx)
 
 /*
  * Locate the struct page for both the matching buddy in our