diff mbox series

[v5,9/9] mm/demotion: Update node_is_toptier to work with memory tiers

Message ID 20220603134237.131362-10-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V June 3, 2022, 1:42 p.m. UTC
With memory tiers support we can have memory on NUMA nodes
in the top tier from which we want to avoid promotion tracking NUMA
faults. Update node_is_toptier to work with memory tiers. To
avoid taking locks, a nodemask is maintained for all demotion
targets. All NUMA nodes are by default top tier nodes and as
we add new lower memory tiers NUMA nodes get added to the
demotion targets thereby moving them out of the top tier.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 16 ++++++++++++++++
 include/linux/node.h         |  5 -----
 mm/huge_memory.c             |  1 +
 mm/memory-tiers.c            | 10 ++++++++++
 mm/migrate.c                 |  1 +
 mm/mprotect.c                |  1 +
 6 files changed, 29 insertions(+), 5 deletions(-)

Comments

Huang, Ying June 6, 2022, 3:11 a.m. UTC | #1
On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> With memory tiers support we can have memory on NUMA nodes
> in the top tier from which we want to avoid promotion tracking NUMA
> faults. Update node_is_toptier to work with memory tiers. To
> avoid taking locks, a nodemask is maintained for all demotion
> targets. All NUMA nodes are by default top tier nodes and as
> we add new lower memory tiers NUMA nodes get added to the
> demotion targets thereby moving them out of the top tier.

Check the usage of node_is_toptier(),

- migrate_misplaced_page()
  node_is_toptier() is used to check whether migration is a promotion. 
We can avoid to use it.  Just compare the rank of the nodes.

- change_pte_range() and change_huge_pmd()
  node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
for promotion.  So I think we should change the name to node_is_fast()
as follows,

static inline bool node_is_fast(int node)
{
	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
}

And, as above, I suggest to add memory tier ID and rank to struct
pglist_data directly.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 3:52 a.m. UTC | #2
On 6/6/22 8:41 AM, Ying Huang wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>> With memory tiers support we can have memory on NUMA nodes
>> in the top tier from which we want to avoid promotion tracking NUMA
>> faults. Update node_is_toptier to work with memory tiers. To
>> avoid taking locks, a nodemask is maintained for all demotion
>> targets. All NUMA nodes are by default top tier nodes and as
>> we add new lower memory tiers NUMA nodes get added to the
>> demotion targets thereby moving them out of the top tier.
> 
> Check the usage of node_is_toptier(),
> 
> - migrate_misplaced_page()
>    node_is_toptier() is used to check whether migration is a promotion.
> We can avoid to use it.  Just compare the rank of the nodes.
> 
> - change_pte_range() and change_huge_pmd()
>    node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> for promotion.  So I think we should change the name to node_is_fast()
> as follows,
> 
> static inline bool node_is_fast(int node)
> {
> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> }
> 

But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other 
patches, absolute value of rank doesn't carry any meaning. It is only
the relative value w.r.t other memory tiers that decide whether it is 
fast or not. Agreed by default memory tiers get built with 
MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' 
Hence to determine a node is consisting of fast memory is essentially 
figuring out whether node is the top most tier in memory hierarchy and 
not just the memory tier rank value is >= MEMORY_RANK_DRAM?

-aneesh
Huang, Ying June 6, 2022, 7:24 a.m. UTC | #3
On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 8:41 AM, Ying Huang wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > With memory tiers support we can have memory on NUMA nodes
> > > in the top tier from which we want to avoid promotion tracking NUMA
> > > faults. Update node_is_toptier to work with memory tiers. To
> > > avoid taking locks, a nodemask is maintained for all demotion
> > > targets. All NUMA nodes are by default top tier nodes and as
> > > we add new lower memory tiers NUMA nodes get added to the
> > > demotion targets thereby moving them out of the top tier.
> > 
> > Check the usage of node_is_toptier(),
> > 
> > - migrate_misplaced_page()
> >    node_is_toptier() is used to check whether migration is a promotion.
> > We can avoid to use it.  Just compare the rank of the nodes.
> > 
> > - change_pte_range() and change_huge_pmd()
> >    node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > for promotion.  So I think we should change the name to node_is_fast()
> > as follows,
> > 
> > static inline bool node_is_fast(int node)
> > {
> > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > }
> > 
> 
> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other 
> patches, absolute value of rank doesn't carry any meaning. It is only
> the relative value w.r.t other memory tiers that decide whether it is 
> fast or not. Agreed by default memory tiers get built with 
> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1' 
> Hence to determine a node is consisting of fast memory is essentially 
> figuring out whether node is the top most tier in memory hierarchy and 
> not just the memory tier rank value is >= MEMORY_RANK_DRAM?

In a system with 3 tiers,

HBM	0
DRAM	1
PMEM	2

In your implementation, only HBM will be considered fast.  But what we
need is to consider both HBM and DRAM fast.  Because we use NUMA
balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
and DRAM pages for that.  And there're no requirements to promote DRAM
pages to HBM with NUMA balancing.

I can understand that the memory tiers are more dynamic now.  For
requirements of NUMA balancing, we need the lowest memory tier (rank)
where there's at least one node with CPU.  The nodes in it and the
higher tiers will be considered fast. 


Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 8:33 a.m. UTC | #4
On 6/6/22 12:54 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 8:41 AM, Ying Huang wrote:
>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>> With memory tiers support we can have memory on NUMA nodes
>>>> in the top tier from which we want to avoid promotion tracking NUMA
>>>> faults. Update node_is_toptier to work with memory tiers. To
>>>> avoid taking locks, a nodemask is maintained for all demotion
>>>> targets. All NUMA nodes are by default top tier nodes and as
>>>> we add new lower memory tiers NUMA nodes get added to the
>>>> demotion targets thereby moving them out of the top tier.
>>>
>>> Check the usage of node_is_toptier(),
>>>
>>> - migrate_misplaced_page()
>>>     node_is_toptier() is used to check whether migration is a promotion.
>>> We can avoid to use it.  Just compare the rank of the nodes.
>>>
>>> - change_pte_range() and change_huge_pmd()
>>>     node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
>>> for promotion.  So I think we should change the name to node_is_fast()
>>> as follows,
>>>
>>> static inline bool node_is_fast(int node)
>>> {
>>> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
>>> }
>>>
>>
>> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
>> patches, absolute value of rank doesn't carry any meaning. It is only
>> the relative value w.r.t other memory tiers that decide whether it is
>> fast or not. Agreed by default memory tiers get built with
>> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
>> Hence to determine a node is consisting of fast memory is essentially
>> figuring out whether node is the top most tier in memory hierarchy and
>> not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> 
> In a system with 3 tiers,
> 
> HBM	0
> DRAM	1
> PMEM	2
> 
> In your implementation, only HBM will be considered fast.  But what we
> need is to consider both HBM and DRAM fast.  Because we use NUMA
> balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> and DRAM pages for that.  And there're no requirements to promote DRAM
> pages to HBM with NUMA balancing.
> 
> I can understand that the memory tiers are more dynamic now.  For
> requirements of NUMA balancing, we need the lowest memory tier (rank)
> where there's at least one node with CPU.  The nodes in it and the
> higher tiers will be considered fast.
> 

is this good (not tested)?
/*
  * build the allowed promotion mask. Promotion is allowed
  * from higher memory tier to lower memory tier only if
  * lower memory tier doesn't include compute. We want to
  * skip promotion from a memory tier, if any node which is
  * part of that memory tier have CPUs. Once we detect such
  * a memory tier, we consider that tier as top tier from
  * which promotion is not allowed.
  */
list_for_each_entry_reverse(memtier, &memory_tiers, list) {
	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
	if (nodes_empty(allowed))
		nodes_or(promotion_mask, promotion_mask, allowed);
	else
		break;
}

and then

static inline bool node_is_toptier(int node)
{

	return !node_isset(node, promotion_mask);
}
Huang, Ying June 8, 2022, 7:26 a.m. UTC | #5
On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 12:54 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> > > On 6/6/22 8:41 AM, Ying Huang wrote:
> > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > > > With memory tiers support we can have memory on NUMA nodes
> > > > > in the top tier from which we want to avoid promotion tracking NUMA
> > > > > faults. Update node_is_toptier to work with memory tiers. To
> > > > > avoid taking locks, a nodemask is maintained for all demotion
> > > > > targets. All NUMA nodes are by default top tier nodes and as
> > > > > we add new lower memory tiers NUMA nodes get added to the
> > > > > demotion targets thereby moving them out of the top tier.
> > > > 
> > > > Check the usage of node_is_toptier(),
> > > > 
> > > > - migrate_misplaced_page()
> > > >     node_is_toptier() is used to check whether migration is a promotion.
> > > > We can avoid to use it.  Just compare the rank of the nodes.
> > > > 
> > > > - change_pte_range() and change_huge_pmd()
> > > >     node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > > > for promotion.  So I think we should change the name to node_is_fast()
> > > > as follows,
> > > > 
> > > > static inline bool node_is_fast(int node)
> > > > {
> > > > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > > > }
> > > > 
> > > 
> > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
> > > patches, absolute value of rank doesn't carry any meaning. It is only
> > > the relative value w.r.t other memory tiers that decide whether it is
> > > fast or not. Agreed by default memory tiers get built with
> > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
> > > Hence to determine a node is consisting of fast memory is essentially
> > > figuring out whether node is the top most tier in memory hierarchy and
> > > not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> > 
> > In a system with 3 tiers,
> > 
> > HBM	0
> > DRAM	1
> > PMEM	2
> > 
> > In your implementation, only HBM will be considered fast.  But what we
> > need is to consider both HBM and DRAM fast.  Because we use NUMA
> > balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> > and DRAM pages for that.  And there're no requirements to promote DRAM
> > pages to HBM with NUMA balancing.
> > 
> > I can understand that the memory tiers are more dynamic now.  For
> > requirements of NUMA balancing, we need the lowest memory tier (rank)
> > where there's at least one node with CPU.  The nodes in it and the
> > higher tiers will be considered fast.
> > 
> 
> is this good (not tested)?
> /*
>   * build the allowed promotion mask. Promotion is allowed
>   * from higher memory tier to lower memory tier only if
>   * lower memory tier doesn't include compute. We want to
>   * skip promotion from a memory tier, if any node which is
>   * part of that memory tier have CPUs. Once we detect such
>   * a memory tier, we consider that tier as top tier from
>   * which promotion is not allowed.
>   */
> list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> 	if (nodes_empty(allowed))
> 		nodes_or(promotion_mask, promotion_mask, allowed);
> 	else
> 		break;
> }
> 
> and then
> 
> static inline bool node_is_toptier(int node)
> {
> 
> 	return !node_isset(node, promotion_mask);
> }
> 

This should work.  But it appears unnatural.  So, I don't think we
should avoid to add more and more node masks to mitigate the design
decision that we cannot access memory tier information directly.  All
these becomes simple and natural, if we can access memory tier
information directly.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 8, 2022, 8:28 a.m. UTC | #6
On 6/8/22 12:56 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 12:54 PM, Ying Huang wrote:
>>> On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
>>>> On 6/6/22 8:41 AM, Ying Huang wrote:
>>>>> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>>>>> With memory tiers support we can have memory on NUMA nodes
>>>>>> in the top tier from which we want to avoid promotion tracking NUMA
>>>>>> faults. Update node_is_toptier to work with memory tiers. To
>>>>>> avoid taking locks, a nodemask is maintained for all demotion
>>>>>> targets. All NUMA nodes are by default top tier nodes and as
>>>>>> we add new lower memory tiers NUMA nodes get added to the
>>>>>> demotion targets thereby moving them out of the top tier.
>>>>>
>>>>> Check the usage of node_is_toptier(),
>>>>>
>>>>> - migrate_misplaced_page()
>>>>>      node_is_toptier() is used to check whether migration is a promotion.
>>>>> We can avoid to use it.  Just compare the rank of the nodes.
>>>>>
>>>>> - change_pte_range() and change_huge_pmd()
>>>>>      node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
>>>>> for promotion.  So I think we should change the name to node_is_fast()
>>>>> as follows,
>>>>>
>>>>> static inline bool node_is_fast(int node)
>>>>> {
>>>>> 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
>>>>> }
>>>>>
>>>>
>>>> But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
>>>> patches, absolute value of rank doesn't carry any meaning. It is only
>>>> the relative value w.r.t other memory tiers that decide whether it is
>>>> fast or not. Agreed by default memory tiers get built with
>>>> MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
>>>> Hence to determine a node is consisting of fast memory is essentially
>>>> figuring out whether node is the top most tier in memory hierarchy and
>>>> not just the memory tier rank value is >= MEMORY_RANK_DRAM?
>>>
>>> In a system with 3 tiers,
>>>
>>> HBM	0
>>> DRAM	1
>>> PMEM	2
>>>
>>> In your implementation, only HBM will be considered fast.  But what we
>>> need is to consider both HBM and DRAM fast.  Because we use NUMA
>>> balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
>>> and DRAM pages for that.  And there're no requirements to promote DRAM
>>> pages to HBM with NUMA balancing.
>>>
>>> I can understand that the memory tiers are more dynamic now.  For
>>> requirements of NUMA balancing, we need the lowest memory tier (rank)
>>> where there's at least one node with CPU.  The nodes in it and the
>>> higher tiers will be considered fast.
>>>
>>
>> is this good (not tested)?
>> /*
>>    * build the allowed promotion mask. Promotion is allowed
>>    * from higher memory tier to lower memory tier only if
>>    * lower memory tier doesn't include compute. We want to
>>    * skip promotion from a memory tier, if any node which is
>>    * part of that memory tier have CPUs. Once we detect such
>>    * a memory tier, we consider that tier as top tier from
>>    * which promotion is not allowed.
>>    */
>> list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
>> 	if (nodes_empty(allowed))
>> 		nodes_or(promotion_mask, promotion_mask, allowed);
>> 	else
>> 		break;
>> }
>>
>> and then
>>
>> static inline bool node_is_toptier(int node)
>> {
>>
>> 	return !node_isset(node, promotion_mask);
>> }
>>
> 
> This should work.  But it appears unnatural.  So, I don't think we
> should avoid to add more and more node masks to mitigate the design
> decision that we cannot access memory tier information directly.  All
> these becomes simple and natural, if we can access memory tier
> information directly.
> 

how do you derive whether node is toptier details if we have memtier 
details in pgdat?

-aneesh
Huang, Ying June 8, 2022, 8:32 a.m. UTC | #7
On Wed, 2022-06-08 at 13:58 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:56 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 14:03 +0530, Aneesh Kumar K V wrote:
> > > On 6/6/22 12:54 PM, Ying Huang wrote:
> > > > On Mon, 2022-06-06 at 09:22 +0530, Aneesh Kumar K V wrote:
> > > > > On 6/6/22 8:41 AM, Ying Huang wrote:
> > > > > > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > > > > > With memory tiers support we can have memory on NUMA nodes
> > > > > > > in the top tier from which we want to avoid promotion tracking NUMA
> > > > > > > faults. Update node_is_toptier to work with memory tiers. To
> > > > > > > avoid taking locks, a nodemask is maintained for all demotion
> > > > > > > targets. All NUMA nodes are by default top tier nodes and as
> > > > > > > we add new lower memory tiers NUMA nodes get added to the
> > > > > > > demotion targets thereby moving them out of the top tier.
> > > > > > 
> > > > > > Check the usage of node_is_toptier(),
> > > > > > 
> > > > > > - migrate_misplaced_page()
> > > > > >      node_is_toptier() is used to check whether migration is a promotion.
> > > > > > We can avoid to use it.  Just compare the rank of the nodes.
> > > > > > 
> > > > > > - change_pte_range() and change_huge_pmd()
> > > > > >      node_is_toptier() is used to avoid scanning fast memory (DRAM) pages
> > > > > > for promotion.  So I think we should change the name to node_is_fast()
> > > > > > as follows,
> > > > > > 
> > > > > > static inline bool node_is_fast(int node)
> > > > > > {
> > > > > > 	return NODE_DATA(node)->mt_rank >= MEMORY_RANK_DRAM;
> > > > > > }
> > > > > > 
> > > > > 
> > > > > But that gives special meaning to MEMORY_RANK_DRAM. As detailed in other
> > > > > patches, absolute value of rank doesn't carry any meaning. It is only
> > > > > the relative value w.r.t other memory tiers that decide whether it is
> > > > > fast or not. Agreed by default memory tiers get built with
> > > > > MEMORY_RANK_DRAM. But userspace can change the rank value of 'memtier1'
> > > > > Hence to determine a node is consisting of fast memory is essentially
> > > > > figuring out whether node is the top most tier in memory hierarchy and
> > > > > not just the memory tier rank value is >= MEMORY_RANK_DRAM?
> > > > 
> > > > In a system with 3 tiers,
> > > > 
> > > > HBM	0
> > > > DRAM	1
> > > > PMEM	2
> > > > 
> > > > In your implementation, only HBM will be considered fast.  But what we
> > > > need is to consider both HBM and DRAM fast.  Because we use NUMA
> > > > balancing to promote PMEM pages to DRAM.  It's unnecessary to scan HBM
> > > > and DRAM pages for that.  And there're no requirements to promote DRAM
> > > > pages to HBM with NUMA balancing.
> > > > 
> > > > I can understand that the memory tiers are more dynamic now.  For
> > > > requirements of NUMA balancing, we need the lowest memory tier (rank)
> > > > where there's at least one node with CPU.  The nodes in it and the
> > > > higher tiers will be considered fast.
> > > > 
> > > 
> > > is this good (not tested)?
> > > /*
> > >    * build the allowed promotion mask. Promotion is allowed
> > >    * from higher memory tier to lower memory tier only if
> > >    * lower memory tier doesn't include compute. We want to
> > >    * skip promotion from a memory tier, if any node which is
> > >    * part of that memory tier have CPUs. Once we detect such
> > >    * a memory tier, we consider that tier as top tier from
> > >    * which promotion is not allowed.
> > >    */
> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> > > 	if (nodes_empty(allowed))
> > > 		nodes_or(promotion_mask, promotion_mask, allowed);
> > > 	else
> > > 		break;
> > > }
> > > 
> > > and then
> > > 
> > > static inline bool node_is_toptier(int node)
> > > {
> > > 
> > > 	return !node_isset(node, promotion_mask);
> > > }
> > > 
> > 
> > This should work.  But it appears unnatural.  So, I don't think we
> > should avoid to add more and more node masks to mitigate the design
> > decision that we cannot access memory tier information directly.  All
> > these becomes simple and natural, if we can access memory tier
> > information directly.
> > 
> 
> how do you derive whether node is toptier details if we have memtier 
> details in pgdat?

pgdat -> memory tier -> rank

Then we can compare this rank with the fast memory rank.  The fast
memory rank can be calculated dynamically at appropriate places.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 8, 2022, 2:37 p.m. UTC | #8
Ying Huang <ying.huang@intel.com> writes:

....

> > > 
>> > > is this good (not tested)?
>> > > /*
>> > >    * build the allowed promotion mask. Promotion is allowed
>> > >    * from higher memory tier to lower memory tier only if
>> > >    * lower memory tier doesn't include compute. We want to
>> > >    * skip promotion from a memory tier, if any node which is
>> > >    * part of that memory tier have CPUs. Once we detect such
>> > >    * a memory tier, we consider that tier as top tier from
>> > >    * which promotion is not allowed.
>> > >    */
>> > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>> > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
>> > > 	if (nodes_empty(allowed))
>> > > 		nodes_or(promotion_mask, promotion_mask, allowed);
>> > > 	else
>> > > 		break;
>> > > }
>> > > 
>> > > and then
>> > > 
>> > > static inline bool node_is_toptier(int node)
>> > > {
>> > > 
>> > > 	return !node_isset(node, promotion_mask);
>> > > }
>> > > 
>> > 
>> > This should work.  But it appears unnatural.  So, I don't think we
>> > should avoid to add more and more node masks to mitigate the design
>> > decision that we cannot access memory tier information directly.  All
>> > these becomes simple and natural, if we can access memory tier
>> > information directly.
>> > 
>> 
>> how do you derive whether node is toptier details if we have memtier 
>> details in pgdat?
>
> pgdat -> memory tier -> rank
>
> Then we can compare this rank with the fast memory rank.  The fast
> memory rank can be calculated dynamically at appropriate places.

This is what I am testing now. We still need to closely audit that lock
free access to the NODE_DATA()->memtier. For v6 I will keep this as a
separate patch and once we all agree that it is safe, I will fold it
back.

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index a388a806b61a..3e733de1a8a0 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -17,7 +17,6 @@
 #define MAX_MEMORY_TIERS  (MAX_STATIC_MEMORY_TIERS + 2)
 
 extern bool numa_demotion_enabled;
-extern nodemask_t promotion_mask;
 int node_create_and_set_memory_tier(int node, int tier);
 int next_demotion_node(int node);
 int node_set_memory_tier(int node, int tier);
@@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node);
 int node_reset_memory_tier(int node, int tier);
 void node_remove_from_memory_tier(int node);
 void node_get_allowed_targets(int node, nodemask_t *targets);
-
-/*
- * By default all nodes are top tiper. As we create new memory tiers
- * we below top tiers we add them to NON_TOP_TIER state.
- */
-static inline bool node_is_toptier(int node)
-{
-	return !node_isset(node, promotion_mask);
-}
+bool node_is_toptier(int node);
 
 #else
 #define numa_demotion_enabled	false
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index aab70355d64f..c4fcfd2b9980 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -928,6 +928,9 @@ typedef struct pglist_data {
 	/* Per-node vmstats */
 	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
 	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
+#ifdef CONFIG_TIERED_MEMORY
+	struct memory_tier *memtier;
+#endif
 } pg_data_t;
 
 #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 29a038bb38b0..31ef0fab5f19 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -7,6 +7,7 @@
 #include <linux/random.h>
 #include <linux/memory.h>
 #include <linux/idr.h>
+#include <linux/rcupdate.h>
 
 #include "internal.h"
 
@@ -26,7 +27,7 @@ struct demotion_nodes {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
-nodemask_t promotion_mask;
+static int top_tier_rank;
 /*
  * node_demotion[] examples:
  *
@@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev)
 	if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS)
 		ida_free(&memtier_dev_id, tier->dev.id);
 
-	kfree(tier);
+	kfree_rcu(tier);
 }
 
 /*
@@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
 	return NULL;
 }
 
+/*
+ * Called with memory_tier_lock. Hence the device references cannot
+ * be dropped during this function.
+ */
+static void memtier_node_clear(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	/*
+	 * Make sure read side see the NULL value before we clear the node
+	 * from the nodelist.
+	 */
+	synchronize_rcu();
+	node_clear(node, memtier->nodelist);
+}
+
+static void memtier_node_set(int node, struct memory_tier *memtier)
+{
+	pg_data_t *pgdat;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return;
+	/*
+	 * Make sure we mark the memtier NULL before we assign the new memory tier
+	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
+	 * finds a NULL memtier or the one which is still valid.
+	 */
+	rcu_assign_pointer(pgdat->memtier, NULL);
+	synchronize_rcu();
+	node_set(node, memtier->nodelist);
+	rcu_assign_pointer(pgdat->memtier, memtier);
+}
+
+bool node_is_toptier(int node)
+{
+	bool toptier;
+	pg_data_t *pgdat;
+	struct memory_tier *memtier;
+
+	pgdat = NODE_DATA(node);
+	if (!pgdat)
+		return false;
+
+	rcu_read_lock();
+	memtier = rcu_dereference(pgdat->memtier);
+	if (!memtier) {
+		toptier = true;
+		goto out;
+	}
+	if (memtier->rank >= top_tier_rank)
+		toptier = true;
+	else
+		toptier = false;
+out:
+	rcu_read_unlock();
+	return toptier;
+}
+
 static int __node_create_and_set_memory_tier(int node, int tier)
 {
 	int ret = 0;
@@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
 			goto out;
 		}
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_create_and_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier)
 		ret = -EINVAL;
 		goto out;
 	}
-	node_set(node, memtier->nodelist);
+	memtier_node_set(node, memtier);
 out:
 	return ret;
 }
@@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier)
 	if (current_tier->dev.id == tier)
 		goto out;
 
-	node_clear(node, current_tier->nodelist);
+	memtier_node_clear(node, current_tier);
 
 	ret = __node_set_memory_tier(node, tier);
 	if (ret) {
 		/* reset it back to older tier */
-		node_set(node, current_tier->nodelist);
+		memtier_node_set(node, current_tier);
 		goto out;
 	}
 
@@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node)
 	 * empty then unregister it to make it invisible
 	 * in sysfs.
 	 */
-	node_clear(node, memtier->nodelist);
+	memtier_node_clear(node, memtier);
 	if (nodes_empty(memtier->nodelist))
 		unregister_memory_tier(memtier);
 
@@ -570,15 +635,13 @@ static void establish_migration_targets(void)
 	 * a memory tier, we consider that tier as top tiper from
 	 * which promotion is not allowed.
 	 */
-	promotion_mask = NODE_MASK_NONE;
 	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
 		nodes_and(allowed, node_states[N_CPU], memtier->nodelist);
-		if (nodes_empty(allowed))
-			nodes_or(promotion_mask, promotion_mask, memtier->nodelist);
-		else
+		if (!nodes_empty(allowed)) {
+			top_tier_rank = memtier->rank;
 			break;
+		}
 	}
-
 	pr_emerg("top tier rank is %d\n", top_tier_rank);
 	allowed = NODE_MASK_NONE;
 	/*
@@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
 
 static int __init memory_tier_init(void)
 {
-	int ret;
+	int ret, node;
 	struct memory_tier *memtier;
 
 	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
@@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
 		panic("%s() failed to register memory tier: %d\n", __func__, ret);
 
 	/* CPU only nodes are not part of memory tiers. */
-	memtier->nodelist = node_states[N_MEMORY];
+	for_each_node_state(node, N_MEMORY) {
+		/*
+		 * Should be safe to do this early in the boot.
+		 */
+		NODE_DATA(node)->memtier = memtier;
+		node_set(node, memtier->nodelist);
+	}
 	migrate_on_reclaim_init();
 
 	return 0;
Tim Chen June 8, 2022, 8:14 p.m. UTC | #9
On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote:
> 
> 
> This is what I am testing now. We still need to closely audit that lock
> free access to the NODE_DATA()->memtier. 

You're refering to this or something else?  This is a write so seems okay.

> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;
> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();


> For v6 I will keep this as a
> separate patch and once we all agree that it is safe, I will fold it
> back.

Please update code that uses __node_get_memory_tier(node) to use
NODE_DATA(node)->memtier;

Otherwise the code looks okay at a first glance.

Tim

> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> index a388a806b61a..3e733de1a8a0 100644
> --- a/include/linux/memory-tiers.h
> +++ b/include/linux/memory-tiers.h
> @@ -17,7 +17,6 @@
>  #define MAX_MEMORY_TIERS  (MAX_STATIC_MEMORY_TIERS + 2)
>  
>  extern bool numa_demotion_enabled;
> -extern nodemask_t promotion_mask;
>  int node_create_and_set_memory_tier(int node, int tier);
>  int next_demotion_node(int node);
>  int node_set_memory_tier(int node, int tier);
> @@ -25,15 +24,7 @@ int node_get_memory_tier_id(int node);
>  int node_reset_memory_tier(int node, int tier);
>  void node_remove_from_memory_tier(int node);
>  void node_get_allowed_targets(int node, nodemask_t *targets);
> -
> -/*
> - * By default all nodes are top tiper. As we create new memory tiers
> - * we below top tiers we add them to NON_TOP_TIER state.
> - */
> -static inline bool node_is_toptier(int node)
> -{
> -	return !node_isset(node, promotion_mask);
> -}
> +bool node_is_toptier(int node);
>  
>  #else
>  #define numa_demotion_enabled	false
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index aab70355d64f..c4fcfd2b9980 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -928,6 +928,9 @@ typedef struct pglist_data {
>  	/* Per-node vmstats */
>  	struct per_cpu_nodestat __percpu *per_cpu_nodestats;
>  	atomic_long_t		vm_stat[NR_VM_NODE_STAT_ITEMS];
> +#ifdef CONFIG_TIERED_MEMORY
> +	struct memory_tier *memtier;
> +#endif
>  } pg_data_t;
>  
>  #define node_present_pages(nid)	(NODE_DATA(nid)->node_present_pages)
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> index 29a038bb38b0..31ef0fab5f19 100644
> --- a/mm/memory-tiers.c
> +++ b/mm/memory-tiers.c
> @@ -7,6 +7,7 @@
>  #include <linux/random.h>
>  #include <linux/memory.h>
>  #include <linux/idr.h>
> +#include <linux/rcupdate.h>
>  
>  #include "internal.h"
>  
> @@ -26,7 +27,7 @@ struct demotion_nodes {
>  static void establish_migration_targets(void);
>  static DEFINE_MUTEX(memory_tier_lock);
>  static LIST_HEAD(memory_tiers);
> -nodemask_t promotion_mask;
> +static int top_tier_rank;
>  /*
>   * node_demotion[] examples:
>   *
> @@ -135,7 +136,7 @@ static void memory_tier_device_release(struct device *dev)
>  	if (tier->dev.id >= MAX_STATIC_MEMORY_TIERS)
>  		ida_free(&memtier_dev_id, tier->dev.id);
>  
> -	kfree(tier);
> +	kfree_rcu(tier);
>  }
>  
>  /*
> @@ -233,6 +234,70 @@ static struct memory_tier *__get_memory_tier_from_id(int id)
>  	return NULL;
>  }
>  
> +/*
> + * Called with memory_tier_lock. Hence the device references cannot
> + * be dropped during this function.
> + */
> +static void memtier_node_clear(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	/*
> +	 * Make sure read side see the NULL value before we clear the node
> +	 * from the nodelist.
> +	 */
> +	synchronize_rcu();
> +	node_clear(node, memtier->nodelist);
> +}
> +
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();
> +	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->rank >= top_tier_rank)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
>  static int __node_create_and_set_memory_tier(int node, int tier)
>  {
>  	int ret = 0;
> @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>  			goto out;
>  		}
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
>  
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);
>  
>  	ret = __node_create_and_set_memory_tier(node, tier);
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> @@ -305,7 +370,7 @@ static int __node_set_memory_tier(int node, int tier)
>  		ret = -EINVAL;
>  		goto out;
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -374,12 +439,12 @@ int node_reset_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
>  
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);
>  
>  	ret = __node_set_memory_tier(node, tier);
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> @@ -407,7 +472,7 @@ void node_remove_from_memory_tier(int node)
>  	 * empty then unregister it to make it invisible
>  	 * in sysfs.
>  	 */
> -	node_clear(node, memtier->nodelist);
> +	memtier_node_clear(node, memtier);
>  	if (nodes_empty(memtier->nodelist))
>  		unregister_memory_tier(memtier);
>  
> @@ -570,15 +635,13 @@ static void establish_migration_targets(void)
>  	 * a memory tier, we consider that tier as top tiper from
>  	 * which promotion is not allowed.
>  	 */
> -	promotion_mask = NODE_MASK_NONE;
>  	list_for_each_entry_reverse(memtier, &memory_tiers, list) {
>  		nodes_and(allowed, node_states[N_CPU], memtier->nodelist);
> -		if (nodes_empty(allowed))
> -			nodes_or(promotion_mask, promotion_mask, memtier->nodelist);
> -		else
> +		if (!nodes_empty(allowed)) {
> +			top_tier_rank = memtier->rank;
>  			break;
> +		}
>  	}
> -
>  	pr_emerg("top tier rank is %d\n", top_tier_rank);
>  	allowed = NODE_MASK_NONE;
>  	/*
> @@ -748,7 +811,7 @@ static const struct attribute_group *memory_tier_attr_groups[] = {
>  
>  static int __init memory_tier_init(void)
>  {
> -	int ret;
> +	int ret, node;
>  	struct memory_tier *memtier;
>  
>  	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> @@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
>  		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>  
>  	/* CPU only nodes are not part of memory tiers. */
> -	memtier->nodelist = node_states[N_MEMORY];
> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;
> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();
>  
>  	return 0;
Huang, Ying June 10, 2022, 6:04 a.m. UTC | #10
On Wed, 2022-06-08 at 20:07 +0530, Aneesh Kumar K.V wrote:
> Ying Huang <ying.huang@intel.com> writes:
> 
> ....
> 
> > > > 
> > > > > is this good (not tested)?
> > > > > /*
> > > > >    * build the allowed promotion mask. Promotion is allowed
> > > > >    * from higher memory tier to lower memory tier only if
> > > > >    * lower memory tier doesn't include compute. We want to
> > > > >    * skip promotion from a memory tier, if any node which is
> > > > >    * part of that memory tier have CPUs. Once we detect such
> > > > >    * a memory tier, we consider that tier as top tier from
> > > > >    * which promotion is not allowed.
> > > > >    */
> > > > > list_for_each_entry_reverse(memtier, &memory_tiers, list) {
> > > > > 	nodes_and(allowed, node_state[N_CPU], memtier->nodelist);
> > > > > 	if (nodes_empty(allowed))
> > > > > 		nodes_or(promotion_mask, promotion_mask, allowed);
> > > > > 	else
> > > > > 		break;
> > > > > }
> > > > > 
> > > > > and then
> > > > > 
> > > > > static inline bool node_is_toptier(int node)
> > > > > {
> > > > > 
> > > > > 	return !node_isset(node, promotion_mask);
> > > > > }
> > > > > 
> > > > 
> > > > This should work.  But it appears unnatural.  So, I don't think we
> > > > should avoid to add more and more node masks to mitigate the design
> > > > decision that we cannot access memory tier information directly.  All
> > > > these becomes simple and natural, if we can access memory tier
> > > > information directly.
> > > > 
> > > 
> > > how do you derive whether node is toptier details if we have memtier 
> > > details in pgdat?
> > 
> > pgdat -> memory tier -> rank
> > 
> > Then we can compare this rank with the fast memory rank.  The fast
> > memory rank can be calculated dynamically at appropriate places.
> 
> This is what I am testing now. We still need to closely audit that lock
> free access to the NODE_DATA()->memtier. For v6 I will keep this as a
> separate patch and once we all agree that it is safe, I will fold it
> back.

Thanks for doing this.  We finally have a way to access memory_tier in
hot path.

[snip]

> +/*
> + * Called with memory_tier_lock. Hence the device references cannot
> + * be dropped during this function.
> + */
> +static void memtier_node_clear(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	/*
> +	 * Make sure read side see the NULL value before we clear the node
> +	 * from the nodelist.
> +	 */
> +	synchronize_rcu();
> +	node_clear(node, memtier->nodelist);
> +}
> +
> +static void memtier_node_set(int node, struct memory_tier *memtier)
> +{
> +	pg_data_t *pgdat;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return;
> +	/*
> +	 * Make sure we mark the memtier NULL before we assign the new memory tier
> +	 * to the NUMA node. This make sure that anybody looking at NODE_DATA
> +	 * finds a NULL memtier or the one which is still valid.
> +	 */
> 
> +	rcu_assign_pointer(pgdat->memtier, NULL);
> +	synchronize_rcu();

Per my understanding, in your code, when we change pgdat->memtier, we
will call synchronize_rcu() twice.  IMHO, once should be OK.  That is,
something like below,

	rcu_assign_pointer(pgdat->memtier, NULL);
	node_clear(node, memtier->nodelist);
	synchronize_rcu();
	node_set(node, new_memtier->nodelist);
	rcu_assign_pointer(pgdat->memtier, new_memtier);

In this way, there will be 3 states,

1. prev

pgdat->memtier == old_memtier
node_isset(node, old_memtier->node_list)
!node_isset(node, new_memtier->node_list)

2. transitioning

pgdat->memtier == NULL
!node_isset(node, old_memtier->node_list)
!node_isset(node, new_memtier->node_list)

3. after

pgdat->memtier == new_memtier
!node_isset(node, old_memtier->node_list)
node_isset(node, new_memtier->node_list)

The real state may be one of 1, 2, 3, 1+2, or 2+3.  But it will not be
1+3.  I think that this satisfied our requirements.

[snip]

>  static int __node_create_and_set_memory_tier(int node, int tier)
>  {
>  	int ret = 0;
> @@ -253,7 +318,7 @@ static int __node_create_and_set_memory_tier(int node, int tier)
>  			goto out;
>  		}
>  	}
> -	node_set(node, memtier->nodelist);
> +	memtier_node_set(node, memtier);
>  out:
>  	return ret;
>  }
> @@ -275,12 +340,12 @@ int node_create_and_set_memory_tier(int node, int tier)
>  	if (current_tier->dev.id == tier)
>  		goto out;
> -	node_clear(node, current_tier->nodelist);
> +	memtier_node_clear(node, current_tier);+	node_set(node, memtier->nodelist);
> +	rcu_assign_pointer(pgdat->memtier, memtier);
> +}
> +
> +bool node_is_toptier(int node)
> +{
> +	bool toptier;
> +	pg_data_t *pgdat;
> +	struct memory_tier *memtier;
> +
> +	pgdat = NODE_DATA(node);
> +	if (!pgdat)
> +		return false;
> +
> +	rcu_read_lock();
> +	memtier = rcu_dereference(pgdat->memtier);
> +	if (!memtier) {
> +		toptier = true;
> +		goto out;
> +	}
> +	if (memtier->rank >= top_tier_rank)
> +		toptier = true;
> +	else
> +		toptier = false;
> +out:
> +	rcu_read_unlock();
> +	return toptier;
> +}
> +
> 
>   	ret = __node_create_and_set_memory_tier(node, tier);
> 
>  	if (ret) {
>  		/* reset it back to older tier */
> -		node_set(node, current_tier->nodelist);
> +		memtier_node_set(node, current_tier);
>  		goto out;
>  	}
>  
> 

[snip]

>  static int __init memory_tier_init(void)
>  {
> -	int ret;
> +	int ret, node;
>  	struct memory_tier *memtier;
>
>  	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> 
> @@ -766,7 +829,13 @@ static int __init memory_tier_init(void)
> 
>  		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>
>  	/* CPU only nodes are not part of memory tiers. */
> 
> -	memtier->nodelist = node_states[N_MEMORY];
> +	for_each_node_state(node, N_MEMORY) {
> +		/*
> +		 * Should be safe to do this early in the boot.
> +		 */
> +		NODE_DATA(node)->memtier = memtier;

No required absoluately.  But IMHO it's more consistent to use
rcu_assign_pointer() here.

> +		node_set(node, memtier->nodelist);
> +	}
>  	migrate_on_reclaim_init();
>  
> > 	return 0;

Best Regareds,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
index cd6e71f702ad..32e0e6fabf02 100644
--- a/include/linux/memory-tiers.h
+++ b/include/linux/memory-tiers.h
@@ -16,12 +16,23 @@ 
 #define MAX_MEMORY_TIERS  3
 
 extern bool numa_demotion_enabled;
+extern nodemask_t demotion_target_mask;
 int next_demotion_node(int node);
 void node_remove_from_memory_tier(int node);
 int node_get_memory_tier_id(int node);
 int node_set_memory_tier(int node, int tier);
 int node_reset_memory_tier(int node, int tier);
 void node_get_allowed_targets(int node, nodemask_t *targets);
+
+/*
+ * By default all nodes are top tiper. As we create new memory tiers
+ * we below top tiers we add them to NON_TOP_TIER state.
+ */
+static inline bool node_is_toptier(int node)
+{
+	return !node_isset(node, demotion_target_mask);
+}
+
 #else
 #define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
@@ -33,6 +44,11 @@  static inline void node_get_allowed_targets(int node, nodemask_t *targets)
 {
 	*targets = NODE_MASK_NONE;
 }
+
+static inline bool node_is_toptier(int node)
+{
+	return true;
+}
 #endif	/* CONFIG_TIERED_MEMORY */
 
 #endif
diff --git a/include/linux/node.h b/include/linux/node.h
index 40d641a8bfb0..9ec680dd607f 100644
--- a/include/linux/node.h
+++ b/include/linux/node.h
@@ -185,9 +185,4 @@  static inline void register_hugetlbfs_with_node(node_registration_func_t reg,
 
 #define to_node(device) container_of(device, struct node, dev)
 
-static inline bool node_is_toptier(int node)
-{
-	return node_state(node, N_CPU);
-}
-
 #endif /* _LINUX_NODE_H_ */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a77c78a2b6b5..294873d4be2b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -35,6 +35,7 @@ 
 #include <linux/numa.h>
 #include <linux/page_owner.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
index 592d939ec28d..df8e9910165a 100644
--- a/mm/memory-tiers.c
+++ b/mm/memory-tiers.c
@@ -31,6 +31,7 @@  static struct bus_type memory_tier_subsys = {
 static void establish_migration_targets(void);
 static DEFINE_MUTEX(memory_tier_lock);
 static LIST_HEAD(memory_tiers);
+nodemask_t demotion_target_mask;
 
 /*
  * node_demotion[] examples:
@@ -541,6 +542,15 @@  static void establish_migration_targets(void)
 	 */
 	list_for_each_entry(memtier, &memory_tiers, list)
 		nodes_or(allowed, allowed, memtier->nodelist);
+	/*
+	 * Add nodes to demotion target mask so that we can find
+	 * top tier easily.
+	 */
+	memtier = list_first_entry(&memory_tiers, struct memory_tier, list);
+	if (memtier)
+		nodes_andnot(demotion_target_mask, allowed, memtier->nodelist);
+	else
+		demotion_target_mask = NODE_MASK_NONE;
 	/*
 	 * Removes nodes not yet in N_MEMORY.
 	 */
diff --git a/mm/migrate.c b/mm/migrate.c
index 0b554625a219..78615c48fc0f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -50,6 +50,7 @@ 
 #include <linux/memory.h>
 #include <linux/random.h>
 #include <linux/sched/sysctl.h>
+#include <linux/memory-tiers.h>
 
 #include <asm/tlbflush.h>
 
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ba5592655ee3..92a2fc0fa88b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -31,6 +31,7 @@ 
 #include <linux/pgtable.h>
 #include <linux/sched/sysctl.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/memory-tiers.h>
 #include <asm/cacheflush.h>
 #include <asm/mmu_context.h>
 #include <asm/tlbflush.h>