diff mbox series

[RFC,02/14] mm/hms: heterogenenous memory system (HMS) documentation

Message ID 20181203233509.20671-3-jglisse@redhat.com (mailing list archive)
State New, archived
Headers show
Series Heterogeneous Memory System (HMS) and hbind() | expand

Commit Message

Jerome Glisse Dec. 3, 2018, 11:34 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

Add documentation to what is HMS and what it is for (see patch content).

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <balbirs@au1.ibm.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
---
 Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++-----
 1 file changed, 246 insertions(+), 29 deletions(-)

Comments

Andi Kleen Dec. 4, 2018, 5:06 p.m. UTC | #1
jglisse@redhat.com writes:

> +
> +To help with forward compatibility each object as a version value and
> +it is mandatory for user space to only use target or initiator with
> +version supported by the user space. For instance if user space only
> +knows about what version 1 means and sees a target with version 2 then
> +the user space must ignore that target as if it does not exist.

So once v2 is introduced all applications that only support v1 break.

That seems very un-Linux and will break Linus' "do not break existing
applications" rule.

The standard approach that if you add something incompatible is to
add new field, but keep the old ones.

> +2) hbind() bind range of virtual address to heterogeneous memory
> +================================================================
> +
> +So instead of using a bitmap, hbind() take an array of uid and each uid
> +is a unique memory target inside the new memory topology description.

You didn't define what an uid is?

user id?

Please use sensible terminology that doesn't conflict with existing
usages.

I assume it's some kind of number that identifies a node in your
graph. 

> +User space also provide an array of modifiers. Modifier can be seen as
> +the flags parameter of mbind() but here we use an array so that user
> +space can not only supply a modifier but also value with it. This should
> +allow the API to grow more features in the future. Kernel should return
> +-EINVAL if it is provided with an unkown modifier and just ignore the
> +call all together, forcing the user space to restrict itself to modifier
> +supported by the kernel it is running on (i know i am dreaming about well
> +behave user space).

It sounds like you're trying to define a system call with built in
ioctl? Is that really a good idea?

If you need ioctl you know where to find it.

Please don't over design APIs like this.

> +3) Tracking and applying heterogeneous memory policies
> +======================================================
> +
> +Current memory policy infrastructure is node oriented, instead of
> +changing that and risking breakage and regression HMS adds a new
> +heterogeneous policy tracking infra-structure. The expectation is
> +that existing application can keep using mbind() and all existing
> +infrastructure under-disturb and unaffected, while new application
> +will use the new API and should avoid mix and matching both (as they
> +can achieve the same thing with the new API).

I think we need a stronger motivation to define a completely
parallel and somewhat redundant infrastructure. What breakage
are you worried about?

The obvious alternative would of course be to add some extra
enumeration to the existing nodes.

It's a strange document. It goes from very high level to low level
with nothing inbetween. I think you need a lot more details
in the middle, in particularly how these new interfaces
should be used. For example how should an application
know how to look for a specific type of device?
How is an automated tool supposed to use the enumeration?
etc.

-Andi
Jerome Glisse Dec. 4, 2018, 6:24 p.m. UTC | #2
On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> jglisse@redhat.com writes:
> 
> > +
> > +To help with forward compatibility each object as a version value and
> > +it is mandatory for user space to only use target or initiator with
> > +version supported by the user space. For instance if user space only
> > +knows about what version 1 means and sees a target with version 2 then
> > +the user space must ignore that target as if it does not exist.
> 
> So once v2 is introduced all applications that only support v1 break.
> 
> That seems very un-Linux and will break Linus' "do not break existing
> applications" rule.
> 
> The standard approach that if you add something incompatible is to
> add new field, but keep the old ones.

No that's not how it is suppose to work. So let says it is 2018 and you
have v1 memory (like your regular main DDR memory for instance) then it
will always be expose a v1 memory.

Fast forward 2020 and you have this new type of memory that is not cache
coherent and you want to expose this to userspace through HMS. What you
do is a kernel patch that introduce the v2 type for target and define a
set of new sysfs file to describe what v2 is. On this new computer you
report your usual main memory as v1 and your new memory as v2.

So the application that only knew about v1 will keep using any v1 memory
on your new platform but it will not use any of the new memory v2 which
is what you want to happen. You do not have to break existing application
while allowing to add new type of memory.


Sorry if it was unclear. I will try to reformulate and give an example
as above.


> > +2) hbind() bind range of virtual address to heterogeneous memory
> > +================================================================
> > +
> > +So instead of using a bitmap, hbind() take an array of uid and each uid
> > +is a unique memory target inside the new memory topology description.
> 
> You didn't define what an uid is?
> 
> user id ?
> 
> Please use sensible terminology that doesn't conflict with existing
> usages.
> 
> I assume it's some kind of number that identifies a node in your
> graph. 

Correct uid is unique id given to each node in the graph. I will clarify
that.


> > +User space also provide an array of modifiers. Modifier can be seen as
> > +the flags parameter of mbind() but here we use an array so that user
> > +space can not only supply a modifier but also value with it. This should
> > +allow the API to grow more features in the future. Kernel should return
> > +-EINVAL if it is provided with an unkown modifier and just ignore the
> > +call all together, forcing the user space to restrict itself to modifier
> > +supported by the kernel it is running on (i know i am dreaming about well
> > +behave user space).
> 
> It sounds like you're trying to define a system call with built in
> ioctl? Is that really a good idea?
> 
> If you need ioctl you know where to find it.

Well i would like to get thing running in the wild with some guinea pig
user to get feedback from end user. It would be easier if i can do this
with upstream kernel and not some random branch in my private repo. While
doing that i would like to avoid commiting to a syscall upstream. So the
way i see around this is doing a driver under staging with an ioctl which
will be turn into a syscall once some confidence into the API is gain.

If you think i should do a syscall right away i am not against doing that.

> 
> Please don't over design APIs like this.

So there is 2 approach here. I can define 2 syscall, one for migration
and one for policy. Migration and policy are 2 different thing from all
existing user point of view. By defining 2 syscall i can cut them down
to do this one thing and one thing only and make it as simple and lean
as possible.

In the present version i took the other approach of defining just one
API that can grow to do more thing. I know the unix way is one simple
tool for one simple job. I can switch to the simple call for one action.


> > +3) Tracking and applying heterogeneous memory policies
> > +======================================================
> > +
> > +Current memory policy infrastructure is node oriented, instead of
> > +changing that and risking breakage and regression HMS adds a new
> > +heterogeneous policy tracking infra-structure. The expectation is
> > +that existing application can keep using mbind() and all existing
> > +infrastructure under-disturb and unaffected, while new application
> > +will use the new API and should avoid mix and matching both (as they
> > +can achieve the same thing with the new API).
> 
> I think we need a stronger motivation to define a completely
> parallel and somewhat redundant infrastructure. What breakage
> are you worried about?

Some memory expose through HMS is not allocated by regular memory
allocator. For instance GPU memory is manage by GPU driver, so when
you want to use GPU memory (either as a policy or by migrating to it)
you need to use the GPU allocator to allocate that memory. HMS adds
a bunch of callback to target structure so that device driver can
expose a generic API to core kernel to do such allocation.

Now i can change existing code path to use target structure as an
intermediary for allocation but this is changing hot code path and
i doubt it would be welcome today. Eventually i think we will want
that to happen and can work on minimizing cost for user that do not
use thing like GPU.

The transition phase will take times (couple years) and i would like
to avoid disturbing existing workload while we migrate GPU user to
this new API.


> The obvious alternative would of course be to add some extra
> enumeration to the existing nodes.

We can not extend NUMA node to expose GPU memory. GPU memory on
current AMD and Intel platform is not cache coherent and thus
should not be use for random memory allocation. It should really
stay a thing user have to explicitly select to use. Note that the
useage we have here is that when you use GPU memory it is as if
the range of virtual address is swapped out from CPU point of view
but the GPU can access it.

> It's a strange document. It goes from very high level to low level
> with nothing inbetween. I think you need a lot more details
> in the middle, in particularly how these new interfaces
> should be used. For example how should an application
> know how to look for a specific type of device?
> How is an automated tool supposed to use the enumeration?
> etc.

Today user use dedicated API (OpenCL, ROCm, CUDA, ...) those high
level API all have the API i present here in one form or another.
So i want to move this high level API that is actively use by
program today into the kernel. The end game is to create common
infrastructure for various accelerator hardware (GPU, FPGA, ...)
to manage memory.

This is something ask by end user for one simple reasons. Today
users have to mix and match multiple API in their application and
when they want to exchange data between one device that use one API
and another device that use another API they have to do explicit
copy and rebuild their data structure inside the new memory. When
you move over thing like tree or any complex data structure you have
to rebuilt it ie redo the pointers link between the nodes of your
data structure.

This is highly error prone complex and wasteful (you have to burn
CPU cycles to do that). Now if you can use the same address space
as all the other memory allocation in your program and move data
around from one device to another with a common API that works on
all the various devices, you are eliminating that complex step and
making the end user life much easier.

So i am doing this to help existing users by addressing an issues
that is becoming harder and harder to solve for userspace. My end
game is to blur the boundary between CPU and device like GPU, FPGA,
...


Thank you for taking time to read this proposal and for your feed-
back. Much appreciated. I will try to include your comments in my
v2.

Cheers,
Jérôme
Dan Williams Dec. 4, 2018, 6:31 p.m. UTC | #3
On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > jglisse@redhat.com writes:
> >
> > > +
> > > +To help with forward compatibility each object as a version value and
> > > +it is mandatory for user space to only use target or initiator with
> > > +version supported by the user space. For instance if user space only
> > > +knows about what version 1 means and sees a target with version 2 then
> > > +the user space must ignore that target as if it does not exist.
> >
> > So once v2 is introduced all applications that only support v1 break.
> >
> > That seems very un-Linux and will break Linus' "do not break existing
> > applications" rule.
> >
> > The standard approach that if you add something incompatible is to
> > add new field, but keep the old ones.
>
> No that's not how it is suppose to work. So let says it is 2018 and you
> have v1 memory (like your regular main DDR memory for instance) then it
> will always be expose a v1 memory.
>
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
>
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.

That sounds needlessly restrictive. Let the kernel arbitrate what
memory an application gets, don't design a system where applications
are hard coded to a memory type. Applications can hint, or optionally
specify an override and the kernel can react accordingly.
Jerome Glisse Dec. 4, 2018, 6:57 p.m. UTC | #4
On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > jglisse@redhat.com writes:
> > >
> > > > +
> > > > +To help with forward compatibility each object as a version value and
> > > > +it is mandatory for user space to only use target or initiator with
> > > > +version supported by the user space. For instance if user space only
> > > > +knows about what version 1 means and sees a target with version 2 then
> > > > +the user space must ignore that target as if it does not exist.
> > >
> > > So once v2 is introduced all applications that only support v1 break.
> > >
> > > That seems very un-Linux and will break Linus' "do not break existing
> > > applications" rule.
> > >
> > > The standard approach that if you add something incompatible is to
> > > add new field, but keep the old ones.
> >
> > No that's not how it is suppose to work. So let says it is 2018 and you
> > have v1 memory (like your regular main DDR memory for instance) then it
> > will always be expose a v1 memory.
> >
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> >
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> 
> That sounds needlessly restrictive. Let the kernel arbitrate what
> memory an application gets, don't design a system where applications
> are hard coded to a memory type. Applications can hint, or optionally
> specify an override and the kernel can react accordingly.

You do not want to randomly use non cache coherent memory inside your
application :) This is not gonna go well with C++ or atomic :) Yes they
are legitimate use case where application can decide to give up cache
coherency temporarily for a range of virtual address. But the application
needs to understand what it is doing and opt in to do that knowing full
well that. The version thing allows for scenario like. You do not have
to define a new version with every new type of memory. If your new memory
has all the properties of v1 than you expose it as v1 and old application
on the new platform will use your new memory type being non the wiser.

The version thing is really to exclude user from using something they
do not want to use without understanding the consequences of doing so.

Cheers,
Jérôme
Logan Gunthorpe Dec. 4, 2018, 7:11 p.m. UTC | #5
On 2018-12-04 11:57 a.m., Jerome Glisse wrote:
>> That sounds needlessly restrictive. Let the kernel arbitrate what
>> memory an application gets, don't design a system where applications
>> are hard coded to a memory type. Applications can hint, or optionally
>> specify an override and the kernel can react accordingly.
> 
> You do not want to randomly use non cache coherent memory inside your
> application :) This is not gonna go well with C++ or atomic :) Yes they
> are legitimate use case where application can decide to give up cache
> coherency temporarily for a range of virtual address. But the application
> needs to understand what it is doing and opt in to do that knowing full
> well that. The version thing allows for scenario like. You do not have
> to define a new version with every new type of memory. If your new memory
> has all the properties of v1 than you expose it as v1 and old application
> on the new platform will use your new memory type being non the wiser.

I agree with Dan and the general idea that this version thing is really
ugly. Define some standard attributes so the application can say "I want
cache-coherent, high bandwidth memory". If there's some future
new-memory attribute, then the application needs to know about it to
request it.

Also, in the same vein, I think it's wrong to have the API enumerate all
the different memory available in the system. The API should simply
allow userspace to say it wants memory that can be accessed by a set of
initiators with a certain set of attributes and the bind call tries to
fulfill that or fallback on system memory/hmm migration/whatever.

Logan
Dan Williams Dec. 4, 2018, 7:19 p.m. UTC | #6
On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > > jglisse@redhat.com writes:
> > > >
> > > > > +
> > > > > +To help with forward compatibility each object as a version value and
> > > > > +it is mandatory for user space to only use target or initiator with
> > > > > +version supported by the user space. For instance if user space only
> > > > > +knows about what version 1 means and sees a target with version 2 then
> > > > > +the user space must ignore that target as if it does not exist.
> > > >
> > > > So once v2 is introduced all applications that only support v1 break.
> > > >
> > > > That seems very un-Linux and will break Linus' "do not break existing
> > > > applications" rule.
> > > >
> > > > The standard approach that if you add something incompatible is to
> > > > add new field, but keep the old ones.
> > >
> > > No that's not how it is suppose to work. So let says it is 2018 and you
> > > have v1 memory (like your regular main DDR memory for instance) then it
> > > will always be expose a v1 memory.
> > >
> > > Fast forward 2020 and you have this new type of memory that is not cache
> > > coherent and you want to expose this to userspace through HMS. What you
> > > do is a kernel patch that introduce the v2 type for target and define a
> > > set of new sysfs file to describe what v2 is. On this new computer you
> > > report your usual main memory as v1 and your new memory as v2.
> > >
> > > So the application that only knew about v1 will keep using any v1 memory
> > > on your new platform but it will not use any of the new memory v2 which
> > > is what you want to happen. You do not have to break existing application
> > > while allowing to add new type of memory.
> >
> > That sounds needlessly restrictive. Let the kernel arbitrate what
> > memory an application gets, don't design a system where applications
> > are hard coded to a memory type. Applications can hint, or optionally
> > specify an override and the kernel can react accordingly.
>
> You do not want to randomly use non cache coherent memory inside your
> application :)

The kernel arbitrates memory, it's a bug if it hands out something
that exotic to an unaware application.
Jerome Glisse Dec. 4, 2018, 7:22 p.m. UTC | #7
On Tue, Dec 04, 2018 at 12:11:42PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 11:57 a.m., Jerome Glisse wrote:
> >> That sounds needlessly restrictive. Let the kernel arbitrate what
> >> memory an application gets, don't design a system where applications
> >> are hard coded to a memory type. Applications can hint, or optionally
> >> specify an override and the kernel can react accordingly.
> > 
> > You do not want to randomly use non cache coherent memory inside your
> > application :) This is not gonna go well with C++ or atomic :) Yes they
> > are legitimate use case where application can decide to give up cache
> > coherency temporarily for a range of virtual address. But the application
> > needs to understand what it is doing and opt in to do that knowing full
> > well that. The version thing allows for scenario like. You do not have
> > to define a new version with every new type of memory. If your new memory
> > has all the properties of v1 than you expose it as v1 and old application
> > on the new platform will use your new memory type being non the wiser.
> 
> I agree with Dan and the general idea that this version thing is really
> ugly. Define some standard attributes so the application can say "I want
> cache-coherent, high bandwidth memory". If there's some future
> new-memory attribute, then the application needs to know about it to
> request it.

So version is a bad prefix, what about type, prefixing target with a
type id. So that application that are looking for a certain type of
memory (which has a set of define properties) can select them. Having
a type file inside the directory and hopping application will read
that sysfs file is a recipies for failure from my point of view. While
having it in the directory name is making sure that the application
has some idea of what it is doing.

> 
> Also, in the same vein, I think it's wrong to have the API enumerate all
> the different memory available in the system. The API should simply
> allow userspace to say it wants memory that can be accessed by a set of
> initiators with a certain set of attributes and the bind call tries to
> fulfill that or fallback on system memory/hmm migration/whatever.

We have existing application that use topology today to partition their
workload and do load balancing. Those application leverage the fact that
they are only running on a small set of known platform with known topology
here i want to provide a common API so that topology can be queried in a
standard by application.

Yes basic application will not leverage all this information and will
be happy enough with give me memory that will be fast for initiator A
and B. That can easily be implemented inside userspace library which
dumbs down the topology on behalf of application.

I believe that proposing a new infrastructure should allow for maximum
expressiveness. The HMS API in this proposal allow to express any kind
of directed graph hence i do not see any limitation going forward. At
the same time userspace library can easily dumbs this down for average
Joe/Jane application.

Cheers,
Jérôme
Jerome Glisse Dec. 4, 2018, 7:32 p.m. UTC | #8
On Tue, Dec 04, 2018 at 11:19:23AM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote:
> > > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote:
> > > >
> > > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > > > jglisse@redhat.com writes:
> > > > >
> > > > > > +
> > > > > > +To help with forward compatibility each object as a version value and
> > > > > > +it is mandatory for user space to only use target or initiator with
> > > > > > +version supported by the user space. For instance if user space only
> > > > > > +knows about what version 1 means and sees a target with version 2 then
> > > > > > +the user space must ignore that target as if it does not exist.
> > > > >
> > > > > So once v2 is introduced all applications that only support v1 break.
> > > > >
> > > > > That seems very un-Linux and will break Linus' "do not break existing
> > > > > applications" rule.
> > > > >
> > > > > The standard approach that if you add something incompatible is to
> > > > > add new field, but keep the old ones.
> > > >
> > > > No that's not how it is suppose to work. So let says it is 2018 and you
> > > > have v1 memory (like your regular main DDR memory for instance) then it
> > > > will always be expose a v1 memory.
> > > >
> > > > Fast forward 2020 and you have this new type of memory that is not cache
> > > > coherent and you want to expose this to userspace through HMS. What you
> > > > do is a kernel patch that introduce the v2 type for target and define a
> > > > set of new sysfs file to describe what v2 is. On this new computer you
> > > > report your usual main memory as v1 and your new memory as v2.
> > > >
> > > > So the application that only knew about v1 will keep using any v1 memory
> > > > on your new platform but it will not use any of the new memory v2 which
> > > > is what you want to happen. You do not have to break existing application
> > > > while allowing to add new type of memory.
> > >
> > > That sounds needlessly restrictive. Let the kernel arbitrate what
> > > memory an application gets, don't design a system where applications
> > > are hard coded to a memory type. Applications can hint, or optionally
> > > specify an override and the kernel can react accordingly.
> >
> > You do not want to randomly use non cache coherent memory inside your
> > application :)
> 
> The kernel arbitrates memory, it's a bug if it hands out something
> that exotic to an unaware application.

In some case and for some period of time some application would like
to use exotic memory for performance reasons. This does exist today.
Graphics API routinely expose uncache memory to application and it has
been doing so for many years.

Some compute folks would like to have some of the benefit of that
sometime. The idea is that you malloc() some memory in your application
do stuff on the CPU, business as usual, then you gonna use that memory
on some exotic device and for that device it would be best if you
migrated that memory to uncache/uncoherent memory. If application
knows its safe to do so then it can decide to pick such memory with
HMS and migrate its malloced stuff there.

This is not only happening in application, it can happen inside a
library that the application use and the application might be totaly
unaware of the library doing so. This is very common today in AI/ML
workload where all the various library in your AI/ML stacks do thing
to the memory you handed them over. It is all part of the library
API contract.

So they are legitimate use case for this hence why i would like to
be able to expose exotic memory to userspace so that it can migrate
regular allocation there when that make sense.

Cheers,
Jérôme
Logan Gunthorpe Dec. 4, 2018, 7:41 p.m. UTC | #9
On 2018-12-04 12:22 p.m., Jerome Glisse wrote:
> So version is a bad prefix, what about type, prefixing target with a
> type id. So that application that are looking for a certain type of
> memory (which has a set of define properties) can select them. Having
> a type file inside the directory and hopping application will read
> that sysfs file is a recipies for failure from my point of view. While
> having it in the directory name is making sure that the application
> has some idea of what it is doing.

Well I don't think it can be a prefix. It has to be a mask. It might be
things like cache coherency, persistence, bandwidth and none of those
things are mutually exclusive.

>> Also, in the same vein, I think it's wrong to have the API enumerate all
>> the different memory available in the system. The API should simply
>> allow userspace to say it wants memory that can be accessed by a set of
>> initiators with a certain set of attributes and the bind call tries to
>> fulfill that or fallback on system memory/hmm migration/whatever.
> 
> We have existing application that use topology today to partition their
> workload and do load balancing. Those application leverage the fact that
> they are only running on a small set of known platform with known topology
> here i want to provide a common API so that topology can be queried in a
> standard by application.

Existing applications are not a valid excuse for poor API design.
Remember, once this API is introduced and has real users, it has to be
maintained *forever*, so we need to get it right. Providing users with
more information than they need makes it exponentially harder to get
right and support.

Logan
Andi Kleen Dec. 4, 2018, 8:12 p.m. UTC | #10
On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote:
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
> 
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.

That seems entirely like the wrong model. We don't want to rewrite every
application for adding a new memory type.

Rather there needs to be an abstract way to query memory of specific
behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar

Sure there can be a name somewhere, but it should only be used
for identification purposes, not to hard code in applications.

Really you need to define some use cases and describe how your API
handles them.

> > 
> > It sounds like you're trying to define a system call with built in
> > ioctl? Is that really a good idea?
> > 
> > If you need ioctl you know where to find it.
> 
> Well i would like to get thing running in the wild with some guinea pig
> user to get feedback from end user. It would be easier if i can do this
> with upstream kernel and not some random branch in my private repo. While
> doing that i would like to avoid commiting to a syscall upstream. So the
> way i see around this is doing a driver under staging with an ioctl which
> will be turn into a syscall once some confidence into the API is gain.

Ok that's fine I guess.

But should be a clearly defined ioctl, not an ioctl with redefinable parameters
(but perhaps I misunderstood your description)

> In the present version i took the other approach of defining just one
> API that can grow to do more thing. I know the unix way is one simple
> tool for one simple job. I can switch to the simple call for one action.

Simple calls are better.

> > > +Current memory policy infrastructure is node oriented, instead of
> > > +changing that and risking breakage and regression HMS adds a new
> > > +heterogeneous policy tracking infra-structure. The expectation is
> > > +that existing application can keep using mbind() and all existing
> > > +infrastructure under-disturb and unaffected, while new application
> > > +will use the new API and should avoid mix and matching both (as they
> > > +can achieve the same thing with the new API).
> > 
> > I think we need a stronger motivation to define a completely
> > parallel and somewhat redundant infrastructure. What breakage
> > are you worried about?
> 
> Some memory expose through HMS is not allocated by regular memory
> allocator. For instance GPU memory is manage by GPU driver, so when
> you want to use GPU memory (either as a policy or by migrating to it)
> you need to use the GPU allocator to allocate that memory. HMS adds
> a bunch of callback to target structure so that device driver can
> expose a generic API to core kernel to do such allocation.

We already have nodes without memory.
We can also take out nodes out of the normal fall back lists.
We also have nodes with special memory (e.g. DMA32)

Nothing you describe here cannot be handled with the existing nodes.

> > The obvious alternative would of course be to add some extra
> > enumeration to the existing nodes.
> 
> We can not extend NUMA node to expose GPU memory. GPU memory on
> current AMD and Intel platform is not cache coherent and thus
> should not be use for random memory allocation. It should really

Sure you don't expose it as normal memory, but it can be still
tied to a node. In fact you have to for the existing topology
interface to work.

> copy and rebuild their data structure inside the new memory. When
> you move over thing like tree or any complex data structure you have
> to rebuilt it ie redo the pointers link between the nodes of your
> data structure.
> 
> This is highly error prone complex and wasteful (you have to burn
> CPU cycles to do that). Now if you can use the same address space
> as all the other memory allocation in your program and move data
> around from one device to another with a common API that works on
> all the various devices, you are eliminating that complex step and
> making the end user life much easier.
> 
> So i am doing this to help existing users by addressing an issues
> that is becoming harder and harder to solve for userspace. My end
> game is to blur the boundary between CPU and device like GPU, FPGA,

This is just high level rationale. You already had that ...

What I was looking for is how applications actually use the 
API.

e.g. 

1. Compute application is looking for fast cache coherent memory 
for CPU usage.

What does it query and how does it decide and how does it allocate?

2. Allocator in OpenCL application is looking for memory to share
with OpenCL. How does it find memory?

3. Storage application is looking for larger but slower memory
for CPU usage. 

4. ...

Please work out some use cases like this.

-Andi
Jerome Glisse Dec. 4, 2018, 8:13 p.m. UTC | #11
On Tue, Dec 04, 2018 at 12:41:39PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 12:22 p.m., Jerome Glisse wrote:
> > So version is a bad prefix, what about type, prefixing target with a
> > type id. So that application that are looking for a certain type of
> > memory (which has a set of define properties) can select them. Having
> > a type file inside the directory and hopping application will read
> > that sysfs file is a recipies for failure from my point of view. While
> > having it in the directory name is making sure that the application
> > has some idea of what it is doing.
> 
> Well I don't think it can be a prefix. It has to be a mask. It might be
> things like cache coherency, persistence, bandwidth and none of those
> things are mutually exclusive.

You are right many are non exclusive. It is just my feeling that having
a mask as a file inside the target directory might be overlook by the
application which might start using things it should not. At same time
i guess if i write the userspace library that abstract this kernel API
then i can enforce application to properly select thing.

I will use mask in v2.

> 
> >> Also, in the same vein, I think it's wrong to have the API enumerate all
> >> the different memory available in the system. The API should simply
> >> allow userspace to say it wants memory that can be accessed by a set of
> >> initiators with a certain set of attributes and the bind call tries to
> >> fulfill that or fallback on system memory/hmm migration/whatever.
> > 
> > We have existing application that use topology today to partition their
> > workload and do load balancing. Those application leverage the fact that
> > they are only running on a small set of known platform with known topology
> > here i want to provide a common API so that topology can be queried in a
> > standard by application.
> 
> Existing applications are not a valid excuse for poor API design.
> Remember, once this API is introduced and has real users, it has to be
> maintained *forever*, so we need to get it right. Providing users with
> more information than they need makes it exponentially harder to get
> right and support.

I am not disagreeing on the pain of maintaining API forever but the fact
remain that they are existing user and without a standard way of exposing
this it is impossible to say if we will see more users for that information
or if it will just be the existing user that will leverage this.

I do not think there is a way to answer that question. I am siding on the
side of this API can be dumb down in userspace by a common library. So let
expose the topology and let userspace dumb it down.

If we dumb it down in the kernel i see few pitfalls:
    - kernel dumbing it down badly
    - kernel dumbing down code can grow out of control with gotcha
      for platform
    - it is still harder to fix kernel than userspace in commercial
      user space (the whole RHEL business of slow moving and long
      supported kernel). So on those being able to fix thing in
      userspace sounds pretty enticing

Cheers,
Jérôme
Andi Kleen Dec. 4, 2018, 8:14 p.m. UTC | #12
> Also, in the same vein, I think it's wrong to have the API enumerate all
> the different memory available in the system. The API should simply

We need an enumeration API too, just to display to the user what they
have, and possibly for applications to size their buffers 
(all we do with existing NUMA nodes)

But yes the default usage should be to query for necessary attributes

-Andi
Logan Gunthorpe Dec. 4, 2018, 8:30 p.m. UTC | #13
On 2018-12-04 1:13 p.m., Jerome Glisse wrote:
> You are right many are non exclusive. It is just my feeling that having
> a mask as a file inside the target directory might be overlook by the
> application which might start using things it should not. At same time
> i guess if i write the userspace library that abstract this kernel API
> then i can enforce application to properly select thing.

I think this is just evidence that this is not a good API. If the user
has the option to just ignore things or do it wrong that's a problem
with the API. Using a prefix for the name doesn't change that fact.

> I do not think there is a way to answer that question. I am siding on the
> side of this API can be dumb down in userspace by a common library. So let
> expose the topology and let userspace dumb it down.

I fundamentally disagree with this approach to designing APIs. Saying
"we'll give you the kitchen sink, add another layer to deal with the
complexity" is actually just eschewing API design and makes it harder
for kernel folks to know what userspace actually requires because they
are multiple layers away.

> If we dumb it down in the kernel i see few pitfalls:
>     - kernel dumbing it down badly
>     - kernel dumbing down code can grow out of control with gotcha
>       for platform

This is just a matter of designing the APIs well. Don't do it badly.

>     - it is still harder to fix kernel than userspace in commercial
>       user space (the whole RHEL business of slow moving and long
>       supported kernel). So on those being able to fix thing in
>       userspace sounds pretty enticing

I hear this argument a lot and it's not compelling to me. I don't think
we should make decisions in upstream code to allow RHEL to bypass the
kernel simply because it would be easier for them to distribute code
changes.

Logan
Jerome Glisse Dec. 4, 2018, 8:41 p.m. UTC | #14
On Tue, Dec 04, 2018 at 12:12:26PM -0800, Andi Kleen wrote:
> On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote:
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> > 
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> 
> That seems entirely like the wrong model. We don't want to rewrite every
> application for adding a new memory type.
> 
> Rather there needs to be an abstract way to query memory of specific
> behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar
> 
> Sure there can be a name somewhere, but it should only be used
> for identification purposes, not to hard code in applications.

Discussion with Logan convinced me to use a mask for property like:
    - cache coherent
    - persistent
    ...

Then files for other properties like:
    - bandwidth (bytes/s)
    - latency
    - granularity (size of individual access or bus width)
    ...

> 
> Really you need to define some use cases and describe how your API
> handles them.

I have given examples of how application looks today and how they
transform with HMS in my email exchange with Dave Hansen. I will
add them to the documentation and to the cover letter in my next
posting.

> > > 
> > > It sounds like you're trying to define a system call with built in
> > > ioctl? Is that really a good idea?
> > > 
> > > If you need ioctl you know where to find it.
> > 
> > Well i would like to get thing running in the wild with some guinea pig
> > user to get feedback from end user. It would be easier if i can do this
> > with upstream kernel and not some random branch in my private repo. While
> > doing that i would like to avoid commiting to a syscall upstream. So the
> > way i see around this is doing a driver under staging with an ioctl which
> > will be turn into a syscall once some confidence into the API is gain.
> 
> Ok that's fine I guess.
> 
> But should be a clearly defined ioctl, not an ioctl with redefinable parameters
> (but perhaps I misunderstood your description)
>
> > In the present version i took the other approach of defining just one
> > API that can grow to do more thing. I know the unix way is one simple
> > tool for one simple job. I can switch to the simple call for one action.
> 
> Simple calls are better.

I will switch to one simple call for each individual action (policy and
migration).

> > > > +Current memory policy infrastructure is node oriented, instead of
> > > > +changing that and risking breakage and regression HMS adds a new
> > > > +heterogeneous policy tracking infra-structure. The expectation is
> > > > +that existing application can keep using mbind() and all existing
> > > > +infrastructure under-disturb and unaffected, while new application
> > > > +will use the new API and should avoid mix and matching both (as they
> > > > +can achieve the same thing with the new API).
> > > 
> > > I think we need a stronger motivation to define a completely
> > > parallel and somewhat redundant infrastructure. What breakage
> > > are you worried about?
> > 
> > Some memory expose through HMS is not allocated by regular memory
> > allocator. For instance GPU memory is manage by GPU driver, so when
> > you want to use GPU memory (either as a policy or by migrating to it)
> > you need to use the GPU allocator to allocate that memory. HMS adds
> > a bunch of callback to target structure so that device driver can
> > expose a generic API to core kernel to do such allocation.
> 
> We already have nodes without memory.
> We can also take out nodes out of the normal fall back lists.
> We also have nodes with special memory (e.g. DMA32)
> 
> Nothing you describe here cannot be handled with the existing nodes.

They are have been patchset in the past to exclude node from allocation
last time i check they all were rejected and people felt it was not a
good thing to do.

Also IIRC adding more node might be problematic as i think we do not
have many bits left inside the flags field of struct page. Right now
i do not believe in moving device memory as generic node inside the
linux kernel because for many folks that will just be a waste, people
only doing desktop and not using their GPU for compute will never get
a good usage from that. Graphic memory allocation is wildely different
from compute allocation which is more like CPU one.

So converting graphic driver to register their memory as node does
not seems as a good idea at this time. I doubt the GPU folks upstream
would accept that (with my GPU hat ons i would not).


> > > The obvious alternative would of course be to add some extra
> > > enumeration to the existing nodes.
> > 
> > We can not extend NUMA node to expose GPU memory. GPU memory on
> > current AMD and Intel platform is not cache coherent and thus
> > should not be use for random memory allocation. It should really
> 
> Sure you don't expose it as normal memory, but it can be still
> tied to a node. In fact you have to for the existing topology
> interface to work.

The existing topology interface is not use today for that memory
and people in GPU world do not see it as an interface that can be
use. See above discussion about GPU memory. This is the raison
d'être of this proposal. A new way to expose heterogeneous memory
to userspace.


> > copy and rebuild their data structure inside the new memory. When
> > you move over thing like tree or any complex data structure you have
> > to rebuilt it ie redo the pointers link between the nodes of your
> > data structure.
> > 
> > This is highly error prone complex and wasteful (you have to burn
> > CPU cycles to do that). Now if you can use the same address space
> > as all the other memory allocation in your program and move data
> > around from one device to another with a common API that works on
> > all the various devices, you are eliminating that complex step and
> > making the end user life much easier.
> > 
> > So i am doing this to help existing users by addressing an issues
> > that is becoming harder and harder to solve for userspace. My end
> > game is to blur the boundary between CPU and device like GPU, FPGA,
> 
> This is just high level rationale. You already had that ...
> 
> What I was looking for is how applications actually use the 
> API.
> 
> e.g. 
> 
> 1. Compute application is looking for fast cache coherent memory 
> for CPU usage.
> 
> What does it query and how does it decide and how does it allocate?

Application have an OpenCL context from the context it gets the
device initiator unique id from the device initiator unique id
it looks at all the links and bridge the initiator is connected
to. Which gives it a list of links it can order that list using
bandwidth first and latency second (ie 2 link with same bandwidth
will be order with the one with slowest latency first). It goes
over that list from best to worse and for each links it looks at
what target are also connected to that link. From that it build
an ordered list of targets. It also only pick cache coherent
memory in that list.

It now use this ordered list of targets to set policy or migrate
its buffer to the best memory. Kernel will first try to use the
first target, if it runs out of that memory it will use the next
target ... so on and so forth.


This can all be down inside a userspace common helper library for
ease of use. More advance application will do finer allocation
for instance they will partition their dataset using the access
frequency. Most accessed dataset in the application will use
the fastest memory (which is likely to be somewhat small ie
few GigaBytes), while dataset that are more sparsely accessed
will be push to use slower memory (but they are more of it).


> 2. Allocator in OpenCL application is looking for memory to share
> with OpenCL. How does it find memory?

Same process as above, starts from initiator id, build links
list then build all target that initiator can access. Then
order that list according to the property of interest to the
application (bandwidth, latency, ...). Once it has the target
list it can use either policy or migration. Policy if it is
for a new allocation, migration if it is to migrate an existing
buffer to memory that is more appropriate for the OpenCL device
under use.


> 3. Storage application is looking for larger but slower memory
> for CPU usage.

Application build a list of initiator corresponding to the CPU
it is using (bind too). From that list of initiator it builds
a list of links (considering bridge too). From the list of links
it builds a list of target (connected to those links).

Then it order the list of target by size (not by latency or
bandwidth). Once it has an ordered list of target then it use
either the policy or migrate API for the range of virtual
address it wants to affect.


> 
> 4. ...
> 
> Please work out some use cases like this.

Note that above all the list building in userspace is intended
to be done by an helper library as this is really boiler plate
code. The last patch in my serie have userspace helpers to
parse the sysfs, i will grow that into a mini library with example
to show case it.

More example from other part of this email thread:

High level overview of how one application looks today:

    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2) Application allocate memory on device A and copy over the dataset
    3) Application run some CPU code to format the copy of the dataset
       inside device A memory (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    4) Application run code on device A that use the dataset
    5) Application allocate memory on device B and copy over result
       from device A
    6) Application run some CPU code to format the copy of the dataset
       inside device B (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    7) Application run code on device B that use the dataset
    8) Application copy result over from device B and keep on doing its
       thing

How it looks with HMS:
    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2-3) Application calls HMS to migrate to device A memory
    4) Application run code on device A that use the dataset
    5-6) Application calls HMS to migrate to device B memory
    7) Application run code on device B that use the dataset
    8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

    Application knows that the platform is running on have 16
    GPU split into 2 group of 8 GPUs each. GPU in each group can
    access each other memory with dedicated mesh links between
    each others. Full speed no traffic bottleneck.

    Application splits its GPU computation in 2 so that each
    partition runs on a group of interconnected GPU allowing
    them to share the dataset.

With HMS:
    Application can query the kernel to discover the topology of
    system it is running on and use it to partition and balance
    its workload accordingly. Same application should now be able
    to run on new platform without having to adapt it to it.

Cheers,
Jérôme
Logan Gunthorpe Dec. 4, 2018, 8:47 p.m. UTC | #15
On 2018-12-04 1:14 p.m., Andi Kleen wrote:
>> Also, in the same vein, I think it's wrong to have the API enumerate all
>> the different memory available in the system. The API should simply

> We need an enumeration API too, just to display to the user what they
> have, and possibly for applications to size their buffers 
> (all we do with existing NUMA nodes)

Yes, but I think my main concern is the conflation of the enumeration
API and the binding API. An application doesn't want to walk through all
the possible memory and types in the system just to get some memory that
will work with a couple initiators (which it somehow has to map to
actual resources, like fds). We also don't want userspace to police
itself on which memory works with which initiator.

Enumeration is definitely not the common use case. And if we create a
new enumeration API now, it may make it difficult or impossible to unify
these types of memory with the existing NUMA node hierarchies if/when
this gets more integrated with the mm core.

Logan
Jerome Glisse Dec. 4, 2018, 8:59 p.m. UTC | #16
On Tue, Dec 04, 2018 at 01:30:01PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:13 p.m., Jerome Glisse wrote:
> > You are right many are non exclusive. It is just my feeling that having
> > a mask as a file inside the target directory might be overlook by the
> > application which might start using things it should not. At same time
> > i guess if i write the userspace library that abstract this kernel API
> > then i can enforce application to properly select thing.
> 
> I think this is just evidence that this is not a good API. If the user
> has the option to just ignore things or do it wrong that's a problem
> with the API. Using a prefix for the name doesn't change that fact.

How to expose harmful memory to userspace then ? How can i expose
non cache coherent memory because yes they are application out there
that use that today and would like to be able to migrate to and from
that memory dynamicly during lifetime of the application as the data
set progress through the application processing pipeline.

They are kind of memory that violate the memory model you expect from
the architecture. This memory is still useful nonetheless and it has
other enticing properties (like bandwidth or latency). The whole point
my proposal is to allow to expose this memory in a generic way so that
application that today rely on a gazillion of device specific API can
move over to common kernel API and consolidate their memory management
on top of a common kernel layers.

The dilema i am facing is exposing this memory while avoiding the non
aware application to accidently use it just because it is there without
understanding the implication that comes with it.

If you have any idea on how to expose this to userspace in a common
API i would happily take any suggestion :) My idea is this patchset
and i agree they are many thing to improve and i have already taken
many of the suggestion given so far.


> 
> > I do not think there is a way to answer that question. I am siding on the
> > side of this API can be dumb down in userspace by a common library. So let
> > expose the topology and let userspace dumb it down.
> 
> I fundamentally disagree with this approach to designing APIs. Saying
> "we'll give you the kitchen sink, add another layer to deal with the
> complexity" is actually just eschewing API design and makes it harder
> for kernel folks to know what userspace actually requires because they
> are multiple layers away.

Note that i do not expose things like physical address or even splits
memory in a node into individual device, in fact in expose less
information that the existing NUMA (no zone, phys index, ...). As i do
not think those have any value to userspace. What matter to userspace
is where is this memory is in my topology so i can look at all the
initiators node that are close by. Or the reverse, i have a set of
initiators what is the set of closest targets to all those initiators.

I feel this is simple enough to understand for anyone. It allows to
describe any topology, a libhms can dumb it down for average application
and more advance application can use the full description. They are
example of such application today. I argue that if we provide a common
API we might see more application but i won't pretend that i know that
for a fact. I am just making assumption here.


> 
> > If we dumb it down in the kernel i see few pitfalls:
> >     - kernel dumbing it down badly
> >     - kernel dumbing down code can grow out of control with gotcha
> >       for platform
> 
> This is just a matter of designing the APIs well. Don't do it badly.

I am talking about the inevitable fact that at some point some system
firmware will miss-represent their platform. System firmware writer
usualy copy and paste thing with little regards to what have change
from one platform to the new. So their will be inevitable workaround
and i would rather see those piling up inside a userspace library than
inside the kernel.

Note that i expec that the error won't be fatal but more along the
line of reporting wrong value for bandwidth, latency, ... So kernel
will most likely unaffected by system firmware error but those will
affect the performance of application that are told innaccurate
informations.


> >     - it is still harder to fix kernel than userspace in commercial
> >       user space (the whole RHEL business of slow moving and long
> >       supported kernel). So on those being able to fix thing in
> >       userspace sounds pretty enticing
> 
> I hear this argument a lot and it's not compelling to me. I don't think
> we should make decisions in upstream code to allow RHEL to bypass the
> kernel simply because it would be easier for them to distribute code
> changes.

Ok i will not bring it up, i have suffer enough on that front so i have
a trauma on this ;)

Cheers,
Jérôme
Jerome Glisse Dec. 4, 2018, 9:15 p.m. UTC | #17
On Tue, Dec 04, 2018 at 01:47:17PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:14 p.m., Andi Kleen wrote:
> >> Also, in the same vein, I think it's wrong to have the API enumerate all
> >> the different memory available in the system. The API should simply
> 
> > We need an enumeration API too, just to display to the user what they
> > have, and possibly for applications to size their buffers 
> > (all we do with existing NUMA nodes)
> 
> Yes, but I think my main concern is the conflation of the enumeration
> API and the binding API. An application doesn't want to walk through all
> the possible memory and types in the system just to get some memory that
> will work with a couple initiators (which it somehow has to map to
> actual resources, like fds). We also don't want userspace to police
> itself on which memory works with which initiator.

How application would police itself ? The API i am proposing is best
effort and as such kernel can fully ignore userspace request as it is
doing now sometimes with mbind(). So kernel always have the last call
and can always override application decission.

Device driver can also decide to override, anything that is kernel
side really have more power than userspace would have. So while we
give trust to userspace we do not abdicate control. That is not the
intention here.


> Enumeration is definitely not the common use case. And if we create a
> new enumeration API now, it may make it difficult or impossible to unify
> these types of memory with the existing NUMA node hierarchies if/when
> this gets more integrated with the mm core.

The point i am trying to make is that it can not get integrated as
regular NUMA node inside the mm core. But rather the mm core can
grow to encompass non NUMA node memory. I explained why in other
part of this thread but roughly:

- Device driver need to be in control of device memory allocation
  for backward compatibility reasons and to keep full filling thing
  like graphic API constraint (OpenGL, Vulkan, X, ...).

- Adding new node type is problematic inside mm as we are running
  out of bits in the struct page

- Excluding node from the regular allocation path was reject by
  upstream previously (IBM did post patchset for that IIRC).

I feel it is a safer path to avoid a one model fits all here and
to accept that device memory will be represented and managed in
a different way from other memory. I believe persistent memory
folks feels the same on that front.

Nonetheless i do want to expose this device memory in a standard
way so that we can consolidate and improve user experience on
that front. Eventually i hope that more of the device memory
management can be turn into a common device memory management
inside core mm but i do not want to enforce that at first as it
is likely to fail (building a moonbase before you have a moon
rocket). I rather grow organicaly from high level API that will
get use right away (it is a matter of converting existing user
to it s/computeAPIBind/HMSBind).

Cheers,
Jérôme
Logan Gunthorpe Dec. 4, 2018, 9:19 p.m. UTC | #18
On 2018-12-04 1:59 p.m., Jerome Glisse wrote:
> How to expose harmful memory to userspace then ? How can i expose
> non cache coherent memory because yes they are application out there
> that use that today and would like to be able to migrate to and from
> that memory dynamicly during lifetime of the application as the data
> set progress through the application processing pipeline.

I'm not arguing against the purpose or use cases. I'm being critical of
the API choices.

> Note that i do not expose things like physical address or even splits
> memory in a node into individual device, in fact in expose less
> information that the existing NUMA (no zone, phys index, ...). As i do
> not think those have any value to userspace. What matter to userspace
> is where is this memory is in my topology so i can look at all the
> initiators node that are close by. Or the reverse, i have a set of
> initiators what is the set of closest targets to all those initiators.

No, what matters to applications is getting memory that will work for
the initiators/resources they need it to work for. The specific topology
might be of interest to administrators but it is not what applications
need. And it should be relatively easy to flesh out the existing sysfs
device tree to provide the topology information administrators need.

> I am talking about the inevitable fact that at some point some system
> firmware will miss-represent their platform. System firmware writer
> usualy copy and paste thing with little regards to what have change
> from one platform to the new. So their will be inevitable workaround
> and i would rather see those piling up inside a userspace library than
> inside the kernel.

It's *absolutely* the kernel's responsibility to patch issues caused by
broken firmware. We have quirks all over the place for this. That's
never something userspace should be responsible for. Really, this is the
raison d'etre of the kernel: to provide userspace with a uniform
execution environment -- if every application had to deal with broken
firmware it would be a nightmare.

Logan
Jerome Glisse Dec. 4, 2018, 9:51 p.m. UTC | #19
On Tue, Dec 04, 2018 at 02:19:09PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 1:59 p.m., Jerome Glisse wrote:
> > How to expose harmful memory to userspace then ? How can i expose
> > non cache coherent memory because yes they are application out there
> > that use that today and would like to be able to migrate to and from
> > that memory dynamicly during lifetime of the application as the data
> > set progress through the application processing pipeline.
> 
> I'm not arguing against the purpose or use cases. I'm being critical of
> the API choices.
> 
> > Note that i do not expose things like physical address or even splits
> > memory in a node into individual device, in fact in expose less
> > information that the existing NUMA (no zone, phys index, ...). As i do
> > not think those have any value to userspace. What matter to userspace
> > is where is this memory is in my topology so i can look at all the
> > initiators node that are close by. Or the reverse, i have a set of
> > initiators what is the set of closest targets to all those initiators.
> 
> No, what matters to applications is getting memory that will work for
> the initiators/resources they need it to work for. The specific topology
> might be of interest to administrators but it is not what applications
> need. And it should be relatively easy to flesh out the existing sysfs
> device tree to provide the topology information administrators need.

Existing user would disagree in my cover letter i have given pointer
to existing library and paper from HPC folks that do leverage system
topology (among the few who are). So they are application _today_ that
do use topology information to adapt their workload to maximize the
performance for the platform they run on.

They are also some new platform that have much more complex topology
that definitly can not be represented as a tree like today sysfs we
have (i believe that even some of the HPC folks have _today_ topology
that are not tree-like).

So existing user + random graph topology becoming more commons lead
me to the choice i made in this API. I believe a graph is someting
that can easily be understood by people. I am not inventing some
weird new data structure, it is just a graph and for the name i have
use the ACPI naming convention but i am more than open to use memory
for target and differentiate cpu and device instead of using initiator
as a name. I do not have strong feeling on that. I do however would
like to be able to represent any topology and be able to use device
memory that is not manage by core mm for reasons i explained previously.

Note that if it turn out to be a bad idea kernel can decide to dumb
down thing in future version for new platform. So it could give a
flat graph to userspace, there is nothing precluding that.


> > I am talking about the inevitable fact that at some point some system
> > firmware will miss-represent their platform. System firmware writer
> > usualy copy and paste thing with little regards to what have change
> > from one platform to the new. So their will be inevitable workaround
> > and i would rather see those piling up inside a userspace library than
> > inside the kernel.
> 
> It's *absolutely* the kernel's responsibility to patch issues caused by
> broken firmware. We have quirks all over the place for this. That's
> never something userspace should be responsible for. Really, this is the
> raison d'etre of the kernel: to provide userspace with a uniform
> execution environment -- if every application had to deal with broken
> firmware it would be a nightmare.

You cuted the other paragraph that explained why they will unlikely
to be broken badly enough to break the kernel.

Anyway we can fix the topology in kernel too ... that is fine with
me.

Cheers,
Jérôme
Logan Gunthorpe Dec. 4, 2018, 10:16 p.m. UTC | #20
On 2018-12-04 2:51 p.m., Jerome Glisse wrote:
> Existing user would disagree in my cover letter i have given pointer
> to existing library and paper from HPC folks that do leverage system
> topology (among the few who are). So they are application _today_ that
> do use topology information to adapt their workload to maximize the
> performance for the platform they run on.

Well we need to give them what they actually need, not what they want to
shoot their foot with. And I imagine, much of what they actually do
right now belongs firmly in the kernel. Like I said, existing
applications are not justifications for bad API design or layering
violations.

You've even mentioned we'd need a simplified "libhms" interface for
applications. We should really just figure out what that needs to be and
make that the kernel interface.

> They are also some new platform that have much more complex topology
> that definitly can not be represented as a tree like today sysfs we
> have (i believe that even some of the HPC folks have _today_ topology
> that are not tree-like).

The sysfs tree already allows for a complex graph that describes
existing hardware very well. If there is hardware it cannot describe
then we should work to improve it and not just carve off a whole new
area for a special API. -- In fact, you are already using sysfs, just
under your own virtual/non-existent bus.

> Note that if it turn out to be a bad idea kernel can decide to dumb
> down thing in future version for new platform. So it could give a
> flat graph to userspace, there is nothing precluding that.

Uh... if it turns out to be a bad idea we are screwed because we have an
API existing applications are using. It's much easier to add features to
a simple (your word: "dumb") interface than it is to take options away
from one that is too broad.

> 
>>> I am talking about the inevitable fact that at some point some system
>>> firmware will miss-represent their platform. System firmware writer
>>> usualy copy and paste thing with little regards to what have change
>>> from one platform to the new. So their will be inevitable workaround
>>> and i would rather see those piling up inside a userspace library than
>>> inside the kernel.
>>
>> It's *absolutely* the kernel's responsibility to patch issues caused by
>> broken firmware. We have quirks all over the place for this. That's
>> never something userspace should be responsible for. Really, this is the
>> raison d'etre of the kernel: to provide userspace with a uniform
>> execution environment -- if every application had to deal with broken
>> firmware it would be a nightmare.
> 
> You cuted the other paragraph that explained why they will unlikely
> to be broken badly enough to break the kernel.

That was entirely beside the point. Just because it doesn't break the
kernel itself doesn't make it any less necessary for it to be fixed
inside the kernel. It must be done in a common place so every
application doesn't have to maintain a table of hardware quirks.

Logan
Jerome Glisse Dec. 4, 2018, 11:56 p.m. UTC | #21
On Tue, Dec 04, 2018 at 03:16:54PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 2:51 p.m., Jerome Glisse wrote:
> > Existing user would disagree in my cover letter i have given pointer
> > to existing library and paper from HPC folks that do leverage system
> > topology (among the few who are). So they are application _today_ that
> > do use topology information to adapt their workload to maximize the
> > performance for the platform they run on.
> 
> Well we need to give them what they actually need, not what they want to
> shoot their foot with. And I imagine, much of what they actually do
> right now belongs firmly in the kernel. Like I said, existing
> applications are not justifications for bad API design or layering
> violations.

One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
two 8 GPUs node connected through each other with fast mesh (ie each
GPU can peer to peer to each other at the same bandwidth). Then this
2 blocks are connected to the other block through a share link.

So it looks like:
    SOCKET0----SOCKET1-----SOCKET2----SOCKET3
    |          |           |          |
    S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
    ||     \\//            ||     \\//
    ||     //\\            ||     //\\
    ...    ====...    -----...    ====...
    ||     \\//            ||     \\//
    ||     //\\            ||     //\\
    S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7

Application partition its workload in 2 ie allocate dataset twice
for 16 group of GPU. Each of the 2 partitions is then further split
in two for some of the buffer in the dataset but not all.

So AFAICT they are using all the topology informations. They see
that they are 4 group of GPU that in those 4 group, they are 2
pair of group with better interconnect and then a share slower
inter-connect between the 2 groups.

From HMS point of view this looks like (ignoring CPU):
link0: S0-GPU0 ... S0-GPU7
link1: S1-GPU0 ... S1-GPU7
link2: S2-GPU0 ... S2-GPU7
link3: S3-GPU0 ... S3-GPU7

link4: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7
link5: S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7

link6: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7
       S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7

Dumbing it more down and they loose information they want. On top
of that there is also the NUMA CPU node (which is more symetric).

I do not see how this can express in current sysfs we have but
maybe there is a way to shoe horn it.

I expect more complex topology to show up with a mix of different
devices (like GPU and FPGA).

> 
> You've even mentioned we'd need a simplified "libhms" interface for
> applications. We should really just figure out what that needs to be and
> make that the kernel interface.

No i said that a libhms for average application would totaly make
sense to dumb thing down. I do not expect all application will use
the full extent of the information. One simple reason, desktop,
on desktop i don't expect the topology to grow too complex and
thus all the desktop application will not care about it (your
blender, libreoffice, ... which are using GPU today).

But for people creating application that will run on big server,
yes i expect some of them will use that information if only the
existing people that already do use that information.


> > They are also some new platform that have much more complex topology
> > that definitly can not be represented as a tree like today sysfs we
> > have (i believe that even some of the HPC folks have _today_ topology
> > that are not tree-like).
> 
> The sysfs tree already allows for a complex graph that describes
> existing hardware very well. If there is hardware it cannot describe
> then we should work to improve it and not just carve off a whole new
> area for a special API. -- In fact, you are already using sysfs, just
> under your own virtual/non-existent bus.

How the above example would looks like ? I fail to see how to do it
inside current sysfs. Maybe by creating multiple virtual device for
each of the inter-connect ? So something like

link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child

Then for link4, link5 and link6 we would need symlink to the GPU
device. So it sounds like creating virtual device for the sake of
doing it in the existing framework. Then userspace would have to
learn about this virtual device to identify them as node for the
topology graph and would have to differentiate from non node
device. This sounds much more complex to me.

Also if doing node for those we would need to do CPU less and memory
less NUMA node as the GPU memory is not usable by the CPU ... I am
not sure we want to get there. If that's what people want fine but
i personnaly don't think this is the right solution.


> > Note that if it turn out to be a bad idea kernel can decide to dumb
> > down thing in future version for new platform. So it could give a
> > flat graph to userspace, there is nothing precluding that.
> 
> Uh... if it turns out to be a bad idea we are screwed because we have an
> API existing applications are using. It's much easier to add features to
> a simple (your word: "dumb") interface than it is to take options away
> from one that is too broad.

We all have fears that what we do will not get use, but i do not
want to stop making progress because of that. Like i said i am doing
all this under staging to get the ball rolling, to test it with
guinea pig and to gain some level of confidence it is actually
useful. So i am providing evidence today (see all the research in
HPC on memory management, topology, placement, ... for which i have
given some links to) and i want to gather more evidence before commiting
to this.

I hope this sounds like a reasonable plan. What would you like me to
do differently ? Like i said i feel that this is a chicken and egg
problem today there is no standard way to get topology so there is
no way to know how much applications would use such informations. We
know that very few applications in special case use the topology
informations. How to test wether more applications would use that
same informations without providing some kind of standard API for
them to get it ?

It is also a system availability thing, right now they are very few
system with such complex topology, but we are seeing more and more
GPU, TPU, FPGA in more and more environement. I want to be pro-active
here and provide API that would help leverage those new system for
people experimenting with them.

My proposal is to do HMS behind staging for a while and also avoid
any disruption to existing code path. See with people living on the
bleeding edge if they get interested in that informations. If not then
i can strip down my thing to the bare minimum which is about device
memory.


> >>> I am talking about the inevitable fact that at some point some system
> >>> firmware will miss-represent their platform. System firmware writer
> >>> usualy copy and paste thing with little regards to what have change
> >>> from one platform to the new. So their will be inevitable workaround
> >>> and i would rather see those piling up inside a userspace library than
> >>> inside the kernel.
> >>
> >> It's *absolutely* the kernel's responsibility to patch issues caused by
> >> broken firmware. We have quirks all over the place for this. That's
> >> never something userspace should be responsible for. Really, this is the
> >> raison d'etre of the kernel: to provide userspace with a uniform
> >> execution environment -- if every application had to deal with broken
> >> firmware it would be a nightmare.
> > 
> > You cuted the other paragraph that explained why they will unlikely
> > to be broken badly enough to break the kernel.
> 
> That was entirely beside the point. Just because it doesn't break the
> kernel itself doesn't make it any less necessary for it to be fixed
> inside the kernel. It must be done in a common place so every
> application doesn't have to maintain a table of hardware quirks.

Fine with quirks in kernel. It was just a personnal taste thing ...
pure kernel vs ugly userspace :)

Cheers,
Jérôme
Felix Kuehling Dec. 5, 2018, 12:54 a.m. UTC | #22
On 2018-12-04 2:11 p.m., Logan Gunthorpe wrote:
> Also, in the same vein, I think it's wrong to have the API enumerate all
> the different memory available in the system. The API should simply
> allow userspace to say it wants memory that can be accessed by a set of
> initiators with a certain set of attributes and the bind call tries to
> fulfill that or fallback on system memory/hmm migration/whatever.

That gets pretty complex when you also have take into account contention
of links and bridges when multiple initiators are accessing multiple
targets simultaneously. If you want the kernel to make sane decisions,
it needs a lot more information about the expected memory access patterns.

Highly optimized algorithms that use multiple GPUs and collective
communications between them want to be able to place their memory
objects in the right location to avoid such contention. You don't want
such an algorithm to guess about opaque policy decisions in the kernel.
If the policy changes, you have to re-optimize the algorithm.

Regards,
  Felix


>
> Logan
Logan Gunthorpe Dec. 5, 2018, 1:15 a.m. UTC | #23
On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> two 8 GPUs node connected through each other with fast mesh (ie each
> GPU can peer to peer to each other at the same bandwidth). Then this
> 2 blocks are connected to the other block through a share link.
> 
> So it looks like:
>     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
>     |          |           |          |
>     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
>     ||     \\//            ||     \\//
>     ||     //\\            ||     //\\
>     ...    ====...    -----...    ====...
>     ||     \\//            ||     \\//
>     ||     //\\            ||     //\\
>     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7

Well the existing NUMA node stuff tells userspace which GPU belongs to
which socket (every device in sysfs already has a numa_node attribute).
And if that's not good enough we should work to improve how that works
for all devices. This problem isn't specific to GPUS or devices with
memory and seems rather orthogonal to an API to bind to device memory.

> How the above example would looks like ? I fail to see how to do it
> inside current sysfs. Maybe by creating multiple virtual device for
> each of the inter-connect ? So something like
> 
> link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child

I think the "links" between GPUs themselves would be a bus. In the same
way a NUMA node is a bus. Each device in sysfs would then need a
directory or something to describe what "link bus(es)" they are a part
of. Though there are other ways to do this: a GPU driver could simply
create symlinks to other GPUs inside a "neighbours" directory under the
device path or something like that.

The point is that this seems like it is specific to GPUs and could
easily be solved in the GPU community without any new universal concepts
or big APIs.

And for applications that need topology information, a lot of it is
already there, we just need to fill in the gaps with small changes that
would be much less controversial. Then if you want to create a libhms
(or whatever) to help applications parse this information out of
existing sysfs that would make sense.

> My proposal is to do HMS behind staging for a while and also avoid
> any disruption to existing code path. See with people living on the
> bleeding edge if they get interested in that informations. If not then
> i can strip down my thing to the bare minimum which is about device
> memory.

This isn't my area or decision to make, but it seemed to me like this is
not what staging is for. Staging is for introducing *drivers* that
aren't up to the Kernel's quality level and they all reside under the
drivers/staging path. It's not meant to introduce experimental APIs
around the kernel that might be revoked at anytime.

DAX introduced itself by marking the config option as EXPERIMENTAL and
printing warnings to dmesg when someone tries to use it. But, to my
knowledge, DAX also wasn't creating APIs with the intention of changing
or revoking them -- it was introducing features using largely existing
APIs that had many broken corner cases.

Do you know of any precedents where big APIs were introduced and then
later revoked or radically changed like you are proposing to do?

Logan
Jerome Glisse Dec. 5, 2018, 2:31 a.m. UTC | #24
On Tue, Dec 04, 2018 at 06:15:08PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > two 8 GPUs node connected through each other with fast mesh (ie each
> > GPU can peer to peer to each other at the same bandwidth). Then this
> > 2 blocks are connected to the other block through a share link.
> > 
> > So it looks like:
> >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> >     |          |           |          |
> >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     ...    ====...    -----...    ====...
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
> 
> Well the existing NUMA node stuff tells userspace which GPU belongs to
> which socket (every device in sysfs already has a numa_node attribute).
> And if that's not good enough we should work to improve how that works
> for all devices. This problem isn't specific to GPUS or devices with
> memory and seems rather orthogonal to an API to bind to device memory.

HMS is generic and not for GPU only, i use GPU as example as they are
the first device introducing this complexity. I believe some of the
FPGA folks are working on same thing. I heard that more TPU like hardware
might also grow such complexity.

What you are proposing just seems to me like redoing HMS under the node
directory in sysfs which has the potential of confusing existing application
while providing no benefits (at least i fail to see any).

> > How the above example would looks like ? I fail to see how to do it
> > inside current sysfs. Maybe by creating multiple virtual device for
> > each of the inter-connect ? So something like
> > 
> > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
> 
> I think the "links" between GPUs themselves would be a bus. In the same
> way a NUMA node is a bus. Each device in sysfs would then need a
> directory or something to describe what "link bus(es)" they are a part
> of. Though there are other ways to do this: a GPU driver could simply
> create symlinks to other GPUs inside a "neighbours" directory under the
> device path or something like that.
> 
> The point is that this seems like it is specific to GPUs and could
> easily be solved in the GPU community without any new universal concepts
> or big APIs.

So it would be springly over all this informations in various sub-
directories. To me this is making userspace life harder. HMS only
has one directory hierarchy that userspace need to parse to extract
the information. From my point of view it is much better but this
might be a taste thing.

> 
> And for applications that need topology information, a lot of it is
> already there, we just need to fill in the gaps with small changes that
> would be much less controversial. Then if you want to create a libhms
> (or whatever) to help applications parse this information out of
> existing sysfs that would make sense.

How can i express multiple link, or memory that is only accessible
by a subset of the devices/CPUs. In today model they are back in
assumption like everyone can access all the node which do not hold
in what i am trying to do.

Yes i can do it by adding invalid peer node list inside each node
but this is all more complex from my point of view. Highly confusing
for existing application and with potential to break existing
application on new platform with such weird nodes.


> > My proposal is to do HMS behind staging for a while and also avoid
> > any disruption to existing code path. See with people living on the
> > bleeding edge if they get interested in that informations. If not then
> > i can strip down my thing to the bare minimum which is about device
> > memory.
> 
> This isn't my area or decision to make, but it seemed to me like this is
> not what staging is for. Staging is for introducing *drivers* that
> aren't up to the Kernel's quality level and they all reside under the
> drivers/staging path. It's not meant to introduce experimental APIs
> around the kernel that might be revoked at anytime.
> 
> DAX introduced itself by marking the config option as EXPERIMENTAL and
> printing warnings to dmesg when someone tries to use it. But, to my
> knowledge, DAX also wasn't creating APIs with the intention of changing
> or revoking them -- it was introducing features using largely existing
> APIs that had many broken corner cases.
> 
> Do you know of any precedents where big APIs were introduced and then
> later revoked or radically changed like you are proposing to do?

Yeah it is kind of an issue, i can go the experimental way, idealy
what i would like is a kernel option that enable it with a kernel
boot parameter as an extra gate keeper so i can distribute kernel
with that feature inside some distribution and then provide simple
instruction for people to test (much easier to give a kernel boot
parameter than to have people rebuild a kernel).

I am open to any suggestion on what would be the best guideline to
experiment with API. The issue is that the changes to userspace are
big and takes time (month of works). So if i have to everything line
up and ready (userspace and kernel) in just one go then it is gonna
be painful. My pain i guess so other don't care ... :)

Cheers,
Jérôme
Dan Williams Dec. 5, 2018, 2:34 a.m. UTC | #25
On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote:
>
>
>
> On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > two 8 GPUs node connected through each other with fast mesh (ie each
> > GPU can peer to peer to each other at the same bandwidth). Then this
> > 2 blocks are connected to the other block through a share link.
> >
> > So it looks like:
> >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> >     |          |           |          |
> >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     ...    ====...    -----...    ====...
> >     ||     \\//            ||     \\//
> >     ||     //\\            ||     //\\
> >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
>
> Well the existing NUMA node stuff tells userspace which GPU belongs to
> which socket (every device in sysfs already has a numa_node attribute).
> And if that's not good enough we should work to improve how that works
> for all devices. This problem isn't specific to GPUS or devices with
> memory and seems rather orthogonal to an API to bind to device memory.
>
> > How the above example would looks like ? I fail to see how to do it
> > inside current sysfs. Maybe by creating multiple virtual device for
> > each of the inter-connect ? So something like
> >
> > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
>
> I think the "links" between GPUs themselves would be a bus. In the same
> way a NUMA node is a bus. Each device in sysfs would then need a
> directory or something to describe what "link bus(es)" they are a part
> of. Though there are other ways to do this: a GPU driver could simply
> create symlinks to other GPUs inside a "neighbours" directory under the
> device path or something like that.
>
> The point is that this seems like it is specific to GPUs and could
> easily be solved in the GPU community without any new universal concepts
> or big APIs.
>
> And for applications that need topology information, a lot of it is
> already there, we just need to fill in the gaps with small changes that
> would be much less controversial. Then if you want to create a libhms
> (or whatever) to help applications parse this information out of
> existing sysfs that would make sense.
>
> > My proposal is to do HMS behind staging for a while and also avoid
> > any disruption to existing code path. See with people living on the
> > bleeding edge if they get interested in that informations. If not then
> > i can strip down my thing to the bare minimum which is about device
> > memory.
>
> This isn't my area or decision to make, but it seemed to me like this is
> not what staging is for. Staging is for introducing *drivers* that
> aren't up to the Kernel's quality level and they all reside under the
> drivers/staging path. It's not meant to introduce experimental APIs
> around the kernel that might be revoked at anytime.
>
> DAX introduced itself by marking the config option as EXPERIMENTAL and
> printing warnings to dmesg when someone tries to use it. But, to my
> knowledge, DAX also wasn't creating APIs with the intention of changing
> or revoking them -- it was introducing features using largely existing
> APIs that had many broken corner cases.
>
> Do you know of any precedents where big APIs were introduced and then
> later revoked or radically changed like you are proposing to do?

This came up before for apis even better defined than HMS as well as
more limited scope, i.e. experimental ABI availability only for -rc
kernels. Linus said this:

"There are no loopholes. No "but it's been only one release". No, no,
no. The whole point is that users are supposed to be able to *trust*
the kernel. If we do something, we keep on doing it.

And if it makes it harder to add new user-visible interfaces, then
that's a *good* thing." [1]

The takeaway being don't land work-in-progress ABIs in the kernel.
Once an application depends on it, there are no more incompatible
changes possible regardless of the warnings, experimental notices, or
"staging" designation. DAX is experimental because there are cases
where it currently does not work with respect to another kernel
feature like xfs-reflink, RDMA. The plan is to fix those, not continue
to hide behind an experimental designation, and fix them in a way that
preserves the user visible behavior that has already been exposed,
i.e. no regressions.

[1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
Jerome Glisse Dec. 5, 2018, 2:37 a.m. UTC | #26
On Tue, Dec 04, 2018 at 06:34:37PM -0800, Dan Williams wrote:
> On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote:
> >
> >
> >
> > On 2018-12-04 4:56 p.m., Jerome Glisse wrote:
> > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and
> > > two 8 GPUs node connected through each other with fast mesh (ie each
> > > GPU can peer to peer to each other at the same bandwidth). Then this
> > > 2 blocks are connected to the other block through a share link.
> > >
> > > So it looks like:
> > >     SOCKET0----SOCKET1-----SOCKET2----SOCKET3
> > >     |          |           |          |
> > >     S0-GPU0====S1-GPU0     S2-GPU0====S1-GPU0
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     ...    ====...    -----...    ====...
> > >     ||     \\//            ||     \\//
> > >     ||     //\\            ||     //\\
> > >     S0-GPU7====S1-GPU7     S2-GPU7====S3-GPU7
> >
> > Well the existing NUMA node stuff tells userspace which GPU belongs to
> > which socket (every device in sysfs already has a numa_node attribute).
> > And if that's not good enough we should work to improve how that works
> > for all devices. This problem isn't specific to GPUS or devices with
> > memory and seems rather orthogonal to an API to bind to device memory.
> >
> > > How the above example would looks like ? I fail to see how to do it
> > > inside current sysfs. Maybe by creating multiple virtual device for
> > > each of the inter-connect ? So something like
> > >
> > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child
> > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child
> > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child
> > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child
> >
> > I think the "links" between GPUs themselves would be a bus. In the same
> > way a NUMA node is a bus. Each device in sysfs would then need a
> > directory or something to describe what "link bus(es)" they are a part
> > of. Though there are other ways to do this: a GPU driver could simply
> > create symlinks to other GPUs inside a "neighbours" directory under the
> > device path or something like that.
> >
> > The point is that this seems like it is specific to GPUs and could
> > easily be solved in the GPU community without any new universal concepts
> > or big APIs.
> >
> > And for applications that need topology information, a lot of it is
> > already there, we just need to fill in the gaps with small changes that
> > would be much less controversial. Then if you want to create a libhms
> > (or whatever) to help applications parse this information out of
> > existing sysfs that would make sense.
> >
> > > My proposal is to do HMS behind staging for a while and also avoid
> > > any disruption to existing code path. See with people living on the
> > > bleeding edge if they get interested in that informations. If not then
> > > i can strip down my thing to the bare minimum which is about device
> > > memory.
> >
> > This isn't my area or decision to make, but it seemed to me like this is
> > not what staging is for. Staging is for introducing *drivers* that
> > aren't up to the Kernel's quality level and they all reside under the
> > drivers/staging path. It's not meant to introduce experimental APIs
> > around the kernel that might be revoked at anytime.
> >
> > DAX introduced itself by marking the config option as EXPERIMENTAL and
> > printing warnings to dmesg when someone tries to use it. But, to my
> > knowledge, DAX also wasn't creating APIs with the intention of changing
> > or revoking them -- it was introducing features using largely existing
> > APIs that had many broken corner cases.
> >
> > Do you know of any precedents where big APIs were introduced and then
> > later revoked or radically changed like you are proposing to do?
> 
> This came up before for apis even better defined than HMS as well as
> more limited scope, i.e. experimental ABI availability only for -rc
> kernels. Linus said this:
> 
> "There are no loopholes. No "but it's been only one release". No, no,
> no. The whole point is that users are supposed to be able to *trust*
> the kernel. If we do something, we keep on doing it.
> 
> And if it makes it harder to add new user-visible interfaces, then
> that's a *good* thing." [1]
> 
> The takeaway being don't land work-in-progress ABIs in the kernel.
> Once an application depends on it, there are no more incompatible
> changes possible regardless of the warnings, experimental notices, or
> "staging" designation. DAX is experimental because there are cases
> where it currently does not work with respect to another kernel
> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> to hide behind an experimental designation, and fix them in a way that
> preserves the user visible behavior that has already been exposed,
> i.e. no regressions.
> 
> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html

So i guess i am heading down the vXX road ... such is my life :)

Cheers,
Jérôme
Aneesh Kumar K.V Dec. 5, 2018, 4:36 a.m. UTC | #27
On 12/4/18 11:54 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
>> jglisse@redhat.com writes:
>>
>>> +
>>> +To help with forward compatibility each object as a version value and
>>> +it is mandatory for user space to only use target or initiator with
>>> +version supported by the user space. For instance if user space only
>>> +knows about what version 1 means and sees a target with version 2 then
>>> +the user space must ignore that target as if it does not exist.
>>
>> So once v2 is introduced all applications that only support v1 break.
>>
>> That seems very un-Linux and will break Linus' "do not break existing
>> applications" rule.
>>
>> The standard approach that if you add something incompatible is to
>> add new field, but keep the old ones.
> 
> No that's not how it is suppose to work. So let says it is 2018 and you
> have v1 memory (like your regular main DDR memory for instance) then it
> will always be expose a v1 memory.
> 
> Fast forward 2020 and you have this new type of memory that is not cache
> coherent and you want to expose this to userspace through HMS. What you
> do is a kernel patch that introduce the v2 type for target and define a
> set of new sysfs file to describe what v2 is. On this new computer you
> report your usual main memory as v1 and your new memory as v2.
> 
> So the application that only knew about v1 will keep using any v1 memory
> on your new platform but it will not use any of the new memory v2 which
> is what you want to happen. You do not have to break existing application
> while allowing to add new type of memory.
> 

So the knowledge that v1 is coherent and v2 is non-coherent is within 
the application? That seems really complicated from application point of 
view. Rill that v1 and v2 definition be arch and system dependent?

if we want to encode properties of a target and initiator we should do 
that as files within these directory. Something like 'is_cache_coherent'
in the target director can be used to identify whether the target is 
cache coherent or not?

-aneesh
Jerome Glisse Dec. 5, 2018, 4:41 a.m. UTC | #28
On Wed, Dec 05, 2018 at 10:06:02AM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 11:54 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote:
> > > jglisse@redhat.com writes:
> > > 
> > > > +
> > > > +To help with forward compatibility each object as a version value and
> > > > +it is mandatory for user space to only use target or initiator with
> > > > +version supported by the user space. For instance if user space only
> > > > +knows about what version 1 means and sees a target with version 2 then
> > > > +the user space must ignore that target as if it does not exist.
> > > 
> > > So once v2 is introduced all applications that only support v1 break.
> > > 
> > > That seems very un-Linux and will break Linus' "do not break existing
> > > applications" rule.
> > > 
> > > The standard approach that if you add something incompatible is to
> > > add new field, but keep the old ones.
> > 
> > No that's not how it is suppose to work. So let says it is 2018 and you
> > have v1 memory (like your regular main DDR memory for instance) then it
> > will always be expose a v1 memory.
> > 
> > Fast forward 2020 and you have this new type of memory that is not cache
> > coherent and you want to expose this to userspace through HMS. What you
> > do is a kernel patch that introduce the v2 type for target and define a
> > set of new sysfs file to describe what v2 is. On this new computer you
> > report your usual main memory as v1 and your new memory as v2.
> > 
> > So the application that only knew about v1 will keep using any v1 memory
> > on your new platform but it will not use any of the new memory v2 which
> > is what you want to happen. You do not have to break existing application
> > while allowing to add new type of memory.
> > 
> 
> So the knowledge that v1 is coherent and v2 is non-coherent is within the
> application? That seems really complicated from application point of view.
> Rill that v1 and v2 definition be arch and system dependent?

No the idea was that kernel version X like 4.20 would define what v1
means. Then once v2 is added it would define what that means. Memory
that has v1 property would get v1 and memory that have v2 property
would get v2 as prefix.

Application that was done at 4.20 time and thus only knew about v1
would only look for v1 folder and thus only get memory it does under-
stand.

This is kind of moot discussion as i will switch to mask file inside
the directory per Logan advice.

> 
> if we want to encode properties of a target and initiator we should do that
> as files within these directory. Something like 'is_cache_coherent'
> in the target director can be used to identify whether the target is cache
> coherent or not?

My objection and fear is that application would overlook new properties
that the application need to understand to safely use new type of memory.
Thus old application might start using weird memory on new platform and
break in unexpected way. This was the whole rational and motivation behind
my choice.

I will switch to a set of flag in a file in the target directory and rely
on sane userspace behavior.

Cheers,
Jérôme
Mike Rapoport Dec. 5, 2018, 10:52 a.m. UTC | #29
On Mon, Dec 03, 2018 at 06:34:57PM -0500, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Add documentation to what is HMS and what it is for (see patch content).
> 
> Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
> Cc: Rafael J. Wysocki <rafael@kernel.org>
> Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Haggai Eran <haggaie@mellanox.com>
> Cc: Balbir Singh <balbirs@au1.ibm.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Felix Kuehling <felix.kuehling@amd.com>
> Cc: Philip Yang <Philip.Yang@amd.com>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Paul Blinzer <Paul.Blinzer@amd.com>
> Cc: Logan Gunthorpe <logang@deltatee.com>
> Cc: John Hubbard <jhubbard@nvidia.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
> Cc: Mark Hairgrove <mhairgrove@nvidia.com>
> Cc: Vivek Kini <vkini@nvidia.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Dave Airlie <airlied@redhat.com>
> Cc: Ben Skeggs <bskeggs@redhat.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++-----
>  1 file changed, 246 insertions(+), 29 deletions(-)

This document describes userspace API and it's better to put it into
Documentation/admin-guide/mm.
The Documentation/vm is more for description of design and implementation.

I've spotted a couple of typos, but I think it doesn't make sense to nitpick
about them before  v10 or so ;-)
 
> diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
> index dbf0f71918a9..bd7c9e8e7077 100644
> --- a/Documentation/vm/hms.rst
> +++ b/Documentation/vm/hms.rst
Logan Gunthorpe Dec. 5, 2018, 5:25 p.m. UTC | #30
On 2018-12-04 7:37 p.m., Jerome Glisse wrote:
>>
>> This came up before for apis even better defined than HMS as well as
>> more limited scope, i.e. experimental ABI availability only for -rc
>> kernels. Linus said this:
>>
>> "There are no loopholes. No "but it's been only one release". No, no,
>> no. The whole point is that users are supposed to be able to *trust*
>> the kernel. If we do something, we keep on doing it.
>>
>> And if it makes it harder to add new user-visible interfaces, then
>> that's a *good* thing." [1]
>>
>> The takeaway being don't land work-in-progress ABIs in the kernel.
>> Once an application depends on it, there are no more incompatible
>> changes possible regardless of the warnings, experimental notices, or
>> "staging" designation. DAX is experimental because there are cases
>> where it currently does not work with respect to another kernel
>> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
>> to hide behind an experimental designation, and fix them in a way that
>> preserves the user visible behavior that has already been exposed,
>> i.e. no regressions.
>>
>> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
> 
> So i guess i am heading down the vXX road ... such is my life :)

I recommend against it. I really haven't been convinced by any of your
arguments for having a second topology tree. The existing topology tree
in sysfs already better describes the links between hardware right now,
except for the missing GPU links (and those should be addressable within
the GPU community). Plus, maybe, some other enhancements to sockets/numa
node descriptions if there's something missing there.

Then, 'hbind' is another issue but I suspect it would be better
implemented as an ioctl on existing GPU interfaces. I certainly can't
see any benefit in using it myself.

It's better to take an approach that would be less controversial with
the community than to brow beat them with a patch set 20+ times until
they take it.

Logan
Logan Gunthorpe Dec. 5, 2018, 5:41 p.m. UTC | #31
On 2018-12-04 7:31 p.m., Jerome Glisse wrote:
> How can i express multiple link, or memory that is only accessible
> by a subset of the devices/CPUs. In today model they are back in
> assumption like everyone can access all the node which do not hold
> in what i am trying to do.

Well multiple links are easy when you have a 'link' bus. Just add
another link device under the bus.

Technically, the accessibility issue is already encoded in sysfs. For
example, through the PCI tree you can determine which ACS bits are set
and determine which devices are behind the same root bridge the same way
we do in the kernel p2pdma subsystem. This is all bus specific which is
fine, but if we want to change that, we should have a common way for
existing buses to describe these attributes in the existing tree. The
new 'link' bus devices would have to have some way to describe cases if
memory isn't accessible in some way across it.

But really, I would say the kernel is responsible for telling you when
memory is accessible to a list of initiators, so it should be part of
the checks in a theoretical hbind api. This is already the approach
p2pdma takes in-kernel: we have functions that tell you if two PCI
devices can talk to each other and we have functions to give you memory
accessible by a set of devices. What we don't have is a special tree
that p2pdma users have to walk through to determine accessibility.

In my eye's, you are just conflating a bunch of different issues that
are better solved independently in the existing frameworks we have. And
if they were tackled individually, you'd have a much easier time getting
them merged one by one.

Logan
Jerome Glisse Dec. 5, 2018, 6:01 p.m. UTC | #32
On Wed, Dec 05, 2018 at 10:25:31AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 7:37 p.m., Jerome Glisse wrote:
> >>
> >> This came up before for apis even better defined than HMS as well as
> >> more limited scope, i.e. experimental ABI availability only for -rc
> >> kernels. Linus said this:
> >>
> >> "There are no loopholes. No "but it's been only one release". No, no,
> >> no. The whole point is that users are supposed to be able to *trust*
> >> the kernel. If we do something, we keep on doing it.
> >>
> >> And if it makes it harder to add new user-visible interfaces, then
> >> that's a *good* thing." [1]
> >>
> >> The takeaway being don't land work-in-progress ABIs in the kernel.
> >> Once an application depends on it, there are no more incompatible
> >> changes possible regardless of the warnings, experimental notices, or
> >> "staging" designation. DAX is experimental because there are cases
> >> where it currently does not work with respect to another kernel
> >> feature like xfs-reflink, RDMA. The plan is to fix those, not continue
> >> to hide behind an experimental designation, and fix them in a way that
> >> preserves the user visible behavior that has already been exposed,
> >> i.e. no regressions.
> >>
> >> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
> > 
> > So i guess i am heading down the vXX road ... such is my life :)
> 
> I recommend against it. I really haven't been convinced by any of your
> arguments for having a second topology tree. The existing topology tree
> in sysfs already better describes the links between hardware right now,
> except for the missing GPU links (and those should be addressable within
> the GPU community). Plus, maybe, some other enhancements to sockets/numa
> node descriptions if there's something missing there.
> 
> Then, 'hbind' is another issue but I suspect it would be better
> implemented as an ioctl on existing GPU interfaces. I certainly can't
> see any benefit in using it myself.
> 
> It's better to take an approach that would be less controversial with
> the community than to brow beat them with a patch set 20+ times until
> they take it.

So here is what i am gonna do because i need this code now. I am gonna
split the helper code that does policy and hbind out from its sysfs
peerage and i am gonna turn it into helpers that each device driver
can use. I will move the sysfs and syscall to be a patchset on its own
which use the exact same above infrastructure.

This means that i am loosing feature as it means that userspace can
not provide a list of multiple device memory to use (which is much more
common that you might think) but at least i can provide something for
the single device case through ioctl.

I am not giving up on sysfs or syscall as this is needed long term so
i am gonna improve it, port existing userspace (OpenCL, ROCm, ...) to
use it (in branch) and demonstrate how it get use by end application.
I will beat it again and again until either i convince people through
hard evidence or i get bored. I do not get bored easily :)

Cheers,
Jérôme
Jerome Glisse Dec. 5, 2018, 6:07 p.m. UTC | #33
On Wed, Dec 05, 2018 at 10:41:56AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-04 7:31 p.m., Jerome Glisse wrote:
> > How can i express multiple link, or memory that is only accessible
> > by a subset of the devices/CPUs. In today model they are back in
> > assumption like everyone can access all the node which do not hold
> > in what i am trying to do.
> 
> Well multiple links are easy when you have a 'link' bus. Just add
> another link device under the bus.

So you are telling do what i am doing in this patchset but not under
HMS directory ?

> 
> Technically, the accessibility issue is already encoded in sysfs. For
> example, through the PCI tree you can determine which ACS bits are set
> and determine which devices are behind the same root bridge the same way
> we do in the kernel p2pdma subsystem. This is all bus specific which is
> fine, but if we want to change that, we should have a common way for
> existing buses to describe these attributes in the existing tree. The
> new 'link' bus devices would have to have some way to describe cases if
> memory isn't accessible in some way across it.

What i am looking at is much more complex than just access bit. It
is a whole set of properties attach to each path (can it be cache
coherent ? can it do atomic ? what is the access granularity ? what
is the bandwidth ? is it dedicated link ? ...)

> 
> But really, I would say the kernel is responsible for telling you when
> memory is accessible to a list of initiators, so it should be part of
> the checks in a theoretical hbind api. This is already the approach
> p2pdma takes in-kernel: we have functions that tell you if two PCI
> devices can talk to each other and we have functions to give you memory
> accessible by a set of devices. What we don't have is a special tree
> that p2pdma users have to walk through to determine accessibility.

You do not need it, but i do need it they are user out there that are
already depending on the information by getting it through non standard
way. I do want to provide a standard way for userspace to get this.
They are real user out there and i believe their would be more user
if we had a standard way to provide it. You do not believe in it fine.
I will do more work in userspace and more example and i will come back
with more hard evidence until i convince enough people.

> 
> In my eye's, you are just conflating a bunch of different issues that
> are better solved independently in the existing frameworks we have. And
> if they were tackled individually, you'd have a much easier time getting
> them merged one by one.

I don't think i can convince you otherwise. They are user that use topology
please looks at the links i provided, those folks have running program
_today_ they rely on non standard API and would like to move toward standard
API it would improve their life.

On top of that i argue that more people would use that information if it
were available to them. I agree that i have no hard evidence to back that
up and that it is just a feeling but you can not disprove me either as
this is a chicken and egg problem, you can not prove people will not use
an API if the API is not there to be use.

Cheers,
Jérôme
Logan Gunthorpe Dec. 5, 2018, 6:20 p.m. UTC | #34
On 2018-12-05 11:07 a.m., Jerome Glisse wrote:
>> Well multiple links are easy when you have a 'link' bus. Just add
>> another link device under the bus.
> 
> So you are telling do what i am doing in this patchset but not under
> HMS directory ?

No, it's completely different. I'm talking about creating a bus to
describe only the real hardware that links GPUs. Not creating a new
virtual tree containing a bunch of duplicate bus and device information
that already exists currently in sysfs.

>>
>> Technically, the accessibility issue is already encoded in sysfs. For
>> example, through the PCI tree you can determine which ACS bits are set
>> and determine which devices are behind the same root bridge the same way
>> we do in the kernel p2pdma subsystem. This is all bus specific which is
>> fine, but if we want to change that, we should have a common way for
>> existing buses to describe these attributes in the existing tree. The
>> new 'link' bus devices would have to have some way to describe cases if
>> memory isn't accessible in some way across it.
> 
> What i am looking at is much more complex than just access bit. It
> is a whole set of properties attach to each path (can it be cache
> coherent ? can it do atomic ? what is the access granularity ? what
> is the bandwidth ? is it dedicated link ? ...)

I'm not talking about just an access bit. I'm talking about what you are
describing: standard ways for *existing* buses in the sysfs hierarchy to
describe things like cache coherency, atomics, granularity, etc without
creating a new hierarchy.

> On top of that i argue that more people would use that information if it
> were available to them. I agree that i have no hard evidence to back that
> up and that it is just a feeling but you can not disprove me either as
> this is a chicken and egg problem, you can not prove people will not use
> an API if the API is not there to be use.

And you miss my point that much of this information is already available
to them. And more can be added in the existing framework without
creating any brand new concepts. I haven't said anything about
chicken-and-egg problems -- I've given you a bunch of different
suggestions to split this up into more managable problems and address
many of them within the APIs and frameworks we have already.

Logan
Jerome Glisse Dec. 5, 2018, 6:33 p.m. UTC | #35
On Wed, Dec 05, 2018 at 11:20:30AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:07 a.m., Jerome Glisse wrote:
> >> Well multiple links are easy when you have a 'link' bus. Just add
> >> another link device under the bus.
> > 
> > So you are telling do what i am doing in this patchset but not under
> > HMS directory ?
> 
> No, it's completely different. I'm talking about creating a bus to
> describe only the real hardware that links GPUs. Not creating a new
> virtual tree containing a bunch of duplicate bus and device information
> that already exists currently in sysfs.
> 
> >>
> >> Technically, the accessibility issue is already encoded in sysfs. For
> >> example, through the PCI tree you can determine which ACS bits are set
> >> and determine which devices are behind the same root bridge the same way
> >> we do in the kernel p2pdma subsystem. This is all bus specific which is
> >> fine, but if we want to change that, we should have a common way for
> >> existing buses to describe these attributes in the existing tree. The
> >> new 'link' bus devices would have to have some way to describe cases if
> >> memory isn't accessible in some way across it.
> > 
> > What i am looking at is much more complex than just access bit. It
> > is a whole set of properties attach to each path (can it be cache
> > coherent ? can it do atomic ? what is the access granularity ? what
> > is the bandwidth ? is it dedicated link ? ...)
> 
> I'm not talking about just an access bit. I'm talking about what you are
> describing: standard ways for *existing* buses in the sysfs hierarchy to
> describe things like cache coherency, atomics, granularity, etc without
> creating a new hierarchy.
> 
> > On top of that i argue that more people would use that information if it
> > were available to them. I agree that i have no hard evidence to back that
> > up and that it is just a feeling but you can not disprove me either as
> > this is a chicken and egg problem, you can not prove people will not use
> > an API if the API is not there to be use.
> 
> And you miss my point that much of this information is already available
> to them. And more can be added in the existing framework without
> creating any brand new concepts. I haven't said anything about
> chicken-and-egg problems -- I've given you a bunch of different
> suggestions to split this up into more managable problems and address
> many of them within the APIs and frameworks we have already.

The thing is that what i am considering is not in sysfs, it does not
even have linux kernel driver, it is just chips that connect device
between them and there is nothing to do with those chips it is all
hardware they do not need a driver. So there is nothing existing that
address what i need to represent.

If i add a a fake driver for those what would i do ? under which
sub-system i register them ? How i express the fact that they
connect device X,Y and Z with some properties ?

This is not PCIE ... you can not discover bridges and child, it
not a tree like structure, it is a random graph (which depends
on how the OEM wire port on the chips).


So i have not pre-existing driver, they are not in sysfs today and
they do not need a driver. Hence why i proposed what i proposed
a sysfs hierarchy where i can add those "virtual" object and shows
how they connect existing device for which we have a sysfs directory
to symlink.


Cheers,
Jérôme
Logan Gunthorpe Dec. 5, 2018, 6:48 p.m. UTC | #36
On 2018-12-05 11:33 a.m., Jerome Glisse wrote:
> If i add a a fake driver for those what would i do ? under which
> sub-system i register them ? How i express the fact that they
> connect device X,Y and Z with some properties ?

Yes this is exactly what I'm suggesting. I wouldn't call it a fake
driver, but a new struct device describing an actual device in the
system. It would be a feature of the GPU subsystem seeing this is a
feature of GPUs. Expressing that the new devices connect to a specific
set of GPUs is not a hard problem to solve.

> This is not PCIE ... you can not discover bridges and child, it
> not a tree like structure, it is a random graph (which depends
> on how the OEM wire port on the chips).

You must be able to discover that these links exist and register a
device with the system. Where else do you get the information currently?
The suggestion doesn't change anything to do with how you interact with
hardware, only how you describe the information within the kernel.

> So i have not pre-existing driver, they are not in sysfs today and
> they do not need a driver. Hence why i proposed what i proposed
> a sysfs hierarchy where i can add those "virtual" object and shows
> how they connect existing device for which we have a sysfs directory
> to symlink.

So add a new driver -- that's what I've been suggesting all along.
Having a driver not exist is no reason to not create one. I'd suggest
that if you want them to show up in the sysfs hierarchy then you do need
some kind of driver code to create a struct device. Just because the
kernel doesn't have to interact with them is no reason not to create a
struct device. It's *much* easier to create a new driver subsystem than
a whole new userspace API.

Logan
Jerome Glisse Dec. 5, 2018, 6:55 p.m. UTC | #37
On Wed, Dec 05, 2018 at 11:48:37AM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:33 a.m., Jerome Glisse wrote:
> > If i add a a fake driver for those what would i do ? under which
> > sub-system i register them ? How i express the fact that they
> > connect device X,Y and Z with some properties ?
> 
> Yes this is exactly what I'm suggesting. I wouldn't call it a fake
> driver, but a new struct device describing an actual device in the
> system. It would be a feature of the GPU subsystem seeing this is a
> feature of GPUs. Expressing that the new devices connect to a specific
> set of GPUs is not a hard problem to solve.
> 
> > This is not PCIE ... you can not discover bridges and child, it
> > not a tree like structure, it is a random graph (which depends
> > on how the OEM wire port on the chips).
> 
> You must be able to discover that these links exist and register a
> device with the system. Where else do you get the information currently?
> The suggestion doesn't change anything to do with how you interact with
> hardware, only how you describe the information within the kernel.
> 
> > So i have not pre-existing driver, they are not in sysfs today and
> > they do not need a driver. Hence why i proposed what i proposed
> > a sysfs hierarchy where i can add those "virtual" object and shows
> > how they connect existing device for which we have a sysfs directory
> > to symlink.
> 
> So add a new driver -- that's what I've been suggesting all along.
> Having a driver not exist is no reason to not create one. I'd suggest
> that if you want them to show up in the sysfs hierarchy then you do need
> some kind of driver code to create a struct device. Just because the
> kernel doesn't have to interact with them is no reason not to create a
> struct device. It's *much* easier to create a new driver subsystem than
> a whole new userspace API.

So now once next type of device shows up with the exact same thing
let say FPGA, we have to create a new subsystem for them too. Also
this make the userspace life much much harder. Now userspace must
go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
merge all that different information together and rebuild the
representation i am putting forward in this patchset in userspace.

There is no telling that kernel won't be able to provide quirk and
workaround because some merging is actually illegal on a given
platform (like some link from a subsystem is not accessible through
the PCI connection of one of the device connected to that link).

So it means userspace will have to grow its own database or work-
around and quirk and i am back in the situation i am in today.

Not very convincing to me. What i am proposing here is a new common
description provided by the kernel where we can reconciliate weird
interaction.

But i doubt i can convince you i will make progress on what i need
today and keep working on sysfs.

Cheers,
Jérôme
Logan Gunthorpe Dec. 5, 2018, 7:10 p.m. UTC | #38
On 2018-12-05 11:55 a.m., Jerome Glisse wrote:
> So now once next type of device shows up with the exact same thing
> let say FPGA, we have to create a new subsystem for them too. Also
> this make the userspace life much much harder. Now userspace must
> go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
> merge all that different information together and rebuild the
> representation i am putting forward in this patchset in userspace.

Yes. But seeing such FPGA links aren't common yet and there isn't really
much in terms of common FPGA infrastructure in the kernel (which are
hard seeing the hardware is infinitely customization) you can let the
people developing FPGA code worry about it and come up with their own
solution. Buses between FPGAs may end up never being common enough for
people to care, or they may end up being so weird that they need their
own description independent of GPUS, or maybe when they become common
they find a way to use the GPU link subsystem -- who knows. Don't try to
design for use cases that don't exist yet.

Yes, userspace will have to know about all the buses it cares to find
links over. Sounds like a perfect thing for libhms to do.

> There is no telling that kernel won't be able to provide quirk and
> workaround because some merging is actually illegal on a given
> platform (like some link from a subsystem is not accessible through
> the PCI connection of one of the device connected to that link).

These are all just different individual problems which need different
solutions not grand new design concepts.

> So it means userspace will have to grow its own database or work-
> around and quirk and i am back in the situation i am in today.

No, as I've said, quirks are firmly the responsibility of kernels.
Userspace will need to know how to work with the different buses and
CPU/node information but there really isn't that many of these to deal
with and this is a much easier approach than trying to come up with a
new API that can wrap the nuances of all existing and potential future
bus types we may have to deal with.

Logan
Jerome Glisse Dec. 5, 2018, 10:58 p.m. UTC | #39
On Wed, Dec 05, 2018 at 12:10:10PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 11:55 a.m., Jerome Glisse wrote:
> > So now once next type of device shows up with the exact same thing
> > let say FPGA, we have to create a new subsystem for them too. Also
> > this make the userspace life much much harder. Now userspace must
> > go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and
> > merge all that different information together and rebuild the
> > representation i am putting forward in this patchset in userspace.
> 
> Yes. But seeing such FPGA links aren't common yet and there isn't really
> much in terms of common FPGA infrastructure in the kernel (which are
> hard seeing the hardware is infinitely customization) you can let the
> people developing FPGA code worry about it and come up with their own
> solution. Buses between FPGAs may end up never being common enough for
> people to care, or they may end up being so weird that they need their
> own description independent of GPUS, or maybe when they become common
> they find a way to use the GPU link subsystem -- who knows. Don't try to
> design for use cases that don't exist yet.
> 
> Yes, userspace will have to know about all the buses it cares to find
> links over. Sounds like a perfect thing for libhms to do.

So just to be clear here is how i understand your position:
"Single coherent sysfs hierarchy to describe something is useless
 let's git rm drivers/base/"

While i am arguing that "hey the /sys/bus/node/devices/* is nice
but it just does not cut it for all this new hardware platform
if i add new nodes there for my new memory i will break tons of
existing application. So what about a new hierarchy that allow
to describe those new hardware platform in a single place like
today node thing"


> 
> > There is no telling that kernel won't be able to provide quirk and
> > workaround because some merging is actually illegal on a given
> > platform (like some link from a subsystem is not accessible through
> > the PCI connection of one of the device connected to that link).
> 
> These are all just different individual problems which need different
> solutions not grand new design concepts.
> 
> > So it means userspace will have to grow its own database or work-
> > around and quirk and i am back in the situation i am in today.
> 
> No, as I've said, quirks are firmly the responsibility of kernels.
> Userspace will need to know how to work with the different buses and
> CPU/node information but there really isn't that many of these to deal
> with and this is a much easier approach than trying to come up with a
> new API that can wrap the nuances of all existing and potential future
> bus types we may have to deal with.

No can do that is what i am trying to explain. So if i bus 1 in a
sub-system A and usualy that kind of bus can serve a bridge for
PCIE ie a CPU can access device behind it by going through a PCIE
device first. So now the userspace libary have this knowledge
bake in. Now if a platform has a bug for whatever reasons where
that does not hold, the kernel has no way to tell userspace that
there is an exception there. It is up to userspace to have a data
base of quirks.

Kernel see all those objects in isolation in your scheme. While in
what i am proposing there is only one place and any device that
participate in this common place can report any quirks so that a
coherent view is given to user space.

If we have gazillion of places where all this informations is spread
around than we have no way to fix weird inter-action between any
of those.

Cheers,
Jérôme
Logan Gunthorpe Dec. 5, 2018, 11:09 p.m. UTC | #40
On 2018-12-05 3:58 p.m., Jerome Glisse wrote:
> So just to be clear here is how i understand your position:
> "Single coherent sysfs hierarchy to describe something is useless
>  let's git rm drivers/base/"

I have no idea what you're talking about. I'm saying the existing sysfs
hierarchy *should* be used for this application -- we shouldn't be
creating another hierarchy.

> While i am arguing that "hey the /sys/bus/node/devices/* is nice
> but it just does not cut it for all this new hardware platform
> if i add new nodes there for my new memory i will break tons of
> existing application. So what about a new hierarchy that allow
> to describe those new hardware platform in a single place like
> today node thing"

I'm talking about /sys/bus and all the bus information under there; not
just the node hierarchy. With this information, you can figure out how
any struct device is connected to another struct device. This has little
to do with a hypothetical memory device and what it might expose. You're
conflating memory devices with links between devices (ie. buses).


> No can do that is what i am trying to explain. So if i bus 1 in a
> sub-system A and usualy that kind of bus can serve a bridge for
> PCIE ie a CPU can access device behind it by going through a PCIE
> device first. So now the userspace libary have this knowledge
> bake in. Now if a platform has a bug for whatever reasons where
> that does not hold, the kernel has no way to tell userspace that
> there is an exception there. It is up to userspace to have a data
> base of quirks.

> Kernel see all those objects in isolation in your scheme. While in
> what i am proposing there is only one place and any device that
> participate in this common place can report any quirks so that a
> coherent view is given to user space.

The above makes no sense to me.


> If we have gazillion of places where all this informations is spread
> around than we have no way to fix weird inter-action between any
> of those.

So work to standardize it so that all buses present a consistent view of
what guarantees they provide for bus accesses. Quirks could then adjust
that information for systems that may be broken.

Logan
Jerome Glisse Dec. 5, 2018, 11:20 p.m. UTC | #41
On Wed, Dec 05, 2018 at 04:09:29PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 3:58 p.m., Jerome Glisse wrote:
> > So just to be clear here is how i understand your position:
> > "Single coherent sysfs hierarchy to describe something is useless
> >  let's git rm drivers/base/"
> 
> I have no idea what you're talking about. I'm saying the existing sysfs
> hierarchy *should* be used for this application -- we shouldn't be
> creating another hierarchy.
> 
> > While i am arguing that "hey the /sys/bus/node/devices/* is nice
> > but it just does not cut it for all this new hardware platform
> > if i add new nodes there for my new memory i will break tons of
> > existing application. So what about a new hierarchy that allow
> > to describe those new hardware platform in a single place like
> > today node thing"
> 
> I'm talking about /sys/bus and all the bus information under there; not
> just the node hierarchy. With this information, you can figure out how
> any struct device is connected to another struct device. This has little
> to do with a hypothetical memory device and what it might expose. You're
> conflating memory devices with links between devices (ie. buses).


And my proposal is under /sys/bus and have symlink to all existing
device it agregate in there.

For device memory i explained why it does not make sense to expose
it as node. So now how do i expose it ? Yes i can expose it under
the device directory but then i can not present the properties of
that memory which depends on through which bus and through which
bridges it is accessed.

So i need bus and bridge objects so that i can express the properties
that depends on the path between the initiator and the target memory.

I argue it is better to expose all this under the same directory.
You say it is not. We NUMA as an example that shows everything under
a single hierarchy so to me you are saying it is useless and has no
value. I say the NUMA thing has value and i would like something like
it just with more stuff and with the capability of doing any kind of
graph.


I just do not see how i can achieve my objectives any differently.

I think we are just talking past each other and this is likely a
pointless conversation. I will keep working on this in the meantime.


> > No can do that is what i am trying to explain. So if i bus 1 in a
> > sub-system A and usualy that kind of bus can serve a bridge for
> > PCIE ie a CPU can access device behind it by going through a PCIE
> > device first. So now the userspace libary have this knowledge
> > bake in. Now if a platform has a bug for whatever reasons where
> > that does not hold, the kernel has no way to tell userspace that
> > there is an exception there. It is up to userspace to have a data
> > base of quirks.
> 
> > Kernel see all those objects in isolation in your scheme. While in
> > what i am proposing there is only one place and any device that
> > participate in this common place can report any quirks so that a
> > coherent view is given to user space.
> 
> The above makes no sense to me.
> 
> 
> > If we have gazillion of places where all this informations is spread
> > around than we have no way to fix weird inter-action between any
> > of those.
> 
> So work to standardize it so that all buses present a consistent view of
> what guarantees they provide for bus accesses. Quirks could then adjust
> that information for systems that may be broken.

So you agree with my proposal ? A sysfs directory in which all the
bus and how they are connected to each other and what is connected
to each of them (device, CPU, memory).

THis is really confusing.

Cheers,
Jérôme
Logan Gunthorpe Dec. 5, 2018, 11:23 p.m. UTC | #42
On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> And my proposal is under /sys/bus and have symlink to all existing
> device it agregate in there.

That's so not the point. Use the existing buses don't invent some
virtual tree. I don't know how many times I have to say this or in how
many ways. I'm not responding anymore.

> So you agree with my proposal ? A sysfs directory in which all the
> bus and how they are connected to each other and what is connected
> to each of them (device, CPU, memory).

I'm fine with the motivation. What I'm arguing against is the
implementation and the fact you have to create a whole grand new
userspace API and hierarchy to accomplish it.

Logan
Jerome Glisse Dec. 5, 2018, 11:27 p.m. UTC | #43
On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> > And my proposal is under /sys/bus and have symlink to all existing
> > device it agregate in there.
> 
> That's so not the point. Use the existing buses don't invent some
> virtual tree. I don't know how many times I have to say this or in how
> many ways. I'm not responding anymore.

And how do i express interaction with different buses because i just
do not see how to do that in the existing scheme. It would be like
teaching to each bus about all the other bus versus having each bus
register itself under a common framework and have all the interaction
between bus mediated through that common framework avoiding code
duplication accross buses.

> 
> > So you agree with my proposal ? A sysfs directory in which all the
> > bus and how they are connected to each other and what is connected
> > to each of them (device, CPU, memory).
> 
> I'm fine with the motivation. What I'm arguing against is the
> implementation and the fact you have to create a whole grand new
> userspace API and hierarchy to accomplish it.
> 
> Logan
Dan Williams Dec. 6, 2018, 12:08 a.m. UTC | #44
On Wed, Dec 5, 2018 at 3:27 PM Jerome Glisse <jglisse@redhat.com> wrote:
>
> On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote:
> >
> >
> > On 2018-12-05 4:20 p.m., Jerome Glisse wrote:
> > > And my proposal is under /sys/bus and have symlink to all existing
> > > device it agregate in there.
> >
> > That's so not the point. Use the existing buses don't invent some
> > virtual tree. I don't know how many times I have to say this or in how
> > many ways. I'm not responding anymore.
>
> And how do i express interaction with different buses because i just
> do not see how to do that in the existing scheme. It would be like
> teaching to each bus about all the other bus versus having each bus
> register itself under a common framework and have all the interaction
> between bus mediated through that common framework avoiding code
> duplication accross buses.
>
> >
> > > So you agree with my proposal ? A sysfs directory in which all the
> > > bus and how they are connected to each other and what is connected
> > > to each of them (device, CPU, memory).
> >
> > I'm fine with the motivation. What I'm arguing against is the
> > implementation and the fact you have to create a whole grand new
> > userspace API and hierarchy to accomplish it.

Right, GPUs show up in /sys today. Don't register a whole new
hierarchy as an alias to what already exists, add a new attribute
scheme to the existing hierarchy. This is what the HMAT enabling is
doing, this is what p2pdma is doing.
diff mbox series

Patch

diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst
index dbf0f71918a9..bd7c9e8e7077 100644
--- a/Documentation/vm/hms.rst
+++ b/Documentation/vm/hms.rst
@@ -4,32 +4,249 @@ 
 Heterogeneous Memory System (HMS)
 =================================
 
-System with complex memory topology needs a more versatile memory topology
-description than just node where a node is a collection of memory and CPU.
-In heterogeneous memory system we consider four types of object::
-   - target: which is any kind of memory
-   - initiator: any kind of device or CPU
-   - inter-connect: any kind of links that connects target and initiator
-   - bridge: a link between two inter-connects
-
-Properties (like bandwidth, latency, bus width, ...) are define per bridge
-and per inter-connect. Property of an inter-connect apply to all initiators
-which are link to that inter-connect. Not all initiators are link to all
-inter-connect and thus not all initiators can access all memory (this apply
-to CPU too ie some CPU might not be able to access all memory).
-
-Bridges allow initiators (that can use the bridge) to access target for
-which they do not have a direct link with (ie they do not share a common
-inter-connect with the target).
-
-Through this four types of object we can describe any kind of system memory
-topology. To expose this to userspace we expose a new sysfs hierarchy (that
-co-exist with the existing one)::
-   - /sys/bus/hms/target* all targets in the system
-   - /sys/bus/hms/initiator* all initiators in the system
-   - /sys/bus/hms/interconnect* all inter-connects in the system
-   - /sys/bus/hms/bridge* all bridges in the system
-
-Inside each bridge or inter-connect directory they are symlinks to targets
-and initiators that are linked to that bridge or inter-connect. Properties
-are defined inside bridge and inter-connect directory.
+Heterogeneous memory system are becoming more and more the norm, in
+those system there is not only the main system memory for each node,
+but also device memory and|or memory hierarchy to consider. Device
+memory can comes from a device like GPU, FPGA, ... or from a memory
+only device (persistent memory, or high density memory device).
+
+Memory hierarchy is when you not only have the main memory but also
+other type of memory like HBM (High Bandwidth Memory often stack up
+on CPU die or GPU die), peristent memory or high density memory (ie
+something slower then regular DDR DIMM but much bigger).
+
+On top of this diversity of memories you also have to account for the
+system bus topology ie how all CPUs and devices are connected to each
+others. Userspace do not care about the exact physical topology but
+care about topology from behavior point of view ie what are all the
+paths between an initiator (anything that can initiate memory access
+like CPU, GPU, FGPA, network controller ...) and a target memory and
+what are all the properties of each of those path (bandwidth, latency,
+granularity, ...).
+
+This means that it is no longer sufficient to consider a flat view
+for each node in a system but for maximum performance we need to
+account for all of this new memory but also for system topology.
+This is why this proposal is unlike the HMAT proposal [1] which
+tries to extend the existing NUMA for new type of memory. Here we
+are tackling a much more profound change that depart from NUMA.
+
+
+One of the reasons for radical change is the advance of accelerator
+like GPU or FPGA means that CPU is no longer the only piece where
+computation happens. It is becoming more and more common for an
+application to use a mix and match of different accelerator to
+perform its computation. So we can no longer satisfy our self with
+a CPU centric and flat view of a system like NUMA and NUMA distance.
+
+
+HMS tackle this problems through three aspects:
+    1 - Expose complex system topology and various kind of memory
+        to user space so that application have a standard way and
+        single place to get all the information it cares about.
+    2 - A new API for user space to bind/provide hint to kernel on
+        which memory to use for range of virtual address (a new
+        mbind() syscall).
+    3 - Kernel side changes for vm policy to handle this changes
+
+
+The rest of this documents is splits in 3 sections, the first section
+talks about complex system topology: what it is, how it is use today
+and how to describe it tomorrow. The second sections talks about
+new API to bind/provide hint to kernel for range of virtual address.
+The third section talks about new mechanism to track bind/hint
+provided by user space or device driver inside the kernel.
+
+
+1) Complex system topology and representing them
+================================================
+
+Inside a node you can have a complex topology of memory, for instance
+you can have multiple HBM memory in a node, each HBM memory tie to a
+set of CPUs (all of which are in the same node). This means that you
+have a hierarchy of memory for CPUs. The local fast HBM but which is
+expected to be relatively small compare to main memory and then the
+main memory. New memory technology might also deepen this hierarchy
+with another level of yet slower memory but gigantic in size (some
+persistent memory technology might fall into that category). Another
+example is device memory, and device themself can have a hierarchy
+like HBM on top of device core and main device memory.
+
+On top of that you can have multiple path to access each memory and
+each path can have different properties (latency, bandwidth, ...).
+Also there is not always symmetry ie some memory might only be
+accessible by some device or CPU ie not accessible by everyone.
+
+So a flat hierarchy for each node is not capable of representing this
+kind of complexity. To simplify discussion and because we do not want
+to single out CPU from device, from here on out we will use initiator
+to refer to either CPU or device. An initiator is any kind of CPU or
+device that can access memory (ie initiate memory access).
+
+At this point a example of such system might help:
+    - 2 nodes and for each node:
+        - 1 CPU per node with 2 complex of CPUs cores per CPU
+        - one HBM memory for each complex of CPUs cores (200GB/s)
+        - CPUs cores complex are linked to each other (100GB/s)
+        - main memory is (90GB/s)
+        - 4 GPUs each with:
+            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
+            - GDDR memory for each GPU (500GB/s) (CPU accessible)
+            - connected to CPU root controller (60GB/s)
+            - connected to other GPUs (even GPUs from the second
+              node) with GPU link (400GB/s)
+
+In this example we restrict our self to bandwidth and ignore bus width
+or latency, this is just to simplify discussions but obviously they
+also factor in.
+
+
+Userspace very much would like to know about this information, for
+instance HPC folks have develop complex library to manage this and
+there is wide research on the topics [2] [3] [4] [5]. Today most of
+the work is done by hardcoding thing for specific platform. Which is
+somewhat acceptable for HPC folks where the platform stays the same
+for a long period of time.
+
+Roughly speaking i see two broads use case for topology information.
+First is for virtualization and vm where you want to segment your
+hardware properly for each vm (binding memory, CPU and GPU that are
+all close to each others). Second is for application, many of which
+can partition their workload to minimize exchange between partition
+allowing each partition to be bind to a subset of device and CPUs
+that are close to each others (for maximum locality). Here it is much
+more than just NUMA distance, you can leverage the memory hierarchy
+and  the system topology all-together (see [2] [3] [4] [5] for more
+references and details).
+
+So this is not exposing topology just for the sake of cool graph in
+userspace. They are active user today of such information and if we
+want to growth and broaden the usage we should provide a unified API
+to standardize how that information is accessible to every one.
+
+
+One proposal so far to handle new type of memory is to user CPU less
+node for those [6]. While same idea can apply for device memory, it is
+still hard to describe multiple path with different property in such
+scheme. While it is backward compatible and have minimum changes, it
+simplify can not convey complex topology (think any kind of random
+graph, not just a tree like graph).
+
+So HMS use a new way to expose to userspace the system topology. It
+relies on 4 types of objects:
+    - target: any kind of memory (main memory, HBM, device, ...)
+    - initiator: CPU or device (anything that can access memory)
+    - link: anything that link initiator and target
+    - bridges: anything that allow group of initiator to access
+      remote target (ie target they are not connected with directly
+      through an link)
+
+Properties like bandwidth, latency, ... are all sets per bridges and
+links. All initiators connected to an link can access any target memory
+also connected to the same link and all with the same link properties.
+
+Link do not need to match physical hardware ie you can have a single
+physical link match a single or multiples software expose link. This
+allows to model device connected to same physical link (like PCIE
+for instance) but not with same characteristics (like number of lane
+or lane speed in PCIE). The reverse is also true ie having a single
+software expose link match multiples physical link.
+
+Bridges allows initiator to access remote link. A bridges connect two
+links to each others and is also specific to list of initiators (ie
+not all initiators connected to each of the link can use the bridge).
+Bridges have their own properties (bandwidth, latency, ...) so that
+the actual property value for each property is the lowest common
+denominator between bridge and each of the links.
+
+
+This model allows to describe any kind of directed graph and thus
+allows to describe any kind of topology we might see in the future.
+It is also easier to add new properties to each object type.
+
+Moreover it can be use to expose devices capable to do peer to peer
+between them. For that simply have all devices capable to peer to
+peer to have a common link or use the bridge object if the peer to
+peer capabilities is only one way for instance.
+
+
+HMS use the above scheme to expose system topology through sysfs under
+/sys/bus/hms/ with:
+    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
+      each has a UID and you can usual value in that folder (node id,
+      size, ...)
+
+    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
+      (CPU or device), each has a HMS UID but also a CPU id for CPU
+      (which match CPU id in (/sys/bus/cpu/). For device you have a
+      path that can be PCIE BUS ID for instance)
+
+    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
+      UID and a file per property (bandwidth, latency, ...) you also
+      find a symlink to every target and initiator connected to that
+      link.
+
+    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
+      a UID and a file per property (bandwidth, latency, ...) you
+      also find a symlink to all initiators that can use that bridge.
+
+To help with forward compatibility each object as a version value and
+it is mandatory for user space to only use target or initiator with
+version supported by the user space. For instance if user space only
+knows about what version 1 means and sees a target with version 2 then
+the user space must ignore that target as if it does not exist.
+
+Mandating that allows the additions of new properties that break back-
+ward compatibility ie user space must know how this new property affect
+the object to be able to use it safely.
+
+Main memory of each node is expose under a common target. For now
+device driver are responsible to register memory they want to expose
+through that scheme but in the future that information might come from
+the system firmware (this is a different discussion).
+
+
+
+2) hbind() bind range of virtual address to heterogeneous memory
+================================================================
+
+So instead of using a bitmap, hbind() take an array of uid and each uid
+is a unique memory target inside the new memory topology description.
+User space also provide an array of modifiers. Modifier can be seen as
+the flags parameter of mbind() but here we use an array so that user
+space can not only supply a modifier but also value with it. This should
+allow the API to grow more features in the future. Kernel should return
+-EINVAL if it is provided with an unkown modifier and just ignore the
+call all together, forcing the user space to restrict itself to modifier
+supported by the kernel it is running on (i know i am dreaming about well
+behave user space).
+
+
+Note that none of this is exclusive of automatic memory placement like
+autonuma. I also believe that we will see something similar to autonuma
+for device memory.
+
+
+3) Tracking and applying heterogeneous memory policies
+======================================================
+
+Current memory policy infrastructure is node oriented, instead of
+changing that and risking breakage and regression HMS adds a new
+heterogeneous policy tracking infra-structure. The expectation is
+that existing application can keep using mbind() and all existing
+infrastructure under-disturb and unaffected, while new application
+will use the new API and should avoid mix and matching both (as they
+can achieve the same thing with the new API).
+
+Also the policy is not directly tie to the vma structure for a few
+reasons:
+    - avoid having to split vma for policy that do not cover full vma
+    - avoid changing too much vma code
+    - avoid growing the vma structure with an extra pointer
+
+The overall design is simple, on hbind() call a hms policy structure
+is created for the supplied range and hms use the callback associated
+with the target memory. This callback is provided by device driver
+for device memory or by core HMS for regular main memory. The callback
+can decide to migrate the range to the target memories or do nothing
+(this can be influenced by flags provided to hbind() too).