Message ID | 20181203233509.20671-3-jglisse@redhat.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Heterogeneous Memory System (HMS) and hbind() | expand |
jglisse@redhat.com writes: > + > +To help with forward compatibility each object as a version value and > +it is mandatory for user space to only use target or initiator with > +version supported by the user space. For instance if user space only > +knows about what version 1 means and sees a target with version 2 then > +the user space must ignore that target as if it does not exist. So once v2 is introduced all applications that only support v1 break. That seems very un-Linux and will break Linus' "do not break existing applications" rule. The standard approach that if you add something incompatible is to add new field, but keep the old ones. > +2) hbind() bind range of virtual address to heterogeneous memory > +================================================================ > + > +So instead of using a bitmap, hbind() take an array of uid and each uid > +is a unique memory target inside the new memory topology description. You didn't define what an uid is? user id? Please use sensible terminology that doesn't conflict with existing usages. I assume it's some kind of number that identifies a node in your graph. > +User space also provide an array of modifiers. Modifier can be seen as > +the flags parameter of mbind() but here we use an array so that user > +space can not only supply a modifier but also value with it. This should > +allow the API to grow more features in the future. Kernel should return > +-EINVAL if it is provided with an unkown modifier and just ignore the > +call all together, forcing the user space to restrict itself to modifier > +supported by the kernel it is running on (i know i am dreaming about well > +behave user space). It sounds like you're trying to define a system call with built in ioctl? Is that really a good idea? If you need ioctl you know where to find it. Please don't over design APIs like this. > +3) Tracking and applying heterogeneous memory policies > +====================================================== > + > +Current memory policy infrastructure is node oriented, instead of > +changing that and risking breakage and regression HMS adds a new > +heterogeneous policy tracking infra-structure. The expectation is > +that existing application can keep using mbind() and all existing > +infrastructure under-disturb and unaffected, while new application > +will use the new API and should avoid mix and matching both (as they > +can achieve the same thing with the new API). I think we need a stronger motivation to define a completely parallel and somewhat redundant infrastructure. What breakage are you worried about? The obvious alternative would of course be to add some extra enumeration to the existing nodes. It's a strange document. It goes from very high level to low level with nothing inbetween. I think you need a lot more details in the middle, in particularly how these new interfaces should be used. For example how should an application know how to look for a specific type of device? How is an automated tool supposed to use the enumeration? etc. -Andi
On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > jglisse@redhat.com writes: > > > + > > +To help with forward compatibility each object as a version value and > > +it is mandatory for user space to only use target or initiator with > > +version supported by the user space. For instance if user space only > > +knows about what version 1 means and sees a target with version 2 then > > +the user space must ignore that target as if it does not exist. > > So once v2 is introduced all applications that only support v1 break. > > That seems very un-Linux and will break Linus' "do not break existing > applications" rule. > > The standard approach that if you add something incompatible is to > add new field, but keep the old ones. No that's not how it is suppose to work. So let says it is 2018 and you have v1 memory (like your regular main DDR memory for instance) then it will always be expose a v1 memory. Fast forward 2020 and you have this new type of memory that is not cache coherent and you want to expose this to userspace through HMS. What you do is a kernel patch that introduce the v2 type for target and define a set of new sysfs file to describe what v2 is. On this new computer you report your usual main memory as v1 and your new memory as v2. So the application that only knew about v1 will keep using any v1 memory on your new platform but it will not use any of the new memory v2 which is what you want to happen. You do not have to break existing application while allowing to add new type of memory. Sorry if it was unclear. I will try to reformulate and give an example as above. > > +2) hbind() bind range of virtual address to heterogeneous memory > > +================================================================ > > + > > +So instead of using a bitmap, hbind() take an array of uid and each uid > > +is a unique memory target inside the new memory topology description. > > You didn't define what an uid is? > > user id ? > > Please use sensible terminology that doesn't conflict with existing > usages. > > I assume it's some kind of number that identifies a node in your > graph. Correct uid is unique id given to each node in the graph. I will clarify that. > > +User space also provide an array of modifiers. Modifier can be seen as > > +the flags parameter of mbind() but here we use an array so that user > > +space can not only supply a modifier but also value with it. This should > > +allow the API to grow more features in the future. Kernel should return > > +-EINVAL if it is provided with an unkown modifier and just ignore the > > +call all together, forcing the user space to restrict itself to modifier > > +supported by the kernel it is running on (i know i am dreaming about well > > +behave user space). > > It sounds like you're trying to define a system call with built in > ioctl? Is that really a good idea? > > If you need ioctl you know where to find it. Well i would like to get thing running in the wild with some guinea pig user to get feedback from end user. It would be easier if i can do this with upstream kernel and not some random branch in my private repo. While doing that i would like to avoid commiting to a syscall upstream. So the way i see around this is doing a driver under staging with an ioctl which will be turn into a syscall once some confidence into the API is gain. If you think i should do a syscall right away i am not against doing that. > > Please don't over design APIs like this. So there is 2 approach here. I can define 2 syscall, one for migration and one for policy. Migration and policy are 2 different thing from all existing user point of view. By defining 2 syscall i can cut them down to do this one thing and one thing only and make it as simple and lean as possible. In the present version i took the other approach of defining just one API that can grow to do more thing. I know the unix way is one simple tool for one simple job. I can switch to the simple call for one action. > > +3) Tracking and applying heterogeneous memory policies > > +====================================================== > > + > > +Current memory policy infrastructure is node oriented, instead of > > +changing that and risking breakage and regression HMS adds a new > > +heterogeneous policy tracking infra-structure. The expectation is > > +that existing application can keep using mbind() and all existing > > +infrastructure under-disturb and unaffected, while new application > > +will use the new API and should avoid mix and matching both (as they > > +can achieve the same thing with the new API). > > I think we need a stronger motivation to define a completely > parallel and somewhat redundant infrastructure. What breakage > are you worried about? Some memory expose through HMS is not allocated by regular memory allocator. For instance GPU memory is manage by GPU driver, so when you want to use GPU memory (either as a policy or by migrating to it) you need to use the GPU allocator to allocate that memory. HMS adds a bunch of callback to target structure so that device driver can expose a generic API to core kernel to do such allocation. Now i can change existing code path to use target structure as an intermediary for allocation but this is changing hot code path and i doubt it would be welcome today. Eventually i think we will want that to happen and can work on minimizing cost for user that do not use thing like GPU. The transition phase will take times (couple years) and i would like to avoid disturbing existing workload while we migrate GPU user to this new API. > The obvious alternative would of course be to add some extra > enumeration to the existing nodes. We can not extend NUMA node to expose GPU memory. GPU memory on current AMD and Intel platform is not cache coherent and thus should not be use for random memory allocation. It should really stay a thing user have to explicitly select to use. Note that the useage we have here is that when you use GPU memory it is as if the range of virtual address is swapped out from CPU point of view but the GPU can access it. > It's a strange document. It goes from very high level to low level > with nothing inbetween. I think you need a lot more details > in the middle, in particularly how these new interfaces > should be used. For example how should an application > know how to look for a specific type of device? > How is an automated tool supposed to use the enumeration? > etc. Today user use dedicated API (OpenCL, ROCm, CUDA, ...) those high level API all have the API i present here in one form or another. So i want to move this high level API that is actively use by program today into the kernel. The end game is to create common infrastructure for various accelerator hardware (GPU, FPGA, ...) to manage memory. This is something ask by end user for one simple reasons. Today users have to mix and match multiple API in their application and when they want to exchange data between one device that use one API and another device that use another API they have to do explicit copy and rebuild their data structure inside the new memory. When you move over thing like tree or any complex data structure you have to rebuilt it ie redo the pointers link between the nodes of your data structure. This is highly error prone complex and wasteful (you have to burn CPU cycles to do that). Now if you can use the same address space as all the other memory allocation in your program and move data around from one device to another with a common API that works on all the various devices, you are eliminating that complex step and making the end user life much easier. So i am doing this to help existing users by addressing an issues that is becoming harder and harder to solve for userspace. My end game is to blur the boundary between CPU and device like GPU, FPGA, ... Thank you for taking time to read this proposal and for your feed- back. Much appreciated. I will try to include your comments in my v2. Cheers, Jérôme
On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote: > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > > jglisse@redhat.com writes: > > > > > + > > > +To help with forward compatibility each object as a version value and > > > +it is mandatory for user space to only use target or initiator with > > > +version supported by the user space. For instance if user space only > > > +knows about what version 1 means and sees a target with version 2 then > > > +the user space must ignore that target as if it does not exist. > > > > So once v2 is introduced all applications that only support v1 break. > > > > That seems very un-Linux and will break Linus' "do not break existing > > applications" rule. > > > > The standard approach that if you add something incompatible is to > > add new field, but keep the old ones. > > No that's not how it is suppose to work. So let says it is 2018 and you > have v1 memory (like your regular main DDR memory for instance) then it > will always be expose a v1 memory. > > Fast forward 2020 and you have this new type of memory that is not cache > coherent and you want to expose this to userspace through HMS. What you > do is a kernel patch that introduce the v2 type for target and define a > set of new sysfs file to describe what v2 is. On this new computer you > report your usual main memory as v1 and your new memory as v2. > > So the application that only knew about v1 will keep using any v1 memory > on your new platform but it will not use any of the new memory v2 which > is what you want to happen. You do not have to break existing application > while allowing to add new type of memory. That sounds needlessly restrictive. Let the kernel arbitrate what memory an application gets, don't design a system where applications are hard coded to a memory type. Applications can hint, or optionally specify an override and the kernel can react accordingly.
On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote: > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote: > > > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > > > jglisse@redhat.com writes: > > > > > > > + > > > > +To help with forward compatibility each object as a version value and > > > > +it is mandatory for user space to only use target or initiator with > > > > +version supported by the user space. For instance if user space only > > > > +knows about what version 1 means and sees a target with version 2 then > > > > +the user space must ignore that target as if it does not exist. > > > > > > So once v2 is introduced all applications that only support v1 break. > > > > > > That seems very un-Linux and will break Linus' "do not break existing > > > applications" rule. > > > > > > The standard approach that if you add something incompatible is to > > > add new field, but keep the old ones. > > > > No that's not how it is suppose to work. So let says it is 2018 and you > > have v1 memory (like your regular main DDR memory for instance) then it > > will always be expose a v1 memory. > > > > Fast forward 2020 and you have this new type of memory that is not cache > > coherent and you want to expose this to userspace through HMS. What you > > do is a kernel patch that introduce the v2 type for target and define a > > set of new sysfs file to describe what v2 is. On this new computer you > > report your usual main memory as v1 and your new memory as v2. > > > > So the application that only knew about v1 will keep using any v1 memory > > on your new platform but it will not use any of the new memory v2 which > > is what you want to happen. You do not have to break existing application > > while allowing to add new type of memory. > > That sounds needlessly restrictive. Let the kernel arbitrate what > memory an application gets, don't design a system where applications > are hard coded to a memory type. Applications can hint, or optionally > specify an override and the kernel can react accordingly. You do not want to randomly use non cache coherent memory inside your application :) This is not gonna go well with C++ or atomic :) Yes they are legitimate use case where application can decide to give up cache coherency temporarily for a range of virtual address. But the application needs to understand what it is doing and opt in to do that knowing full well that. The version thing allows for scenario like. You do not have to define a new version with every new type of memory. If your new memory has all the properties of v1 than you expose it as v1 and old application on the new platform will use your new memory type being non the wiser. The version thing is really to exclude user from using something they do not want to use without understanding the consequences of doing so. Cheers, Jérôme
On 2018-12-04 11:57 a.m., Jerome Glisse wrote: >> That sounds needlessly restrictive. Let the kernel arbitrate what >> memory an application gets, don't design a system where applications >> are hard coded to a memory type. Applications can hint, or optionally >> specify an override and the kernel can react accordingly. > > You do not want to randomly use non cache coherent memory inside your > application :) This is not gonna go well with C++ or atomic :) Yes they > are legitimate use case where application can decide to give up cache > coherency temporarily for a range of virtual address. But the application > needs to understand what it is doing and opt in to do that knowing full > well that. The version thing allows for scenario like. You do not have > to define a new version with every new type of memory. If your new memory > has all the properties of v1 than you expose it as v1 and old application > on the new platform will use your new memory type being non the wiser. I agree with Dan and the general idea that this version thing is really ugly. Define some standard attributes so the application can say "I want cache-coherent, high bandwidth memory". If there's some future new-memory attribute, then the application needs to know about it to request it. Also, in the same vein, I think it's wrong to have the API enumerate all the different memory available in the system. The API should simply allow userspace to say it wants memory that can be accessed by a set of initiators with a certain set of attributes and the bind call tries to fulfill that or fallback on system memory/hmm migration/whatever. Logan
On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote: > > On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote: > > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote: > > > > > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > > > > jglisse@redhat.com writes: > > > > > > > > > + > > > > > +To help with forward compatibility each object as a version value and > > > > > +it is mandatory for user space to only use target or initiator with > > > > > +version supported by the user space. For instance if user space only > > > > > +knows about what version 1 means and sees a target with version 2 then > > > > > +the user space must ignore that target as if it does not exist. > > > > > > > > So once v2 is introduced all applications that only support v1 break. > > > > > > > > That seems very un-Linux and will break Linus' "do not break existing > > > > applications" rule. > > > > > > > > The standard approach that if you add something incompatible is to > > > > add new field, but keep the old ones. > > > > > > No that's not how it is suppose to work. So let says it is 2018 and you > > > have v1 memory (like your regular main DDR memory for instance) then it > > > will always be expose a v1 memory. > > > > > > Fast forward 2020 and you have this new type of memory that is not cache > > > coherent and you want to expose this to userspace through HMS. What you > > > do is a kernel patch that introduce the v2 type for target and define a > > > set of new sysfs file to describe what v2 is. On this new computer you > > > report your usual main memory as v1 and your new memory as v2. > > > > > > So the application that only knew about v1 will keep using any v1 memory > > > on your new platform but it will not use any of the new memory v2 which > > > is what you want to happen. You do not have to break existing application > > > while allowing to add new type of memory. > > > > That sounds needlessly restrictive. Let the kernel arbitrate what > > memory an application gets, don't design a system where applications > > are hard coded to a memory type. Applications can hint, or optionally > > specify an override and the kernel can react accordingly. > > You do not want to randomly use non cache coherent memory inside your > application :) The kernel arbitrates memory, it's a bug if it hands out something that exotic to an unaware application.
On Tue, Dec 04, 2018 at 12:11:42PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 11:57 a.m., Jerome Glisse wrote: > >> That sounds needlessly restrictive. Let the kernel arbitrate what > >> memory an application gets, don't design a system where applications > >> are hard coded to a memory type. Applications can hint, or optionally > >> specify an override and the kernel can react accordingly. > > > > You do not want to randomly use non cache coherent memory inside your > > application :) This is not gonna go well with C++ or atomic :) Yes they > > are legitimate use case where application can decide to give up cache > > coherency temporarily for a range of virtual address. But the application > > needs to understand what it is doing and opt in to do that knowing full > > well that. The version thing allows for scenario like. You do not have > > to define a new version with every new type of memory. If your new memory > > has all the properties of v1 than you expose it as v1 and old application > > on the new platform will use your new memory type being non the wiser. > > I agree with Dan and the general idea that this version thing is really > ugly. Define some standard attributes so the application can say "I want > cache-coherent, high bandwidth memory". If there's some future > new-memory attribute, then the application needs to know about it to > request it. So version is a bad prefix, what about type, prefixing target with a type id. So that application that are looking for a certain type of memory (which has a set of define properties) can select them. Having a type file inside the directory and hopping application will read that sysfs file is a recipies for failure from my point of view. While having it in the directory name is making sure that the application has some idea of what it is doing. > > Also, in the same vein, I think it's wrong to have the API enumerate all > the different memory available in the system. The API should simply > allow userspace to say it wants memory that can be accessed by a set of > initiators with a certain set of attributes and the bind call tries to > fulfill that or fallback on system memory/hmm migration/whatever. We have existing application that use topology today to partition their workload and do load balancing. Those application leverage the fact that they are only running on a small set of known platform with known topology here i want to provide a common API so that topology can be queried in a standard by application. Yes basic application will not leverage all this information and will be happy enough with give me memory that will be fast for initiator A and B. That can easily be implemented inside userspace library which dumbs down the topology on behalf of application. I believe that proposing a new infrastructure should allow for maximum expressiveness. The HMS API in this proposal allow to express any kind of directed graph hence i do not see any limitation going forward. At the same time userspace library can easily dumbs this down for average Joe/Jane application. Cheers, Jérôme
On Tue, Dec 04, 2018 at 11:19:23AM -0800, Dan Williams wrote: > On Tue, Dec 4, 2018 at 10:58 AM Jerome Glisse <jglisse@redhat.com> wrote: > > > > On Tue, Dec 04, 2018 at 10:31:17AM -0800, Dan Williams wrote: > > > On Tue, Dec 4, 2018 at 10:24 AM Jerome Glisse <jglisse@redhat.com> wrote: > > > > > > > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > > > > > jglisse@redhat.com writes: > > > > > > > > > > > + > > > > > > +To help with forward compatibility each object as a version value and > > > > > > +it is mandatory for user space to only use target or initiator with > > > > > > +version supported by the user space. For instance if user space only > > > > > > +knows about what version 1 means and sees a target with version 2 then > > > > > > +the user space must ignore that target as if it does not exist. > > > > > > > > > > So once v2 is introduced all applications that only support v1 break. > > > > > > > > > > That seems very un-Linux and will break Linus' "do not break existing > > > > > applications" rule. > > > > > > > > > > The standard approach that if you add something incompatible is to > > > > > add new field, but keep the old ones. > > > > > > > > No that's not how it is suppose to work. So let says it is 2018 and you > > > > have v1 memory (like your regular main DDR memory for instance) then it > > > > will always be expose a v1 memory. > > > > > > > > Fast forward 2020 and you have this new type of memory that is not cache > > > > coherent and you want to expose this to userspace through HMS. What you > > > > do is a kernel patch that introduce the v2 type for target and define a > > > > set of new sysfs file to describe what v2 is. On this new computer you > > > > report your usual main memory as v1 and your new memory as v2. > > > > > > > > So the application that only knew about v1 will keep using any v1 memory > > > > on your new platform but it will not use any of the new memory v2 which > > > > is what you want to happen. You do not have to break existing application > > > > while allowing to add new type of memory. > > > > > > That sounds needlessly restrictive. Let the kernel arbitrate what > > > memory an application gets, don't design a system where applications > > > are hard coded to a memory type. Applications can hint, or optionally > > > specify an override and the kernel can react accordingly. > > > > You do not want to randomly use non cache coherent memory inside your > > application :) > > The kernel arbitrates memory, it's a bug if it hands out something > that exotic to an unaware application. In some case and for some period of time some application would like to use exotic memory for performance reasons. This does exist today. Graphics API routinely expose uncache memory to application and it has been doing so for many years. Some compute folks would like to have some of the benefit of that sometime. The idea is that you malloc() some memory in your application do stuff on the CPU, business as usual, then you gonna use that memory on some exotic device and for that device it would be best if you migrated that memory to uncache/uncoherent memory. If application knows its safe to do so then it can decide to pick such memory with HMS and migrate its malloced stuff there. This is not only happening in application, it can happen inside a library that the application use and the application might be totaly unaware of the library doing so. This is very common today in AI/ML workload where all the various library in your AI/ML stacks do thing to the memory you handed them over. It is all part of the library API contract. So they are legitimate use case for this hence why i would like to be able to expose exotic memory to userspace so that it can migrate regular allocation there when that make sense. Cheers, Jérôme
On 2018-12-04 12:22 p.m., Jerome Glisse wrote: > So version is a bad prefix, what about type, prefixing target with a > type id. So that application that are looking for a certain type of > memory (which has a set of define properties) can select them. Having > a type file inside the directory and hopping application will read > that sysfs file is a recipies for failure from my point of view. While > having it in the directory name is making sure that the application > has some idea of what it is doing. Well I don't think it can be a prefix. It has to be a mask. It might be things like cache coherency, persistence, bandwidth and none of those things are mutually exclusive. >> Also, in the same vein, I think it's wrong to have the API enumerate all >> the different memory available in the system. The API should simply >> allow userspace to say it wants memory that can be accessed by a set of >> initiators with a certain set of attributes and the bind call tries to >> fulfill that or fallback on system memory/hmm migration/whatever. > > We have existing application that use topology today to partition their > workload and do load balancing. Those application leverage the fact that > they are only running on a small set of known platform with known topology > here i want to provide a common API so that topology can be queried in a > standard by application. Existing applications are not a valid excuse for poor API design. Remember, once this API is introduced and has real users, it has to be maintained *forever*, so we need to get it right. Providing users with more information than they need makes it exponentially harder to get right and support. Logan
On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote: > Fast forward 2020 and you have this new type of memory that is not cache > coherent and you want to expose this to userspace through HMS. What you > do is a kernel patch that introduce the v2 type for target and define a > set of new sysfs file to describe what v2 is. On this new computer you > report your usual main memory as v1 and your new memory as v2. > > So the application that only knew about v1 will keep using any v1 memory > on your new platform but it will not use any of the new memory v2 which > is what you want to happen. You do not have to break existing application > while allowing to add new type of memory. That seems entirely like the wrong model. We don't want to rewrite every application for adding a new memory type. Rather there needs to be an abstract way to query memory of specific behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar Sure there can be a name somewhere, but it should only be used for identification purposes, not to hard code in applications. Really you need to define some use cases and describe how your API handles them. > > > > It sounds like you're trying to define a system call with built in > > ioctl? Is that really a good idea? > > > > If you need ioctl you know where to find it. > > Well i would like to get thing running in the wild with some guinea pig > user to get feedback from end user. It would be easier if i can do this > with upstream kernel and not some random branch in my private repo. While > doing that i would like to avoid commiting to a syscall upstream. So the > way i see around this is doing a driver under staging with an ioctl which > will be turn into a syscall once some confidence into the API is gain. Ok that's fine I guess. But should be a clearly defined ioctl, not an ioctl with redefinable parameters (but perhaps I misunderstood your description) > In the present version i took the other approach of defining just one > API that can grow to do more thing. I know the unix way is one simple > tool for one simple job. I can switch to the simple call for one action. Simple calls are better. > > > +Current memory policy infrastructure is node oriented, instead of > > > +changing that and risking breakage and regression HMS adds a new > > > +heterogeneous policy tracking infra-structure. The expectation is > > > +that existing application can keep using mbind() and all existing > > > +infrastructure under-disturb and unaffected, while new application > > > +will use the new API and should avoid mix and matching both (as they > > > +can achieve the same thing with the new API). > > > > I think we need a stronger motivation to define a completely > > parallel and somewhat redundant infrastructure. What breakage > > are you worried about? > > Some memory expose through HMS is not allocated by regular memory > allocator. For instance GPU memory is manage by GPU driver, so when > you want to use GPU memory (either as a policy or by migrating to it) > you need to use the GPU allocator to allocate that memory. HMS adds > a bunch of callback to target structure so that device driver can > expose a generic API to core kernel to do such allocation. We already have nodes without memory. We can also take out nodes out of the normal fall back lists. We also have nodes with special memory (e.g. DMA32) Nothing you describe here cannot be handled with the existing nodes. > > The obvious alternative would of course be to add some extra > > enumeration to the existing nodes. > > We can not extend NUMA node to expose GPU memory. GPU memory on > current AMD and Intel platform is not cache coherent and thus > should not be use for random memory allocation. It should really Sure you don't expose it as normal memory, but it can be still tied to a node. In fact you have to for the existing topology interface to work. > copy and rebuild their data structure inside the new memory. When > you move over thing like tree or any complex data structure you have > to rebuilt it ie redo the pointers link between the nodes of your > data structure. > > This is highly error prone complex and wasteful (you have to burn > CPU cycles to do that). Now if you can use the same address space > as all the other memory allocation in your program and move data > around from one device to another with a common API that works on > all the various devices, you are eliminating that complex step and > making the end user life much easier. > > So i am doing this to help existing users by addressing an issues > that is becoming harder and harder to solve for userspace. My end > game is to blur the boundary between CPU and device like GPU, FPGA, This is just high level rationale. You already had that ... What I was looking for is how applications actually use the API. e.g. 1. Compute application is looking for fast cache coherent memory for CPU usage. What does it query and how does it decide and how does it allocate? 2. Allocator in OpenCL application is looking for memory to share with OpenCL. How does it find memory? 3. Storage application is looking for larger but slower memory for CPU usage. 4. ... Please work out some use cases like this. -Andi
On Tue, Dec 04, 2018 at 12:41:39PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 12:22 p.m., Jerome Glisse wrote: > > So version is a bad prefix, what about type, prefixing target with a > > type id. So that application that are looking for a certain type of > > memory (which has a set of define properties) can select them. Having > > a type file inside the directory and hopping application will read > > that sysfs file is a recipies for failure from my point of view. While > > having it in the directory name is making sure that the application > > has some idea of what it is doing. > > Well I don't think it can be a prefix. It has to be a mask. It might be > things like cache coherency, persistence, bandwidth and none of those > things are mutually exclusive. You are right many are non exclusive. It is just my feeling that having a mask as a file inside the target directory might be overlook by the application which might start using things it should not. At same time i guess if i write the userspace library that abstract this kernel API then i can enforce application to properly select thing. I will use mask in v2. > > >> Also, in the same vein, I think it's wrong to have the API enumerate all > >> the different memory available in the system. The API should simply > >> allow userspace to say it wants memory that can be accessed by a set of > >> initiators with a certain set of attributes and the bind call tries to > >> fulfill that or fallback on system memory/hmm migration/whatever. > > > > We have existing application that use topology today to partition their > > workload and do load balancing. Those application leverage the fact that > > they are only running on a small set of known platform with known topology > > here i want to provide a common API so that topology can be queried in a > > standard by application. > > Existing applications are not a valid excuse for poor API design. > Remember, once this API is introduced and has real users, it has to be > maintained *forever*, so we need to get it right. Providing users with > more information than they need makes it exponentially harder to get > right and support. I am not disagreeing on the pain of maintaining API forever but the fact remain that they are existing user and without a standard way of exposing this it is impossible to say if we will see more users for that information or if it will just be the existing user that will leverage this. I do not think there is a way to answer that question. I am siding on the side of this API can be dumb down in userspace by a common library. So let expose the topology and let userspace dumb it down. If we dumb it down in the kernel i see few pitfalls: - kernel dumbing it down badly - kernel dumbing down code can grow out of control with gotcha for platform - it is still harder to fix kernel than userspace in commercial user space (the whole RHEL business of slow moving and long supported kernel). So on those being able to fix thing in userspace sounds pretty enticing Cheers, Jérôme
> Also, in the same vein, I think it's wrong to have the API enumerate all > the different memory available in the system. The API should simply We need an enumeration API too, just to display to the user what they have, and possibly for applications to size their buffers (all we do with existing NUMA nodes) But yes the default usage should be to query for necessary attributes -Andi
On 2018-12-04 1:13 p.m., Jerome Glisse wrote: > You are right many are non exclusive. It is just my feeling that having > a mask as a file inside the target directory might be overlook by the > application which might start using things it should not. At same time > i guess if i write the userspace library that abstract this kernel API > then i can enforce application to properly select thing. I think this is just evidence that this is not a good API. If the user has the option to just ignore things or do it wrong that's a problem with the API. Using a prefix for the name doesn't change that fact. > I do not think there is a way to answer that question. I am siding on the > side of this API can be dumb down in userspace by a common library. So let > expose the topology and let userspace dumb it down. I fundamentally disagree with this approach to designing APIs. Saying "we'll give you the kitchen sink, add another layer to deal with the complexity" is actually just eschewing API design and makes it harder for kernel folks to know what userspace actually requires because they are multiple layers away. > If we dumb it down in the kernel i see few pitfalls: > - kernel dumbing it down badly > - kernel dumbing down code can grow out of control with gotcha > for platform This is just a matter of designing the APIs well. Don't do it badly. > - it is still harder to fix kernel than userspace in commercial > user space (the whole RHEL business of slow moving and long > supported kernel). So on those being able to fix thing in > userspace sounds pretty enticing I hear this argument a lot and it's not compelling to me. I don't think we should make decisions in upstream code to allow RHEL to bypass the kernel simply because it would be easier for them to distribute code changes. Logan
On Tue, Dec 04, 2018 at 12:12:26PM -0800, Andi Kleen wrote: > On Tue, Dec 04, 2018 at 01:24:22PM -0500, Jerome Glisse wrote: > > Fast forward 2020 and you have this new type of memory that is not cache > > coherent and you want to expose this to userspace through HMS. What you > > do is a kernel patch that introduce the v2 type for target and define a > > set of new sysfs file to describe what v2 is. On this new computer you > > report your usual main memory as v1 and your new memory as v2. > > > > So the application that only knew about v1 will keep using any v1 memory > > on your new platform but it will not use any of the new memory v2 which > > is what you want to happen. You do not have to break existing application > > while allowing to add new type of memory. > > That seems entirely like the wrong model. We don't want to rewrite every > application for adding a new memory type. > > Rather there needs to be an abstract way to query memory of specific > behavior: e.g. cache coherent, size >= xGB, fastest or lowest latency or similar > > Sure there can be a name somewhere, but it should only be used > for identification purposes, not to hard code in applications. Discussion with Logan convinced me to use a mask for property like: - cache coherent - persistent ... Then files for other properties like: - bandwidth (bytes/s) - latency - granularity (size of individual access or bus width) ... > > Really you need to define some use cases and describe how your API > handles them. I have given examples of how application looks today and how they transform with HMS in my email exchange with Dave Hansen. I will add them to the documentation and to the cover letter in my next posting. > > > > > > It sounds like you're trying to define a system call with built in > > > ioctl? Is that really a good idea? > > > > > > If you need ioctl you know where to find it. > > > > Well i would like to get thing running in the wild with some guinea pig > > user to get feedback from end user. It would be easier if i can do this > > with upstream kernel and not some random branch in my private repo. While > > doing that i would like to avoid commiting to a syscall upstream. So the > > way i see around this is doing a driver under staging with an ioctl which > > will be turn into a syscall once some confidence into the API is gain. > > Ok that's fine I guess. > > But should be a clearly defined ioctl, not an ioctl with redefinable parameters > (but perhaps I misunderstood your description) > > > In the present version i took the other approach of defining just one > > API that can grow to do more thing. I know the unix way is one simple > > tool for one simple job. I can switch to the simple call for one action. > > Simple calls are better. I will switch to one simple call for each individual action (policy and migration). > > > > +Current memory policy infrastructure is node oriented, instead of > > > > +changing that and risking breakage and regression HMS adds a new > > > > +heterogeneous policy tracking infra-structure. The expectation is > > > > +that existing application can keep using mbind() and all existing > > > > +infrastructure under-disturb and unaffected, while new application > > > > +will use the new API and should avoid mix and matching both (as they > > > > +can achieve the same thing with the new API). > > > > > > I think we need a stronger motivation to define a completely > > > parallel and somewhat redundant infrastructure. What breakage > > > are you worried about? > > > > Some memory expose through HMS is not allocated by regular memory > > allocator. For instance GPU memory is manage by GPU driver, so when > > you want to use GPU memory (either as a policy or by migrating to it) > > you need to use the GPU allocator to allocate that memory. HMS adds > > a bunch of callback to target structure so that device driver can > > expose a generic API to core kernel to do such allocation. > > We already have nodes without memory. > We can also take out nodes out of the normal fall back lists. > We also have nodes with special memory (e.g. DMA32) > > Nothing you describe here cannot be handled with the existing nodes. They are have been patchset in the past to exclude node from allocation last time i check they all were rejected and people felt it was not a good thing to do. Also IIRC adding more node might be problematic as i think we do not have many bits left inside the flags field of struct page. Right now i do not believe in moving device memory as generic node inside the linux kernel because for many folks that will just be a waste, people only doing desktop and not using their GPU for compute will never get a good usage from that. Graphic memory allocation is wildely different from compute allocation which is more like CPU one. So converting graphic driver to register their memory as node does not seems as a good idea at this time. I doubt the GPU folks upstream would accept that (with my GPU hat ons i would not). > > > The obvious alternative would of course be to add some extra > > > enumeration to the existing nodes. > > > > We can not extend NUMA node to expose GPU memory. GPU memory on > > current AMD and Intel platform is not cache coherent and thus > > should not be use for random memory allocation. It should really > > Sure you don't expose it as normal memory, but it can be still > tied to a node. In fact you have to for the existing topology > interface to work. The existing topology interface is not use today for that memory and people in GPU world do not see it as an interface that can be use. See above discussion about GPU memory. This is the raison d'être of this proposal. A new way to expose heterogeneous memory to userspace. > > copy and rebuild their data structure inside the new memory. When > > you move over thing like tree or any complex data structure you have > > to rebuilt it ie redo the pointers link between the nodes of your > > data structure. > > > > This is highly error prone complex and wasteful (you have to burn > > CPU cycles to do that). Now if you can use the same address space > > as all the other memory allocation in your program and move data > > around from one device to another with a common API that works on > > all the various devices, you are eliminating that complex step and > > making the end user life much easier. > > > > So i am doing this to help existing users by addressing an issues > > that is becoming harder and harder to solve for userspace. My end > > game is to blur the boundary between CPU and device like GPU, FPGA, > > This is just high level rationale. You already had that ... > > What I was looking for is how applications actually use the > API. > > e.g. > > 1. Compute application is looking for fast cache coherent memory > for CPU usage. > > What does it query and how does it decide and how does it allocate? Application have an OpenCL context from the context it gets the device initiator unique id from the device initiator unique id it looks at all the links and bridge the initiator is connected to. Which gives it a list of links it can order that list using bandwidth first and latency second (ie 2 link with same bandwidth will be order with the one with slowest latency first). It goes over that list from best to worse and for each links it looks at what target are also connected to that link. From that it build an ordered list of targets. It also only pick cache coherent memory in that list. It now use this ordered list of targets to set policy or migrate its buffer to the best memory. Kernel will first try to use the first target, if it runs out of that memory it will use the next target ... so on and so forth. This can all be down inside a userspace common helper library for ease of use. More advance application will do finer allocation for instance they will partition their dataset using the access frequency. Most accessed dataset in the application will use the fastest memory (which is likely to be somewhat small ie few GigaBytes), while dataset that are more sparsely accessed will be push to use slower memory (but they are more of it). > 2. Allocator in OpenCL application is looking for memory to share > with OpenCL. How does it find memory? Same process as above, starts from initiator id, build links list then build all target that initiator can access. Then order that list according to the property of interest to the application (bandwidth, latency, ...). Once it has the target list it can use either policy or migration. Policy if it is for a new allocation, migration if it is to migrate an existing buffer to memory that is more appropriate for the OpenCL device under use. > 3. Storage application is looking for larger but slower memory > for CPU usage. Application build a list of initiator corresponding to the CPU it is using (bind too). From that list of initiator it builds a list of links (considering bridge too). From the list of links it builds a list of target (connected to those links). Then it order the list of target by size (not by latency or bandwidth). Once it has an ordered list of target then it use either the policy or migrate API for the range of virtual address it wants to affect. > > 4. ... > > Please work out some use cases like this. Note that above all the list building in userspace is intended to be done by an helper library as this is really boiler plate code. The last patch in my serie have userspace helpers to parse the sysfs, i will grow that into a mini library with example to show case it. More example from other part of this email thread: High level overview of how one application looks today: 1) Application get some dataset from some source (disk, network, sensors, ...) 2) Application allocate memory on device A and copy over the dataset 3) Application run some CPU code to format the copy of the dataset inside device A memory (rebuild pointers inside the dataset, this can represent millions and millions of operations) 4) Application run code on device A that use the dataset 5) Application allocate memory on device B and copy over result from device A 6) Application run some CPU code to format the copy of the dataset inside device B (rebuild pointers inside the dataset, this can represent millions and millions of operations) 7) Application run code on device B that use the dataset 8) Application copy result over from device B and keep on doing its thing How it looks with HMS: 1) Application get some dataset from some source (disk, network, sensors, ...) 2-3) Application calls HMS to migrate to device A memory 4) Application run code on device A that use the dataset 5-6) Application calls HMS to migrate to device B memory 7) Application run code on device B that use the dataset 8) Application calls HMS to migrate result to main memory So we now avoid explicit copy and having to rebuild data structure inside each device address space. Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query the kernel to discover the topology of system it is running on and use it to partition and balance its workload accordingly. Same application should now be able to run on new platform without having to adapt it to it. Cheers, Jérôme
On 2018-12-04 1:14 p.m., Andi Kleen wrote: >> Also, in the same vein, I think it's wrong to have the API enumerate all >> the different memory available in the system. The API should simply > We need an enumeration API too, just to display to the user what they > have, and possibly for applications to size their buffers > (all we do with existing NUMA nodes) Yes, but I think my main concern is the conflation of the enumeration API and the binding API. An application doesn't want to walk through all the possible memory and types in the system just to get some memory that will work with a couple initiators (which it somehow has to map to actual resources, like fds). We also don't want userspace to police itself on which memory works with which initiator. Enumeration is definitely not the common use case. And if we create a new enumeration API now, it may make it difficult or impossible to unify these types of memory with the existing NUMA node hierarchies if/when this gets more integrated with the mm core. Logan
On Tue, Dec 04, 2018 at 01:30:01PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 1:13 p.m., Jerome Glisse wrote: > > You are right many are non exclusive. It is just my feeling that having > > a mask as a file inside the target directory might be overlook by the > > application which might start using things it should not. At same time > > i guess if i write the userspace library that abstract this kernel API > > then i can enforce application to properly select thing. > > I think this is just evidence that this is not a good API. If the user > has the option to just ignore things or do it wrong that's a problem > with the API. Using a prefix for the name doesn't change that fact. How to expose harmful memory to userspace then ? How can i expose non cache coherent memory because yes they are application out there that use that today and would like to be able to migrate to and from that memory dynamicly during lifetime of the application as the data set progress through the application processing pipeline. They are kind of memory that violate the memory model you expect from the architecture. This memory is still useful nonetheless and it has other enticing properties (like bandwidth or latency). The whole point my proposal is to allow to expose this memory in a generic way so that application that today rely on a gazillion of device specific API can move over to common kernel API and consolidate their memory management on top of a common kernel layers. The dilema i am facing is exposing this memory while avoiding the non aware application to accidently use it just because it is there without understanding the implication that comes with it. If you have any idea on how to expose this to userspace in a common API i would happily take any suggestion :) My idea is this patchset and i agree they are many thing to improve and i have already taken many of the suggestion given so far. > > > I do not think there is a way to answer that question. I am siding on the > > side of this API can be dumb down in userspace by a common library. So let > > expose the topology and let userspace dumb it down. > > I fundamentally disagree with this approach to designing APIs. Saying > "we'll give you the kitchen sink, add another layer to deal with the > complexity" is actually just eschewing API design and makes it harder > for kernel folks to know what userspace actually requires because they > are multiple layers away. Note that i do not expose things like physical address or even splits memory in a node into individual device, in fact in expose less information that the existing NUMA (no zone, phys index, ...). As i do not think those have any value to userspace. What matter to userspace is where is this memory is in my topology so i can look at all the initiators node that are close by. Or the reverse, i have a set of initiators what is the set of closest targets to all those initiators. I feel this is simple enough to understand for anyone. It allows to describe any topology, a libhms can dumb it down for average application and more advance application can use the full description. They are example of such application today. I argue that if we provide a common API we might see more application but i won't pretend that i know that for a fact. I am just making assumption here. > > > If we dumb it down in the kernel i see few pitfalls: > > - kernel dumbing it down badly > > - kernel dumbing down code can grow out of control with gotcha > > for platform > > This is just a matter of designing the APIs well. Don't do it badly. I am talking about the inevitable fact that at some point some system firmware will miss-represent their platform. System firmware writer usualy copy and paste thing with little regards to what have change from one platform to the new. So their will be inevitable workaround and i would rather see those piling up inside a userspace library than inside the kernel. Note that i expec that the error won't be fatal but more along the line of reporting wrong value for bandwidth, latency, ... So kernel will most likely unaffected by system firmware error but those will affect the performance of application that are told innaccurate informations. > > - it is still harder to fix kernel than userspace in commercial > > user space (the whole RHEL business of slow moving and long > > supported kernel). So on those being able to fix thing in > > userspace sounds pretty enticing > > I hear this argument a lot and it's not compelling to me. I don't think > we should make decisions in upstream code to allow RHEL to bypass the > kernel simply because it would be easier for them to distribute code > changes. Ok i will not bring it up, i have suffer enough on that front so i have a trauma on this ;) Cheers, Jérôme
On Tue, Dec 04, 2018 at 01:47:17PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 1:14 p.m., Andi Kleen wrote: > >> Also, in the same vein, I think it's wrong to have the API enumerate all > >> the different memory available in the system. The API should simply > > > We need an enumeration API too, just to display to the user what they > > have, and possibly for applications to size their buffers > > (all we do with existing NUMA nodes) > > Yes, but I think my main concern is the conflation of the enumeration > API and the binding API. An application doesn't want to walk through all > the possible memory and types in the system just to get some memory that > will work with a couple initiators (which it somehow has to map to > actual resources, like fds). We also don't want userspace to police > itself on which memory works with which initiator. How application would police itself ? The API i am proposing is best effort and as such kernel can fully ignore userspace request as it is doing now sometimes with mbind(). So kernel always have the last call and can always override application decission. Device driver can also decide to override, anything that is kernel side really have more power than userspace would have. So while we give trust to userspace we do not abdicate control. That is not the intention here. > Enumeration is definitely not the common use case. And if we create a > new enumeration API now, it may make it difficult or impossible to unify > these types of memory with the existing NUMA node hierarchies if/when > this gets more integrated with the mm core. The point i am trying to make is that it can not get integrated as regular NUMA node inside the mm core. But rather the mm core can grow to encompass non NUMA node memory. I explained why in other part of this thread but roughly: - Device driver need to be in control of device memory allocation for backward compatibility reasons and to keep full filling thing like graphic API constraint (OpenGL, Vulkan, X, ...). - Adding new node type is problematic inside mm as we are running out of bits in the struct page - Excluding node from the regular allocation path was reject by upstream previously (IBM did post patchset for that IIRC). I feel it is a safer path to avoid a one model fits all here and to accept that device memory will be represented and managed in a different way from other memory. I believe persistent memory folks feels the same on that front. Nonetheless i do want to expose this device memory in a standard way so that we can consolidate and improve user experience on that front. Eventually i hope that more of the device memory management can be turn into a common device memory management inside core mm but i do not want to enforce that at first as it is likely to fail (building a moonbase before you have a moon rocket). I rather grow organicaly from high level API that will get use right away (it is a matter of converting existing user to it s/computeAPIBind/HMSBind). Cheers, Jérôme
On 2018-12-04 1:59 p.m., Jerome Glisse wrote: > How to expose harmful memory to userspace then ? How can i expose > non cache coherent memory because yes they are application out there > that use that today and would like to be able to migrate to and from > that memory dynamicly during lifetime of the application as the data > set progress through the application processing pipeline. I'm not arguing against the purpose or use cases. I'm being critical of the API choices. > Note that i do not expose things like physical address or even splits > memory in a node into individual device, in fact in expose less > information that the existing NUMA (no zone, phys index, ...). As i do > not think those have any value to userspace. What matter to userspace > is where is this memory is in my topology so i can look at all the > initiators node that are close by. Or the reverse, i have a set of > initiators what is the set of closest targets to all those initiators. No, what matters to applications is getting memory that will work for the initiators/resources they need it to work for. The specific topology might be of interest to administrators but it is not what applications need. And it should be relatively easy to flesh out the existing sysfs device tree to provide the topology information administrators need. > I am talking about the inevitable fact that at some point some system > firmware will miss-represent their platform. System firmware writer > usualy copy and paste thing with little regards to what have change > from one platform to the new. So their will be inevitable workaround > and i would rather see those piling up inside a userspace library than > inside the kernel. It's *absolutely* the kernel's responsibility to patch issues caused by broken firmware. We have quirks all over the place for this. That's never something userspace should be responsible for. Really, this is the raison d'etre of the kernel: to provide userspace with a uniform execution environment -- if every application had to deal with broken firmware it would be a nightmare. Logan
On Tue, Dec 04, 2018 at 02:19:09PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 1:59 p.m., Jerome Glisse wrote: > > How to expose harmful memory to userspace then ? How can i expose > > non cache coherent memory because yes they are application out there > > that use that today and would like to be able to migrate to and from > > that memory dynamicly during lifetime of the application as the data > > set progress through the application processing pipeline. > > I'm not arguing against the purpose or use cases. I'm being critical of > the API choices. > > > Note that i do not expose things like physical address or even splits > > memory in a node into individual device, in fact in expose less > > information that the existing NUMA (no zone, phys index, ...). As i do > > not think those have any value to userspace. What matter to userspace > > is where is this memory is in my topology so i can look at all the > > initiators node that are close by. Or the reverse, i have a set of > > initiators what is the set of closest targets to all those initiators. > > No, what matters to applications is getting memory that will work for > the initiators/resources they need it to work for. The specific topology > might be of interest to administrators but it is not what applications > need. And it should be relatively easy to flesh out the existing sysfs > device tree to provide the topology information administrators need. Existing user would disagree in my cover letter i have given pointer to existing library and paper from HPC folks that do leverage system topology (among the few who are). So they are application _today_ that do use topology information to adapt their workload to maximize the performance for the platform they run on. They are also some new platform that have much more complex topology that definitly can not be represented as a tree like today sysfs we have (i believe that even some of the HPC folks have _today_ topology that are not tree-like). So existing user + random graph topology becoming more commons lead me to the choice i made in this API. I believe a graph is someting that can easily be understood by people. I am not inventing some weird new data structure, it is just a graph and for the name i have use the ACPI naming convention but i am more than open to use memory for target and differentiate cpu and device instead of using initiator as a name. I do not have strong feeling on that. I do however would like to be able to represent any topology and be able to use device memory that is not manage by core mm for reasons i explained previously. Note that if it turn out to be a bad idea kernel can decide to dumb down thing in future version for new platform. So it could give a flat graph to userspace, there is nothing precluding that. > > I am talking about the inevitable fact that at some point some system > > firmware will miss-represent their platform. System firmware writer > > usualy copy and paste thing with little regards to what have change > > from one platform to the new. So their will be inevitable workaround > > and i would rather see those piling up inside a userspace library than > > inside the kernel. > > It's *absolutely* the kernel's responsibility to patch issues caused by > broken firmware. We have quirks all over the place for this. That's > never something userspace should be responsible for. Really, this is the > raison d'etre of the kernel: to provide userspace with a uniform > execution environment -- if every application had to deal with broken > firmware it would be a nightmare. You cuted the other paragraph that explained why they will unlikely to be broken badly enough to break the kernel. Anyway we can fix the topology in kernel too ... that is fine with me. Cheers, Jérôme
On 2018-12-04 2:51 p.m., Jerome Glisse wrote: > Existing user would disagree in my cover letter i have given pointer > to existing library and paper from HPC folks that do leverage system > topology (among the few who are). So they are application _today_ that > do use topology information to adapt their workload to maximize the > performance for the platform they run on. Well we need to give them what they actually need, not what they want to shoot their foot with. And I imagine, much of what they actually do right now belongs firmly in the kernel. Like I said, existing applications are not justifications for bad API design or layering violations. You've even mentioned we'd need a simplified "libhms" interface for applications. We should really just figure out what that needs to be and make that the kernel interface. > They are also some new platform that have much more complex topology > that definitly can not be represented as a tree like today sysfs we > have (i believe that even some of the HPC folks have _today_ topology > that are not tree-like). The sysfs tree already allows for a complex graph that describes existing hardware very well. If there is hardware it cannot describe then we should work to improve it and not just carve off a whole new area for a special API. -- In fact, you are already using sysfs, just under your own virtual/non-existent bus. > Note that if it turn out to be a bad idea kernel can decide to dumb > down thing in future version for new platform. So it could give a > flat graph to userspace, there is nothing precluding that. Uh... if it turns out to be a bad idea we are screwed because we have an API existing applications are using. It's much easier to add features to a simple (your word: "dumb") interface than it is to take options away from one that is too broad. > >>> I am talking about the inevitable fact that at some point some system >>> firmware will miss-represent their platform. System firmware writer >>> usualy copy and paste thing with little regards to what have change >>> from one platform to the new. So their will be inevitable workaround >>> and i would rather see those piling up inside a userspace library than >>> inside the kernel. >> >> It's *absolutely* the kernel's responsibility to patch issues caused by >> broken firmware. We have quirks all over the place for this. That's >> never something userspace should be responsible for. Really, this is the >> raison d'etre of the kernel: to provide userspace with a uniform >> execution environment -- if every application had to deal with broken >> firmware it would be a nightmare. > > You cuted the other paragraph that explained why they will unlikely > to be broken badly enough to break the kernel. That was entirely beside the point. Just because it doesn't break the kernel itself doesn't make it any less necessary for it to be fixed inside the kernel. It must be done in a common place so every application doesn't have to maintain a table of hardware quirks. Logan
On Tue, Dec 04, 2018 at 03:16:54PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 2:51 p.m., Jerome Glisse wrote: > > Existing user would disagree in my cover letter i have given pointer > > to existing library and paper from HPC folks that do leverage system > > topology (among the few who are). So they are application _today_ that > > do use topology information to adapt their workload to maximize the > > performance for the platform they run on. > > Well we need to give them what they actually need, not what they want to > shoot their foot with. And I imagine, much of what they actually do > right now belongs firmly in the kernel. Like I said, existing > applications are not justifications for bad API design or layering > violations. One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and two 8 GPUs node connected through each other with fast mesh (ie each GPU can peer to peer to each other at the same bandwidth). Then this 2 blocks are connected to the other block through a share link. So it looks like: SOCKET0----SOCKET1-----SOCKET2----SOCKET3 | | | | S0-GPU0====S1-GPU0 S2-GPU0====S1-GPU0 || \\// || \\// || //\\ || //\\ ... ====... -----... ====... || \\// || \\// || //\\ || //\\ S0-GPU7====S1-GPU7 S2-GPU7====S3-GPU7 Application partition its workload in 2 ie allocate dataset twice for 16 group of GPU. Each of the 2 partitions is then further split in two for some of the buffer in the dataset but not all. So AFAICT they are using all the topology informations. They see that they are 4 group of GPU that in those 4 group, they are 2 pair of group with better interconnect and then a share slower inter-connect between the 2 groups. From HMS point of view this looks like (ignoring CPU): link0: S0-GPU0 ... S0-GPU7 link1: S1-GPU0 ... S1-GPU7 link2: S2-GPU0 ... S2-GPU7 link3: S3-GPU0 ... S3-GPU7 link4: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7 link5: S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7 link6: S0-GPU0 ... S0-GPU7 S1-GPU0 ... S1-GPU7 S2-GPU0 ... S2-GPU7 S3-GPU0 ... S3-GPU7 Dumbing it more down and they loose information they want. On top of that there is also the NUMA CPU node (which is more symetric). I do not see how this can express in current sysfs we have but maybe there is a way to shoe horn it. I expect more complex topology to show up with a mix of different devices (like GPU and FPGA). > > You've even mentioned we'd need a simplified "libhms" interface for > applications. We should really just figure out what that needs to be and > make that the kernel interface. No i said that a libhms for average application would totaly make sense to dumb thing down. I do not expect all application will use the full extent of the information. One simple reason, desktop, on desktop i don't expect the topology to grow too complex and thus all the desktop application will not care about it (your blender, libreoffice, ... which are using GPU today). But for people creating application that will run on big server, yes i expect some of them will use that information if only the existing people that already do use that information. > > They are also some new platform that have much more complex topology > > that definitly can not be represented as a tree like today sysfs we > > have (i believe that even some of the HPC folks have _today_ topology > > that are not tree-like). > > The sysfs tree already allows for a complex graph that describes > existing hardware very well. If there is hardware it cannot describe > then we should work to improve it and not just carve off a whole new > area for a special API. -- In fact, you are already using sysfs, just > under your own virtual/non-existent bus. How the above example would looks like ? I fail to see how to do it inside current sysfs. Maybe by creating multiple virtual device for each of the inter-connect ? So something like link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child Then for link4, link5 and link6 we would need symlink to the GPU device. So it sounds like creating virtual device for the sake of doing it in the existing framework. Then userspace would have to learn about this virtual device to identify them as node for the topology graph and would have to differentiate from non node device. This sounds much more complex to me. Also if doing node for those we would need to do CPU less and memory less NUMA node as the GPU memory is not usable by the CPU ... I am not sure we want to get there. If that's what people want fine but i personnaly don't think this is the right solution. > > Note that if it turn out to be a bad idea kernel can decide to dumb > > down thing in future version for new platform. So it could give a > > flat graph to userspace, there is nothing precluding that. > > Uh... if it turns out to be a bad idea we are screwed because we have an > API existing applications are using. It's much easier to add features to > a simple (your word: "dumb") interface than it is to take options away > from one that is too broad. We all have fears that what we do will not get use, but i do not want to stop making progress because of that. Like i said i am doing all this under staging to get the ball rolling, to test it with guinea pig and to gain some level of confidence it is actually useful. So i am providing evidence today (see all the research in HPC on memory management, topology, placement, ... for which i have given some links to) and i want to gather more evidence before commiting to this. I hope this sounds like a reasonable plan. What would you like me to do differently ? Like i said i feel that this is a chicken and egg problem today there is no standard way to get topology so there is no way to know how much applications would use such informations. We know that very few applications in special case use the topology informations. How to test wether more applications would use that same informations without providing some kind of standard API for them to get it ? It is also a system availability thing, right now they are very few system with such complex topology, but we are seeing more and more GPU, TPU, FPGA in more and more environement. I want to be pro-active here and provide API that would help leverage those new system for people experimenting with them. My proposal is to do HMS behind staging for a while and also avoid any disruption to existing code path. See with people living on the bleeding edge if they get interested in that informations. If not then i can strip down my thing to the bare minimum which is about device memory. > >>> I am talking about the inevitable fact that at some point some system > >>> firmware will miss-represent their platform. System firmware writer > >>> usualy copy and paste thing with little regards to what have change > >>> from one platform to the new. So their will be inevitable workaround > >>> and i would rather see those piling up inside a userspace library than > >>> inside the kernel. > >> > >> It's *absolutely* the kernel's responsibility to patch issues caused by > >> broken firmware. We have quirks all over the place for this. That's > >> never something userspace should be responsible for. Really, this is the > >> raison d'etre of the kernel: to provide userspace with a uniform > >> execution environment -- if every application had to deal with broken > >> firmware it would be a nightmare. > > > > You cuted the other paragraph that explained why they will unlikely > > to be broken badly enough to break the kernel. > > That was entirely beside the point. Just because it doesn't break the > kernel itself doesn't make it any less necessary for it to be fixed > inside the kernel. It must be done in a common place so every > application doesn't have to maintain a table of hardware quirks. Fine with quirks in kernel. It was just a personnal taste thing ... pure kernel vs ugly userspace :) Cheers, Jérôme
On 2018-12-04 2:11 p.m., Logan Gunthorpe wrote: > Also, in the same vein, I think it's wrong to have the API enumerate all > the different memory available in the system. The API should simply > allow userspace to say it wants memory that can be accessed by a set of > initiators with a certain set of attributes and the bind call tries to > fulfill that or fallback on system memory/hmm migration/whatever. That gets pretty complex when you also have take into account contention of links and bridges when multiple initiators are accessing multiple targets simultaneously. If you want the kernel to make sane decisions, it needs a lot more information about the expected memory access patterns. Highly optimized algorithms that use multiple GPUs and collective communications between them want to be able to place their memory objects in the right location to avoid such contention. You don't want such an algorithm to guess about opaque policy decisions in the kernel. If the policy changes, you have to re-optimize the algorithm. Regards, Felix > > Logan
On 2018-12-04 4:56 p.m., Jerome Glisse wrote: > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and > two 8 GPUs node connected through each other with fast mesh (ie each > GPU can peer to peer to each other at the same bandwidth). Then this > 2 blocks are connected to the other block through a share link. > > So it looks like: > SOCKET0----SOCKET1-----SOCKET2----SOCKET3 > | | | | > S0-GPU0====S1-GPU0 S2-GPU0====S1-GPU0 > || \\// || \\// > || //\\ || //\\ > ... ====... -----... ====... > || \\// || \\// > || //\\ || //\\ > S0-GPU7====S1-GPU7 S2-GPU7====S3-GPU7 Well the existing NUMA node stuff tells userspace which GPU belongs to which socket (every device in sysfs already has a numa_node attribute). And if that's not good enough we should work to improve how that works for all devices. This problem isn't specific to GPUS or devices with memory and seems rather orthogonal to an API to bind to device memory. > How the above example would looks like ? I fail to see how to do it > inside current sysfs. Maybe by creating multiple virtual device for > each of the inter-connect ? So something like > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child I think the "links" between GPUs themselves would be a bus. In the same way a NUMA node is a bus. Each device in sysfs would then need a directory or something to describe what "link bus(es)" they are a part of. Though there are other ways to do this: a GPU driver could simply create symlinks to other GPUs inside a "neighbours" directory under the device path or something like that. The point is that this seems like it is specific to GPUs and could easily be solved in the GPU community without any new universal concepts or big APIs. And for applications that need topology information, a lot of it is already there, we just need to fill in the gaps with small changes that would be much less controversial. Then if you want to create a libhms (or whatever) to help applications parse this information out of existing sysfs that would make sense. > My proposal is to do HMS behind staging for a while and also avoid > any disruption to existing code path. See with people living on the > bleeding edge if they get interested in that informations. If not then > i can strip down my thing to the bare minimum which is about device > memory. This isn't my area or decision to make, but it seemed to me like this is not what staging is for. Staging is for introducing *drivers* that aren't up to the Kernel's quality level and they all reside under the drivers/staging path. It's not meant to introduce experimental APIs around the kernel that might be revoked at anytime. DAX introduced itself by marking the config option as EXPERIMENTAL and printing warnings to dmesg when someone tries to use it. But, to my knowledge, DAX also wasn't creating APIs with the intention of changing or revoking them -- it was introducing features using largely existing APIs that had many broken corner cases. Do you know of any precedents where big APIs were introduced and then later revoked or radically changed like you are proposing to do? Logan
On Tue, Dec 04, 2018 at 06:15:08PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 4:56 p.m., Jerome Glisse wrote: > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and > > two 8 GPUs node connected through each other with fast mesh (ie each > > GPU can peer to peer to each other at the same bandwidth). Then this > > 2 blocks are connected to the other block through a share link. > > > > So it looks like: > > SOCKET0----SOCKET1-----SOCKET2----SOCKET3 > > | | | | > > S0-GPU0====S1-GPU0 S2-GPU0====S1-GPU0 > > || \\// || \\// > > || //\\ || //\\ > > ... ====... -----... ====... > > || \\// || \\// > > || //\\ || //\\ > > S0-GPU7====S1-GPU7 S2-GPU7====S3-GPU7 > > Well the existing NUMA node stuff tells userspace which GPU belongs to > which socket (every device in sysfs already has a numa_node attribute). > And if that's not good enough we should work to improve how that works > for all devices. This problem isn't specific to GPUS or devices with > memory and seems rather orthogonal to an API to bind to device memory. HMS is generic and not for GPU only, i use GPU as example as they are the first device introducing this complexity. I believe some of the FPGA folks are working on same thing. I heard that more TPU like hardware might also grow such complexity. What you are proposing just seems to me like redoing HMS under the node directory in sysfs which has the potential of confusing existing application while providing no benefits (at least i fail to see any). > > How the above example would looks like ? I fail to see how to do it > > inside current sysfs. Maybe by creating multiple virtual device for > > each of the inter-connect ? So something like > > > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child > > I think the "links" between GPUs themselves would be a bus. In the same > way a NUMA node is a bus. Each device in sysfs would then need a > directory or something to describe what "link bus(es)" they are a part > of. Though there are other ways to do this: a GPU driver could simply > create symlinks to other GPUs inside a "neighbours" directory under the > device path or something like that. > > The point is that this seems like it is specific to GPUs and could > easily be solved in the GPU community without any new universal concepts > or big APIs. So it would be springly over all this informations in various sub- directories. To me this is making userspace life harder. HMS only has one directory hierarchy that userspace need to parse to extract the information. From my point of view it is much better but this might be a taste thing. > > And for applications that need topology information, a lot of it is > already there, we just need to fill in the gaps with small changes that > would be much less controversial. Then if you want to create a libhms > (or whatever) to help applications parse this information out of > existing sysfs that would make sense. How can i express multiple link, or memory that is only accessible by a subset of the devices/CPUs. In today model they are back in assumption like everyone can access all the node which do not hold in what i am trying to do. Yes i can do it by adding invalid peer node list inside each node but this is all more complex from my point of view. Highly confusing for existing application and with potential to break existing application on new platform with such weird nodes. > > My proposal is to do HMS behind staging for a while and also avoid > > any disruption to existing code path. See with people living on the > > bleeding edge if they get interested in that informations. If not then > > i can strip down my thing to the bare minimum which is about device > > memory. > > This isn't my area or decision to make, but it seemed to me like this is > not what staging is for. Staging is for introducing *drivers* that > aren't up to the Kernel's quality level and they all reside under the > drivers/staging path. It's not meant to introduce experimental APIs > around the kernel that might be revoked at anytime. > > DAX introduced itself by marking the config option as EXPERIMENTAL and > printing warnings to dmesg when someone tries to use it. But, to my > knowledge, DAX also wasn't creating APIs with the intention of changing > or revoking them -- it was introducing features using largely existing > APIs that had many broken corner cases. > > Do you know of any precedents where big APIs were introduced and then > later revoked or radically changed like you are proposing to do? Yeah it is kind of an issue, i can go the experimental way, idealy what i would like is a kernel option that enable it with a kernel boot parameter as an extra gate keeper so i can distribute kernel with that feature inside some distribution and then provide simple instruction for people to test (much easier to give a kernel boot parameter than to have people rebuild a kernel). I am open to any suggestion on what would be the best guideline to experiment with API. The issue is that the changes to userspace are big and takes time (month of works). So if i have to everything line up and ready (userspace and kernel) in just one go then it is gonna be painful. My pain i guess so other don't care ... :) Cheers, Jérôme
On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > On 2018-12-04 4:56 p.m., Jerome Glisse wrote: > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and > > two 8 GPUs node connected through each other with fast mesh (ie each > > GPU can peer to peer to each other at the same bandwidth). Then this > > 2 blocks are connected to the other block through a share link. > > > > So it looks like: > > SOCKET0----SOCKET1-----SOCKET2----SOCKET3 > > | | | | > > S0-GPU0====S1-GPU0 S2-GPU0====S1-GPU0 > > || \\// || \\// > > || //\\ || //\\ > > ... ====... -----... ====... > > || \\// || \\// > > || //\\ || //\\ > > S0-GPU7====S1-GPU7 S2-GPU7====S3-GPU7 > > Well the existing NUMA node stuff tells userspace which GPU belongs to > which socket (every device in sysfs already has a numa_node attribute). > And if that's not good enough we should work to improve how that works > for all devices. This problem isn't specific to GPUS or devices with > memory and seems rather orthogonal to an API to bind to device memory. > > > How the above example would looks like ? I fail to see how to do it > > inside current sysfs. Maybe by creating multiple virtual device for > > each of the inter-connect ? So something like > > > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child > > I think the "links" between GPUs themselves would be a bus. In the same > way a NUMA node is a bus. Each device in sysfs would then need a > directory or something to describe what "link bus(es)" they are a part > of. Though there are other ways to do this: a GPU driver could simply > create symlinks to other GPUs inside a "neighbours" directory under the > device path or something like that. > > The point is that this seems like it is specific to GPUs and could > easily be solved in the GPU community without any new universal concepts > or big APIs. > > And for applications that need topology information, a lot of it is > already there, we just need to fill in the gaps with small changes that > would be much less controversial. Then if you want to create a libhms > (or whatever) to help applications parse this information out of > existing sysfs that would make sense. > > > My proposal is to do HMS behind staging for a while and also avoid > > any disruption to existing code path. See with people living on the > > bleeding edge if they get interested in that informations. If not then > > i can strip down my thing to the bare minimum which is about device > > memory. > > This isn't my area or decision to make, but it seemed to me like this is > not what staging is for. Staging is for introducing *drivers* that > aren't up to the Kernel's quality level and they all reside under the > drivers/staging path. It's not meant to introduce experimental APIs > around the kernel that might be revoked at anytime. > > DAX introduced itself by marking the config option as EXPERIMENTAL and > printing warnings to dmesg when someone tries to use it. But, to my > knowledge, DAX also wasn't creating APIs with the intention of changing > or revoking them -- it was introducing features using largely existing > APIs that had many broken corner cases. > > Do you know of any precedents where big APIs were introduced and then > later revoked or radically changed like you are proposing to do? This came up before for apis even better defined than HMS as well as more limited scope, i.e. experimental ABI availability only for -rc kernels. Linus said this: "There are no loopholes. No "but it's been only one release". No, no, no. The whole point is that users are supposed to be able to *trust* the kernel. If we do something, we keep on doing it. And if it makes it harder to add new user-visible interfaces, then that's a *good* thing." [1] The takeaway being don't land work-in-progress ABIs in the kernel. Once an application depends on it, there are no more incompatible changes possible regardless of the warnings, experimental notices, or "staging" designation. DAX is experimental because there are cases where it currently does not work with respect to another kernel feature like xfs-reflink, RDMA. The plan is to fix those, not continue to hide behind an experimental designation, and fix them in a way that preserves the user visible behavior that has already been exposed, i.e. no regressions. [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html
On Tue, Dec 04, 2018 at 06:34:37PM -0800, Dan Williams wrote: > On Tue, Dec 4, 2018 at 5:15 PM Logan Gunthorpe <logang@deltatee.com> wrote: > > > > > > > > On 2018-12-04 4:56 p.m., Jerome Glisse wrote: > > > One example i have is 4 nodes (CPU socket) each nodes with 8 GPUs and > > > two 8 GPUs node connected through each other with fast mesh (ie each > > > GPU can peer to peer to each other at the same bandwidth). Then this > > > 2 blocks are connected to the other block through a share link. > > > > > > So it looks like: > > > SOCKET0----SOCKET1-----SOCKET2----SOCKET3 > > > | | | | > > > S0-GPU0====S1-GPU0 S2-GPU0====S1-GPU0 > > > || \\// || \\// > > > || //\\ || //\\ > > > ... ====... -----... ====... > > > || \\// || \\// > > > || //\\ || //\\ > > > S0-GPU7====S1-GPU7 S2-GPU7====S3-GPU7 > > > > Well the existing NUMA node stuff tells userspace which GPU belongs to > > which socket (every device in sysfs already has a numa_node attribute). > > And if that's not good enough we should work to improve how that works > > for all devices. This problem isn't specific to GPUS or devices with > > memory and seems rather orthogonal to an API to bind to device memory. > > > > > How the above example would looks like ? I fail to see how to do it > > > inside current sysfs. Maybe by creating multiple virtual device for > > > each of the inter-connect ? So something like > > > > > > link0 -> device:00 which itself has S0-GPU0 ... S0-GPU7 has child > > > link1 -> device:01 which itself has S1-GPU0 ... S1-GPU7 has child > > > link2 -> device:02 which itself has S2-GPU0 ... S2-GPU7 has child > > > link3 -> device:03 which itself has S3-GPU0 ... S3-GPU7 has child > > > > I think the "links" between GPUs themselves would be a bus. In the same > > way a NUMA node is a bus. Each device in sysfs would then need a > > directory or something to describe what "link bus(es)" they are a part > > of. Though there are other ways to do this: a GPU driver could simply > > create symlinks to other GPUs inside a "neighbours" directory under the > > device path or something like that. > > > > The point is that this seems like it is specific to GPUs and could > > easily be solved in the GPU community without any new universal concepts > > or big APIs. > > > > And for applications that need topology information, a lot of it is > > already there, we just need to fill in the gaps with small changes that > > would be much less controversial. Then if you want to create a libhms > > (or whatever) to help applications parse this information out of > > existing sysfs that would make sense. > > > > > My proposal is to do HMS behind staging for a while and also avoid > > > any disruption to existing code path. See with people living on the > > > bleeding edge if they get interested in that informations. If not then > > > i can strip down my thing to the bare minimum which is about device > > > memory. > > > > This isn't my area or decision to make, but it seemed to me like this is > > not what staging is for. Staging is for introducing *drivers* that > > aren't up to the Kernel's quality level and they all reside under the > > drivers/staging path. It's not meant to introduce experimental APIs > > around the kernel that might be revoked at anytime. > > > > DAX introduced itself by marking the config option as EXPERIMENTAL and > > printing warnings to dmesg when someone tries to use it. But, to my > > knowledge, DAX also wasn't creating APIs with the intention of changing > > or revoking them -- it was introducing features using largely existing > > APIs that had many broken corner cases. > > > > Do you know of any precedents where big APIs were introduced and then > > later revoked or radically changed like you are proposing to do? > > This came up before for apis even better defined than HMS as well as > more limited scope, i.e. experimental ABI availability only for -rc > kernels. Linus said this: > > "There are no loopholes. No "but it's been only one release". No, no, > no. The whole point is that users are supposed to be able to *trust* > the kernel. If we do something, we keep on doing it. > > And if it makes it harder to add new user-visible interfaces, then > that's a *good* thing." [1] > > The takeaway being don't land work-in-progress ABIs in the kernel. > Once an application depends on it, there are no more incompatible > changes possible regardless of the warnings, experimental notices, or > "staging" designation. DAX is experimental because there are cases > where it currently does not work with respect to another kernel > feature like xfs-reflink, RDMA. The plan is to fix those, not continue > to hide behind an experimental designation, and fix them in a way that > preserves the user visible behavior that has already been exposed, > i.e. no regressions. > > [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html So i guess i am heading down the vXX road ... such is my life :) Cheers, Jérôme
On 12/4/18 11:54 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: >> jglisse@redhat.com writes: >> >>> + >>> +To help with forward compatibility each object as a version value and >>> +it is mandatory for user space to only use target or initiator with >>> +version supported by the user space. For instance if user space only >>> +knows about what version 1 means and sees a target with version 2 then >>> +the user space must ignore that target as if it does not exist. >> >> So once v2 is introduced all applications that only support v1 break. >> >> That seems very un-Linux and will break Linus' "do not break existing >> applications" rule. >> >> The standard approach that if you add something incompatible is to >> add new field, but keep the old ones. > > No that's not how it is suppose to work. So let says it is 2018 and you > have v1 memory (like your regular main DDR memory for instance) then it > will always be expose a v1 memory. > > Fast forward 2020 and you have this new type of memory that is not cache > coherent and you want to expose this to userspace through HMS. What you > do is a kernel patch that introduce the v2 type for target and define a > set of new sysfs file to describe what v2 is. On this new computer you > report your usual main memory as v1 and your new memory as v2. > > So the application that only knew about v1 will keep using any v1 memory > on your new platform but it will not use any of the new memory v2 which > is what you want to happen. You do not have to break existing application > while allowing to add new type of memory. > So the knowledge that v1 is coherent and v2 is non-coherent is within the application? That seems really complicated from application point of view. Rill that v1 and v2 definition be arch and system dependent? if we want to encode properties of a target and initiator we should do that as files within these directory. Something like 'is_cache_coherent' in the target director can be used to identify whether the target is cache coherent or not? -aneesh
On Wed, Dec 05, 2018 at 10:06:02AM +0530, Aneesh Kumar K.V wrote: > On 12/4/18 11:54 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 09:06:59AM -0800, Andi Kleen wrote: > > > jglisse@redhat.com writes: > > > > > > > + > > > > +To help with forward compatibility each object as a version value and > > > > +it is mandatory for user space to only use target or initiator with > > > > +version supported by the user space. For instance if user space only > > > > +knows about what version 1 means and sees a target with version 2 then > > > > +the user space must ignore that target as if it does not exist. > > > > > > So once v2 is introduced all applications that only support v1 break. > > > > > > That seems very un-Linux and will break Linus' "do not break existing > > > applications" rule. > > > > > > The standard approach that if you add something incompatible is to > > > add new field, but keep the old ones. > > > > No that's not how it is suppose to work. So let says it is 2018 and you > > have v1 memory (like your regular main DDR memory for instance) then it > > will always be expose a v1 memory. > > > > Fast forward 2020 and you have this new type of memory that is not cache > > coherent and you want to expose this to userspace through HMS. What you > > do is a kernel patch that introduce the v2 type for target and define a > > set of new sysfs file to describe what v2 is. On this new computer you > > report your usual main memory as v1 and your new memory as v2. > > > > So the application that only knew about v1 will keep using any v1 memory > > on your new platform but it will not use any of the new memory v2 which > > is what you want to happen. You do not have to break existing application > > while allowing to add new type of memory. > > > > So the knowledge that v1 is coherent and v2 is non-coherent is within the > application? That seems really complicated from application point of view. > Rill that v1 and v2 definition be arch and system dependent? No the idea was that kernel version X like 4.20 would define what v1 means. Then once v2 is added it would define what that means. Memory that has v1 property would get v1 and memory that have v2 property would get v2 as prefix. Application that was done at 4.20 time and thus only knew about v1 would only look for v1 folder and thus only get memory it does under- stand. This is kind of moot discussion as i will switch to mask file inside the directory per Logan advice. > > if we want to encode properties of a target and initiator we should do that > as files within these directory. Something like 'is_cache_coherent' > in the target director can be used to identify whether the target is cache > coherent or not? My objection and fear is that application would overlook new properties that the application need to understand to safely use new type of memory. Thus old application might start using weird memory on new platform and break in unexpected way. This was the whole rational and motivation behind my choice. I will switch to a set of flag in a file in the target directory and rely on sane userspace behavior. Cheers, Jérôme
On Mon, Dec 03, 2018 at 06:34:57PM -0500, jglisse@redhat.com wrote: > From: Jérôme Glisse <jglisse@redhat.com> > > Add documentation to what is HMS and what it is for (see patch content). > > Signed-off-by: Jérôme Glisse <jglisse@redhat.com> > Cc: Rafael J. Wysocki <rafael@kernel.org> > Cc: Ross Zwisler <ross.zwisler@linux.intel.com> > Cc: Dan Williams <dan.j.williams@intel.com> > Cc: Dave Hansen <dave.hansen@intel.com> > Cc: Haggai Eran <haggaie@mellanox.com> > Cc: Balbir Singh <balbirs@au1.ibm.com> > Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Felix Kuehling <felix.kuehling@amd.com> > Cc: Philip Yang <Philip.Yang@amd.com> > Cc: Christian König <christian.koenig@amd.com> > Cc: Paul Blinzer <Paul.Blinzer@amd.com> > Cc: Logan Gunthorpe <logang@deltatee.com> > Cc: John Hubbard <jhubbard@nvidia.com> > Cc: Ralph Campbell <rcampbell@nvidia.com> > Cc: Michal Hocko <mhocko@kernel.org> > Cc: Jonathan Cameron <jonathan.cameron@huawei.com> > Cc: Mark Hairgrove <mhairgrove@nvidia.com> > Cc: Vivek Kini <vkini@nvidia.com> > Cc: Mel Gorman <mgorman@techsingularity.net> > Cc: Dave Airlie <airlied@redhat.com> > Cc: Ben Skeggs <bskeggs@redhat.com> > Cc: Andrea Arcangeli <aarcange@redhat.com> > --- > Documentation/vm/hms.rst | 275 ++++++++++++++++++++++++++++++++++----- > 1 file changed, 246 insertions(+), 29 deletions(-) This document describes userspace API and it's better to put it into Documentation/admin-guide/mm. The Documentation/vm is more for description of design and implementation. I've spotted a couple of typos, but I think it doesn't make sense to nitpick about them before v10 or so ;-) > diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst > index dbf0f71918a9..bd7c9e8e7077 100644 > --- a/Documentation/vm/hms.rst > +++ b/Documentation/vm/hms.rst
On 2018-12-04 7:37 p.m., Jerome Glisse wrote: >> >> This came up before for apis even better defined than HMS as well as >> more limited scope, i.e. experimental ABI availability only for -rc >> kernels. Linus said this: >> >> "There are no loopholes. No "but it's been only one release". No, no, >> no. The whole point is that users are supposed to be able to *trust* >> the kernel. If we do something, we keep on doing it. >> >> And if it makes it harder to add new user-visible interfaces, then >> that's a *good* thing." [1] >> >> The takeaway being don't land work-in-progress ABIs in the kernel. >> Once an application depends on it, there are no more incompatible >> changes possible regardless of the warnings, experimental notices, or >> "staging" designation. DAX is experimental because there are cases >> where it currently does not work with respect to another kernel >> feature like xfs-reflink, RDMA. The plan is to fix those, not continue >> to hide behind an experimental designation, and fix them in a way that >> preserves the user visible behavior that has already been exposed, >> i.e. no regressions. >> >> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html > > So i guess i am heading down the vXX road ... such is my life :) I recommend against it. I really haven't been convinced by any of your arguments for having a second topology tree. The existing topology tree in sysfs already better describes the links between hardware right now, except for the missing GPU links (and those should be addressable within the GPU community). Plus, maybe, some other enhancements to sockets/numa node descriptions if there's something missing there. Then, 'hbind' is another issue but I suspect it would be better implemented as an ioctl on existing GPU interfaces. I certainly can't see any benefit in using it myself. It's better to take an approach that would be less controversial with the community than to brow beat them with a patch set 20+ times until they take it. Logan
On 2018-12-04 7:31 p.m., Jerome Glisse wrote: > How can i express multiple link, or memory that is only accessible > by a subset of the devices/CPUs. In today model they are back in > assumption like everyone can access all the node which do not hold > in what i am trying to do. Well multiple links are easy when you have a 'link' bus. Just add another link device under the bus. Technically, the accessibility issue is already encoded in sysfs. For example, through the PCI tree you can determine which ACS bits are set and determine which devices are behind the same root bridge the same way we do in the kernel p2pdma subsystem. This is all bus specific which is fine, but if we want to change that, we should have a common way for existing buses to describe these attributes in the existing tree. The new 'link' bus devices would have to have some way to describe cases if memory isn't accessible in some way across it. But really, I would say the kernel is responsible for telling you when memory is accessible to a list of initiators, so it should be part of the checks in a theoretical hbind api. This is already the approach p2pdma takes in-kernel: we have functions that tell you if two PCI devices can talk to each other and we have functions to give you memory accessible by a set of devices. What we don't have is a special tree that p2pdma users have to walk through to determine accessibility. In my eye's, you are just conflating a bunch of different issues that are better solved independently in the existing frameworks we have. And if they were tackled individually, you'd have a much easier time getting them merged one by one. Logan
On Wed, Dec 05, 2018 at 10:25:31AM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 7:37 p.m., Jerome Glisse wrote: > >> > >> This came up before for apis even better defined than HMS as well as > >> more limited scope, i.e. experimental ABI availability only for -rc > >> kernels. Linus said this: > >> > >> "There are no loopholes. No "but it's been only one release". No, no, > >> no. The whole point is that users are supposed to be able to *trust* > >> the kernel. If we do something, we keep on doing it. > >> > >> And if it makes it harder to add new user-visible interfaces, then > >> that's a *good* thing." [1] > >> > >> The takeaway being don't land work-in-progress ABIs in the kernel. > >> Once an application depends on it, there are no more incompatible > >> changes possible regardless of the warnings, experimental notices, or > >> "staging" designation. DAX is experimental because there are cases > >> where it currently does not work with respect to another kernel > >> feature like xfs-reflink, RDMA. The plan is to fix those, not continue > >> to hide behind an experimental designation, and fix them in a way that > >> preserves the user visible behavior that has already been exposed, > >> i.e. no regressions. > >> > >> [1]: https://lists.linuxfoundation.org/pipermail/ksummit-discuss/2017-August/004742.html > > > > So i guess i am heading down the vXX road ... such is my life :) > > I recommend against it. I really haven't been convinced by any of your > arguments for having a second topology tree. The existing topology tree > in sysfs already better describes the links between hardware right now, > except for the missing GPU links (and those should be addressable within > the GPU community). Plus, maybe, some other enhancements to sockets/numa > node descriptions if there's something missing there. > > Then, 'hbind' is another issue but I suspect it would be better > implemented as an ioctl on existing GPU interfaces. I certainly can't > see any benefit in using it myself. > > It's better to take an approach that would be less controversial with > the community than to brow beat them with a patch set 20+ times until > they take it. So here is what i am gonna do because i need this code now. I am gonna split the helper code that does policy and hbind out from its sysfs peerage and i am gonna turn it into helpers that each device driver can use. I will move the sysfs and syscall to be a patchset on its own which use the exact same above infrastructure. This means that i am loosing feature as it means that userspace can not provide a list of multiple device memory to use (which is much more common that you might think) but at least i can provide something for the single device case through ioctl. I am not giving up on sysfs or syscall as this is needed long term so i am gonna improve it, port existing userspace (OpenCL, ROCm, ...) to use it (in branch) and demonstrate how it get use by end application. I will beat it again and again until either i convince people through hard evidence or i get bored. I do not get bored easily :) Cheers, Jérôme
On Wed, Dec 05, 2018 at 10:41:56AM -0700, Logan Gunthorpe wrote: > > > On 2018-12-04 7:31 p.m., Jerome Glisse wrote: > > How can i express multiple link, or memory that is only accessible > > by a subset of the devices/CPUs. In today model they are back in > > assumption like everyone can access all the node which do not hold > > in what i am trying to do. > > Well multiple links are easy when you have a 'link' bus. Just add > another link device under the bus. So you are telling do what i am doing in this patchset but not under HMS directory ? > > Technically, the accessibility issue is already encoded in sysfs. For > example, through the PCI tree you can determine which ACS bits are set > and determine which devices are behind the same root bridge the same way > we do in the kernel p2pdma subsystem. This is all bus specific which is > fine, but if we want to change that, we should have a common way for > existing buses to describe these attributes in the existing tree. The > new 'link' bus devices would have to have some way to describe cases if > memory isn't accessible in some way across it. What i am looking at is much more complex than just access bit. It is a whole set of properties attach to each path (can it be cache coherent ? can it do atomic ? what is the access granularity ? what is the bandwidth ? is it dedicated link ? ...) > > But really, I would say the kernel is responsible for telling you when > memory is accessible to a list of initiators, so it should be part of > the checks in a theoretical hbind api. This is already the approach > p2pdma takes in-kernel: we have functions that tell you if two PCI > devices can talk to each other and we have functions to give you memory > accessible by a set of devices. What we don't have is a special tree > that p2pdma users have to walk through to determine accessibility. You do not need it, but i do need it they are user out there that are already depending on the information by getting it through non standard way. I do want to provide a standard way for userspace to get this. They are real user out there and i believe their would be more user if we had a standard way to provide it. You do not believe in it fine. I will do more work in userspace and more example and i will come back with more hard evidence until i convince enough people. > > In my eye's, you are just conflating a bunch of different issues that > are better solved independently in the existing frameworks we have. And > if they were tackled individually, you'd have a much easier time getting > them merged one by one. I don't think i can convince you otherwise. They are user that use topology please looks at the links i provided, those folks have running program _today_ they rely on non standard API and would like to move toward standard API it would improve their life. On top of that i argue that more people would use that information if it were available to them. I agree that i have no hard evidence to back that up and that it is just a feeling but you can not disprove me either as this is a chicken and egg problem, you can not prove people will not use an API if the API is not there to be use. Cheers, Jérôme
On 2018-12-05 11:07 a.m., Jerome Glisse wrote: >> Well multiple links are easy when you have a 'link' bus. Just add >> another link device under the bus. > > So you are telling do what i am doing in this patchset but not under > HMS directory ? No, it's completely different. I'm talking about creating a bus to describe only the real hardware that links GPUs. Not creating a new virtual tree containing a bunch of duplicate bus and device information that already exists currently in sysfs. >> >> Technically, the accessibility issue is already encoded in sysfs. For >> example, through the PCI tree you can determine which ACS bits are set >> and determine which devices are behind the same root bridge the same way >> we do in the kernel p2pdma subsystem. This is all bus specific which is >> fine, but if we want to change that, we should have a common way for >> existing buses to describe these attributes in the existing tree. The >> new 'link' bus devices would have to have some way to describe cases if >> memory isn't accessible in some way across it. > > What i am looking at is much more complex than just access bit. It > is a whole set of properties attach to each path (can it be cache > coherent ? can it do atomic ? what is the access granularity ? what > is the bandwidth ? is it dedicated link ? ...) I'm not talking about just an access bit. I'm talking about what you are describing: standard ways for *existing* buses in the sysfs hierarchy to describe things like cache coherency, atomics, granularity, etc without creating a new hierarchy. > On top of that i argue that more people would use that information if it > were available to them. I agree that i have no hard evidence to back that > up and that it is just a feeling but you can not disprove me either as > this is a chicken and egg problem, you can not prove people will not use > an API if the API is not there to be use. And you miss my point that much of this information is already available to them. And more can be added in the existing framework without creating any brand new concepts. I haven't said anything about chicken-and-egg problems -- I've given you a bunch of different suggestions to split this up into more managable problems and address many of them within the APIs and frameworks we have already. Logan
On Wed, Dec 05, 2018 at 11:20:30AM -0700, Logan Gunthorpe wrote: > > > On 2018-12-05 11:07 a.m., Jerome Glisse wrote: > >> Well multiple links are easy when you have a 'link' bus. Just add > >> another link device under the bus. > > > > So you are telling do what i am doing in this patchset but not under > > HMS directory ? > > No, it's completely different. I'm talking about creating a bus to > describe only the real hardware that links GPUs. Not creating a new > virtual tree containing a bunch of duplicate bus and device information > that already exists currently in sysfs. > > >> > >> Technically, the accessibility issue is already encoded in sysfs. For > >> example, through the PCI tree you can determine which ACS bits are set > >> and determine which devices are behind the same root bridge the same way > >> we do in the kernel p2pdma subsystem. This is all bus specific which is > >> fine, but if we want to change that, we should have a common way for > >> existing buses to describe these attributes in the existing tree. The > >> new 'link' bus devices would have to have some way to describe cases if > >> memory isn't accessible in some way across it. > > > > What i am looking at is much more complex than just access bit. It > > is a whole set of properties attach to each path (can it be cache > > coherent ? can it do atomic ? what is the access granularity ? what > > is the bandwidth ? is it dedicated link ? ...) > > I'm not talking about just an access bit. I'm talking about what you are > describing: standard ways for *existing* buses in the sysfs hierarchy to > describe things like cache coherency, atomics, granularity, etc without > creating a new hierarchy. > > > On top of that i argue that more people would use that information if it > > were available to them. I agree that i have no hard evidence to back that > > up and that it is just a feeling but you can not disprove me either as > > this is a chicken and egg problem, you can not prove people will not use > > an API if the API is not there to be use. > > And you miss my point that much of this information is already available > to them. And more can be added in the existing framework without > creating any brand new concepts. I haven't said anything about > chicken-and-egg problems -- I've given you a bunch of different > suggestions to split this up into more managable problems and address > many of them within the APIs and frameworks we have already. The thing is that what i am considering is not in sysfs, it does not even have linux kernel driver, it is just chips that connect device between them and there is nothing to do with those chips it is all hardware they do not need a driver. So there is nothing existing that address what i need to represent. If i add a a fake driver for those what would i do ? under which sub-system i register them ? How i express the fact that they connect device X,Y and Z with some properties ? This is not PCIE ... you can not discover bridges and child, it not a tree like structure, it is a random graph (which depends on how the OEM wire port on the chips). So i have not pre-existing driver, they are not in sysfs today and they do not need a driver. Hence why i proposed what i proposed a sysfs hierarchy where i can add those "virtual" object and shows how they connect existing device for which we have a sysfs directory to symlink. Cheers, Jérôme
On 2018-12-05 11:33 a.m., Jerome Glisse wrote: > If i add a a fake driver for those what would i do ? under which > sub-system i register them ? How i express the fact that they > connect device X,Y and Z with some properties ? Yes this is exactly what I'm suggesting. I wouldn't call it a fake driver, but a new struct device describing an actual device in the system. It would be a feature of the GPU subsystem seeing this is a feature of GPUs. Expressing that the new devices connect to a specific set of GPUs is not a hard problem to solve. > This is not PCIE ... you can not discover bridges and child, it > not a tree like structure, it is a random graph (which depends > on how the OEM wire port on the chips). You must be able to discover that these links exist and register a device with the system. Where else do you get the information currently? The suggestion doesn't change anything to do with how you interact with hardware, only how you describe the information within the kernel. > So i have not pre-existing driver, they are not in sysfs today and > they do not need a driver. Hence why i proposed what i proposed > a sysfs hierarchy where i can add those "virtual" object and shows > how they connect existing device for which we have a sysfs directory > to symlink. So add a new driver -- that's what I've been suggesting all along. Having a driver not exist is no reason to not create one. I'd suggest that if you want them to show up in the sysfs hierarchy then you do need some kind of driver code to create a struct device. Just because the kernel doesn't have to interact with them is no reason not to create a struct device. It's *much* easier to create a new driver subsystem than a whole new userspace API. Logan
On Wed, Dec 05, 2018 at 11:48:37AM -0700, Logan Gunthorpe wrote: > > > On 2018-12-05 11:33 a.m., Jerome Glisse wrote: > > If i add a a fake driver for those what would i do ? under which > > sub-system i register them ? How i express the fact that they > > connect device X,Y and Z with some properties ? > > Yes this is exactly what I'm suggesting. I wouldn't call it a fake > driver, but a new struct device describing an actual device in the > system. It would be a feature of the GPU subsystem seeing this is a > feature of GPUs. Expressing that the new devices connect to a specific > set of GPUs is not a hard problem to solve. > > > This is not PCIE ... you can not discover bridges and child, it > > not a tree like structure, it is a random graph (which depends > > on how the OEM wire port on the chips). > > You must be able to discover that these links exist and register a > device with the system. Where else do you get the information currently? > The suggestion doesn't change anything to do with how you interact with > hardware, only how you describe the information within the kernel. > > > So i have not pre-existing driver, they are not in sysfs today and > > they do not need a driver. Hence why i proposed what i proposed > > a sysfs hierarchy where i can add those "virtual" object and shows > > how they connect existing device for which we have a sysfs directory > > to symlink. > > So add a new driver -- that's what I've been suggesting all along. > Having a driver not exist is no reason to not create one. I'd suggest > that if you want them to show up in the sysfs hierarchy then you do need > some kind of driver code to create a struct device. Just because the > kernel doesn't have to interact with them is no reason not to create a > struct device. It's *much* easier to create a new driver subsystem than > a whole new userspace API. So now once next type of device shows up with the exact same thing let say FPGA, we have to create a new subsystem for them too. Also this make the userspace life much much harder. Now userspace must go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and merge all that different information together and rebuild the representation i am putting forward in this patchset in userspace. There is no telling that kernel won't be able to provide quirk and workaround because some merging is actually illegal on a given platform (like some link from a subsystem is not accessible through the PCI connection of one of the device connected to that link). So it means userspace will have to grow its own database or work- around and quirk and i am back in the situation i am in today. Not very convincing to me. What i am proposing here is a new common description provided by the kernel where we can reconciliate weird interaction. But i doubt i can convince you i will make progress on what i need today and keep working on sysfs. Cheers, Jérôme
On 2018-12-05 11:55 a.m., Jerome Glisse wrote: > So now once next type of device shows up with the exact same thing > let say FPGA, we have to create a new subsystem for them too. Also > this make the userspace life much much harder. Now userspace must > go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and > merge all that different information together and rebuild the > representation i am putting forward in this patchset in userspace. Yes. But seeing such FPGA links aren't common yet and there isn't really much in terms of common FPGA infrastructure in the kernel (which are hard seeing the hardware is infinitely customization) you can let the people developing FPGA code worry about it and come up with their own solution. Buses between FPGAs may end up never being common enough for people to care, or they may end up being so weird that they need their own description independent of GPUS, or maybe when they become common they find a way to use the GPU link subsystem -- who knows. Don't try to design for use cases that don't exist yet. Yes, userspace will have to know about all the buses it cares to find links over. Sounds like a perfect thing for libhms to do. > There is no telling that kernel won't be able to provide quirk and > workaround because some merging is actually illegal on a given > platform (like some link from a subsystem is not accessible through > the PCI connection of one of the device connected to that link). These are all just different individual problems which need different solutions not grand new design concepts. > So it means userspace will have to grow its own database or work- > around and quirk and i am back in the situation i am in today. No, as I've said, quirks are firmly the responsibility of kernels. Userspace will need to know how to work with the different buses and CPU/node information but there really isn't that many of these to deal with and this is a much easier approach than trying to come up with a new API that can wrap the nuances of all existing and potential future bus types we may have to deal with. Logan
On Wed, Dec 05, 2018 at 12:10:10PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-05 11:55 a.m., Jerome Glisse wrote: > > So now once next type of device shows up with the exact same thing > > let say FPGA, we have to create a new subsystem for them too. Also > > this make the userspace life much much harder. Now userspace must > > go parse PCIE, subsystem1, subsystem2, subsystemN, NUMA, ... and > > merge all that different information together and rebuild the > > representation i am putting forward in this patchset in userspace. > > Yes. But seeing such FPGA links aren't common yet and there isn't really > much in terms of common FPGA infrastructure in the kernel (which are > hard seeing the hardware is infinitely customization) you can let the > people developing FPGA code worry about it and come up with their own > solution. Buses between FPGAs may end up never being common enough for > people to care, or they may end up being so weird that they need their > own description independent of GPUS, or maybe when they become common > they find a way to use the GPU link subsystem -- who knows. Don't try to > design for use cases that don't exist yet. > > Yes, userspace will have to know about all the buses it cares to find > links over. Sounds like a perfect thing for libhms to do. So just to be clear here is how i understand your position: "Single coherent sysfs hierarchy to describe something is useless let's git rm drivers/base/" While i am arguing that "hey the /sys/bus/node/devices/* is nice but it just does not cut it for all this new hardware platform if i add new nodes there for my new memory i will break tons of existing application. So what about a new hierarchy that allow to describe those new hardware platform in a single place like today node thing" > > > There is no telling that kernel won't be able to provide quirk and > > workaround because some merging is actually illegal on a given > > platform (like some link from a subsystem is not accessible through > > the PCI connection of one of the device connected to that link). > > These are all just different individual problems which need different > solutions not grand new design concepts. > > > So it means userspace will have to grow its own database or work- > > around and quirk and i am back in the situation i am in today. > > No, as I've said, quirks are firmly the responsibility of kernels. > Userspace will need to know how to work with the different buses and > CPU/node information but there really isn't that many of these to deal > with and this is a much easier approach than trying to come up with a > new API that can wrap the nuances of all existing and potential future > bus types we may have to deal with. No can do that is what i am trying to explain. So if i bus 1 in a sub-system A and usualy that kind of bus can serve a bridge for PCIE ie a CPU can access device behind it by going through a PCIE device first. So now the userspace libary have this knowledge bake in. Now if a platform has a bug for whatever reasons where that does not hold, the kernel has no way to tell userspace that there is an exception there. It is up to userspace to have a data base of quirks. Kernel see all those objects in isolation in your scheme. While in what i am proposing there is only one place and any device that participate in this common place can report any quirks so that a coherent view is given to user space. If we have gazillion of places where all this informations is spread around than we have no way to fix weird inter-action between any of those. Cheers, Jérôme
On 2018-12-05 3:58 p.m., Jerome Glisse wrote: > So just to be clear here is how i understand your position: > "Single coherent sysfs hierarchy to describe something is useless > let's git rm drivers/base/" I have no idea what you're talking about. I'm saying the existing sysfs hierarchy *should* be used for this application -- we shouldn't be creating another hierarchy. > While i am arguing that "hey the /sys/bus/node/devices/* is nice > but it just does not cut it for all this new hardware platform > if i add new nodes there for my new memory i will break tons of > existing application. So what about a new hierarchy that allow > to describe those new hardware platform in a single place like > today node thing" I'm talking about /sys/bus and all the bus information under there; not just the node hierarchy. With this information, you can figure out how any struct device is connected to another struct device. This has little to do with a hypothetical memory device and what it might expose. You're conflating memory devices with links between devices (ie. buses). > No can do that is what i am trying to explain. So if i bus 1 in a > sub-system A and usualy that kind of bus can serve a bridge for > PCIE ie a CPU can access device behind it by going through a PCIE > device first. So now the userspace libary have this knowledge > bake in. Now if a platform has a bug for whatever reasons where > that does not hold, the kernel has no way to tell userspace that > there is an exception there. It is up to userspace to have a data > base of quirks. > Kernel see all those objects in isolation in your scheme. While in > what i am proposing there is only one place and any device that > participate in this common place can report any quirks so that a > coherent view is given to user space. The above makes no sense to me. > If we have gazillion of places where all this informations is spread > around than we have no way to fix weird inter-action between any > of those. So work to standardize it so that all buses present a consistent view of what guarantees they provide for bus accesses. Quirks could then adjust that information for systems that may be broken. Logan
On Wed, Dec 05, 2018 at 04:09:29PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-05 3:58 p.m., Jerome Glisse wrote: > > So just to be clear here is how i understand your position: > > "Single coherent sysfs hierarchy to describe something is useless > > let's git rm drivers/base/" > > I have no idea what you're talking about. I'm saying the existing sysfs > hierarchy *should* be used for this application -- we shouldn't be > creating another hierarchy. > > > While i am arguing that "hey the /sys/bus/node/devices/* is nice > > but it just does not cut it for all this new hardware platform > > if i add new nodes there for my new memory i will break tons of > > existing application. So what about a new hierarchy that allow > > to describe those new hardware platform in a single place like > > today node thing" > > I'm talking about /sys/bus and all the bus information under there; not > just the node hierarchy. With this information, you can figure out how > any struct device is connected to another struct device. This has little > to do with a hypothetical memory device and what it might expose. You're > conflating memory devices with links between devices (ie. buses). And my proposal is under /sys/bus and have symlink to all existing device it agregate in there. For device memory i explained why it does not make sense to expose it as node. So now how do i expose it ? Yes i can expose it under the device directory but then i can not present the properties of that memory which depends on through which bus and through which bridges it is accessed. So i need bus and bridge objects so that i can express the properties that depends on the path between the initiator and the target memory. I argue it is better to expose all this under the same directory. You say it is not. We NUMA as an example that shows everything under a single hierarchy so to me you are saying it is useless and has no value. I say the NUMA thing has value and i would like something like it just with more stuff and with the capability of doing any kind of graph. I just do not see how i can achieve my objectives any differently. I think we are just talking past each other and this is likely a pointless conversation. I will keep working on this in the meantime. > > No can do that is what i am trying to explain. So if i bus 1 in a > > sub-system A and usualy that kind of bus can serve a bridge for > > PCIE ie a CPU can access device behind it by going through a PCIE > > device first. So now the userspace libary have this knowledge > > bake in. Now if a platform has a bug for whatever reasons where > > that does not hold, the kernel has no way to tell userspace that > > there is an exception there. It is up to userspace to have a data > > base of quirks. > > > Kernel see all those objects in isolation in your scheme. While in > > what i am proposing there is only one place and any device that > > participate in this common place can report any quirks so that a > > coherent view is given to user space. > > The above makes no sense to me. > > > > If we have gazillion of places where all this informations is spread > > around than we have no way to fix weird inter-action between any > > of those. > > So work to standardize it so that all buses present a consistent view of > what guarantees they provide for bus accesses. Quirks could then adjust > that information for systems that may be broken. So you agree with my proposal ? A sysfs directory in which all the bus and how they are connected to each other and what is connected to each of them (device, CPU, memory). THis is really confusing. Cheers, Jérôme
On 2018-12-05 4:20 p.m., Jerome Glisse wrote: > And my proposal is under /sys/bus and have symlink to all existing > device it agregate in there. That's so not the point. Use the existing buses don't invent some virtual tree. I don't know how many times I have to say this or in how many ways. I'm not responding anymore. > So you agree with my proposal ? A sysfs directory in which all the > bus and how they are connected to each other and what is connected > to each of them (device, CPU, memory). I'm fine with the motivation. What I'm arguing against is the implementation and the fact you have to create a whole grand new userspace API and hierarchy to accomplish it. Logan
On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-05 4:20 p.m., Jerome Glisse wrote: > > And my proposal is under /sys/bus and have symlink to all existing > > device it agregate in there. > > That's so not the point. Use the existing buses don't invent some > virtual tree. I don't know how many times I have to say this or in how > many ways. I'm not responding anymore. And how do i express interaction with different buses because i just do not see how to do that in the existing scheme. It would be like teaching to each bus about all the other bus versus having each bus register itself under a common framework and have all the interaction between bus mediated through that common framework avoiding code duplication accross buses. > > > So you agree with my proposal ? A sysfs directory in which all the > > bus and how they are connected to each other and what is connected > > to each of them (device, CPU, memory). > > I'm fine with the motivation. What I'm arguing against is the > implementation and the fact you have to create a whole grand new > userspace API and hierarchy to accomplish it. > > Logan
On Wed, Dec 5, 2018 at 3:27 PM Jerome Glisse <jglisse@redhat.com> wrote: > > On Wed, Dec 05, 2018 at 04:23:42PM -0700, Logan Gunthorpe wrote: > > > > > > On 2018-12-05 4:20 p.m., Jerome Glisse wrote: > > > And my proposal is under /sys/bus and have symlink to all existing > > > device it agregate in there. > > > > That's so not the point. Use the existing buses don't invent some > > virtual tree. I don't know how many times I have to say this or in how > > many ways. I'm not responding anymore. > > And how do i express interaction with different buses because i just > do not see how to do that in the existing scheme. It would be like > teaching to each bus about all the other bus versus having each bus > register itself under a common framework and have all the interaction > between bus mediated through that common framework avoiding code > duplication accross buses. > > > > > > So you agree with my proposal ? A sysfs directory in which all the > > > bus and how they are connected to each other and what is connected > > > to each of them (device, CPU, memory). > > > > I'm fine with the motivation. What I'm arguing against is the > > implementation and the fact you have to create a whole grand new > > userspace API and hierarchy to accomplish it. Right, GPUs show up in /sys today. Don't register a whole new hierarchy as an alias to what already exists, add a new attribute scheme to the existing hierarchy. This is what the HMAT enabling is doing, this is what p2pdma is doing.
diff --git a/Documentation/vm/hms.rst b/Documentation/vm/hms.rst index dbf0f71918a9..bd7c9e8e7077 100644 --- a/Documentation/vm/hms.rst +++ b/Documentation/vm/hms.rst @@ -4,32 +4,249 @@ Heterogeneous Memory System (HMS) ================================= -System with complex memory topology needs a more versatile memory topology -description than just node where a node is a collection of memory and CPU. -In heterogeneous memory system we consider four types of object:: - - target: which is any kind of memory - - initiator: any kind of device or CPU - - inter-connect: any kind of links that connects target and initiator - - bridge: a link between two inter-connects - -Properties (like bandwidth, latency, bus width, ...) are define per bridge -and per inter-connect. Property of an inter-connect apply to all initiators -which are link to that inter-connect. Not all initiators are link to all -inter-connect and thus not all initiators can access all memory (this apply -to CPU too ie some CPU might not be able to access all memory). - -Bridges allow initiators (that can use the bridge) to access target for -which they do not have a direct link with (ie they do not share a common -inter-connect with the target). - -Through this four types of object we can describe any kind of system memory -topology. To expose this to userspace we expose a new sysfs hierarchy (that -co-exist with the existing one):: - - /sys/bus/hms/target* all targets in the system - - /sys/bus/hms/initiator* all initiators in the system - - /sys/bus/hms/interconnect* all inter-connects in the system - - /sys/bus/hms/bridge* all bridges in the system - -Inside each bridge or inter-connect directory they are symlinks to targets -and initiators that are linked to that bridge or inter-connect. Properties -are defined inside bridge and inter-connect directory. +Heterogeneous memory system are becoming more and more the norm, in +those system there is not only the main system memory for each node, +but also device memory and|or memory hierarchy to consider. Device +memory can comes from a device like GPU, FPGA, ... or from a memory +only device (persistent memory, or high density memory device). + +Memory hierarchy is when you not only have the main memory but also +other type of memory like HBM (High Bandwidth Memory often stack up +on CPU die or GPU die), peristent memory or high density memory (ie +something slower then regular DDR DIMM but much bigger). + +On top of this diversity of memories you also have to account for the +system bus topology ie how all CPUs and devices are connected to each +others. Userspace do not care about the exact physical topology but +care about topology from behavior point of view ie what are all the +paths between an initiator (anything that can initiate memory access +like CPU, GPU, FGPA, network controller ...) and a target memory and +what are all the properties of each of those path (bandwidth, latency, +granularity, ...). + +This means that it is no longer sufficient to consider a flat view +for each node in a system but for maximum performance we need to +account for all of this new memory but also for system topology. +This is why this proposal is unlike the HMAT proposal [1] which +tries to extend the existing NUMA for new type of memory. Here we +are tackling a much more profound change that depart from NUMA. + + +One of the reasons for radical change is the advance of accelerator +like GPU or FPGA means that CPU is no longer the only piece where +computation happens. It is becoming more and more common for an +application to use a mix and match of different accelerator to +perform its computation. So we can no longer satisfy our self with +a CPU centric and flat view of a system like NUMA and NUMA distance. + + +HMS tackle this problems through three aspects: + 1 - Expose complex system topology and various kind of memory + to user space so that application have a standard way and + single place to get all the information it cares about. + 2 - A new API for user space to bind/provide hint to kernel on + which memory to use for range of virtual address (a new + mbind() syscall). + 3 - Kernel side changes for vm policy to handle this changes + + +The rest of this documents is splits in 3 sections, the first section +talks about complex system topology: what it is, how it is use today +and how to describe it tomorrow. The second sections talks about +new API to bind/provide hint to kernel for range of virtual address. +The third section talks about new mechanism to track bind/hint +provided by user space or device driver inside the kernel. + + +1) Complex system topology and representing them +================================================ + +Inside a node you can have a complex topology of memory, for instance +you can have multiple HBM memory in a node, each HBM memory tie to a +set of CPUs (all of which are in the same node). This means that you +have a hierarchy of memory for CPUs. The local fast HBM but which is +expected to be relatively small compare to main memory and then the +main memory. New memory technology might also deepen this hierarchy +with another level of yet slower memory but gigantic in size (some +persistent memory technology might fall into that category). Another +example is device memory, and device themself can have a hierarchy +like HBM on top of device core and main device memory. + +On top of that you can have multiple path to access each memory and +each path can have different properties (latency, bandwidth, ...). +Also there is not always symmetry ie some memory might only be +accessible by some device or CPU ie not accessible by everyone. + +So a flat hierarchy for each node is not capable of representing this +kind of complexity. To simplify discussion and because we do not want +to single out CPU from device, from here on out we will use initiator +to refer to either CPU or device. An initiator is any kind of CPU or +device that can access memory (ie initiate memory access). + +At this point a example of such system might help: + - 2 nodes and for each node: + - 1 CPU per node with 2 complex of CPUs cores per CPU + - one HBM memory for each complex of CPUs cores (200GB/s) + - CPUs cores complex are linked to each other (100GB/s) + - main memory is (90GB/s) + - 4 GPUs each with: + - HBM memory for each GPU (1000GB/s) (not CPU accessible) + - GDDR memory for each GPU (500GB/s) (CPU accessible) + - connected to CPU root controller (60GB/s) + - connected to other GPUs (even GPUs from the second + node) with GPU link (400GB/s) + +In this example we restrict our self to bandwidth and ignore bus width +or latency, this is just to simplify discussions but obviously they +also factor in. + + +Userspace very much would like to know about this information, for +instance HPC folks have develop complex library to manage this and +there is wide research on the topics [2] [3] [4] [5]. Today most of +the work is done by hardcoding thing for specific platform. Which is +somewhat acceptable for HPC folks where the platform stays the same +for a long period of time. + +Roughly speaking i see two broads use case for topology information. +First is for virtualization and vm where you want to segment your +hardware properly for each vm (binding memory, CPU and GPU that are +all close to each others). Second is for application, many of which +can partition their workload to minimize exchange between partition +allowing each partition to be bind to a subset of device and CPUs +that are close to each others (for maximum locality). Here it is much +more than just NUMA distance, you can leverage the memory hierarchy +and the system topology all-together (see [2] [3] [4] [5] for more +references and details). + +So this is not exposing topology just for the sake of cool graph in +userspace. They are active user today of such information and if we +want to growth and broaden the usage we should provide a unified API +to standardize how that information is accessible to every one. + + +One proposal so far to handle new type of memory is to user CPU less +node for those [6]. While same idea can apply for device memory, it is +still hard to describe multiple path with different property in such +scheme. While it is backward compatible and have minimum changes, it +simplify can not convey complex topology (think any kind of random +graph, not just a tree like graph). + +So HMS use a new way to expose to userspace the system topology. It +relies on 4 types of objects: + - target: any kind of memory (main memory, HBM, device, ...) + - initiator: CPU or device (anything that can access memory) + - link: anything that link initiator and target + - bridges: anything that allow group of initiator to access + remote target (ie target they are not connected with directly + through an link) + +Properties like bandwidth, latency, ... are all sets per bridges and +links. All initiators connected to an link can access any target memory +also connected to the same link and all with the same link properties. + +Link do not need to match physical hardware ie you can have a single +physical link match a single or multiples software expose link. This +allows to model device connected to same physical link (like PCIE +for instance) but not with same characteristics (like number of lane +or lane speed in PCIE). The reverse is also true ie having a single +software expose link match multiples physical link. + +Bridges allows initiator to access remote link. A bridges connect two +links to each others and is also specific to list of initiators (ie +not all initiators connected to each of the link can use the bridge). +Bridges have their own properties (bandwidth, latency, ...) so that +the actual property value for each property is the lowest common +denominator between bridge and each of the links. + + +This model allows to describe any kind of directed graph and thus +allows to describe any kind of topology we might see in the future. +It is also easier to add new properties to each object type. + +Moreover it can be use to expose devices capable to do peer to peer +between them. For that simply have all devices capable to peer to +peer to have a common link or use the bridge object if the peer to +peer capabilities is only one way for instance. + + +HMS use the above scheme to expose system topology through sysfs under +/sys/bus/hms/ with: + - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, + each has a UID and you can usual value in that folder (node id, + size, ...) + + - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator + (CPU or device), each has a HMS UID but also a CPU id for CPU + (which match CPU id in (/sys/bus/cpu/). For device you have a + path that can be PCIE BUS ID for instance) + + - /sys/bus/hms/devices/v%version-%id-link : an link, each has a + UID and a file per property (bandwidth, latency, ...) you also + find a symlink to every target and initiator connected to that + link. + + - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has + a UID and a file per property (bandwidth, latency, ...) you + also find a symlink to all initiators that can use that bridge. + +To help with forward compatibility each object as a version value and +it is mandatory for user space to only use target or initiator with +version supported by the user space. For instance if user space only +knows about what version 1 means and sees a target with version 2 then +the user space must ignore that target as if it does not exist. + +Mandating that allows the additions of new properties that break back- +ward compatibility ie user space must know how this new property affect +the object to be able to use it safely. + +Main memory of each node is expose under a common target. For now +device driver are responsible to register memory they want to expose +through that scheme but in the future that information might come from +the system firmware (this is a different discussion). + + + +2) hbind() bind range of virtual address to heterogeneous memory +================================================================ + +So instead of using a bitmap, hbind() take an array of uid and each uid +is a unique memory target inside the new memory topology description. +User space also provide an array of modifiers. Modifier can be seen as +the flags parameter of mbind() but here we use an array so that user +space can not only supply a modifier but also value with it. This should +allow the API to grow more features in the future. Kernel should return +-EINVAL if it is provided with an unkown modifier and just ignore the +call all together, forcing the user space to restrict itself to modifier +supported by the kernel it is running on (i know i am dreaming about well +behave user space). + + +Note that none of this is exclusive of automatic memory placement like +autonuma. I also believe that we will see something similar to autonuma +for device memory. + + +3) Tracking and applying heterogeneous memory policies +====================================================== + +Current memory policy infrastructure is node oriented, instead of +changing that and risking breakage and regression HMS adds a new +heterogeneous policy tracking infra-structure. The expectation is +that existing application can keep using mbind() and all existing +infrastructure under-disturb and unaffected, while new application +will use the new API and should avoid mix and matching both (as they +can achieve the same thing with the new API). + +Also the policy is not directly tie to the vma structure for a few +reasons: + - avoid having to split vma for policy that do not cover full vma + - avoid changing too much vma code + - avoid growing the vma structure with an extra pointer + +The overall design is simple, on hbind() call a hms policy structure +is created for the supplied range and hms use the callback associated +with the target memory. This callback is provided by device driver +for device memory or by core HMS for regular main memory. The callback +can decide to migrate the range to the target memories or do nothing +(this can be influenced by flags provided to hbind() too).