Message ID | 20181203233509.20671-1-jglisse@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | Heterogeneous Memory System (HMS) and hbind() | expand |
On 12/4/18 5:04 AM, jglisse@redhat.com wrote: > From: Jérôme Glisse <jglisse@redhat.com> > > Heterogeneous memory system are becoming more and more the norm, in > those system there is not only the main system memory for each node, > but also device memory and|or memory hierarchy to consider. Device > memory can comes from a device like GPU, FPGA, ... or from a memory > only device (persistent memory, or high density memory device). > > Memory hierarchy is when you not only have the main memory but also > other type of memory like HBM (High Bandwidth Memory often stack up > on CPU die or GPU die), peristent memory or high density memory (ie > something slower then regular DDR DIMM but much bigger). > > On top of this diversity of memories you also have to account for the > system bus topology ie how all CPUs and devices are connected to each > others. Userspace do not care about the exact physical topology but > care about topology from behavior point of view ie what are all the > paths between an initiator (anything that can initiate memory access > like CPU, GPU, FGPA, network controller ...) and a target memory and > what are all the properties of each of those path (bandwidth, latency, > granularity, ...). > > This means that it is no longer sufficient to consider a flat view > for each node in a system but for maximum performance we need to > account for all of this new memory but also for system topology. > This is why this proposal is unlike the HMAT proposal [1] which > tries to extend the existing NUMA for new type of memory. Here we > are tackling a much more profound change that depart from NUMA. > > > One of the reasons for radical change is the advance of accelerator > like GPU or FPGA means that CPU is no longer the only piece where > computation happens. It is becoming more and more common for an > application to use a mix and match of different accelerator to > perform its computation. So we can no longer satisfy our self with > a CPU centric and flat view of a system like NUMA and NUMA distance. > > > This patchset is a proposal to tackle this problems through three > aspects: > 1 - Expose complex system topology and various kind of memory > to user space so that application have a standard way and > single place to get all the information it cares about. > 2 - A new API for user space to bind/provide hint to kernel on > which memory to use for range of virtual address (a new > mbind() syscall). > 3 - Kernel side changes for vm policy to handle this changes > > This patchset is not and end to end solution but it provides enough > pieces to be useful against nouveau (upstream open source driver for > NVidia GPU). It is intended as a starting point for discussion so > that we can figure out what to do. To avoid having too much topics > to discuss i am not considering memory cgroup for now but it is > definitely something we will want to integrate with. > > The rest of this emails is splits in 3 sections, the first section > talks about complex system topology: what it is, how it is use today > and how to describe it tomorrow. The second sections talks about > new API to bind/provide hint to kernel for range of virtual address. > The third section talks about new mechanism to track bind/hint > provided by user space or device driver inside the kernel. > > > 1) Complex system topology and representing them > ------------------------------------------------ > > Inside a node you can have a complex topology of memory, for instance > you can have multiple HBM memory in a node, each HBM memory tie to a > set of CPUs (all of which are in the same node). This means that you > have a hierarchy of memory for CPUs. The local fast HBM but which is > expected to be relatively small compare to main memory and then the > main memory. New memory technology might also deepen this hierarchy > with another level of yet slower memory but gigantic in size (some > persistent memory technology might fall into that category). Another > example is device memory, and device themself can have a hierarchy > like HBM on top of device core and main device memory. > > On top of that you can have multiple path to access each memory and > each path can have different properties (latency, bandwidth, ...). > Also there is not always symmetry ie some memory might only be > accessible by some device or CPU ie not accessible by everyone. > > So a flat hierarchy for each node is not capable of representing this > kind of complexity. To simplify discussion and because we do not want > to single out CPU from device, from here on out we will use initiator > to refer to either CPU or device. An initiator is any kind of CPU or > device that can access memory (ie initiate memory access). > > At this point a example of such system might help: > - 2 nodes and for each node: > - 1 CPU per node with 2 complex of CPUs cores per CPU > - one HBM memory for each complex of CPUs cores (200GB/s) > - CPUs cores complex are linked to each other (100GB/s) > - main memory is (90GB/s) > - 4 GPUs each with: > - HBM memory for each GPU (1000GB/s) (not CPU accessible) > - GDDR memory for each GPU (500GB/s) (CPU accessible) > - connected to CPU root controller (60GB/s) > - connected to other GPUs (even GPUs from the second > node) with GPU link (400GB/s) > > In this example we restrict our self to bandwidth and ignore bus width > or latency, this is just to simplify discussions but obviously they > also factor in. > > > Userspace very much would like to know about this information, for > instance HPC folks have develop complex library to manage this and > there is wide research on the topics [2] [3] [4] [5]. Today most of > the work is done by hardcoding thing for specific platform. Which is > somewhat acceptable for HPC folks where the platform stays the same > for a long period of time. But if we want a more ubiquituous support > we should aim to provide the information needed through standard > kernel API such as the one presented in this patchset. > > Roughly speaking i see two broads use case for topology information. > First is for virtualization and vm where you want to segment your > hardware properly for each vm (binding memory, CPU and GPU that are > all close to each others). Second is for application, many of which > can partition their workload to minimize exchange between partition > allowing each partition to be bind to a subset of device and CPUs > that are close to each others (for maximum locality). Here it is much > more than just NUMA distance, you can leverage the memory hierarchy > and the system topology all-together (see [2] [3] [4] [5] for more > references and details). > > So this is not exposing topology just for the sake of cool graph in > userspace. They are active user today of such information and if we > want to growth and broaden the usage we should provide a unified API > to standardize how that information is accessible to every one. > > > One proposal so far to handle new type of memory is to user CPU less > node for those [6]. While same idea can apply for device memory, it is > still hard to describe multiple path with different property in such > scheme. While it is backward compatible and have minimum changes, it > simplify can not convey complex topology (think any kind of random > graph, not just a tree like graph). > > Thus far this kind of system have been use through device specific API > and rely on all kind of system specific quirks. To avoid this going out > of hands and grow into a bigger mess than it already is, this patchset > tries to provide a common generic API that should fit various devices > (GPU, FPGA, ...). > > So this patchset propose a new way to expose to userspace the system > topology. It relies on 4 types of objects: > - target: any kind of memory (main memory, HBM, device, ...) > - initiator: CPU or device (anything that can access memory) > - link: anything that link initiator and target > - bridges: anything that allow group of initiator to access > remote target (ie target they are not connected with directly > through an link) > > Properties like bandwidth, latency, ... are all sets per bridges and > links. All initiators connected to an link can access any target memory > also connected to the same link and all with the same link properties. > > Link do not need to match physical hardware ie you can have a single > physical link match a single or multiples software expose link. This > allows to model device connected to same physical link (like PCIE > for instance) but not with same characteristics (like number of lane > or lane speed in PCIE). The reverse is also true ie having a single > software expose link match multiples physical link. > > Bridges allows initiator to access remote link. A bridges connect two > links to each others and is also specific to list of initiators (ie > not all initiators connected to each of the link can use the bridge). > Bridges have their own properties (bandwidth, latency, ...) so that > the actual property value for each property is the lowest common > denominator between bridge and each of the links. > > > This model allows to describe any kind of directed graph and thus > allows to describe any kind of topology we might see in the future. > It is also easier to add new properties to each object type. > > Moreover it can be use to expose devices capable to do peer to peer > between them. For that simply have all devices capable to peer to > peer to have a common link or use the bridge object if the peer to > peer capabilities is only one way for instance. > > > This patchset use the above scheme to expose system topology through > sysfs under /sys/bus/hms/ with: > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > each has a UID and you can usual value in that folder (node id, > size, ...) > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > (CPU or device), each has a HMS UID but also a CPU id for CPU > (which match CPU id in (/sys/bus/cpu/). For device you have a > path that can be PCIE BUS ID for instance) > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > UID and a file per property (bandwidth, latency, ...) you also > find a symlink to every target and initiator connected to that > link. > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > a UID and a file per property (bandwidth, latency, ...) you > also find a symlink to all initiators that can use that bridge. is that version tagging really needed? What changes do you envision with versions? > > To help with forward compatibility each object as a version value and > it is mandatory for user space to only use target or initiator with > version supported by the user space. For instance if user space only > knows about what version 1 means and sees a target with version 2 then > the user space must ignore that target as if it does not exist. > > Mandating that allows the additions of new properties that break back- > ward compatibility ie user space must know how this new property affect > the object to be able to use it safely. > > This patchset expose main memory of each node under a common target. > For now device driver are responsible to register memory they want to > expose through that scheme but in the future that information might > come from the system firmware (this is a different discussion). > > > > 2) hbind() bind range of virtual address to heterogeneous memory > ---------------------------------------------------------------- > > With this new topology description the mbind() API is too limited to > handle which memory to picks. This is why this patchset introduce a new > API: hbind() for heterogeneous bind. The hbind() API allows to bind any > kind of target memory (using the HMS target uid), this can be any memory > expose through HMS ie main memory, HBM, device memory ... > > So instead of using a bitmap, hbind() take an array of uid and each uid > is a unique memory target inside the new memory topology description. > User space also provide an array of modifiers. This patchset only define > some modifier. Modifier can be seen as the flags parameter of mbind() > but here we use an array so that user space can not only supply a modifier > but also value with it. This should allow the API to grow more features > in the future. Kernel should return -EINVAL if it is provided with an > unkown modifier and just ignore the call all together, forcing the user > space to restrict itself to modifier supported by the kernel it is > running on (i know i am dreaming about well behave user space). > > > Note that none of this is exclusive of automatic memory placement like > autonuma. I also believe that we will see something similar to autonuma > for device memory. This patchset is just there to provide new API for > process that wish to have a fine control over their memory placement > because process should know better than the kernel on where to place > thing. > > This patchset also add necessary bits to the nouveau open source driver > for it to expose its memory and to allow process to bind some range to > the GPU memory. Note that on x86 the GPU memory is not accessible by > CPU because PCIE does not allow cache coherent access to device memory. > Thus when using PCIE device memory on x86 it is mapped as swap out from > CPU POV and any CPU access will triger a migration back to main memory > (this is all part of HMM and nouveau not in this patchset). > > This is all done under staging so that we can experiment with the user- > space API for a while before committing to anything. Getting this right > is hard and it might not happen on the first try so instead of having to > support forever an API i would rather have it leave behind staging for > people to experiment with and once we feel confident we have something > we can live with then convert it to a syscall. > > > 3) Tracking and applying heterogeneous memory policies > ------------------------------------------------------ > > Current memory policy infrastructure is node oriented, instead of > changing that and risking breakage and regression this patchset add a > new heterogeneous policy tracking infra-structure. The expectation is > that existing application can keep using mbind() and all existing > infrastructure under-disturb and unaffected, while new application > will use the new API and should avoid mix and matching both (as they > can achieve the same thing with the new API). > > Also the policy is not directly tie to the vma structure for a few > reasons: > - avoid having to split vma for policy that do not cover full vma > - avoid changing too much vma code > - avoid growing the vma structure with an extra pointer > So instead this patchset use the mmu_notifier API to track vma liveness > (munmap(),mremap(),...). > > This patchset is not tie to process memory allocation either (like said > at the begining this is not and end to end patchset but a starting > point). It does however demonstrate how migration to device memory can > work under this scheme (using nouveau as a demonstration vehicle). > > The overall design is simple, on hbind() call a hms policy structure > is created for the supplied range and hms use the callback associated > with the target memory. This callback is provided by device driver > for device memory or by core HMS for regular main memory. The callback > can decide to migrate the range to the target memories or do nothing > (this can be influenced by flags provided to hbind() too). > > > Latter patches can tie page fault with HMS policy to direct memory > allocation to the right target. For now i would rather postpone that > discussion until a consensus is reach on how to move forward on all > the topics presented in this email. Start smalls, grow big ;) > > I liked the simplicity of keeping it outside all the existing memory management policy code. But that that is also the drawback isn't it? We now have multiple entities tracking cpu and memory. (This reminded me of how we started with memcg in the early days). Once we have these different types of targets, ideally the system should be able to place them in the ideal location based on the affinity of the access. ie. we should automatically place the memory such that initiator can access the target optimally. That is what we try to do with current system with autonuma. (You did mention that you are not looking at how this patch series will evolve to automatic handling of placement right now.) But i guess we want to see if the framework indeed help in achieving that goal. Having HMS outside the core memory handling routines will be a road blocker there? -aneesh
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote: > On 12/4/18 5:04 AM, jglisse@redhat.com wrote: > > From: Jérôme Glisse <jglisse@redhat.com> [...] > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > > each has a UID and you can usual value in that folder (node id, > > size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > > (CPU or device), each has a HMS UID but also a CPU id for CPU > > (which match CPU id in (/sys/bus/cpu/). For device you have a > > path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > > UID and a file per property (bandwidth, latency, ...) you also > > find a symlink to every target and initiator connected to that > > link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > > a UID and a file per property (bandwidth, latency, ...) you > > also find a symlink to all initiators that can use that bridge. > > is that version tagging really needed? What changes do you envision with > versions? I kind of dislike it myself but this is really to keep userspace from inadvertently using some kind of memory/initiator/link/bridge that it should not be using if it does not understand what are the implication. If it was a file inside the directory there is a big chance that user- space will overlook it. So an old program on a new platform with a new kind of weird memory like un-coherent memory might start using it and get all weird result. If version is in the directory name it kind of force userspace to only look at memory/initiator/link/bridge it does understand and can use safely. So i am doing this in hope that it will protect application when new type of things pops up. We have too many example where we can not evolve something because existing application have bake in assumptions about it. [...] > > 3) Tracking and applying heterogeneous memory policies > > ------------------------------------------------------ > > > > Current memory policy infrastructure is node oriented, instead of > > changing that and risking breakage and regression this patchset add a > > new heterogeneous policy tracking infra-structure. The expectation is > > that existing application can keep using mbind() and all existing > > infrastructure under-disturb and unaffected, while new application > > will use the new API and should avoid mix and matching both (as they > > can achieve the same thing with the new API). > > > > Also the policy is not directly tie to the vma structure for a few > > reasons: > > - avoid having to split vma for policy that do not cover full vma > > - avoid changing too much vma code > > - avoid growing the vma structure with an extra pointer > > So instead this patchset use the mmu_notifier API to track vma liveness > > (munmap(),mremap(),...). > > > > This patchset is not tie to process memory allocation either (like said > > at the begining this is not and end to end patchset but a starting > > point). It does however demonstrate how migration to device memory can > > work under this scheme (using nouveau as a demonstration vehicle). > > > > The overall design is simple, on hbind() call a hms policy structure > > is created for the supplied range and hms use the callback associated > > with the target memory. This callback is provided by device driver > > for device memory or by core HMS for regular main memory. The callback > > can decide to migrate the range to the target memories or do nothing > > (this can be influenced by flags provided to hbind() too). > > > > > > Latter patches can tie page fault with HMS policy to direct memory > > allocation to the right target. For now i would rather postpone that > > discussion until a consensus is reach on how to move forward on all > > the topics presented in this email. Start smalls, grow big ;) > > > > > > I liked the simplicity of keeping it outside all the existing memory > management policy code. But that that is also the drawback isn't it? > We now have multiple entities tracking cpu and memory. (This reminded me of > how we started with memcg in the early days). This is a hard choice, the rational is that either application use this new API either it use the old one. So the expectation is that both should not co-exist in a process. Eventualy both can be consolidated into one inside the kernel while maintaining the different userspace API. But i feel that it is better to get to that point slowly while we experiment with the new API. I feel that we need to gain some experience with the new API on real workload to convince ourself that it is something we can leave with. If we reach that point than we can work on consolidating kernel code into one. In the meantime this experiment does not disrupt or regress existing API. I took the cautionary road. > Once we have these different types of targets, ideally the system should > be able to place them in the ideal location based on the affinity of the > access. ie. we should automatically place the memory such that > initiator can access the target optimally. That is what we try to do with > current system with autonuma. (You did mention that you are not looking at > how this patch series will evolve to automatic handling of placement right > now.) But i guess we want to see if the framework indeed help in achieving > that goal. Having HMS outside the core memory > handling routines will be a road blocker there? So evolving autonuma gonna be a thing on its own, the issue is that auto- numa revolve around CPU id and use a handful of bits to try to catch CPU access pattern. With device in the mix it is much harder, first using the page fault trick of autonuma might not be the best idea, second we can get a lot of informations from IOMMU, bridge chipset or device itself on what is accessed by who. So my believe on that front is that its gonna be something different, like tracking range of virtual address and maintaining a data structure for range (not per page). All this is done in core mm code, i am just keeping out of vma struct or other struct to avoid growing them when and wasting thing when thit is not in use. So it is very much inside core handling routines, it is just optional. In any case i believe that explicit placement (where application hbind() thing) will be the first main use case. Once we have that figured out (or at least once we believe we have it figured out :)) then we can look into auto-heterogeneous. Cheers, Jérôme
On 12/3/18 3:34 PM, jglisse@redhat.com wrote: > This means that it is no longer sufficient to consider a flat view > for each node in a system but for maximum performance we need to > account for all of this new memory but also for system topology. > This is why this proposal is unlike the HMAT proposal [1] which > tries to extend the existing NUMA for new type of memory. Here we > are tackling a much more profound change that depart from NUMA. The HMAT and its implications exist, in firmware, whether or not we do *anything* in Linux to support it or not. Any system with an HMAT inherently reflects the new topology, via proximity domains, whether or not we parse the HMAT table in Linux or not. Basically, *ACPI* has decided to extend NUMA. Linux can either fight that or embrace it. Keith's HMAT patches are embracing it. These patches are appearing to fight it. Agree? Disagree? Also, could you add a simple, example program for how someone might use this? I got lost in all the new sysfs and ioctl gunk. Can you characterize how this would work with the *exiting* NUMA interfaces that we have?
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jglisse@redhat.com wrote: > > This means that it is no longer sufficient to consider a flat view > > for each node in a system but for maximum performance we need to > > account for all of this new memory but also for system topology. > > This is why this proposal is unlike the HMAT proposal [1] which > > tries to extend the existing NUMA for new type of memory. Here we > > are tackling a much more profound change that depart from NUMA. > > The HMAT and its implications exist, in firmware, whether or not we do > *anything* in Linux to support it or not. Any system with an HMAT > inherently reflects the new topology, via proximity domains, whether or > not we parse the HMAT table in Linux or not. > > Basically, *ACPI* has decided to extend NUMA. Linux can either fight > that or embrace it. Keith's HMAT patches are embracing it. These > patches are appearing to fight it. Agree? Disagree? Disagree, sorry if it felt that way that was not my intention. The ACPI HMAT information can be use to populate the HMS file system representation. My intention is not to fight Keith's HMAT patches they are useful on their own. But i do not see how to evolve NUMA to support device memory, so while Keith is taking a step into the direction i want, i do not see how to cross to the place i need to be. More on that below. > > Also, could you add a simple, example program for how someone might use > this? I got lost in all the new sysfs and ioctl gunk. Can you > characterize how this would work with the *exiting* NUMA interfaces that > we have? That is the issue i can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. More over in some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node. Here is an abreviated list of feature i need to support: - device private memory (not accessible by CPU or anybody else) - non-coherent memory (PCIE is not cache coherent for CPU access) - multiple path to access same memory either: - multiple _different_ physical address alias to same memory - device block can select which path they take to access some memory (it is not inside the page table but in how you program the device block) - complex topology that is not a tree where device link can have better characteristics than the CPU inter-connect between the nodes. They are existing today user that use topology information to partition their workload (HPC folks who have a fix platform). - device memory needs to stay under device driver control as some existing API (OpenGL, Vulkan) have different memory model and if we want the device to be use for those too then we need to keep the device driver in control of the device memory allocation There is an example userspace program with the last patch in the serie. But here is a high level overview of how one application looks today: 1) Application get some dataset from some source (disk, network, sensors, ...) 2) Application allocate memory on device A and copy over the dataset 3) Application run some CPU code to format the copy of the dataset inside device A memory (rebuild pointers inside the dataset, this can represent millions and millions of operations) 4) Application run code on device A that use the dataset 5) Application allocate memory on device B and copy over result from device A 6) Application run some CPU code to format the copy of the dataset inside device B (rebuild pointers inside the dataset, this can represent millions and millions of operations) 7) Application run code on device B that use the dataset 8) Application copy result over from device B and keep on doing its thing How it looks with HMS: 1) Application get some dataset from some source (disk, network, sensors, ...) 2-3) Application calls HMS to migrate to device A memory 4) Application run code on device A that use the dataset 5-6) Application calls HMS to migrate to device B memory 7) Application run code on device B that use the dataset 8) Application calls HMS to migrate result to main memory So we now avoid explicit copy and having to rebuild data structure inside each device address space. Above example is for migrate. Here is an example for how the topology is use today: Application knows that the platform is running on have 16 GPU split into 2 group of 8 GPUs each. GPU in each group can access each other memory with dedicated mesh links between each others. Full speed no traffic bottleneck. Application splits its GPU computation in 2 so that each partition runs on a group of interconnected GPU allowing them to share the dataset. With HMS: Application can query the kernel to discover the topology of system it is running on and use it to partition and balance its workload accordingly. Same application should now be able to run on new platform without having to adapt it to it. This is kind of naive i expect topology to be hard to use but maybe it is just me being pesimistics. In any case today we have a chicken and egg problem. We do not have a standard way to expose topology so program that can leverage topology are only done for HPC where the platform is standard for few years. If we had a standard way to expose the topology then maybe we would see more program using it. At very least we could convert existing user. Policy is same kind of story, this email is long enough now :) But i can write one down if you want. Cheers, Jérôme
On 12/4/18 10:49 AM, Jerome Glisse wrote: > Policy is same kind of story, this email is long enough now :) But > i can write one down if you want. Yes, please. I'd love to see the code. We'll do the same on the "HMAT" side and we can compare notes.
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > > Policy is same kind of story, this email is long enough now :) But > > i can write one down if you want. > > Yes, please. I'd love to see the code. > > We'll do the same on the "HMAT" side and we can compare notes. Example use case ? Example use are: Application create a range of virtual address with mmap() for the input dataset. Application knows it will use GPU on it directly so it calls hbind() to set a policy for the range to use GPU memory for any new allocation for the range. Application directly stream the dataset to GPU memory through the virtual address range thanks to the policy. Application create a range of virtual address with mmap() to store the output result of GPU jobs its about to launch. It binds the range of virtual address to GPU memory so that allocation use GPU memory for the range. Application can also use policy binding as a slow migration path ie set a policy to a new target memory so that new allocation are directed to this new target. Or do you want example userspace program like the one in the last patch of this serie ? Cheers, Jérôme
On 12/4/18 10:49 AM, Jerome Glisse wrote: >> Also, could you add a simple, example program for how someone might use >> this? I got lost in all the new sysfs and ioctl gunk. Can you >> characterize how this would work with the *exiting* NUMA interfaces that >> we have? > That is the issue i can not expose device memory as NUMA node as > device memory is not cache coherent on AMD and Intel platform today. > > More over in some case that memory is not visible at all by the CPU > which is not something you can express in the current NUMA node. Yeah, our NUMA mechanisms are for managing memory that the kernel itself manages in the "normal" allocator and supports a full feature set on. That has a bunch of implications, like that the memory is cache coherent and accessible from everywhere. The HMAT patches only comprehend this "normal" memory, which is why we're extending the existing /sys/devices/system/node infrastructure. This series has a much more aggressive goal, which is comprehending the connections of every memory-target to every memory-initiator, no matter who is managing the memory, who can access it, or what it can be used for. Theoretically, HMS could be used for everything that we're doing with /sys/devices/system/node, as long as it's tied back into the existing NUMA infrastructure _somehow_. Right?
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: > On 12/4/18 10:49 AM, Jerome Glisse wrote: > >> Also, could you add a simple, example program for how someone might use > >> this? I got lost in all the new sysfs and ioctl gunk. Can you > >> characterize how this would work with the *exiting* NUMA interfaces that > >> we have? > > That is the issue i can not expose device memory as NUMA node as > > device memory is not cache coherent on AMD and Intel platform today. > > > > More over in some case that memory is not visible at all by the CPU > > which is not something you can express in the current NUMA node. > > Yeah, our NUMA mechanisms are for managing memory that the kernel itself > manages in the "normal" allocator and supports a full feature set on. > That has a bunch of implications, like that the memory is cache coherent > and accessible from everywhere. > > The HMAT patches only comprehend this "normal" memory, which is why > we're extending the existing /sys/devices/system/node infrastructure. > > This series has a much more aggressive goal, which is comprehending the > connections of every memory-target to every memory-initiator, no matter > who is managing the memory, who can access it, or what it can be used for. > > Theoretically, HMS could be used for everything that we're doing with > /sys/devices/system/node, as long as it's tied back into the existing > NUMA infrastructure _somehow_. > > Right? Fully correct mind if i steal that perfect summary description next time i post ? I am so bad at explaining thing :) Intention is to allow program to do everything they do with mbind() today and tomorrow with the HMAT patchset and on top of that to also be able to do what they do today through API like OpenCL, ROCm, CUDA ... So it is one kernel API to rule them all ;) Also at first i intend to special case vma alloc page when they are HMS policy, long term i would like to merge code path inside the kernel. But i do not want to disrupt existing code path today, i rather grow to that organicaly. Step by step. The mbind() would still work un-affected in the end just the plumbing would be slightly different. Cheers, Jérôme
On 12/3/18 3:34 PM, jglisse@redhat.com wrote: > This patchset use the above scheme to expose system topology through > sysfs under /sys/bus/hms/ with: > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > each has a UID and you can usual value in that folder (node id, > size, ...) > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > (CPU or device), each has a HMS UID but also a CPU id for CPU > (which match CPU id in (/sys/bus/cpu/). For device you have a > path that can be PCIE BUS ID for instance) > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > UID and a file per property (bandwidth, latency, ...) you also > find a symlink to every target and initiator connected to that > link. > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > a UID and a file per property (bandwidth, latency, ...) you > also find a symlink to all initiators that can use that bridge. We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the connections between each node. Let's suppose that each node has some CPUs and some memory. That means we'll have 1024 target directories in sysfs, 1024 initiator directories in sysfs, and 1024*1024 link directories. Or, would the kernel be responsible for "compiling" the firmware-provided information down into a more manageable number of links? Some idiot made the mistake of having one sysfs directory per 128MB of memory way back when, and now we have hundreds of thousands of /sys/devices/system/memory/memoryX directories. That sucks to manage. Isn't this potentially repeating that mistake? Basically, is sysfs the right place to even expose this much data?
On 12/4/18 1:57 PM, Jerome Glisse wrote: > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) Go for it! > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) While I appreciate the exhaustive scope of such a project, I'm really worried that if we decided to use this for our "HMAT" use cases, we'll be bottlenecked behind this project while *it* goes through 25 revisions over 4 or 5 years like HMM did. So, should we just "park" the enhancements to the existing NUMA interfaces and infrastructure (think /sys/devices/system/node) and wait for this to go in? Do we try to develop them in parallel and make them consistent? Or, do we just ignore each other and make Andrew sort it out in a few years? :)
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > On 12/3/18 3:34 PM, jglisse@redhat.com wrote: > > This patchset use the above scheme to expose system topology through > > sysfs under /sys/bus/hms/ with: > > - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, > > each has a UID and you can usual value in that folder (node id, > > size, ...) > > > > - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator > > (CPU or device), each has a HMS UID but also a CPU id for CPU > > (which match CPU id in (/sys/bus/cpu/). For device you have a > > path that can be PCIE BUS ID for instance) > > > > - /sys/bus/hms/devices/v%version-%id-link : an link, each has a > > UID and a file per property (bandwidth, latency, ...) you also > > find a symlink to every target and initiator connected to that > > link. > > > > - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has > > a UID and a file per property (bandwidth, latency, ...) you > > also find a symlink to all initiators that can use that bridge. > > We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the > connections between each node. Let's suppose that each node has some > CPUs and some memory. > > That means we'll have 1024 target directories in sysfs, 1024 initiator > directories in sysfs, and 1024*1024 link directories. Or, would the > kernel be responsible for "compiling" the firmware-provided information > down into a more manageable number of links? > > Some idiot made the mistake of having one sysfs directory per 128MB of > memory way back when, and now we have hundreds of thousands of > /sys/devices/system/memory/memoryX directories. That sucks to manage. > Isn't this potentially repeating that mistake? > > Basically, is sysfs the right place to even expose this much data? I definitly want to avoid the memoryX mistake. So i do not want to see one link directory per device. Taking my simple laptop as an example with 4 CPUs, a wifi and 2 GPU (the integrated one and a discret one): link0: cpu0 cpu1 cpu2 cpu3 link1: wifi (2 pcie lane) link2: gpu0 (unknown number of lane but i believe it has higher bandwidth to main memory) link3: gpu1 (16 pcie lane) link4: gpu1 and gpu memory So one link directory per number of pcie lane your device have so that you can differentiate on bandwidth. The main memory is symlinked inside all the link directory except link4. The GPU discret memory is only in link4 directory as it is only accessible by the GPU (we could add it under link3 too with the non cache coherent property attach to it). The issue then becomes how to convert down the HMAT over verbose information to populate some reasonable layout for HMS. For that i would say that create a link directory for each different matrix cell. As an example let say that each entry in the matrix has bandwidth and latency then we create a link directory for each combination of bandwidth and latency. On simple system that should boils down to a handfull of combination roughly speaking mirroring the example above of one link directory per number of PCIE lane for instance. I don't think i have a system with an HMAT table if you have one HMAT table to provide i could show up the end result. Note i believe the ACPI HMAT matrix is a bad design for that reasons ie there is lot of commonality in many of the matrix entry and many entry also do not make sense (ie initiator not being able to access all the targets). I feel that link/bridge is much more compact and allow to represent any directed graph with multiple arrows from one node to another same node. Cheers, Jérôme
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote: > On 12/4/18 1:57 PM, Jerome Glisse wrote: > > Fully correct mind if i steal that perfect summary description next time > > i post ? I am so bad at explaining thing :) > > Go for it! > > > Intention is to allow program to do everything they do with mbind() today > > and tomorrow with the HMAT patchset and on top of that to also be able to > > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > > kernel API to rule them all ;) > > While I appreciate the exhaustive scope of such a project, I'm really > worried that if we decided to use this for our "HMAT" use cases, we'll > be bottlenecked behind this project while *it* goes through 25 revisions > over 4 or 5 years like HMM did. > > So, should we just "park" the enhancements to the existing NUMA > interfaces and infrastructure (think /sys/devices/system/node) and wait > for this to go in? Do we try to develop them in parallel and make them > consistent? Or, do we just ignore each other and make Andrew sort it > out in a few years? :) Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;) More seriously i think you should go ahead with Keith HMAT patchset and make progress there. In HMAT case you can grow and evolve the NUMA node infrastructure to address your need and i believe you are doing it in a sensible way. But i do not see a path for what i am trying to achieve in that framework. If anyone have any good idea i would welcome it. In the meantime i hope i can make progress with my proposal here under staging. Once i get enough stuff working in userspace and convince guinea pig (i need to find a better name for those poor people i will coerce in testing this ;)) then i can have some hard evidence of what thing in my proposal is useful on some concret case with open source stack from top to bottom. It might means stripping down what i am proposing today to what turns out to be useful. Then start a discussion about merging the kernel underlying code into one (while preserving all existing API) and getting out of staging with real syscall we will have to die with. I know that at the very least the hbind() and hpolicy() syscall would be successful as the HPC folks have been been dreaming of this. The topology thing is harder to know, they are some users today but i can not say how much more interest it can spark outside of this very small community that is HPC. Cheers, Jérôme
On 12/4/18 4:15 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: >> Basically, is sysfs the right place to even expose this much data? > > I definitly want to avoid the memoryX mistake. So i do not want to > see one link directory per device. Taking my simple laptop as an > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > discret one): > > link0: cpu0 cpu1 cpu2 cpu3 > link1: wifi (2 pcie lane) > link2: gpu0 (unknown number of lane but i believe it has higher > bandwidth to main memory) > link3: gpu1 (16 pcie lane) > link4: gpu1 and gpu memory > > So one link directory per number of pcie lane your device have > so that you can differentiate on bandwidth. The main memory is > symlinked inside all the link directory except link4. The GPU > discret memory is only in link4 directory as it is only > accessible by the GPU (we could add it under link3 too with the > non cache coherent property attach to it). I'm actually really interested in how this proposal scales. It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live? > The issue then becomes how to convert down the HMAT over verbose > information to populate some reasonable layout for HMS. For that > i would say that create a link directory for each different > matrix cell. As an example let say that each entry in the matrix > has bandwidth and latency then we create a link directory for > each combination of bandwidth and latency. On simple system that > should boils down to a handfull of combination roughly speaking > mirroring the example above of one link directory per number of > PCIE lane for instance. OK, but there are 1024*1024 matrix cells on a systems with 1024 proximity domains (ACPI term for NUMA node). So it sounds like you are proposing a million-directory approach. We also can't simply say that two CPUs with the same connection to two other CPUs (think a 4-socket QPI-connected system) share the same "link" because they share the same combination of bandwidth and latency. We need to know that *each* has its own, unique link and do not share link resources. > I don't think i have a system with an HMAT table if you have one > HMAT table to provide i could show up the end result. It is new enough (ACPI 6.2) that no publicly-available hardware that exists that implements one (that I know of). Keith Busch can probably extract one and send it to you or show you how we're faking them with QEMU. > Note i believe the ACPI HMAT matrix is a bad design for that > reasons ie there is lot of commonality in many of the matrix > entry and many entry also do not make sense (ie initiator not > being able to access all the targets). I feel that link/bridge > is much more compact and allow to represent any directed graph > with multiple arrows from one node to another same node. I don't disagree. But, folks are building systems with them and we need to either deal with it, or make its data manageable. You saw our approach: we cull the data and only expose the bare minimum in sysfs.
On 2018-12-04 4:57 p.m., Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote: >> Yeah, our NUMA mechanisms are for managing memory that the kernel itself >> manages in the "normal" allocator and supports a full feature set on. >> That has a bunch of implications, like that the memory is cache coherent >> and accessible from everywhere. >> >> The HMAT patches only comprehend this "normal" memory, which is why >> we're extending the existing /sys/devices/system/node infrastructure. >> >> This series has a much more aggressive goal, which is comprehending the >> connections of every memory-target to every memory-initiator, no matter >> who is managing the memory, who can access it, or what it can be used for. >> >> Theoretically, HMS could be used for everything that we're doing with >> /sys/devices/system/node, as long as it's tied back into the existing >> NUMA infrastructure _somehow_. >> >> Right? > Fully correct mind if i steal that perfect summary description next time > i post ? I am so bad at explaining thing :) > > Intention is to allow program to do everything they do with mbind() today > and tomorrow with the HMAT patchset and on top of that to also be able to > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one > kernel API to rule them all ;) As for ROCm, I'm looking forward to using hbind in our own APIs. It will save us some time and trouble not having to implement all the low-level policy and tracking of virtual address ranges in our device driver. Going forward, having a common API to manage the topology and memory affinity would also enable sane ways of having accelerators and memory devices from different vendors interact under control of a topology-aware application. Disclaimer: I haven't had a chance to review the patches in detail yet. Got caught up in the documentation and discussion ... Regards, Felix > > Also at first i intend to special case vma alloc page when they are HMS > policy, long term i would like to merge code path inside the kernel. But > i do not want to disrupt existing code path today, i rather grow to that > organicaly. Step by step. The mbind() would still work un-affected in > the end just the plumbing would be slightly different. > > Cheers, > Jérôme
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > On 12/4/18 4:15 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote: > >> Basically, is sysfs the right place to even expose this much data? > > > > I definitly want to avoid the memoryX mistake. So i do not want to > > see one link directory per device. Taking my simple laptop as an > > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a > > discret one): > > > > link0: cpu0 cpu1 cpu2 cpu3 > > link1: wifi (2 pcie lane) > > link2: gpu0 (unknown number of lane but i believe it has higher > > bandwidth to main memory) > > link3: gpu1 (16 pcie lane) > > link4: gpu1 and gpu memory > > > > So one link directory per number of pcie lane your device have > > so that you can differentiate on bandwidth. The main memory is > > symlinked inside all the link directory except link4. The GPU > > discret memory is only in link4 directory as it is only > > accessible by the GPU (we could add it under link3 too with the > > non cache coherent property attach to it). > > I'm actually really interested in how this proposal scales. It's quite > easy to represent a laptop, but can this scale to the largest systems > that we expect to encounter over the next 20 years that this ABI will live? > > > The issue then becomes how to convert down the HMAT over verbose > > information to populate some reasonable layout for HMS. For that > > i would say that create a link directory for each different > > matrix cell. As an example let say that each entry in the matrix > > has bandwidth and latency then we create a link directory for > > each combination of bandwidth and latency. On simple system that > > should boils down to a handfull of combination roughly speaking > > mirroring the example above of one link directory per number of > > PCIE lane for instance. > > OK, but there are 1024*1024 matrix cells on a systems with 1024 > proximity domains (ACPI term for NUMA node). So it sounds like you are > proposing a million-directory approach. No, pseudo code: struct list links; for (unsigned r = 0; r < nrows; r++) { for (unsigned c = 0; c < ncolumns; c++) { if (!link_find(links, hmat[r][c].bandwidth, hmat[r][c].latency)) { link = link_new(hmat[r][c].bandwidth, hmat[r][c].latency); // add initiator and target correspond to that row // and columns to this new link list_add(&link, links); } } } So all cells that have same property are under the same link. Do you expect all the cell to always have different properties ? On today platform it should not be the case. I do expect we will keep seeing many initiator/target pair that share same properties as other pair. But yes if you have system where no initiator/target pair have the same properties than you in the worst case you are describing. But hey that is the hardware you have then :) Note that userspace can parse all this once during its initialization and create pools of target to use. > We also can't simply say that two CPUs with the same connection to two > other CPUs (think a 4-socket QPI-connected system) share the same "link" > because they share the same combination of bandwidth and latency. We > need to know that *each* has its own, unique link and do not share link > resources. That is the purpose of the bridge object to inter-connect link. To be more exact link is like saying you have 2 arrows with the same properties between every node listed in the link. While bridge allow to define arrow in just one direction. Maybe i should define arrow and node instead of trying to match some of the ACPI terminology. This might be easier for people to follow than first having to understand the terminology. The fear i have with HMAT culling is that HMAT does not have the information to avoid such culling. > > I don't think i have a system with an HMAT table if you have one > > HMAT table to provide i could show up the end result. > > It is new enough (ACPI 6.2) that no publicly-available hardware that > exists that implements one (that I know of). Keith Busch can probably > extract one and send it to you or show you how we're faking them with QEMU. > > > Note i believe the ACPI HMAT matrix is a bad design for that > > reasons ie there is lot of commonality in many of the matrix > > entry and many entry also do not make sense (ie initiator not > > being able to access all the targets). I feel that link/bridge > > is much more compact and allow to represent any directed graph > > with multiple arrows from one node to another same node. > > I don't disagree. But, folks are building systems with them and we need > to either deal with it, or make its data manageable. You saw our > approach: we cull the data and only expose the bare minimum in sysfs. Yeah and i intend to cull data too inside HMS. Cheers, Jérôme
On 12/5/18 12:19 AM, Jerome Glisse wrote: > Above example is for migrate. Here is an example for how the > topology is use today: > > Application knows that the platform is running on have 16 > GPU split into 2 group of 8 GPUs each. GPU in each group can > access each other memory with dedicated mesh links between > each others. Full speed no traffic bottleneck. > > Application splits its GPU computation in 2 so that each > partition runs on a group of interconnected GPU allowing > them to share the dataset. > > With HMS: > Application can query the kernel to discover the topology of > system it is running on and use it to partition and balance > its workload accordingly. Same application should now be able > to run on new platform without having to adapt it to it. > Will the kernel be ever involved in decision making here? Like the scheduler will we ever want to control how there computation units get scheduled onto GPU groups or GPU? > This is kind of naive i expect topology to be hard to use but maybe > it is just me being pesimistics. In any case today we have a chicken > and egg problem. We do not have a standard way to expose topology so > program that can leverage topology are only done for HPC where the > platform is standard for few years. If we had a standard way to expose > the topology then maybe we would see more program using it. At very > least we could convert existing user. > > I am wondering whether we should consider HMAT as a subset of the ideas mentioned in this thread and see whether we can first achieve HMAT representation with your patch series? -aneesh
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote: > On 12/5/18 12:19 AM, Jerome Glisse wrote: > > > Above example is for migrate. Here is an example for how the > > topology is use today: > > > > Application knows that the platform is running on have 16 > > GPU split into 2 group of 8 GPUs each. GPU in each group can > > access each other memory with dedicated mesh links between > > each others. Full speed no traffic bottleneck. > > > > Application splits its GPU computation in 2 so that each > > partition runs on a group of interconnected GPU allowing > > them to share the dataset. > > > > With HMS: > > Application can query the kernel to discover the topology of > > system it is running on and use it to partition and balance > > its workload accordingly. Same application should now be able > > to run on new platform without having to adapt it to it. > > > > Will the kernel be ever involved in decision making here? Like the scheduler > will we ever want to control how there computation units get scheduled onto > GPU groups or GPU? I don;t think you will ever see fine control in software because it would go against what GPU are fundamentaly. GPU have 1000 of cores and usualy 10 times more thread in flight than core (depends on the number of register use by the program or size of their thread local storage). By having many more thread in flight the GPU always have some threads that are not waiting for memory access and thus always have something to schedule next on the core. This scheduling is all done in real time and i do not see that as a good fit for any kernel CPU code. That being said higher level and more coarse directive can be given to the GPU hardware scheduler like giving priorities to group of thread so that they always get schedule first if ready. There is a cgroup proposal that goes into the direction of exposing high level control over GPU resource like that. I think this is a better venue to discuss such topics. > > > This is kind of naive i expect topology to be hard to use but maybe > > it is just me being pesimistics. In any case today we have a chicken > > and egg problem. We do not have a standard way to expose topology so > > program that can leverage topology are only done for HPC where the > > platform is standard for few years. If we had a standard way to expose > > the topology then maybe we would see more program using it. At very > > least we could convert existing user. > > > > > > I am wondering whether we should consider HMAT as a subset of the ideas > mentioned in this thread and see whether we can first achieve HMAT > representation with your patch series? I do not want to block HMAT on that. What i am trying to do really does not fit in the existing NUMA node this is what i have been trying to show even if not everyone is convince by that. Some bulets points of why: - memory i care about is not accessible by everyone (backed in assumption in NUMA node) - memory i care about might not be cache coherent (again backed in assumption in NUMA node) - topology matter so that userspace knows what inter-connect is share and what have dedicated links to memory - their can be multiple path between one device and one target memory and each path have different numa distance (or rather properties like bandwidth, latency, ...) again this is does not fit with the NUMA distance thing - memory is not manage by core kernel for reasons i hav explained - ... The HMAT proposal does not deal with such memory, it is much more close to what the current model can describe. Cheers, Jérôme
On 12/4/18 6:13 PM, Jerome Glisse wrote: > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: >> OK, but there are 1024*1024 matrix cells on a systems with 1024 >> proximity domains (ACPI term for NUMA node). So it sounds like you are >> proposing a million-directory approach. > > No, pseudo code: > struct list links; > > for (unsigned r = 0; r < nrows; r++) { > for (unsigned c = 0; c < ncolumns; c++) { > if (!link_find(links, hmat[r][c].bandwidth, > hmat[r][c].latency)) { > link = link_new(hmat[r][c].bandwidth, > hmat[r][c].latency); > // add initiator and target correspond to that row > // and columns to this new link > list_add(&link, links); > } > } > } > > So all cells that have same property are under the same link. OK, so the "link" here is like a cable. It's like saying, "we have a network and everything is connected with an ethernet cable that can do 1gbit/sec". But, what actually connects an initiator to a target? I assume we still need to know which link is used for each target/initiator pair. Where is that enumerated? I think this just means we need a million symlinks to a "link" instead of a million link directories. Still not great. > Note that userspace can parse all this once during its initialization > and create pools of target to use. It sounds like you're agreeing that there is too much data in this interface for applications to _regularly_ parse it. We need some central thing that parses it all and caches the results.
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote: > On 12/4/18 6:13 PM, Jerome Glisse wrote: > > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote: > >> OK, but there are 1024*1024 matrix cells on a systems with 1024 > >> proximity domains (ACPI term for NUMA node). So it sounds like you are > >> proposing a million-directory approach. > > > > No, pseudo code: > > struct list links; > > > > for (unsigned r = 0; r < nrows; r++) { > > for (unsigned c = 0; c < ncolumns; c++) { > > if (!link_find(links, hmat[r][c].bandwidth, > > hmat[r][c].latency)) { > > link = link_new(hmat[r][c].bandwidth, > > hmat[r][c].latency); > > // add initiator and target correspond to that row > > // and columns to this new link > > list_add(&link, links); > > } > > } > > } > > > > So all cells that have same property are under the same link. > > OK, so the "link" here is like a cable. It's like saying, "we have a > network and everything is connected with an ethernet cable that can do > 1gbit/sec". > > But, what actually connects an initiator to a target? I assume we still > need to know which link is used for each target/initiator pair. Where > is that enumerated? ls /sys/bus/hms/devices/v0-0-link/ node0 power subsystem uevent uid bandwidth latency v0-1-target v0-15-initiator v0-21-target v0-4-initiator v0-7-initiator v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator v0-5-initiator v0-8-initiator v0-6-initiator v0-9-initiator v0-12-initiator v0-10-initiator So above is 16 CPUs (initiators*) and 2 targets all connected through a common link. This means that all the initiators connected to this link can access all the target connected to this link. The bandwidth and latency is best case scenario for instance when only one initiator is accessing the target. Initiator can only access target they share a link with or an extended path through a bridge. So if you have an initiator connected to link0 and a target connected to link1 and there is a bridge link0 to link1 then the initiator can access the target memory in link1 but the bandwidth and latency will be min(link0.bandwidth, link1.bandwidth, bridge.bandwidth) min(link0.latency, link1.latency, bridge.latency) You can really match one to one a link with bus in your system. For instance with PCIE if you only have 16lanes PCIE devices you only devince one link directory for all your PCIE devices (ignore the PCIE peer to peer scenario here). You add a bride between your PCIE link to your NUMA node link (the node to which this PCIE root complex belongs), this means that PCIE device can access the local node memory with given bandwidth and latency (best case). > > I think this just means we need a million symlinks to a "link" instead > of a million link directories. Still not great. > > > Note that userspace can parse all this once during its initialization > > and create pools of target to use. > > It sounds like you're agreeing that there is too much data in this > interface for applications to _regularly_ parse it. We need some > central thing that parses it all and caches the results. No so there is 2 kinds of applications: 1) average one: i am using device {1, 3, 9} give me best memory for those devices 2) advance one: what is the topology of this system ? Parse the topology and partition its workload accordingly For case 1 you can pre-parse stuff but this can be done by helper library but for case 2 there is no amount of pre-parsing you can do in kernel, only the application knows its own architecture and thus only the application knows what matter in the topology. Is the application looking for big chunk of memory even if it is slow ? Is it also looking for fast memory close to X and Y ? ... Each application will care about different thing and there is no telling what its gonna be. So what i am saying is that this information is likely to be parse once by the application during startup ie the sysfs is not something that is continuously read and parse by the application (unless application also care about hotplug and then we are talking about the 1% of the 1%). Cheers, Jérôme
On 12/5/18 9:53 AM, Jerome Glisse wrote: > No so there is 2 kinds of applications: > 1) average one: i am using device {1, 3, 9} give me best memory for > those devices ... > > For case 1 you can pre-parse stuff but this can be done by helper library How would that work? Would each user/container/whatever do this once? Where would they keep the pre-parsed stuff? How do they manage their cache if the topology changes?
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote: > On 12/5/18 9:53 AM, Jerome Glisse wrote: > > No so there is 2 kinds of applications: > > 1) average one: i am using device {1, 3, 9} give me best memory for > > those devices > ... > > > > For case 1 you can pre-parse stuff but this can be done by helper library > > How would that work? Would each user/container/whatever do this once? > Where would they keep the pre-parsed stuff? How do they manage their > cache if the topology changes? Short answer i don't expect a cache, i expect that each program will have a init function that query the topology and update the application codes accordingly. This is what people do today, query all available devices, decide which one to use and how, create context for each selected ones, define a memory migration job/memory policy for each part of the program so that memory is migrated/have proper policy in place when the code that run on some device is executed. Long answer: I can not dictate how user folks do their program saddly :) I expect that many application will do it once during start up. Then you will have all those containers folks or VM folks that will get presure to react to hot- plug. For instance if you upgrade your instance with your cloud provider to have more GPUs or more TPUs ... It is likely to appear as an hotplug from the VM/container point of view and thus as an hotplug from the application point of view. So far demonstration i have seen do that by relaunching the application ... More on that through the live re-patching issues below. Oh and i expect application will crash if you hot-unplug anything it is using (this is what happens i believe now in most API). Again i expect that some pressure from cloud user and provider will force programmer to be a bit more reactive to this kind of event. Live re-patching application code can be difficult i am told. Let say you have: void compute_serious0_stuff(accelerator_t *accelerator, void *inputA, size_t sinputA, void *inputB, size_t sinputB, void *outputA, size_t soutputA) { ... // Migrate the inputA to the accelerator memory api_migrate_memory_to_accelerator(accelerator, inputA, sinputA); // The inputB buffer is fine in its default placement // The output is assume to be empty vma ie no page allocated yet // so set a policy to direct all allocation due to page fault to // use the accelerator memory api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA); ... for_parallel<accelerator> (i = 0; i < THEYAREAMILLIONSITEMS; ++i) { // Do something serious } ... } void serious0_orchestrator(topology topology, void *inputA, void *inputB, void *outputA) { static accelerator_t **selected = NULL; static serious0_job_partition *partition; ... if (selected == NULL) { serious0_select_and_partition(topology, &selected, &partition, inputA, inputB, outputA) } ... for(i = 0; i < nselected; ++) { ... compute_serious0_stuff(selected[i], inputA + partition[i].inputA_offset, partition[i].inputA_size, inputB + partition[i].inputB_offset, partition[i].inputB_size, outputA + partition[i].outputB_offset, partition[i].outputA_size); ... } ... for(i = 0; i < nselected; ++) { accelerator_wait_finish(selected[i]); } ... // outputA is ready to be use by the next function in the program } If you start without a GPU/TPU your for_parallel will use the CPU and with the code the compiler have emitted at built time. For GPU/TPU at build time you compile your for_parallel loop to some intermediate representation (a virtual ISA) then at runtime during the application initialization that intermediate representation get lowered down to all the available GPU/TPU on your system and each for_parallel loop is patched to be turn into a call to: void dispatch_accelerator_function(accelerator_t *accelerator, void *function, ...) { } So in the above example the for_parallel loop becomes: dispatch_accelerator_function(accelerator, i_compute_serious_stuff, inputA, inputB, outputA); This hot patching of code is easy to do when no CPU thread is running the code. However when CPU threads are running it can be problematic, i am sure you can do trickery like delay the patching only to the next time the function get call by doing clever thing at build time like prepending each for_parallel section with enough nop that would allow you to replace it to a call to the dispatch function and a jump over the normal CPU code. I think compiler people want to solve the static case first ie during application initializations decide what devices are gonna be use and then update the application accordingly. But i expect it will grow to support hotplug as relaunching the application is not that user friendly even in this day an age where people starts millions of container with one mouse click. Anyway above example is how it looks today and accelerator can turn up to be just regular CPU core if you do not have any devices. The idea is that we would like a common API that cover both CPU thread and device thread. Same for the migration/policy functions if it happens that the accelerator is just plain old CPU then you want to migrate memory to the CPU node and set memory policy to that node too. Cheers, Jérôme
On 12/6/18 11:20 AM, Jerome Glisse wrote: >>> For case 1 you can pre-parse stuff but this can be done by helper library >> How would that work? Would each user/container/whatever do this once? >> Where would they keep the pre-parsed stuff? How do they manage their >> cache if the topology changes? > Short answer i don't expect a cache, i expect that each program will have > a init function that query the topology and update the application codes > accordingly. My concern with having folks do per-program parsing, *and* having a huge amount of data to parse makes it unusable. The largest systems will literally have hundreds of thousands of objects in /sysfs, even in a single directory. That makes readdir() basically impossible, and makes even open() (if you already know the path you want somehow) hard to do fast. I just don't think sysfs (or any filesystem, really) can scale to express large, complicated topologies in a way that any normal program can practically parse it. My suspicion is that we're going to need to have the kernel parse and cache these things. We *might* have the data available in sysfs, but we can't reasonably expect anyone to go parsing it.
On 2018-12-06 12:31 p.m., Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: >>>> For case 1 you can pre-parse stuff but this can be done by helper library >>> How would that work? Would each user/container/whatever do this once? >>> Where would they keep the pre-parsed stuff? How do they manage their >>> cache if the topology changes? >> Short answer i don't expect a cache, i expect that each program will have >> a init function that query the topology and update the application codes >> accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. Is this actually realistic? I find it hard to imagine an actual hardware bus that can have even thousands of devices under a single node, let alone hundreds of thousands. At some point the laws of physics apply. For example, in present hardware, the most ports a single PCI switch can have these days is under one hundred. I'd imagine any such large systems would have a hierarchy of devices (ie. layers of switch-like devices) which implies the existing sysfs bus/devices should have a path through it without navigating a directory with that unreasonable a number of objects in it. HMS, on the other hand, has all possible initiators (,etc) under a single directory. The caveat to this is, that to find an initial starting point in the bus hierarchy you might have to go through /sys/dev/{block|char} or /sys/class which may have directories with a large number of objects. Though, such a system would necessarily have a similarly large number of objects in /dev which means means you will probably never get around the readdir/open bottleneck you mention... and, thus, this doesn't seem overly realistic to me. Logan
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > On 12/6/18 11:20 AM, Jerome Glisse wrote: > >>> For case 1 you can pre-parse stuff but this can be done by helper library > >> How would that work? Would each user/container/whatever do this once? > >> Where would they keep the pre-parsed stuff? How do they manage their > >> cache if the topology changes? > > Short answer i don't expect a cache, i expect that each program will have > > a init function that query the topology and update the application codes > > accordingly. > > My concern with having folks do per-program parsing, *and* having a huge > amount of data to parse makes it unusable. The largest systems will > literally have hundreds of thousands of objects in /sysfs, even in a > single directory. That makes readdir() basically impossible, and makes > even open() (if you already know the path you want somehow) hard to do fast. > > I just don't think sysfs (or any filesystem, really) can scale to > express large, complicated topologies in a way that any normal program > can practically parse it. > > My suspicion is that we're going to need to have the kernel parse and > cache these things. We *might* have the data available in sysfs, but we > can't reasonably expect anyone to go parsing it. What i am failing to explain is that kernel can not parse because kernel does not know what the application cares about and every single applications will make different choices and thus select differents devices and memory. It is not even gonna a thing like class A of application will do X and class B will do Y. Every single application in class A might do something different because somes care about the little details. So any kind of pre-parsing in the kernel is defeated by the fact that the kernel does not know what the application is looking for. I do not see anyway to express the application logic in something that can be some kind of automaton or regular expression. The application can litteraly intro-inspect itself and the topology to partition its workload. The topology and device selection is expected to be thousands of line of code in the most advance application. Even worse inside one same application, they might be different device partition and memory selection for different function in the application. I am not scare about the anount of data to parse really, even on big node it is gonna be few dozens of links and bridges, and few dozens of devices. So we are talking hundred directories to parse and read. Maybe an example will help. Let say we have an application with the following pipeline: inA -> functionA -> outA -> functionB -> outB -> functionC -> result - inA 8 gigabytes - outA 8 gigabytes - outB one dword - result something small - functionA is doing heavy computation on inA (several thousands of instructions for each dword in inA). - functionB is doing heavy computation for each dword in outA (again thousand of instruction for each dword) and it is looking for a specific result that it knows will be unique among all the dword computation ie it is output only one dword in outB - functionC is something well suited for CPU that take outB and turns it into the final result Now let see few different system and their topologies: [T2] 1 GPU with 16GB of memory and a handfull of CPU cores [T1] 1 GPU with 8GB of memory and a handfull of CPU cores [T3] 2 GPU with 8GB of memory and a handfull of CPU core [T4] 2 GPU with 8GB of memory and a handfull of CPU core the 2 GPU have a very fast link between each others (400GBytes/s) Now let see how the program will partition itself for each topology: [T1] Application partition its computation in 3 phases: P1: - migrate inA to GPU memory P2: - execute functionA on inA producing outA P3 - execute functionB on outA producing outB - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T2] Application partition its computation in 5 phases: P1: - migrate first 4GB of inA to GPU memory P2: - execute functionA for the 4GB and write the 4GB outA result to the GPU memory P3: - execute functionB for the first 4GB of outA - while functionB is running DMA in the background the the second 4GB of inA to the GPU memory - once one of the millions of thread running functionB find the result it is looking for it writes it to outB which is in main memory - run functionC and see if functionB have found the thing and written it to outB if so then kill all GPU thread and DMA and return the result we are done P4: - run functionA on the second half of inA ie we did not find the result in the first half so we no process the second half that have been migrated to the GPU memory in the background (see above) P5: - run functionB on the second 4GB of outA like above - run functionC on CPU and kill everything as soon as one of the thread running functionB has found the result - return the result [T3] Application partition its computation in 3 phases: P1: - migrate first 4GB of inA to GPU1 memory - migrate last 4GB of inA to GPU2 memory P2: - execute functionA on GPU1 on the first 4GB -> outA - execute functionA on GPU2 on the last 4GB -> outA P3: - execute functionB on GPU1 on the first 4GB of outA - execute functionB on GPU2 on the last 4GB of outA - run functionC and see if functionB running on GPU1 and GPU2 have found the thing and written it to outB if so then kill all GPU threads and return the result we are done [T4] Application partition its computation in 2 phases: P1: - migrate 8GB of inA to GPU1 memory - allocate 8GB for outA in GPU2 memory P2: - execute functionA on GPU1 on the inA 8GB and write out result to GPU2 through the fast link - execute functionB on GPU2 and look over each thread on functionB on outA (busy running even if outA is not valid for each thread running functionB) - run functionC and see if functionB running on GPU2 have found the thing and written it to outB if so then kill all GPU threads and return the result we are done So this is widely different partition that all depends on the topology and how accelerator are inter-connected and how much memory they have. This is a relatively simple example, they are people out there spending month on designing adaptive partitioning algorithm for their application. Cheers, Jérôme
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote: > On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote: > > On 12/6/18 11:20 AM, Jerome Glisse wrote: > > >>> For case 1 you can pre-parse stuff but this can be done by helper library > > >> How would that work? Would each user/container/whatever do this once? > > >> Where would they keep the pre-parsed stuff? How do they manage their > > >> cache if the topology changes? > > > Short answer i don't expect a cache, i expect that each program will have > > > a init function that query the topology and update the application codes > > > accordingly. > > > > My concern with having folks do per-program parsing, *and* having a huge > > amount of data to parse makes it unusable. The largest systems will > > literally have hundreds of thousands of objects in /sysfs, even in a > > single directory. That makes readdir() basically impossible, and makes > > even open() (if you already know the path you want somehow) hard to do fast. > > > > I just don't think sysfs (or any filesystem, really) can scale to > > express large, complicated topologies in a way that any normal program > > can practically parse it. > > > > My suspicion is that we're going to need to have the kernel parse and > > cache these things. We *might* have the data available in sysfs, but we > > can't reasonably expect anyone to go parsing it. > > What i am failing to explain is that kernel can not parse because kernel > does not know what the application cares about and every single applications > will make different choices and thus select differents devices and memory. > > It is not even gonna a thing like class A of application will do X and > class B will do Y. Every single application in class A might do something > different because somes care about the little details. > > So any kind of pre-parsing in the kernel is defeated by the fact that the > kernel does not know what the application is looking for. > > I do not see anyway to express the application logic in something that > can be some kind of automaton or regular expression. The application can > litteraly intro-inspect itself and the topology to partition its workload. > The topology and device selection is expected to be thousands of line of > code in the most advance application. > > Even worse inside one same application, they might be different device > partition and memory selection for different function in the application. > > > I am not scare about the anount of data to parse really, even on big node > it is gonna be few dozens of links and bridges, and few dozens of devices. > So we are talking hundred directories to parse and read. > > > Maybe an example will help. Let say we have an application with the > following pipeline: > > inA -> functionA -> outA -> functionB -> outB -> functionC -> result > > - inA 8 gigabytes > - outA 8 gigabytes > - outB one dword > - result something small > - functionA is doing heavy computation on inA (several thousands of > instructions for each dword in inA). > - functionB is doing heavy computation for each dword in outA (again > thousand of instruction for each dword) and it is looking for a > specific result that it knows will be unique among all the dword > computation ie it is output only one dword in outB > - functionC is something well suited for CPU that take outB and turns > it into the final result > > Now let see few different system and their topologies: > [T2] 1 GPU with 16GB of memory and a handfull of CPU cores > [T1] 1 GPU with 8GB of memory and a handfull of CPU cores > [T3] 2 GPU with 8GB of memory and a handfull of CPU core > [T4] 2 GPU with 8GB of memory and a handfull of CPU core > the 2 GPU have a very fast link between each others > (400GBytes/s) > > Now let see how the program will partition itself for each topology: > [T1] Application partition its computation in 3 phases: > P1: - migrate inA to GPU memory > P2: - execute functionA on inA producing outA > P3 - execute functionB on outA producing outB > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU threads and return the result we are done > > [T2] Application partition its computation in 5 phases: > P1: - migrate first 4GB of inA to GPU memory > P2: - execute functionA for the 4GB and write the 4GB > outA result to the GPU memory > P3: - execute functionB for the first 4GB of outA > - while functionB is running DMA in the background > the the second 4GB of inA to the GPU memory > - once one of the millions of thread running functionB > find the result it is looking for it writes it to > outB which is in main memory > - run functionC and see if functionB have found the > thing and written it to outB if so then kill all > GPU thread and DMA and return the result we are > done > P4: - run functionA on the second half of inA ie we did > not find the result in the first half so we no > process the second half that have been migrated to > the GPU memory in the background (see above) > P5: - run functionB on the second 4GB of outA like > above > - run functionC on CPU and kill everything as soon > as one of the thread running functionB has found > the result > - return the result > > [T3] Application partition its computation in 3 phases: > P1: - migrate first 4GB of inA to GPU1 memory > - migrate last 4GB of inA to GPU2 memory > P2: - execute functionA on GPU1 on the first 4GB -> outA > - execute functionA on GPU2 on the last 4GB -> outA > P3: - execute functionB on GPU1 on the first 4GB of outA > - execute functionB on GPU2 on the last 4GB of outA > - run functionC and see if functionB running on GPU1 > and GPU2 have found the thing and written it to outB > if so then kill all GPU threads and return the result > we are done > > [T4] Application partition its computation in 2 phases: > P1: - migrate 8GB of inA to GPU1 memory > - allocate 8GB for outA in GPU2 memory > P2: - execute functionA on GPU1 on the inA 8GB and write > out result to GPU2 through the fast link > - execute functionB on GPU2 and look over each > thread on functionB on outA (busy running even > if outA is not valid for each thread running > functionB) > - run functionC and see if functionB running on GPU2 > have found the thing and written it to outB if so > then kill all GPU threads and return the result > we are done > > > So this is widely different partition that all depends on the topology > and how accelerator are inter-connected and how much memory they have. > This is a relatively simple example, they are people out there spending > month on designing adaptive partitioning algorithm for their application. > And since i am writting example, another funny one let say you have a system with 2 nodes and on each node 2 GPU and one network. On each node the local network adapter can only access one of the 2 GPU memory. All the GPU are conntected to each other through a fully symmetrical mesh inter-connect. Now let say your program has 4 functions back to back, each functions consuming the output of the previous one. Finaly you get your input from the network and stream out the final function output to the network So what you can do is: Node0 Net0 -> write to Node0 GPU0 memory Node0 GPU0 -> run first function and write result to Node0 GPU1 Node0 GPU1 -> run second function and write result to Node1 GPU3 Node1 GPU3 -> run third function and write result to Node1 GPU2 Node1 Net1 -> read result from Node1 GPU2 and stream it out Yes this kind of thing can be decided at application startup during initialization. Idea is that you model your program computation graph each node is a function (or group of functions) and each arrow is data flow (input and output). So you have a graph, now what you do is try to find a sub-graph of your system topology that match this graph and for the system topology you also have to check that each of your program node can run on the specific accelerator node of your system (does the accelerator have the feature X and Y ?) If you are not lucky and that there is no 1 to 1 match the you can can re-arrange/simplify your application computation graph. For instance group multiple of your application function node into just one node to shrink your computation graph. Rinse and repeat. Moreover each application will have multiple separate computation graph and the application will want to spread as evenly as possible its workload and select the most powerfull accelerator for the most intensive computation ... I do not see how to have graph matching API with complex testing where you need to query back userspace library. Like querying if the userspace penCL driver for GPU A support feature X ? Which might not only depend on the device generation or kernel device driver version but also on the version of the userspace driver. I feel it would be a lot easier to provide a graph to userspace and have userspace do this complex matching and adaption of its computation graph and load balance its computation at the same time. Of course not all application will be that complex and like i said i believe average app (especialy desktop app design to run on laptop) will just use a dumb down thing ie they will only use one or two devices at the most. Yes all this is hard but easy problems are not interesting to solve. Cheers, Jérôme
On 12/6/18 12:11 PM, Logan Gunthorpe wrote: >> My concern with having folks do per-program parsing, *and* having a huge >> amount of data to parse makes it unusable. The largest systems will >> literally have hundreds of thousands of objects in /sysfs, even in a >> single directory. That makes readdir() basically impossible, and makes >> even open() (if you already know the path you want somehow) hard to do fast. > Is this actually realistic? I find it hard to imagine an actual hardware > bus that can have even thousands of devices under a single node, let > alone hundreds of thousands. Jerome's proposal, as I understand it, would have generic "links". They're not an instance of bus, but characterize a class of "link". For instance, a "link" might characterize the characteristics of the QPI bus between two CPU sockets. The link directory would enumerate the list of all *instances* of that link So, a "link" directory for QPI would say Socket0<->Socket1, Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It would have to enumerate the connections between every entity that shared those link properties. While there might not be millions of buses, there could be millions of *paths* across all those buses, and that's what the HMAT describes, at least: the net result of all those paths.
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote: > On 12/6/18 12:11 PM, Logan Gunthorpe wrote: > >> My concern with having folks do per-program parsing, *and* having a huge > >> amount of data to parse makes it unusable. The largest systems will > >> literally have hundreds of thousands of objects in /sysfs, even in a > >> single directory. That makes readdir() basically impossible, and makes > >> even open() (if you already know the path you want somehow) hard to do fast. > > Is this actually realistic? I find it hard to imagine an actual hardware > > bus that can have even thousands of devices under a single node, let > > alone hundreds of thousands. > > Jerome's proposal, as I understand it, would have generic "links". > They're not an instance of bus, but characterize a class of "link". For > instance, a "link" might characterize the characteristics of the QPI bus > between two CPU sockets. The link directory would enumerate the list of > all *instances* of that link > > So, a "link" directory for QPI would say Socket0<->Socket1, > Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc... It > would have to enumerate the connections between every entity that shared > those link properties. > > While there might not be millions of buses, there could be millions of > *paths* across all those buses, and that's what the HMAT describes, at > least: the net result of all those paths. Sorry if again i miss-explained thing. Link are arrows between nodes (CPU or device or memory). An arrow/link has properties associated with it: bandwidth, latency, cache-coherent, ... So if in your system you 4 Sockets and that each socket is connected to each other (mesh) and all inter-connect in the mesh have same property then you only have 1 link directory with the 4 socket in it. No if the 4 sockets are connect in a ring fashion ie: Socket0 - Socket1 | | Socket3 - Socket2 Then you have 4 links: link0: socket0 socket1 link1: socket1 socket2 link3: socket2 socket3 link4: socket3 socket0 I do not see how their can be an explosion of link directory, worse case is as many link directories as they are bus for a CPU/device/ target. So worse case if you have N devices and each devices is connected two 2 bus (PCIE and QPI to go to other socket for instance) then you have 2*N link directory (again this is a worst case). They are lot of commonality that will remain so i expect that quite a few link directory will have many symlink ie you won't get close to the worst case. In the end really it is easier to think from the physical topology and there a link correspond to an inter-connect between two device or CPU. In all the systems i have seen even in the craziest roadmap i have only seen thing like 128/256 inter-connect (4 socket 32/64 devices per socket) and many of which can be grouped under a common link directory. Here worse case is 4 connection per device/CPU/ target so worse case of 128/256 * 4 = 512/1024 link directory and that's a lot. Given regularity i have seen described on slides i expect that it would need something like 30 link directory and 20 bridges directory. On today system 8GPU per socket with GPUlink between each GPU and PCIE all this with 4 socket it comes down to 20 links directory. In any cases each devices/CPU/target has a limit on the number of bus/inter-connect it is connected too. I doubt there is anyone designing device that will have much more than 4 external bus connection. So it is not a link per pair. It is a link for group of device/CPU/ target. Is it any clearer ? Cheers, Jérôme
On 12/6/18 2:39 PM, Jerome Glisse wrote: > No if the 4 sockets are connect in a ring fashion ie: > Socket0 - Socket1 > | | > Socket3 - Socket2 > > Then you have 4 links: > link0: socket0 socket1 > link1: socket1 socket2 > link3: socket2 socket3 > link4: socket3 socket0 > > I do not see how their can be an explosion of link directory, worse > case is as many link directories as they are bus for a CPU/device/ > target. This looks great. But, we don't _have_ this kind of information for any system that I know about or any system available in the near future. We basically have two different world views: 1. The system is described point-to-point. A connects to B @ 100GB/s. B connects to C at 50GB/s. Thus, C->A should be 50GB/s. * Less information to convey * Potentially less precise if the properties are not perfectly additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. * Costs must be calculated instead of being explicitly specified 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s B->C @ 50GB/s, A->C @ 50GB/s. * A *lot* more information to convey O(N^2)? * Potentially more precise. * Costs are explicitly specified, not calculated These patches are really tied to world view #1. But, the HMAT is really tied to world view #1. I know you're not a fan of the HMAT. But it is the firmware reality that we are stuck with, until something better shows up. I just don't see a way to convert it into what you have described here. I'm starting to think that, no matter if the HMAT or some other approach gets adopted, we shouldn't be exposing this level of gunk to userspace at *all* since it requires adopting one of the world views.
On 2018-12-06 4:09 p.m., Dave Hansen wrote: > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ > 100GB/s. B connects to C at 50GB/s. Thus, C->A should be > 50GB/s. > * Less information to convey > * Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. > * Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s > B->C @ 50GB/s, A->C @ 50GB/s. > * A *lot* more information to convey O(N^2)? > * Potentially more precise. > * Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. I didn't think this was meant to describe actual real world performance between all of the links. If that's the case all of this seems like a pipe dream to me. Attributes like cache coherency, atomics, etc should fit well in world view #1... and, at best, some kind of flag saying whether or not to use a particular link if you care about transfer speed. -- But we don't need special "link" directories to describe the properties of existing buses. You're not *really* going to know bandwidth or latency for any of this unless you actually measure it on the system in question. Logan
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. Whoops, should have been "the HMAT is really tied to world view #2"
On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > I didn't think this was meant to describe actual real world performance > between all of the links. If that's the case all of this seems like a > pipe dream to me. The HMAT discussions (that I was a part of at least) settled on just trying to describe what we called "sticker speed". Nobody had an expectation that you *really* had to measure everything. The best we can do for any of these approaches is approximate things. > You're not *really* going to know bandwidth or latency for any of this > unless you actually measure it on the system in question. Yeah, agreed.
On 2018-12-06 4:38 p.m., Dave Hansen wrote: > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: >> I didn't think this was meant to describe actual real world performance >> between all of the links. If that's the case all of this seems like a >> pipe dream to me. > > The HMAT discussions (that I was a part of at least) settled on just > trying to describe what we called "sticker speed". Nobody had an > expectation that you *really* had to measure everything. > > The best we can do for any of these approaches is approximate things. Yes, though there's a lot of caveats in this assumption alone. Specifically with PCI: the bus may run at however many GB/s but P2P through a CPU's root complexes can slow down significantly (like down to MB/s). I've seen similar things across QPI: I can sometimes do P2P from PCI->QPI->PCI but the performance doesn't even come close to the sticker speed of any of those buses. I'm not sure how anyone is going to deal with those issues, but it does firmly place us in world view #2 instead of #1. But, yes, I agree exposing information like in #2 full out to userspace, especially through sysfs, seems like a nightmare and I don't see anything in HMS to help with that. Providing an API to ask for memory (or another resource) that's accessible by a set of initiators and with a set of requirements for capabilities seems more manageable. Logan
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote: > On 12/6/18 2:39 PM, Jerome Glisse wrote: > > No if the 4 sockets are connect in a ring fashion ie: > > Socket0 - Socket1 > > | | > > Socket3 - Socket2 > > > > Then you have 4 links: > > link0: socket0 socket1 > > link1: socket1 socket2 > > link3: socket2 socket3 > > link4: socket3 socket0 > > > > I do not see how their can be an explosion of link directory, worse > > case is as many link directories as they are bus for a CPU/device/ > > target. > > This looks great. But, we don't _have_ this kind of information for any > system that I know about or any system available in the near future. We do not have it in any standard way, it is out there in either device driver database, application data base, special platform OEM blob burried somewhere in the firmware ... I want to solve the kernel side of the problem ie how to expose this to userspace. How the kernel get that information is an orthogonal problem. For now my intention is to have device driver register and create the links and bridges that are not enumerated by standard firmware. > > We basically have two different world views: > 1. The system is described point-to-point. A connects to B @ > 100GB/s. B connects to C at 50GB/s. Thus, C->A should be > 50GB/s. > * Less information to convey > * Potentially less precise if the properties are not perfectly > additive. If A->B=10ns and B->C=20ns, A->C might be >30ns. > * Costs must be calculated instead of being explicitly specified > 2. The system is described endpoint-to-endpoint. A->B @ 100GB/s > B->C @ 50GB/s, A->C @ 50GB/s. > * A *lot* more information to convey O(N^2)? > * Potentially more precise. > * Costs are explicitly specified, not calculated > > These patches are really tied to world view #1. But, the HMAT is really > tied to world view #1. ^#2 Note that they are also the bridge object in my proposal. So in my proposal you in #1 you have: link0: A <-> B with 100GB/s and 10ns latency link1: B <-> C with 50GB/s and 20ns latency Now if A can reach C through B then you have bridges (bridge are uni- directional unlike link that are bi-directional thought that finer point can be discuss this is what allow any kind of directed graph to be represented): bridge2: link0 -> link1 bridge3: link1 -> link0 You can also associated properties to bridge (but it is not mandatory). So you can say that bridge2 and bridge3 have a latency of 50ns and if the addition of latency is enough then you do not specificy it in bridge. It is a rule that a path latency is the sum of its individual link latency. For bandwidth it is the minimum bandwidth ie what ever is the bottleneck for the path. > I know you're not a fan of the HMAT. But it is the firmware reality > that we are stuck with, until something better shows up. I just don't > see a way to convert it into what you have described here. Like i said i am not targetting HMAT system i am targeting system that rely today on database spread between driver and application. I want to move that knowledge in driver first so that they can teach the core kernel and register thing in the core. Providing a standard firmware way to provide this information is a different problem (they are some loose standard on non ACPI platform AFAIK). > I'm starting to think that, no matter if the HMAT or some other approach > gets adopted, we shouldn't be exposing this level of gunk to userspace > at *all* since it requires adopting one of the world views. I do not see this as exclusive. Yes they are HMAT system "soon" to arrive but we already have the more extended view which is just buried under a pile of different pieces. I do not see any exclusion between the 2. If HMAT is good enough for a whole class of system fine but there is also a whole class of system and users that do not fit in that paradigm hence my proposal. Cheers, Jérôme
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > >> I didn't think this was meant to describe actual real world performance > >> between all of the links. If that's the case all of this seems like a > >> pipe dream to me. > > > > The HMAT discussions (that I was a part of at least) settled on just > > trying to describe what we called "sticker speed". Nobody had an > > expectation that you *really* had to measure everything. > > > > The best we can do for any of these approaches is approximate things. > > Yes, though there's a lot of caveats in this assumption alone. > Specifically with PCI: the bus may run at however many GB/s but P2P > through a CPU's root complexes can slow down significantly (like down to > MB/s). > > I've seen similar things across QPI: I can sometimes do P2P from > PCI->QPI->PCI but the performance doesn't even come close to the sticker > speed of any of those buses. > > I'm not sure how anyone is going to deal with those issues, but it does > firmly place us in world view #2 instead of #1. But, yes, I agree > exposing information like in #2 full out to userspace, especially > through sysfs, seems like a nightmare and I don't see anything in HMS to > help with that. Providing an API to ask for memory (or another resource) > that's accessible by a set of initiators and with a set of requirements > for capabilities seems more manageable. Note that in #1 you have bridge that fully allow to express those path limitation. So what you just describe can be fully reported to userspace. I explained and given examples on how program adapt their computation to the system topology it does exist today and people are even developing new programming langage with some of those idea baked in. So they are people out there that already rely on such information they just do not get it from the kernel but from a mix of various device specific API and they have to stich everything themself and develop a database of quirk and gotcha. My proposal is to provide a coherent kernel API where we can sanitize that informations and report it to userspace in a single and coherent description. Cheers, Jérôme
On Thu, 6 Dec 2018 19:20:45 -0500 Jerome Glisse <jglisse@redhat.com> wrote: > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > >> I didn't think this was meant to describe actual real world performance > > >> between all of the links. If that's the case all of this seems like a > > >> pipe dream to me. > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > trying to describe what we called "sticker speed". Nobody had an > > > expectation that you *really* had to measure everything. > > > > > > The best we can do for any of these approaches is approximate things. > > > > Yes, though there's a lot of caveats in this assumption alone. > > Specifically with PCI: the bus may run at however many GB/s but P2P > > through a CPU's root complexes can slow down significantly (like down to > > MB/s). > > > > I've seen similar things across QPI: I can sometimes do P2P from > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > speed of any of those buses. > > > > I'm not sure how anyone is going to deal with those issues, but it does > > firmly place us in world view #2 instead of #1. But, yes, I agree > > exposing information like in #2 full out to userspace, especially > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > help with that. Providing an API to ask for memory (or another resource) > > that's accessible by a set of initiators and with a set of requirements > > for capabilities seems more manageable. > > Note that in #1 you have bridge that fully allow to express those path > limitation. So what you just describe can be fully reported to userspace. > > I explained and given examples on how program adapt their computation to > the system topology it does exist today and people are even developing new > programming langage with some of those idea baked in. > > So they are people out there that already rely on such information they > just do not get it from the kernel but from a mix of various device specific > API and they have to stich everything themself and develop a database of > quirk and gotcha. My proposal is to provide a coherent kernel API where > we can sanitize that informations and report it to userspace in a single > and coherent description. > > Cheers, > Jérôme I know it doesn't work everywhere, but I think it's worth enumerating what cases we can get some of these numbers for and where the complexity lies. I.e. What can the really determined user space library do today? So one open question is how close can we get in a userspace only prototype. At the end of the day userspace can often read HMAT directly if it wants to /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to end view (world 2). I dislike the limitations of that as much as the next person. It is slowly improving with the word "Auditable" being kicked around - btw anyone interested in ACPI who works for a UEFI member, there are efforts going on and more viewpoints would be great. Expect some baby steps shortly. For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of this is discoverable to some degree. * Link speed, * Number of Lanes, * Full topology. What isn't there (I think) * In component latency / bandwidth limitations (some activity going on to improve that long term) * Effect of credit allocations etc on effectively bandwidth - interconnect performance is a whole load of black magic. Presumably there is some information available from NVLink etc? So whilst I really like the proposal in some ways, I wonder how much exploration could be done of the usefulness of the data without touching the kernel at all. The other aspect that is needed to actually make this 'dynamically' useful is to be able to map whatever Performance Counters are available to the relevant 'links', bridges etc. Ticket numbers are not all that useful unfortunately except for small amounts of data on lightly loaded buses. The kernel ultimately only needs to have a model of this topology if: 1) It's going to use it itself 2) Its going to do something automatic with it. 3) It needs to fix garbage info or supplement with things only the kernel knows. Jonathan
On Fri, Dec 07, 2018 at 03:06:36PM +0000, Jonathan Cameron wrote: > On Thu, 6 Dec 2018 19:20:45 -0500 > Jerome Glisse <jglisse@redhat.com> wrote: > > > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote: > > > > > > > > > On 2018-12-06 4:38 p.m., Dave Hansen wrote: > > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote: > > > >> I didn't think this was meant to describe actual real world performance > > > >> between all of the links. If that's the case all of this seems like a > > > >> pipe dream to me. > > > > > > > > The HMAT discussions (that I was a part of at least) settled on just > > > > trying to describe what we called "sticker speed". Nobody had an > > > > expectation that you *really* had to measure everything. > > > > > > > > The best we can do for any of these approaches is approximate things. > > > > > > Yes, though there's a lot of caveats in this assumption alone. > > > Specifically with PCI: the bus may run at however many GB/s but P2P > > > through a CPU's root complexes can slow down significantly (like down to > > > MB/s). > > > > > > I've seen similar things across QPI: I can sometimes do P2P from > > > PCI->QPI->PCI but the performance doesn't even come close to the sticker > > > speed of any of those buses. > > > > > > I'm not sure how anyone is going to deal with those issues, but it does > > > firmly place us in world view #2 instead of #1. But, yes, I agree > > > exposing information like in #2 full out to userspace, especially > > > through sysfs, seems like a nightmare and I don't see anything in HMS to > > > help with that. Providing an API to ask for memory (or another resource) > > > that's accessible by a set of initiators and with a set of requirements > > > for capabilities seems more manageable. > > > > Note that in #1 you have bridge that fully allow to express those path > > limitation. So what you just describe can be fully reported to userspace. > > > > I explained and given examples on how program adapt their computation to > > the system topology it does exist today and people are even developing new > > programming langage with some of those idea baked in. > > > > So they are people out there that already rely on such information they > > just do not get it from the kernel but from a mix of various device specific > > API and they have to stich everything themself and develop a database of > > quirk and gotcha. My proposal is to provide a coherent kernel API where > > we can sanitize that informations and report it to userspace in a single > > and coherent description. > > > > Cheers, > > Jérôme > > I know it doesn't work everywhere, but I think it's worth enumerating what > cases we can get some of these numbers for and where the complexity lies. > I.e. What can the really determined user space library do today? I gave an example in an email in this thread: https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html Is the kind of example you are looking for ? :) > > So one open question is how close can we get in a userspace only prototype. > At the end of the day userspace can often read HMAT directly if it wants to > /sys/firmware/acpi/tables/HMAT. Obviously that gets us only the end to > end view (world 2). I dislike the limitations of that as much as the next > person. It is slowly improving with the word "Auditable" being > kicked around - btw anyone interested in ACPI who works for a UEFI > member, there are efforts going on and more viewpoints would be great. > Expect some baby steps shortly. > > For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of > this is discoverable to some degree. > * Link speed, > * Number of Lanes, > * Full topology. Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI, ...) userspace will have way to find the topology. The issue lies with orthogonal topology of extra bus that are not necessarily enumerated or with a device driver presently and especially how they inter-act with each other (can you cross them ? ...) > > What isn't there (I think) > * In component latency / bandwidth limitations (some activity going > on to improve that long term) > * Effect of credit allocations etc on effectively bandwidth - interconnect > performance is a whole load of black magic. > > Presumably there is some information available from NVLink etc? From my point of view we want to give the best case sticker value to userspace ie the bandwidth the engineer that designed the bus sworn their hardware deliver :) I believe it the is the best approximation we can deliver. > > So whilst I really like the proposal in some ways, I wonder how much exploration > could be done of the usefulness of the data without touching the kernel at all. > > The other aspect that is needed to actually make this 'dynamically' useful is > to be able to map whatever Performance Counters are available to the relevant > 'links', bridges etc. Ticket numbers are not all that useful unfortunately > except for small amounts of data on lightly loaded buses. > > The kernel ultimately only needs to have a model of this topology if: > 1) It's going to use it itself I don't think this should be a criteria, kernel is not using GPU or network adatper to browse the web for itself (at least i hope the linux kernel is not selfaware ;)). So this kind of topology is not of big use to the kernel. Kernel will only care about CPU and memory that abide to the memory model of the platform. It will also care about more irregular CPU inter-connected ie CPUs on the same mega substrate likely have a faster inter-connect between them then to the ones in a different physical socket. NUMA distance can model that. Dunno if more than that would be useful to the kernel. > 2) Its going to do something automatic with it. The information is intended for userspace for application that use that information. Today application get that information from non standard source and i would like to provide this in a standard common place in the kernel for few reasons: - Common model with explicit definition of what is what and what are the rules. No need to userspace to understand the specificities of various kernel sub-system. - Define unique identifiant for _every_ type of memory in the system even device memory so that i can define syscall to operate on those memory (can not do that in device driver) - Integrate with core mm so that long term we can move more of individual device memory management into core component. > 3) It needs to fix garbage info or supplement with things only the kernel knows. Yes kernel is expect to fix the informations it get and sanitize it so that userspace do not have to grow database of quirk and workaround. Moreover kernel can also benchmark inter-connect and adapt reported bandwidth and latency if this is ever something people would like to see. I will post two v2 where i split the common helpers from the sysfs and syscall part. I need the common helpers today in the case of single device and have user for that code (nouveau and amdgpu for starter). I want to continue the sysfs and syscall discussion and i need to reformulate thing and give better explaination of why i think the way i am doing thing have more values than any other. Dunno if i will have time to finish rework-ing all this before the end of this year. Cheers, Jérôme
From: Jérôme Glisse <jglisse@redhat.com> Heterogeneous memory system are becoming more and more the norm, in those system there is not only the main system memory for each node, but also device memory and|or memory hierarchy to consider. Device memory can comes from a device like GPU, FPGA, ... or from a memory only device (persistent memory, or high density memory device). Memory hierarchy is when you not only have the main memory but also other type of memory like HBM (High Bandwidth Memory often stack up on CPU die or GPU die), peristent memory or high density memory (ie something slower then regular DDR DIMM but much bigger). On top of this diversity of memories you also have to account for the system bus topology ie how all CPUs and devices are connected to each others. Userspace do not care about the exact physical topology but care about topology from behavior point of view ie what are all the paths between an initiator (anything that can initiate memory access like CPU, GPU, FGPA, network controller ...) and a target memory and what are all the properties of each of those path (bandwidth, latency, granularity, ...). This means that it is no longer sufficient to consider a flat view for each node in a system but for maximum performance we need to account for all of this new memory but also for system topology. This is why this proposal is unlike the HMAT proposal [1] which tries to extend the existing NUMA for new type of memory. Here we are tackling a much more profound change that depart from NUMA. One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance. This patchset is a proposal to tackle this problems through three aspects: 1 - Expose complex system topology and various kind of memory to user space so that application have a standard way and single place to get all the information it cares about. 2 - A new API for user space to bind/provide hint to kernel on which memory to use for range of virtual address (a new mbind() syscall). 3 - Kernel side changes for vm policy to handle this changes This patchset is not and end to end solution but it provides enough pieces to be useful against nouveau (upstream open source driver for NVidia GPU). It is intended as a starting point for discussion so that we can figure out what to do. To avoid having too much topics to discuss i am not considering memory cgroup for now but it is definitely something we will want to integrate with. The rest of this emails is splits in 3 sections, the first section talks about complex system topology: what it is, how it is use today and how to describe it tomorrow. The second sections talks about new API to bind/provide hint to kernel for range of virtual address. The third section talks about new mechanism to track bind/hint provided by user space or device driver inside the kernel. 1) Complex system topology and representing them ------------------------------------------------ Inside a node you can have a complex topology of memory, for instance you can have multiple HBM memory in a node, each HBM memory tie to a set of CPUs (all of which are in the same node). This means that you have a hierarchy of memory for CPUs. The local fast HBM but which is expected to be relatively small compare to main memory and then the main memory. New memory technology might also deepen this hierarchy with another level of yet slower memory but gigantic in size (some persistent memory technology might fall into that category). Another example is device memory, and device themself can have a hierarchy like HBM on top of device core and main device memory. On top of that you can have multiple path to access each memory and each path can have different properties (latency, bandwidth, ...). Also there is not always symmetry ie some memory might only be accessible by some device or CPU ie not accessible by everyone. So a flat hierarchy for each node is not capable of representing this kind of complexity. To simplify discussion and because we do not want to single out CPU from device, from here on out we will use initiator to refer to either CPU or device. An initiator is any kind of CPU or device that can access memory (ie initiate memory access). At this point a example of such system might help: - 2 nodes and for each node: - 1 CPU per node with 2 complex of CPUs cores per CPU - one HBM memory for each complex of CPUs cores (200GB/s) - CPUs cores complex are linked to each other (100GB/s) - main memory is (90GB/s) - 4 GPUs each with: - HBM memory for each GPU (1000GB/s) (not CPU accessible) - GDDR memory for each GPU (500GB/s) (CPU accessible) - connected to CPU root controller (60GB/s) - connected to other GPUs (even GPUs from the second node) with GPU link (400GB/s) In this example we restrict our self to bandwidth and ignore bus width or latency, this is just to simplify discussions but obviously they also factor in. Userspace very much would like to know about this information, for instance HPC folks have develop complex library to manage this and there is wide research on the topics [2] [3] [4] [5]. Today most of the work is done by hardcoding thing for specific platform. Which is somewhat acceptable for HPC folks where the platform stays the same for a long period of time. But if we want a more ubiquituous support we should aim to provide the information needed through standard kernel API such as the one presented in this patchset. Roughly speaking i see two broads use case for topology information. First is for virtualization and vm where you want to segment your hardware properly for each vm (binding memory, CPU and GPU that are all close to each others). Second is for application, many of which can partition their workload to minimize exchange between partition allowing each partition to be bind to a subset of device and CPUs that are close to each others (for maximum locality). Here it is much more than just NUMA distance, you can leverage the memory hierarchy and the system topology all-together (see [2] [3] [4] [5] for more references and details). So this is not exposing topology just for the sake of cool graph in userspace. They are active user today of such information and if we want to growth and broaden the usage we should provide a unified API to standardize how that information is accessible to every one. One proposal so far to handle new type of memory is to user CPU less node for those [6]. While same idea can apply for device memory, it is still hard to describe multiple path with different property in such scheme. While it is backward compatible and have minimum changes, it simplify can not convey complex topology (think any kind of random graph, not just a tree like graph). Thus far this kind of system have been use through device specific API and rely on all kind of system specific quirks. To avoid this going out of hands and grow into a bigger mess than it already is, this patchset tries to provide a common generic API that should fit various devices (GPU, FPGA, ...). So this patchset propose a new way to expose to userspace the system topology. It relies on 4 types of objects: - target: any kind of memory (main memory, HBM, device, ...) - initiator: CPU or device (anything that can access memory) - link: anything that link initiator and target - bridges: anything that allow group of initiator to access remote target (ie target they are not connected with directly through an link) Properties like bandwidth, latency, ... are all sets per bridges and links. All initiators connected to an link can access any target memory also connected to the same link and all with the same link properties. Link do not need to match physical hardware ie you can have a single physical link match a single or multiples software expose link. This allows to model device connected to same physical link (like PCIE for instance) but not with same characteristics (like number of lane or lane speed in PCIE). The reverse is also true ie having a single software expose link match multiples physical link. Bridges allows initiator to access remote link. A bridges connect two links to each others and is also specific to list of initiators (ie not all initiators connected to each of the link can use the bridge). Bridges have their own properties (bandwidth, latency, ...) so that the actual property value for each property is the lowest common denominator between bridge and each of the links. This model allows to describe any kind of directed graph and thus allows to describe any kind of topology we might see in the future. It is also easier to add new properties to each object type. Moreover it can be use to expose devices capable to do peer to peer between them. For that simply have all devices capable to peer to peer to have a common link or use the bridge object if the peer to peer capabilities is only one way for instance. This patchset use the above scheme to expose system topology through sysfs under /sys/bus/hms/ with: - /sys/bus/hms/devices/v%version-%id-target/ : a target memory, each has a UID and you can usual value in that folder (node id, size, ...) - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator (CPU or device), each has a HMS UID but also a CPU id for CPU (which match CPU id in (/sys/bus/cpu/). For device you have a path that can be PCIE BUS ID for instance) - /sys/bus/hms/devices/v%version-%id-link : an link, each has a UID and a file per property (bandwidth, latency, ...) you also find a symlink to every target and initiator connected to that link. - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has a UID and a file per property (bandwidth, latency, ...) you also find a symlink to all initiators that can use that bridge. To help with forward compatibility each object as a version value and it is mandatory for user space to only use target or initiator with version supported by the user space. For instance if user space only knows about what version 1 means and sees a target with version 2 then the user space must ignore that target as if it does not exist. Mandating that allows the additions of new properties that break back- ward compatibility ie user space must know how this new property affect the object to be able to use it safely. This patchset expose main memory of each node under a common target. For now device driver are responsible to register memory they want to expose through that scheme but in the future that information might come from the system firmware (this is a different discussion). 2) hbind() bind range of virtual address to heterogeneous memory ---------------------------------------------------------------- With this new topology description the mbind() API is too limited to handle which memory to picks. This is why this patchset introduce a new API: hbind() for heterogeneous bind. The hbind() API allows to bind any kind of target memory (using the HMS target uid), this can be any memory expose through HMS ie main memory, HBM, device memory ... So instead of using a bitmap, hbind() take an array of uid and each uid is a unique memory target inside the new memory topology description. User space also provide an array of modifiers. This patchset only define some modifier. Modifier can be seen as the flags parameter of mbind() but here we use an array so that user space can not only supply a modifier but also value with it. This should allow the API to grow more features in the future. Kernel should return -EINVAL if it is provided with an unkown modifier and just ignore the call all together, forcing the user space to restrict itself to modifier supported by the kernel it is running on (i know i am dreaming about well behave user space). Note that none of this is exclusive of automatic memory placement like autonuma. I also believe that we will see something similar to autonuma for device memory. This patchset is just there to provide new API for process that wish to have a fine control over their memory placement because process should know better than the kernel on where to place thing. This patchset also add necessary bits to the nouveau open source driver for it to expose its memory and to allow process to bind some range to the GPU memory. Note that on x86 the GPU memory is not accessible by CPU because PCIE does not allow cache coherent access to device memory. Thus when using PCIE device memory on x86 it is mapped as swap out from CPU POV and any CPU access will triger a migration back to main memory (this is all part of HMM and nouveau not in this patchset). This is all done under staging so that we can experiment with the user- space API for a while before committing to anything. Getting this right is hard and it might not happen on the first try so instead of having to support forever an API i would rather have it leave behind staging for people to experiment with and once we feel confident we have something we can live with then convert it to a syscall. 3) Tracking and applying heterogeneous memory policies ------------------------------------------------------ Current memory policy infrastructure is node oriented, instead of changing that and risking breakage and regression this patchset add a new heterogeneous policy tracking infra-structure. The expectation is that existing application can keep using mbind() and all existing infrastructure under-disturb and unaffected, while new application will use the new API and should avoid mix and matching both (as they can achieve the same thing with the new API). Also the policy is not directly tie to the vma structure for a few reasons: - avoid having to split vma for policy that do not cover full vma - avoid changing too much vma code - avoid growing the vma structure with an extra pointer So instead this patchset use the mmu_notifier API to track vma liveness (munmap(),mremap(),...). This patchset is not tie to process memory allocation either (like said at the begining this is not and end to end patchset but a starting point). It does however demonstrate how migration to device memory can work under this scheme (using nouveau as a demonstration vehicle). The overall design is simple, on hbind() call a hms policy structure is created for the supplied range and hms use the callback associated with the target memory. This callback is provided by device driver for device memory or by core HMS for regular main memory. The callback can decide to migrate the range to the target memories or do nothing (this can be influenced by flags provided to hbind() too). Latter patches can tie page fault with HMS policy to direct memory allocation to the right target. For now i would rather postpone that discussion until a consensus is reach on how to move forward on all the topics presented in this email. Start smalls, grow big ;) Cheers, Jérôme Glisse https://cgit.freedesktop.org/~glisse/linux/log/?h=hms-hbind-v01 git://people.freedesktop.org/~glisse/linux hms-hbind-v01 [1] https://lkml.org/lkml/2018/11/15/331 [2] https://arxiv.org/pdf/1704.08273.pdf [3] https://csmd.ornl.gov/highlight/sharp-unified-memory-allocator-intent-based-memory-allocator-extreme-scale-systems [4] https://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/Trott-white-paper.pdf http://cacs.usc.edu/education/cs653/Edwards-Kokkos-JPDC14.pdf [5] https://github.com/LLNL/Umpire https://umpire.readthedocs.io/en/develop/ [6] https://www.spinics.net/lists/hotplug/msg06171.html Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Ross Zwisler <ross.zwisler@linux.intel.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Balbir Singh <bsingharora@gmail.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Felix Kuehling <felix.kuehling@amd.com> Cc: Philip Yang <Philip.Yang@amd.com> Cc: Christian König <christian.koenig@amd.com> Cc: Paul Blinzer <Paul.Blinzer@amd.com> Cc: Logan Gunthorpe <logang@deltatee.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Mark Hairgrove <mhairgrove@nvidia.com> Cc: Vivek Kini <vkini@nvidia.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Airlie <airlied@redhat.com> Cc: Ben Skeggs <bskeggs@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ben Woodard <woodard@redhat.com> Cc: linux-acpi@vger.kernel.org Jérôme Glisse (14): mm/hms: heterogeneous memory system (sysfs infrastructure) mm/hms: heterogenenous memory system (HMS) documentation mm/hms: add target memory to heterogeneous memory system infrastructure mm/hms: add initiator to heterogeneous memory system infrastructure mm/hms: add link to heterogeneous memory system infrastructure mm/hms: add bridge to heterogeneous memory system infrastructure mm/hms: register main memory with heterogenenous memory system mm/hms: register main CPUs with heterogenenous memory system mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS) mm/hbind: add heterogeneous memory policy tracking infrastructure mm/hbind: add bind command to heterogeneous memory policy mm/hbind: add migrate command to hbind() ioctl drm/nouveau: register GPU under heterogeneous memory system test/hms: tests for heterogeneous memory system Documentation/vm/hms.rst | 252 ++++++++ drivers/base/Kconfig | 14 + drivers/base/Makefile | 1 + drivers/base/cpu.c | 5 + drivers/base/hms-bridge.c | 197 +++++++ drivers/base/hms-initiator.c | 141 +++++ drivers/base/hms-link.c | 183 ++++++ drivers/base/hms-target.c | 193 +++++++ drivers/base/hms.c | 199 +++++++ drivers/base/init.c | 2 + drivers/base/node.c | 83 ++- drivers/gpu/drm/nouveau/Kbuild | 1 + drivers/gpu/drm/nouveau/nouveau_hms.c | 80 +++ drivers/gpu/drm/nouveau/nouveau_hms.h | 46 ++ drivers/gpu/drm/nouveau/nouveau_svm.c | 6 + include/linux/cpu.h | 4 + include/linux/hms.h | 219 +++++++ include/linux/mm_types.h | 6 + include/linux/node.h | 6 + include/uapi/linux/hbind.h | 73 +++ kernel/fork.c | 3 + mm/Makefile | 1 + mm/hms.c | 545 ++++++++++++++++++ tools/testing/hms/Makefile | 17 + tools/testing/hms/hbind-create-device-file.sh | 11 + tools/testing/hms/test-hms-migrate.c | 77 +++ tools/testing/hms/test-hms.c | 237 ++++++++ tools/testing/hms/test-hms.h | 67 +++ 28 files changed, 2667 insertions(+), 2 deletions(-) create mode 100644 Documentation/vm/hms.rst create mode 100644 drivers/base/hms-bridge.c create mode 100644 drivers/base/hms-initiator.c create mode 100644 drivers/base/hms-link.c create mode 100644 drivers/base/hms-target.c create mode 100644 drivers/base/hms.c create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.c create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.h create mode 100644 include/linux/hms.h create mode 100644 include/uapi/linux/hbind.h create mode 100644 mm/hms.c create mode 100644 tools/testing/hms/Makefile create mode 100755 tools/testing/hms/hbind-create-device-file.sh create mode 100644 tools/testing/hms/test-hms-migrate.c create mode 100644 tools/testing/hms/test-hms.c create mode 100644 tools/testing/hms/test-hms.h