mbox series

[RFC,00/14] Heterogeneous Memory System (HMS) and hbind()

Message ID 20181203233509.20671-1-jglisse@redhat.com (mailing list archive)
Headers show
Series Heterogeneous Memory System (HMS) and hbind() | expand

Message

Jerome Glisse Dec. 3, 2018, 11:34 p.m. UTC
From: Jérôme Glisse <jglisse@redhat.com>

Heterogeneous memory system are becoming more and more the norm, in
those system there is not only the main system memory for each node,
but also device memory and|or memory hierarchy to consider. Device
memory can comes from a device like GPU, FPGA, ... or from a memory
only device (persistent memory, or high density memory device).

Memory hierarchy is when you not only have the main memory but also
other type of memory like HBM (High Bandwidth Memory often stack up
on CPU die or GPU die), peristent memory or high density memory (ie
something slower then regular DDR DIMM but much bigger).

On top of this diversity of memories you also have to account for the
system bus topology ie how all CPUs and devices are connected to each
others. Userspace do not care about the exact physical topology but
care about topology from behavior point of view ie what are all the
paths between an initiator (anything that can initiate memory access
like CPU, GPU, FGPA, network controller ...) and a target memory and
what are all the properties of each of those path (bandwidth, latency,
granularity, ...).

This means that it is no longer sufficient to consider a flat view
for each node in a system but for maximum performance we need to
account for all of this new memory but also for system topology.
This is why this proposal is unlike the HMAT proposal [1] which
tries to extend the existing NUMA for new type of memory. Here we
are tackling a much more profound change that depart from NUMA.


One of the reasons for radical change is the advance of accelerator
like GPU or FPGA means that CPU is no longer the only piece where
computation happens. It is becoming more and more common for an
application to use a mix and match of different accelerator to
perform its computation. So we can no longer satisfy our self with
a CPU centric and flat view of a system like NUMA and NUMA distance.


This patchset is a proposal to tackle this problems through three
aspects:
    1 - Expose complex system topology and various kind of memory
        to user space so that application have a standard way and
        single place to get all the information it cares about.
    2 - A new API for user space to bind/provide hint to kernel on
        which memory to use for range of virtual address (a new
        mbind() syscall).
    3 - Kernel side changes for vm policy to handle this changes

This patchset is not and end to end solution but it provides enough
pieces to be useful against nouveau (upstream open source driver for
NVidia GPU). It is intended as a starting point for discussion so
that we can figure out what to do. To avoid having too much topics
to discuss i am not considering memory cgroup for now but it is
definitely something we will want to integrate with.

The rest of this emails is splits in 3 sections, the first section
talks about complex system topology: what it is, how it is use today
and how to describe it tomorrow. The second sections talks about
new API to bind/provide hint to kernel for range of virtual address.
The third section talks about new mechanism to track bind/hint
provided by user space or device driver inside the kernel.


1) Complex system topology and representing them
------------------------------------------------

Inside a node you can have a complex topology of memory, for instance
you can have multiple HBM memory in a node, each HBM memory tie to a
set of CPUs (all of which are in the same node). This means that you
have a hierarchy of memory for CPUs. The local fast HBM but which is
expected to be relatively small compare to main memory and then the
main memory. New memory technology might also deepen this hierarchy
with another level of yet slower memory but gigantic in size (some
persistent memory technology might fall into that category). Another
example is device memory, and device themself can have a hierarchy
like HBM on top of device core and main device memory.

On top of that you can have multiple path to access each memory and
each path can have different properties (latency, bandwidth, ...).
Also there is not always symmetry ie some memory might only be
accessible by some device or CPU ie not accessible by everyone.

So a flat hierarchy for each node is not capable of representing this
kind of complexity. To simplify discussion and because we do not want
to single out CPU from device, from here on out we will use initiator
to refer to either CPU or device. An initiator is any kind of CPU or
device that can access memory (ie initiate memory access).

At this point a example of such system might help:
    - 2 nodes and for each node:
        - 1 CPU per node with 2 complex of CPUs cores per CPU
        - one HBM memory for each complex of CPUs cores (200GB/s)
        - CPUs cores complex are linked to each other (100GB/s)
        - main memory is (90GB/s)
        - 4 GPUs each with:
            - HBM memory for each GPU (1000GB/s) (not CPU accessible)
            - GDDR memory for each GPU (500GB/s) (CPU accessible)
            - connected to CPU root controller (60GB/s)
            - connected to other GPUs (even GPUs from the second
              node) with GPU link (400GB/s)

In this example we restrict our self to bandwidth and ignore bus width
or latency, this is just to simplify discussions but obviously they
also factor in.


Userspace very much would like to know about this information, for
instance HPC folks have develop complex library to manage this and
there is wide research on the topics [2] [3] [4] [5]. Today most of
the work is done by hardcoding thing for specific platform. Which is
somewhat acceptable for HPC folks where the platform stays the same
for a long period of time. But if we want a more ubiquituous support
we should aim to provide the information needed through standard
kernel API such as the one presented in this patchset.

Roughly speaking i see two broads use case for topology information.
First is for virtualization and vm where you want to segment your
hardware properly for each vm (binding memory, CPU and GPU that are
all close to each others). Second is for application, many of which
can partition their workload to minimize exchange between partition
allowing each partition to be bind to a subset of device and CPUs
that are close to each others (for maximum locality). Here it is much
more than just NUMA distance, you can leverage the memory hierarchy
and  the system topology all-together (see [2] [3] [4] [5] for more
references and details).

So this is not exposing topology just for the sake of cool graph in
userspace. They are active user today of such information and if we
want to growth and broaden the usage we should provide a unified API
to standardize how that information is accessible to every one.


One proposal so far to handle new type of memory is to user CPU less
node for those [6]. While same idea can apply for device memory, it is
still hard to describe multiple path with different property in such
scheme. While it is backward compatible and have minimum changes, it
simplify can not convey complex topology (think any kind of random
graph, not just a tree like graph).

Thus far this kind of system have been use through device specific API
and rely on all kind of system specific quirks. To avoid this going out
of hands and grow into a bigger mess than it already is, this patchset
tries to provide a common generic API that should fit various devices
(GPU, FPGA, ...).

So this patchset propose a new way to expose to userspace the system
topology. It relies on 4 types of objects:
    - target: any kind of memory (main memory, HBM, device, ...)
    - initiator: CPU or device (anything that can access memory)
    - link: anything that link initiator and target
    - bridges: anything that allow group of initiator to access
      remote target (ie target they are not connected with directly
      through an link)

Properties like bandwidth, latency, ... are all sets per bridges and
links. All initiators connected to an link can access any target memory
also connected to the same link and all with the same link properties.

Link do not need to match physical hardware ie you can have a single
physical link match a single or multiples software expose link. This
allows to model device connected to same physical link (like PCIE
for instance) but not with same characteristics (like number of lane
or lane speed in PCIE). The reverse is also true ie having a single
software expose link match multiples physical link.

Bridges allows initiator to access remote link. A bridges connect two
links to each others and is also specific to list of initiators (ie
not all initiators connected to each of the link can use the bridge).
Bridges have their own properties (bandwidth, latency, ...) so that
the actual property value for each property is the lowest common
denominator between bridge and each of the links.


This model allows to describe any kind of directed graph and thus
allows to describe any kind of topology we might see in the future.
It is also easier to add new properties to each object type.

Moreover it can be use to expose devices capable to do peer to peer
between them. For that simply have all devices capable to peer to
peer to have a common link or use the bridge object if the peer to
peer capabilities is only one way for instance.


This patchset use the above scheme to expose system topology through
sysfs under /sys/bus/hms/ with:
    - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
      each has a UID and you can usual value in that folder (node id,
      size, ...)

    - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
      (CPU or device), each has a HMS UID but also a CPU id for CPU
      (which match CPU id in (/sys/bus/cpu/). For device you have a
      path that can be PCIE BUS ID for instance)

    - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
      UID and a file per property (bandwidth, latency, ...) you also
      find a symlink to every target and initiator connected to that
      link.

    - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
      a UID and a file per property (bandwidth, latency, ...) you
      also find a symlink to all initiators that can use that bridge.

To help with forward compatibility each object as a version value and
it is mandatory for user space to only use target or initiator with
version supported by the user space. For instance if user space only
knows about what version 1 means and sees a target with version 2 then
the user space must ignore that target as if it does not exist.

Mandating that allows the additions of new properties that break back-
ward compatibility ie user space must know how this new property affect
the object to be able to use it safely.

This patchset expose main memory of each node under a common target.
For now device driver are responsible to register memory they want to
expose through that scheme but in the future that information might
come from the system firmware (this is a different discussion).



2) hbind() bind range of virtual address to heterogeneous memory
----------------------------------------------------------------

With this new topology description the mbind() API is too limited to
handle which memory to picks. This is why this patchset introduce a new
API: hbind() for heterogeneous bind. The hbind() API allows to bind any
kind of target memory (using the HMS target uid), this can be any memory
expose through HMS ie main memory, HBM, device memory ... 

So instead of using a bitmap, hbind() take an array of uid and each uid
is a unique memory target inside the new memory topology description.
User space also provide an array of modifiers. This patchset only define
some modifier. Modifier can be seen as the flags parameter of mbind()
but here we use an array so that user space can not only supply a modifier
but also value with it. This should allow the API to grow more features
in the future. Kernel should return -EINVAL if it is provided with an
unkown modifier and just ignore the call all together, forcing the user
space to restrict itself to modifier supported by the kernel it is
running on (i know i am dreaming about well behave user space).


Note that none of this is exclusive of automatic memory placement like
autonuma. I also believe that we will see something similar to autonuma
for device memory. This patchset is just there to provide new API for
process that wish to have a fine control over their memory placement
because process should know better than the kernel on where to place
thing.

This patchset also add necessary bits to the nouveau open source driver
for it to expose its memory and to allow process to bind some range to
the GPU memory. Note that on x86 the GPU memory is not accessible by
CPU because PCIE does not allow cache coherent access to device memory.
Thus when using PCIE device memory on x86 it is mapped as swap out from
CPU POV and any CPU access will triger a migration back to main memory
(this is all part of HMM and nouveau not in this patchset).

This is all done under staging so that we can experiment with the user-
space API for a while before committing to anything. Getting this right
is hard and it might not happen on the first try so instead of having to
support forever an API i would rather have it leave behind staging for
people to experiment with and once we feel confident we have something
we can live with then convert it to a syscall.


3) Tracking and applying heterogeneous memory policies
------------------------------------------------------

Current memory policy infrastructure is node oriented, instead of
changing that and risking breakage and regression this patchset add a
new heterogeneous policy tracking infra-structure. The expectation is
that existing application can keep using mbind() and all existing
infrastructure under-disturb and unaffected, while new application
will use the new API and should avoid mix and matching both (as they
can achieve the same thing with the new API).

Also the policy is not directly tie to the vma structure for a few
reasons:
    - avoid having to split vma for policy that do not cover full vma
    - avoid changing too much vma code
    - avoid growing the vma structure with an extra pointer
So instead this patchset use the mmu_notifier API to track vma liveness
(munmap(),mremap(),...).

This patchset is not tie to process memory allocation either (like said
at the begining this is not and end to end patchset but a starting
point). It does however demonstrate how migration to device memory can
work under this scheme (using nouveau as a demonstration vehicle).

The overall design is simple, on hbind() call a hms policy structure
is created for the supplied range and hms use the callback associated
with the target memory. This callback is provided by device driver
for device memory or by core HMS for regular main memory. The callback
can decide to migrate the range to the target memories or do nothing
(this can be influenced by flags provided to hbind() too).


Latter patches can tie page fault with HMS policy to direct memory
allocation to the right target. For now i would rather postpone that
discussion until a consensus is reach on how to move forward on all
the topics presented in this email. Start smalls, grow big ;)

Cheers,
Jérôme Glisse

https://cgit.freedesktop.org/~glisse/linux/log/?h=hms-hbind-v01
git://people.freedesktop.org/~glisse/linux hms-hbind-v01


[1] https://lkml.org/lkml/2018/11/15/331
[2] https://arxiv.org/pdf/1704.08273.pdf
[3] https://csmd.ornl.gov/highlight/sharp-unified-memory-allocator-intent-based-memory-allocator-extreme-scale-systems
[4] https://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/Trott-white-paper.pdf
    http://cacs.usc.edu/education/cs653/Edwards-Kokkos-JPDC14.pdf
[5] https://github.com/LLNL/Umpire
    https://umpire.readthedocs.io/en/develop/
[6] https://www.spinics.net/lists/hotplug/msg06171.html

Cc: Rafael J. Wysocki <rafael@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Haggai Eran <haggaie@mellanox.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Felix Kuehling <felix.kuehling@amd.com>
Cc: Philip Yang <Philip.Yang@amd.com>
Cc: Christian König <christian.koenig@amd.com>
Cc: Paul Blinzer <Paul.Blinzer@amd.com>
Cc: Logan Gunthorpe <logang@deltatee.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Vivek Kini <vkini@nvidia.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Ben Woodard <woodard@redhat.com>
Cc: linux-acpi@vger.kernel.org

Jérôme Glisse (14):
  mm/hms: heterogeneous memory system (sysfs infrastructure)
  mm/hms: heterogenenous memory system (HMS) documentation
  mm/hms: add target memory to heterogeneous memory system
    infrastructure
  mm/hms: add initiator to heterogeneous memory system infrastructure
  mm/hms: add link to heterogeneous memory system infrastructure
  mm/hms: add bridge to heterogeneous memory system infrastructure
  mm/hms: register main memory with heterogenenous memory system
  mm/hms: register main CPUs with heterogenenous memory system
  mm/hms: hbind() for heterogeneous memory system (aka mbind() for HMS)
  mm/hbind: add heterogeneous memory policy tracking infrastructure
  mm/hbind: add bind command to heterogeneous memory policy
  mm/hbind: add migrate command to hbind() ioctl
  drm/nouveau: register GPU under heterogeneous memory system
  test/hms: tests for heterogeneous memory system

 Documentation/vm/hms.rst                      | 252 ++++++++
 drivers/base/Kconfig                          |  14 +
 drivers/base/Makefile                         |   1 +
 drivers/base/cpu.c                            |   5 +
 drivers/base/hms-bridge.c                     | 197 +++++++
 drivers/base/hms-initiator.c                  | 141 +++++
 drivers/base/hms-link.c                       | 183 ++++++
 drivers/base/hms-target.c                     | 193 +++++++
 drivers/base/hms.c                            | 199 +++++++
 drivers/base/init.c                           |   2 +
 drivers/base/node.c                           |  83 ++-
 drivers/gpu/drm/nouveau/Kbuild                |   1 +
 drivers/gpu/drm/nouveau/nouveau_hms.c         |  80 +++
 drivers/gpu/drm/nouveau/nouveau_hms.h         |  46 ++
 drivers/gpu/drm/nouveau/nouveau_svm.c         |   6 +
 include/linux/cpu.h                           |   4 +
 include/linux/hms.h                           | 219 +++++++
 include/linux/mm_types.h                      |   6 +
 include/linux/node.h                          |   6 +
 include/uapi/linux/hbind.h                    |  73 +++
 kernel/fork.c                                 |   3 +
 mm/Makefile                                   |   1 +
 mm/hms.c                                      | 545 ++++++++++++++++++
 tools/testing/hms/Makefile                    |  17 +
 tools/testing/hms/hbind-create-device-file.sh |  11 +
 tools/testing/hms/test-hms-migrate.c          |  77 +++
 tools/testing/hms/test-hms.c                  | 237 ++++++++
 tools/testing/hms/test-hms.h                  |  67 +++
 28 files changed, 2667 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/vm/hms.rst
 create mode 100644 drivers/base/hms-bridge.c
 create mode 100644 drivers/base/hms-initiator.c
 create mode 100644 drivers/base/hms-link.c
 create mode 100644 drivers/base/hms-target.c
 create mode 100644 drivers/base/hms.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.c
 create mode 100644 drivers/gpu/drm/nouveau/nouveau_hms.h
 create mode 100644 include/linux/hms.h
 create mode 100644 include/uapi/linux/hbind.h
 create mode 100644 mm/hms.c
 create mode 100644 tools/testing/hms/Makefile
 create mode 100755 tools/testing/hms/hbind-create-device-file.sh
 create mode 100644 tools/testing/hms/test-hms-migrate.c
 create mode 100644 tools/testing/hms/test-hms.c
 create mode 100644 tools/testing/hms/test-hms.h

Comments

Aneesh Kumar K.V Dec. 4, 2018, 7:44 a.m. UTC | #1
On 12/4/18 5:04 AM, jglisse@redhat.com wrote:
> From: Jérôme Glisse <jglisse@redhat.com>
> 
> Heterogeneous memory system are becoming more and more the norm, in
> those system there is not only the main system memory for each node,
> but also device memory and|or memory hierarchy to consider. Device
> memory can comes from a device like GPU, FPGA, ... or from a memory
> only device (persistent memory, or high density memory device).
> 
> Memory hierarchy is when you not only have the main memory but also
> other type of memory like HBM (High Bandwidth Memory often stack up
> on CPU die or GPU die), peristent memory or high density memory (ie
> something slower then regular DDR DIMM but much bigger).
> 
> On top of this diversity of memories you also have to account for the
> system bus topology ie how all CPUs and devices are connected to each
> others. Userspace do not care about the exact physical topology but
> care about topology from behavior point of view ie what are all the
> paths between an initiator (anything that can initiate memory access
> like CPU, GPU, FGPA, network controller ...) and a target memory and
> what are all the properties of each of those path (bandwidth, latency,
> granularity, ...).
> 
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.
> 
> 
> One of the reasons for radical change is the advance of accelerator
> like GPU or FPGA means that CPU is no longer the only piece where
> computation happens. It is becoming more and more common for an
> application to use a mix and match of different accelerator to
> perform its computation. So we can no longer satisfy our self with
> a CPU centric and flat view of a system like NUMA and NUMA distance.
> 
> 
> This patchset is a proposal to tackle this problems through three
> aspects:
>      1 - Expose complex system topology and various kind of memory
>          to user space so that application have a standard way and
>          single place to get all the information it cares about.
>      2 - A new API for user space to bind/provide hint to kernel on
>          which memory to use for range of virtual address (a new
>          mbind() syscall).
>      3 - Kernel side changes for vm policy to handle this changes
> 
> This patchset is not and end to end solution but it provides enough
> pieces to be useful against nouveau (upstream open source driver for
> NVidia GPU). It is intended as a starting point for discussion so
> that we can figure out what to do. To avoid having too much topics
> to discuss i am not considering memory cgroup for now but it is
> definitely something we will want to integrate with.
> 
> The rest of this emails is splits in 3 sections, the first section
> talks about complex system topology: what it is, how it is use today
> and how to describe it tomorrow. The second sections talks about
> new API to bind/provide hint to kernel for range of virtual address.
> The third section talks about new mechanism to track bind/hint
> provided by user space or device driver inside the kernel.
> 
> 
> 1) Complex system topology and representing them
> ------------------------------------------------
> 
> Inside a node you can have a complex topology of memory, for instance
> you can have multiple HBM memory in a node, each HBM memory tie to a
> set of CPUs (all of which are in the same node). This means that you
> have a hierarchy of memory for CPUs. The local fast HBM but which is
> expected to be relatively small compare to main memory and then the
> main memory. New memory technology might also deepen this hierarchy
> with another level of yet slower memory but gigantic in size (some
> persistent memory technology might fall into that category). Another
> example is device memory, and device themself can have a hierarchy
> like HBM on top of device core and main device memory.
> 
> On top of that you can have multiple path to access each memory and
> each path can have different properties (latency, bandwidth, ...).
> Also there is not always symmetry ie some memory might only be
> accessible by some device or CPU ie not accessible by everyone.
> 
> So a flat hierarchy for each node is not capable of representing this
> kind of complexity. To simplify discussion and because we do not want
> to single out CPU from device, from here on out we will use initiator
> to refer to either CPU or device. An initiator is any kind of CPU or
> device that can access memory (ie initiate memory access).
> 
> At this point a example of such system might help:
>      - 2 nodes and for each node:
>          - 1 CPU per node with 2 complex of CPUs cores per CPU
>          - one HBM memory for each complex of CPUs cores (200GB/s)
>          - CPUs cores complex are linked to each other (100GB/s)
>          - main memory is (90GB/s)
>          - 4 GPUs each with:
>              - HBM memory for each GPU (1000GB/s) (not CPU accessible)
>              - GDDR memory for each GPU (500GB/s) (CPU accessible)
>              - connected to CPU root controller (60GB/s)
>              - connected to other GPUs (even GPUs from the second
>                node) with GPU link (400GB/s)
> 
> In this example we restrict our self to bandwidth and ignore bus width
> or latency, this is just to simplify discussions but obviously they
> also factor in.
> 
> 
> Userspace very much would like to know about this information, for
> instance HPC folks have develop complex library to manage this and
> there is wide research on the topics [2] [3] [4] [5]. Today most of
> the work is done by hardcoding thing for specific platform. Which is
> somewhat acceptable for HPC folks where the platform stays the same
> for a long period of time. But if we want a more ubiquituous support
> we should aim to provide the information needed through standard
> kernel API such as the one presented in this patchset.
> 
> Roughly speaking i see two broads use case for topology information.
> First is for virtualization and vm where you want to segment your
> hardware properly for each vm (binding memory, CPU and GPU that are
> all close to each others). Second is for application, many of which
> can partition their workload to minimize exchange between partition
> allowing each partition to be bind to a subset of device and CPUs
> that are close to each others (for maximum locality). Here it is much
> more than just NUMA distance, you can leverage the memory hierarchy
> and  the system topology all-together (see [2] [3] [4] [5] for more
> references and details).
> 
> So this is not exposing topology just for the sake of cool graph in
> userspace. They are active user today of such information and if we
> want to growth and broaden the usage we should provide a unified API
> to standardize how that information is accessible to every one.
> 
> 
> One proposal so far to handle new type of memory is to user CPU less
> node for those [6]. While same idea can apply for device memory, it is
> still hard to describe multiple path with different property in such
> scheme. While it is backward compatible and have minimum changes, it
> simplify can not convey complex topology (think any kind of random
> graph, not just a tree like graph).
> 
> Thus far this kind of system have been use through device specific API
> and rely on all kind of system specific quirks. To avoid this going out
> of hands and grow into a bigger mess than it already is, this patchset
> tries to provide a common generic API that should fit various devices
> (GPU, FPGA, ...).
> 
> So this patchset propose a new way to expose to userspace the system
> topology. It relies on 4 types of objects:
>      - target: any kind of memory (main memory, HBM, device, ...)
>      - initiator: CPU or device (anything that can access memory)
>      - link: anything that link initiator and target
>      - bridges: anything that allow group of initiator to access
>        remote target (ie target they are not connected with directly
>        through an link)
> 
> Properties like bandwidth, latency, ... are all sets per bridges and
> links. All initiators connected to an link can access any target memory
> also connected to the same link and all with the same link properties.
> 
> Link do not need to match physical hardware ie you can have a single
> physical link match a single or multiples software expose link. This
> allows to model device connected to same physical link (like PCIE
> for instance) but not with same characteristics (like number of lane
> or lane speed in PCIE). The reverse is also true ie having a single
> software expose link match multiples physical link.
> 
> Bridges allows initiator to access remote link. A bridges connect two
> links to each others and is also specific to list of initiators (ie
> not all initiators connected to each of the link can use the bridge).
> Bridges have their own properties (bandwidth, latency, ...) so that
> the actual property value for each property is the lowest common
> denominator between bridge and each of the links.
> 
> 
> This model allows to describe any kind of directed graph and thus
> allows to describe any kind of topology we might see in the future.
> It is also easier to add new properties to each object type.
> 
> Moreover it can be use to expose devices capable to do peer to peer
> between them. For that simply have all devices capable to peer to
> peer to have a common link or use the bridge object if the peer to
> peer capabilities is only one way for instance.
> 
> 
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
>      - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>        each has a UID and you can usual value in that folder (node id,
>        size, ...)
> 
>      - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>        (CPU or device), each has a HMS UID but also a CPU id for CPU
>        (which match CPU id in (/sys/bus/cpu/). For device you have a
>        path that can be PCIE BUS ID for instance)
> 
>      - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>        UID and a file per property (bandwidth, latency, ...) you also
>        find a symlink to every target and initiator connected to that
>        link.
> 
>      - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>        a UID and a file per property (bandwidth, latency, ...) you
>        also find a symlink to all initiators that can use that bridge.

is that version tagging really needed? What changes do you envision with 
versions?

> 
> To help with forward compatibility each object as a version value and
> it is mandatory for user space to only use target or initiator with
> version supported by the user space. For instance if user space only
> knows about what version 1 means and sees a target with version 2 then
> the user space must ignore that target as if it does not exist.
> 
> Mandating that allows the additions of new properties that break back-
> ward compatibility ie user space must know how this new property affect
> the object to be able to use it safely.
> 
> This patchset expose main memory of each node under a common target.
> For now device driver are responsible to register memory they want to
> expose through that scheme but in the future that information might
> come from the system firmware (this is a different discussion).
> 
> 
> 
> 2) hbind() bind range of virtual address to heterogeneous memory
> ----------------------------------------------------------------
> 
> With this new topology description the mbind() API is too limited to
> handle which memory to picks. This is why this patchset introduce a new
> API: hbind() for heterogeneous bind. The hbind() API allows to bind any
> kind of target memory (using the HMS target uid), this can be any memory
> expose through HMS ie main memory, HBM, device memory ...
> 
> So instead of using a bitmap, hbind() take an array of uid and each uid
> is a unique memory target inside the new memory topology description.
> User space also provide an array of modifiers. This patchset only define
> some modifier. Modifier can be seen as the flags parameter of mbind()
> but here we use an array so that user space can not only supply a modifier
> but also value with it. This should allow the API to grow more features
> in the future. Kernel should return -EINVAL if it is provided with an
> unkown modifier and just ignore the call all together, forcing the user
> space to restrict itself to modifier supported by the kernel it is
> running on (i know i am dreaming about well behave user space).
> 
> 
> Note that none of this is exclusive of automatic memory placement like
> autonuma. I also believe that we will see something similar to autonuma
> for device memory. This patchset is just there to provide new API for
> process that wish to have a fine control over their memory placement
> because process should know better than the kernel on where to place
> thing.
> 
> This patchset also add necessary bits to the nouveau open source driver
> for it to expose its memory and to allow process to bind some range to
> the GPU memory. Note that on x86 the GPU memory is not accessible by
> CPU because PCIE does not allow cache coherent access to device memory.
> Thus when using PCIE device memory on x86 it is mapped as swap out from
> CPU POV and any CPU access will triger a migration back to main memory
> (this is all part of HMM and nouveau not in this patchset).
> 
> This is all done under staging so that we can experiment with the user-
> space API for a while before committing to anything. Getting this right
> is hard and it might not happen on the first try so instead of having to
> support forever an API i would rather have it leave behind staging for
> people to experiment with and once we feel confident we have something
> we can live with then convert it to a syscall.
> 
> 
> 3) Tracking and applying heterogeneous memory policies
> ------------------------------------------------------
> 
> Current memory policy infrastructure is node oriented, instead of
> changing that and risking breakage and regression this patchset add a
> new heterogeneous policy tracking infra-structure. The expectation is
> that existing application can keep using mbind() and all existing
> infrastructure under-disturb and unaffected, while new application
> will use the new API and should avoid mix and matching both (as they
> can achieve the same thing with the new API).
> 
> Also the policy is not directly tie to the vma structure for a few
> reasons:
>      - avoid having to split vma for policy that do not cover full vma
>      - avoid changing too much vma code
>      - avoid growing the vma structure with an extra pointer
> So instead this patchset use the mmu_notifier API to track vma liveness
> (munmap(),mremap(),...).
> 
> This patchset is not tie to process memory allocation either (like said
> at the begining this is not and end to end patchset but a starting
> point). It does however demonstrate how migration to device memory can
> work under this scheme (using nouveau as a demonstration vehicle).
> 
> The overall design is simple, on hbind() call a hms policy structure
> is created for the supplied range and hms use the callback associated
> with the target memory. This callback is provided by device driver
> for device memory or by core HMS for regular main memory. The callback
> can decide to migrate the range to the target memories or do nothing
> (this can be influenced by flags provided to hbind() too).
> 
> 
> Latter patches can tie page fault with HMS policy to direct memory
> allocation to the right target. For now i would rather postpone that
> discussion until a consensus is reach on how to move forward on all
> the topics presented in this email. Start smalls, grow big ;)
> 
>

I liked the simplicity of keeping it outside all the existing memory 
management policy code. But that that is also the drawback isn't it?
We now have multiple entities tracking cpu and memory. (This reminded me 
of how we started with memcg in the early days).

Once we have these different types of targets, ideally the system should
be able to place them in the ideal location based on the affinity of the 
access. ie. we should automatically place the memory such that
initiator can access the target optimally. That is what we try to do 
with current system with autonuma. (You did mention that you are not 
looking at how this patch series will evolve to automatic handling of 
placement right now.) But i guess we want to see if the framework indeed 
help in achieving that goal. Having HMS outside the core memory
handling routines will be a road blocker there?

-aneesh
Jerome Glisse Dec. 4, 2018, 2:44 p.m. UTC | #2
On Tue, Dec 04, 2018 at 01:14:14PM +0530, Aneesh Kumar K.V wrote:
> On 12/4/18 5:04 AM, jglisse@redhat.com wrote:
> > From: Jérôme Glisse <jglisse@redhat.com>

[...]

> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >      - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >        each has a UID and you can usual value in that folder (node id,
> >        size, ...)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >        (CPU or device), each has a HMS UID but also a CPU id for CPU
> >        (which match CPU id in (/sys/bus/cpu/). For device you have a
> >        path that can be PCIE BUS ID for instance)
> > 
> >      - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >        UID and a file per property (bandwidth, latency, ...) you also
> >        find a symlink to every target and initiator connected to that
> >        link.
> > 
> >      - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >        a UID and a file per property (bandwidth, latency, ...) you
> >        also find a symlink to all initiators that can use that bridge.
> 
> is that version tagging really needed? What changes do you envision with
> versions?

I kind of dislike it myself but this is really to keep userspace from
inadvertently using some kind of memory/initiator/link/bridge that it
should not be using if it does not understand what are the implication.

If it was a file inside the directory there is a big chance that user-
space will overlook it. So an old program on a new platform with a new
kind of weird memory like un-coherent memory might start using it and
get all weird result. If version is in the directory name it kind of
force userspace to only look at memory/initiator/link/bridge it does
understand and can use safely.

So i am doing this in hope that it will protect application when new
type of things pops up. We have too many example where we can not
evolve something because existing application have bake in assumptions
about it.


[...]

> > 3) Tracking and applying heterogeneous memory policies
> > ------------------------------------------------------
> > 
> > Current memory policy infrastructure is node oriented, instead of
> > changing that and risking breakage and regression this patchset add a
> > new heterogeneous policy tracking infra-structure. The expectation is
> > that existing application can keep using mbind() and all existing
> > infrastructure under-disturb and unaffected, while new application
> > will use the new API and should avoid mix and matching both (as they
> > can achieve the same thing with the new API).
> > 
> > Also the policy is not directly tie to the vma structure for a few
> > reasons:
> >      - avoid having to split vma for policy that do not cover full vma
> >      - avoid changing too much vma code
> >      - avoid growing the vma structure with an extra pointer
> > So instead this patchset use the mmu_notifier API to track vma liveness
> > (munmap(),mremap(),...).
> > 
> > This patchset is not tie to process memory allocation either (like said
> > at the begining this is not and end to end patchset but a starting
> > point). It does however demonstrate how migration to device memory can
> > work under this scheme (using nouveau as a demonstration vehicle).
> > 
> > The overall design is simple, on hbind() call a hms policy structure
> > is created for the supplied range and hms use the callback associated
> > with the target memory. This callback is provided by device driver
> > for device memory or by core HMS for regular main memory. The callback
> > can decide to migrate the range to the target memories or do nothing
> > (this can be influenced by flags provided to hbind() too).
> > 
> > 
> > Latter patches can tie page fault with HMS policy to direct memory
> > allocation to the right target. For now i would rather postpone that
> > discussion until a consensus is reach on how to move forward on all
> > the topics presented in this email. Start smalls, grow big ;)
> > 
> > 
> 
> I liked the simplicity of keeping it outside all the existing memory
> management policy code. But that that is also the drawback isn't it?
> We now have multiple entities tracking cpu and memory. (This reminded me of
> how we started with memcg in the early days).

This is a hard choice, the rational is that either application use this
new API either it use the old one. So the expectation is that both should
not co-exist in a process. Eventualy both can be consolidated into one
inside the kernel while maintaining the different userspace API. But i
feel that it is better to get to that point slowly while we experiment
with the new API. I feel that we need to gain some experience with the
new API on real workload to convince ourself that it is something we can
leave with. If we reach that point than we can work on consolidating
kernel code into one. In the meantime this experiment does not disrupt
or regress existing API. I took the cautionary road.


> Once we have these different types of targets, ideally the system should
> be able to place them in the ideal location based on the affinity of the
> access. ie. we should automatically place the memory such that
> initiator can access the target optimally. That is what we try to do with
> current system with autonuma. (You did mention that you are not looking at
> how this patch series will evolve to automatic handling of placement right
> now.) But i guess we want to see if the framework indeed help in achieving
> that goal. Having HMS outside the core memory
> handling routines will be a road blocker there?

So evolving autonuma gonna be a thing on its own, the issue is that auto-
numa revolve around CPU id and use a handful of bits to try to catch CPU
access pattern. With device in the mix it is much harder, first using the
page fault trick of autonuma might not be the best idea, second we can get
a lot of informations from IOMMU, bridge chipset or device itself on what
is accessed by who.

So my believe on that front is that its gonna be something different, like
tracking range of virtual address and maintaining a data structure for
range (not per page).

All this is done in core mm code, i am just keeping out of vma struct or
other struct to avoid growing them when and wasting thing when thit is not
in use. So it is very much inside core handling routines, it is just
optional.

In any case i believe that explicit placement (where application hbind()
thing) will be the first main use case. Once we have that figured out (or
at least once we believe we have it figured out :)) then we can look into
auto-heterogeneous.

Cheers,
Jérôme
Dave Hansen Dec. 4, 2018, 6:02 p.m. UTC | #3
On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> This means that it is no longer sufficient to consider a flat view
> for each node in a system but for maximum performance we need to
> account for all of this new memory but also for system topology.
> This is why this proposal is unlike the HMAT proposal [1] which
> tries to extend the existing NUMA for new type of memory. Here we
> are tackling a much more profound change that depart from NUMA.

The HMAT and its implications exist, in firmware, whether or not we do
*anything* in Linux to support it or not.  Any system with an HMAT
inherently reflects the new topology, via proximity domains, whether or
not we parse the HMAT table in Linux or not.

Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
that or embrace it.  Keith's HMAT patches are embracing it.  These
patches are appearing to fight it.  Agree?  Disagree?

Also, could you add a simple, example program for how someone might use
this?  I got lost in all the new sysfs and ioctl gunk.  Can you
characterize how this would work with the *exiting* NUMA interfaces that
we have?
Jerome Glisse Dec. 4, 2018, 6:49 p.m. UTC | #4
On Tue, Dec 04, 2018 at 10:02:55AM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> > This means that it is no longer sufficient to consider a flat view
> > for each node in a system but for maximum performance we need to
> > account for all of this new memory but also for system topology.
> > This is why this proposal is unlike the HMAT proposal [1] which
> > tries to extend the existing NUMA for new type of memory. Here we
> > are tackling a much more profound change that depart from NUMA.
> 
> The HMAT and its implications exist, in firmware, whether or not we do
> *anything* in Linux to support it or not.  Any system with an HMAT
> inherently reflects the new topology, via proximity domains, whether or
> not we parse the HMAT table in Linux or not.
> 
> Basically, *ACPI* has decided to extend NUMA.  Linux can either fight
> that or embrace it.  Keith's HMAT patches are embracing it.  These
> patches are appearing to fight it.  Agree?  Disagree?

Disagree, sorry if it felt that way that was not my intention. The
ACPI HMAT information can be use to populate the HMS file system
representation. My intention is not to fight Keith's HMAT patches
they are useful on their own. But i do not see how to evolve NUMA
to support device memory, so while Keith is taking a step into the
direction i want, i do not see how to cross to the place i need to
be. More on that below.

> 
> Also, could you add a simple, example program for how someone might use
> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> characterize how this would work with the *exiting* NUMA interfaces that
> we have?

That is the issue i can not expose device memory as NUMA node as
device memory is not cache coherent on AMD and Intel platform today.

More over in some case that memory is not visible at all by the CPU
which is not something you can express in the current NUMA node.
Here is an abreviated list of feature i need to support:
    - device private memory (not accessible by CPU or anybody else)
    - non-coherent memory (PCIE is not cache coherent for CPU access)
    - multiple path to access same memory either:
        - multiple _different_ physical address alias to same memory
        - device block can select which path they take to access some
          memory (it is not inside the page table but in how you program
          the device block)
    - complex topology that is not a tree where device link can have
      better characteristics than the CPU inter-connect between the
      nodes. They are existing today user that use topology information
      to partition their workload (HPC folks who have a fix platform).
    - device memory needs to stay under device driver control as some
      existing API (OpenGL, Vulkan) have different memory model and if
      we want the device to be use for those too then we need to keep
      the device driver in control of the device memory allocation


There is an example userspace program with the last patch in the serie.
But here is a high level overview of how one application looks today:

    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2) Application allocate memory on device A and copy over the dataset
    3) Application run some CPU code to format the copy of the dataset
       inside device A memory (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    4) Application run code on device A that use the dataset
    5) Application allocate memory on device B and copy over result
       from device A
    6) Application run some CPU code to format the copy of the dataset
       inside device B (rebuild pointers inside the dataset,
       this can represent millions and millions of operations)
    7) Application run code on device B that use the dataset
    8) Application copy result over from device B and keep on doing its
       thing

How it looks with HMS:
    1) Application get some dataset from some source (disk, network,
       sensors, ...)
    2-3) Application calls HMS to migrate to device A memory
    4) Application run code on device A that use the dataset
    5-6) Application calls HMS to migrate to device B memory
    7) Application run code on device B that use the dataset
    8) Application calls HMS to migrate result to main memory

So we now avoid explicit copy and having to rebuild data structure
inside each device address space.


Above example is for migrate. Here is an example for how the
topology is use today:

    Application knows that the platform is running on have 16
    GPU split into 2 group of 8 GPUs each. GPU in each group can
    access each other memory with dedicated mesh links between
    each others. Full speed no traffic bottleneck.

    Application splits its GPU computation in 2 so that each
    partition runs on a group of interconnected GPU allowing
    them to share the dataset.

With HMS:
    Application can query the kernel to discover the topology of
    system it is running on and use it to partition and balance
    its workload accordingly. Same application should now be able
    to run on new platform without having to adapt it to it.

This is kind of naive i expect topology to be hard to use but maybe
it is just me being pesimistics. In any case today we have a chicken
and egg problem. We do not have a standard way to expose topology so
program that can leverage topology are only done for HPC where the
platform is standard for few years. If we had a standard way to expose
the topology then maybe we would see more program using it. At very
least we could convert existing user.


Policy is same kind of story, this email is long enough now :) But
i can write one down if you want.


Cheers,
Jérôme
Dave Hansen Dec. 4, 2018, 6:54 p.m. UTC | #5
On 12/4/18 10:49 AM, Jerome Glisse wrote:
> Policy is same kind of story, this email is long enough now :) But
> i can write one down if you want.

Yes, please.  I'd love to see the code.

We'll do the same on the "HMAT" side and we can compare notes.
Jerome Glisse Dec. 4, 2018, 7:11 p.m. UTC | #6
On Tue, Dec 04, 2018 at 10:54:10AM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> > Policy is same kind of story, this email is long enough now :) But
> > i can write one down if you want.
> 
> Yes, please.  I'd love to see the code.
> 
> We'll do the same on the "HMAT" side and we can compare notes.

Example use case ? Example use are:
    Application create a range of virtual address with mmap() for the
    input dataset. Application knows it will use GPU on it directly so
    it calls hbind() to set a policy for the range to use GPU memory
    for any new allocation for the range.

    Application directly stream the dataset to GPU memory through the
    virtual address range thanks to the policy.


    Application create a range of virtual address with mmap() to store
    the output result of GPU jobs its about to launch. It binds the
    range of virtual address to GPU memory so that allocation use GPU
    memory for the range.


    Application can also use policy binding as a slow migration path
    ie set a policy to a new target memory so that new allocation are
    directed to this new target.

Or do you want example userspace program like the one in the last
patch of this serie ?

Cheers,
Jérôme
Dave Hansen Dec. 4, 2018, 9:37 p.m. UTC | #7
On 12/4/18 10:49 AM, Jerome Glisse wrote:
>> Also, could you add a simple, example program for how someone might use
>> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
>> characterize how this would work with the *exiting* NUMA interfaces that
>> we have?
> That is the issue i can not expose device memory as NUMA node as
> device memory is not cache coherent on AMD and Intel platform today.
> 
> More over in some case that memory is not visible at all by the CPU
> which is not something you can express in the current NUMA node.

Yeah, our NUMA mechanisms are for managing memory that the kernel itself
manages in the "normal" allocator and supports a full feature set on.
That has a bunch of implications, like that the memory is cache coherent
and accessible from everywhere.

The HMAT patches only comprehend this "normal" memory, which is why
we're extending the existing /sys/devices/system/node infrastructure.

This series has a much more aggressive goal, which is comprehending the
connections of every memory-target to every memory-initiator, no matter
who is managing the memory, who can access it, or what it can be used for.

Theoretically, HMS could be used for everything that we're doing with
/sys/devices/system/node, as long as it's tied back into the existing
NUMA infrastructure _somehow_.

Right?
Jerome Glisse Dec. 4, 2018, 9:57 p.m. UTC | #8
On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
> On 12/4/18 10:49 AM, Jerome Glisse wrote:
> >> Also, could you add a simple, example program for how someone might use
> >> this?  I got lost in all the new sysfs and ioctl gunk.  Can you
> >> characterize how this would work with the *exiting* NUMA interfaces that
> >> we have?
> > That is the issue i can not expose device memory as NUMA node as
> > device memory is not cache coherent on AMD and Intel platform today.
> > 
> > More over in some case that memory is not visible at all by the CPU
> > which is not something you can express in the current NUMA node.
> 
> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
> manages in the "normal" allocator and supports a full feature set on.
> That has a bunch of implications, like that the memory is cache coherent
> and accessible from everywhere.
> 
> The HMAT patches only comprehend this "normal" memory, which is why
> we're extending the existing /sys/devices/system/node infrastructure.
> 
> This series has a much more aggressive goal, which is comprehending the
> connections of every memory-target to every memory-initiator, no matter
> who is managing the memory, who can access it, or what it can be used for.
> 
> Theoretically, HMS could be used for everything that we're doing with
> /sys/devices/system/node, as long as it's tied back into the existing
> NUMA infrastructure _somehow_.
> 
> Right?

Fully correct mind if i steal that perfect summary description next time
i post ? I am so bad at explaining thing :)

Intention is to allow program to do everything they do with mbind() today
and tomorrow with the HMAT patchset and on top of that to also be able to
do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
kernel API to rule them all ;)

Also at first i intend to special case vma alloc page when they are HMS
policy, long term i would like to merge code path inside the kernel. But
i do not want to disrupt existing code path today, i rather grow to that
organicaly. Step by step. The mbind() would still work un-affected in
the end just the plumbing would be slightly different.

Cheers,
Jérôme
Dave Hansen Dec. 4, 2018, 11:54 p.m. UTC | #9
On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> This patchset use the above scheme to expose system topology through
> sysfs under /sys/bus/hms/ with:
>     - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
>       each has a UID and you can usual value in that folder (node id,
>       size, ...)
> 
>     - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
>       (CPU or device), each has a HMS UID but also a CPU id for CPU
>       (which match CPU id in (/sys/bus/cpu/). For device you have a
>       path that can be PCIE BUS ID for instance)
> 
>     - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
>       UID and a file per property (bandwidth, latency, ...) you also
>       find a symlink to every target and initiator connected to that
>       link.
> 
>     - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
>       a UID and a file per property (bandwidth, latency, ...) you
>       also find a symlink to all initiators that can use that bridge.

We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
connections between each node.  Let's suppose that each node has some
CPUs and some memory.

That means we'll have 1024 target directories in sysfs, 1024 initiator
directories in sysfs, and 1024*1024 link directories.  Or, would the
kernel be responsible for "compiling" the firmware-provided information
down into a more manageable number of links?

Some idiot made the mistake of having one sysfs directory per 128MB of
memory way back when, and now we have hundreds of thousands of
/sys/devices/system/memory/memoryX directories.  That sucks to manage.
Isn't this potentially repeating that mistake?

Basically, is sysfs the right place to even expose this much data?
Dave Hansen Dec. 4, 2018, 11:58 p.m. UTC | #10
On 12/4/18 1:57 PM, Jerome Glisse wrote:
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)

Go for it!

> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

While I appreciate the exhaustive scope of such a project, I'm really
worried that if we decided to use this for our "HMAT" use cases, we'll
be bottlenecked behind this project while *it* goes through 25 revisions
over 4 or 5 years like HMM did.

So, should we just "park" the enhancements to the existing NUMA
interfaces and infrastructure (think /sys/devices/system/node) and wait
for this to go in?  Do we try to develop them in parallel and make them
consistent?  Or, do we just ignore each other and make Andrew sort it
out in a few years? :)
Jerome Glisse Dec. 5, 2018, 12:15 a.m. UTC | #11
On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> On 12/3/18 3:34 PM, jglisse@redhat.com wrote:
> > This patchset use the above scheme to expose system topology through
> > sysfs under /sys/bus/hms/ with:
> >     - /sys/bus/hms/devices/v%version-%id-target/ : a target memory,
> >       each has a UID and you can usual value in that folder (node id,
> >       size, ...)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-initiator/ : an initiator
> >       (CPU or device), each has a HMS UID but also a CPU id for CPU
> >       (which match CPU id in (/sys/bus/cpu/). For device you have a
> >       path that can be PCIE BUS ID for instance)
> > 
> >     - /sys/bus/hms/devices/v%version-%id-link : an link, each has a
> >       UID and a file per property (bandwidth, latency, ...) you also
> >       find a symlink to every target and initiator connected to that
> >       link.
> > 
> >     - /sys/bus/hms/devices/v%version-%id-bridge : a bridge, each has
> >       a UID and a file per property (bandwidth, latency, ...) you
> >       also find a symlink to all initiators that can use that bridge.
> 
> We support 1024 NUMA nodes on x86.  The ACPI HMAT expresses the
> connections between each node.  Let's suppose that each node has some
> CPUs and some memory.
> 
> That means we'll have 1024 target directories in sysfs, 1024 initiator
> directories in sysfs, and 1024*1024 link directories.  Or, would the
> kernel be responsible for "compiling" the firmware-provided information
> down into a more manageable number of links?
> 
> Some idiot made the mistake of having one sysfs directory per 128MB of
> memory way back when, and now we have hundreds of thousands of
> /sys/devices/system/memory/memoryX directories.  That sucks to manage.
> Isn't this potentially repeating that mistake?
> 
> Basically, is sysfs the right place to even expose this much data?

I definitly want to avoid the memoryX mistake. So i do not want to
see one link directory per device. Taking my simple laptop as an
example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
discret one):

link0: cpu0 cpu1 cpu2 cpu3
link1: wifi (2 pcie lane)
link2: gpu0 (unknown number of lane but i believe it has higher
             bandwidth to main memory)
link3: gpu1 (16 pcie lane)
link4: gpu1 and gpu memory

So one link directory per number of pcie lane your device have
so that you can differentiate on bandwidth. The main memory is
symlinked inside all the link directory except link4. The GPU
discret memory is only in link4 directory as it is only
accessible by the GPU (we could add it under link3 too with the
non cache coherent property attach to it).


The issue then becomes how to convert down the HMAT over verbose
information to populate some reasonable layout for HMS. For that
i would say that create a link directory for each different
matrix cell. As an example let say that each entry in the matrix
has bandwidth and latency then we create a link directory for
each combination of bandwidth and latency. On simple system that
should boils down to a handfull of combination roughly speaking
mirroring the example above of one link directory per number of
PCIE lane for instance.

I don't think i have a system with an HMAT table if you have one
HMAT table to provide i could show up the end result.

Note i believe the ACPI HMAT matrix is a bad design for that
reasons ie there is lot of commonality in many of the matrix
entry and many entry also do not make sense (ie initiator not
being able to access all the targets). I feel that link/bridge
is much more compact and allow to represent any directed graph
with multiple arrows from one node to another same node.

Cheers,
Jérôme
Jerome Glisse Dec. 5, 2018, 12:29 a.m. UTC | #12
On Tue, Dec 04, 2018 at 03:58:23PM -0800, Dave Hansen wrote:
> On 12/4/18 1:57 PM, Jerome Glisse wrote:
> > Fully correct mind if i steal that perfect summary description next time
> > i post ? I am so bad at explaining thing :)
> 
> Go for it!
> 
> > Intention is to allow program to do everything they do with mbind() today
> > and tomorrow with the HMAT patchset and on top of that to also be able to
> > do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> > kernel API to rule them all ;)
> 
> While I appreciate the exhaustive scope of such a project, I'm really
> worried that if we decided to use this for our "HMAT" use cases, we'll
> be bottlenecked behind this project while *it* goes through 25 revisions
> over 4 or 5 years like HMM did.
> 
> So, should we just "park" the enhancements to the existing NUMA
> interfaces and infrastructure (think /sys/devices/system/node) and wait
> for this to go in?  Do we try to develop them in parallel and make them
> consistent?  Or, do we just ignore each other and make Andrew sort it
> out in a few years? :)

Let have a battle with giant foam q-tip at next LSF/MM and see who wins ;)

More seriously i think you should go ahead with Keith HMAT patchset and
make progress there. In HMAT case you can grow and evolve the NUMA node
infrastructure to address your need and i believe you are doing it in
a sensible way. But i do not see a path for what i am trying to achieve
in that framework. If anyone have any good idea i would welcome it.

In the meantime i hope i can make progress with my proposal here under
staging. Once i get enough stuff working in userspace and convince guinea
pig (i need to find a better name for those poor people i will coerce
in testing this ;)) then i can have some hard evidence of what thing in
my proposal is useful on some concret case with open source stack from
top to bottom. It might means stripping down what i am proposing today
to what turns out to be useful. Then start a discussion about merging the
kernel underlying code into one (while preserving all existing API) and
getting out of staging with real syscall we will have to die with.

I know that at the very least the hbind() and hpolicy() syscall would
be successful as the HPC folks have been been dreaming of this. The
topology thing is harder to know, they are some users today but i can
not say how much more interest it can spark outside of this very small
community that is HPC.

Cheers,
Jérôme
Dave Hansen Dec. 5, 2018, 1:06 a.m. UTC | #13
On 12/4/18 4:15 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
>> Basically, is sysfs the right place to even expose this much data?
> 
> I definitly want to avoid the memoryX mistake. So i do not want to
> see one link directory per device. Taking my simple laptop as an
> example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> discret one):
> 
> link0: cpu0 cpu1 cpu2 cpu3
> link1: wifi (2 pcie lane)
> link2: gpu0 (unknown number of lane but i believe it has higher
>              bandwidth to main memory)
> link3: gpu1 (16 pcie lane)
> link4: gpu1 and gpu memory
> 
> So one link directory per number of pcie lane your device have
> so that you can differentiate on bandwidth. The main memory is
> symlinked inside all the link directory except link4. The GPU
> discret memory is only in link4 directory as it is only
> accessible by the GPU (we could add it under link3 too with the
> non cache coherent property attach to it).

I'm actually really interested in how this proposal scales.  It's quite
easy to represent a laptop, but can this scale to the largest systems
that we expect to encounter over the next 20 years that this ABI will live?

> The issue then becomes how to convert down the HMAT over verbose
> information to populate some reasonable layout for HMS. For that
> i would say that create a link directory for each different
> matrix cell. As an example let say that each entry in the matrix
> has bandwidth and latency then we create a link directory for
> each combination of bandwidth and latency. On simple system that
> should boils down to a handfull of combination roughly speaking
> mirroring the example above of one link directory per number of
> PCIE lane for instance.

OK, but there are 1024*1024 matrix cells on a systems with 1024
proximity domains (ACPI term for NUMA node).  So it sounds like you are
proposing a million-directory approach.

We also can't simply say that two CPUs with the same connection to two
other CPUs (think a 4-socket QPI-connected system) share the same "link"
because they share the same combination of bandwidth and latency.  We
need to know that *each* has its own, unique link and do not share link
resources.

> I don't think i have a system with an HMAT table if you have one
> HMAT table to provide i could show up the end result.

It is new enough (ACPI 6.2) that no publicly-available hardware that
exists that implements one (that I know of).  Keith Busch can probably
extract one and send it to you or show you how we're faking them with QEMU.

> Note i believe the ACPI HMAT matrix is a bad design for that
> reasons ie there is lot of commonality in many of the matrix
> entry and many entry also do not make sense (ie initiator not
> being able to access all the targets). I feel that link/bridge
> is much more compact and allow to represent any directed graph
> with multiple arrows from one node to another same node.

I don't disagree.  But, folks are building systems with them and we need
to either deal with it, or make its data manageable.  You saw our
approach: we cull the data and only expose the bare minimum in sysfs.
Felix Kuehling Dec. 5, 2018, 1:22 a.m. UTC | #14
On 2018-12-04 4:57 p.m., Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 01:37:56PM -0800, Dave Hansen wrote:
>> Yeah, our NUMA mechanisms are for managing memory that the kernel itself
>> manages in the "normal" allocator and supports a full feature set on.
>> That has a bunch of implications, like that the memory is cache coherent
>> and accessible from everywhere.
>>
>> The HMAT patches only comprehend this "normal" memory, which is why
>> we're extending the existing /sys/devices/system/node infrastructure.
>>
>> This series has a much more aggressive goal, which is comprehending the
>> connections of every memory-target to every memory-initiator, no matter
>> who is managing the memory, who can access it, or what it can be used for.
>>
>> Theoretically, HMS could be used for everything that we're doing with
>> /sys/devices/system/node, as long as it's tied back into the existing
>> NUMA infrastructure _somehow_.
>>
>> Right?
> Fully correct mind if i steal that perfect summary description next time
> i post ? I am so bad at explaining thing :)
>
> Intention is to allow program to do everything they do with mbind() today
> and tomorrow with the HMAT patchset and on top of that to also be able to
> do what they do today through API like OpenCL, ROCm, CUDA ... So it is one
> kernel API to rule them all ;)

As for ROCm, I'm looking forward to using hbind in our own APIs. It will
save us some time and trouble not having to implement all the low-level
policy and tracking of virtual address ranges in our device driver.
Going forward, having a common API to manage the topology and memory
affinity would also enable sane ways of having accelerators and memory
devices from different vendors interact under control of a
topology-aware application.

Disclaimer: I haven't had a chance to review the patches in detail yet.
Got caught up in the documentation and discussion ...

Regards,
  Felix


>
> Also at first i intend to special case vma alloc page when they are HMS
> policy, long term i would like to merge code path inside the kernel. But
> i do not want to disrupt existing code path today, i rather grow to that
> organicaly. Step by step. The mbind() would still work un-affected in
> the end just the plumbing would be slightly different.
>
> Cheers,
> Jérôme
Jerome Glisse Dec. 5, 2018, 2:13 a.m. UTC | #15
On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> On 12/4/18 4:15 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 03:54:22PM -0800, Dave Hansen wrote:
> >> Basically, is sysfs the right place to even expose this much data?
> > 
> > I definitly want to avoid the memoryX mistake. So i do not want to
> > see one link directory per device. Taking my simple laptop as an
> > example with 4 CPUs, a wifi and 2 GPU (the integrated one and a
> > discret one):
> > 
> > link0: cpu0 cpu1 cpu2 cpu3
> > link1: wifi (2 pcie lane)
> > link2: gpu0 (unknown number of lane but i believe it has higher
> >              bandwidth to main memory)
> > link3: gpu1 (16 pcie lane)
> > link4: gpu1 and gpu memory
> > 
> > So one link directory per number of pcie lane your device have
> > so that you can differentiate on bandwidth. The main memory is
> > symlinked inside all the link directory except link4. The GPU
> > discret memory is only in link4 directory as it is only
> > accessible by the GPU (we could add it under link3 too with the
> > non cache coherent property attach to it).
> 
> I'm actually really interested in how this proposal scales.  It's quite
> easy to represent a laptop, but can this scale to the largest systems
> that we expect to encounter over the next 20 years that this ABI will live?
> 
> > The issue then becomes how to convert down the HMAT over verbose
> > information to populate some reasonable layout for HMS. For that
> > i would say that create a link directory for each different
> > matrix cell. As an example let say that each entry in the matrix
> > has bandwidth and latency then we create a link directory for
> > each combination of bandwidth and latency. On simple system that
> > should boils down to a handfull of combination roughly speaking
> > mirroring the example above of one link directory per number of
> > PCIE lane for instance.
> 
> OK, but there are 1024*1024 matrix cells on a systems with 1024
> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> proposing a million-directory approach.

No, pseudo code:
    struct list links;

    for (unsigned r = 0; r < nrows; r++) {
        for (unsigned c = 0; c < ncolumns; c++) {
            if (!link_find(links, hmat[r][c].bandwidth,
                           hmat[r][c].latency)) {
                link = link_new(hmat[r][c].bandwidth,
                                hmat[r][c].latency);
                // add initiator and target correspond to that row
                // and columns to this new link
                list_add(&link, links);
            }
        }
    }

So all cells that have same property are under the same link. Do you
expect all the cell to always have different properties ? On today
platform it should not be the case. I do expect we will keep seeing
many initiator/target pair that share same properties as other pair.

But yes if you have system where no initiator/target pair have the
same properties than you in the worst case you are describing. But
hey that is the hardware you have then :)

Note that userspace can parse all this once during its initialization
and create pools of target to use.


> We also can't simply say that two CPUs with the same connection to two
> other CPUs (think a 4-socket QPI-connected system) share the same "link"
> because they share the same combination of bandwidth and latency.  We
> need to know that *each* has its own, unique link and do not share link
> resources.

That is the purpose of the bridge object to inter-connect link.
To be more exact link is like saying you have 2 arrows with the
same properties between every node listed in the link. While
bridge allow to define arrow in just one direction. Maybe i
should define arrow and node instead of trying to match some of
the ACPI terminology. This might be easier for people to follow
than first having to understand the terminology.

The fear i have with HMAT culling is that HMAT does not have the
information to avoid such culling.

> > I don't think i have a system with an HMAT table if you have one
> > HMAT table to provide i could show up the end result.
> 
> It is new enough (ACPI 6.2) that no publicly-available hardware that
> exists that implements one (that I know of).  Keith Busch can probably
> extract one and send it to you or show you how we're faking them with QEMU.
> 
> > Note i believe the ACPI HMAT matrix is a bad design for that
> > reasons ie there is lot of commonality in many of the matrix
> > entry and many entry also do not make sense (ie initiator not
> > being able to access all the targets). I feel that link/bridge
> > is much more compact and allow to represent any directed graph
> > with multiple arrows from one node to another same node.
> 
> I don't disagree.  But, folks are building systems with them and we need
> to either deal with it, or make its data manageable.  You saw our
> approach: we cull the data and only expose the bare minimum in sysfs.

Yeah and i intend to cull data too inside HMS.

Cheers,
Jérôme
Aneesh Kumar K.V Dec. 5, 2018, 11:27 a.m. UTC | #16
On 12/5/18 12:19 AM, Jerome Glisse wrote:

> Above example is for migrate. Here is an example for how the
> topology is use today:
> 
>      Application knows that the platform is running on have 16
>      GPU split into 2 group of 8 GPUs each. GPU in each group can
>      access each other memory with dedicated mesh links between
>      each others. Full speed no traffic bottleneck.
> 
>      Application splits its GPU computation in 2 so that each
>      partition runs on a group of interconnected GPU allowing
>      them to share the dataset.
> 
> With HMS:
>      Application can query the kernel to discover the topology of
>      system it is running on and use it to partition and balance
>      its workload accordingly. Same application should now be able
>      to run on new platform without having to adapt it to it.
> 

Will the kernel be ever involved in decision making here? Like the 
scheduler will we ever want to control how there computation units get 
scheduled onto GPU groups or GPU?

> This is kind of naive i expect topology to be hard to use but maybe
> it is just me being pesimistics. In any case today we have a chicken
> and egg problem. We do not have a standard way to expose topology so
> program that can leverage topology are only done for HPC where the
> platform is standard for few years. If we had a standard way to expose
> the topology then maybe we would see more program using it. At very
> least we could convert existing user.
> 
> 

I am wondering whether we should consider HMAT as a subset of the ideas
mentioned in this thread and see whether we can first achieve HMAT 
representation with your patch series?

-aneesh
Jerome Glisse Dec. 5, 2018, 4:09 p.m. UTC | #17
On Wed, Dec 05, 2018 at 04:57:17PM +0530, Aneesh Kumar K.V wrote:
> On 12/5/18 12:19 AM, Jerome Glisse wrote:
> 
> > Above example is for migrate. Here is an example for how the
> > topology is use today:
> > 
> >      Application knows that the platform is running on have 16
> >      GPU split into 2 group of 8 GPUs each. GPU in each group can
> >      access each other memory with dedicated mesh links between
> >      each others. Full speed no traffic bottleneck.
> > 
> >      Application splits its GPU computation in 2 so that each
> >      partition runs on a group of interconnected GPU allowing
> >      them to share the dataset.
> > 
> > With HMS:
> >      Application can query the kernel to discover the topology of
> >      system it is running on and use it to partition and balance
> >      its workload accordingly. Same application should now be able
> >      to run on new platform without having to adapt it to it.
> > 
> 
> Will the kernel be ever involved in decision making here? Like the scheduler
> will we ever want to control how there computation units get scheduled onto
> GPU groups or GPU?

I don;t think you will ever see fine control in software because it
would go against what GPU are fundamentaly. GPU have 1000 of cores
and usualy 10 times more thread in flight than core (depends on the
number of register use by the program or size of their thread local
storage). By having many more thread in flight the GPU always have
some threads that are not waiting for memory access and thus always
have something to schedule next on the core. This scheduling is all
done in real time and i do not see that as a good fit for any kernel
CPU code.

That being said higher level and more coarse directive can be given
to the GPU hardware scheduler like giving priorities to group of
thread so that they always get schedule first if ready. There is
a cgroup proposal that goes into the direction of exposing high
level control over GPU resource like that. I think this is a better
venue to discuss such topics.

> 
> > This is kind of naive i expect topology to be hard to use but maybe
> > it is just me being pesimistics. In any case today we have a chicken
> > and egg problem. We do not have a standard way to expose topology so
> > program that can leverage topology are only done for HPC where the
> > platform is standard for few years. If we had a standard way to expose
> > the topology then maybe we would see more program using it. At very
> > least we could convert existing user.
> > 
> > 
> 
> I am wondering whether we should consider HMAT as a subset of the ideas
> mentioned in this thread and see whether we can first achieve HMAT
> representation with your patch series?

I do not want to block HMAT on that. What i am trying to do really
does not fit in the existing NUMA node this is what i have been trying
to show even if not everyone is convince by that. Some bulets points
of why:
    - memory i care about is not accessible by everyone (backed in
      assumption in NUMA node)
    - memory i care about might not be cache coherent (again backed
      in assumption in NUMA node)
    - topology matter so that userspace knows what inter-connect is
      share and what have dedicated links to memory
    - their can be multiple path between one device and one target
      memory and each path have different numa distance (or rather
      properties like bandwidth, latency, ...) again this is does
      not fit with the NUMA distance thing
    - memory is not manage by core kernel for reasons i hav explained
    - ...

The HMAT proposal does not deal with such memory, it is much more
close to what the current model can describe.

Cheers,
Jérôme
Dave Hansen Dec. 5, 2018, 5:27 p.m. UTC | #18
On 12/4/18 6:13 PM, Jerome Glisse wrote:
> On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
>> OK, but there are 1024*1024 matrix cells on a systems with 1024
>> proximity domains (ACPI term for NUMA node).  So it sounds like you are
>> proposing a million-directory approach.
> 
> No, pseudo code:
>     struct list links;
> 
>     for (unsigned r = 0; r < nrows; r++) {
>         for (unsigned c = 0; c < ncolumns; c++) {
>             if (!link_find(links, hmat[r][c].bandwidth,
>                            hmat[r][c].latency)) {
>                 link = link_new(hmat[r][c].bandwidth,
>                                 hmat[r][c].latency);
>                 // add initiator and target correspond to that row
>                 // and columns to this new link
>                 list_add(&link, links);
>             }
>         }
>     }
> 
> So all cells that have same property are under the same link. 

OK, so the "link" here is like a cable.  It's like saying, "we have a
network and everything is connected with an ethernet cable that can do
1gbit/sec".

But, what actually connects an initiator to a target?  I assume we still
need to know which link is used for each target/initiator pair.  Where
is that enumerated?

I think this just means we need a million symlinks to a "link" instead
of a million link directories.  Still not great.

> Note that userspace can parse all this once during its initialization
> and create pools of target to use.

It sounds like you're agreeing that there is too much data in this
interface for applications to _regularly_ parse it.  We need some
central thing that parses it all and caches the results.
Jerome Glisse Dec. 5, 2018, 5:53 p.m. UTC | #19
On Wed, Dec 05, 2018 at 09:27:09AM -0800, Dave Hansen wrote:
> On 12/4/18 6:13 PM, Jerome Glisse wrote:
> > On Tue, Dec 04, 2018 at 05:06:49PM -0800, Dave Hansen wrote:
> >> OK, but there are 1024*1024 matrix cells on a systems with 1024
> >> proximity domains (ACPI term for NUMA node).  So it sounds like you are
> >> proposing a million-directory approach.
> > 
> > No, pseudo code:
> >     struct list links;
> > 
> >     for (unsigned r = 0; r < nrows; r++) {
> >         for (unsigned c = 0; c < ncolumns; c++) {
> >             if (!link_find(links, hmat[r][c].bandwidth,
> >                            hmat[r][c].latency)) {
> >                 link = link_new(hmat[r][c].bandwidth,
> >                                 hmat[r][c].latency);
> >                 // add initiator and target correspond to that row
> >                 // and columns to this new link
> >                 list_add(&link, links);
> >             }
> >         }
> >     }
> > 
> > So all cells that have same property are under the same link. 
> 
> OK, so the "link" here is like a cable.  It's like saying, "we have a
> network and everything is connected with an ethernet cable that can do
> 1gbit/sec".
> 
> But, what actually connects an initiator to a target?  I assume we still
> need to know which link is used for each target/initiator pair.  Where
> is that enumerated?

ls /sys/bus/hms/devices/v0-0-link/
node0           power           subsystem       uevent
uid             bandwidth       latency         v0-1-target
v0-15-initiator v0-21-target    v0-4-initiator  v0-7-initiator
v0-10-initiator v0-13-initiator v0-16-initiator v0-2-initiator
v0-11-initiator v0-14-initiator v0-17-initiator v0-3-initiator
v0-5-initiator  v0-8-initiator  v0-6-initiator  v0-9-initiator
v0-12-initiator v0-10-initiator

So above is 16 CPUs (initiators*) and 2 targets all connected
through a common link. This means that all the initiators
connected to this link can access all the target connected to
this link. The bandwidth and latency is best case scenario
for instance when only one initiator is accessing the target.

Initiator can only access target they share a link with or
an extended path through a bridge. So if you have an initiator
connected to link0 and a target connected to link1 and there
is a bridge link0 to link1 then the initiator can access the
target memory in link1 but the bandwidth and latency will be
min(link0.bandwidth, link1.bandwidth, bridge.bandwidth)
min(link0.latency, link1.latency, bridge.latency)

You can really match one to one a link with bus in your
system. For instance with PCIE if you only have 16lanes
PCIE devices you only devince one link directory for all
your PCIE devices (ignore the PCIE peer to peer scenario
here). You add a bride between your PCIE link to your
NUMA node link (the node to which this PCIE root complex
belongs), this means that PCIE device can access the local
node memory with given bandwidth and latency (best case).


> 
> I think this just means we need a million symlinks to a "link" instead
> of a million link directories.  Still not great.
> 
> > Note that userspace can parse all this once during its initialization
> > and create pools of target to use.
> 
> It sounds like you're agreeing that there is too much data in this
> interface for applications to _regularly_ parse it.  We need some
> central thing that parses it all and caches the results.

No so there is 2 kinds of applications:
    1) average one: i am using device {1, 3, 9} give me best memory for
       those devices
    2) advance one: what is the topology of this system ? Parse the
       topology and partition its workload accordingly

For case 1 you can pre-parse stuff but this can be done by helper library
but for case 2 there is no amount of pre-parsing you can do in kernel, only
the application knows its own architecture and thus only the application
knows what matter in the topology. Is the application looking for big
chunk of memory even if it is slow ? Is it also looking for fast memory
close to X and Y ? ...

Each application will care about different thing and there is no telling
what its gonna be.

So what i am saying is that this information is likely to be parse once
by the application during startup ie the sysfs is not something that
is continuously read and parse by the application (unless application
also care about hotplug and then we are talking about the 1% of the 1%).

Cheers,
Jérôme
Dave Hansen Dec. 6, 2018, 6:25 p.m. UTC | #20
On 12/5/18 9:53 AM, Jerome Glisse wrote:
> No so there is 2 kinds of applications:
>     1) average one: i am using device {1, 3, 9} give me best memory for
>        those devices
...
> 
> For case 1 you can pre-parse stuff but this can be done by helper library

How would that work?  Would each user/container/whatever do this once?
Where would they keep the pre-parsed stuff?  How do they manage their
cache if the topology changes?
Jerome Glisse Dec. 6, 2018, 7:20 p.m. UTC | #21
On Thu, Dec 06, 2018 at 10:25:08AM -0800, Dave Hansen wrote:
> On 12/5/18 9:53 AM, Jerome Glisse wrote:
> > No so there is 2 kinds of applications:
> >     1) average one: i am using device {1, 3, 9} give me best memory for
> >        those devices
> ...
> > 
> > For case 1 you can pre-parse stuff but this can be done by helper library
> 
> How would that work?  Would each user/container/whatever do this once?
> Where would they keep the pre-parsed stuff?  How do they manage their
> cache if the topology changes?

Short answer i don't expect a cache, i expect that each program will have
a init function that query the topology and update the application codes
accordingly. This is what people do today, query all available devices,
decide which one to use and how, create context for each selected ones,
define a memory migration job/memory policy for each part of the program
so that memory is migrated/have proper policy in place when the code that
run on some device is executed.


Long answer:

I can not dictate how user folks do their program saddly :) I expect that
many application will do it once during start up. Then you will have all
those containers folks or VM folks that will get presure to react to hot-
plug. For instance if you upgrade your instance with your cloud provider
to have more GPUs or more TPUs ... It is likely to appear as an hotplug
from the VM/container point of view and thus as an hotplug from the
application point of view. So far demonstration i have seen do that by
relaunching the application ... More on that through the live re-patching
issues below.

Oh and i expect application will crash if you hot-unplug anything it is
using (this is what happens i believe now in most API). Again i expect
that some pressure from cloud user and provider will force programmer
to be a bit more reactive to this kind of event.


Live re-patching application code can be difficult i am told. Let say you
have:

void compute_serious0_stuff(accelerator_t *accelerator, void *inputA,
                            size_t sinputA, void *inputB, size_t sinputB,
                            void *outputA, size_t soutputA)
{
    ...

    // Migrate the inputA to the accelerator memory
    api_migrate_memory_to_accelerator(accelerator, inputA, sinputA);

    // The inputB buffer is fine in its default placement

    // The output is assume to be empty vma ie no page allocated yet
    // so set a policy to direct all allocation due to page fault to
    // use the accelerator memory
    api_set_memory_policy_to_accelerator(accelerator, outputA, soutputA);

    ...
    for_parallel<accelerator> (i = 0; i < THEYAREAMILLIONSITEMS; ++i) {
        // Do something serious
    }
    ...
}

void serious0_orchestrator(topology topology, void *inputA,
                           void *inputB, void *outputA)
{
    static accelerator_t **selected = NULL;
    static serious0_job_partition *partition;
    ...
    if (selected == NULL) {
        serious0_select_and_partition(topology, &selected, &partition,
                                      inputA, inputB, outputA)
    }
    ...
    for(i = 0; i < nselected; ++) {
        ...
        compute_serious0_stuff(selected[i],
                               inputA + partition[i].inputA_offset,
                               partition[i].inputA_size,
                               inputB + partition[i].inputB_offset,
                               partition[i].inputB_size,
                               outputA + partition[i].outputB_offset,
                               partition[i].outputA_size);
        ...
    }
    ...
    for(i = 0; i < nselected; ++) {
        accelerator_wait_finish(selected[i]);
    }
    ...
    // outputA is ready to be use by the next function in the program
}

If you start without a GPU/TPU your for_parallel will use the CPU and
with the code the compiler have emitted at built time. For GPU/TPU at
build time you compile your for_parallel loop to some intermediate
representation (a virtual ISA) then at runtime during the application
initialization that intermediate representation get lowered down to
all the available GPU/TPU on your system and each for_parallel loop
is patched to be turn into a call to:

void dispatch_accelerator_function(accelerator_t *accelerator,
                                   void *function, ...)
{
}

So in the above example the for_parallel loop becomes:
dispatch_accelerator_function(accelerator, i_compute_serious_stuff,
                              inputA, inputB, outputA);

This hot patching of code is easy to do when no CPU thread is running
the code. However when CPU threads are running it can be problematic,
i am sure you can do trickery like delay the patching only to the next
time the function get call by doing clever thing at build time like
prepending each for_parallel section with enough nop that would allow
you to replace it to a call to the dispatch function and a jump over
the normal CPU code.


I think compiler people want to solve the static case first ie during
application initializations decide what devices are gonna be use and
then update the application accordingly. But i expect it will grow
to support hotplug as relaunching the application is not that user
friendly even in this day an age where people starts millions of
container with one mouse click.


Anyway above example is how it looks today and accelerator can turn
up to be just regular CPU core if you do not have any devices. The
idea is that we would like a common API that cover both CPU thread
and device thread. Same for the migration/policy functions if it
happens that the accelerator is just plain old CPU then you want to
migrate memory to the CPU node and set memory policy to that node too.

Cheers,
Jérôme
Dave Hansen Dec. 6, 2018, 7:31 p.m. UTC | #22
On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>> For case 1 you can pre-parse stuff but this can be done by helper library
>> How would that work?  Would each user/container/whatever do this once?
>> Where would they keep the pre-parsed stuff?  How do they manage their
>> cache if the topology changes?
> Short answer i don't expect a cache, i expect that each program will have
> a init function that query the topology and update the application codes
> accordingly.

My concern with having folks do per-program parsing, *and* having a huge
amount of data to parse makes it unusable.  The largest systems will
literally have hundreds of thousands of objects in /sysfs, even in a
single directory.  That makes readdir() basically impossible, and makes
even open() (if you already know the path you want somehow) hard to do fast.

I just don't think sysfs (or any filesystem, really) can scale to
express large, complicated topologies in a way that any normal program
can practically parse it.

My suspicion is that we're going to need to have the kernel parse and
cache these things.  We *might* have the data available in sysfs, but we
can't reasonably expect anyone to go parsing it.
Logan Gunthorpe Dec. 6, 2018, 8:11 p.m. UTC | #23
On 2018-12-06 12:31 p.m., Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
>>>> For case 1 you can pre-parse stuff but this can be done by helper library
>>> How would that work?  Would each user/container/whatever do this once?
>>> Where would they keep the pre-parsed stuff?  How do they manage their
>>> cache if the topology changes?
>> Short answer i don't expect a cache, i expect that each program will have
>> a init function that query the topology and update the application codes
>> accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.

Is this actually realistic? I find it hard to imagine an actual hardware
bus that can have even thousands of devices under a single node, let
alone hundreds of thousands. At some point the laws of physics apply.
For example, in present hardware, the most ports a single PCI switch can
have these days is under one hundred. I'd imagine any such large systems
would have a hierarchy of devices (ie. layers of switch-like devices)
which implies the existing sysfs bus/devices  should have a path through
it without navigating a directory with that unreasonable a number of
objects in it. HMS, on the other hand, has all possible initiators
(,etc) under a single directory.

The caveat to this is, that to find an initial starting point in the bus
hierarchy you might have to go through /sys/dev/{block|char} or
/sys/class which may have directories with a large number of objects.
Though, such a system would necessarily have a similarly large number of
objects in /dev which means means you will probably never get around the
readdir/open bottleneck you mention... and, thus, this doesn't seem
overly realistic to me.

Logan
Jerome Glisse Dec. 6, 2018, 8:27 p.m. UTC | #24
On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> On 12/6/18 11:20 AM, Jerome Glisse wrote:
> >>> For case 1 you can pre-parse stuff but this can be done by helper library
> >> How would that work?  Would each user/container/whatever do this once?
> >> Where would they keep the pre-parsed stuff?  How do they manage their
> >> cache if the topology changes?
> > Short answer i don't expect a cache, i expect that each program will have
> > a init function that query the topology and update the application codes
> > accordingly.
> 
> My concern with having folks do per-program parsing, *and* having a huge
> amount of data to parse makes it unusable.  The largest systems will
> literally have hundreds of thousands of objects in /sysfs, even in a
> single directory.  That makes readdir() basically impossible, and makes
> even open() (if you already know the path you want somehow) hard to do fast.
> 
> I just don't think sysfs (or any filesystem, really) can scale to
> express large, complicated topologies in a way that any normal program
> can practically parse it.
> 
> My suspicion is that we're going to need to have the kernel parse and
> cache these things.  We *might* have the data available in sysfs, but we
> can't reasonably expect anyone to go parsing it.

What i am failing to explain is that kernel can not parse because kernel
does not know what the application cares about and every single applications
will make different choices and thus select differents devices and memory.

It is not even gonna a thing like class A of application will do X and
class B will do Y. Every single application in class A might do something
different because somes care about the little details.

So any kind of pre-parsing in the kernel is defeated by the fact that the
kernel does not know what the application is looking for.

I do not see anyway to express the application logic in something that
can be some kind of automaton or regular expression. The application can
litteraly intro-inspect itself and the topology to partition its workload.
The topology and device selection is expected to be thousands of line of
code in the most advance application.

Even worse inside one same application, they might be different device
partition and memory selection for different function in the application.


I am not scare about the anount of data to parse really, even on big node
it is gonna be few dozens of links and bridges, and few dozens of devices.
So we are talking hundred directories to parse and read.


Maybe an example will help. Let say we have an application with the
following pipeline:

    inA -> functionA -> outA -> functionB -> outB -> functionC -> result

    - inA 8 gigabytes
    - outA 8 gigabytes
    - outB one dword
    - result something small
    - functionA is doing heavy computation on inA (several thousands of
      instructions for each dword in inA).
    - functionB is doing heavy computation for each dword in outA (again
      thousand of instruction for each dword) and it is looking for a
      specific result that it knows will be unique among all the dword
      computation ie it is output only one dword in outB
    - functionC is something well suited for CPU that take outB and turns
      it into the final result

Now let see few different system and their topologies:
    [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
    [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
    [T3] 2 GPU with 8GB of memory and a handfull of CPU core
    [T4] 2 GPU with 8GB of memory and a handfull of CPU core
         the 2 GPU have a very fast link between each others
         (400GBytes/s)

Now let see how the program will partition itself for each topology:
    [T1] Application partition its computation in 3 phases:
            P1: - migrate inA to GPU memory
            P2: - execute functionA on inA producing outA
            P3  - execute functionB on outA producing outB
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU threads and return the result we are done

    [T2] Application partition its computation in 5 phases:
            P1: - migrate first 4GB of inA to GPU memory
            P2: - execute functionA for the 4GB and write the 4GB
                  outA result to the GPU memory
            P3: - execute functionB for the first 4GB of outA
                - while functionB is running DMA in the background
                  the the second 4GB of inA to the GPU memory
                - once one of the millions of thread running functionB
                  find the result it is looking for it writes it to
                  outB which is in main memory
                - run functionC and see if functionB have found the
                  thing and written it to outB if so then kill all
                  GPU thread and DMA and return the result we are
                  done
            P4: - run functionA on the second half of inA ie we did
                  not find the result in the first half so we no
                  process the second half that have been migrated to
                  the GPU memory in the background (see above)
            P5: - run functionB on the second 4GB of outA like
                  above
                - run functionC on CPU and kill everything as soon
                  as one of the thread running functionB has found
                  the result
                - return the result

    [T3] Application partition its computation in 3 phases:
            P1: - migrate first 4GB of inA to GPU1 memory
                - migrate last 4GB of inA to GPU2 memory
            P2: - execute functionA on GPU1 on the first 4GB -> outA
                - execute functionA on GPU2 on the last 4GB -> outA
            P3: - execute functionB on GPU1 on the first 4GB of outA
                - execute functionB on GPU2 on the last 4GB of outA
                - run functionC and see if functionB running on GPU1
                  and GPU2 have found the thing and written it to outB
                  if so then kill all GPU threads and return the result
                  we are done

    [T4] Application partition its computation in 2 phases:
            P1: - migrate 8GB of inA to GPU1 memory
                - allocate 8GB for outA in GPU2 memory
            P2: - execute functionA on GPU1 on the inA 8GB and write
                  out result to GPU2 through the fast link
                - execute functionB on GPU2 and look over each
                  thread on functionB on outA (busy running even
                  if outA is not valid for each thread running
                  functionB)
                - run functionC and see if functionB running on GPU2
                  have found the thing and written it to outB if so
                  then kill all GPU threads and return the result
                  we are done


So this is widely different partition that all depends on the topology
and how accelerator are inter-connected and how much memory they have.
This is a relatively simple example, they are people out there spending
month on designing adaptive partitioning algorithm for their application.

Cheers,
Jérôme
Jerome Glisse Dec. 6, 2018, 9:46 p.m. UTC | #25
On Thu, Dec 06, 2018 at 03:27:06PM -0500, Jerome Glisse wrote:
> On Thu, Dec 06, 2018 at 11:31:21AM -0800, Dave Hansen wrote:
> > On 12/6/18 11:20 AM, Jerome Glisse wrote:
> > >>> For case 1 you can pre-parse stuff but this can be done by helper library
> > >> How would that work?  Would each user/container/whatever do this once?
> > >> Where would they keep the pre-parsed stuff?  How do they manage their
> > >> cache if the topology changes?
> > > Short answer i don't expect a cache, i expect that each program will have
> > > a init function that query the topology and update the application codes
> > > accordingly.
> > 
> > My concern with having folks do per-program parsing, *and* having a huge
> > amount of data to parse makes it unusable.  The largest systems will
> > literally have hundreds of thousands of objects in /sysfs, even in a
> > single directory.  That makes readdir() basically impossible, and makes
> > even open() (if you already know the path you want somehow) hard to do fast.
> > 
> > I just don't think sysfs (or any filesystem, really) can scale to
> > express large, complicated topologies in a way that any normal program
> > can practically parse it.
> > 
> > My suspicion is that we're going to need to have the kernel parse and
> > cache these things.  We *might* have the data available in sysfs, but we
> > can't reasonably expect anyone to go parsing it.
> 
> What i am failing to explain is that kernel can not parse because kernel
> does not know what the application cares about and every single applications
> will make different choices and thus select differents devices and memory.
> 
> It is not even gonna a thing like class A of application will do X and
> class B will do Y. Every single application in class A might do something
> different because somes care about the little details.
> 
> So any kind of pre-parsing in the kernel is defeated by the fact that the
> kernel does not know what the application is looking for.
> 
> I do not see anyway to express the application logic in something that
> can be some kind of automaton or regular expression. The application can
> litteraly intro-inspect itself and the topology to partition its workload.
> The topology and device selection is expected to be thousands of line of
> code in the most advance application.
> 
> Even worse inside one same application, they might be different device
> partition and memory selection for different function in the application.
> 
> 
> I am not scare about the anount of data to parse really, even on big node
> it is gonna be few dozens of links and bridges, and few dozens of devices.
> So we are talking hundred directories to parse and read.
> 
> 
> Maybe an example will help. Let say we have an application with the
> following pipeline:
> 
>     inA -> functionA -> outA -> functionB -> outB -> functionC -> result
> 
>     - inA 8 gigabytes
>     - outA 8 gigabytes
>     - outB one dword
>     - result something small
>     - functionA is doing heavy computation on inA (several thousands of
>       instructions for each dword in inA).
>     - functionB is doing heavy computation for each dword in outA (again
>       thousand of instruction for each dword) and it is looking for a
>       specific result that it knows will be unique among all the dword
>       computation ie it is output only one dword in outB
>     - functionC is something well suited for CPU that take outB and turns
>       it into the final result
> 
> Now let see few different system and their topologies:
>     [T2] 1 GPU with 16GB of memory and a handfull of CPU cores
>     [T1] 1 GPU with 8GB of memory and a handfull of CPU cores
>     [T3] 2 GPU with 8GB of memory and a handfull of CPU core
>     [T4] 2 GPU with 8GB of memory and a handfull of CPU core
>          the 2 GPU have a very fast link between each others
>          (400GBytes/s)
> 
> Now let see how the program will partition itself for each topology:
>     [T1] Application partition its computation in 3 phases:
>             P1: - migrate inA to GPU memory
>             P2: - execute functionA on inA producing outA
>             P3  - execute functionB on outA producing outB
>                 - run functionC and see if functionB have found the
>                   thing and written it to outB if so then kill all
>                   GPU threads and return the result we are done
> 
>     [T2] Application partition its computation in 5 phases:
>             P1: - migrate first 4GB of inA to GPU memory
>             P2: - execute functionA for the 4GB and write the 4GB
>                   outA result to the GPU memory
>             P3: - execute functionB for the first 4GB of outA
>                 - while functionB is running DMA in the background
>                   the the second 4GB of inA to the GPU memory
>                 - once one of the millions of thread running functionB
>                   find the result it is looking for it writes it to
>                   outB which is in main memory
>                 - run functionC and see if functionB have found the
>                   thing and written it to outB if so then kill all
>                   GPU thread and DMA and return the result we are
>                   done
>             P4: - run functionA on the second half of inA ie we did
>                   not find the result in the first half so we no
>                   process the second half that have been migrated to
>                   the GPU memory in the background (see above)
>             P5: - run functionB on the second 4GB of outA like
>                   above
>                 - run functionC on CPU and kill everything as soon
>                   as one of the thread running functionB has found
>                   the result
>                 - return the result
> 
>     [T3] Application partition its computation in 3 phases:
>             P1: - migrate first 4GB of inA to GPU1 memory
>                 - migrate last 4GB of inA to GPU2 memory
>             P2: - execute functionA on GPU1 on the first 4GB -> outA
>                 - execute functionA on GPU2 on the last 4GB -> outA
>             P3: - execute functionB on GPU1 on the first 4GB of outA
>                 - execute functionB on GPU2 on the last 4GB of outA
>                 - run functionC and see if functionB running on GPU1
>                   and GPU2 have found the thing and written it to outB
>                   if so then kill all GPU threads and return the result
>                   we are done
> 
>     [T4] Application partition its computation in 2 phases:
>             P1: - migrate 8GB of inA to GPU1 memory
>                 - allocate 8GB for outA in GPU2 memory
>             P2: - execute functionA on GPU1 on the inA 8GB and write
>                   out result to GPU2 through the fast link
>                 - execute functionB on GPU2 and look over each
>                   thread on functionB on outA (busy running even
>                   if outA is not valid for each thread running
>                   functionB)
>                 - run functionC and see if functionB running on GPU2
>                   have found the thing and written it to outB if so
>                   then kill all GPU threads and return the result
>                   we are done
> 
> 
> So this is widely different partition that all depends on the topology
> and how accelerator are inter-connected and how much memory they have.
> This is a relatively simple example, they are people out there spending
> month on designing adaptive partitioning algorithm for their application.
> 

And since i am writting example, another funny one let say you have
a system with 2 nodes and on each node 2 GPU and one network. On each
node the local network adapter can only access one of the 2 GPU memory.
All the GPU are conntected to each other through a fully symmetrical
mesh inter-connect.

Now let say your program has 4 functions back to back, each functions
consuming the output of the previous one. Finaly you get your input
from the network and stream out the final function output to the network

So what you can do is:
    Node0 Net0 -> write to Node0 GPU0 memory
    Node0 GPU0 -> run first function and write result to Node0 GPU1
    Node0 GPU1 -> run second function and write result to Node1 GPU3
    Node1 GPU3 -> run third function and write result to Node1 GPU2
    Node1 Net1 -> read result from Node1 GPU2 and stream it out


Yes this kind of thing can be decided at application startup during
initialization. Idea is that you model your program computation graph
each node is a function (or group of functions) and each arrow is
data flow (input and output).

So you have a graph, now what you do is try to find a sub-graph of
your system topology that match this graph and for the system topology
you also have to check that each of your program node can run on
the specific accelerator node of your system (does the accelerator
have the feature X and Y ?)

If you are not lucky and that there is no 1 to 1 match the you can
can re-arrange/simplify your application computation graph. For
instance group multiple of your application function node into just
one node to shrink your computation graph. Rinse and repeat.


Moreover each application will have multiple separate computation
graph and the application will want to spread as evenly as possible
its workload and select the most powerfull accelerator for the most
intensive computation ...


I do not see how to have graph matching API with complex testing
where you need to query back userspace library. Like querying if
the userspace penCL driver for GPU A support feature X ? Which
might not only depend on the device generation or kernel device
driver version but also on the version of the userspace driver.

I feel it would be a lot easier to provide a graph to userspace
and have userspace do this complex matching and adaption of its
computation graph and load balance its computation at the same
time.


Of course not all application will be that complex and like i said
i believe average app (especialy desktop app design to run on
laptop) will just use a dumb down thing ie they will only use
one or two devices at the most.


Yes all this is hard but easy problems are not interesting to
solve.

Cheers,
Jérôme
Dave Hansen Dec. 6, 2018, 10:04 p.m. UTC | #26
On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
>> My concern with having folks do per-program parsing, *and* having a huge
>> amount of data to parse makes it unusable.  The largest systems will
>> literally have hundreds of thousands of objects in /sysfs, even in a
>> single directory.  That makes readdir() basically impossible, and makes
>> even open() (if you already know the path you want somehow) hard to do fast.
> Is this actually realistic? I find it hard to imagine an actual hardware
> bus that can have even thousands of devices under a single node, let
> alone hundreds of thousands.

Jerome's proposal, as I understand it, would have generic "links".
They're not an instance of bus, but characterize a class of "link".  For
instance, a "link" might characterize the characteristics of the QPI bus
between two CPU sockets. The link directory would enumerate the list of
all *instances* of that link

So, a "link" directory for QPI would say Socket0<->Socket1,
Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
would have to enumerate the connections between every entity that shared
those link properties.

While there might not be millions of buses, there could be millions of
*paths* across all those buses, and that's what the HMAT describes, at
least: the net result of all those paths.
Jerome Glisse Dec. 6, 2018, 10:39 p.m. UTC | #27
On Thu, Dec 06, 2018 at 02:04:46PM -0800, Dave Hansen wrote:
> On 12/6/18 12:11 PM, Logan Gunthorpe wrote:
> >> My concern with having folks do per-program parsing, *and* having a huge
> >> amount of data to parse makes it unusable.  The largest systems will
> >> literally have hundreds of thousands of objects in /sysfs, even in a
> >> single directory.  That makes readdir() basically impossible, and makes
> >> even open() (if you already know the path you want somehow) hard to do fast.
> > Is this actually realistic? I find it hard to imagine an actual hardware
> > bus that can have even thousands of devices under a single node, let
> > alone hundreds of thousands.
> 
> Jerome's proposal, as I understand it, would have generic "links".
> They're not an instance of bus, but characterize a class of "link".  For
> instance, a "link" might characterize the characteristics of the QPI bus
> between two CPU sockets. The link directory would enumerate the list of
> all *instances* of that link
> 
> So, a "link" directory for QPI would say Socket0<->Socket1,
> Socket1<->Socket2, Socket1<->Socket2, Socket2<->PCIe-1.2.3.4 etc...  It
> would have to enumerate the connections between every entity that shared
> those link properties.
> 
> While there might not be millions of buses, there could be millions of
> *paths* across all those buses, and that's what the HMAT describes, at
> least: the net result of all those paths.

Sorry if again i miss-explained thing. Link are arrows between nodes
(CPU or device or memory). An arrow/link has properties associated
with it: bandwidth, latency, cache-coherent, ...

So if in your system you 4 Sockets and that each socket is connected to
each other (mesh) and all inter-connect in the mesh have same property
then you only have 1 link directory with the 4 socket in it.

No if the 4 sockets are connect in a ring fashion ie:
        Socket0 - Socket1
           |         |
        Socket3 - Socket2

Then you have 4 links:
link0: socket0 socket1
link1: socket1 socket2
link3: socket2 socket3
link4: socket3 socket0

I do not see how their can be an explosion of link directory, worse
case is as many link directories as they are bus for a CPU/device/
target. So worse case if you have N devices and each devices is
connected two 2 bus (PCIE and QPI to go to other socket for instance)
then you have 2*N link directory (again this is a worst case).

They are lot of commonality that will remain so i expect that quite
a few link directory will have many symlink ie you won't get close
to the worst case.


In the end really it is easier to think from the physical topology
and there a link correspond to an inter-connect between two device
or CPU. In all the systems i have seen even in the craziest roadmap
i have only seen thing like 128/256 inter-connect (4 socket 32/64
devices per socket) and many of which can be grouped under a common
link directory. Here worse case is 4 connection per device/CPU/
target so worse case of 128/256 * 4  = 512/1024 link directory
and that's a lot. Given regularity i have seen described on slides
i expect that it would need something like 30 link directory and
20 bridges directory.

On today system 8GPU per socket with GPUlink between each GPU and
PCIE all this with 4 socket it comes down to 20 links directory.

In any cases each devices/CPU/target has a limit on the number of
bus/inter-connect it is connected too. I doubt there is anyone
designing device that will have much more than 4 external bus
connection.

So it is not a link per pair. It is a link for group of device/CPU/
target. Is it any clearer ?

Cheers,
Jérôme
Dave Hansen Dec. 6, 2018, 11:09 p.m. UTC | #28
On 12/6/18 2:39 PM, Jerome Glisse wrote:
> No if the 4 sockets are connect in a ring fashion ie:
>         Socket0 - Socket1
>            |         |
>         Socket3 - Socket2
> 
> Then you have 4 links:
> link0: socket0 socket1
> link1: socket1 socket2
> link3: socket2 socket3
> link4: socket3 socket0
> 
> I do not see how their can be an explosion of link directory, worse
> case is as many link directories as they are bus for a CPU/device/
> target.

This looks great.  But, we don't _have_ this kind of information for any
system that I know about or any system available in the near future.

We basically have two different world views:
1. The system is described point-to-point.  A connects to B @
   100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
   50GB/s.
   * Less information to convey
   * Potentially less precise if the properties are not perfectly
     additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
   * Costs must be calculated instead of being explicitly specified
2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
   B->C @ 50GB/s, A->C @ 50GB/s.
   * A *lot* more information to convey O(N^2)?
   * Potentially more precise.
   * Costs are explicitly specified, not calculated

These patches are really tied to world view #1.  But, the HMAT is really
tied to world view #1.

I know you're not a fan of the HMAT.  But it is the firmware reality
that we are stuck with, until something better shows up.  I just don't
see a way to convert it into what you have described here.

I'm starting to think that, no matter if the HMAT or some other approach
gets adopted, we shouldn't be exposing this level of gunk to userspace
at *all* since it requires adopting one of the world views.
Logan Gunthorpe Dec. 6, 2018, 11:28 p.m. UTC | #29
On 2018-12-06 4:09 p.m., Dave Hansen wrote:
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.
> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>    100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>    50GB/s.
>    * Less information to convey
>    * Potentially less precise if the properties are not perfectly
>      additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>    * Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>    B->C @ 50GB/s, A->C @ 50GB/s.
>    * A *lot* more information to convey O(N^2)?
>    * Potentially more precise.
>    * Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

I didn't think this was meant to describe actual real world performance
between all of the links. If that's the case all of this seems like a
pipe dream to me.

Attributes like cache coherency, atomics, etc should fit well in world
view #1... and, at best, some kind of flag saying whether or not to use
a particular link if you care about transfer speed. -- But we don't need
special "link" directories to describe the properties of existing buses.

You're not *really* going to know bandwidth or latency for any of this
unless you actually measure it on the system in question.

Logan
Dave Hansen Dec. 6, 2018, 11:34 p.m. UTC | #30
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.

Whoops, should have been "the HMAT is really tied to world view #2"
Dave Hansen Dec. 6, 2018, 11:38 p.m. UTC | #31
On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> I didn't think this was meant to describe actual real world performance
> between all of the links. If that's the case all of this seems like a
> pipe dream to me.

The HMAT discussions (that I was a part of at least) settled on just
trying to describe what we called "sticker speed".  Nobody had an
expectation that you *really* had to measure everything.

The best we can do for any of these approaches is approximate things.

> You're not *really* going to know bandwidth or latency for any of this
> unless you actually measure it on the system in question.

Yeah, agreed.
Logan Gunthorpe Dec. 6, 2018, 11:48 p.m. UTC | #32
On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
>> I didn't think this was meant to describe actual real world performance
>> between all of the links. If that's the case all of this seems like a
>> pipe dream to me.
> 
> The HMAT discussions (that I was a part of at least) settled on just
> trying to describe what we called "sticker speed".  Nobody had an
> expectation that you *really* had to measure everything.
> 
> The best we can do for any of these approaches is approximate things.

Yes, though there's a lot of caveats in this assumption alone.
Specifically with PCI: the bus may run at however many GB/s but P2P
through a CPU's root complexes can slow down significantly (like down to
MB/s).

I've seen similar things across QPI: I can sometimes do P2P from
PCI->QPI->PCI but the performance doesn't even come close to the sticker
speed of any of those buses.

I'm not sure how anyone is going to deal with those issues, but it does
firmly place us in world view #2 instead of #1. But, yes, I agree
exposing information like in #2 full out to userspace, especially
through sysfs, seems like a nightmare and I don't see anything in HMS to
help with that. Providing an API to ask for memory (or another resource)
that's accessible by a set of initiators and with a set of requirements
for capabilities seems more manageable.

Logan
Jerome Glisse Dec. 7, 2018, 12:15 a.m. UTC | #33
On Thu, Dec 06, 2018 at 03:09:21PM -0800, Dave Hansen wrote:
> On 12/6/18 2:39 PM, Jerome Glisse wrote:
> > No if the 4 sockets are connect in a ring fashion ie:
> >         Socket0 - Socket1
> >            |         |
> >         Socket3 - Socket2
> > 
> > Then you have 4 links:
> > link0: socket0 socket1
> > link1: socket1 socket2
> > link3: socket2 socket3
> > link4: socket3 socket0
> > 
> > I do not see how their can be an explosion of link directory, worse
> > case is as many link directories as they are bus for a CPU/device/
> > target.
> 
> This looks great.  But, we don't _have_ this kind of information for any
> system that I know about or any system available in the near future.

We do not have it in any standard way, it is out there in either
device driver database, application data base, special platform
OEM blob burried somewhere in the firmware ...

I want to solve the kernel side of the problem ie how to expose
this to userspace. How the kernel get that information is an
orthogonal problem. For now my intention is to have device driver
register and create the links and bridges that are not enumerated
by standard firmware.

> 
> We basically have two different world views:
> 1. The system is described point-to-point.  A connects to B @
>    100GB/s.  B connects to C at 50GB/s.  Thus, C->A should be
>    50GB/s.
>    * Less information to convey
>    * Potentially less precise if the properties are not perfectly
>      additive.  If A->B=10ns and B->C=20ns, A->C might be >30ns.
>    * Costs must be calculated instead of being explicitly specified
> 2. The system is described endpoint-to-endpoint.  A->B @ 100GB/s
>    B->C @ 50GB/s, A->C @ 50GB/s.
>    * A *lot* more information to convey O(N^2)?
>    * Potentially more precise.
>    * Costs are explicitly specified, not calculated
> 
> These patches are really tied to world view #1.  But, the HMAT is really
> tied to world view #1.
                      ^#2

Note that they are also the bridge object in my proposal. So in my
proposal you in #1 you have:
link0: A <-> B with 100GB/s and 10ns latency
link1: B <-> C with 50GB/s and 20ns latency

Now if A can reach C through B then you have bridges (bridge are uni-
directional unlike link that are bi-directional thought that finer
point can be discuss this is what allow any kind of directed graph to
be represented):
bridge2: link0 -> link1
bridge3: link1 -> link0

You can also associated properties to bridge (but it is not mandatory).
So you can say that bridge2 and bridge3 have a latency of 50ns and if
the addition of latency is enough then you do not specificy it in bridge.
It is a rule that a path latency is the sum of its individual link
latency. For bandwidth it is the minimum bandwidth ie what ever is the
bottleneck for the path.


> I know you're not a fan of the HMAT.  But it is the firmware reality
> that we are stuck with, until something better shows up.  I just don't
> see a way to convert it into what you have described here.

Like i said i am not targetting HMAT system i am targeting system that
rely today on database spread between driver and application. I want to
move that knowledge in driver first so that they can teach the core
kernel and register thing in the core. Providing a standard firmware
way to provide this information is a different problem (they are some
loose standard on non ACPI platform AFAIK).

> I'm starting to think that, no matter if the HMAT or some other approach
> gets adopted, we shouldn't be exposing this level of gunk to userspace
> at *all* since it requires adopting one of the world views.

I do not see this as exclusive. Yes they are HMAT system "soon" to arrive
but we already have the more extended view which is just buried under a
pile of different pieces. I do not see any exclusion between the 2. If
HMAT is good enough for a whole class of system fine but there is also
a whole class of system and users that do not fit in that paradigm hence
my proposal.

Cheers,
Jérôme
Jerome Glisse Dec. 7, 2018, 12:20 a.m. UTC | #34
On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> 
> 
> On 2018-12-06 4:38 p.m., Dave Hansen wrote:
> > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:
> >> I didn't think this was meant to describe actual real world performance
> >> between all of the links. If that's the case all of this seems like a
> >> pipe dream to me.
> > 
> > The HMAT discussions (that I was a part of at least) settled on just
> > trying to describe what we called "sticker speed".  Nobody had an
> > expectation that you *really* had to measure everything.
> > 
> > The best we can do for any of these approaches is approximate things.
> 
> Yes, though there's a lot of caveats in this assumption alone.
> Specifically with PCI: the bus may run at however many GB/s but P2P
> through a CPU's root complexes can slow down significantly (like down to
> MB/s).
> 
> I've seen similar things across QPI: I can sometimes do P2P from
> PCI->QPI->PCI but the performance doesn't even come close to the sticker
> speed of any of those buses.
> 
> I'm not sure how anyone is going to deal with those issues, but it does
> firmly place us in world view #2 instead of #1. But, yes, I agree
> exposing information like in #2 full out to userspace, especially
> through sysfs, seems like a nightmare and I don't see anything in HMS to
> help with that. Providing an API to ask for memory (or another resource)
> that's accessible by a set of initiators and with a set of requirements
> for capabilities seems more manageable.

Note that in #1 you have bridge that fully allow to express those path
limitation. So what you just describe can be fully reported to userspace.

I explained and given examples on how program adapt their computation to
the system topology it does exist today and people are even developing new
programming langage with some of those idea baked in.

So they are people out there that already rely on such information they
just do not get it from the kernel but from a mix of various device specific
API and they have to stich everything themself and develop a database of
quirk and gotcha. My proposal is to provide a coherent kernel API where
we can sanitize that informations and report it to userspace in a single
and coherent description.

Cheers,
Jérôme
Jonathan Cameron Dec. 7, 2018, 3:06 p.m. UTC | #35
On Thu, 6 Dec 2018 19:20:45 -0500
Jerome Glisse <jglisse@redhat.com> wrote:

> On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > 
> > 
> > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > >> I didn't think this was meant to describe actual real world performance
> > >> between all of the links. If that's the case all of this seems like a
> > >> pipe dream to me.  
> > > 
> > > The HMAT discussions (that I was a part of at least) settled on just
> > > trying to describe what we called "sticker speed".  Nobody had an
> > > expectation that you *really* had to measure everything.
> > > 
> > > The best we can do for any of these approaches is approximate things.  
> > 
> > Yes, though there's a lot of caveats in this assumption alone.
> > Specifically with PCI: the bus may run at however many GB/s but P2P
> > through a CPU's root complexes can slow down significantly (like down to
> > MB/s).
> > 
> > I've seen similar things across QPI: I can sometimes do P2P from
> > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > speed of any of those buses.
> > 
> > I'm not sure how anyone is going to deal with those issues, but it does
> > firmly place us in world view #2 instead of #1. But, yes, I agree
> > exposing information like in #2 full out to userspace, especially
> > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > help with that. Providing an API to ask for memory (or another resource)
> > that's accessible by a set of initiators and with a set of requirements
> > for capabilities seems more manageable.  
> 
> Note that in #1 you have bridge that fully allow to express those path
> limitation. So what you just describe can be fully reported to userspace.
> 
> I explained and given examples on how program adapt their computation to
> the system topology it does exist today and people are even developing new
> programming langage with some of those idea baked in.
> 
> So they are people out there that already rely on such information they
> just do not get it from the kernel but from a mix of various device specific
> API and they have to stich everything themself and develop a database of
> quirk and gotcha. My proposal is to provide a coherent kernel API where
> we can sanitize that informations and report it to userspace in a single
> and coherent description.
> 
> Cheers,
> Jérôme

I know it doesn't work everywhere, but I think it's worth enumerating what
cases we can get some of these numbers for and where the complexity lies.
I.e. What can the really determined user space library do today?

So one open question is how close can we get in a userspace only prototype.
At the end of the day userspace can often read HMAT directly if it wants to
/sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
end view (world 2).  I dislike the limitations of that as much as the next
person. It is slowly improving with the word "Auditable" being
kicked around - btw anyone interested in ACPI who works for a UEFI
member, there are efforts going on and more viewpoints would be great.
Expect some baby steps shortly.

For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
this is discoverable to some degree. 
* Link speed,
* Number of Lanes,
* Full topology.

What isn't there (I think)
* In component latency / bandwidth limitations (some activity going
  on to improve that long term)
* Effect of credit allocations etc on effectively bandwidth - interconnect
  performance is a whole load of black magic.

Presumably there is some information available from NVLink etc?

So whilst I really like the proposal in some ways, I wonder how much exploration
could be done of the usefulness of the data without touching the kernel at all.

The other aspect that is needed to actually make this 'dynamically' useful is
to be able to map whatever Performance Counters are available to the relevant
'links', bridges etc.   Ticket numbers are not all that useful unfortunately
except for small amounts of data on lightly loaded buses.

The kernel ultimately only needs to have a model of this topology if:
1) It's going to use it itself
2) Its going to do something automatic with it.
3) It needs to fix garbage info or supplement with things only the kernel knows.

Jonathan
Jerome Glisse Dec. 7, 2018, 7:37 p.m. UTC | #36
On Fri, Dec 07, 2018 at 03:06:36PM +0000, Jonathan Cameron wrote:
> On Thu, 6 Dec 2018 19:20:45 -0500
> Jerome Glisse <jglisse@redhat.com> wrote:
> 
> > On Thu, Dec 06, 2018 at 04:48:57PM -0700, Logan Gunthorpe wrote:
> > > 
> > > 
> > > On 2018-12-06 4:38 p.m., Dave Hansen wrote:  
> > > > On 12/6/18 3:28 PM, Logan Gunthorpe wrote:  
> > > >> I didn't think this was meant to describe actual real world performance
> > > >> between all of the links. If that's the case all of this seems like a
> > > >> pipe dream to me.  
> > > > 
> > > > The HMAT discussions (that I was a part of at least) settled on just
> > > > trying to describe what we called "sticker speed".  Nobody had an
> > > > expectation that you *really* had to measure everything.
> > > > 
> > > > The best we can do for any of these approaches is approximate things.  
> > > 
> > > Yes, though there's a lot of caveats in this assumption alone.
> > > Specifically with PCI: the bus may run at however many GB/s but P2P
> > > through a CPU's root complexes can slow down significantly (like down to
> > > MB/s).
> > > 
> > > I've seen similar things across QPI: I can sometimes do P2P from
> > > PCI->QPI->PCI but the performance doesn't even come close to the sticker
> > > speed of any of those buses.
> > > 
> > > I'm not sure how anyone is going to deal with those issues, but it does
> > > firmly place us in world view #2 instead of #1. But, yes, I agree
> > > exposing information like in #2 full out to userspace, especially
> > > through sysfs, seems like a nightmare and I don't see anything in HMS to
> > > help with that. Providing an API to ask for memory (or another resource)
> > > that's accessible by a set of initiators and with a set of requirements
> > > for capabilities seems more manageable.  
> > 
> > Note that in #1 you have bridge that fully allow to express those path
> > limitation. So what you just describe can be fully reported to userspace.
> > 
> > I explained and given examples on how program adapt their computation to
> > the system topology it does exist today and people are even developing new
> > programming langage with some of those idea baked in.
> > 
> > So they are people out there that already rely on such information they
> > just do not get it from the kernel but from a mix of various device specific
> > API and they have to stich everything themself and develop a database of
> > quirk and gotcha. My proposal is to provide a coherent kernel API where
> > we can sanitize that informations and report it to userspace in a single
> > and coherent description.
> > 
> > Cheers,
> > Jérôme
> 
> I know it doesn't work everywhere, but I think it's worth enumerating what
> cases we can get some of these numbers for and where the complexity lies.
> I.e. What can the really determined user space library do today?

I gave an example in an email in this thread:

https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1821872.html

Is the kind of example you are looking for ? :)

> 
> So one open question is how close can we get in a userspace only prototype.
> At the end of the day userspace can often read HMAT directly if it wants to
> /sys/firmware/acpi/tables/HMAT.  Obviously that gets us only the end to
> end view (world 2).  I dislike the limitations of that as much as the next
> person. It is slowly improving with the word "Auditable" being
> kicked around - btw anyone interested in ACPI who works for a UEFI
> member, there are efforts going on and more viewpoints would be great.
> Expect some baby steps shortly.
> 
> For devices on PCIe (and protocols on top of it e.g. CCIX), a lot of
> this is discoverable to some degree. 
> * Link speed,
> * Number of Lanes,
> * Full topology.

Yes discoverable bus like PCIE and all its derivative (CCIX, OpenCAPI,
...) userspace will have way to find the topology. The issue lies with
orthogonal topology of extra bus that are not necessarily enumerated
or with a device driver presently and especially how they inter-act
with each other (can you cross them ? ...)

> 
> What isn't there (I think)
> * In component latency / bandwidth limitations (some activity going
>   on to improve that long term)
> * Effect of credit allocations etc on effectively bandwidth - interconnect
>   performance is a whole load of black magic.
> 
> Presumably there is some information available from NVLink etc?

From my point of view we want to give the best case sticker value to
userspace ie the bandwidth the engineer that designed the bus sworn
their hardware deliver :)

I believe it the is the best approximation we can deliver.

> 
> So whilst I really like the proposal in some ways, I wonder how much exploration
> could be done of the usefulness of the data without touching the kernel at all.
> 
> The other aspect that is needed to actually make this 'dynamically' useful is
> to be able to map whatever Performance Counters are available to the relevant
> 'links', bridges etc.   Ticket numbers are not all that useful unfortunately
> except for small amounts of data on lightly loaded buses.
> 
> The kernel ultimately only needs to have a model of this topology if:
> 1) It's going to use it itself

I don't think this should be a criteria, kernel is not using GPU or
network adatper to browse the web for itself (at least i hope the
linux kernel is not selfaware ;)). So this kind of topology is not
of big use to the kernel. Kernel will only care about CPU and memory
that abide to the memory model of the platform. It will also care
about more irregular CPU inter-connected ie CPUs on the same mega
substrate likely have a faster inter-connect between them then to
the ones in a different physical socket. NUMA distance can model
that. Dunno if more than that would be useful to the kernel.

> 2) Its going to do something automatic with it.

The information is intended for userspace for application that use
that information. Today application get that information from non
standard source and i would like to provide this in a standard
common place in the kernel for few reasons:
    - Common model with explicit definition of what is what and
      what are the rules. No need to userspace to understand the
      specificities of various kernel sub-system.
    - Define unique identifiant for _every_ type of memory in the
      system even device memory so that i can define syscall to
      operate on those memory (can not do that in device driver)
    - Integrate with core mm so that long term we can move more
      of individual device memory management into core component.

> 3) It needs to fix garbage info or supplement with things only the kernel knows.

Yes kernel is expect to fix the informations it get and sanitize
it so that userspace do not have to grow database of quirk and
workaround. Moreover kernel can also benchmark inter-connect and
adapt reported bandwidth and latency if this is ever something
people would like to see.


I will post two v2 where i split the common helpers from the sysfs
and syscall part. I need the common helpers today in the case of
single device and have user for that code (nouveau and amdgpu for
starter). I want to continue the sysfs and syscall discussion and
i need to reformulate thing and give better explaination of why
i think the way i am doing thing have more values than any other.

Dunno if i will have time to finish rework-ing all this before the
end of this year.

Cheers,
Jérôme