[v4,0/2] Add p2p via dmabuf to habanalabs

Message ID	20210705130314.11519-1-ogabbay@kernel.org (mailing list archive)
Headers	show Return-Path: <linux-media-owner@kernel.org> From: Oded Gabbay <ogabbay@kernel.org> To: linux-kernel@vger.kernel.org, gregkh@linuxfoundation.org Cc: sumit.semwal@linaro.org, christian.koenig@amd.com, daniel.vetter@ffwll.ch, galpress@amazon.com, sleybo@amazon.com, dri-devel@lists.freedesktop.org, jgg@ziepe.ca, linux-rdma@vger.kernel.org, linux-media@vger.kernel.org, dledford@redhat.com, airlied@gmail.com, alexander.deucher@amd.com, leonro@nvidia.com, hch@lst.de, amd-gfx@lists.freedesktop.org, linaro-mm-sig@lists.linaro.org Subject: [PATCH v4 0/2] Add p2p via dmabuf to habanalabs Date: Mon, 5 Jul 2021 16:03:12 +0300 Message-Id: <20210705130314.11519-1-ogabbay@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Add p2p via dmabuf to habanalabs \| expand [v4,0/2] Add p2p via dmabuf to habanalabs [v4,1/2] habanalabs: define uAPI to export FD for DMA-BUF [v4,2/2] habanalabs: add support for dma-buf exporter

Oded Gabbay July 5, 2021, 1:03 p.m. UTC

Hi,
I'm sending v4 of this patch-set following the long email thread.
I want to thank Jason for reviewing v3 and pointing out the errors, saving
us time later to debug it :)

I consulted with Christian on how to fix patch 2 (the implementation) and
at the end of the day I shamelessly copied the relevant content from
amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.

I also made a few improvements after looking at the relevant code in amdgpu.
The details are in the changelog of patch 2.

I took the time to write an import code into the driver, allowing me to
check real P2P with two Gaudi devices, one as exporter and the other as
importer. I'm not going to include the import code in the product, it was
just for testing purposes (although I can share it if anyone wants).

I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).

Greg, I hope this will be good enough for you to merge this code.

Thanks,
Oded

Oded Gabbay (1):
  habanalabs: define uAPI to export FD for DMA-BUF

Tomer Tayar (1):
  habanalabs: add support for dma-buf exporter

 drivers/misc/habanalabs/Kconfig             |   1 +
 drivers/misc/habanalabs/common/habanalabs.h |  26 ++
 drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
 drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
 drivers/misc/habanalabs/goya/goya.c         |   1 +
 include/uapi/misc/habanalabs.h              |  28 +-
 6 files changed, 532 insertions(+), 5 deletions(-)

Daniel Vetter July 6, 2021, 8:40 a.m. UTC | #1

On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> Hi,
> I'm sending v4 of this patch-set following the long email thread.
> I want to thank Jason for reviewing v3 and pointing out the errors, saving
> us time later to debug it :)
> 
> I consulted with Christian on how to fix patch 2 (the implementation) and
> at the end of the day I shamelessly copied the relevant content from
> amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> 
> I also made a few improvements after looking at the relevant code in amdgpu.
> The details are in the changelog of patch 2.
> 
> I took the time to write an import code into the driver, allowing me to
> check real P2P with two Gaudi devices, one as exporter and the other as
> importer. I'm not going to include the import code in the product, it was
> just for testing purposes (although I can share it if anyone wants).
> 
> I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> 
> Greg, I hope this will be good enough for you to merge this code.

So we're officially going to use dri-devel for technical details review
and then Greg for merging so we don't have to deal with other merge
criteria dri-devel folks have?

I don't expect anything less by now, but it does make the original claim
that drivers/misc will not step all over accelerators folks a complete
farce under the totally-not-a-gpu banner.

This essentially means that for any other accelerator stack that doesn't
fit the dri-devel merge criteria, even if it's acting like a gpu and uses
other gpu driver stuff, you can just send it to Greg and it's good to go.

There's quite a lot of these floating around actually (and many do have
semi-open runtimes, like habanalabs have now too, just not open enough to
be actually useful). It's going to be absolutely lovely having to explain
to these companies in background chats why habanalabs gets away with their
stack and they don't.

Or maybe we should just merge them all and give up on the idea of having
open cross-vendor driver stacks for these accelerators.

Thanks, Daniel

> 
> Thanks,
> Oded
> 
> Oded Gabbay (1):
>   habanalabs: define uAPI to export FD for DMA-BUF
> 
> Tomer Tayar (1):
>   habanalabs: add support for dma-buf exporter
> 
>  drivers/misc/habanalabs/Kconfig             |   1 +
>  drivers/misc/habanalabs/common/habanalabs.h |  26 ++
>  drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
>  drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
>  drivers/misc/habanalabs/goya/goya.c         |   1 +
>  include/uapi/misc/habanalabs.h              |  28 +-
>  6 files changed, 532 insertions(+), 5 deletions(-)
> 
> -- 
> 2.25.1
>

Oded Gabbay July 6, 2021, 10:03 a.m. UTC | #2

On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > Hi,
> > I'm sending v4 of this patch-set following the long email thread.
> > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > us time later to debug it :)
> >
> > I consulted with Christian on how to fix patch 2 (the implementation) and
> > at the end of the day I shamelessly copied the relevant content from
> > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> >
> > I also made a few improvements after looking at the relevant code in amdgpu.
> > The details are in the changelog of patch 2.
> >
> > I took the time to write an import code into the driver, allowing me to
> > check real P2P with two Gaudi devices, one as exporter and the other as
> > importer. I'm not going to include the import code in the product, it was
> > just for testing purposes (although I can share it if anyone wants).
> >
> > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> >
> > Greg, I hope this will be good enough for you to merge this code.
>
> So we're officially going to use dri-devel for technical details review
> and then Greg for merging so we don't have to deal with other merge
> criteria dri-devel folks have?
I'm glad to receive any help or review, regardless of the subsystem
the person giving that help belongs to.

>
> I don't expect anything less by now, but it does make the original claim
> that drivers/misc will not step all over accelerators folks a complete
> farce under the totally-not-a-gpu banner.
>
> This essentially means that for any other accelerator stack that doesn't
> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> other gpu driver stuff, you can just send it to Greg and it's good to go.

What's wrong with Greg ??? ;)

On a more serious note, yes, I do think the dri-devel merge criteria
is very extreme, and effectively drives-out many AI accelerator
companies that want to contribute to the kernel but can't/won't open
their software IP and patents.

I think the expectation from AI startups (who are 90% of the deep
learning field) to cooperate outside of company boundaries is not
realistic, especially on the user-side, where the real IP of the
company resides.

Personally I don't think there is a real justification for that at
this point of time, but if it will make you (and other people here)
happy I really don't mind creating a non-gpu accelerator subsystem
that will contain all the totally-not-a-gpu accelerators, and will
have a more relaxed criteria for upstreaming. Something along an
"rdma-core" style library looks like the correct amount of user-level
open source that should be enough.

The question is, what will happen later ? Will it be sufficient to
"allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
hmm) ?

If the community and dri-devel maintainers (and you among them) will
assure me it is good enough, then I'll happily contribute my work and
personal time to organize this effort and implement it.

Thanks,
oded

>
> There's quite a lot of these floating around actually (and many do have
> semi-open runtimes, like habanalabs have now too, just not open enough to
> be actually useful). It's going to be absolutely lovely having to explain
> to these companies in background chats why habanalabs gets away with their
> stack and they don't.
>
> Or maybe we should just merge them all and give up on the idea of having
> open cross-vendor driver stacks for these accelerators.
>
> Thanks, Daniel
>
> >
> > Thanks,
> > Oded
> >
> > Oded Gabbay (1):
> >   habanalabs: define uAPI to export FD for DMA-BUF
> >
> > Tomer Tayar (1):
> >   habanalabs: add support for dma-buf exporter
> >
> >  drivers/misc/habanalabs/Kconfig             |   1 +
> >  drivers/misc/habanalabs/common/habanalabs.h |  26 ++
> >  drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
> >  drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
> >  drivers/misc/habanalabs/goya/goya.c         |   1 +
> >  include/uapi/misc/habanalabs.h              |  28 +-
> >  6 files changed, 532 insertions(+), 5 deletions(-)
> >
> > --
> > 2.25.1
> >
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter July 6, 2021, 10:36 a.m. UTC | #3

On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > Hi,
> > > I'm sending v4 of this patch-set following the long email thread.
> > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > us time later to debug it :)
> > >
> > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > at the end of the day I shamelessly copied the relevant content from
> > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > >
> > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > The details are in the changelog of patch 2.
> > >
> > > I took the time to write an import code into the driver, allowing me to
> > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > importer. I'm not going to include the import code in the product, it was
> > > just for testing purposes (although I can share it if anyone wants).
> > >
> > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > >
> > > Greg, I hope this will be good enough for you to merge this code.
> >
> > So we're officially going to use dri-devel for technical details review
> > and then Greg for merging so we don't have to deal with other merge
> > criteria dri-devel folks have?
> I'm glad to receive any help or review, regardless of the subsystem
> the person giving that help belongs to.
>
> >
> > I don't expect anything less by now, but it does make the original claim
> > that drivers/misc will not step all over accelerators folks a complete
> > farce under the totally-not-a-gpu banner.
> >
> > This essentially means that for any other accelerator stack that doesn't
> > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > other gpu driver stuff, you can just send it to Greg and it's good to go.
>
> What's wrong with Greg ??? ;)
>
> On a more serious note, yes, I do think the dri-devel merge criteria
> is very extreme, and effectively drives-out many AI accelerator
> companies that want to contribute to the kernel but can't/won't open
> their software IP and patents.
>
> I think the expectation from AI startups (who are 90% of the deep
> learning field) to cooperate outside of company boundaries is not
> realistic, especially on the user-side, where the real IP of the
> company resides.
>
> Personally I don't think there is a real justification for that at
> this point of time, but if it will make you (and other people here)
> happy I really don't mind creating a non-gpu accelerator subsystem
> that will contain all the totally-not-a-gpu accelerators, and will
> have a more relaxed criteria for upstreaming. Something along an
> "rdma-core" style library looks like the correct amount of user-level
> open source that should be enough.
>
> The question is, what will happen later ? Will it be sufficient to
> "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> hmm) ?
>
> If the community and dri-devel maintainers (and you among them) will
> assure me it is good enough, then I'll happily contribute my work and
> personal time to organize this effort and implement it.

I think dri-devel stance is pretty clear and well known: We want the
userspace to be open, because that's where most of the driver stack
is. Without an open driver stack there's no way to ever have anything
cross-vendor.

And that includes the compiler and anything else you need to drive the hardware.

Afaik linux cpu arch ports are also not accepted if there's no open
gcc or llvm port around, because without that the overall stack just
becomes useless.

If that means AI companies don't want to open our their hw specs
enough to allow that, so be it - all you get in that case is
offloading the kernel side  of the stack for convenience, with zero
long term prospects to ever make this into a cross vendor subsystem
stack that does something useful. If the business case says you can't
open up your hw enough for that, I really don't see the point in
merging such a driver, it'll be an unmaintainable stack by anyone else
who's not having access to those NDA covered specs and patents and
everything.

If the stack is actually cross vendor to begin with that's just bonus,
but generally that doesn't happen voluntarily and needs a few years to
decades to get there. So that's not really something we require.

tldr; just a runtime isn't enough for dri-devel.

Now Greg seems to be happy to merge kernel drivers that aren't useful
with the open bits provided, so *shrug*.

Cheers, Daniel

PS: If requiring an actually useful open driver stack is somehow
*extreme* I have no idea why we even bother with merging device
drivers to upstream. Just make a stable driver api and done, vendors
can then do whatever they feel like and protect their "valuable IP and
patents" or whatever it is.

> Thanks,
> oded
>
> >
> > There's quite a lot of these floating around actually (and many do have
> > semi-open runtimes, like habanalabs have now too, just not open enough to
> > be actually useful). It's going to be absolutely lovely having to explain
> > to these companies in background chats why habanalabs gets away with their
> > stack and they don't.
> >
> > Or maybe we should just merge them all and give up on the idea of having
> > open cross-vendor driver stacks for these accelerators.
> >
> > Thanks, Daniel
> >
> > >
> > > Thanks,
> > > Oded
> > >
> > > Oded Gabbay (1):
> > >   habanalabs: define uAPI to export FD for DMA-BUF
> > >
> > > Tomer Tayar (1):
> > >   habanalabs: add support for dma-buf exporter
> > >
> > >  drivers/misc/habanalabs/Kconfig             |   1 +
> > >  drivers/misc/habanalabs/common/habanalabs.h |  26 ++
> > >  drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
> > >  drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
> > >  drivers/misc/habanalabs/goya/goya.c         |   1 +
> > >  include/uapi/misc/habanalabs.h              |  28 +-
> > >  6 files changed, 532 insertions(+), 5 deletions(-)
> > >
> > > --
> > > 2.25.1
> > >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch

Daniel Vetter July 6, 2021, 10:47 a.m. UTC | #4

On Tue, Jul 6, 2021 at 12:36 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > > Hi,
> > > > I'm sending v4 of this patch-set following the long email thread.
> > > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > > us time later to debug it :)
> > > >
> > > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > > at the end of the day I shamelessly copied the relevant content from
> > > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > > >
> > > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > > The details are in the changelog of patch 2.
> > > >
> > > > I took the time to write an import code into the driver, allowing me to
> > > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > > importer. I'm not going to include the import code in the product, it was
> > > > just for testing purposes (although I can share it if anyone wants).
> > > >
> > > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > > >
> > > > Greg, I hope this will be good enough for you to merge this code.
> > >
> > > So we're officially going to use dri-devel for technical details review
> > > and then Greg for merging so we don't have to deal with other merge
> > > criteria dri-devel folks have?
> > I'm glad to receive any help or review, regardless of the subsystem
> > the person giving that help belongs to.
> >
> > >
> > > I don't expect anything less by now, but it does make the original claim
> > > that drivers/misc will not step all over accelerators folks a complete
> > > farce under the totally-not-a-gpu banner.
> > >
> > > This essentially means that for any other accelerator stack that doesn't
> > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> >
> > What's wrong with Greg ??? ;)
> >
> > On a more serious note, yes, I do think the dri-devel merge criteria
> > is very extreme, and effectively drives-out many AI accelerator
> > companies that want to contribute to the kernel but can't/won't open
> > their software IP and patents.
> >
> > I think the expectation from AI startups (who are 90% of the deep
> > learning field) to cooperate outside of company boundaries is not
> > realistic, especially on the user-side, where the real IP of the
> > company resides.
> >
> > Personally I don't think there is a real justification for that at
> > this point of time, but if it will make you (and other people here)
> > happy I really don't mind creating a non-gpu accelerator subsystem
> > that will contain all the totally-not-a-gpu accelerators, and will
> > have a more relaxed criteria for upstreaming. Something along an
> > "rdma-core" style library looks like the correct amount of user-level
> > open source that should be enough.
> >
> > The question is, what will happen later ? Will it be sufficient to
> > "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> > hmm) ?
> >
> > If the community and dri-devel maintainers (and you among them) will
> > assure me it is good enough, then I'll happily contribute my work and
> > personal time to organize this effort and implement it.
>
> I think dri-devel stance is pretty clear and well known: We want the
> userspace to be open, because that's where most of the driver stack
> is. Without an open driver stack there's no way to ever have anything
> cross-vendor.
>
> And that includes the compiler and anything else you need to drive the hardware.
>
> Afaik linux cpu arch ports are also not accepted if there's no open
> gcc or llvm port around, because without that the overall stack just
> becomes useless.
>
> If that means AI companies don't want to open our their hw specs
> enough to allow that, so be it - all you get in that case is
> offloading the kernel side  of the stack for convenience, with zero
> long term prospects to ever make this into a cross vendor subsystem
> stack that does something useful. If the business case says you can't
> open up your hw enough for that, I really don't see the point in
> merging such a driver, it'll be an unmaintainable stack by anyone else
> who's not having access to those NDA covered specs and patents and
> everything.
>
> If the stack is actually cross vendor to begin with that's just bonus,
> but generally that doesn't happen voluntarily and needs a few years to
> decades to get there. So that's not really something we require.
>
> tldr; just a runtime isn't enough for dri-devel.
>
> Now Greg seems to be happy to merge kernel drivers that aren't useful
> with the open bits provided, so *shrug*.
>
> Cheers, Daniel
>
> PS: If requiring an actually useful open driver stack is somehow
> *extreme* I have no idea why we even bother with merging device
> drivers to upstream. Just make a stable driver api and done, vendors
> can then do whatever they feel like and protect their "valuable IP and
> patents" or whatever it is.

So perhaps this isn't clear, so let's explain this differently.

The deal when having a driver in upstream is that both the vendor and
upstream benefits:
- vendor gets their driver carried and adjusted in upstream, because
there's no stable uapi, and the benefit of being included everywhere
by default
- upstream gets the benefit to be able to hack around in more drivers,
which generally leads to a more robust subsystem and driver
architecture

Now what you want is to have the benefits for you, without giving the
wider community the benefit of actually being able to hack on your
driver stack. Because you prefer to keep critical pieces of it
protected and closed, which makes sure no one can create a new
cross-vendor stack without your permission. Or without investing a lot
of time into reverse-engineering the hardware. That's not extreme,
that's just preferring to have your cake and eat it too.

And frankly on dri-devel we don't take such a loopsided deal. Greg
otoh seems to be totally fine, or not really understand what it takes
to build an accelerator stack, or I dunno what, but he's happy merging
them.

Cheers, Daniel


> > Thanks,
> > oded
> >
> > >
> > > There's quite a lot of these floating around actually (and many do have
> > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > be actually useful). It's going to be absolutely lovely having to explain
> > > to these companies in background chats why habanalabs gets away with their
> > > stack and they don't.
> > >
> > > Or maybe we should just merge them all and give up on the idea of having
> > > open cross-vendor driver stacks for these accelerators.
> > >
> > > Thanks, Daniel
> > >
> > > >
> > > > Thanks,
> > > > Oded
> > > >
> > > > Oded Gabbay (1):
> > > >   habanalabs: define uAPI to export FD for DMA-BUF
> > > >
> > > > Tomer Tayar (1):
> > > >   habanalabs: add support for dma-buf exporter
> > > >
> > > >  drivers/misc/habanalabs/Kconfig             |   1 +
> > > >  drivers/misc/habanalabs/common/habanalabs.h |  26 ++
> > > >  drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
> > > >  drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
> > > >  drivers/misc/habanalabs/goya/goya.c         |   1 +
> > > >  include/uapi/misc/habanalabs.h              |  28 +-
> > > >  6 files changed, 532 insertions(+), 5 deletions(-)
> > > >
> > > > --
> > > > 2.25.1
> > > >
> > >
> > > --
> > > Daniel Vetter
> > > Software Engineer, Intel Corporation
> > > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter July 6, 2021, 12:07 p.m. UTC | #5

On Tue, Jul 6, 2021 at 12:47 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 6, 2021 at 12:36 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> > On Tue, Jul 6, 2021 at 12:03 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> > >
> > > On Tue, Jul 6, 2021 at 11:40 AM Daniel Vetter <daniel@ffwll.ch> wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 04:03:12PM +0300, Oded Gabbay wrote:
> > > > > Hi,
> > > > > I'm sending v4 of this patch-set following the long email thread.
> > > > > I want to thank Jason for reviewing v3 and pointing out the errors, saving
> > > > > us time later to debug it :)
> > > > >
> > > > > I consulted with Christian on how to fix patch 2 (the implementation) and
> > > > > at the end of the day I shamelessly copied the relevant content from
> > > > > amdgpu_vram_mgr_alloc_sgt() and amdgpu_dma_buf_attach(), regarding the
> > > > > usage of dma_map_resource() and pci_p2pdma_distance_many(), respectively.
> > > > >
> > > > > I also made a few improvements after looking at the relevant code in amdgpu.
> > > > > The details are in the changelog of patch 2.
> > > > >
> > > > > I took the time to write an import code into the driver, allowing me to
> > > > > check real P2P with two Gaudi devices, one as exporter and the other as
> > > > > importer. I'm not going to include the import code in the product, it was
> > > > > just for testing purposes (although I can share it if anyone wants).
> > > > >
> > > > > I run it on a bare-metal environment with IOMMU enabled, on a sky-lake CPU
> > > > > with a white-listed PCIe bridge (to make the pci_p2pdma_distance_many happy).
> > > > >
> > > > > Greg, I hope this will be good enough for you to merge this code.
> > > >
> > > > So we're officially going to use dri-devel for technical details review
> > > > and then Greg for merging so we don't have to deal with other merge
> > > > criteria dri-devel folks have?
> > > I'm glad to receive any help or review, regardless of the subsystem
> > > the person giving that help belongs to.
> > >
> > > >
> > > > I don't expect anything less by now, but it does make the original claim
> > > > that drivers/misc will not step all over accelerators folks a complete
> > > > farce under the totally-not-a-gpu banner.
> > > >
> > > > This essentially means that for any other accelerator stack that doesn't
> > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > >
> > > What's wrong with Greg ??? ;)
> > >
> > > On a more serious note, yes, I do think the dri-devel merge criteria
> > > is very extreme, and effectively drives-out many AI accelerator
> > > companies that want to contribute to the kernel but can't/won't open
> > > their software IP and patents.
> > >
> > > I think the expectation from AI startups (who are 90% of the deep
> > > learning field) to cooperate outside of company boundaries is not
> > > realistic, especially on the user-side, where the real IP of the
> > > company resides.
> > >
> > > Personally I don't think there is a real justification for that at
> > > this point of time, but if it will make you (and other people here)
> > > happy I really don't mind creating a non-gpu accelerator subsystem
> > > that will contain all the totally-not-a-gpu accelerators, and will
> > > have a more relaxed criteria for upstreaming. Something along an
> > > "rdma-core" style library looks like the correct amount of user-level
> > > open source that should be enough.

On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
cores in their hw, for which you'd need some kind of compiler to make
use of the hardware and the interfaces the kernel provides? So not
really compareable, but also my understanding is that rdma-core does
actually allow you to reasonable use&drive all the hw features and
kernel interfaces fully.

So we actually want less on dri-devel, because for compute/accel chips
we're currently happy with a vendor userspace. It just needs to be
functional and complete, and open in its entirety.

Now if there's going to be a AI/NN/spatial compute core runtime with
all the things included that's cross-vendor that's obviously going to
be great, but that's strictly a bonus. And eventually the long-term
goal, once we have a few open stacks from various vendors. But atm we
have 0 open stacks, so one thing at a time.

> > > The question is, what will happen later ? Will it be sufficient to
> > > "allow" us to use dmabuf and maybe other gpu stuff in the future (e.g.
> > > hmm) ?
> > >
> > > If the community and dri-devel maintainers (and you among them) will
> > > assure me it is good enough, then I'll happily contribute my work and
> > > personal time to organize this effort and implement it.
> >
> > I think dri-devel stance is pretty clear and well known: We want the
> > userspace to be open, because that's where most of the driver stack
> > is. Without an open driver stack there's no way to ever have anything
> > cross-vendor.
> >
> > And that includes the compiler and anything else you need to drive the hardware.
> >
> > Afaik linux cpu arch ports are also not accepted if there's no open
> > gcc or llvm port around, because without that the overall stack just
> > becomes useless.
> >
> > If that means AI companies don't want to open our their hw specs
> > enough to allow that, so be it - all you get in that case is
> > offloading the kernel side  of the stack for convenience, with zero
> > long term prospects to ever make this into a cross vendor subsystem
> > stack that does something useful. If the business case says you can't
> > open up your hw enough for that, I really don't see the point in
> > merging such a driver, it'll be an unmaintainable stack by anyone else
> > who's not having access to those NDA covered specs and patents and
> > everything.
> >
> > If the stack is actually cross vendor to begin with that's just bonus,
> > but generally that doesn't happen voluntarily and needs a few years to
> > decades to get there. So that's not really something we require.
> >
> > tldr; just a runtime isn't enough for dri-devel.
> >
> > Now Greg seems to be happy to merge kernel drivers that aren't useful
> > with the open bits provided, so *shrug*.
> >
> > Cheers, Daniel
> >
> > PS: If requiring an actually useful open driver stack is somehow
> > *extreme* I have no idea why we even bother with merging device
> > drivers to upstream. Just make a stable driver api and done, vendors
> > can then do whatever they feel like and protect their "valuable IP and
> > patents" or whatever it is.
>
> So perhaps this isn't clear, so let's explain this differently.
>
> The deal when having a driver in upstream is that both the vendor and
> upstream benefits:
> - vendor gets their driver carried and adjusted in upstream, because
> there's no stable uapi, and the benefit of being included everywhere

s/uapi/kernel driver api/ ofc, but I got it right in the first reply at least.
-Daniel

> by default
> - upstream gets the benefit to be able to hack around in more drivers,
> which generally leads to a more robust subsystem and driver
> architecture
>
> Now what you want is to have the benefits for you, without giving the
> wider community the benefit of actually being able to hack on your
> driver stack. Because you prefer to keep critical pieces of it
> protected and closed, which makes sure no one can create a new
> cross-vendor stack without your permission. Or without investing a lot
> of time into reverse-engineering the hardware. That's not extreme,
> that's just preferring to have your cake and eat it too.
>
> And frankly on dri-devel we don't take such a loopsided deal. Greg
> otoh seems to be totally fine, or not really understand what it takes
> to build an accelerator stack, or I dunno what, but he's happy merging
> them.
>
> Cheers, Daniel
>
>
> > > Thanks,
> > > oded
> > >
> > > >
> > > > There's quite a lot of these floating around actually (and many do have
> > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > to these companies in background chats why habanalabs gets away with their
> > > > stack and they don't.
> > > >
> > > > Or maybe we should just merge them all and give up on the idea of having
> > > > open cross-vendor driver stacks for these accelerators.
> > > >
> > > > Thanks, Daniel
> > > >
> > > > >
> > > > > Thanks,
> > > > > Oded
> > > > >
> > > > > Oded Gabbay (1):
> > > > >   habanalabs: define uAPI to export FD for DMA-BUF
> > > > >
> > > > > Tomer Tayar (1):
> > > > >   habanalabs: add support for dma-buf exporter
> > > > >
> > > > >  drivers/misc/habanalabs/Kconfig             |   1 +
> > > > >  drivers/misc/habanalabs/common/habanalabs.h |  26 ++
> > > > >  drivers/misc/habanalabs/common/memory.c     | 480 +++++++++++++++++++-
> > > > >  drivers/misc/habanalabs/gaudi/gaudi.c       |   1 +
> > > > >  drivers/misc/habanalabs/goya/goya.c         |   1 +
> > > > >  include/uapi/misc/habanalabs.h              |  28 +-
> > > > >  6 files changed, 532 insertions(+), 5 deletions(-)
> > > > >
> > > > > --
> > > > > 2.25.1
> > > > >
> > > >
> > > > --
> > > > Daniel Vetter
> > > > Software Engineer, Intel Corporation
> > > > http://blog.ffwll.ch
> >
> >
> >
> > --
> > Daniel Vetter
> > Software Engineer, Intel Corporation
> > http://blog.ffwll.ch
>
>
>
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Christoph Hellwig July 6, 2021, 12:21 p.m. UTC | #6

On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > Greg, I hope this will be good enough for you to merge this code.
> 
> So we're officially going to use dri-devel for technical details review
> and then Greg for merging so we don't have to deal with other merge
> criteria dri-devel folks have?
> 
> I don't expect anything less by now, but it does make the original claim
> that drivers/misc will not step all over accelerators folks a complete
> farce under the totally-not-a-gpu banner.
> 
> This essentially means that for any other accelerator stack that doesn't
> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> other gpu driver stuff, you can just send it to Greg and it's good to go.
> 
> There's quite a lot of these floating around actually (and many do have
> semi-open runtimes, like habanalabs have now too, just not open enough to
> be actually useful). It's going to be absolutely lovely having to explain
> to these companies in background chats why habanalabs gets away with their
> stack and they don't.

FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
runtime if they want to push any additional feature in the kernel.
The current situation is not sustainable.

Christoph Hellwig July 6, 2021, 12:23 p.m. UTC | #7

On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> Afaik linux cpu arch ports are also not accepted if there's no open
> gcc or llvm port around, because without that the overall stack just
> becomes useless.

Yes.  And the one architecture that has an open but not upstream
compiler already is more than enough of a pain to not repeat that
mistake ever again.

Daniel Vetter July 6, 2021, 12:23 p.m. UTC | #8

On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > Greg, I hope this will be good enough for you to merge this code.
> > 
> > So we're officially going to use dri-devel for technical details review
> > and then Greg for merging so we don't have to deal with other merge
> > criteria dri-devel folks have?
> > 
> > I don't expect anything less by now, but it does make the original claim
> > that drivers/misc will not step all over accelerators folks a complete
> > farce under the totally-not-a-gpu banner.
> > 
> > This essentially means that for any other accelerator stack that doesn't
> > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > 
> > There's quite a lot of these floating around actually (and many do have
> > semi-open runtimes, like habanalabs have now too, just not open enough to
> > be actually useful). It's going to be absolutely lovely having to explain
> > to these companies in background chats why habanalabs gets away with their
> > stack and they don't.
> 
> FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
> runtime if they want to push any additional feature in the kernel.
> The current situation is not sustainable.

Before anyone replies: The runtime is open, the compiler is still closed.
This has become the new default for accel driver submissions, I think
mostly because all the interesting bits for non-3d accelerators are in the
accel ISA, and no longer in the runtime. So vendors are fairly happy to
throw in the runtime as a freebie.

It's still incomplete, and it's still useless if you want to actually hack
on the driver stack.
-Daniel

Oded Gabbay July 6, 2021, 12:45 p.m. UTC | #9

On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > Greg, I hope this will be good enough for you to merge this code.
> > >
> > > So we're officially going to use dri-devel for technical details review
> > > and then Greg for merging so we don't have to deal with other merge
> > > criteria dri-devel folks have?
> > >
> > > I don't expect anything less by now, but it does make the original claim
> > > that drivers/misc will not step all over accelerators folks a complete
> > > farce under the totally-not-a-gpu banner.
> > >
> > > This essentially means that for any other accelerator stack that doesn't
> > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > >
> > > There's quite a lot of these floating around actually (and many do have
> > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > be actually useful). It's going to be absolutely lovely having to explain
> > > to these companies in background chats why habanalabs gets away with their
> > > stack and they don't.
> >
> > FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
> > runtime if they want to push any additional feature in the kernel.
> > The current situation is not sustainable.
Well, that's like, your opinion...

>
> Before anyone replies: The runtime is open, the compiler is still closed.
> This has become the new default for accel driver submissions, I think
> mostly because all the interesting bits for non-3d accelerators are in the
> accel ISA, and no longer in the runtime. So vendors are fairly happy to
> throw in the runtime as a freebie.
>
> It's still incomplete, and it's still useless if you want to actually hack
> on the driver stack.
> -Daniel
> --
I don't understand what's not sustainable here.

There is zero code inside the driver that communicates or interacts
with our TPC code (TPC is the Tensor Processing Core).
Even submitting works to the TPC is done via a generic queue
interface. And that queue IP is common between all our engines
(TPC/DMA/NIC). The driver provides all the specs of that queue IP,
because the driver's code is handling that queue. But why is the TPC
compiler code even relevant here ?

btw, you can today see our TPC code at
https://github.com/HabanaAI/Habana_Custom_Kernel
There is a link there to the TPC user guide and link to download the
LLVM compiler.

Oded

Daniel Vetter July 6, 2021, 1:17 p.m. UTC | #10

On Tue, Jul 6, 2021 at 2:46 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
>
> On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > > Greg, I hope this will be good enough for you to merge this code.
> > > >
> > > > So we're officially going to use dri-devel for technical details review
> > > > and then Greg for merging so we don't have to deal with other merge
> > > > criteria dri-devel folks have?
> > > >
> > > > I don't expect anything less by now, but it does make the original claim
> > > > that drivers/misc will not step all over accelerators folks a complete
> > > > farce under the totally-not-a-gpu banner.
> > > >
> > > > This essentially means that for any other accelerator stack that doesn't
> > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > > >
> > > > There's quite a lot of these floating around actually (and many do have
> > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > to these companies in background chats why habanalabs gets away with their
> > > > stack and they don't.
> > >
> > > FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
> > > runtime if they want to push any additional feature in the kernel.
> > > The current situation is not sustainable.
> Well, that's like, your opinion...
>
> >
> > Before anyone replies: The runtime is open, the compiler is still closed.
> > This has become the new default for accel driver submissions, I think
> > mostly because all the interesting bits for non-3d accelerators are in the
> > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > throw in the runtime as a freebie.
> >
> > It's still incomplete, and it's still useless if you want to actually hack
> > on the driver stack.
> > -Daniel
> > --
> I don't understand what's not sustainable here.
>
> There is zero code inside the driver that communicates or interacts
> with our TPC code (TPC is the Tensor Processing Core).
> Even submitting works to the TPC is done via a generic queue
> interface. And that queue IP is common between all our engines
> (TPC/DMA/NIC). The driver provides all the specs of that queue IP,
> because the driver's code is handling that queue. But why is the TPC
> compiler code even relevant here ?

Can I use the hw how it's intended to be used without it?

If the answer is no, then essentially what you're doing with your
upstream driver is getting all the benefits of an upstream driver,
while upstream gets nothing. We can't use your stack, not as-is. Sure
we can use the queue, but we can't actually submit anything
interesting. And I'm pretty sure the point of your hw is to do more
than submit no-op packets to a queue.

This is all "I want my cake and eat it too" approach to upstreaming,
and it's totally fine attitude to have, but if you don't see why
there's maybe an different side to it then I don't get what you're
arguing. Upstream isn't free lunch for nothing.

Frankly I'm starting to assume you're arguing this all in bad faith
just because habanalabds doesn't want to actually have an open driver
stack, so any attack is good, no matter what. Which is also what
everyone else does who submits their accel driver to upstream, and
which gets us back to the starting point of this sub-thread of me
really appreciation how this will improve background discussions going
forward for everyone.

Like if the requirement for accel drivers truly is that you can submit
a dummy command to the queues then I have about 5-10 drivers at least
I could merge instantly. For something like the intel gpu driver it
would be about 50 lines of code (including all the structure boiler
plate the ioctls require)in userspace to submit a dummy queue command.
GPU and accel vendors would really love that, because it would allow
them to freeload on upstream and do essentially nothing in return.

And we'd end up with an unmaintainable disaster of a gpu or well
accelerator subsystem because there's nothing you can change or
improve because all the really useful bits of the stack are closed.
And ofc that's not any companies problem anymore, so ofc you with the
habanalabs hat on don't care and call this *extreme*.

> btw, you can today see our TPC code at
> https://github.com/HabanaAI/Habana_Custom_Kernel
> There is a link there to the TPC user guide and link to download the
> LLVM compiler.

I got stuck clicking links before I found the source for that llvm
compiler. Can you give me a direct link to the repo with sourcecode
instead please?

Thanks, Daniel

Jason Gunthorpe July 6, 2021, 1:44 p.m. UTC | #11

On Tue, Jul 06, 2021 at 02:07:16PM +0200, Daniel Vetter wrote:

> On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
> cores in their hw, for which you'd need some kind of compiler to make
> use of the hardware and the interfaces the kernel provides? So not
> really compareable, but also my understanding is that rdma-core does
> actually allow you to reasonable use&drive all the hw features and
> kernel interfaces fully.

The whole HPC stack has speciality compilers of course. OpenMP, PGAS,
etc. These compilers map onto library primitives that eventually boil
down into rdma-core calls. Even the HW devices have various
programmability that are being targetted with compilers now. People
are making NIC devices with ARM cores/etc - P4 is emerging for some
packet processing tasks.

rdma-core can drive all the kernel interfaces with at least an ioctl
wrapper, and it has a test suite that tries to cover this. It does not
exercise the full HW capability, programmability, etc of every single
device.

I actually don't entirely know what everyone has built on top of
rdma-core, or how I'd try to map it the DRI ideas you are trying to
explain.

Should we ban all Intel RDMA drivers because they are shipping
proprietary Intel HPC compilers and proprietary Intel MPI which drives
their RDMA HW? Or is that OK because there are open analogs for some
of that stuff? And yes, the open versions are inferior in various
metrics.

Pragmatically what I want to see is enough RDMA common/open user space
to understand the uAPI and thus more about how the kernel driver
works. Forcing everyone into rdma-core has already prevented a number
of uAPI mistakes in drivers that would have been bad - so at least
this level really is valuable.

> So we actually want less on dri-devel, because for compute/accel chips
> we're currently happy with a vendor userspace. It just needs to be
> functional and complete, and open in its entirety.

In a sense yes: DRI doesn't insist on a single code base to act as the
kernel interface, but that is actually the thing that has brought the
most value to RDMA, IMHO.

We've certainly had some interesting successes because of this. The
first submission for AWS's EFA driver proposed to skip the rdma-core
step, which was rejected. However since EFA has been in that ecosystem
it has benefited greatly, I think.

However, in another sense no: RDMA hasn't been blocking, say Intel,
just because they have built proprietary stuff on top of our open
stack.

Honestly, I think GPU is approaching this backwards. Wayland should
have been designed to prevent proprietary userspace stacks.

Jason

Oded Gabbay July 6, 2021, 1:45 p.m. UTC | #12

On Tue, Jul 6, 2021 at 4:17 PM Daniel Vetter <daniel@ffwll.ch> wrote:
>
> On Tue, Jul 6, 2021 at 2:46 PM Oded Gabbay <oded.gabbay@gmail.com> wrote:
> >
> > On Tue, Jul 6, 2021 at 3:23 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> > >
> > > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> > > > On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> > > > > > Greg, I hope this will be good enough for you to merge this code.
> > > > >
> > > > > So we're officially going to use dri-devel for technical details review
> > > > > and then Greg for merging so we don't have to deal with other merge
> > > > > criteria dri-devel folks have?
> > > > >
> > > > > I don't expect anything less by now, but it does make the original claim
> > > > > that drivers/misc will not step all over accelerators folks a complete
> > > > > farce under the totally-not-a-gpu banner.
> > > > >
> > > > > This essentially means that for any other accelerator stack that doesn't
> > > > > fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> > > > > other gpu driver stuff, you can just send it to Greg and it's good to go.
> > > > >
> > > > > There's quite a lot of these floating around actually (and many do have
> > > > > semi-open runtimes, like habanalabs have now too, just not open enough to
> > > > > be actually useful). It's going to be absolutely lovely having to explain
> > > > > to these companies in background chats why habanalabs gets away with their
> > > > > stack and they don't.
> > > >
> > > > FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
> > > > runtime if they want to push any additional feature in the kernel.
> > > > The current situation is not sustainable.
> > Well, that's like, your opinion...
> >
> > >
> > > Before anyone replies: The runtime is open, the compiler is still closed.
> > > This has become the new default for accel driver submissions, I think
> > > mostly because all the interesting bits for non-3d accelerators are in the
> > > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > > throw in the runtime as a freebie.
> > >
> > > It's still incomplete, and it's still useless if you want to actually hack
> > > on the driver stack.
> > > -Daniel
> > > --
> > I don't understand what's not sustainable here.
> >
> > There is zero code inside the driver that communicates or interacts
> > with our TPC code (TPC is the Tensor Processing Core).
> > Even submitting works to the TPC is done via a generic queue
> > interface. And that queue IP is common between all our engines
> > (TPC/DMA/NIC). The driver provides all the specs of that queue IP,
> > because the driver's code is handling that queue. But why is the TPC
> > compiler code even relevant here ?
>
> Can I use the hw how it's intended to be used without it?
You can use the h/w with the userspace stack we are providing in our
github repos + website.
Part of the userspace stack is open sourced, part is closed source.
And I'm actively working on opening up more stuff as we go along.

>
> If the answer is no, then essentially what you're doing with your
> upstream driver is getting all the benefits of an upstream driver,
> while upstream gets nothing. We can't use your stack, not as-is. Sure
> we can use the queue, but we can't actually submit anything
> interesting. And I'm pretty sure the point of your hw is to do more
> than submit no-op packets to a queue.
>
> This is all "I want my cake and eat it too" approach to upstreaming,
> and it's totally fine attitude to have, but if you don't see why
> there's maybe an different side to it then I don't get what you're
> arguing. Upstream isn't free lunch for nothing.
>
> Frankly I'm starting to assume you're arguing this all in bad faith
> just because habanalabds doesn't want to actually have an open driver
> stack, so any attack is good, no matter what. Which is also what
> everyone else does who submits their accel driver to upstream, and
> which gets us back to the starting point of this sub-thread of me
> really appreciation how this will improve background discussions going
> forward for everyone.
>
> Like if the requirement for accel drivers truly is that you can submit
> a dummy command to the queues then I have about 5-10 drivers at least
> I could merge instantly. For something like the intel gpu driver it
> would be about 50 lines of code (including all the structure boiler
> plate the ioctls require)in userspace to submit a dummy queue command.
> GPU and accel vendors would really love that, because it would allow
> them to freeload on upstream and do essentially nothing in return.
>
> And we'd end up with an unmaintainable disaster of a gpu or well
> accelerator subsystem because there's nothing you can change or
> improve because all the really useful bits of the stack are closed.
> And ofc that's not any companies problem anymore, so ofc you with the
> habanalabs hat on don't care and call this *extreme*.
>
> > btw, you can today see our TPC code at
> > https://github.com/HabanaAI/Habana_Custom_Kernel
> > There is a link there to the TPC user guide and link to download the
> > LLVM compiler.
>
> I got stuck clicking links before I found the source for that llvm
> compiler. Can you give me a direct link to the repo with sourcecode
> instead please?
The source code for the LLVM compiler is not available yet. That's one
of the parts I'm working on getting in the open.
Having said that, I don't think (and I'm not alone at this) that this
should be a pre-requirement for upstreaming kernel drivers of any
type.
And we had this discussion in the past, I'm sure we are both tired of
repeating ourselves.

>
> Thanks, Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Daniel Vetter July 6, 2021, 2:09 p.m. UTC | #13

On Tue, Jul 6, 2021 at 3:44 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 02:07:16PM +0200, Daniel Vetter wrote:
>
> > On the "rdma-core" idea, afaik rdma NIC do not have fully programmable
> > cores in their hw, for which you'd need some kind of compiler to make
> > use of the hardware and the interfaces the kernel provides? So not
> > really compareable, but also my understanding is that rdma-core does
> > actually allow you to reasonable use&drive all the hw features and
> > kernel interfaces fully.
>
> The whole HPC stack has speciality compilers of course. OpenMP, PGAS,
> etc. These compilers map onto library primitives that eventually boil
> down into rdma-core calls. Even the HW devices have various
> programmability that are being targetted with compilers now. People
> are making NIC devices with ARM cores/etc - P4 is emerging for some
> packet processing tasks.

Well it depends which compilers we're talking about here, and what
kind of features. Higher level compilers that break down some fancy
language like OpenMP into what that actually should do on a given
hardware like gpu, or rdma-connected cluster, or whatever, we really
don't care about. You don't need that to drive the hardware. Usually
that stuff works by breaking some of the code down into cpu compiler
IR (most of this is built on top of LLVM IR nowadays), interspersed
with library calls to the runtime.

Now the thing I care about here is if things doen't get compiled down
to cpu code, but to some other IR (SPIR-V is starting to win, but very
often ist still a hacked up version of LLVM IR), which then in a
hw-specific backend gets compiled down to instructions that run on the
hw. I had no idea that rdma NICs can do that, but it sounds like? I
guess maybe some openmpi operations could be done directly on the rdma
chip, but I'm not sure why you'd want a backend compiler here.

Anyway, for anything that works like a gpu accelerator, like 3d accel,
or parallel compute accel (aka gpgpu) or spatial compute accel (aka
NN/AI) or maybe even fpga accel most of the magic to use the hardware
is in this backend compiler, which translates from an IR into whatever
your accelerator consumes. That's the part we really care about for
modern accelerators because without that defacto the hardware is
useless. Generally these chips have full-blown, if special purpose
ISA, with register files, spilling, branches, loops and other control
flow (sometimes only execution masks on simpler hw).

> rdma-core can drive all the kernel interfaces with at least an ioctl
> wrapper, and it has a test suite that tries to cover this. It does not
> exercise the full HW capability, programmability, etc of every single
> device.
>
> I actually don't entirely know what everyone has built on top of
> rdma-core, or how I'd try to map it the DRI ideas you are trying to
> explain.
>
> Should we ban all Intel RDMA drivers because they are shipping
> proprietary Intel HPC compilers and proprietary Intel MPI which drives
> their RDMA HW? Or is that OK because there are open analogs for some
> of that stuff? And yes, the open versions are inferior in various
> metrics.
>
> Pragmatically what I want to see is enough RDMA common/open user space
> to understand the uAPI and thus more about how the kernel driver
> works. Forcing everyone into rdma-core has already prevented a number
> of uAPI mistakes in drivers that would have been bad - so at least
> this level really is valuable.
>
> > So we actually want less on dri-devel, because for compute/accel chips
> > we're currently happy with a vendor userspace. It just needs to be
> > functional and complete, and open in its entirety.
>
> In a sense yes: DRI doesn't insist on a single code base to act as the
> kernel interface, but that is actually the thing that has brought the
> most value to RDMA, IMHO.

So in practice we're not that different in DRI wrt userspace - if
there is an established cross-vendor project in the given area, we do
expect the userspace side to be merged there. And nowadays most of the
feature work is done that way, it's just that we don't have a single
project like rdma-core for this. We do still allow per-driver submit
interfaces because hw is just not standardized enough there, the
standards are at a higher level. Which is why it just doesn't make
sense to talk about a kernel driver as something that's useful
stand-alone at all.

> We've certainly had some interesting successes because of this. The
> first submission for AWS's EFA driver proposed to skip the rdma-core
> step, which was rejected. However since EFA has been in that ecosystem
> it has benefited greatly, I think.
>
> However, in another sense no: RDMA hasn't been blocking, say Intel,
> just because they have built proprietary stuff on top of our open
> stack.

Oh we allow this too. We only block the initial submission if the
proprietary stuff is the only thing out there.

> Honestly, I think GPU is approaching this backwards. Wayland should
> have been designed to prevent proprietary userspace stacks.

That's not possible without some serious cans of worms though. Wayland
is a protocol, and you can't forbid people from implementing it.
Otherwise all the compatible open implementations of closed protocols
wouldn't be possible either.

Now the implementation is a different thing, and there a few
compositors have succumbed to market pressure and enabled the nvidia
stack, as a mostly separate piece from supporting the open stack. And
that's largely because nvidia managed to completely kill the open
source r/e effort through firmware licensing and crypto-key based
verified loading, so unless you install the proprietary stack you
actually can't make use of the hardware at all - well display works
without the firmware, but 3d/compute just doesn't. So you just can't
use nvidia hw without accepting their proprietary driver licenses and
all that entails for the latest hardware.

So I'm not clear what you're suggesting here we should do different.
-Daniel

Jason Gunthorpe July 6, 2021, 2:23 p.m. UTC | #14

On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:

> If that means AI companies don't want to open our their hw specs
> enough to allow that, so be it - all you get in that case is
> offloading the kernel side  of the stack for convenience, with zero
> long term prospects to ever make this into a cross vendor subsystem
> stack that does something useful.

I don't think this is true at all - nouveau is probably the best
example.

nouveau reverse engineered a userspace stack for one of these devices.

How much further ahead would they have been by now if they had a
vendor supported, fully featured, open kernel driver to build the
userspace upon?

> open up your hw enough for that, I really don't see the point in
> merging such a driver, it'll be an unmaintainable stack by anyone else
> who's not having access to those NDA covered specs and patents and
> everything.

My perspective from RDMA is that the drivers are black boxes. I can
hack around the interface layers but there is a lot of wild stuff in
there that can't be understood without access to the HW documentation.

I think only HW that has open specs, like say NVMe, can really be
properly community oriented. Otherwise we have to work in a community
partnership with the vendor.

Jason

Daniel Vetter July 6, 2021, 2:39 p.m. UTC | #15

On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
>
> > If that means AI companies don't want to open our their hw specs
> > enough to allow that, so be it - all you get in that case is
> > offloading the kernel side  of the stack for convenience, with zero
> > long term prospects to ever make this into a cross vendor subsystem
> > stack that does something useful.
>
> I don't think this is true at all - nouveau is probably the best
> example.
>
> nouveau reverse engineered a userspace stack for one of these devices.
>
> How much further ahead would they have been by now if they had a
> vendor supported, fully featured, open kernel driver to build the
> userspace upon?

There is actually tons of example here, most of the arm socs have
fully open kernel drivers, supported by the vendor (out of tree).

The hard part is the userspace driver and all the things you're
submitting to it. We've had open kernel drivers for mail/qualcomm/...
years before any believable open implementation started existing.
Typing up the memory manager and hw submission queue handling is
comparitively trivial. Generally the kernel driver is also done last,
you bring up the userspace first, often by just directly programming
the hw from userspace. Kernel driver only gets in the way with this
stuff (nouveau is entirely developed as a userspace driver, as the
most extreme example).

This is a bit different for the display side, but nowadays those
drivers are fully in-kernel so they're all open. Well except the
nvidia one, and I've not heard of nvidia working on even an
out-of-tree open display driver, so that won't help the in-tree effort
at all.

Where it would have helped is if this open driver would come with
redistributable firmware, because that is right now the thing making
nouveau reverse-engineering painful enough to be non-feasible. Well
not the reverse-engineering, but the "shipping the result as a working
driver stack".

I don't think the facts on the ground support your claim here, aside
from the practical problem that nvidia is unwilling to even create an
open driver to begin with. So there isn't anything to merge.

> > open up your hw enough for that, I really don't see the point in
> > merging such a driver, it'll be an unmaintainable stack by anyone else
> > who's not having access to those NDA covered specs and patents and
> > everything.
>
> My perspective from RDMA is that the drivers are black boxes. I can
> hack around the interface layers but there is a lot of wild stuff in
> there that can't be understood without access to the HW documentation.

There's shipping gpu drivers with entirely reverse-engineered stacks.
And I don't mean "shipping in fedora" but "shipping in Chrome tablets
sold by OEM partners of Google". So it's very much possible, even if
the vendor is maximally stubborn about things.

> I think only HW that has open specs, like say NVMe, can really be
> properly community oriented. Otherwise we have to work in a community
> partnership with the vendor.

Well sure that's the ideal case, but most vendors in the accel space
arent interested actual partnership with the wider community. It's
"merge this kernel driver and have no further demands about anything
else". Well there are some who are on board, but it does take pretty
enormous amounts of coercion.
-Daniel

Jason Gunthorpe July 6, 2021, 2:56 p.m. UTC | #16

On Tue, Jul 06, 2021 at 04:09:25PM +0200, Daniel Vetter wrote:
> Anyway, for anything that works like a gpu accelerator, like 3d accel,
> or parallel compute accel (aka gpgpu) or spatial compute accel (aka
> NN/AI) or maybe even fpga accel most of the magic to use the hardware
> is in this backend compiler, which translates from an IR into whatever
> your accelerator consumes. That's the part we really care about for
> modern accelerators because without that defacto the hardware is
> useless. Generally these chips have full-blown, if special purpose
> ISA, with register files, spilling, branches, loops and other control
> flow (sometimes only execution masks on simpler hw).

I don't know if I see it so clearly as you do - at the end of the day
the user keys in the program in some proprietary (or open!) language
and and wack of propritary magic transforms it to "make it work".

There are many barriers that prevent someone without the secret
knowledge from duplicating the end result of a working program. An
accelerator ISA is certainly one example, but I wouldn't overly focus
on it as the only blocker.

Like you said below the NVIDIA GPU ISA seems known but the HW is still
not really useful for other reasons.

Habana seems to have gone the other way, the HW is fully useful but we
don't have the ISA transformation and other details.

Both cases seem to have ended up with something useless, and I have a
hard time saying nouveau has more right to be in the kernel tree than
Habana does.

> > Honestly, I think GPU is approaching this backwards. Wayland should
> > have been designed to prevent proprietary userspace stacks.
> 
> That's not possible without some serious cans of worms though. Wayland
> is a protocol, and you can't forbid people from implementing it.
> Otherwise all the compatible open implementations of closed protocols
> wouldn't be possible either.

Well, in many ways so is Linux, but nobody would seriously
re-implement Linux just to produce a driver.

> So I'm not clear what you're suggesting here we should do different.

Not enabling proprietary stacks as above would be a good start.

Jason

Jason Gunthorpe July 6, 2021, 3:25 p.m. UTC | #17

On Tue, Jul 06, 2021 at 04:39:19PM +0200, Daniel Vetter wrote:
> On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> >
> > > If that means AI companies don't want to open our their hw specs
> > > enough to allow that, so be it - all you get in that case is
> > > offloading the kernel side  of the stack for convenience, with zero
> > > long term prospects to ever make this into a cross vendor subsystem
> > > stack that does something useful.
> >
> > I don't think this is true at all - nouveau is probably the best
> > example.
> >
> > nouveau reverse engineered a userspace stack for one of these devices.
> >
> > How much further ahead would they have been by now if they had a
> > vendor supported, fully featured, open kernel driver to build the
> > userspace upon?
> 
> There is actually tons of example here, most of the arm socs have
> fully open kernel drivers, supported by the vendor (out of tree).

I choose nouveau because of this:

$ git ls-files drivers/gpu/drm/arm/ | xargs wc -l
 15039 total
$ git ls-files drivers/gpu/drm/nouveau/ | xargs wc -l
 204198 total

At 13x the size of mali this is not just some easy to wire up memory
manager and command submission. And after all that typing it still
isn't very good. The fully supported AMD vendor driver is over 3
million lines, so nouveau probably needs to grow several times.

My argument is that an in-tree open kernel driver is a big help to
reverse engineering an open userspace. Having the vendors
collaboration to build that monstrous thing can only help the end goal
of an end to end open stack.

For instance a vendor with an in-tree driver has a strong incentive to
sort out their FW licensing issues so it can be redistributed.

I'm not sure about this all or nothing approach. AFAIK DRM has the
worst problems with out of tree drivers right now.

> Where it would have helped is if this open driver would come with
> redistributable firmware, because that is right now the thing making
> nouveau reverse-engineering painful enough to be non-feasible. Well
> not the reverse-engineering, but the "shipping the result as a working
> driver stack".

I don't think much of the out of tree but open drivers. The goal must
be to get vendors in tree.

I would applaud Habana for getting an intree driver at least, even if
the userspace is not what we'd all want to see.

> I don't think the facts on the ground support your claim here, aside
> from the practical problem that nvidia is unwilling to even create an
> open driver to begin with. So there isn't anything to merge.

The internet tells me there is nvgpu, it doesn't seem to have helped.

Jason

Daniel Vetter July 6, 2021, 3:49 p.m. UTC | #18

On Tue, Jul 6, 2021 at 5:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Jul 06, 2021 at 04:39:19PM +0200, Daniel Vetter wrote:
> > On Tue, Jul 6, 2021 at 4:23 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > >
> > > On Tue, Jul 06, 2021 at 12:36:51PM +0200, Daniel Vetter wrote:
> > >
> > > > If that means AI companies don't want to open our their hw specs
> > > > enough to allow that, so be it - all you get in that case is
> > > > offloading the kernel side  of the stack for convenience, with zero
> > > > long term prospects to ever make this into a cross vendor subsystem
> > > > stack that does something useful.
> > >
> > > I don't think this is true at all - nouveau is probably the best
> > > example.
> > >
> > > nouveau reverse engineered a userspace stack for one of these devices.
> > >
> > > How much further ahead would they have been by now if they had a
> > > vendor supported, fully featured, open kernel driver to build the
> > > userspace upon?
> >
> > There is actually tons of example here, most of the arm socs have
> > fully open kernel drivers, supported by the vendor (out of tree).
>
> I choose nouveau because of this:
>
> $ git ls-files drivers/gpu/drm/arm/ | xargs wc -l
>  15039 total
> $ git ls-files drivers/gpu/drm/nouveau/ | xargs wc -l
>  204198 total

drm/arm is the arm display driver, which isn't actually shipping
anywhere afaik. Also it's not including the hdmi/dp output drivers,
those are generally external on socs, but integrated in discrete gpu.

The other thing to keep in mind is that one of these drivers supports
25 years of product generations, and the other one doesn't. So I think
adding it all up it's not that much different. Last time I looked if
you look at just command submission and rendering/compute, and not
include display, which heavily skews the stats, it's about 10% kernel,
90% userspace driver parts. Not including anything that's shared,
which is most of it (compiler frontend, intermediate optimizer, entire
runtime/state tracker and all the integration and glue pieces
largely).

> At 13x the size of mali this is not just some easy to wire up memory
> manager and command submission. And after all that typing it still
> isn't very good. The fully supported AMD vendor driver is over 3
> million lines, so nouveau probably needs to grow several times.

AMD is 3 million lines the size because it includes per-generation
generated header files.

And of course once you throw an entire vendor team at a driver all
those engineers will produce something, and there's the usual that the
last 10% of features produce about 90% of the complexity and code
problem. E.g. the kbase driver for arm mali gpu is 20x the size of the
in-tree panfrost driver - they need to keep typing to justify their
continued employement, or something like that. Usually it's because
they reinvent the world.

> My argument is that an in-tree open kernel driver is a big help to
> reverse engineering an open userspace. Having the vendors
> collaboration to build that monstrous thing can only help the end goal
> of an end to end open stack.

Not sure where this got lost, but we're totally fine with vendors
using the upstream driver together with their closed stack. And most
of the drivers we do have in upstream are actually, at least in parts,
supported by the vendor. E.g. if you'd have looked the drm/arm driver
you picked is actually 100% written by ARM engineers. So kinda
unfitting example.

> For instance a vendor with an in-tree driver has a strong incentive to
> sort out their FW licensing issues so it can be redistributed.

Nvidia has been claiming to try and sort out the FW problem for years.
They even managed to release a few things, but I think the last one is
2-3 years late now. Partially the reason is that there don't have a
stable api between the firmware and driver, it's all internal from the
same source tree, and they don't really want to change that.

> I'm not sure about this all or nothing approach. AFAIK DRM has the
> worst problems with out of tree drivers right now.

Well I guess someone could stand up a drivers/totally-not-gpu and just
let the flood in. Even duplicated drivers and everything included,
because the vendor drivers are better. Worth a shot, we've practically
started this already, I'm just not going to help with the cleanup.

> > Where it would have helped is if this open driver would come with
> > redistributable firmware, because that is right now the thing making
> > nouveau reverse-engineering painful enough to be non-feasible. Well
> > not the reverse-engineering, but the "shipping the result as a working
> > driver stack".
>
> I don't think much of the out of tree but open drivers. The goal must
> be to get vendors in tree.

Agreed. We actually got them in-tree largely. Nvidia even contributes
the oddball thing, and I think the tegra line is still fully supported
in upstream with the upstream driver.

I'm not sure the bleak picture you're drawing is reality, aside from
the fact that Nvidia discrete gpu drivers being a disaster with no
redistributable firmware, no open kernel driver that works, and
nothing else really either.

> I would applaud Habana for getting an intree driver at least, even if
> the userspace is not what we'd all want to see.
>
> > I don't think the facts on the ground support your claim here, aside
> > from the practical problem that nvidia is unwilling to even create an
> > open driver to begin with. So there isn't anything to merge.
>
> The internet tells me there is nvgpu, it doesn't seem to have helped.

Not sure which one you mean, but every once in a while they open up a
few headers, or a few programming specs, or a small driver somewhere
for a very specific thing, and then it dies again or gets obfuscated
for the next platform, or just never updated. I've never seen anything
that comes remotely to something complete, aside from tegra socs,
which are fully supported in upstream afaik.
-Daniel

Daniel Vetter July 6, 2021, 3:52 p.m. UTC | #19

On Tue, Jul 6, 2021 at 4:56 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Jul 06, 2021 at 04:09:25PM +0200, Daniel Vetter wrote:
> > Anyway, for anything that works like a gpu accelerator, like 3d accel,
> > or parallel compute accel (aka gpgpu) or spatial compute accel (aka
> > NN/AI) or maybe even fpga accel most of the magic to use the hardware
> > is in this backend compiler, which translates from an IR into whatever
> > your accelerator consumes. That's the part we really care about for
> > modern accelerators because without that defacto the hardware is
> > useless. Generally these chips have full-blown, if special purpose
> > ISA, with register files, spilling, branches, loops and other control
> > flow (sometimes only execution masks on simpler hw).
>
> I don't know if I see it so clearly as you do - at the end of the day
> the user keys in the program in some proprietary (or open!) language
> and and wack of propritary magic transforms it to "make it work".
>
> There are many barriers that prevent someone without the secret
> knowledge from duplicating the end result of a working program. An
> accelerator ISA is certainly one example, but I wouldn't overly focus
> on it as the only blocker.

Well we don't, we do just ask for the full driver stack to make the hw
work. It's just that in the past most vendors choose to leave out the
compiler/ISA from their open stack/specs. Well except nvidia, which
still chooses to leave out everything aside from some very, very
minimal thing around documenting display functionality.

> Like you said below the NVIDIA GPU ISA seems known but the HW is still
> not really useful for other reasons.
>
> Habana seems to have gone the other way, the HW is fully useful but we
> don't have the ISA transformation and other details.

You can actually use nvidia gpus, they're fully functional.

If you install the blobby stack. Which is exactly the same thing as
with habanalabs, plus/minus a few things at the fringes.

In the end it's about drawing the line somewhere, so maybe we should
merge the nvidia glue code that makes their blobby stack work better
with upstream? There's quite a few pieces there, e.g. their display
driver is by design a userspace driver, whereas with kernel
modesetting it needs to be in the kernel to expose the common kms
ioctl interfaces, so they've built up a glue layer to forward
everything to userspace and back. On windows it works because there
kernel code can have growing stacks and fun stuff like that, at least
that's my understanding. Not really an option to just run the code in
linux.

I'm pretty sure nvidia would appreciate that, and maybe every once in
a while they open up a header for a generation or two of products like
they've done in the past.

> Both cases seem to have ended up with something useless, and I have a
> hard time saying nouveau has more right to be in the kernel tree than
> Habana does.
>
> > > Honestly, I think GPU is approaching this backwards. Wayland should
> > > have been designed to prevent proprietary userspace stacks.
> >
> > That's not possible without some serious cans of worms though. Wayland
> > is a protocol, and you can't forbid people from implementing it.
> > Otherwise all the compatible open implementations of closed protocols
> > wouldn't be possible either.
>
> Well, in many ways so is Linux, but nobody would seriously
> re-implement Linux just to produce a driver.

Well in the gpu space for 2+ decades nvidia has been setting the
standard, and the open stack has been trying to catch up by
reimplementing the entire thing. It took a fair while.

> > So I'm not clear what you're suggesting here we should do different.
>
> Not enabling proprietary stacks as above would be a good start.

I'm still not sure what exactly you mean here. Like on the 3d side
there's opengl and vulkan, and nvidia just has an entirely different
implementation of that compared to any of the open drivers. That is a
bit less code than linux, but it's not small, and reimplementing over
decades is pretty much what happened. And if it's not allowed we'd
actually not have an open 3d gpu stack at all, because only very
recently did we get an agreement around the tracemark/licensing issues
of that stuff with Khronos. Recently compared to the history of opengl
at least.

So I'm still not clear what exactly it is you're suggesting we should
do? Not implement the industry standards for 3d (and accept we stay
irrelevant forever)? Reject nvidia blobs harder than we do already?
Distros will continue to ship an auto-installer for that stack, at
least some, so we're pretty much maxed out already. Like in what way
do you think the upstream stack does enable the proprietary nvidia
stack? Should we permanently ban any contributions from anyone with an
@nvidia.com address, even if it helps the open stack improve?

Like I'm not seeing something concrete that could be done, which would
actually prevent nvidia from having their completely independent
stack, with exact same functionality and not a line of code shared.
Which is were we are right now. The only thing where we could be more
strict is to reject any contributions from them at all, just because
we don't like them. That seems a bit too extreme
-Daniel

Daniel Vetter July 6, 2021, 4:07 p.m. UTC | #20

On Tue, Jul 6, 2021 at 5:49 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
> On Tue, Jul 6, 2021 at 5:25 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > I'm not sure about this all or nothing approach. AFAIK DRM has the
> > worst problems with out of tree drivers right now.
>
> Well I guess someone could stand up a drivers/totally-not-gpu and just
> let the flood in. Even duplicated drivers and everything included,
> because the vendor drivers are better. Worth a shot, we've practically
> started this already, I'm just not going to help with the cleanup.

tbh I think at this point someone should just do that. Ideally with
some boundary like please don't use dma-fence or dma-buf and stuff
like that so drivers/gpu doesn't ever have to deal with the fallout.
But way too many people think that somehow you magically get the other
90% of an open accel stack if you're just friendly enough and merge
the kernel driver, so we really should just that experiment in
upstream and watch it pan out in reality.

Minimally it would be some great entertainment :-)

Also on your claim that drivers/gpu is a non-upstream disaster: I've
also learned that that for drivers/rdma there's the upstream driver,
and then there's the out-of-tree hackjob the vendor actually supports.
So seems to be about the same level of screwed up, if you ask the
vendor they tell you the upstream driver isn't a thing they care about
and it's just done for a bit of goodwill. Except if you have enormous
amounts of volume, then suddenly it's an option ... Minus the fw issue
for nvidia, upstream does support all the gpus you can buy right now
and that can run on linux with some vendor driver (aka excluding apple
M1 and ofc upcoming products from most vendors).

drivers/accel otoh is mostly out-of-tree, because aside from Greg
mergin habanalabs no one is bold enough anymore to just merge them
all. There's lots of those going around that would be ready for
picking. And they've been continously submitted to upstream over the
years, even before the entire habanalabs thing.
-Daniel

Jason Gunthorpe July 6, 2021, 4:29 p.m. UTC | #21

On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:

> The other thing to keep in mind is that one of these drivers supports
> 25 years of product generations, and the other one doesn't. 

Sure, but that is the point, isn't it? To have an actually useful
thing you need all of this mess

> > My argument is that an in-tree open kernel driver is a big help to
> > reverse engineering an open userspace. Having the vendors
> > collaboration to build that monstrous thing can only help the end goal
> > of an end to end open stack.
> 
> Not sure where this got lost, but we're totally fine with vendors
> using the upstream driver together with their closed stack. And most
> of the drivers we do have in upstream are actually, at least in parts,
> supported by the vendor. E.g. if you'd have looked the drm/arm driver
> you picked is actually 100% written by ARM engineers. So kinda
> unfitting example.

So the argument with Habana really boils down to how much do they need
to show in the open source space to get a kernel driver? You want to
see the ISA or compiler at least?

That at least doesn't seem "extreme" to me.

> > For instance a vendor with an in-tree driver has a strong incentive to
> > sort out their FW licensing issues so it can be redistributed.
> 
> Nvidia has been claiming to try and sort out the FW problem for years.
> They even managed to release a few things, but I think the last one is
> 2-3 years late now. Partially the reason is that there don't have a
> stable api between the firmware and driver, it's all internal from the
> same source tree, and they don't really want to change that.

Right, companies have no incentive to work in a sane way if they have
their own parallel world. I think drawing them part by part into the
standard open workflows and expectations is actually helpful to
everyone.

> > > I don't think the facts on the ground support your claim here, aside
> > > from the practical problem that nvidia is unwilling to even create an
> > > open driver to begin with. So there isn't anything to merge.
> >
> > The internet tells me there is nvgpu, it doesn't seem to have helped.
> 
> Not sure which one you mean, but every once in a while they open up a
> few headers, or a few programming specs, or a small driver somewhere
> for a very specific thing, and then it dies again or gets obfuscated
> for the next platform, or just never updated. I've never seen anything
> that comes remotely to something complete, aside from tegra socs,
> which are fully supported in upstream afaik.

I understand nvgpu is the tegra driver that people actualy
use. nouveau may have good tegra support but is it used in any actual
commercial product?

Jason

Jason Gunthorpe July 6, 2021, 5:28 p.m. UTC | #22

On Tue, Jul 06, 2021 at 06:07:17PM +0200, Daniel Vetter wrote:

> Also on your claim that drivers/gpu is a non-upstream disaster: I've
> also learned that that for drivers/rdma there's the upstream driver,
> and then there's the out-of-tree hackjob the vendor actually
> supports.

In the enterprise world everyone has their out of tree backport
drivers. It varies on the vendor how much deviation there is from the
upstream driver and what commercial support relationship the vendor
has with the enterprise distros.

> So seems to be about the same level of screwed up, if you ask the
> vendor they tell you the upstream driver isn't a thing they care about
> and it's just done for a bit of goodwill. 

Sounds like you should get a new RDMA supplier :)

To be fair Intel is getting better, they got their new RDMA HW support
merged into v5.14 after about 2 years in the out of tree world. Though
it is still incomplete compared to their out of tree driver, the gap
is much smaller now.

> amounts of volume, then suddenly it's an option ... Minus the fw issue
> for nvidia, upstream does support all the gpus you can buy right now
> and that can run on linux with some vendor driver (aka excluding apple
> M1 and ofc upcoming products from most vendors).

I would look at how many actual commercial systems are running the
upstream/inbox stack. I personally know of quite a few sites with big
HPC RDMA deployments running pure inbox kernels, no add on kernel
modules, with full commercial support.

If you can say that kind of arrangment is also common place in the GPU
world then I will happily be wrong.

Jason

Christoph Hellwig July 6, 2021, 5:31 p.m. UTC | #23

On Tue, Jul 06, 2021 at 02:28:28PM -0300, Jason Gunthorpe wrote:
> > Also on your claim that drivers/gpu is a non-upstream disaster: I've
> > also learned that that for drivers/rdma there's the upstream driver,
> > and then there's the out-of-tree hackjob the vendor actually
> > supports.
> 
> In the enterprise world everyone has their out of tree backport
> drivers. It varies on the vendor how much deviation there is from the
> upstream driver and what commercial support relationship the vendor
> has with the enterprise distros.

I think he means the Mellanox OFED stack, which is a complete and utter
mess and which gets force fed by Mellanox/Nvidia on unsuspecting
customers.  I know many big HPC sites that ignore it, but a lot of
enterprise customers are dumb enought to deploy it.

Daniel Vetter July 6, 2021, 5:35 p.m. UTC | #24

On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
>
> > The other thing to keep in mind is that one of these drivers supports
> > 25 years of product generations, and the other one doesn't.
>
> Sure, but that is the point, isn't it? To have an actually useful
> thing you need all of this mess
>
> > > My argument is that an in-tree open kernel driver is a big help to
> > > reverse engineering an open userspace. Having the vendors
> > > collaboration to build that monstrous thing can only help the end goal
> > > of an end to end open stack.
> >
> > Not sure where this got lost, but we're totally fine with vendors
> > using the upstream driver together with their closed stack. And most
> > of the drivers we do have in upstream are actually, at least in parts,
> > supported by the vendor. E.g. if you'd have looked the drm/arm driver
> > you picked is actually 100% written by ARM engineers. So kinda
> > unfitting example.
>
> So the argument with Habana really boils down to how much do they need
> to show in the open source space to get a kernel driver? You want to
> see the ISA or compiler at least?

Yup. We dont care about any of the fancy pieces you build on top, nor
does the compiler need to be the optimizing one. Just something that's
good enough to drive the hw in some demons to see how it works and all
that. Generally that's also not that hard to reverse engineer, if
someone is bored enough, the real fancy stuff tends to be in how you
optimize the generated code. And make it fit into the higher levels
properly.

> That at least doesn't seem "extreme" to me.
>
> > > For instance a vendor with an in-tree driver has a strong incentive to
> > > sort out their FW licensing issues so it can be redistributed.
> >
> > Nvidia has been claiming to try and sort out the FW problem for years.
> > They even managed to release a few things, but I think the last one is
> > 2-3 years late now. Partially the reason is that there don't have a
> > stable api between the firmware and driver, it's all internal from the
> > same source tree, and they don't really want to change that.
>
> Right, companies have no incentive to work in a sane way if they have
> their own parallel world. I think drawing them part by part into the
> standard open workflows and expectations is actually helpful to
> everyone.

Well we do try to get them on board part-by-part generally starting
with the kernel and ending with a proper compiler instead of the usual
llvm hack job, but for whatever reasons they really like their
in-house stuff, see below for what I mean.

> > > > I don't think the facts on the ground support your claim here, aside
> > > > from the practical problem that nvidia is unwilling to even create an
> > > > open driver to begin with. So there isn't anything to merge.
> > >
> > > The internet tells me there is nvgpu, it doesn't seem to have helped.
> >
> > Not sure which one you mean, but every once in a while they open up a
> > few headers, or a few programming specs, or a small driver somewhere
> > for a very specific thing, and then it dies again or gets obfuscated
> > for the next platform, or just never updated. I've never seen anything
> > that comes remotely to something complete, aside from tegra socs,
> > which are fully supported in upstream afaik.
>
> I understand nvgpu is the tegra driver that people actualy
> use. nouveau may have good tegra support but is it used in any actual
> commercial product?

I think it was almost the case. Afaik they still have their internal
userspace stack working on top of nvidia, at least last year someone
fixed up a bunch of issues in the tegra+nouveau combo to enable format
modifiers properly across the board. But also nvidia is never going to
sell you that as the officially supported thing, unless your ask comes
back with enormous amounts of sold hardware.

And it's not just nvidia, it's pretty much everyone. Like a soc
company I don't want to know started collaborating with upstream and
the reverse-engineered mesa team on a kernel driver, seems to work
pretty well for current hardware. But for the next generation they
decided it's going to be again only their in-house tree that
completele ignores drivers/gpu/drm, and also tosses all the
foundational work they helped build on the userspace side. And this is
consistent across all companies, over the last 20 years I know of
(often non-public) stories across every single company where they
decided that all the time invested into community/upstream
collaboration isn't useful anymore, we go all vendor solo for the next
one.

Most of those you luckily don't hear about anymore, all it results in
the upstream driver being 1-2 years late or so. But even the good ones
where we collaborate well can't seem to help themselves and want to
throw it all away every few years.
-Daniel

Jason Gunthorpe July 6, 2021, 5:59 p.m. UTC | #25

On Tue, Jul 06, 2021 at 07:31:37PM +0200, Christoph Hellwig wrote:
> On Tue, Jul 06, 2021 at 02:28:28PM -0300, Jason Gunthorpe wrote:
> > > Also on your claim that drivers/gpu is a non-upstream disaster: I've
> > > also learned that that for drivers/rdma there's the upstream driver,
> > > and then there's the out-of-tree hackjob the vendor actually
> > > supports.
> > 
> > In the enterprise world everyone has their out of tree backport
> > drivers. It varies on the vendor how much deviation there is from the
> > upstream driver and what commercial support relationship the vendor
> > has with the enterprise distros.
> 
> I think he means the Mellanox OFED stack, which is a complete and utter
> mess and which gets force fed by Mellanox/Nvidia on unsuspecting
> customers.  I know many big HPC sites that ignore it, but a lot of
> enterprise customers are dumb enought to deploy it.

No, I don't think so. While MOFED is indeed a giant mess, the mlx5
upstream driver is not some token effort to generate good will and
Mellanox certainly does provide full commercial support for the mlx5
drivers shipped inside various enterprise distros.

MOFED also doesn't have a big functional divergance from RDMA
upstream, and it is not mandatory just to use the hardware.

I can not say the same about other company's RDMA driver
distributions, Daniel's description of "minimal effort to get
goodwill" would match others much better.

You are right that there are a lot of enterprise customers who deploy
the MOFED. I can't agree with their choices, but they are not forced
into using it anymore.

Jason

Daniel Vetter July 6, 2021, 6:03 p.m. UTC | #26

I should stop typing and prep dinner, but I found some too hilarious
typos below.

On Tue, Jul 6, 2021 at 7:35 PM Daniel Vetter <daniel.vetter@ffwll.ch> wrote:
>
> On Tue, Jul 6, 2021 at 6:29 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >
> > On Tue, Jul 06, 2021 at 05:49:01PM +0200, Daniel Vetter wrote:
> >
> > > The other thing to keep in mind is that one of these drivers supports
> > > 25 years of product generations, and the other one doesn't.
> >
> > Sure, but that is the point, isn't it? To have an actually useful
> > thing you need all of this mess
> >
> > > > My argument is that an in-tree open kernel driver is a big help to
> > > > reverse engineering an open userspace. Having the vendors
> > > > collaboration to build that monstrous thing can only help the end goal
> > > > of an end to end open stack.
> > >
> > > Not sure where this got lost, but we're totally fine with vendors
> > > using the upstream driver together with their closed stack. And most
> > > of the drivers we do have in upstream are actually, at least in parts,
> > > supported by the vendor. E.g. if you'd have looked the drm/arm driver
> > > you picked is actually 100% written by ARM engineers. So kinda
> > > unfitting example.
> >
> > So the argument with Habana really boils down to how much do they need
> > to show in the open source space to get a kernel driver? You want to
> > see the ISA or compiler at least?
>
> Yup. We dont care about any of the fancy pieces you build on top, nor
> does the compiler need to be the optimizing one. Just something that's
> good enough to drive the hw in some demons to see how it works and all

s/demons/demos/ but hw tends to be funky enough that either fits :-)

> that. Generally that's also not that hard to reverse engineer, if
> someone is bored enough, the real fancy stuff tends to be in how you
> optimize the generated code. And make it fit into the higher levels
> properly.
>
> > That at least doesn't seem "extreme" to me.
> >
> > > > For instance a vendor with an in-tree driver has a strong incentive to
> > > > sort out their FW licensing issues so it can be redistributed.
> > >
> > > Nvidia has been claiming to try and sort out the FW problem for years.
> > > They even managed to release a few things, but I think the last one is
> > > 2-3 years late now. Partially the reason is that there don't have a
> > > stable api between the firmware and driver, it's all internal from the
> > > same source tree, and they don't really want to change that.
> >
> > Right, companies have no incentive to work in a sane way if they have
> > their own parallel world. I think drawing them part by part into the
> > standard open workflows and expectations is actually helpful to
> > everyone.
>
> Well we do try to get them on board part-by-part generally starting
> with the kernel and ending with a proper compiler instead of the usual
> llvm hack job, but for whatever reasons they really like their
> in-house stuff, see below for what I mean.
>
> > > > > I don't think the facts on the ground support your claim here, aside
> > > > > from the practical problem that nvidia is unwilling to even create an
> > > > > open driver to begin with. So there isn't anything to merge.
> > > >
> > > > The internet tells me there is nvgpu, it doesn't seem to have helped.
> > >
> > > Not sure which one you mean, but every once in a while they open up a
> > > few headers, or a few programming specs, or a small driver somewhere
> > > for a very specific thing, and then it dies again or gets obfuscated
> > > for the next platform, or just never updated. I've never seen anything
> > > that comes remotely to something complete, aside from tegra socs,
> > > which are fully supported in upstream afaik.
> >
> > I understand nvgpu is the tegra driver that people actualy
> > use. nouveau may have good tegra support but is it used in any actual
> > commercial product?
>
> I think it was almost the case. Afaik they still have their internal
> userspace stack working on top of nvidia, at least last year someone
> fixed up a bunch of issues in the tegra+nouveau combo to enable format
> modifiers properly across the board. But also nvidia is never going to
> sell you that as the officially supported thing, unless your ask comes
> back with enormous amounts of sold hardware.
>
> And it's not just nvidia, it's pretty much everyone. Like a soc
> company I don't want to know started collaborating with upstream and

s/know/name/ I do know them unfortunately quite well ...

Cheers, Daniel

> the reverse-engineered mesa team on a kernel driver, seems to work
> pretty well for current hardware. But for the next generation they
> decided it's going to be again only their in-house tree that
> completele ignores drivers/gpu/drm, and also tosses all the
> foundational work they helped build on the userspace side. And this is
> consistent across all companies, over the last 20 years I know of
> (often non-public) stories across every single company where they
> decided that all the time invested into community/upstream
> collaboration isn't useful anymore, we go all vendor solo for the next
> one.
>
> Most of those you luckily don't hear about anymore, all it results in
> the upstream driver being 1-2 years late or so. But even the good ones
> where we collaborate well can't seem to help themselves and want to
> throw it all away every few years.
> -Daniel
> --
> Daniel Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch

Jason Gunthorpe July 6, 2021, 6:31 p.m. UTC | #27

On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:

> Yup. We dont care about any of the fancy pieces you build on top, nor
> does the compiler need to be the optimizing one. Just something that's
> good enough to drive the hw in some demons to see how it works and all
> that. Generally that's also not that hard to reverse engineer, if
> someone is bored enough, the real fancy stuff tends to be in how you
> optimize the generated code. And make it fit into the higher levels
> properly.

Seems reasonable to me

> And it's not just nvidia, it's pretty much everyone. Like a soc
> company I don't want to know started collaborating with upstream and
> the reverse-engineered mesa team on a kernel driver, seems to work
> pretty well for current hardware. 

What I've seen is that this only works with customer demand. Companies
need to hear from their customers that upstream is what is needed, and
companies cannot properly hear that until they are at least already
partially invested in the upstream process and have the right
customers that are sophisticated enough to care.

Embedded makes everything 10x worse because too many customers just
don't care about upstream, you can hack your way through everything,
and indulge in single generation thinking. Fork the whole kernel for 3
years, EOL, no problem!

It is the enterprise world, particularly with an opinionated company
like RH saying NO stuck in the middle that really seems to drive
things toward upstream.

Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
example) but it is incredibly time consuming, expensive and becoming
more and more difficult every year.

The big point is this:

> But also nvidia is never going to sell you that as the officially
> supported thing, unless your ask comes back with enormous amounts of
> sold hardware.

I think this is at the core of Linux's success in the enterprise
world. Big customers who care demanding open source. Any vendor, even
nvidia will want to meet customer demands.

IHMO upstream success is found by motivating the customer to demand
and make it "easy" for the vendor to supply it.

Jason

Daniel Vetter July 6, 2021, 7:06 p.m. UTC | #28

On Tue, Jul 6, 2021 at 8:31 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:
> > Yup. We dont care about any of the fancy pieces you build on top, nor
> > does the compiler need to be the optimizing one. Just something that's
> > good enough to drive the hw in some demons to see how it works and all
> > that. Generally that's also not that hard to reverse engineer, if
> > someone is bored enough, the real fancy stuff tends to be in how you
> > optimize the generated code. And make it fit into the higher levels
> > properly.
>
> Seems reasonable to me
>
> > And it's not just nvidia, it's pretty much everyone. Like a soc
> > company I don't want to know started collaborating with upstream and
> > the reverse-engineered mesa team on a kernel driver, seems to work
> > pretty well for current hardware.
>
> What I've seen is that this only works with customer demand. Companies
> need to hear from their customers that upstream is what is needed, and
> companies cannot properly hear that until they are at least already
> partially invested in the upstream process and have the right
> customers that are sophisticated enough to care.
>
> Embedded makes everything 10x worse because too many customers just
> don't care about upstream, you can hack your way through everything,
> and indulge in single generation thinking. Fork the whole kernel for 3
> years, EOL, no problem!

It's not entirely hopeless in embedded either. Sure there's the giant
pile of sell&forget abandonware, but there are lots of embedded things
where multi-year to multi-decade support is required. And an upstream
gfx stack beats anything the vendor has to offer on that, easily.

And on the server side it's actually pretty hard to convince customers
of the upstream driver benefits, because they don't want or can't
abandon nvidia and have just learned to accept the pain. They either
build a few abstraction layers on top (and demand the vendor support
those), or they flat out demand you support the nvidia broprietary
interfaces. And AMD has been trying to move the needle here for years,
with not that much success.

> It is the enterprise world, particularly with an opinionated company
> like RH saying NO stuck in the middle that really seems to drive
> things toward upstream.
>
> Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
> example) but it is incredibly time consuming, expensive and becoming
> more and more difficult every year.
>
> The big point is this:
>
> > But also nvidia is never going to sell you that as the officially
> > supported thing, unless your ask comes back with enormous amounts of
> > sold hardware.
>
> I think this is at the core of Linux's success in the enterprise
> world. Big customers who care demanding open source. Any vendor, even
> nvidia will want to meet customer demands.
>
> IHMO upstream success is found by motivating the customer to demand
> and make it "easy" for the vendor to supply it.

Yup, exactly same situation here. The problem seems to be a bit that
gpu vendor stubbornness is higher than established customer demand
even, or they just don't care, and so in the last few years that
customer demand has resulted in payment to consulting shops and hiring
of engineers into reverse-engineering a full driver, instead of
customer and vendor splitting the difference and the vendor
upstreaming their stack. And that's for companies who've done it in
the past, or at least collaborated on parts like the kernel driver, so
I really have no clue why they don't just continue. We have
well-established customers who do want it all open and upstream,
across kernel and userspace pieces.

And it looks like it's going to repeat itself a few more times
unfortunately. I'm not sure when exactly the lesson will sink in.

Maybe I missed some, but looking at current render/compute drivers I
think (but not even sure on that) only drm/lima is a hobbyist project
and perhaps you want to include drm/nouveau as not paid by customers
and more something redhat does out of principle. All the others are
paid for by customers, with vendor involvement ranging from "just
helping out with the kernel driver" to "pays for pretty much all of
the development". And still apparently that's not enough demand for an
upstream driver stack.
-Daniel

Alex Deucher July 6, 2021, 7:09 p.m. UTC | #29

On Tue, Jul 6, 2021 at 2:31 PM Jason Gunthorpe <jgg@ziepe.ca> wrote:
>
> On Tue, Jul 06, 2021 at 07:35:55PM +0200, Daniel Vetter wrote:
>
> > Yup. We dont care about any of the fancy pieces you build on top, nor
> > does the compiler need to be the optimizing one. Just something that's
> > good enough to drive the hw in some demons to see how it works and all
> > that. Generally that's also not that hard to reverse engineer, if
> > someone is bored enough, the real fancy stuff tends to be in how you
> > optimize the generated code. And make it fit into the higher levels
> > properly.
>
> Seems reasonable to me
>
> > And it's not just nvidia, it's pretty much everyone. Like a soc
> > company I don't want to know started collaborating with upstream and
> > the reverse-engineered mesa team on a kernel driver, seems to work
> > pretty well for current hardware.
>
> What I've seen is that this only works with customer demand. Companies
> need to hear from their customers that upstream is what is needed, and
> companies cannot properly hear that until they are at least already
> partially invested in the upstream process and have the right
> customers that are sophisticated enough to care.
>
> Embedded makes everything 10x worse because too many customers just
> don't care about upstream, you can hack your way through everything,
> and indulge in single generation thinking. Fork the whole kernel for 3
> years, EOL, no problem!
>
> It is the enterprise world, particularly with an opinionated company
> like RH saying NO stuck in the middle that really seems to drive
> things toward upstream.
>
> Yes, vendors can work around Red Hat's No (and NVIDIA GPU is such an
> example) but it is incredibly time consuming, expensive and becoming
> more and more difficult every year.
>
> The big point is this:
>
> > But also nvidia is never going to sell you that as the officially
> > supported thing, unless your ask comes back with enormous amounts of
> > sold hardware.
>
> I think this is at the core of Linux's success in the enterprise
> world. Big customers who care demanding open source. Any vendor, even
> nvidia will want to meet customer demands.
>
> IHMO upstream success is found by motivating the customer to demand
> and make it "easy" for the vendor to supply it.

I think this is one of the last big challenges on Linux.  It's REALLY
hard to align new products with Linux kernel releases and distro
kernels.  Hardware cycles are too short and drivers (at least for
GPUs) are too big to really fit well with the current Linux release
model.  In many cases enterprise distros have locked down on a kernel
version around the same time we are doing new chip bring up.  You are
almost always off by one when it comes to kernel version alignment.
Even if you can get the initial code upstream in the right kernel
version, it tends to be aligned to such early silicon that you end up
needing a pile of additional patches to make production cards work.
Those changes are often deemed "too big" for stable kernel fixes.  The
only real way to deal with that effectively is with vendor provided
packaged drivers using something like dkms to cover launch.  Thus you
need to do your bring up against latest upstream and then backport, or
do your bring up against some older kernel and forward port for
upstream.  You end up doing everything twice.  Things get better with
sustaining support in subsequent distro releases, but it doesn't help
at product launch.  I don't know what the right solution for this is.

Alex

Christian König July 7, 2021, 12:17 p.m. UTC | #30

Am 06.07.21 um 14:23 schrieb Daniel Vetter:
> On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
>> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
>>>> Greg, I hope this will be good enough for you to merge this code.
>>> So we're officially going to use dri-devel for technical details review
>>> and then Greg for merging so we don't have to deal with other merge
>>> criteria dri-devel folks have?
>>>
>>> I don't expect anything less by now, but it does make the original claim
>>> that drivers/misc will not step all over accelerators folks a complete
>>> farce under the totally-not-a-gpu banner.
>>>
>>> This essentially means that for any other accelerator stack that doesn't
>>> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
>>> other gpu driver stuff, you can just send it to Greg and it's good to go.
>>>
>>> There's quite a lot of these floating around actually (and many do have
>>> semi-open runtimes, like habanalabs have now too, just not open enough to
>>> be actually useful). It's going to be absolutely lovely having to explain
>>> to these companies in background chats why habanalabs gets away with their
>>> stack and they don't.
>> FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
>> runtime if they want to push any additional feature in the kernel.
>> The current situation is not sustainable.
> Before anyone replies: The runtime is open, the compiler is still closed.
> This has become the new default for accel driver submissions, I think
> mostly because all the interesting bits for non-3d accelerators are in the
> accel ISA, and no longer in the runtime. So vendors are fairly happy to
> throw in the runtime as a freebie.

Well a compiler and runtime makes things easier, but the real question 
is if they are really required for upstreaming a kernel driver?

I mean what we need is to be able to exercise the functionality. So 
wouldn't (for example) an assembler be sufficient?

> It's still incomplete, and it's still useless if you want to actually hack
> on the driver stack.

Yeah, when you want to hack on it in the sense of extending it then this 
requirement is certainly true.

But as far as I can see userspace don't need to be extendable to justify 
a kernel driver. It just needs to have enough glue to thoughtfully 
exercise the relevant kernel interfaces.

Applying that to GPUs I think what you need to be able to is to write 
shaders, but that doesn't need to be in a higher language requiring a 
compiler and runtime. Released opcodes and a low level assembler should 
be sufficient.

Regards,
Christian.

> -Daniel

Daniel Vetter July 7, 2021, 12:54 p.m. UTC | #31

On Wed, Jul 7, 2021 at 2:17 PM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
> Am 06.07.21 um 14:23 schrieb Daniel Vetter:
> > On Tue, Jul 06, 2021 at 02:21:10PM +0200, Christoph Hellwig wrote:
> >> On Tue, Jul 06, 2021 at 10:40:37AM +0200, Daniel Vetter wrote:
> >>>> Greg, I hope this will be good enough for you to merge this code.
> >>> So we're officially going to use dri-devel for technical details review
> >>> and then Greg for merging so we don't have to deal with other merge
> >>> criteria dri-devel folks have?
> >>>
> >>> I don't expect anything less by now, but it does make the original claim
> >>> that drivers/misc will not step all over accelerators folks a complete
> >>> farce under the totally-not-a-gpu banner.
> >>>
> >>> This essentially means that for any other accelerator stack that doesn't
> >>> fit the dri-devel merge criteria, even if it's acting like a gpu and uses
> >>> other gpu driver stuff, you can just send it to Greg and it's good to go.
> >>>
> >>> There's quite a lot of these floating around actually (and many do have
> >>> semi-open runtimes, like habanalabs have now too, just not open enough to
> >>> be actually useful). It's going to be absolutely lovely having to explain
> >>> to these companies in background chats why habanalabs gets away with their
> >>> stack and they don't.
> >> FYI, I fully agree with Daniel here.  Habanlabs needs to open up their
> >> runtime if they want to push any additional feature in the kernel.
> >> The current situation is not sustainable.
> > Before anyone replies: The runtime is open, the compiler is still closed.
> > This has become the new default for accel driver submissions, I think
> > mostly because all the interesting bits for non-3d accelerators are in the
> > accel ISA, and no longer in the runtime. So vendors are fairly happy to
> > throw in the runtime as a freebie.
>
> Well a compiler and runtime makes things easier, but the real question
> is if they are really required for upstreaming a kernel driver?
>
> I mean what we need is to be able to exercise the functionality. So
> wouldn't (for example) an assembler be sufficient?

So no one has tried this yet, but I think an assembler, or maybe even
just the full PRM for the ISA is also good enough I think.

I guess in practice everyone just comes with the compiler for a few reasons:
- AMD and Intel are great and release full PRMs for the gpu, but
preparing those takes a lot of time. Often that's done as part of
bring up, to make sure everything is annotated properly, so that all
the necessary bits are included, but none of the future stuff, or
silicon bring-up pieces. So in reality you have the compiler before
you have the isa docs.

- reverse-engineered drivers also tend to have demo compilers before
anything like full ISA docs show up :-) But also the docs tooling they
have are great.

- then there's the case of developing a driver with NDA'd docs. Again
you'll have a compiler as the only real output, there's not going to
be any docs or anything like that.

> > It's still incomplete, and it's still useless if you want to actually hack
> > on the driver stack.
>
> Yeah, when you want to hack on it in the sense of extending it then this
> requirement is certainly true.
>
> But as far as I can see userspace don't need to be extendable to justify
> a kernel driver. It just needs to have enough glue to thoughtfully
> exercise the relevant kernel interfaces.
>
> Applying that to GPUs I think what you need to be able to is to write
> shaders, but that doesn't need to be in a higher language requiring a
> compiler and runtime. Released opcodes and a low level assembler should
> be sufficient.

Yeah I think in theory ISA docs + assembler testcase or whatever is
perfectly fine. In reality anyone who cares enough to do this properly
gets to the demo quality compiler stage first, and so that's what we
take for merging a new stack.

I do disagree that we're only ever asking for this and not more, e.g.
if you come with a new 3d accelator and it's not coming with a
userspace driver as a mesa MR, you have to do some very serious
explaining about wtf you're doing - mesa3d won, pretty much across the
board, as a common project for both vulkan and opengl, and the
justifications for reinventing wheels better be really good here. Also
by the time you've written enough scaffolding to show it integrates in
non-stupid ways into mesa, you practically have a demo-quality driver
stack anyway.

Similar on the display side of things, over the past year consensus
for merge criteria have gone up quite a bit, e.g. there's a patch
floating around to make that clearer:

https://lore.kernel.org/dri-devel/20210706161244.1038592-1-maxime@cerno.tech/

Of course this doesn't include anything grandfathered in (*cough*
amdvlk *cough*), and also outside of 3d there's clearly no
cross-vendor project that's established enough, media, compute, AI/NN
stuff is all very badly fragmented. That's maybe lamentable, but like
you said not really a reason to reject a kernel driver.
-Daniel

Dennis Dalessandro July 9, 2021, 2:47 p.m. UTC | #32

On 7/6/21 1:59 PM, Jason Gunthorpe wrote:
> I can not say the same about other company's RDMA driver
> distributions, Daniel's description of "minimal effort to get
> goodwill" would match others much better.

Not sure what other RDMA driver you are talking about but as for Cornelis
Networks, we do have a packaged up version of our software. However it is meant
to make things easier on end users to bridge the gap between the distro kernel
drivers and the upstream kernel.

It's definitely not a requirement and plenty of folks do use distro
kernels/drivers. I'm not sure how many large sites are using something straight
off kernel.org but the upstream hfi1 driver is 100% the real deal.

We continually develop on and test the upstream kernel. Our goal is always to
upstream patches first. We learned that lesson the hard way when we first tried
to upstream hfi1.

-Denny

[v4,0/2] Add p2p via dmabuf to habanalabs

Message

Comments