diff mbox series

[v5,1/3] Provide in-kernel headers to make extending kernel easier

Message ID 20190320163116.39275-1-joel@joelfernandes.org (mailing list archive)
State New
Headers show
Series [v5,1/3] Provide in-kernel headers to make extending kernel easier | expand

Commit Message

Joel Fernandes March 20, 2019, 4:31 p.m. UTC
Introduce in-kernel headers and other artifacts which are made available
as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
it possible to build kernel modules, run eBPF programs, and other
tracing programs that need to extend the kernel for tracing purposes
without any dependency on the file system having headers and build
artifacts.

On Android and embedded systems, it is common to switch kernels but not
have kernel headers available on the file system. Further once a
different kernel is booted, any headers stored on the file system will
no longer be useful. By storing the headers as a compressed archive
within the kernel, we can avoid these issues that have been a hindrance
for a long time.

The best way to use this feature is by building it in. Several users
have a need for this, when they switch debug kernels, they donot want to
update the filesystem or worry about it where to store the headers on
it. However, the feature is also buildable as a module in case the user
desires it not being part of the kernel image. This makes it possible to
load and unload the headers from memory on demand. A tracing program, or
a kernel module builder can load the module, do its operations, and then
unload the module to save kernel memory. The total memory needed is 3.8MB.

By having the archive available at a fixed location independent of
filesystem dependencies and conventions, all debugging tools can
directly refer to the fixed location for the archive, without concerning
with where the headers on a typical filesystem which significantly
simplifies tooling that needs kernel headers.

The code to read the headers is based on /proc/config.gz code and uses
the same technique to embed the headers.

To build a module, the below steps have been tested on an x86 machine:
modprobe kheaders
rm -rf $HOME/headers
mkdir -p $HOME/headers
tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
cd my-kernel-module
make -C $HOME/headers M=$(pwd) modules
rmmod kheaders

Additional notes:
(1) external modules must be built on the same arch as the host that
built vmlinux. This can be done either in a qemu emulated chroot on the
target, or natively. This is due to host arch dependency of kernel
scripts.

(2)
If module building is used, since Module.symvers is not available in the
archive due to a cyclic dependency with building of the archive into the
kernel or module binaries, the modules built using the archive will not
contain symbol versioning (modversion). This is usually not an issue
since the idea of this patch is to build a kernel module on the fly and
load it into the same kernel. An appropriate warning is already printed
by the kernel to alert the user of modules not having modversions when
built using the archive. For building with modversions, the user can use
traditional header packages. For our tracing usecases, we build modules
on the fly with this so it is not a concern.

(3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
future patches that would extract the headers from a kernel or module
image.

(v4 was Tested-by the following folks,
 v5 only has minor changes and has passed my testing).
Tested-by: qais.yousef@arm.com
Tested-by: dietmar.eggemann@arm.com
Tested-by: linux@manojrajarao.com
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---

    v4 -> v5:
    (Thanks to Masahiro Yamada for several excellent suggestions)
    - used incbin instead of bin2c (Masahiro did similar idea)
    - added module.lds if ia64 otherwise ia64 may fail to build.
    - added clean-files rule to Makefile
    - removed strip-comments script and doing it inline
    - added set -e to header generated to die on errorsr
    - fixed a minor issue where find command was noisy.
    - removed unneeded tar.xz rule from kernel/.gitignore
    - added Tested-by tags from ARM folks.

    Changes since v3:
    - Blank tar was being generated because of a one line I
      forgot to push. It is updated now.
    - Added module.lds since arm64 needs it to build modules.

    Changes since v2:
    (Thanks to Masahiro Yamada for several excellent suggestions)
    - Added support for out of tree builds.
    - Added incremental build support bringing down build time of
      incremental builds from 50 seconds to 5 seconds.
    - Fixed various small nits / cleanups.
    - clean ups to kheaders.c pointed by Alexey Dobriyan.
    - Fixed MODULE_LICENSE in test module and kheaders.c
    - Dropped Module.symvers from archive due to circular dependency.

    Changes since v1:
    - removed IKH_EXTRA variable, not needed (Masahiro Yamada)
    - small fix ups to selftest
       - added target to main Makefile etc
       - added MODULE_LICENSE to test module
       - made selftest more quiet

    Changes since RFC:
    Both changes bring size down to 3.8MB:
    - use xz for compression
    - strip comments except SPDX lines
    - Call out the module name in Kconfig
    - Also added selftests in second patch to ensure headers are always
    working.

 init/Kconfig            | 11 ++++++
 kernel/.gitignore       |  1 +
 kernel/Makefile         | 28 ++++++++++++++
 kernel/kheaders.c       | 73 +++++++++++++++++++++++++++++++++++
 scripts/gen_ikh_data.sh | 84 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 197 insertions(+)
 create mode 100644 kernel/kheaders.c
 create mode 100755 scripts/gen_ikh_data.sh

Comments

Andrew Morton March 20, 2019, 6:31 p.m. UTC | #1
On Wed, 20 Mar 2019 12:31:14 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:

> Introduce in-kernel headers and other artifacts which are made available
> as an archive through proc (/proc/kheaders.tar.xz file).

Lazyass here hasn't been following the recent review discussion and is
going to try cheating.  Are there significant outstanding concerns now,
or are we good to go?
Joel Fernandes March 20, 2019, 7:42 p.m. UTC | #2
On Wed, Mar 20, 2019 at 11:31:11AM -0700, Andrew Morton wrote:
> On Wed, 20 Mar 2019 12:31:14 -0400 "Joel Fernandes (Google)" <joel@joelfernandes.org> wrote:
> 
> > Introduce in-kernel headers and other artifacts which are made available
> > as an archive through proc (/proc/kheaders.tar.xz file).
> 
> Lazyass here hasn't been following the recent review discussion and is
> going to try cheating.  Are there significant outstanding concerns now,
> or are we good to go?

Looks to me like it is pretty solid. v4->v5 was just cosmetic changes,
rebasing etc. Unless Masahiro has any other concerns, it could be a
linux-next candidate. I tested it quite heavily, but would be nice to get
some linux-next testing as well.

thanks,

 - Joel
Olof Johansson April 8, 2019, 4:29 p.m. UTC | #3
Hi,

On Wed, Mar 20, 2019 at 9:31 AM Joel Fernandes (Google)
<joel@joelfernandes.org> wrote:
>
> Introduce in-kernel headers and other artifacts which are made available
> as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
> it possible to build kernel modules, run eBPF programs, and other
> tracing programs that need to extend the kernel for tracing purposes
> without any dependency on the file system having headers and build
> artifacts.
>
> On Android and embedded systems, it is common to switch kernels but not
> have kernel headers available on the file system. Further once a
> different kernel is booted, any headers stored on the file system will
> no longer be useful. By storing the headers as a compressed archive
> within the kernel, we can avoid these issues that have been a hindrance
> for a long time.
>
> The best way to use this feature is by building it in. Several users
> have a need for this, when they switch debug kernels, they donot want to
> update the filesystem or worry about it where to store the headers on
> it. However, the feature is also buildable as a module in case the user
> desires it not being part of the kernel image. This makes it possible to
> load and unload the headers from memory on demand. A tracing program, or
> a kernel module builder can load the module, do its operations, and then
> unload the module to save kernel memory. The total memory needed is 3.8MB.
>
> By having the archive available at a fixed location independent of
> filesystem dependencies and conventions, all debugging tools can
> directly refer to the fixed location for the archive, without concerning
> with where the headers on a typical filesystem which significantly
> simplifies tooling that needs kernel headers.
>
> The code to read the headers is based on /proc/config.gz code and uses
> the same technique to embed the headers.
>
> To build a module, the below steps have been tested on an x86 machine:
> modprobe kheaders
> rm -rf $HOME/headers
> mkdir -p $HOME/headers
> tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
> cd my-kernel-module
> make -C $HOME/headers M=$(pwd) modules
> rmmod kheaders
>
> Additional notes:
> (1) external modules must be built on the same arch as the host that
> built vmlinux. This can be done either in a qemu emulated chroot on the
> target, or natively. This is due to host arch dependency of kernel
> scripts.
>
> (2)
> If module building is used, since Module.symvers is not available in the
> archive due to a cyclic dependency with building of the archive into the
> kernel or module binaries, the modules built using the archive will not
> contain symbol versioning (modversion). This is usually not an issue
> since the idea of this patch is to build a kernel module on the fly and
> load it into the same kernel. An appropriate warning is already printed
> by the kernel to alert the user of modules not having modversions when
> built using the archive. For building with modversions, the user can use
> traditional header packages. For our tracing usecases, we build modules
> on the fly with this so it is not a concern.
>
> (3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
> future patches that would extract the headers from a kernel or module
> image.
>
> (v4 was Tested-by the following folks,
>  v5 only has minor changes and has passed my testing).
> Tested-by: qais.yousef@arm.com
> Tested-by: dietmar.eggemann@arm.com
> Tested-by: linux@manojrajarao.com
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

Sorry to be late at the party with this kind of feedback, but I find
the whole ".tar.gz in procfs" to be an awkward solution, especially if
there's expected to be userspace tooling that depends on this
long-term.

Wouldn't it be more convenient to provide it in a standardized format
such that you won't have to take an additional step, and always have
it in a known location?

Something like:

 - Pseudo-filesystem, that can just be mounted under
/sys/kernel/headers or something (similar to debugfs or
/proc/device-tree).
 - Exporting something like a squashfs image instead, allowing
loopback mounting of it (or by providing a pseudo-/dev entry for it),
again allowing direct export of the contents and avoiding the
extracted directory from being out of sync with currently running
kernel.

Having to copy and extract the tarball is the most awkward step, IMHO.
I also find the waste of kernel memory for it to be an issue, but
given that it can be built as a module I guess that's the obvious
solution for those who care about memory consumption.


-Olof
Daniel Colascione April 8, 2019, 4:37 p.m. UTC | #4
On Mon, Apr 8, 2019 at 9:29 AM Olof Johansson <olof@lixom.net> wrote:
>
> Hi,
>
> On Wed, Mar 20, 2019 at 9:31 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Introduce in-kernel headers and other artifacts which are made available
> > as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
> > it possible to build kernel modules, run eBPF programs, and other
> > tracing programs that need to extend the kernel for tracing purposes
> > without any dependency on the file system having headers and build
> > artifacts.
> >
> > On Android and embedded systems, it is common to switch kernels but not
> > have kernel headers available on the file system. Further once a
> > different kernel is booted, any headers stored on the file system will
> > no longer be useful. By storing the headers as a compressed archive
> > within the kernel, we can avoid these issues that have been a hindrance
> > for a long time.
> >
> > The best way to use this feature is by building it in. Several users
> > have a need for this, when they switch debug kernels, they donot want to
> > update the filesystem or worry about it where to store the headers on
> > it. However, the feature is also buildable as a module in case the user
> > desires it not being part of the kernel image. This makes it possible to
> > load and unload the headers from memory on demand. A tracing program, or
> > a kernel module builder can load the module, do its operations, and then
> > unload the module to save kernel memory. The total memory needed is 3.8MB.
> >
> > By having the archive available at a fixed location independent of
> > filesystem dependencies and conventions, all debugging tools can
> > directly refer to the fixed location for the archive, without concerning
> > with where the headers on a typical filesystem which significantly
> > simplifies tooling that needs kernel headers.
> >
> > The code to read the headers is based on /proc/config.gz code and uses
> > the same technique to embed the headers.
> >
> > To build a module, the below steps have been tested on an x86 machine:
> > modprobe kheaders
> > rm -rf $HOME/headers
> > mkdir -p $HOME/headers
> > tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
> > cd my-kernel-module
> > make -C $HOME/headers M=$(pwd) modules
> > rmmod kheaders
> >
> > Additional notes:
> > (1) external modules must be built on the same arch as the host that
> > built vmlinux. This can be done either in a qemu emulated chroot on the
> > target, or natively. This is due to host arch dependency of kernel
> > scripts.
> >
> > (2)
> > If module building is used, since Module.symvers is not available in the
> > archive due to a cyclic dependency with building of the archive into the
> > kernel or module binaries, the modules built using the archive will not
> > contain symbol versioning (modversion). This is usually not an issue
> > since the idea of this patch is to build a kernel module on the fly and
> > load it into the same kernel. An appropriate warning is already printed
> > by the kernel to alert the user of modules not having modversions when
> > built using the archive. For building with modversions, the user can use
> > traditional header packages. For our tracing usecases, we build modules
> > on the fly with this so it is not a concern.
> >
> > (3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
> > future patches that would extract the headers from a kernel or module
> > image.
> >
> > (v4 was Tested-by the following folks,
> >  v5 only has minor changes and has passed my testing).
> > Tested-by: qais.yousef@arm.com
> > Tested-by: dietmar.eggemann@arm.com
> > Tested-by: linux@manojrajarao.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
>
> Sorry to be late at the party with this kind of feedback, but I find
> the whole ".tar.gz in procfs" to be an awkward solution, especially if
> there's expected to be userspace tooling that depends on this
> long-term.
> [snip]

The approaches you proposed were explored in detail on this thread and
other related threads. The tarball in proc approach is a simple,
pragmatic approach that allows makes a lot of scenarios Just Work
where they didn't before. Approaches like a new filesystem, a
mountable block device, a custom debuginfo format, and so on add
complexity without providing concrete gains in functionality. We'd
like to get this work into the tree sooner rather than later.
Olof Johansson April 8, 2019, 4:53 p.m. UTC | #5
On Mon, Apr 8, 2019 at 9:37 AM Daniel Colascione <dancol@google.com> wrote:
>
> On Mon, Apr 8, 2019 at 9:29 AM Olof Johansson <olof@lixom.net> wrote:
> >
> > Hi,
> >
> > On Wed, Mar 20, 2019 at 9:31 AM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Introduce in-kernel headers and other artifacts which are made available
> > > as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
> > > it possible to build kernel modules, run eBPF programs, and other
> > > tracing programs that need to extend the kernel for tracing purposes
> > > without any dependency on the file system having headers and build
> > > artifacts.
> > >
> > > On Android and embedded systems, it is common to switch kernels but not
> > > have kernel headers available on the file system. Further once a
> > > different kernel is booted, any headers stored on the file system will
> > > no longer be useful. By storing the headers as a compressed archive
> > > within the kernel, we can avoid these issues that have been a hindrance
> > > for a long time.
> > >
> > > The best way to use this feature is by building it in. Several users
> > > have a need for this, when they switch debug kernels, they donot want to
> > > update the filesystem or worry about it where to store the headers on
> > > it. However, the feature is also buildable as a module in case the user
> > > desires it not being part of the kernel image. This makes it possible to
> > > load and unload the headers from memory on demand. A tracing program, or
> > > a kernel module builder can load the module, do its operations, and then
> > > unload the module to save kernel memory. The total memory needed is 3.8MB.
> > >
> > > By having the archive available at a fixed location independent of
> > > filesystem dependencies and conventions, all debugging tools can
> > > directly refer to the fixed location for the archive, without concerning
> > > with where the headers on a typical filesystem which significantly
> > > simplifies tooling that needs kernel headers.
> > >
> > > The code to read the headers is based on /proc/config.gz code and uses
> > > the same technique to embed the headers.
> > >
> > > To build a module, the below steps have been tested on an x86 machine:
> > > modprobe kheaders
> > > rm -rf $HOME/headers
> > > mkdir -p $HOME/headers
> > > tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
> > > cd my-kernel-module
> > > make -C $HOME/headers M=$(pwd) modules
> > > rmmod kheaders
> > >
> > > Additional notes:
> > > (1) external modules must be built on the same arch as the host that
> > > built vmlinux. This can be done either in a qemu emulated chroot on the
> > > target, or natively. This is due to host arch dependency of kernel
> > > scripts.
> > >
> > > (2)
> > > If module building is used, since Module.symvers is not available in the
> > > archive due to a cyclic dependency with building of the archive into the
> > > kernel or module binaries, the modules built using the archive will not
> > > contain symbol versioning (modversion). This is usually not an issue
> > > since the idea of this patch is to build a kernel module on the fly and
> > > load it into the same kernel. An appropriate warning is already printed
> > > by the kernel to alert the user of modules not having modversions when
> > > built using the archive. For building with modversions, the user can use
> > > traditional header packages. For our tracing usecases, we build modules
> > > on the fly with this so it is not a concern.
> > >
> > > (3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
> > > future patches that would extract the headers from a kernel or module
> > > image.
> > >
> > > (v4 was Tested-by the following folks,
> > >  v5 only has minor changes and has passed my testing).
> > > Tested-by: qais.yousef@arm.com
> > > Tested-by: dietmar.eggemann@arm.com
> > > Tested-by: linux@manojrajarao.com
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > Sorry to be late at the party with this kind of feedback, but I find
> > the whole ".tar.gz in procfs" to be an awkward solution, especially if
> > there's expected to be userspace tooling that depends on this
> > long-term.
> > [snip]
>
> The approaches you proposed were explored in detail on this thread and
> other related threads.

Not on this thread, but maybe on others. Since they weren't linked and
referenced, I didn't find them. Having them summarized in the patch
description would be a great idea.

> The tarball in proc approach is a simple,
> pragmatic approach that allows makes a lot of scenarios Just Work
> where they didn't before.

"My pragmatic solution, your messy hack".

It's an awkward solution that will now be permanently locked in due to
the /proc interface being an ABI.

> Approaches like a new filesystem, a
> mountable block device, a custom debuginfo format, and so on add
> complexity without providing concrete gains in functionality.

It definitely provides concrete gains in functionality, in particular
it provides a significantly less fragile way of providing the data
such that having it out of sync with the running kernel is a lot less
accident-prone.

> We'd
> like to get this work into the tree sooner rather than later.

That has _never_ been a good argument for picking up something that
will need to be supported as an ABI forever.

We've solved these kind of things in the past without exporting
tarballs from the kernel. We can do it again.


I literally didn't hear a single valid reason for why this patch
should go in from you.


-Olof
Joel Fernandes April 8, 2019, 8:36 p.m. UTC | #6
On Mon, Apr 08, 2019 at 09:29:30AM -0700, Olof Johansson wrote:
> Hi,
> 
> On Wed, Mar 20, 2019 at 9:31 AM Joel Fernandes (Google)
> <joel@joelfernandes.org> wrote:
> >
> > Introduce in-kernel headers and other artifacts which are made available
> > as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
> > it possible to build kernel modules, run eBPF programs, and other
> > tracing programs that need to extend the kernel for tracing purposes
> > without any dependency on the file system having headers and build
> > artifacts.
> >
> > On Android and embedded systems, it is common to switch kernels but not
> > have kernel headers available on the file system. Further once a
> > different kernel is booted, any headers stored on the file system will
> > no longer be useful. By storing the headers as a compressed archive
> > within the kernel, we can avoid these issues that have been a hindrance
> > for a long time.
> >
> > The best way to use this feature is by building it in. Several users
> > have a need for this, when they switch debug kernels, they donot want to
> > update the filesystem or worry about it where to store the headers on
> > it. However, the feature is also buildable as a module in case the user
> > desires it not being part of the kernel image. This makes it possible to
> > load and unload the headers from memory on demand. A tracing program, or
> > a kernel module builder can load the module, do its operations, and then
> > unload the module to save kernel memory. The total memory needed is 3.8MB.
> >
> > By having the archive available at a fixed location independent of
> > filesystem dependencies and conventions, all debugging tools can
> > directly refer to the fixed location for the archive, without concerning
> > with where the headers on a typical filesystem which significantly
> > simplifies tooling that needs kernel headers.
> >
> > The code to read the headers is based on /proc/config.gz code and uses
> > the same technique to embed the headers.
> >
> > To build a module, the below steps have been tested on an x86 machine:
> > modprobe kheaders
> > rm -rf $HOME/headers
> > mkdir -p $HOME/headers
> > tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
> > cd my-kernel-module
> > make -C $HOME/headers M=$(pwd) modules
> > rmmod kheaders
> >
> > Additional notes:
> > (1) external modules must be built on the same arch as the host that
> > built vmlinux. This can be done either in a qemu emulated chroot on the
> > target, or natively. This is due to host arch dependency of kernel
> > scripts.
> >
> > (2)
> > If module building is used, since Module.symvers is not available in the
> > archive due to a cyclic dependency with building of the archive into the
> > kernel or module binaries, the modules built using the archive will not
> > contain symbol versioning (modversion). This is usually not an issue
> > since the idea of this patch is to build a kernel module on the fly and
> > load it into the same kernel. An appropriate warning is already printed
> > by the kernel to alert the user of modules not having modversions when
> > built using the archive. For building with modversions, the user can use
> > traditional header packages. For our tracing usecases, we build modules
> > on the fly with this so it is not a concern.
> >
> > (3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
> > future patches that would extract the headers from a kernel or module
> > image.
> >
> > (v4 was Tested-by the following folks,
> >  v5 only has minor changes and has passed my testing).
> > Tested-by: qais.yousef@arm.com
> > Tested-by: dietmar.eggemann@arm.com
> > Tested-by: linux@manojrajarao.com
> > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> 
> Sorry to be late at the party with this kind of feedback, but I find
> the whole ".tar.gz in procfs" to be an awkward solution, especially if
> there's expected to be userspace tooling that depends on this
> long-term.

No problem, your feedback is welcome.

> Wouldn't it be more convenient to provide it in a standardized format
> such that you won't have to take an additional step, and always have
> This is that form IMO.

The location of the archive is fixed/known. If you are talking of the
location where the user decompresses it to, then they a;ready know where they
are decompressing to.

> Something like:
> 
>  - Pseudo-filesystem, that can just be mounted under
> /sys/kernel/headers or something (similar to debugfs or
> /proc/device-tree).

The headers are huge if uncompressed (~30MB). Currently we use xz compression
in the archive. It would be a huge waste to decompress everything into
memory such as through an in-memory filesystem. And compressing on a
per-file basis would be too slow for build time. Currently the build of the
archive is extrememly fast.

>  - Exporting something like a squashfs image instead, allowing
> loopback mounting of it (or by providing a pseudo-/dev entry for it),
> again allowing direct export of the contents and avoiding the
> extracted directory from being out of sync with currently running
> kernel.

One drawback of squashfs (other than possibly the compression ratio) is that
this would be kernel build unfriendly in comparison to tar+xz. On my machine,
squashfs-tools needed to be installed. For users who don't have this package,
that would break their kernel build.

> Having to copy and extract the tarball is the most awkward step, IMHO.
> I also find the waste of kernel memory for it to be an issue, but
> given that it can be built as a module I guess that's the obvious
> solution for those who care about memory consumption.

Yes. We discussed in previous threads that for users who really want the
archive to be completely uncompressed and in-memory, can just load the
module, decompress into tmpfs, and unload the module. That is an extra step,
yes.

We had close to 2-3 months of discussions now with various folks up until v5.
I am about to post v6 which is in line with Masahiro Yamada's expecations. In
that I will be dropping module building artifacts due to his module building
concerns and only include the headers.

thanks,

- Joel
Karim Yaghmour April 8, 2019, 8:52 p.m. UTC | #7
Hi Olof,

On 4/8/19 12:29 PM, Olof Johansson wrote:
> Sorry to be late at the party with this kind of feedback, but I find
> the whole ".tar.gz in procfs" to be an awkward solution, especially if
> there's expected to be userspace tooling that depends on this
> long-term.
> 
> Wouldn't it be more convenient to provide it in a standardized format
> such that you won't have to take an additional step, and always have
> it in a known location?
> 
> Something like:
> 
>   - Pseudo-filesystem, that can just be mounted under
> /sys/kernel/headers or something (similar to debugfs or
> /proc/device-tree).
>   - Exporting something like a squashfs image instead, allowing
> loopback mounting of it (or by providing a pseudo-/dev entry for it),
> again allowing direct export of the contents and avoiding the
> extracted directory from being out of sync with currently running
> kernel.
> 
> Having to copy and extract the tarball is the most awkward step, IMHO.
> I also find the waste of kernel memory for it to be an issue, but
> given that it can be built as a module I guess that's the obvious
> solution for those who care about memory consumption.

One of the things I pointed out earlier in the thread is that 
/proc/config.gz has already set a precedent as to the interface for this 
sort of artifact. It's a plain compressed file and it's directly 
accessible from toplevel /proc. From a consistency perspective there's 
an idiomatic angle to some sort of "/proc/kheaders.gz".

In some offline discussions I was also told that squashfs (I'm no expert 
of it) required special user-space tools and had some security issues.

Cheers,
Olof Johansson April 10, 2019, 3:07 p.m. UTC | #8
On Mon, Apr 8, 2019 at 1:36 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> On Mon, Apr 08, 2019 at 09:29:30AM -0700, Olof Johansson wrote:
> > Hi,
> >
> > On Wed, Mar 20, 2019 at 9:31 AM Joel Fernandes (Google)
> > <joel@joelfernandes.org> wrote:
> > >
> > > Introduce in-kernel headers and other artifacts which are made available
> > > as an archive through proc (/proc/kheaders.tar.xz file). This archive makes
> > > it possible to build kernel modules, run eBPF programs, and other
> > > tracing programs that need to extend the kernel for tracing purposes
> > > without any dependency on the file system having headers and build
> > > artifacts.
> > >
> > > On Android and embedded systems, it is common to switch kernels but not
> > > have kernel headers available on the file system. Further once a
> > > different kernel is booted, any headers stored on the file system will
> > > no longer be useful. By storing the headers as a compressed archive
> > > within the kernel, we can avoid these issues that have been a hindrance
> > > for a long time.
> > >
> > > The best way to use this feature is by building it in. Several users
> > > have a need for this, when they switch debug kernels, they donot want to
> > > update the filesystem or worry about it where to store the headers on
> > > it. However, the feature is also buildable as a module in case the user
> > > desires it not being part of the kernel image. This makes it possible to
> > > load and unload the headers from memory on demand. A tracing program, or
> > > a kernel module builder can load the module, do its operations, and then
> > > unload the module to save kernel memory. The total memory needed is 3.8MB.
> > >
> > > By having the archive available at a fixed location independent of
> > > filesystem dependencies and conventions, all debugging tools can
> > > directly refer to the fixed location for the archive, without concerning
> > > with where the headers on a typical filesystem which significantly
> > > simplifies tooling that needs kernel headers.
> > >
> > > The code to read the headers is based on /proc/config.gz code and uses
> > > the same technique to embed the headers.
> > >
> > > To build a module, the below steps have been tested on an x86 machine:
> > > modprobe kheaders
> > > rm -rf $HOME/headers
> > > mkdir -p $HOME/headers
> > > tar -xvf /proc/kheaders.tar.xz -C $HOME/headers >/dev/null
> > > cd my-kernel-module
> > > make -C $HOME/headers M=$(pwd) modules
> > > rmmod kheaders
> > >
> > > Additional notes:
> > > (1) external modules must be built on the same arch as the host that
> > > built vmlinux. This can be done either in a qemu emulated chroot on the
> > > target, or natively. This is due to host arch dependency of kernel
> > > scripts.
> > >
> > > (2)
> > > If module building is used, since Module.symvers is not available in the
> > > archive due to a cyclic dependency with building of the archive into the
> > > kernel or module binaries, the modules built using the archive will not
> > > contain symbol versioning (modversion). This is usually not an issue
> > > since the idea of this patch is to build a kernel module on the fly and
> > > load it into the same kernel. An appropriate warning is already printed
> > > by the kernel to alert the user of modules not having modversions when
> > > built using the archive. For building with modversions, the user can use
> > > traditional header packages. For our tracing usecases, we build modules
> > > on the fly with this so it is not a concern.
> > >
> > > (3) I have left IKHD_ST and IKHD_ED markers as is to facilitate
> > > future patches that would extract the headers from a kernel or module
> > > image.
> > >
> > > (v4 was Tested-by the following folks,
> > >  v5 only has minor changes and has passed my testing).
> > > Tested-by: qais.yousef@arm.com
> > > Tested-by: dietmar.eggemann@arm.com
> > > Tested-by: linux@manojrajarao.com
> > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> >
> > Sorry to be late at the party with this kind of feedback, but I find
> > the whole ".tar.gz in procfs" to be an awkward solution, especially if
> > there's expected to be userspace tooling that depends on this
> > long-term.
>
> No problem, your feedback is welcome.
>
> > Wouldn't it be more convenient to provide it in a standardized format
> > such that you won't have to take an additional step, and always have
> > This is that form IMO.
>
> The location of the archive is fixed/known. If you are talking of the
> location where the user decompresses it to, then they a;ready know where they
> are decompressing to.

The location _of_ the archive, sure. But the format of what is in the
tarball, how it is versioned, and how to manage it will have to be
done by every user.

For any script that doesn't depend on some shared system state that
wants to, say, build a eBPF program and load it, it would need to
extract the tarball from scratch to make sure it is the current
correct version of it.

If that's required by all users, why not just present the data in a
way that it can be used directly?

> > Something like:
> >
> >  - Pseudo-filesystem, that can just be mounted under
> > /sys/kernel/headers or something (similar to debugfs or
> > /proc/device-tree).
>
> The headers are huge if uncompressed (~30MB). Currently we use xz compression
> in the archive. It would be a huge waste to decompress everything into
> memory such as through an in-memory filesystem. And compressing on a
> per-file basis would be too slow for build time. Currently the build of the
> archive is extrememly fast.

Keeping it around at all times in memory seems like a significant
waste, I agree.

Providing a standard way of presenting the contents without more
requirements on userspace, and without building up new cargo cult
methods for how to prepare the headers, would still be useful though
(see below).

> >  - Exporting something like a squashfs image instead, allowing
> > loopback mounting of it (or by providing a pseudo-/dev entry for it),
> > again allowing direct export of the contents and avoiding the
> > extracted directory from being out of sync with currently running
> > kernel.
>
> One drawback of squashfs (other than possibly the compression ratio) is that
> this would be kernel build unfriendly in comparison to tar+xz. On my machine,
> squashfs-tools needed to be installed. For users who don't have this package,
> that would break their kernel build.

Adding a new tool that is required to use a new feature isn't that bad
-- it's not like you're breaking the build for everyone.

We've also done this before in the past, by importing the tools into
the kernel tree if needed. It can be solved.

> > Having to copy and extract the tarball is the most awkward step, IMHO.
> > I also find the waste of kernel memory for it to be an issue, but
> > given that it can be built as a module I guess that's the obvious
> > solution for those who care about memory consumption.
>
> Yes. We discussed in previous threads that for users who really want the
> archive to be completely uncompressed and in-memory, can just load the
> module, decompress into tmpfs, and unload the module. That is an extra step,
> yes.

Most users will need to decompress it every time they use it anyway,
especially if there's no versioned prefix in the tarball that they can
use to key to a previously decompressed version with the exact same
kernel version and config.

So, if you need to do that anyway, wouldn't it be easier if you just
mounted a FS to get to it. If you're on a system where you can't use
it in-place for resource reasons, you can copy it off and unmount it.
No extra tools needed in userspace then at run/use time.

Said filesystem could be populated by a compressed cpio archive since
we already have code in the kernel to do this for initramfs, and could
do so at mount time -- and at unmount time it'd be freed up.

If you absolutely need to export a file to userspace with the archive,
my suggestion is to do it through debugfs. That way the format isn't
in a /proc ABI that can't be changed in the future (debugfs isn't
required to be stable in the same way). This way we can change the
format carried in the kernel over time without changing the official
way we present the data to userspace (via a filesystem view).

As far as format goes; there's clear precedent on cpio being used and
supported; we already have build time requirements on the userspace
tools with some options. Using tar would actually be a new dependency
even if it is a common tool to have installed. With a self-populating
FS, there's no new tool requirements on the runtime side either.

> We had close to 2-3 months of discussions now with various folks up until v5.
> I am about to post v6 which is in line with Masahiro Yamada's expecations. In
> that I will be dropping module building artifacts due to his module building
> concerns and only include the headers.

I've found some of the old discussion and read up on it. I think it
was pretty quick at dismissing ideas for more robust implementations
("it needs squashfs-tools"), and had some narrow viewpoints (exporting
a tarball is the least amount of kernel change, while adding
complexity at the system/usage side).

I'd also like to clarify: I'm not opposed to the general idea of
providing the needed headers with the kernel somehow. I just think
it's worth spending effort making sure an interface for it that we'll
need to live with forever is appropriately thought through and not
rushed in, especially since we're likely to get substantial
infrastructure on top of it quickly (eBPF and friends in particular).


-Olof
Olof Johansson April 10, 2019, 3:15 p.m. UTC | #9
On Mon, Apr 8, 2019 at 1:52 PM Karim Yaghmour
<karim.yaghmour@opersys.com> wrote:
>
>
> Hi Olof,
>
> On 4/8/19 12:29 PM, Olof Johansson wrote:
> > Sorry to be late at the party with this kind of feedback, but I find
> > the whole ".tar.gz in procfs" to be an awkward solution, especially if
> > there's expected to be userspace tooling that depends on this
> > long-term.
> >
> > Wouldn't it be more convenient to provide it in a standardized format
> > such that you won't have to take an additional step, and always have
> > it in a known location?
> >
> > Something like:
> >
> >   - Pseudo-filesystem, that can just be mounted under
> > /sys/kernel/headers or something (similar to debugfs or
> > /proc/device-tree).
> >   - Exporting something like a squashfs image instead, allowing
> > loopback mounting of it (or by providing a pseudo-/dev entry for it),
> > again allowing direct export of the contents and avoiding the
> > extracted directory from being out of sync with currently running
> > kernel.
> >
> > Having to copy and extract the tarball is the most awkward step, IMHO.
> > I also find the waste of kernel memory for it to be an issue, but
> > given that it can be built as a module I guess that's the obvious
> > solution for those who care about memory consumption.
>
> One of the things I pointed out earlier in the thread is that
> /proc/config.gz has already set a precedent as to the interface for this
> sort of artifact. It's a plain compressed file and it's directly
> accessible from toplevel /proc. From a consistency perspective there's
> an idiomatic angle to some sort of "/proc/kheaders.gz".

I'm not arguing against providing the headers in some format, I think
that's a good idea.

On similarities, there are some but there are also substantial
differences in the use model.

For the config file, the main use cases are:

 - Checking to make sure that the running kernel has a particular set
of config options set or cleared.
 - Ease of cloning the config of a running kernel when building a new one.

The file format is just a plain text file, even if compressed. No real
internal structure to consider.

Both of the above uses are relatively rare (well, the first might be
done in some startup scripts, etc).

The kernel headers case is different. The file format is more complex
(tarball, which would also include the structure of said tarball). You
can't just zgrep to get some data out.

Also, the way the contents is used is different, in that it will be
needed by runtime tools that build and load eBPF programs. For the
build to always be known to be against the running headers, every
build would likely need to decompress and stage said tarball
independently and not rely on previous state. If that's needed, why
not just provide it once in the right format and avoid people building
userspace solutions in several different ways to do the same thing?

> In some offline discussions I was also told that squashfs (I'm no expert
> of it) required special user-space tools and had some security issues.

I'm unaware of what the security issues are, and there's indeed a
GPLv2 tool needed to construct the filesystem. The latter can be
solved, the former I don't know enough about to have an opinion.

Anyway, see my other reply just now -- CPIO + a filesystem view, and
providing said cpio archive in debugfs for those who want to copy it
off themselves might be something that fits everybody.


-Olof
Daniel Colascione April 10, 2019, 3:44 p.m. UTC | #10
On Wed, Apr 10, 2019 at 8:16 AM Olof Johansson <olof@lixom.net> wrote:
> Anyway, see my other reply just now -- CPIO + a filesystem view, and
> providing said cpio archive in debugfs for those who want to copy it
> off themselves might be something that fits everybody.

I don't think it's worth increasing the complexity of the kernel for
the sake of a little userspace convenience. By including the headers
in the kernel, we solve an important coordination problem --- but we
should solve this problem in the simplest possible way, pushing
whatever complexity possible to userspace, which is more flexible and
which evolves faster. Providing a compressed archive is fine. The user
burden is not large: it's just extracting an archive from a standard
container format. The kernel has no special advantageous way of doing
this job. If userspace wants to cache extraction, it can hash the
compressed archive itself and use the result as a key: it's only 3MB.
But extracting the archive every time is fine. We're not talking about
a huge amount of data, and I don't see a need for the kind of
long-term caching that would require revalidation.

As for cpio vs tar: IME, people are much more familiar with the
latter, and they're both omnipresent on unixish systems. I think it's
fine to rely on a tool that's been part of POSIX for over 30 years.
Joel Fernandes April 10, 2019, 3:50 p.m. UTC | #11
On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
[snip]
> > > Wouldn't it be more convenient to provide it in a standardized format
> > > such that you won't have to take an additional step, and always have
> > > This is that form IMO.
> >
> > The location of the archive is fixed/known. If you are talking of the
> > location where the user decompresses it to, then they a;ready know where they
> > are decompressing to.
>
> The location _of_ the archive, sure. But the format of what is in the
> tarball, how it is versioned, and how to manage it will have to be
> done by every user.
>
> For any script that doesn't depend on some shared system state that
> wants to, say, build a eBPF program and load it, it would need to
> extract the tarball from scratch to make sure it is the current
> correct version of it.
>
> If that's required by all users, why not just present the data in a
> way that it can be used directly?

That is the part that is unclear from your proposal. If we present a
filesystem view, then I am assuming the data will have to be
decompressed first into memory. That means you are proposing use of
30MB uncompressed memory. The whole archive has to be decompressed but
the whole archive if compressed with XZ for a maximum compression
ratio.

> > > Having to copy and extract the tarball is the most awkward step, IMHO.
> > > I also find the waste of kernel memory for it to be an issue, but
> > > given that it can be built as a module I guess that's the obvious
> > > solution for those who care about memory consumption.
> >
> > Yes. We discussed in previous threads that for users who really want the
> > archive to be completely uncompressed and in-memory, can just load the
> > module, decompress into tmpfs, and unload the module. That is an extra step,
> > yes.
>
> Most users will need to decompress it every time they use it anyway,
> especially if there's no versioned prefix in the tarball that they can
> use to key to a previously decompressed version with the exact same
> kernel version and config.
>
> So, if you need to do that anyway, wouldn't it be easier if you just
> mounted a FS to get to it. If you're on a system where you can't use
> it in-place for resource reasons, you can copy it off and unmount it.
> No extra tools needed in userspace then at run/use time.
>
> Said filesystem could be populated by a compressed cpio archive since
> we already have code in the kernel to do this for initramfs, and could
> do so at mount time -- and at unmount time it'd be freed up.

But still, decompressing to the filesystem in a scratch area may be
better than decompressing to RAM, for some users who have lesser RAM.
This patchset does not enforce a certain way of doing things and
leaves it to the user.

> If you absolutely need to export a file to userspace with the archive,
> my suggestion is to do it through debugfs. That way the format isn't
> in a /proc ABI that can't be changed in the future (debugfs isn't
> required to be stable in the same way). This way we can change the
> format carried in the kernel over time without changing the official
> way we present the data to userspace (via a filesystem view).
>
> As far as format goes; there's clear precedent on cpio being used and
> supported; we already have build time requirements on the userspace
> tools with some options. Using tar would actually be a new dependency
> even if it is a common tool to have installed. With a self-populating
> FS, there's no new tool requirements on the runtime side either.

debugfs is going away for Android and is controversial in the fact
that its functionality isn't guaranteed to be there (debugfs breakages
aren't necessarily bugs AFAIK). So this isn't an option.

> > We had close to 2-3 months of discussions now with various folks up until v5.
> > I am about to post v6 which is in line with Masahiro Yamada's expecations. In
> > that I will be dropping module building artifacts due to his module building
> > concerns and only include the headers.
>
> I've found some of the old discussion and read up on it. I think it
> was pretty quick at dismissing ideas for more robust implementations
> ("it needs squashfs-tools"), and had some narrow viewpoints (exporting
> a tarball is the least amount of kernel change, while adding
> complexity at the system/usage side).

Honestly, that's kind of unfair to be quoting just a few points like
that. If I remember there were 100s of emails and many good view
points were brought up by many people. We have done the diligence in
the discussions of this over a period of time.

> I'd also like to clarify: I'm not opposed to the general idea of
> providing the needed headers with the kernel somehow. I just think
> it's worth spending effort making sure an interface for it that we'll
> need to live with forever is appropriately thought through and not
> rushed in, especially since we're likely to get substantial
> infrastructure on top of it quickly (eBPF and friends in particular).

We have spent the time :) This seems like the best solution of all.
Greg KH and other maintainers are also supportive of it as can be seen
in other threads. We can consider an alternate proposal if it is
better, but I don't see any better one proposed at the moment.

- squashfs-tools requirement on the build really sucks
- cpio uncompressed to memory equally sucks because it consumes all
the memory uncompressed instead of reclaimable pages
- decompressing into tmpfs will suck for Android because we don't use
disk-based swap and we run into the same cpio issue above. We use ZRAM
for compressed swap.
- debugfs is a non-option for Android

The tar+xz is trivially created without depending on squashfs-tools.
And xz provides the maximum compression ratio in our experiments.
Decompression time is a non-issue since trace tools are using it.

The filesystem view sounds using mount/unmount like a pony to me, but
it does not meet the requirements above. Let me know if I am missing
something.

thanks,
 - Joel
Olof Johansson April 10, 2019, 4:34 p.m. UTC | #12
On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
>
> On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> [snip]
> > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > such that you won't have to take an additional step, and always have
> > > > This is that form IMO.
> > >
> > > The location of the archive is fixed/known. If you are talking of the
> > > location where the user decompresses it to, then they a;ready know where they
> > > are decompressing to.
> >
> > The location _of_ the archive, sure. But the format of what is in the
> > tarball, how it is versioned, and how to manage it will have to be
> > done by every user.
> >
> > For any script that doesn't depend on some shared system state that
> > wants to, say, build a eBPF program and load it, it would need to
> > extract the tarball from scratch to make sure it is the current
> > correct version of it.
> >
> > If that's required by all users, why not just present the data in a
> > way that it can be used directly?
>
> That is the part that is unclear from your proposal. If we present a
> filesystem view, then I am assuming the data will have to be
> decompressed first into memory. That means you are proposing use of
> 30MB uncompressed memory. The whole archive has to be decompressed but
> the whole archive if compressed with XZ for a maximum compression
> ratio.

Only while the filesystem is mounted. So you would do something like:

 - Mount filesystem
 - Build and load
 - Unmount

The 30MB would only be used while the filesystem is mounted.

Compared to:
 - Extract tarball
 - Build and load
 - Remove file tree from filesystem

> > > > Having to copy and extract the tarball is the most awkward step, IMHO.
> > > > I also find the waste of kernel memory for it to be an issue, but
> > > > given that it can be built as a module I guess that's the obvious
> > > > solution for those who care about memory consumption.
> > >
> > > Yes. We discussed in previous threads that for users who really want the
> > > archive to be completely uncompressed and in-memory, can just load the
> > > module, decompress into tmpfs, and unload the module. That is an extra step,
> > > yes.
> >
> > Most users will need to decompress it every time they use it anyway,
> > especially if there's no versioned prefix in the tarball that they can
> > use to key to a previously decompressed version with the exact same
> > kernel version and config.
> >
> > So, if you need to do that anyway, wouldn't it be easier if you just
> > mounted a FS to get to it. If you're on a system where you can't use
> > it in-place for resource reasons, you can copy it off and unmount it.
> > No extra tools needed in userspace then at run/use time.
> >
> > Said filesystem could be populated by a compressed cpio archive since
> > we already have code in the kernel to do this for initramfs, and could
> > do so at mount time -- and at unmount time it'd be freed up.
>
> But still, decompressing to the filesystem in a scratch area may be
> better than decompressing to RAM, for some users who have lesser RAM.
> This patchset does not enforce a certain way of doing things and
> leaves it to the user.

There are lots of things where we provide suitable ways of doing
things to the user instead of making them come up with their own
handling of things. devtmpfs is a perfect example of this -- doing
things in userspace was perfectly possible but still a hassle in many
cases, and having the kernel do it for you when it already has the
data makes sense.

I'd expect many users to still want to do this to tmpfs. Also, I
expect whatever userspace tools and programs that will consume this
data is likely to consume similar or more memory while running anyway.
So mounting + copying + unmounting on the heavily constrained systems
shouldn't be raising the high water mark on memory consumption.

> > If you absolutely need to export a file to userspace with the archive,
> > my suggestion is to do it through debugfs. That way the format isn't
> > in a /proc ABI that can't be changed in the future (debugfs isn't
> > required to be stable in the same way). This way we can change the
> > format carried in the kernel over time without changing the official
> > way we present the data to userspace (via a filesystem view).
> >
> > As far as format goes; there's clear precedent on cpio being used and
> > supported; we already have build time requirements on the userspace
> > tools with some options. Using tar would actually be a new dependency
> > even if it is a common tool to have installed. With a self-populating
> > FS, there's no new tool requirements on the runtime side either.
>
> debugfs is going away for Android and is controversial in the fact
> that its functionality isn't guaranteed to be there (debugfs breakages
> aren't necessarily bugs AFAIK). So this isn't an option.

The argument that this needs to go into /proc because Android is
removing debugfs isn't a very strong one.

And "debugfs breakages aren't bugs" is exactly why I'm suggesting to
do the non-supported export of the archive that way instead.

> > > We had close to 2-3 months of discussions now with various folks up until v5.
> > > I am about to post v6 which is in line with Masahiro Yamada's expecations. In
> > > that I will be dropping module building artifacts due to his module building
> > > concerns and only include the headers.
> >
> > I've found some of the old discussion and read up on it. I think it
> > was pretty quick at dismissing ideas for more robust implementations
> > ("it needs squashfs-tools"), and had some narrow viewpoints (exporting
> > a tarball is the least amount of kernel change, while adding
> > complexity at the system/usage side).
>
> Honestly, that's kind of unfair to be quoting just a few points like
> that. If I remember there were 100s of emails and many good view
> points were brought up by many people. We have done the diligence in
> the discussions of this over a period of time.

That wasn't captured with the patch submission, and having people go
find 100s of emails to figure out why your seemingly lacking solution
is the best one available is not how you motivate getting your code
into the kernel.

> > I'd also like to clarify: I'm not opposed to the general idea of
> > providing the needed headers with the kernel somehow. I just think
> > it's worth spending effort making sure an interface for it that we'll
> > need to live with forever is appropriately thought through and not
> > rushed in, especially since we're likely to get substantial
> > infrastructure on top of it quickly (eBPF and friends in particular).
>
> We have spent the time :) This seems like the best solution of all.

That should be documented.

> Greg KH and other maintainers are also supportive of it as can be seen
> in other threads.

I've found support for the desire to provide headers. If there's so
much support for this solution, the number of Acks to the patch should
have been higher.

> We can consider an alternate proposal if it is
> better, but I don't see any better one proposed at the moment.

Really?

> - squashfs-tools requirement on the build really sucks

Nah, this is a minor detail.

> - cpio uncompressed to memory equally sucks because it consumes all
> the memory uncompressed instead of reclaimable pages

Only while mounted.

> - decompressing into tmpfs will suck for Android because we don't use
> disk-based swap and we run into the same cpio issue above. We use ZRAM
> for compressed swap.

See comments above about high water marks for memory consumption
likely not moving much.

> - debugfs is a non-option for Android

Not my problem.

> The tar+xz is trivially created without depending on squashfs-tools.

It adds a new dependency on tar.

> And xz provides the maximum compression ratio in our experiments.

Sure.

> Decompression time is a non-issue since trace tools are using it.

Sure.

> The filesystem view sounds using mount/unmount like a pony to me, but
> it does not meet the requirements above. Let me know if I am missing
> something.

What requirements?


-Olof
Joel Fernandes April 10, 2019, 5:33 p.m. UTC | #13
On Wed, Apr 10, 2019 at 12:35 PM Olof Johansson <olof@lixom.net> wrote:
>
> On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
> >
> > On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> > [snip]
> > > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > > such that you won't have to take an additional step, and always have
> > > > > This is that form IMO.
> > > >
> > > > The location of the archive is fixed/known. If you are talking of the
> > > > location where the user decompresses it to, then they a;ready know where they
> > > > are decompressing to.
> > >
> > > The location _of_ the archive, sure. But the format of what is in the
> > > tarball, how it is versioned, and how to manage it will have to be
> > > done by every user.
> > >
> > > For any script that doesn't depend on some shared system state that
> > > wants to, say, build a eBPF program and load it, it would need to
> > > extract the tarball from scratch to make sure it is the current
> > > correct version of it.
> > >
> > > If that's required by all users, why not just present the data in a
> > > way that it can be used directly?
> >
> > That is the part that is unclear from your proposal. If we present a
> > filesystem view, then I am assuming the data will have to be
> > decompressed first into memory. That means you are proposing use of
> > 30MB uncompressed memory. The whole archive has to be decompressed but
> > the whole archive if compressed with XZ for a maximum compression
> > ratio.
>
> Only while the filesystem is mounted. So you would do something like:
>
>  - Mount filesystem
>  - Build and load
>  - Unmount
>
> The 30MB would only be used while the filesystem is mounted.
>
> Compared to:
>  - Extract tarball
>  - Build and load
>  - Remove file tree from filesystem

I feel there is no benefit in this proposal and adds considerable
complexity to the kernel for no benefit. Only drawbacks - will likely
do much poorly on lower memory devices, addition of more complexity
and likely bugs, etc.

> > > > > Having to copy and extract the tarball is the most awkward step, IMHO.
> > > > > I also find the waste of kernel memory for it to be an issue, but
> > > > > given that it can be built as a module I guess that's the obvious
> > > > > solution for those who care about memory consumption.
> > > >
> > > > Yes. We discussed in previous threads that for users who really want the
> > > > archive to be completely uncompressed and in-memory, can just load the
> > > > module, decompress into tmpfs, and unload the module. That is an extra step,
> > > > yes.
> > >
> > > Most users will need to decompress it every time they use it anyway,
> > > especially if there's no versioned prefix in the tarball that they can
> > > use to key to a previously decompressed version with the exact same
> > > kernel version and config.
> > >
> > > So, if you need to do that anyway, wouldn't it be easier if you just
> > > mounted a FS to get to it. If you're on a system where you can't use
> > > it in-place for resource reasons, you can copy it off and unmount it.
> > > No extra tools needed in userspace then at run/use time.
> > >
> > > Said filesystem could be populated by a compressed cpio archive since
> > > we already have code in the kernel to do this for initramfs, and could
> > > do so at mount time -- and at unmount time it'd be freed up.
> >
> > But still, decompressing to the filesystem in a scratch area may be
> > better than decompressing to RAM, for some users who have lesser RAM.
> > This patchset does not enforce a certain way of doing things and
> > leaves it to the user.
>
> There are lots of things where we provide suitable ways of doing
> things to the user instead of making them come up with their own
> handling of things. devtmpfs is a perfect example of this -- doing
> things in userspace was perfectly possible but still a hassle in many
> cases, and having the kernel do it for you when it already has the
> data makes sense.
>
> I'd expect many users to still want to do this to tmpfs. Also, I
> expect whatever userspace tools and programs that will consume this
> data is likely to consume similar or more memory while running anyway.
> So mounting + copying + unmounting on the heavily constrained systems
> shouldn't be raising the high water mark on memory consumption.

With this patch, a user can decompress the archive into their own
tmpfs instance if they want to. This was also mentioned on previous
threads. I don't see your point at all.

> > > If you absolutely need to export a file to userspace with the archive,
> > > my suggestion is to do it through debugfs. That way the format isn't
> > > in a /proc ABI that can't be changed in the future (debugfs isn't
> > > required to be stable in the same way). This way we can change the
> > > format carried in the kernel over time without changing the official
> > > way we present the data to userspace (via a filesystem view).
> > >
> > > As far as format goes; there's clear precedent on cpio being used and
> > > supported; we already have build time requirements on the userspace
> > > tools with some options. Using tar would actually be a new dependency
> > > even if it is a common tool to have installed. With a self-populating
> > > FS, there's no new tool requirements on the runtime side either.
> >
> > debugfs is going away for Android and is controversial in the fact
> > that its functionality isn't guaranteed to be there (debugfs breakages
> > aren't necessarily bugs AFAIK). So this isn't an option.
>
> The argument that this needs to go into /proc because Android is
> removing debugfs isn't a very strong one.
>
> And "debugfs breakages aren't bugs" is exactly why I'm suggesting to
> do the non-supported export of the archive that way instead.

BPF tools are shipped on production systems. They should not break,
that you want put them into debugfs to make them more likely to break
does not make any sense.

> > > > We had close to 2-3 months of discussions now with various folks up until v5.
> > > > I am about to post v6 which is in line with Masahiro Yamada's expecations. In
> > > > that I will be dropping module building artifacts due to his module building
> > > > concerns and only include the headers.
> > >
> > > I've found some of the old discussion and read up on it. I think it
> > > was pretty quick at dismissing ideas for more robust implementations
> > > ("it needs squashfs-tools"), and had some narrow viewpoints (exporting
> > > a tarball is the least amount of kernel change, while adding
> > > complexity at the system/usage side).
> >
> > Honestly, that's kind of unfair to be quoting just a few points like
> > that. If I remember there were 100s of emails and many good view
> > points were brought up by many people. We have done the diligence in
> > the discussions of this over a period of time.
>
> That wasn't captured with the patch submission, and having people go
> find 100s of emails to figure out why your seemingly lacking solution
> is the best one available is not how you motivate getting your code
> into the kernel.

I can summarize it better in the commit message. That's fine with me.

> > Greg KH and other maintainers are also supportive of it as can be seen
> > in other threads.
>
> I've found support for the desire to provide headers. If there's so
> much support for this solution, the number of Acks to the patch should
> have been higher.

There was at least one Ack on a prior revision, one Reviewed-by, and
at least 4 Tested-by(s). I dropped the tags since I changed the patch
a bit recently although the user interface and the idea is
fundamentally the same. Also Masahiro Yamada is happy with the quality
of the v6 patch, I privately chatted him. He mentioned he will likely
give his Acked-by tag.

> > We can consider an alternate proposal if it is
> > better, but I don't see any better one proposed at the moment.
>
> Really?

What do you mean? I meant better, as in, a proposal that works and
makes sense. Is simple, bug-free and solves the problem we are trying
to solve.

> > - cpio uncompressed to memory equally sucks because it consumes all
> > the memory uncompressed instead of reclaimable pages
>
> Only while mounted.

Still a disadvantage.

> > - decompressing into tmpfs will suck for Android because we don't use
> > disk-based swap and we run into the same cpio issue above. We use ZRAM
> > for compressed swap.
>
> See comments above about high water marks for memory consumption
> likely not moving much.
>
> > - debugfs is a non-option for Android
>
> Not my problem.

Really? Android runs on billions of devices. That is arrogant /
ignorant to say Android's requirements of moving away from debugfs are
not your problem.

> > The filesystem view sounds using mount/unmount like a pony to me, but
> > it does not meet the requirements above. Let me know if I am missing
> > something.
>
> What requirements?

I think you know this already - we don't want 30MB of active RAM being
used, that does not make much sense. debugfs doesn't work because
tools that need this will need to work even on production systems.
That you want debugfs because it is more susceptible to ABI breakage
doesn't make much sense. Not having all of the deal breakers above are
requirements.

thanks!

 - Joel

>
>
> -Olof
>
> --
> You received this message because you are subscribed to the Google Groups "kernel-team" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
>
Daniel Colascione April 10, 2019, 5:35 p.m. UTC | #14
On Wed, Apr 10, 2019 at 9:35 AM Olof Johansson <olof@lixom.net> wrote:
> There are lots of things where we provide suitable ways of doing
> things to the user instead of making them come up with their own
> handling of things. devtmpfs is a perfect example of this -- doing
> things in userspace was perfectly possible but still a hassle in many
> cases, and having the kernel do it for you when it already has the
> data makes sense.

devtmpfs is something that the kernel can do *better* than userspace.
The userspace equivalent is error-prone. Unpacking an archive, on the
other hand, is trivial. Does userspace really need the kernel's help
to do tar xvf? Are we really so worried about userspace getting tar
xvf wrong somehow that we should provide tar xvf in the kernel? The
kernel should have a feature only if there's some particular reason to
put that feature in the kernel instead of userspace. A smaller kernel
is a better kernel, all things being equal.

It's also worth noting that years and years ago, many proposals to do
what udevd does (but in the kernel) failed to make it into the kernel.

> I'd expect many users to still want to do this to tmpfs. Also, I
> expect whatever userspace tools and programs that will consume this
> data is likely to consume similar or more memory while running anyway.
> So mounting + copying + unmounting on the heavily constrained systems
> shouldn't be raising the high water mark on memory consumption.

Consider the embedded case: who's to say that the machine
decompressing the header bundle is the machine that provides it? We
can suck a compressed archive off the device without ever
decompressing it. I know you that you proposed providing access to
both a compressed cpio archive and a filesystem view of that archive,
but in this scheme, the filesystem view is redundant. If someone wants
a filesystem view of an archive, he can make one with FUSE without the
kernel's help and in a general way.

> > > If you absolutely need to export a file to userspace with the archive,
> > > my suggestion is to do it through debugfs. That way the format isn't
> > > in a /proc ABI that can't be changed in the future (debugfs isn't
> > > required to be stable in the same way). This way we can change the
> > > format carried in the kernel over time without changing the official
> > > way we present the data to userspace (via a filesystem view).
> > >
> > > As far as format goes; there's clear precedent on cpio being used and
> > > supported; we already have build time requirements on the userspace
> > > tools with some options. Using tar would actually be a new dependency
> > > even if it is a common tool to have installed. With a self-populating
> > > FS, there's no new tool requirements on the runtime side either.
> >
> > debugfs is going away for Android and is controversial in the fact
> > that its functionality isn't guaranteed to be there (debugfs breakages
> > aren't necessarily bugs AFAIK). So this isn't an option.
>
> The argument that this needs to go into /proc because Android is
> removing debugfs isn't a very strong one.
>
> And "debugfs breakages aren't bugs" is exactly why I'm suggesting to
> do the non-supported export of the archive that way instead.

Breaking this header bundle *should* be a bug though: tools will rely
on it. It's not critical interface in the same way that, say, open(2)
is, but the interface stability guarantee should nevertheless be
stronger than what debugfs provides.

> > > > We had close to 2-3 months of discussions now with various folks up until v5.
> > > > I am about to post v6 which is in line with Masahiro Yamada's expecations. In
> > > > that I will be dropping module building artifacts due to his module building
> > > > concerns and only include the headers.
> > >
> > > I've found some of the old discussion and read up on it. I think it
> > > was pretty quick at dismissing ideas for more robust implementations
> > > ("it needs squashfs-tools"), and had some narrow viewpoints (exporting
> > > a tarball is the least amount of kernel change, while adding
> > > complexity at the system/usage side).
> >
> > Honestly, that's kind of unfair to be quoting just a few points like
> > that. If I remember there were 100s of emails and many good view
> > points were brought up by many people. We have done the diligence in
> > the discussions of this over a period of time.
>
> That wasn't captured with the patch submission, and having people go
> find 100s of emails to figure out why your seemingly lacking solution
> is the best one available is not how you motivate getting your code
> into the kernel.
>
> > > I'd also like to clarify: I'm not opposed to the general idea of
> > > providing the needed headers with the kernel somehow. I just think
> > > it's worth spending effort making sure an interface for it that we'll
> > > need to live with forever is appropriately thought through and not
> > > rushed in, especially since we're likely to get substantial
> > > infrastructure on top of it quickly (eBPF and friends in particular).
> >
> > We have spent the time :) This seems like the best solution of all.
>
> That should be documented.
>
> > Greg KH and other maintainers are also supportive of it as can be seen
> > in other threads.
>
> I've found support for the desire to provide headers. If there's so
> much support for this solution, the number of Acks to the patch should
> have been higher.
>
> > We can consider an alternate proposal if it is
> > better, but I don't see any better one proposed at the moment.
>
> Really?

Joel's proposal is the simplest approach that solves the problem we're
trying to solve. The model you're proposing adds a lot of complexity,
and I'm not convinced that the complexity buys us anything.

> > - squashfs-tools requirement on the build really sucks
>
> Nah, this is a minor detail.

tar is ubiquitous. The squashfs tool isn't. Treating both dependencies
the same way is a false equivalence.

> > - cpio uncompressed to memory equally sucks because it consumes all
> > the memory uncompressed instead of reclaimable pages
>
> Only while mounted.
>
> > - decompressing into tmpfs will suck for Android because we don't use
> > disk-based swap and we run into the same cpio issue above. We use ZRAM
> > for compressed swap.
>
> See comments above about high water marks for memory consumption
> likely not moving much.
>
> > - debugfs is a non-option for Android
>
> Not my problem.

It's important to make interfaces that work for everyone. Robust eBPF
use should be possible without debugfs.
Arnd Bergmann April 10, 2019, 7:19 p.m. UTC | #15
On Wed, Apr 10, 2019 at 5:09 PM Olof Johansson <olof@lixom.net> wrote:
>
> As far as format goes; there's clear precedent on cpio being used and
> supported; we already have build time requirements on the userspace
> tools with some options. Using tar would actually be a new dependency
> even if it is a common tool to have installed. With a self-populating
> FS, there's no new tool requirements on the runtime side either.

The decision between tar and cpio directly follows from whether the
headers are uncompressed in kernel space or in user space:

- we use cpio for initramfs because unpacking cpio in C code is
  fairly simple, while unpacking tar is not (see the wikipedia page on tar).
  If we were to unpack it from the kernel, the initramfs code could be
  trivially reused.
- nobody sane uses cpio in user space, since tar is already present
  almost everywhere, while cpio is rather obscure in comparison.

        Arnd
Alexei Starovoitov April 11, 2019, 3:15 a.m. UTC | #16
On Wed, Apr 10, 2019 at 09:34:49AM -0700, Olof Johansson wrote:
> On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
> >
> > On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> > [snip]
> > > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > > such that you won't have to take an additional step, and always have
> > > > > This is that form IMO.
...
> Compared to:
>  - Extract tarball
>  - Build and load
>  - Remove file tree from filesystem

I think there are too many assumptions in this thread in regard to what
is more convenient for user space.
I think bcc should try to avoid extracting tarball into file system.
For example libbcc can uncompress kheader.tar.xz into virtual file system
of clang front-end. It's more or less std::map<string, string>
with key=path, value=content of the file. Access to such in-memory
'files' is obviously faster than doing open/read syscalls.
bcc already uses this approach for some bcc internal 'files' that
it passes to clang during compilation.
All of /proc/kheaders.tar.xz files can be passed the same way
without extracting them into real file system.

Joel, would be great if you can share corresponding bcc patch
that takes advantage of /proc/kheaders
Joel Fernandes April 11, 2019, 4:30 p.m. UTC | #17
On Wed, Apr 10, 2019 at 08:15:42PM -0700, Alexei Starovoitov wrote:
> On Wed, Apr 10, 2019 at 09:34:49AM -0700, Olof Johansson wrote:
> > On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> > > [snip]
> > > > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > > > such that you won't have to take an additional step, and always have
> > > > > > This is that form IMO.
> ...
> > Compared to:
> >  - Extract tarball
> >  - Build and load
> >  - Remove file tree from filesystem
> 
> I think there are too many assumptions in this thread in regard to what
> is more convenient for user space.
> I think bcc should try to avoid extracting tarball into file system.
> For example libbcc can uncompress kheader.tar.xz into virtual file system
> of clang front-end. It's more or less std::map<string, string>
> with key=path, value=content of the file. Access to such in-memory
> 'files' is obviously faster than doing open/read syscalls.
> bcc already uses this approach for some bcc internal 'files' that
> it passes to clang during compilation.
> All of /proc/kheaders.tar.xz files can be passed the same way
> without extracting them into real file system.
> 
> Joel, would be great if you can share corresponding bcc patch
> that takes advantage of /proc/kheaders

That sounds good to me. Others have tested this with BCC with a manual
extraction and setting environment variable to point BCC to the extracted
location, but I can work BCC patch that makes this automatic and share it (I
was planning to do the "automatically make BCC do it" patch for the BCC
project, but first wanted this in. But I agree with Alexei, its a good idea
to do it now and share it). Masahiro had some nits in v6 that I need to
address anyway ;-) And Olof grumbled a bit about the commit message which
could be polished a bit more :-)

thanks!

 - Joel
Enrico Weigelt, metux IT consult April 12, 2019, 4:16 p.m. UTC | #18
On 10.04.19 21:19, Arnd Bergmann wrote:

> - we use cpio for initramfs because unpacking cpio in C code is
>   fairly simple, while unpacking tar is not (see the wikipedia page on tar).
>   If we were to unpack it from the kernel, the initramfs code could be
>   trivially reused.

What about ar ? ;-)

> - nobody sane uses cpio in user space, since tar is already present
>   almost everywhere, while cpio is rather obscure in comparison.

So, rpm doesn't count as "sane" ? :o


--mtx
Steven Rostedt April 12, 2019, 5:25 p.m. UTC | #19
On Fri, 12 Apr 2019 18:16:19 +0200
"Enrico Weigelt, metux IT consult" <lkml@metux.net> wrote:

> So, rpm doesn't count as "sane" ? :o
> 

As a Debian user... No ;-)

/me runs

-- Steve
Olof Johansson April 14, 2019, 7:38 p.m. UTC | #20
On Wed, Apr 10, 2019 at 8:15 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Apr 10, 2019 at 09:34:49AM -0700, Olof Johansson wrote:
> > On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
> > >
> > > On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> > > [snip]
> > > > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > > > such that you won't have to take an additional step, and always have
> > > > > > This is that form IMO.
> ...
> > Compared to:
> >  - Extract tarball
> >  - Build and load
> >  - Remove file tree from filesystem
>
> I think there are too many assumptions in this thread in regard to what
> is more convenient for user space.
> I think bcc should try to avoid extracting tarball into file system.
> For example libbcc can uncompress kheader.tar.xz into virtual file system
> of clang front-end. It's more or less std::map<string, string>
> with key=path, value=content of the file. Access to such in-memory
> 'files' is obviously faster than doing open/read syscalls.

I think performance is a red herring, especially since you have to
uncompress it on every compiler invocation. With this you'd need to
read/touch/write _all_ header files, not just the one your current
compiler invocation will use.

In the grand scheme of things, open/mmap syscalls wouldn't necessarily
be slower.

> bcc already uses this approach for some bcc internal 'files' that
> it passes to clang during compilation.
> All of /proc/kheaders.tar.xz files can be passed the same way
> without extracting them into real file system.

This is now a circular argument. Joel was stating that the plain text
headers took up too much memory, but now it's preferred to create such
filesystem in userspace memory on *every compiler invocation*?
That's... not better. And definitely worse if you want to compile in
parallel.


From my perspective, this is where we're at:

This patch seems to have been met with a lot of responses in the tone
of "this is not an appealing solution". Meanwhile, some of the
suggested alternative solutions have not worked out, and we are now at
a point where there's less interest in exploring alternatives and
arguments to merge as-is with only minor adjustments.

I understand the desire to solve this. It'd be really convenient to
have a way to runtime build against the same structure layouts that
the kernel was built with. But I haven't heard anyone say that they
*like* the solution proposed, and I haven't seen many of those
expressing concerns being converted to support it.

Usually what we do at times like this is that we say "Yeah, this is a
problem that should be solved, but this solution doesn't seem to be
the right one and we would need to maintain it forever as part of the
ABI. Let's wait until a better solution is found." With time,
sometimes a better solution becomes obvious, or circumstances change
enough to allow for some different approach, or someone has a new idea
from a different perspective that solves the same problem.

All of that being said, I don't have veto rights on code going into
the kernel, even if I think picking up this patch would be the wrong
thing to do.

I'd be a *lot* less hesitant if this went into debugfs or another
location than /proc, which is one of the most regression-sensitive
interfaces we have in the kernel.

> Joel, would be great if you can share corresponding bcc patch
> that takes advantage of /proc/kheaders


-Olof
Enrico Weigelt, metux IT consult April 15, 2019, 9:41 a.m. UTC | #21
On 14.04.19 21:38, Olof Johansson wrote:

Hi folks,


I haven't followed these longs discussions completely, so forgive me if
I've missed something. But here're my comments on this ...

> In the grand scheme of things, open/mmap syscalls wouldn't necessarily> be slower.
ACK. In my own experience on dealing w/ lots of files, eg. headers, in
compilation processes, there indeed is a bottleneck, when thousands of
files have to be processed, but:

a) most the kernel-side delay's were coming from actual IO - w/ tmpfs,
   that easily goes away, and the syscall overhead isn't so much.
b) most of the total computation was preprocessor and parsing.

> This patch seems to have been met with a lot of responses in the tone> of "this is not an appealing solution".

Personally, having generic helpers for putting blobs into /proc files
(like config.gz) sound appealing. But I'm not sure whether doing that
w/ kernel headers this way is a good solution. Actually, I'm even not
sure whether raw kernel headers are at all are a good way. (can't we
use compiler-generated debug info ?)

> Usually what we do at times like this is that we say "Yeah, this is a> problem that should be solved, but this solution doesn't seem to be>
the right one and we would need to maintain it forever as part of the>
ABI. Let's wait until a better solution is found." With time,> sometimes
a better solution becomes obvious, or circumstances change> enough to
allow for some different approach, or someone has a new idea> from a
different perspective that solves the same problem.
ACK. For now, this is an Android-only debug tool, just needed there
because of it's unusual partitioning/deployment mechanisms - on usual
GNU/Linux distros, we just have the kheaders in the file system.
(and even on my small embedded devices, I either run the DUTs via NFS,
9P2k, initrd, etc or just deploy kernel and headers into the filesystem)

As Android already is in it's own universe, why can't that stuff remain
incubated there, until we have more field experience w/ it and more time
to rethink the whole idea very carefully ?

The patch is pretty small, so it's trivial cherry-pick, in case somebody
outside Android universe wants to use it.

I see two smaller sub-problems that IMHO deserve a more generic
solution:

#1: generic helpers for easily exposing in-kernel blobs as /proc files
    (potentially w/ transparent decompression)

#2: generic way for easily linking in files with very few LoC

    one scenario that I've got in mind is using dtb snippets for board
    drivers, that only need to initialize some generic drivers w/
    board specific configuration, so that doesn't have to be hand-
    written anymore.


> I'd be a *lot* less hesitant if this went into debugfs or another
> location than /proc, which is one of the most regression-sensitive
> interfaces we have in the kernel.

ACK. I don't think that /proc really is the right place for that.
Actually, I even doubt that for config.gz , it's just historically
there (many distros already disabled it for many years). IMHO, /proc
is for process information. (i guess that was also a reason for creating
debugfs instead of putting that stuff into /proc).


--mtx
Joel Fernandes April 15, 2019, 1:52 p.m. UTC | #22
On Mon, Apr 15, 2019 at 11:41:18AM +0200, Enrico Weigelt, metux IT consult wrote:
[snip]
> > This patch seems to have been met with a lot of responses in the tone> of "this is not an appealing solution".
> 
> Personally, having generic helpers for putting blobs into /proc files
> (like config.gz) sound appealing. But I'm not sure whether doing that
> w/ kernel headers this way is a good solution. Actually, I'm even not
> sure whether raw kernel headers are at all are a good way. (can't we
> use compiler-generated debug info ?)

We can't use compiler generated debug info for this.

As discussed previously here, eBPF tools need kernel headers, DWARF and
compiler debug information wont help:
https://lkml.org/lkml/2019/3/11/1358
https://lkml.org/lkml/2019/3/11/1363

> > Usually what we do at times like this is that we say "Yeah, this is a> problem that should be solved, but this solution doesn't seem to be>
> the right one and we would need to maintain it forever as part of the>
> ABI. Let's wait until a better solution is found." With time,> sometimes
> a better solution becomes obvious, or circumstances change> enough to
> allow for some different approach, or someone has a new idea> from a
> different perspective that solves the same problem.
> ACK. For now, this is an Android-only debug tool, just needed there
> because of it's unusual partitioning/deployment mechanisms - on usual
> GNU/Linux distros, we just have the kheaders in the file system.
> (and even on my small embedded devices, I either run the DUTs via NFS,
> 9P2k, initrd, etc or just deploy kernel and headers into the filesystem)
> 
> As Android already is in it's own universe, why can't that stuff remain
> incubated there, until we have more field experience w/ it and more time
> to rethink the whole idea very carefully ?

Well, we follow mostly an upstream first process.

> The patch is pretty small, so it's trivial cherry-pick, in case somebody
> outside Android universe wants to use it.

It could break very easily if things upstream change in some way, and adds a
lot of maintenance burden, besides I don't see a good reason it should not be
upstreamed tbh.

thanks,

 - Joel
Joel Fernandes April 15, 2019, 2:05 p.m. UTC | #23
On Sun, Apr 14, 2019 at 12:38:34PM -0700, Olof Johansson wrote:
> On Wed, Apr 10, 2019 at 8:15 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Apr 10, 2019 at 09:34:49AM -0700, Olof Johansson wrote:
> > > On Wed, Apr 10, 2019 at 8:51 AM Joel Fernandes <joelaf@google.com> wrote:
> > > >
> > > > On Wed, Apr 10, 2019 at 11:07 AM Olof Johansson <olof@lixom.net> wrote:
> > > > [snip]
> > > > > > > Wouldn't it be more convenient to provide it in a standardized format
> > > > > > > such that you won't have to take an additional step, and always have
> > > > > > > This is that form IMO.
> > ...
> > > Compared to:
> > >  - Extract tarball
> > >  - Build and load
> > >  - Remove file tree from filesystem
> >
> > I think there are too many assumptions in this thread in regard to what
> > is more convenient for user space.
> > I think bcc should try to avoid extracting tarball into file system.
> > For example libbcc can uncompress kheader.tar.xz into virtual file system
> > of clang front-end. It's more or less std::map<string, string>
> > with key=path, value=content of the file. Access to such in-memory
> > 'files' is obviously faster than doing open/read syscalls.
> 
> I think performance is a red herring, especially since you have to
> uncompress it on every compiler invocation. With this you'd need to
> read/touch/write _all_ header files, not just the one your current
> compiler invocation will use.
> 
> In the grand scheme of things, open/mmap syscalls wouldn't necessarily
> be slower.

Agreed.

> > bcc already uses this approach for some bcc internal 'files' that
> > it passes to clang during compilation.
> > All of /proc/kheaders.tar.xz files can be passed the same way
> > without extracting them into real file system.
> 
> This is now a circular argument. Joel was stating that the plain text
> headers took up too much memory, but now it's preferred to create such
> filesystem in userspace memory on *every compiler invocation*?
> That's... not better. And definitely worse if you want to compile in
> parallel.

The BCC patch does not extract purely into memory, but uses temporary
directory: https://github.com/iovisor/bcc/pull/2312
I believe this is a good approach.

> From my perspective, this is where we're at:
> 
> This patch seems to have been met with a lot of responses in the tone
> of "this is not an appealing solution". Meanwhile, some of the
> suggested alternative solutions have not worked out, and we are now at
> a point where there's less interest in exploring alternatives and
> arguments to merge as-is with only minor adjustments.
> 
> I understand the desire to solve this. It'd be really convenient to
> have a way to runtime build against the same structure layouts that
> the kernel was built with. But I haven't heard anyone say that they

About structure layouts, I'm assuming you mean compiler generated debug info.
That does not work for eBPF tools as was mentioned previously in these threads:
https://lkml.org/lkml/2019/3/11/1358
https://lkml.org/lkml/2019/3/11/1363

> *like* the solution proposed, and I haven't seen many of those
> expressing concerns being converted to support it.

IMHO there has been good number of people on both sides of the argument. If
it were as strong of an opposition as you think, then I would have personally
not wanted this merged tbh. We do want a solution that is clean and works, and
I think this is a candidate.

[snip]
> I'd be a *lot* less hesitant if this went into debugfs or another
> location than /proc, which is one of the most regression-sensitive
> interfaces we have in the kernel.

The solution should be regression-sensitive imho, we don't want the tracing
tools to break, people use them. And their usability and robustness is
important which prompted these patches. For some time we have been hosting
downloadable headers for popular kernels (such as LTS versions) to workaround
this issue, but this both a maintenance issue and non-scalable, and not
robust (someone boots a custom kernel or we forget to update headers for a
future LTS release and the tools break).

thanks,

 - Joel
Steven Rostedt April 15, 2019, 2:41 p.m. UTC | #24
On Sun, 14 Apr 2019 12:38:34 -0700
Olof Johansson <olof@lixom.net> wrote:

> >From my perspective, this is where we're at:  
> 
> This patch seems to have been met with a lot of responses in the tone
> of "this is not an appealing solution". Meanwhile, some of the
> suggested alternative solutions have not worked out, and we are now at
> a point where there's less interest in exploring alternatives and
> arguments to merge as-is with only minor adjustments.

Another consideration to make is difficulty of support. Having a
tarball compressed headers may not be an appealing solution, but it
isn't one that would be too much of an issue to support. Having a
better interface would be difficult to get right, and if you get it
wrong, you are now stuck with supporting something that may become a
big pain to do so in the future.

> I'd be a *lot* less hesitant if this went into debugfs or another
> location than /proc, which is one of the most regression-sensitive
> interfaces we have in the kernel.
> 

I agree with this assessment. We shouldn't use config.gz as precedence
for this solution. config.gz should have been in debugfs to begin with,
but I don't believe debugfs was around when config.gz was introduced.
(Don't have time to look into the history of the two).

-- Steve
Kees Cook April 16, 2019, 3:50 a.m. UTC | #25
On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> I agree with this assessment. We shouldn't use config.gz as precedence
> for this solution. config.gz should have been in debugfs to begin with,
> but I don't believe debugfs was around when config.gz was introduced.
> (Don't have time to look into the history of the two).

I don't agree with this: /proc/config.gz is used by a lot of tools
that do sanity-check of running systems. This isn't _debugging_...
it's verifying correct kernel builds. It's a fancy version of checking
/proc/version.
Steven Rostedt April 16, 2019, 12:33 p.m. UTC | #26
On Mon, 15 Apr 2019 22:50:10 -0500
Kees Cook <keescook@chromium.org> wrote:

> On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > I agree with this assessment. We shouldn't use config.gz as precedence
> > for this solution. config.gz should have been in debugfs to begin with,
> > but I don't believe debugfs was around when config.gz was introduced.
> > (Don't have time to look into the history of the two).  
> 
> I don't agree with this: /proc/config.gz is used by a lot of tools
> that do sanity-check of running systems. This isn't _debugging_...
> it's verifying correct kernel builds. It's a fancy version of checking
> /proc/version.
> 

Then we should perhaps make a new file system call tarballs ;-)

 /sys/kernel/tarballs/

and place everything there. That way it removes it from /proc (which is
the worse place for that) and also makes it something other than debug.
That's what I did for tracefs.

-- Steve
Greg Kroah-Hartman April 16, 2019, 12:49 p.m. UTC | #27
On Tue, Apr 16, 2019 at 08:33:06AM -0400, Steven Rostedt wrote:
> On Mon, 15 Apr 2019 22:50:10 -0500
> Kees Cook <keescook@chromium.org> wrote:
> 
> > On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > I agree with this assessment. We shouldn't use config.gz as precedence
> > > for this solution. config.gz should have been in debugfs to begin with,
> > > but I don't believe debugfs was around when config.gz was introduced.
> > > (Don't have time to look into the history of the two).  
> > 
> > I don't agree with this: /proc/config.gz is used by a lot of tools
> > that do sanity-check of running systems. This isn't _debugging_...
> > it's verifying correct kernel builds. It's a fancy version of checking
> > /proc/version.
> > 
> 
> Then we should perhaps make a new file system call tarballs ;-)
> 
>  /sys/kernel/tarballs/
> 
> and place everything there. That way it removes it from /proc (which is
> the worse place for that) and also makes it something other than debug.
> That's what I did for tracefs.

As horrible as that suggestion is, it does kind of make sense :)

We can't put this in debugfs as that's only for debugging and systems
should never have that mounted for normal operations (users want to
build ebpf programs), and /proc really should be for processes but that
horse is long left the barn.

But, I'm willing to consider putting this either in a system-fs-like
filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
around in if the main objection is that we should not be cluttering up
/proc with stuff like this.

thanks,

greg k-h
Joel Fernandes April 16, 2019, 1:04 p.m. UTC | #28
On Tue, Apr 16, 2019 at 02:49:39PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Apr 16, 2019 at 08:33:06AM -0400, Steven Rostedt wrote:
> > On Mon, 15 Apr 2019 22:50:10 -0500
> > Kees Cook <keescook@chromium.org> wrote:
> > 
> > > On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> > > > I agree with this assessment. We shouldn't use config.gz as precedence
> > > > for this solution. config.gz should have been in debugfs to begin with,
> > > > but I don't believe debugfs was around when config.gz was introduced.
> > > > (Don't have time to look into the history of the two).  
> > > 
> > > I don't agree with this: /proc/config.gz is used by a lot of tools
> > > that do sanity-check of running systems. This isn't _debugging_...
> > > it's verifying correct kernel builds. It's a fancy version of checking
> > > /proc/version.
> > > 
> > 
> > Then we should perhaps make a new file system call tarballs ;-)
> > 
> >  /sys/kernel/tarballs/
> > 
> > and place everything there. That way it removes it from /proc (which is
> > the worse place for that) and also makes it something other than debug.
> > That's what I did for tracefs.
> 
> As horrible as that suggestion is, it does kind of make sense :)
> 
> We can't put this in debugfs as that's only for debugging and systems
> should never have that mounted for normal operations (users want to
> build ebpf programs), and /proc really should be for processes but that
> horse is long left the barn.
> 
> But, I'm willing to consider putting this either in a system-fs-like
> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> around in if the main objection is that we should not be cluttering up
> /proc with stuff like this.
> 

I am ok with the suggestion of /sys/kernel for the archive. That also seems
to fit well with the idea that the headers are kernel related and probably
belong here more strictly speaking, than /proc.

thanks,

- Joel


> thanks,
> 
> greg k-h
Karim Yaghmour April 16, 2019, 1:32 p.m. UTC | #29
On 4/16/19 9:04 AM, Joel Fernandes wrote:
> On Tue, Apr 16, 2019 at 02:49:39PM +0200, Greg Kroah-Hartman wrote:
>> On Tue, Apr 16, 2019 at 08:33:06AM -0400, Steven Rostedt wrote:
>>> On Mon, 15 Apr 2019 22:50:10 -0500
>>> Kees Cook <keescook@chromium.org> wrote:
>>>
>>>> On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>>>> I agree with this assessment. We shouldn't use config.gz as precedence
>>>>> for this solution. config.gz should have been in debugfs to begin with,
>>>>> but I don't believe debugfs was around when config.gz was introduced.
>>>>> (Don't have time to look into the history of the two).
>>>>
>>>> I don't agree with this: /proc/config.gz is used by a lot of tools
>>>> that do sanity-check of running systems. This isn't _debugging_...
>>>> it's verifying correct kernel builds. It's a fancy version of checking
>>>> /proc/version.
>>>>
>>>
>>> Then we should perhaps make a new file system call tarballs ;-)
>>>
>>>   /sys/kernel/tarballs/
>>>
>>> and place everything there. That way it removes it from /proc (which is
>>> the worse place for that) and also makes it something other than debug.
>>> That's what I did for tracefs.
>>
>> As horrible as that suggestion is, it does kind of make sense :)
>>
>> We can't put this in debugfs as that's only for debugging and systems
>> should never have that mounted for normal operations (users want to
>> build ebpf programs), and /proc really should be for processes but that
>> horse is long left the barn.
>>
>> But, I'm willing to consider putting this either in a system-fs-like
>> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
>> around in if the main objection is that we should not be cluttering up
>> /proc with stuff like this.
>>
> 
> I am ok with the suggestion of /sys/kernel for the archive. That also seems
> to fit well with the idea that the headers are kernel related and probably
> belong here more strictly speaking, than /proc.

This makes sense. And if it alleviates concerns regarding extending 
/proc ABIs then might as well switch to this.

Olof, what do you think of this?
Steven Rostedt April 16, 2019, 1:45 p.m. UTC | #30
On Tue, 16 Apr 2019 09:32:37 -0400
Karim Yaghmour <karim.yaghmour@opersys.com> wrote:

> >>> Then we should perhaps make a new file system call tarballs ;-)
> >>>
> >>>   /sys/kernel/tarballs/
> >>>
> >>> and place everything there. That way it removes it from /proc (which is
> >>> the worse place for that) and also makes it something other than debug.
> >>> That's what I did for tracefs.  
> >>
> >> As horrible as that suggestion is, it does kind of make sense :)
> >>
> >> We can't put this in debugfs as that's only for debugging and systems
> >> should never have that mounted for normal operations (users want to
> >> build ebpf programs), and /proc really should be for processes but that
> >> horse is long left the barn.
> >>
> >> But, I'm willing to consider putting this either in a system-fs-like
> >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> >> around in if the main objection is that we should not be cluttering up
> >> /proc with stuff like this.
> >>  
> > 
> > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > to fit well with the idea that the headers are kernel related and probably
> > belong here more strictly speaking, than /proc.  
> 
> This makes sense. And if it alleviates concerns regarding extending 
> /proc ABIs then might as well switch to this.
> 
> Olof, what do you think of this?

BTW, the name "tarballs" was kind of a joke. Probably should come up
with a better name. Although, I'm fine with tarballsfs ;-)

-- Steve
Joel Fernandes April 16, 2019, 2:21 p.m. UTC | #31
On Tue, Apr 16, 2019 at 09:45:09AM -0400, Steven Rostedt wrote:
> On Tue, 16 Apr 2019 09:32:37 -0400
> Karim Yaghmour <karim.yaghmour@opersys.com> wrote:
> 
> > >>> Then we should perhaps make a new file system call tarballs ;-)
> > >>>
> > >>>   /sys/kernel/tarballs/
> > >>>
> > >>> and place everything there. That way it removes it from /proc (which is
> > >>> the worse place for that) and also makes it something other than debug.
> > >>> That's what I did for tracefs.  
> > >>
> > >> As horrible as that suggestion is, it does kind of make sense :)
> > >>
> > >> We can't put this in debugfs as that's only for debugging and systems
> > >> should never have that mounted for normal operations (users want to
> > >> build ebpf programs), and /proc really should be for processes but that
> > >> horse is long left the barn.
> > >>
> > >> But, I'm willing to consider putting this either in a system-fs-like
> > >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> > >> around in if the main objection is that we should not be cluttering up
> > >> /proc with stuff like this.
> > >>  
> > > 
> > > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > > to fit well with the idea that the headers are kernel related and probably
> > > belong here more strictly speaking, than /proc.  
> > 
> > This makes sense. And if it alleviates concerns regarding extending 
> > /proc ABIs then might as well switch to this.
> > 
> > Olof, what do you think of this?
> 
> BTW, the name "tarballs" was kind of a joke. Probably should come up
> with a better name. Although, I'm fine with tarballsfs ;-)

:-)
In theory, the headers could also be hosted in tracefs since the scope of the
patch right now is to help tracing tools (BCC / eBPF). Although /sys/kernel
might be better in case the scope is expanded to other things.

thanks,

 - Joel
Greg Kroah-Hartman April 16, 2019, 2:22 p.m. UTC | #32
On Tue, Apr 16, 2019 at 09:45:09AM -0400, Steven Rostedt wrote:
> On Tue, 16 Apr 2019 09:32:37 -0400
> Karim Yaghmour <karim.yaghmour@opersys.com> wrote:
> 
> > >>> Then we should perhaps make a new file system call tarballs ;-)
> > >>>
> > >>>   /sys/kernel/tarballs/
> > >>>
> > >>> and place everything there. That way it removes it from /proc (which is
> > >>> the worse place for that) and also makes it something other than debug.
> > >>> That's what I did for tracefs.  
> > >>
> > >> As horrible as that suggestion is, it does kind of make sense :)
> > >>
> > >> We can't put this in debugfs as that's only for debugging and systems
> > >> should never have that mounted for normal operations (users want to
> > >> build ebpf programs), and /proc really should be for processes but that
> > >> horse is long left the barn.
> > >>
> > >> But, I'm willing to consider putting this either in a system-fs-like
> > >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> > >> around in if the main objection is that we should not be cluttering up
> > >> /proc with stuff like this.
> > >>  
> > > 
> > > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > > to fit well with the idea that the headers are kernel related and probably
> > > belong here more strictly speaking, than /proc.  
> > 
> > This makes sense. And if it alleviates concerns regarding extending 
> > /proc ABIs then might as well switch to this.
> > 
> > Olof, what do you think of this?
> 
> BTW, the name "tarballs" was kind of a joke. Probably should come up
> with a better name. Although, I'm fine with tarballsfs ;-)

No need to have this be a separate filesystem, we can use a binary sysfs
file in /sys/kernel/ for this as the kernel is not doing any "parsing"
of the data, it is just dumping it out to userspace.

If I make /sys/kernel/config.gz be the same thing as /proc/config.gz
will you fix up 'make localmodconfig' to use it?  :)

thanks,

greg k-h
Steven Rostedt April 16, 2019, 2:43 p.m. UTC | #33
On Tue, 16 Apr 2019 16:22:40 +0200
Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:

> If I make /sys/kernel/config.gz be the same thing as /proc/config.gz
> will you fix up 'make localmodconfig' to use it?  :)

Sure.

-- Steve
Olof Johansson April 16, 2019, 4:42 p.m. UTC | #34
On Tue, Apr 16, 2019 at 7:43 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Tue, 16 Apr 2019 16:22:40 +0200
> Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote:
>
> > If I make /sys/kernel/config.gz be the same thing as /proc/config.gz
> > will you fix up 'make localmodconfig' to use it?  :)
>
> Sure.

Doing a link to the new location is probably best, just like
/proc/device-tree is today.


-Olof
Alexei Starovoitov April 16, 2019, 4:46 p.m. UTC | #35
On Tue, Apr 16, 2019 at 04:22:40PM +0200, Greg Kroah-Hartman wrote:
> On Tue, Apr 16, 2019 at 09:45:09AM -0400, Steven Rostedt wrote:
> > On Tue, 16 Apr 2019 09:32:37 -0400
> > Karim Yaghmour <karim.yaghmour@opersys.com> wrote:
> > 
> > > >>> Then we should perhaps make a new file system call tarballs ;-)
> > > >>>
> > > >>>   /sys/kernel/tarballs/
> > > >>>
> > > >>> and place everything there. That way it removes it from /proc (which is
> > > >>> the worse place for that) and also makes it something other than debug.
> > > >>> That's what I did for tracefs.  
> > > >>
> > > >> As horrible as that suggestion is, it does kind of make sense :)
> > > >>
> > > >> We can't put this in debugfs as that's only for debugging and systems
> > > >> should never have that mounted for normal operations (users want to
> > > >> build ebpf programs), and /proc really should be for processes but that
> > > >> horse is long left the barn.
> > > >>
> > > >> But, I'm willing to consider putting this either in a system-fs-like
> > > >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> > > >> around in if the main objection is that we should not be cluttering up
> > > >> /proc with stuff like this.
> > > >>  
> > > > 
> > > > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > > > to fit well with the idea that the headers are kernel related and probably
> > > > belong here more strictly speaking, than /proc.  
> > > 
> > > This makes sense. And if it alleviates concerns regarding extending 
> > > /proc ABIs then might as well switch to this.
> > > 
> > > Olof, what do you think of this?
> > 
> > BTW, the name "tarballs" was kind of a joke. Probably should come up
> > with a better name. Although, I'm fine with tarballsfs ;-)
> 
> No need to have this be a separate filesystem, we can use a binary sysfs
> file in /sys/kernel/ for this as the kernel is not doing any "parsing"
> of the data, it is just dumping it out to userspace.

What folks keep saying that an fs of header files is easier to use
than tarball from bcc and cleaner from architectural pov.
That's not the case.
From bcc side I'd rather have a single precompiled headers blob
that I can feed into clang and improve bpf program compilation time.
Having a set of headers is a step to generate such .pch file,
but once generated the headers can be removed from fs and kheaders
module unloaded.
The sequence is: bcc checks standard /lib/module location,
if not there loads kheader mod, extracts into known location, and unloads.
The extraced headers are in plain fs cache and will be evicted from memory
when bcc is done compiling progs.
imo much cleaner than kernel maintaining headers-fs and wasting memory.

Where kheaders.tar.xz is placed doesn't really matter.
/proc or /sys/kernel makes no real difference.
Olof Johansson April 16, 2019, 4:47 p.m. UTC | #36
On Tue, Apr 16, 2019 at 6:32 AM Karim Yaghmour
<karim.yaghmour@opersys.com> wrote:
>
>
> On 4/16/19 9:04 AM, Joel Fernandes wrote:
> > On Tue, Apr 16, 2019 at 02:49:39PM +0200, Greg Kroah-Hartman wrote:
> >> On Tue, Apr 16, 2019 at 08:33:06AM -0400, Steven Rostedt wrote:
> >>> On Mon, 15 Apr 2019 22:50:10 -0500
> >>> Kees Cook <keescook@chromium.org> wrote:
> >>>
> >>>> On Mon, Apr 15, 2019 at 9:41 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >>>>> I agree with this assessment. We shouldn't use config.gz as precedence
> >>>>> for this solution. config.gz should have been in debugfs to begin with,
> >>>>> but I don't believe debugfs was around when config.gz was introduced.
> >>>>> (Don't have time to look into the history of the two).
> >>>>
> >>>> I don't agree with this: /proc/config.gz is used by a lot of tools
> >>>> that do sanity-check of running systems. This isn't _debugging_...
> >>>> it's verifying correct kernel builds. It's a fancy version of checking
> >>>> /proc/version.
> >>>>
> >>>
> >>> Then we should perhaps make a new file system call tarballs ;-)
> >>>
> >>>   /sys/kernel/tarballs/
> >>>
> >>> and place everything there. That way it removes it from /proc (which is
> >>> the worse place for that) and also makes it something other than debug.
> >>> That's what I did for tracefs.
> >>
> >> As horrible as that suggestion is, it does kind of make sense :)
> >>
> >> We can't put this in debugfs as that's only for debugging and systems
> >> should never have that mounted for normal operations (users want to
> >> build ebpf programs), and /proc really should be for processes but that
> >> horse is long left the barn.
> >>
> >> But, I'm willing to consider putting this either in a system-fs-like
> >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> >> around in if the main objection is that we should not be cluttering up
> >> /proc with stuff like this.
> >>
> >
> > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > to fit well with the idea that the headers are kernel related and probably
> > belong here more strictly speaking, than /proc.
>
> This makes sense. And if it alleviates concerns regarding extending
> /proc ABIs then might as well switch to this.
>
> Olof, what do you think of this?

In practice we've been more lenient with changes to /sys over time, so
I think this might be a reasonable compromise.

I still think that a filesystem view is the cleanest way to do this,
but I won't push back from this going in. It does solve a real
problem, and if we want a different format later we can revisit it
then.


-Olof
Olof Johansson April 16, 2019, 4:57 p.m. UTC | #37
On Tue, Apr 16, 2019 at 9:46 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Apr 16, 2019 at 04:22:40PM +0200, Greg Kroah-Hartman wrote:
> > On Tue, Apr 16, 2019 at 09:45:09AM -0400, Steven Rostedt wrote:
> > > On Tue, 16 Apr 2019 09:32:37 -0400
> > > Karim Yaghmour <karim.yaghmour@opersys.com> wrote:
> > >
> > > > >>> Then we should perhaps make a new file system call tarballs ;-)
> > > > >>>
> > > > >>>   /sys/kernel/tarballs/
> > > > >>>
> > > > >>> and place everything there. That way it removes it from /proc (which is
> > > > >>> the worse place for that) and also makes it something other than debug.
> > > > >>> That's what I did for tracefs.
> > > > >>
> > > > >> As horrible as that suggestion is, it does kind of make sense :)
> > > > >>
> > > > >> We can't put this in debugfs as that's only for debugging and systems
> > > > >> should never have that mounted for normal operations (users want to
> > > > >> build ebpf programs), and /proc really should be for processes but that
> > > > >> horse is long left the barn.
> > > > >>
> > > > >> But, I'm willing to consider putting this either in a system-fs-like
> > > > >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> > > > >> around in if the main objection is that we should not be cluttering up
> > > > >> /proc with stuff like this.
> > > > >>
> > > > >
> > > > > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > > > > to fit well with the idea that the headers are kernel related and probably
> > > > > belong here more strictly speaking, than /proc.
> > > >
> > > > This makes sense. And if it alleviates concerns regarding extending
> > > > /proc ABIs then might as well switch to this.
> > > >
> > > > Olof, what do you think of this?
> > >
> > > BTW, the name "tarballs" was kind of a joke. Probably should come up
> > > with a better name. Although, I'm fine with tarballsfs ;-)
> >
> > No need to have this be a separate filesystem, we can use a binary sysfs
> > file in /sys/kernel/ for this as the kernel is not doing any "parsing"
> > of the data, it is just dumping it out to userspace.
>
> What folks keep saying that an fs of header files is easier to use
> than tarball from bcc and cleaner from architectural pov.
> That's not the case.
> From bcc side I'd rather have a single precompiled headers blob
> that I can feed into clang and improve bpf program compilation time.
> Having a set of headers is a step to generate such .pch file,
> but once generated the headers can be removed from fs and kheaders
> module unloaded.
> The sequence is: bcc checks standard /lib/module location,
> if not there loads kheader mod, extracts into known location, and unloads.

May I suggest keeping the bcc-populated headers somewhere else?
Ideally something cleaned out on every reboot in case kernel changes
without version string doing it.

That way you can by default prefer the module-exported tarball, and
fall back to /lib/module/$(uname -r)/ if not available, instead of the
other way around and instead of having to check creation times on the
dir vs boot time of the kernel, etc.

Anyway, that's just an implementation detail. But it's the kind of
detail that all tools that use this would need to get right, instead
of doing it right once by exporting it in a way that it can be
directly used.

> The extraced headers are in plain fs cache and will be evicted from memory
> when bcc is done compiling progs.
> imo much cleaner than kernel maintaining headers-fs and wasting memory.

So, in my original proposal I recommended unmounting when not needing
it, which would remove the memory usage as well.

> Where kheaders.tar.xz is placed doesn't really matter.
> /proc or /sys/kernel makes no real difference.

If done in a location that isn't a perpetual ABI commitment, a tarball
solution is something we can work with.


-Olof
Joel Fernandes April 16, 2019, 5:22 p.m. UTC | #38
On Tue, Apr 16, 2019 at 12:57 PM Olof Johansson <olof@lixom.net> wrote:
[snip]
> > Where kheaders.tar.xz is placed doesn't really matter.
> > /proc or /sys/kernel makes no real difference.
>
> If done in a location that isn't a perpetual ABI commitment, a tarball
> solution is something we can work with.

This is the part I don't get, Olof. It has to be an ABI commitment so
that the tools always work. Why are we so interested in breaking
userspace tools? We want a robust solution, not a fragile one.

thanks,

 - Joel
Alexei Starovoitov April 16, 2019, 5:30 p.m. UTC | #39
On Tue, Apr 16, 2019 at 09:57:08AM -0700, Olof Johansson wrote:
> On Tue, Apr 16, 2019 at 9:46 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Apr 16, 2019 at 04:22:40PM +0200, Greg Kroah-Hartman wrote:
> > > On Tue, Apr 16, 2019 at 09:45:09AM -0400, Steven Rostedt wrote:
> > > > On Tue, 16 Apr 2019 09:32:37 -0400
> > > > Karim Yaghmour <karim.yaghmour@opersys.com> wrote:
> > > >
> > > > > >>> Then we should perhaps make a new file system call tarballs ;-)
> > > > > >>>
> > > > > >>>   /sys/kernel/tarballs/
> > > > > >>>
> > > > > >>> and place everything there. That way it removes it from /proc (which is
> > > > > >>> the worse place for that) and also makes it something other than debug.
> > > > > >>> That's what I did for tracefs.
> > > > > >>
> > > > > >> As horrible as that suggestion is, it does kind of make sense :)
> > > > > >>
> > > > > >> We can't put this in debugfs as that's only for debugging and systems
> > > > > >> should never have that mounted for normal operations (users want to
> > > > > >> build ebpf programs), and /proc really should be for processes but that
> > > > > >> horse is long left the barn.
> > > > > >>
> > > > > >> But, I'm willing to consider putting this either in a system-fs-like
> > > > > >> filesystem, or just in sysfs itself, we do have /sys/kernel/ to play
> > > > > >> around in if the main objection is that we should not be cluttering up
> > > > > >> /proc with stuff like this.
> > > > > >>
> > > > > >
> > > > > > I am ok with the suggestion of /sys/kernel for the archive. That also seems
> > > > > > to fit well with the idea that the headers are kernel related and probably
> > > > > > belong here more strictly speaking, than /proc.
> > > > >
> > > > > This makes sense. And if it alleviates concerns regarding extending
> > > > > /proc ABIs then might as well switch to this.
> > > > >
> > > > > Olof, what do you think of this?
> > > >
> > > > BTW, the name "tarballs" was kind of a joke. Probably should come up
> > > > with a better name. Although, I'm fine with tarballsfs ;-)
> > >
> > > No need to have this be a separate filesystem, we can use a binary sysfs
> > > file in /sys/kernel/ for this as the kernel is not doing any "parsing"
> > > of the data, it is just dumping it out to userspace.
> >
> > What folks keep saying that an fs of header files is easier to use
> > than tarball from bcc and cleaner from architectural pov.
> > That's not the case.
> > From bcc side I'd rather have a single precompiled headers blob
> > that I can feed into clang and improve bpf program compilation time.
> > Having a set of headers is a step to generate such .pch file,
> > but once generated the headers can be removed from fs and kheaders
> > module unloaded.
> > The sequence is: bcc checks standard /lib/module location,
> > if not there loads kheader mod, extracts into known location, and unloads.
> 
> May I suggest keeping the bcc-populated headers somewhere else?

what do you mean by bcc-populated?
bcc keeps its own headers inside libbcc.so .data section and provides
them to clang as 'memory buffer' in clang's virtual file system.

> Ideally something cleaned out on every reboot in case kernel changes
> without version string doing it.
> 
> That way you can by default prefer the module-exported tarball, and
> fall back to /lib/module/$(uname -r)/ if not available, instead of the
> other way around and instead of having to check creation times on the
> dir vs boot time of the kernel, etc.

the order of checks is bcc implementation detail. we can change that later.
we've seen issues with /lib/modules/`uname -r` being corrupted by chef,
so we might actually extract from kheaders.tar.xz all the time and more
than once.
Like try-compiling a simple prog and if it doesn't work, do the extract.

> Anyway, that's just an implementation detail. But it's the kind of
> detail that all tools that use this would need to get right, instead
> of doing it right once by exporting it in a way that it can be
> directly used.

Today bcc is the only tool that interacts with clang this way.
There is enough complexity and plenty of complex issues with
on-the-fly recompile approach.
I strongly suggest anyone considering new on-the-fly recompile to work
with us on BTF instead.

The set of headers is not an ultimate goal. See the example with pch. 
bpf tracing needs three components:
- all types and layout of datastructures;
  including all function prototypes with arg names
- all macros
- all inline functions

The first one is solved by BTF based solution,
but macroses and infline functions have no substitute, but C header files.
That is today.
Eventually we might find a way to reduce dependency on headers and have
macroses and infline functions represented some other way.
Like mini-pch where only relevant bits of headers are represented as
clang's syntax tree or mini C code.
The key point is that having headers is not a goal.
Making kernel maintain an fs of headers is imo a waste of kernel code.
The most minimal approach of compressed tarball is preferred.

> 
> > The extraced headers are in plain fs cache and will be evicted from memory
> > when bcc is done compiling progs.
> > imo much cleaner than kernel maintaining headers-fs and wasting memory.
> 
> So, in my original proposal I recommended unmounting when not needing
> it, which would remove the memory usage as well.

and such header-fs would uncompress internal tarball, create inodes, dentries
and to make sure all that stuff is cleanly refcnted and freed.
imo that is plenty of kernel code for no good reason.

> > Where kheaders.tar.xz is placed doesn't really matter.
> > /proc or /sys/kernel makes no real difference.
> 
> If done in a location that isn't a perpetual ABI commitment, a tarball
> solution is something we can work with.

Fair enough. My guess that kheaders.tar.xz in this shape we would
need for at least 5 years. After that we'll come up with better approach.
diff mbox series

Patch

diff --git a/init/Kconfig b/init/Kconfig
index 4592bf7997c0..ea75bfbf7dfa 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -580,6 +580,17 @@  config IKCONFIG_PROC
 	  This option enables access to the kernel configuration file
 	  through /proc/config.gz.
 
+config IKHEADERS_PROC
+	tristate "Enable kernel header artifacts through /proc/kheaders.tar.xz"
+	depends on PROC_FS
+	help
+	  This option enables access to the kernel header and other artifacts that
+          are generated during the build process. These can be used to build kernel
+          modules or by other in-kernel programs such as those generated by eBPF
+          and systemtap tools. If you build the headers as a module, a module
+          called kheaders.ko is built which can be loaded on-demand to get access
+          to the headers.
+
 config LOG_BUF_SHIFT
 	int "Kernel log buffer size (16 => 64KB, 17 => 128KB)"
 	range 12 25
diff --git a/kernel/.gitignore b/kernel/.gitignore
index 6e699100872f..34d1e77ee9df 100644
--- a/kernel/.gitignore
+++ b/kernel/.gitignore
@@ -1,5 +1,6 @@ 
 #
 # Generated files
 #
+kheaders.md5
 timeconst.h
 hz.bc
diff --git a/kernel/Makefile b/kernel/Makefile
index 6c57e78817da..7c486bc25fd7 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -70,6 +70,7 @@  obj-$(CONFIG_UTS_NS) += utsname.o
 obj-$(CONFIG_USER_NS) += user_namespace.o
 obj-$(CONFIG_PID_NS) += pid_namespace.o
 obj-$(CONFIG_IKCONFIG) += configs.o
+obj-$(CONFIG_IKHEADERS_PROC) += kheaders.o
 obj-$(CONFIG_SMP) += stop_machine.o
 obj-$(CONFIG_KPROBES_SANITY_TEST) += test_kprobes.o
 obj-$(CONFIG_AUDIT) += audit.o auditfilter.o
@@ -121,3 +122,30 @@  $(obj)/configs.o: $(obj)/config_data.gz
 targets += config_data.gz
 $(obj)/config_data.gz: $(KCONFIG_CONFIG) FORCE
 	$(call if_changed,gzip)
+
+# Build a list of in-kernel headers for building kernel modules
+ikh_file_list := include/
+ikh_file_list += arch/$(SRCARCH)/Makefile
+ikh_file_list += arch/$(SRCARCH)/include/
+ikh_file_list += arch/$(SRCARCH)/module.lds
+ikh_file_list += arch/$(SRCARCH)/kernel/module.lds
+ikh_file_list += scripts/
+ikh_file_list += Makefile
+
+# Things we need from the $objtree. "OBJDIR" is for the gen_ikh_data.sh
+# script to identify that this comes from the $objtree directory
+ikh_file_list += OBJDIR/scripts/
+ikh_file_list += OBJDIR/include/
+ikh_file_list += OBJDIR/arch/$(SRCARCH)/include/
+ifeq ($(CONFIG_STACK_VALIDATION), y)
+ikh_file_list += OBJDIR/tools/objtool/objtool
+endif
+
+$(obj)/kheaders.o: $(obj)/kheaders_data.tar.xz
+
+quiet_cmd_genikh = GEN     $(obj)/kheaders_data.tar.xz
+cmd_genikh = $(srctree)/scripts/gen_ikh_data.sh $@ $(ikh_file_list)
+$(obj)/kheaders_data.tar.xz: FORCE
+	$(call cmd,genikh)
+
+clean-files := kheaders_data.tar.xz kheaders.md5
diff --git a/kernel/kheaders.c b/kernel/kheaders.c
new file mode 100644
index 000000000000..d072a958a8f1
--- /dev/null
+++ b/kernel/kheaders.c
@@ -0,0 +1,73 @@ 
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * kernel/kheaders.c
+ * Provide headers and artifacts needed to build kernel modules.
+ * (Borrowed code from kernel/configs.c)
+ */
+
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/proc_fs.h>
+#include <linux/init.h>
+#include <linux/uaccess.h>
+
+/*
+ * Define kernel_headers_data and kernel_headers_data_end, within which the the
+ * compressed kernel headers are stpred. The file is first compressed with xz.
+ */
+
+asm (
+"	.pushsection .rodata, \"a\"		\n"
+"	.global kernel_headers_data		\n"
+"kernel_headers_data:				\n"
+"	.incbin \"kernel/kheaders_data.tar.xz\"	\n"
+"	.global kernel_headers_data_end		\n"
+"kernel_headers_data_end:			\n"
+"	.popsection				\n"
+);
+
+extern char kernel_headers_data;
+extern char kernel_headers_data_end;
+
+static ssize_t
+ikheaders_read_current(struct file *file, char __user *buf,
+		      size_t len, loff_t *offset)
+{
+	return simple_read_from_buffer(buf, len, offset,
+				       &kernel_headers_data,
+				       &kernel_headers_data_end -
+				       &kernel_headers_data);
+}
+
+static const struct file_operations ikheaders_file_ops = {
+	.read = ikheaders_read_current,
+	.llseek = default_llseek,
+};
+
+static int __init ikheaders_init(void)
+{
+	struct proc_dir_entry *entry;
+
+	/* create the current headers file */
+	entry = proc_create("kheaders.tar.xz", S_IRUGO, NULL,
+			    &ikheaders_file_ops);
+	if (!entry)
+		return -ENOMEM;
+
+	proc_set_size(entry,
+		      &kernel_headers_data_end -
+		      &kernel_headers_data);
+	return 0;
+}
+
+static void __exit ikheaders_cleanup(void)
+{
+	remove_proc_entry("kheaders.tar.xz", NULL);
+}
+
+module_init(ikheaders_init);
+module_exit(ikheaders_cleanup);
+
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Joel Fernandes");
+MODULE_DESCRIPTION("Echo the kernel header artifacts used to build the kernel");
diff --git a/scripts/gen_ikh_data.sh b/scripts/gen_ikh_data.sh
new file mode 100755
index 000000000000..3f9cae72c2a4
--- /dev/null
+++ b/scripts/gen_ikh_data.sh
@@ -0,0 +1,84 @@ 
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This script generates an archive consisting of kernel headers
+# for CONFIG_IKHEADERS_PROC.
+set -e
+
+spath="$(dirname "$(readlink -f "$0")")"
+kroot="$spath/.."
+outdir="$(pwd)"
+tarfile=$1
+cpio_dir=$outdir/$tarfile.tmp
+
+file_list=${@:2}
+
+src_file_list=""
+for f in $file_list; do
+	if [ ! -f "$kroot/$f" ] && [ ! -d "$kroot/$f" ]; then continue; fi
+	src_file_list="$src_file_list $(echo $f | grep -v OBJDIR)"
+done
+
+obj_file_list=""
+for f in $file_list; do
+	f=$(echo $f | grep OBJDIR | sed -e 's/OBJDIR\///g')
+	if [ ! -f $f ] && [ ! -d $f ]; then continue; fi
+	obj_file_list="$obj_file_list $f";
+done
+
+# Support incremental builds by skipping archive generation
+# if timestamps of files being archived are not changed.
+
+# This block is useful for debugging the incremental builds.
+# Uncomment it for debugging.
+# iter=1
+# if [ ! -f /tmp/iter ]; then echo 1 > /tmp/iter;
+# else; 	iter=$(($(cat /tmp/iter) + 1)); fi
+# find $src_file_list -type f | xargs ls -lR > /tmp/src-ls-$iter
+# find $obj_file_list -type f | xargs ls -lR > /tmp/obj-ls-$iter
+
+# modules.order and include/generated/compile.h are ignored because these are
+# touched even when none of the source files changed. This causes pointless
+# regeneration, so let us ignore them for md5 calculation.
+pushd $kroot > /dev/null
+src_files_md5="$(find $src_file_list -type f ! -name modules.order |
+		grep -v "include/generated/compile.h"		   |
+		xargs ls -lR | md5sum | cut -d ' ' -f1)"
+popd > /dev/null
+obj_files_md5="$(find $obj_file_list -type f ! -name modules.order |
+		grep -v "include/generated/compile.h"		   |
+		xargs ls -lR | md5sum | cut -d ' ' -f1)"
+
+if [ -f $tarfile ]; then tarfile_md5="$(md5sum $tarfile | cut -d ' ' -f1)"; fi
+if [ -f kernel/kheaders.md5 ] &&
+	[ "$(cat kernel/kheaders.md5|head -1)" == "$src_files_md5" ] &&
+	[ "$(cat kernel/kheaders.md5|head -2|tail -1)" == "$obj_files_md5" ] &&
+	[ "$(cat kernel/kheaders.md5|tail -1)" == "$tarfile_md5" ]; then
+		exit
+fi
+
+rm -rf $cpio_dir
+mkdir $cpio_dir
+
+pushd $kroot > /dev/null
+for f in $src_file_list;
+	do find "$f" ! -name "*.c" ! -name "*.o" ! -name "*.cmd" ! -name ".*";
+done | cpio --quiet -pd $cpio_dir
+popd > /dev/null
+
+# The second CPIO can complain if files already exist which can
+# happen with out of tree builds. Just silence CPIO for now.
+for f in $obj_file_list;
+	do find "$f" ! -name "*.c" ! -name "*.o" ! -name "*.cmd" ! -name ".*";
+done | cpio --quiet -pd $cpio_dir >/dev/null 2>&1
+
+find  $cpio_dir -type f -print0 |
+	xargs -0 -P8 -n1 perl -pi -e 'BEGIN {undef $/;}; s/\/\*((?!SPDX).)*?\*\///smg;'
+
+tar -Jcf $tarfile -C $cpio_dir/ . > /dev/null
+
+echo "$src_files_md5" > kernel/kheaders.md5
+echo "$obj_files_md5" >> kernel/kheaders.md5
+echo "$(md5sum $tarfile | cut -d ' ' -f1)" >> kernel/kheaders.md5
+
+rm -rf $cpio_dir