mbox series

[RFC,0/1] overlayfs: C/R enhancments (RFC)

Message ID 20201004192401.9738-1-alexander.mikhalitsyn@virtuozzo.com (mailing list archive)
Headers show
Series overlayfs: C/R enhancments (RFC) | expand

Message

Alexander Mikhalitsyn Oct. 4, 2020, 7:24 p.m. UTC
Some time ago we discussed about the problem of Checkpoint-Restoring
overlayfs mounts [1]. Big thanks to Amir for review and suggestions.

Brief from previous discussion.
Problem statement: to checkpoint-restore overlayfs mounts we need
to save overlayfs mount state and save it into the image. Basically,
this state for us it's just mount options of overlayfs mount. But
here we have two problems:

I. during mounting overlayfs user may specify relative paths in upperdir,
workdir, lowerdir options

II. also user may unmount mount from which these paths was opened during mounting

This is real problems for us. My first patch was attempt to address both problems.
1. I've added refcnt get for mounts from which overlayfs was mounted.
2. I've changed overlayfs mountinfo show algorithm, so overlayfs started to *always*
show full paths for upperdir,workdir,lowerdirs.
3. I've added mnt_id show-time only option which allows to determine from which mnt_id
we opened options paths.

Pros:
- we can determine full information about overlayfs mount
- we hold refcnt to mount, so, user may unmount source mounts only
with lazy flag

Cons:
- by adding refcnt get for mount I've changed possible overlayfs usecases
- by showing *full* paths we can more easily reache PAGE_SIZE limit of 
mounts options in procfs
- by adding mnt_id show-only option I've added inconsistency between
mount-time options and show-time mount options

After very productive discussion with Amir and Pavel I've decided to write new
implementation. In new approach we decided *not* to take extra refcnts to mounts.
Also we decided to use exportfs fhandles instead of full paths. To determine
full path we plan to use the next algo:
1. Export {s_dev; fhandle} from overlayfs for *all* sources
2. User open_by_handle_at syscall to open all these fhandles (we need to
determine mount for each fhandle, looks like we can do this by s_dev by linear
search in /proc/<pid>/mountinfo)
3. Then readlink /proc/<pid>/fd/<opened fd>
4. Dump this full path+mnt_id

But there is question. How to export this {s_dev; fhandle} from kernel to userspace?
- We decided not to use procfs.
- Amir proposed solution - use xattrs. But after diving into it I've meet problem
where I can set this xattrs?
If I set this xattrs on overlayfs dentries then during rsync, or cp -p=xattr we will copy
this temporary information.
- ioctls? (this patchset implements this approach)
- fsinfo subsystem (not merged yet) [2]

Problems with ioctls:
1. We limited in output data size (16 KB AFAIK)
but MAX_HANDLE_SZ=128(bytes), OVL_MAX_STACK=500(num lowerdirs)
So, MAX_HANDLE_SZ*OVL_MAX_STACK = 64KB which is bigger than limit.
So, I've decided to give user one fhandle by one call. This is also
bad from the performance point of view.
2. When using ioctls we need to have *fixed* size of input and output.
So, if MAX_HANDLE_SZ will change in the future our _IOR('o', 2, struct ovl_mnt_opt_fh)
will also change with struct ovl_mnt_opt_fh.

So, I hope that we discuss about this patchset and try to make possible solutions together.

Thanks.
Regards, Alex.

[1] https://lore.kernel.org/linux-unionfs/20200604161133.20949-1-alexander.mikhalitsyn@virtuozzo.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=fsinfo-core

Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Miklos Szeredi <miklos@szeredi.hu>
Cc: David Howells <dhowells@redhat.com>
Cc: linux-unionfs@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org

Alexander Mikhalitsyn (1):
  overlayfs: add ioctls that allows to get fhandle for layers dentries

 fs/overlayfs/readdir.c | 160 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 160 insertions(+)

Comments

Amir Goldstein Oct. 5, 2020, 7:56 a.m. UTC | #1
On Sun, Oct 4, 2020 at 10:25 PM Alexander Mikhalitsyn
<alexander.mikhalitsyn@virtuozzo.com> wrote:
>
> Some time ago we discussed about the problem of Checkpoint-Restoring
> overlayfs mounts [1]. Big thanks to Amir for review and suggestions.
>
> Brief from previous discussion.
> Problem statement: to checkpoint-restore overlayfs mounts we need
> to save overlayfs mount state and save it into the image. Basically,
> this state for us it's just mount options of overlayfs mount. But
> here we have two problems:
>
> I. during mounting overlayfs user may specify relative paths in upperdir,
> workdir, lowerdir options
>
> II. also user may unmount mount from which these paths was opened during mounting
>
> This is real problems for us. My first patch was attempt to address both problems.
> 1. I've added refcnt get for mounts from which overlayfs was mounted.
> 2. I've changed overlayfs mountinfo show algorithm, so overlayfs started to *always*
> show full paths for upperdir,workdir,lowerdirs.
> 3. I've added mnt_id show-time only option which allows to determine from which mnt_id
> we opened options paths.
>
> Pros:
> - we can determine full information about overlayfs mount
> - we hold refcnt to mount, so, user may unmount source mounts only
> with lazy flag
>
> Cons:
> - by adding refcnt get for mount I've changed possible overlayfs usecases
> - by showing *full* paths we can more easily reache PAGE_SIZE limit of
> mounts options in procfs
> - by adding mnt_id show-only option I've added inconsistency between
> mount-time options and show-time mount options
>
> After very productive discussion with Amir and Pavel I've decided to write new
> implementation. In new approach we decided *not* to take extra refcnts to mounts.
> Also we decided to use exportfs fhandles instead of full paths. To determine
> full path we plan to use the next algo:
> 1. Export {s_dev; fhandle} from overlayfs for *all* sources
> 2. User open_by_handle_at syscall to open all these fhandles (we need to
> determine mount for each fhandle, looks like we can do this by s_dev by linear
> search in /proc/<pid>/mountinfo)
> 3. Then readlink /proc/<pid>/fd/<opened fd>
> 4. Dump this full path+mnt_id
>

Hi Alex,

The general concept looks good to me.
I will not provide specific comment on the implementation (it looks
fine) until the
concept API is accepted by the maintainer.

The main thing I want to make sure is that if we add this interface it can
serve other use cases as well.

During my talk on LPC, I got a similar request from two developers for two
different use cases. They wanted a safe method to iterate "changes
since baseline"
from either within the container or from the host.

Your proposed API is a step in the direction for meeting their requirement.
The major change is that ioctl (or whatever method) should expose the
layers topology of a specific object, not only the overlay instance.

For C/R you would query the layers topology of the overlay root dir.

My comments of the specific methods below are not meant to
object to the choice of ioctl, but they are meant to give the alternative
a fair chance. I am kind of leaning towards ioctl myself.

> But there is question. How to export this {s_dev; fhandle} from kernel to userspace?
> - We decided not to use procfs.

Why not?
C/R already uses procfs to export fhandle for fanotify/inotify
I kind of like the idea of having /sys/fs/overlay/instances etc.
It could be useful to many things.

> - Amir proposed solution - use xattrs. But after diving into it I've meet problem
> where I can set this xattrs?
> If I set this xattrs on overlayfs dentries then during rsync, or cp -p=xattr we will copy
> this temporary information.

No you won't.
rsync, cp will only copy xattrs listed with listxattr.
Several filesystems, such as cifs and nfs export "object properties"
via private xattrs
that are not listed in listxattr (e.g. CIFS_XATTR_CIFS_ACL).
You are not limited in what you can do in the "trusted.overlay" namespace, for
example "trusted.overlay.layers.0.fh"

The advantage is that it is very easy to implement and requires
less discussions about ABI, but I agree it does feel a bit like a hack.

> - ioctls? (this patchset implements this approach)
> - fsinfo subsystem (not merged yet) [2]
>
> Problems with ioctls:
> 1. We limited in output data size (16 KB AFAIK)
> but MAX_HANDLE_SZ=128(bytes), OVL_MAX_STACK=500(num lowerdirs)
> So, MAX_HANDLE_SZ*OVL_MAX_STACK = 64KB which is bigger than limit.
> So, I've decided to give user one fhandle by one call. This is also
> bad from the performance point of view.
> 2. When using ioctls we need to have *fixed* size of input and output.
> So, if MAX_HANDLE_SZ will change in the future our _IOR('o', 2, struct ovl_mnt_opt_fh)
> will also change with struct ovl_mnt_opt_fh.
>

The choice of API with fixed output size for a variable length info seems weird.

I am tempted to suggest extending name_to_handle_at(), for example
name_to_handle_at(ovl_root_fd, path, &fhandle, &layer_id, AT_LAYER)

Where layer_id can be input/output arg.

But I acknowledge this is going to be a much harder sell...

Thanks,
Amir.
Alexander Mikhalitsyn Oct. 5, 2020, 7:46 p.m. UTC | #2
Hi Amir,

On Mon, 5 Oct 2020 10:56:50 +0300
Amir Goldstein <amir73il@gmail.com> wrote:

> On Sun, Oct 4, 2020 at 10:25 PM Alexander Mikhalitsyn
> <alexander.mikhalitsyn@virtuozzo.com> wrote:
> >
> > Some time ago we discussed about the problem of Checkpoint-Restoring
> > overlayfs mounts [1]. Big thanks to Amir for review and suggestions.
> >
> > Brief from previous discussion.
> > Problem statement: to checkpoint-restore overlayfs mounts we need
> > to save overlayfs mount state and save it into the image. Basically,
> > this state for us it's just mount options of overlayfs mount. But
> > here we have two problems:
> >
> > I. during mounting overlayfs user may specify relative paths in upperdir,
> > workdir, lowerdir options
> >
> > II. also user may unmount mount from which these paths was opened during mounting
> >
> > This is real problems for us. My first patch was attempt to address both problems.
> > 1. I've added refcnt get for mounts from which overlayfs was mounted.
> > 2. I've changed overlayfs mountinfo show algorithm, so overlayfs started to *always*
> > show full paths for upperdir,workdir,lowerdirs.
> > 3. I've added mnt_id show-time only option which allows to determine from which mnt_id
> > we opened options paths.
> >
> > Pros:
> > - we can determine full information about overlayfs mount
> > - we hold refcnt to mount, so, user may unmount source mounts only
> > with lazy flag
> >
> > Cons:
> > - by adding refcnt get for mount I've changed possible overlayfs usecases
> > - by showing *full* paths we can more easily reache PAGE_SIZE limit of
> > mounts options in procfs
> > - by adding mnt_id show-only option I've added inconsistency between
> > mount-time options and show-time mount options
> >
> > After very productive discussion with Amir and Pavel I've decided to write new
> > implementation. In new approach we decided *not* to take extra refcnts to mounts.
> > Also we decided to use exportfs fhandles instead of full paths. To determine
> > full path we plan to use the next algo:
> > 1. Export {s_dev; fhandle} from overlayfs for *all* sources
> > 2. User open_by_handle_at syscall to open all these fhandles (we need to
> > determine mount for each fhandle, looks like we can do this by s_dev by linear
> > search in /proc/<pid>/mountinfo)
> > 3. Then readlink /proc/<pid>/fd/<opened fd>
> > 4. Dump this full path+mnt_id
> >
> 
> Hi Alex,
> 
> The general concept looks good to me.
> I will not provide specific comment on the implementation (it looks
> fine) until the
> concept API is accepted by the maintainer.
> 
> The main thing I want to make sure is that if we add this interface it can
> serve other use cases as well.

Yes, let's create universal interface.

> 
> During my talk on LPC, I got a similar request from two developers for two
> different use cases. They wanted a safe method to iterate "changes
> since baseline"
> from either within the container or from the host.

This discussions was on lkml or in private room?

> 
> Your proposed API is a step in the direction for meeting their requirement.
> The major change is that ioctl (or whatever method) should expose the
> layers topology of a specific object, not only the overlay instance.
> 
> For C/R you would query the layers topology of the overlay root dir.
> 
> My comments of the specific methods below are not meant to
> object to the choice of ioctl, but they are meant to give the alternative
> a fair chance. I am kind of leaning towards ioctl myself.
> 
> > But there is question. How to export this {s_dev; fhandle} from kernel to userspace?
> > - We decided not to use procfs.
> 
> Why not?
> C/R already uses procfs to export fhandle for fanotify/inotify
> I kind of like the idea of having /sys/fs/overlay/instances etc.
> It could be useful to many things.

Ah, sorry. For some reason I've decided that we excluded procfs/sysfs option :)
Let's take this option into account too.

> 
> > - Amir proposed solution - use xattrs. But after diving into it I've meet problem
> > where I can set this xattrs?
> > If I set this xattrs on overlayfs dentries then during rsync, or cp -p=xattr we will copy
> > this temporary information.
> 
> No you won't.
> rsync, cp will only copy xattrs listed with listxattr.
> Several filesystems, such as cifs and nfs export "object properties"
> via private xattrs
> that are not listed in listxattr (e.g. CIFS_XATTR_CIFS_ACL).
> You are not limited in what you can do in the "trusted.overlay" namespace, for
> example "trusted.overlay.layers.0.fh"
> 
> The advantage is that it is very easy to implement and requires
> less discussions about ABI, but I agree it does feel a bit like a hack.

Ack. I can try to write some draft implementation with xattrs.

> 
> > - ioctls? (this patchset implements this approach)
> > - fsinfo subsystem (not merged yet) [2]
> >
> > Problems with ioctls:
> > 1. We limited in output data size (16 KB AFAIK)
> > but MAX_HANDLE_SZ=128(bytes), OVL_MAX_STACK=500(num lowerdirs)
> > So, MAX_HANDLE_SZ*OVL_MAX_STACK = 64KB which is bigger than limit.
> > So, I've decided to give user one fhandle by one call. This is also
> > bad from the performance point of view.
> > 2. When using ioctls we need to have *fixed* size of input and output.
> > So, if MAX_HANDLE_SZ will change in the future our _IOR('o', 2, struct ovl_mnt_opt_fh)
> > will also change with struct ovl_mnt_opt_fh.
> >
> 
> The choice of API with fixed output size for a variable length info seems weird.

Yes, and I've proposed option with ioctl syscall where we open file descriptor
instead of doing direct copy_from_user/copy_to_user.

> 
> I am tempted to suggest extending name_to_handle_at(), for example
> name_to_handle_at(ovl_root_fd, path, &fhandle, &layer_id, AT_LAYER)
> 
> Where layer_id can be input/output arg.
> 
> But I acknowledge this is going to be a much harder sell...

Looks interesting. I'll need to dive and think about it.

> 
> Thanks,
> Amir.

Thanks for your attention and review of patchset!

Regards,
Alex.
Amir Goldstein Oct. 6, 2020, 6:53 a.m. UTC | #3
On Mon, Oct 5, 2020 at 10:47 PM Alexander Mikhalitsyn
<alexander.mikhalitsyn@virtuozzo.com> wrote:
>
> Hi Amir,
>
> On Mon, 5 Oct 2020 10:56:50 +0300
> Amir Goldstein <amir73il@gmail.com> wrote:
>
> > On Sun, Oct 4, 2020 at 10:25 PM Alexander Mikhalitsyn
> > <alexander.mikhalitsyn@virtuozzo.com> wrote:
> > >
> > > Some time ago we discussed about the problem of Checkpoint-Restoring
> > > overlayfs mounts [1]. Big thanks to Amir for review and suggestions.
> > >
> > > Brief from previous discussion.
> > > Problem statement: to checkpoint-restore overlayfs mounts we need
> > > to save overlayfs mount state and save it into the image. Basically,
> > > this state for us it's just mount options of overlayfs mount. But
> > > here we have two problems:
> > >
> > > I. during mounting overlayfs user may specify relative paths in upperdir,
> > > workdir, lowerdir options
> > >
> > > II. also user may unmount mount from which these paths was opened during mounting
> > >
> > > This is real problems for us. My first patch was attempt to address both problems.
> > > 1. I've added refcnt get for mounts from which overlayfs was mounted.
> > > 2. I've changed overlayfs mountinfo show algorithm, so overlayfs started to *always*
> > > show full paths for upperdir,workdir,lowerdirs.
> > > 3. I've added mnt_id show-time only option which allows to determine from which mnt_id
> > > we opened options paths.
> > >
> > > Pros:
> > > - we can determine full information about overlayfs mount
> > > - we hold refcnt to mount, so, user may unmount source mounts only
> > > with lazy flag
> > >
> > > Cons:
> > > - by adding refcnt get for mount I've changed possible overlayfs usecases
> > > - by showing *full* paths we can more easily reache PAGE_SIZE limit of
> > > mounts options in procfs
> > > - by adding mnt_id show-only option I've added inconsistency between
> > > mount-time options and show-time mount options
> > >
> > > After very productive discussion with Amir and Pavel I've decided to write new
> > > implementation. In new approach we decided *not* to take extra refcnts to mounts.
> > > Also we decided to use exportfs fhandles instead of full paths. To determine
> > > full path we plan to use the next algo:
> > > 1. Export {s_dev; fhandle} from overlayfs for *all* sources
> > > 2. User open_by_handle_at syscall to open all these fhandles (we need to
> > > determine mount for each fhandle, looks like we can do this by s_dev by linear
> > > search in /proc/<pid>/mountinfo)
> > > 3. Then readlink /proc/<pid>/fd/<opened fd>
> > > 4. Dump this full path+mnt_id
> > >
> >
> > Hi Alex,
> >
> > The general concept looks good to me.
> > I will not provide specific comment on the implementation (it looks
> > fine) until the
> > concept API is accepted by the maintainer.
> >
> > The main thing I want to make sure is that if we add this interface it can
> > serve other use cases as well.
>
> Yes, let's create universal interface.
>

Note that this universal interface contradicts the direction of sysfs
which is a convenient way for getting filesystem instance info, but
not object info.

> >
> > During my talk on LPC, I got a similar request from two developers for two
> > different use cases. They wanted a safe method to iterate "changes
> > since baseline"
> > from either within the container or from the host.
>
> This discussions was on lkml or in private room?
>

The containers track:
https://youtu.be/fSyr_IXM21Y?t=4939

We continued in private channels, but the general idea
is an API to provide some insight about underlying layers

> >
> > Your proposed API is a step in the direction for meeting their requirement.
> > The major change is that ioctl (or whatever method) should expose the
> > layers topology of a specific object, not only the overlay instance.
> >
> > For C/R you would query the layers topology of the overlay root dir.
> >
> > My comments of the specific methods below are not meant to
> > object to the choice of ioctl, but they are meant to give the alternative
> > a fair chance. I am kind of leaning towards ioctl myself.
> >
> > > But there is question. How to export this {s_dev; fhandle} from kernel to userspace?
> > > - We decided not to use procfs.
> >
> > Why not?
> > C/R already uses procfs to export fhandle for fanotify/inotify
> > I kind of like the idea of having /sys/fs/overlay/instances etc.
> > It could be useful to many things.
>
> Ah, sorry. For some reason I've decided that we excluded procfs/sysfs option :)
> Let's take this option into account too.
>
> >
> > > - Amir proposed solution - use xattrs. But after diving into it I've meet problem
> > > where I can set this xattrs?
> > > If I set this xattrs on overlayfs dentries then during rsync, or cp -p=xattr we will copy
> > > this temporary information.
> >
> > No you won't.
> > rsync, cp will only copy xattrs listed with listxattr.
> > Several filesystems, such as cifs and nfs export "object properties"
> > via private xattrs
> > that are not listed in listxattr (e.g. CIFS_XATTR_CIFS_ACL).
> > You are not limited in what you can do in the "trusted.overlay" namespace, for
> > example "trusted.overlay.layers.0.fh"
> >
> > The advantage is that it is very easy to implement and requires
> > less discussions about ABI, but I agree it does feel a bit like a hack.
>
> Ack. I can try to write some draft implementation with xattrs.
>

You don't have to write code before getting an ack from
maintainer on the design, but fine by me.

> >
> > > - ioctls? (this patchset implements this approach)
> > > - fsinfo subsystem (not merged yet) [2]
> > >
> > > Problems with ioctls:
> > > 1. We limited in output data size (16 KB AFAIK)
> > > but MAX_HANDLE_SZ=128(bytes), OVL_MAX_STACK=500(num lowerdirs)
> > > So, MAX_HANDLE_SZ*OVL_MAX_STACK = 64KB which is bigger than limit.
> > > So, I've decided to give user one fhandle by one call. This is also
> > > bad from the performance point of view.
> > > 2. When using ioctls we need to have *fixed* size of input and output.
> > > So, if MAX_HANDLE_SZ will change in the future our _IOR('o', 2, struct ovl_mnt_opt_fh)
> > > will also change with struct ovl_mnt_opt_fh.
> > >
> >
> > The choice of API with fixed output size for a variable length info seems weird.
>
> Yes, and I've proposed option with ioctl syscall where we open file descriptor
> instead of doing direct copy_from_user/copy_to_user.
>
> >
> > I am tempted to suggest extending name_to_handle_at(), for example
> > name_to_handle_at(ovl_root_fd, path, &fhandle, &layer_id, AT_LAYER)
> >
> > Where layer_id can be input/output arg.
> >
> > But I acknowledge this is going to be a much harder sell...
>
> Looks interesting. I'll need to dive and think about it.
>

This API change has a lot more stakeholders.
I think it would be wiser for you to stay within the overlayfs boundaries.

Thanks,
Amir.