overlayfs: allowing for changes to lowerdir

Message ID	CAOQ4uxiBmFdcueorKV7zwPLCDq4DE+H8x=8H1f7+3v3zysW9qA@mail.gmail.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> MIME-Version: 1.0 In-Reply-To: <CA+ZH+jFBAaRi6VPmf3PBdDZQQMOaT6WUByATqaw1QL5M9+-dxg@mail.gmail.com> References: <CA+ZH+jFBAaRi6VPmf3PBdDZQQMOaT6WUByATqaw1QL5M9+-dxg@mail.gmail.com> From: Amir Goldstein <amir73il@gmail.com> Date: Tue, 14 Feb 2017 16:01:45 +0200 Message-ID: <CAOQ4uxiBmFdcueorKV7zwPLCDq4DE+H8x=8H1f7+3v3zysW9qA@mail.gmail.com> Subject: Re: overlayfs: allowing for changes to lowerdir To: Josh England <jjengla@gmail.com> Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>, linux-unionfs@vger.kernel.org, Miklos Szeredi <mszeredi@redhat.com> Content-Type: text/plain; charset=UTF-8 Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Amir Goldstein Feb. 14, 2017, 2:01 p.m. UTC

On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
> So here's the use case:  lowerdir is an NFS mounted root filesystem
> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
> for writes to happen.  This works great with the caveat being I cannot
> make 'live' changes to the root filesystem, which poses the problem.
> Any access to a changed file causes a 'Stale file handle' error.
>
> With some experimenting, I've discovered that remounting the overlay
> filesystem (mount -o remount / /)  registers any changes that have
> been made to the lower NFS filesystem.  In addition, dumping cache
> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
> go away and reads pass through to the lower dir and correctly show
> changes.
>
> I'd like to make this use case feasible by allowing changes to the NFS
> lowerdir to work more or less transparently.  It seems like if the
> overlay did not do any caching at all, all reads would fall through to
> either the upperdir ram disk or the NFS lower, which is precisely what
> I want.
>
> So, let me pose this somewhat naive question:  Would it be possible to
> simply disable any cacheing performed by the overlay to force all
> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
> Would this be enough to accomplish my goal of being able to change the
> lowerdir of an active overlayfs?
>

There is no need to disable caching. There is already a mechanism
in place in VFS to revalidate inode cache entries.
NFS implements d_revalidate() and overlayfs implements d_revalidate()
by calling into the lower fs d_revalidate().

However overlayfs intentionally errors when lower entry has been modified.
(see: 7c03b5d ovl: allow distributed fs as lower layer)

You can try this (untested) patch to revert this behavior, just to see if it
works for your use case, but it won't change this fact
from Documentation/filesystems/overlayfs.txt:
" Changes to the underlying filesystems while part of a mounted overlay
filesystem are not allowed.  If the underlying filesystem is changed,
the behavior of the overlay is undefined, though it will not result in
a crash or deadlock."

Specifically, renaming directories and files in lower that were already
copied up is going to have a weird outcome.

Also, the situation with changing files in lower remote fs could be worse
than changing files on lower local fs, simply because right now, this
use case is not tested (i.e. it results in ESTALE).

I believe that fixing this use case, if at all possible, would require quite
a bit of work, a lot of documentation (about expected behavior) and
even more testing.

Amir.

Josh England Feb. 14, 2017, 5:14 p.m. UTC | #1

Amir,

Thanks for the patch and the clarification for which changes to the
underlying filesystem are likely to cause havoc.  I'll play with this
some to see how usable it might be.

-JE


On Tue, Feb 14, 2017 at 6:01 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>> for writes to happen.  This works great with the caveat being I cannot
>> make 'live' changes to the root filesystem, which poses the problem.
>> Any access to a changed file causes a 'Stale file handle' error.
>>
>> With some experimenting, I've discovered that remounting the overlay
>> filesystem (mount -o remount / /)  registers any changes that have
>> been made to the lower NFS filesystem.  In addition, dumping cache
>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>> go away and reads pass through to the lower dir and correctly show
>> changes.
>>
>> I'd like to make this use case feasible by allowing changes to the NFS
>> lowerdir to work more or less transparently.  It seems like if the
>> overlay did not do any caching at all, all reads would fall through to
>> either the upperdir ram disk or the NFS lower, which is precisely what
>> I want.
>>
>> So, let me pose this somewhat naive question:  Would it be possible to
>> simply disable any cacheing performed by the overlay to force all
>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>> Would this be enough to accomplish my goal of being able to change the
>> lowerdir of an active overlayfs?
>>
>
> There is no need to disable caching. There is already a mechanism
> in place in VFS to revalidate inode cache entries.
> NFS implements d_revalidate() and overlayfs implements d_revalidate()
> by calling into the lower fs d_revalidate().
>
> However overlayfs intentionally errors when lower entry has been modified.
> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>
> You can try this (untested) patch to revert this behavior, just to see if it
> works for your use case, but it won't change this fact
> from Documentation/filesystems/overlayfs.txt:
> " Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."
>
> Specifically, renaming directories and files in lower that were already
> copied up is going to have a weird outcome.
>
> Also, the situation with changing files in lower remote fs could be worse
> than changing files on lower local fs, simply because right now, this
> use case is not tested (i.e. it results in ESTALE).
>
> I believe that fixing this use case, if at all possible, would require quite
> a bit of work, a lot of documentation (about expected behavior) and
> even more testing.
>
> Amir.
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index e8ef9d1..6806ef3 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -106,16 +106,11 @@ static int ovl_dentry_revalidate(struct dentry
> *dentry, unsigned int flags)
>
>                 if (d->d_flags & DCACHE_OP_REVALIDATE) {
>                         ret = d->d_op->d_revalidate(d, flags);
> -                       if (ret < 0)
> +                       if (ret =< 0)
>                                 return ret;
> -                       if (!ret) {
> -                               if (!(flags & LOOKUP_RCU))
> -                                       d_invalidate(d);
> -                               return -ESTALE;
> -                       }
>                 }
>         }
> -       return 1;
> +       return ret;
>  }

Josh England Feb. 21, 2017, 11:08 p.m. UTC | #2

Amir,

After playing with it some, this patch seems to provide precisely the
behavior I need for my use case.  Do you think it makes sense to turn
this behavior into a module parameter (eg: allow_revalidate)?

-JE


On Tue, Feb 14, 2017 at 6:01 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>> for writes to happen.  This works great with the caveat being I cannot
>> make 'live' changes to the root filesystem, which poses the problem.
>> Any access to a changed file causes a 'Stale file handle' error.
>>
>> With some experimenting, I've discovered that remounting the overlay
>> filesystem (mount -o remount / /)  registers any changes that have
>> been made to the lower NFS filesystem.  In addition, dumping cache
>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>> go away and reads pass through to the lower dir and correctly show
>> changes.
>>
>> I'd like to make this use case feasible by allowing changes to the NFS
>> lowerdir to work more or less transparently.  It seems like if the
>> overlay did not do any caching at all, all reads would fall through to
>> either the upperdir ram disk or the NFS lower, which is precisely what
>> I want.
>>
>> So, let me pose this somewhat naive question:  Would it be possible to
>> simply disable any cacheing performed by the overlay to force all
>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>> Would this be enough to accomplish my goal of being able to change the
>> lowerdir of an active overlayfs?
>>
>
> There is no need to disable caching. There is already a mechanism
> in place in VFS to revalidate inode cache entries.
> NFS implements d_revalidate() and overlayfs implements d_revalidate()
> by calling into the lower fs d_revalidate().
>
> However overlayfs intentionally errors when lower entry has been modified.
> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>
> You can try this (untested) patch to revert this behavior, just to see if it
> works for your use case, but it won't change this fact
> from Documentation/filesystems/overlayfs.txt:
> " Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."
>
> Specifically, renaming directories and files in lower that were already
> copied up is going to have a weird outcome.
>
> Also, the situation with changing files in lower remote fs could be worse
> than changing files on lower local fs, simply because right now, this
> use case is not tested (i.e. it results in ESTALE).
>
> I believe that fixing this use case, if at all possible, would require quite
> a bit of work, a lot of documentation (about expected behavior) and
> even more testing.
>
> Amir.
>
> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> index e8ef9d1..6806ef3 100644
> --- a/fs/overlayfs/super.c
> +++ b/fs/overlayfs/super.c
> @@ -106,16 +106,11 @@ static int ovl_dentry_revalidate(struct dentry
> *dentry, unsigned int flags)
>
>                 if (d->d_flags & DCACHE_OP_REVALIDATE) {
>                         ret = d->d_op->d_revalidate(d, flags);
> -                       if (ret < 0)
> +                       if (ret =< 0)
>                                 return ret;
> -                       if (!ret) {
> -                               if (!(flags & LOOKUP_RCU))
> -                                       d_invalidate(d);
> -                               return -ESTALE;
> -                       }
>                 }
>         }
> -       return 1;
> +       return ret;
>  }

Ian Kent Feb. 22, 2017, 9 a.m. UTC | #3

On Tue, 2017-02-21 at 15:08 -0800, Josh England wrote:
> Amir,
> 
> After playing with it some, this patch seems to provide precisely the
> behavior I need for my use case.  Do you think it makes sense to turn
> this behavior into a module parameter (eg: allow_revalidate)?

This has been a hot topic for me recently, and not only for the overlayfs use
case.

Any case where the inode of a mount point changes (file or directory) is a
problem with current kernels.

The problem is that, if the invalidated mount point belongs to a file system
that has been propagated to other namespaces (which is mostly always the case)
the invalid dentry will be exposed in the other namespaces where it is not a
mountpoint.

I'm not sure how this can be exploited but apparently it can be by using user
namespaces.

To implement this without exposing these dentrys when they shouldn't be the VFS
would need to retain the invalidated dentrys and allow valid dentrys to be
created (leading to multiple dentrys of the same name within directories), hide
them from external VFS lookups, but be able to find them itself when it needs
them and check if they are a mount point in the current namespace, avoid further
revalidation, etc, etc, and the implications go on.

Not only would retaining these dentrys probably lead to some fairly ugly code,
locating and checking these dentrys would need to be done on very hot execution
paths in the VFS which is a challenge in itself (the current kernel mount data
structures don't lend themselves to this at all).

So yes, on the face of it, it's straight forward to get something fairly simple
that will "appear to work" but would not be anywhere near good enough for
inclusion in the mainline kernel.

> 
> -JE
> 
> 
> On Tue, Feb 14, 2017 at 6:01 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> > On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
> > > So here's the use case:  lowerdir is an NFS mounted root filesystem
> > > (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
> > > for writes to happen.  This works great with the caveat being I cannot
> > > make 'live' changes to the root filesystem, which poses the problem.
> > > Any access to a changed file causes a 'Stale file handle' error.
> > > 
> > > With some experimenting, I've discovered that remounting the overlay
> > > filesystem (mount -o remount / /)  registers any changes that have
> > > been made to the lower NFS filesystem.  In addition, dumping cache
> > > (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
> > > go away and reads pass through to the lower dir and correctly show
> > > changes.
> > > 
> > > I'd like to make this use case feasible by allowing changes to the NFS
> > > lowerdir to work more or less transparently.  It seems like if the
> > > overlay did not do any caching at all, all reads would fall through to
> > > either the upperdir ram disk or the NFS lower, which is precisely what
> > > I want.
> > > 
> > > So, let me pose this somewhat naive question:  Would it be possible to
> > > simply disable any cacheing performed by the overlay to force all
> > > reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
> > > Would this be enough to accomplish my goal of being able to change the
> > > lowerdir of an active overlayfs?
> > > 
> > 
> > There is no need to disable caching. There is already a mechanism
> > in place in VFS to revalidate inode cache entries.
> > NFS implements d_revalidate() and overlayfs implements d_revalidate()
> > by calling into the lower fs d_revalidate().
> > 
> > However overlayfs intentionally errors when lower entry has been modified.
> > (see: 7c03b5d ovl: allow distributed fs as lower layer)
> > 
> > You can try this (untested) patch to revert this behavior, just to see if it
> > works for your use case, but it won't change this fact
> > from Documentation/filesystems/overlayfs.txt:
> > " Changes to the underlying filesystems while part of a mounted overlay
> > filesystem are not allowed.  If the underlying filesystem is changed,
> > the behavior of the overlay is undefined, though it will not result in
> > a crash or deadlock."
> > 
> > Specifically, renaming directories and files in lower that were already
> > copied up is going to have a weird outcome.
> > 
> > Also, the situation with changing files in lower remote fs could be worse
> > than changing files on lower local fs, simply because right now, this
> > use case is not tested (i.e. it results in ESTALE).
> > 
> > I believe that fixing this use case, if at all possible, would require quite
> > a bit of work, a lot of documentation (about expected behavior) and
> > even more testing.
> > 
> > Amir.
> > 
> > diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
> > index e8ef9d1..6806ef3 100644
> > --- a/fs/overlayfs/super.c
> > +++ b/fs/overlayfs/super.c
> > @@ -106,16 +106,11 @@ static int ovl_dentry_revalidate(struct dentry
> > *dentry, unsigned int flags)
> > 
> >                 if (d->d_flags & DCACHE_OP_REVALIDATE) {
> >                         ret = d->d_op->d_revalidate(d, flags);
> > -                       if (ret < 0)
> > +                       if (ret =< 0)
> >                                 return ret;
> > -                       if (!ret) {
> > -                               if (!(flags & LOOKUP_RCU))
> > -                                       d_invalidate(d);
> > -                               return -ESTALE;
> > -                       }
> >                 }
> >         }
> > -       return 1;
> > +       return ret;
> >  }
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-unionfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Amir Goldstein Feb. 27, 2017, 10:40 a.m. UTC | #4

On Wed, Feb 22, 2017 at 1:08 AM, Josh England <jjengla@gmail.com> wrote:
> Amir,
>
> After playing with it some, this patch seems to provide precisely the
> behavior I need for my use case.  Do you think it makes sense to turn
> this behavior into a module parameter (eg: allow_revalidate)?
>

I don't know, because I don't know the reason that Miklos chose to error
on revalidate of remote lower fs.
But it would be strange to introduce a feature that changes one undefined
behavior (maybe ESTALE) with another undefined behavior.

It may be easier if you can argue for a use case which does have
defined behavior,
for example, lower fs has some directory subtrees that are not
modified via overlayfs
and only modified directry via lower fs.
I think this *may* result in defined behavior over lower remote fs,
but can't tell for sure.
Anyway, you will have to argue why such a setup is useful.


> -JE
>
>
> On Tue, Feb 14, 2017 at 6:01 AM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>>> for writes to happen.  This works great with the caveat being I cannot
>>> make 'live' changes to the root filesystem, which poses the problem.
>>> Any access to a changed file causes a 'Stale file handle' error.
>>>
>>> With some experimenting, I've discovered that remounting the overlay
>>> filesystem (mount -o remount / /)  registers any changes that have
>>> been made to the lower NFS filesystem.  In addition, dumping cache
>>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>>> go away and reads pass through to the lower dir and correctly show
>>> changes.
>>>
>>> I'd like to make this use case feasible by allowing changes to the NFS
>>> lowerdir to work more or less transparently.  It seems like if the
>>> overlay did not do any caching at all, all reads would fall through to
>>> either the upperdir ram disk or the NFS lower, which is precisely what
>>> I want.
>>>
>>> So, let me pose this somewhat naive question:  Would it be possible to
>>> simply disable any cacheing performed by the overlay to force all
>>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>>> Would this be enough to accomplish my goal of being able to change the
>>> lowerdir of an active overlayfs?
>>>
>>
>> There is no need to disable caching. There is already a mechanism
>> in place in VFS to revalidate inode cache entries.
>> NFS implements d_revalidate() and overlayfs implements d_revalidate()
>> by calling into the lower fs d_revalidate().
>>
>> However overlayfs intentionally errors when lower entry has been modified.
>> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>>
>> You can try this (untested) patch to revert this behavior, just to see if it
>> works for your use case, but it won't change this fact
>> from Documentation/filesystems/overlayfs.txt:
>> " Changes to the underlying filesystems while part of a mounted overlay
>> filesystem are not allowed.  If the underlying filesystem is changed,
>> the behavior of the overlay is undefined, though it will not result in
>> a crash or deadlock."
>>
>> Specifically, renaming directories and files in lower that were already
>> copied up is going to have a weird outcome.
>>
>> Also, the situation with changing files in lower remote fs could be worse
>> than changing files on lower local fs, simply because right now, this
>> use case is not tested (i.e. it results in ESTALE).
>>
>> I believe that fixing this use case, if at all possible, would require quite
>> a bit of work, a lot of documentation (about expected behavior) and
>> even more testing.
>>
>> Amir.
>>
>> diff --git a/fs/overlayfs/super.c b/fs/overlayfs/super.c
>> index e8ef9d1..6806ef3 100644
>> --- a/fs/overlayfs/super.c
>> +++ b/fs/overlayfs/super.c
>> @@ -106,16 +106,11 @@ static int ovl_dentry_revalidate(struct dentry
>> *dentry, unsigned int flags)
>>
>>                 if (d->d_flags & DCACHE_OP_REVALIDATE) {
>>                         ret = d->d_op->d_revalidate(d, flags);
>> -                       if (ret < 0)
>> +                       if (ret =< 0)
>>                                 return ret;
>> -                       if (!ret) {
>> -                               if (!(flags & LOOKUP_RCU))
>> -                                       d_invalidate(d);
>> -                               return -ESTALE;
>> -                       }
>>                 }
>>         }
>> -       return 1;
>> +       return ret;
>>  }

Josh England Feb. 28, 2017, 7:08 p.m. UTC | #5

On Mon, Feb 27, 2017 at 2:40 AM, Amir Goldstein <amir73il@gmail.com> wrote:
> It may be easier if you can argue for a use case which does have
> defined behavior,
> for example, lower fs has some directory subtrees that are not
> modified via overlayfs
> and only modified directry via lower fs.
> I think this *may* result in defined behavior over lower remote fs,
> but can't tell for sure.
> Anyway, you will have to argue why such a setup is useful.

My use case is actually one that involves very little (but necessary)
writes to the overlayfs, and the vast majority of changes happen
directly to the lower fs.  I maintain cluster management software that
has been used on some of the largest clusters in the world.  The
premise is to network boot and mount a shared NFS root on an
arbitrarily large number of cluster nodes.  It has always been an
annoyance to have to find the little bits and pieces in the OS (think
of a full RHEL/CentOS distro) that require the ability to write
somewhere and so we've always had to play silly tricks to get around
that.  Using a ram overlay as the upper allows these minor writes to
happen transparently and save end users massive amounts of time having
to track down and handle each one individually.

The behavior I've noted so far from this patch seems to be very well
defined and follows the rule of least surprise (at least for me).  I
can change files/directories in the shared root and the changes are
instantly visible to all clients.  If a file/directory has been copied
up on the client that change takes precedence over changes in the
lower.  Deletes work as expected.  I haven't found a downside yet.  In
fact I've bundled a release of my software that configures this mode
of operation as on option.  The given patch is included with that
release, but it would be 1000 times cleaner as a stock kernel module
parameter with no patches required.

-JE

Al Viro Feb. 28, 2017, 7:44 p.m. UTC | #6

On Tue, Feb 28, 2017 at 11:08:31AM -0800, Josh England wrote:

> The behavior I've noted so far from this patch seems to be very well
> defined and follows the rule of least surprise (at least for me).  I
> can change files/directories in the shared root and the changes are
> instantly visible to all clients.  If a file/directory has been copied
> up on the client that change takes precedence over changes in the
> lower.  Deletes work as expected.  I haven't found a downside yet.

Try to do something that invalidates a directory in underlying layer
with stuff in overlay on its subdirectories.

Amir Goldstein March 1, 2017, 11:15 a.m. UTC | #7

On Tue, Feb 28, 2017 at 9:44 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 28, 2017 at 11:08:31AM -0800, Josh England wrote:
>
>> The behavior I've noted so far from this patch seems to be very well
>> defined and follows the rule of least surprise (at least for me).  I
>> can change files/directories in the shared root and the changes are
>> instantly visible to all clients.  If a file/directory has been copied
>> up on the client that change takes precedence over changes in the
>> lower.  Deletes work as expected.  I haven't found a downside yet.
>
> Try to do something that invalidates a directory in underlying layer
> with stuff in overlay on its subdirectories.

That's an interesting test to run, but why does ovl_dentry_revalidate()
need to return 1/-ESTALE instead of ret?
Why not propagate the return values 0 and -ECHILD from NFS?

After all, NFS lower directory inode can be invalidated after lookup
and then other overlayfs ops (e.g. readdir) could fail with -ESTALE
just the same as those ops would fail directly over NFS. No?

What am I missing?

Josh England March 1, 2017, 6:22 p.m. UTC | #8

Directory /foo/bar
Client (overlay) creates /foo/bar/biz
/foo/bar gets deleted from lower (on NFS server)
All of /foo/bar (including biz) still exists on client because it had
been copied up.

This is actually the behavior I'd expect given the nature of overlayfs.

-JE


On Tue, Feb 28, 2017 at 11:44 AM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> On Tue, Feb 28, 2017 at 11:08:31AM -0800, Josh England wrote:
>
>> The behavior I've noted so far from this patch seems to be very well
>> defined and follows the rule of least surprise (at least for me).  I
>> can change files/directories in the shared root and the changes are
>> instantly visible to all clients.  If a file/directory has been copied
>> up on the client that change takes precedence over changes in the
>> lower.  Deletes work as expected.  I haven't found a downside yet.
>
> Try to do something that invalidates a directory in underlying layer
> with stuff in overlay on its subdirectories.

Colin Walters March 1, 2017, 8:22 p.m. UTC | #9

On Tue, Feb 28, 2017, at 02:08 PM, Josh England wrote:
>  It has always been an
> annoyance to have to find the little bits and pieces in the OS (think
> of a full RHEL/CentOS distro) that require the ability to write
> somewhere

As part of driving the [OSTree](https://ostree.readthedocs.io/en/latest/)
model into Fedora/CentOS (well really rpm-ostree, which is part of Atomic Host),
we have a read-only bind mount over /usr, and the only two writable directories are /etc and /var.

In a diskless mode, then you could have /usr as a read-only NFS/cachefiles mount, and use tmpfs
for /etc and /var.  Ideally /etc can be read-only too during operation but that is
definitely one where we hit things like LVM userspace writing there in the background:
https://bugzilla.redhat.com/show_bug.cgi?id=1366584

But if something required writing to /usr at runtime in Fedora/CentOS that should
be considered a bug.

Miklos Szeredi March 9, 2017, 10:37 a.m. UTC | #10

On Tue, Feb 14, 2017 at 3:01 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>> for writes to happen.  This works great with the caveat being I cannot
>> make 'live' changes to the root filesystem, which poses the problem.
>> Any access to a changed file causes a 'Stale file handle' error.
>>
>> With some experimenting, I've discovered that remounting the overlay
>> filesystem (mount -o remount / /)  registers any changes that have
>> been made to the lower NFS filesystem.  In addition, dumping cache
>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>> go away and reads pass through to the lower dir and correctly show
>> changes.
>>
>> I'd like to make this use case feasible by allowing changes to the NFS
>> lowerdir to work more or less transparently.  It seems like if the
>> overlay did not do any caching at all, all reads would fall through to
>> either the upperdir ram disk or the NFS lower, which is precisely what
>> I want.
>>
>> So, let me pose this somewhat naive question:  Would it be possible to
>> simply disable any cacheing performed by the overlay to force all
>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>> Would this be enough to accomplish my goal of being able to change the
>> lowerdir of an active overlayfs?
>>
>
> There is no need to disable caching. There is already a mechanism
> in place in VFS to revalidate inode cache entries.
> NFS implements d_revalidate() and overlayfs implements d_revalidate()
> by calling into the lower fs d_revalidate().
>
> However overlayfs intentionally errors when lower entry has been modified.
> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>
> You can try this (untested) patch to revert this behavior, just to see if it
> works for your use case, but it won't change this fact
> from Documentation/filesystems/overlayfs.txt:
> " Changes to the underlying filesystems while part of a mounted overlay
> filesystem are not allowed.  If the underlying filesystem is changed,
> the behavior of the overlay is undefined, though it will not result in
> a crash or deadlock."

Best way to keep things simple is to only add functionality when
someone actually needs it (and can test it).  This has been the design
policy in overlayfs and it worked wonderfully.

So we could probably fix the undefined behavior in the above case to
some extent.

>
> Specifically, renaming directories and files in lower that were already
> copied up is going to have a weird outcome.
>
> Also, the situation with changing files in lower remote fs could be worse
> than changing files on lower local fs, simply because right now, this
> use case is not tested (i.e. it results in ESTALE).
>
> I believe that fixing this use case, if at all possible, would require quite
> a bit of work, a lot of documentation (about expected behavior) and
> even more testing.

Well, your patch seems to be safe:  if remote fs says something
changed, throw away node and subtree on the overlay level.

We could introduce the same thing for local fs.  Just need to verify
in .d_revalidate() that underlying dentry's parent and name matches
overlay dentry's parent and name.  It's an overhead, and makes no
sense in the case when we know the lower layers won't change, so it
may be best to keep this check optional.

Note, that overlay would still return ESTALE if the change on the
lower layer happens on a dentry already looked up (e.g. cwd, open
file, race of lookup with rename on underlying layer).  Same as NFS.

Thanks,
Miklos

Amir Goldstein March 9, 2017, 11:22 a.m. UTC | #11

On Thu, Mar 9, 2017 at 12:37 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> On Tue, Feb 14, 2017 at 3:01 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>>> for writes to happen.  This works great with the caveat being I cannot
>>> make 'live' changes to the root filesystem, which poses the problem.
>>> Any access to a changed file causes a 'Stale file handle' error.
>>>
>>> With some experimenting, I've discovered that remounting the overlay
>>> filesystem (mount -o remount / /)  registers any changes that have
>>> been made to the lower NFS filesystem.  In addition, dumping cache
>>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>>> go away and reads pass through to the lower dir and correctly show
>>> changes.
>>>
>>> I'd like to make this use case feasible by allowing changes to the NFS
>>> lowerdir to work more or less transparently.  It seems like if the
>>> overlay did not do any caching at all, all reads would fall through to
>>> either the upperdir ram disk or the NFS lower, which is precisely what
>>> I want.
>>>
>>> So, let me pose this somewhat naive question:  Would it be possible to
>>> simply disable any cacheing performed by the overlay to force all
>>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>>> Would this be enough to accomplish my goal of being able to change the
>>> lowerdir of an active overlayfs?
>>>
>>
>> There is no need to disable caching. There is already a mechanism
>> in place in VFS to revalidate inode cache entries.
>> NFS implements d_revalidate() and overlayfs implements d_revalidate()
>> by calling into the lower fs d_revalidate().
>>
>> However overlayfs intentionally errors when lower entry has been modified.
>> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>>
>> You can try this (untested) patch to revert this behavior, just to see if it
>> works for your use case, but it won't change this fact
>> from Documentation/filesystems/overlayfs.txt:
>> " Changes to the underlying filesystems while part of a mounted overlay
>> filesystem are not allowed.  If the underlying filesystem is changed,
>> the behavior of the overlay is undefined, though it will not result in
>> a crash or deadlock."
>
> Best way to keep things simple is to only add functionality when
> someone actually needs it (and can test it).  This has been the design
> policy in overlayfs and it worked wonderfully.
>
> So we could probably fix the undefined behavior in the above case to
> some extent.
>
>>
>> Specifically, renaming directories and files in lower that were already
>> copied up is going to have a weird outcome.
>>
>> Also, the situation with changing files in lower remote fs could be worse
>> than changing files on lower local fs, simply because right now, this
>> use case is not tested (i.e. it results in ESTALE).
>>
>> I believe that fixing this use case, if at all possible, would require quite
>> a bit of work, a lot of documentation (about expected behavior) and
>> even more testing.
>
> Well, your patch seems to be safe:  if remote fs says something
> changed, throw away node and subtree on the overlay level.
>
> We could introduce the same thing for local fs.  Just need to verify
> in .d_revalidate() that underlying dentry's parent and name matches
> overlay dentry's parent and name.  It's an overhead, and makes no
> sense in the case when we know the lower layers won't change, so it
> may be best to keep this check optional.
>
> Note, that overlay would still return ESTALE if the change on the
> lower layer happens on a dentry already looked up (e.g. cwd, open
> file, race of lookup with rename on underlying layer).  Same as NFS.
>

Naturally.

Could you explain what was the reason for special casing (ret == 0)
in ovl_dentry_revalidate()?

Miklos Szeredi March 9, 2017, 1:12 p.m. UTC | #12

On Thu, Mar 9, 2017 at 12:22 PM, Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Mar 9, 2017 at 12:37 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
>> On Tue, Feb 14, 2017 at 3:01 PM, Amir Goldstein <amir73il@gmail.com> wrote:
>>> On Mon, Feb 13, 2017 at 11:41 PM, Josh England <jjengla@gmail.com> wrote:
>>>> So here's the use case:  lowerdir is an NFS mounted root filesystem
>>>> (shared by a bunch of nodes).  upperdir is a tmpfs RAM disk to allow
>>>> for writes to happen.  This works great with the caveat being I cannot
>>>> make 'live' changes to the root filesystem, which poses the problem.
>>>> Any access to a changed file causes a 'Stale file handle' error.
>>>>
>>>> With some experimenting, I've discovered that remounting the overlay
>>>> filesystem (mount -o remount / /)  registers any changes that have
>>>> been made to the lower NFS filesystem.  In addition, dumping cache
>>>> (via /proc/sys/vm/drop_caches) also makes the stale file handle errors
>>>> go away and reads pass through to the lower dir and correctly show
>>>> changes.
>>>>
>>>> I'd like to make this use case feasible by allowing changes to the NFS
>>>> lowerdir to work more or less transparently.  It seems like if the
>>>> overlay did not do any caching at all, all reads would fall through to
>>>> either the upperdir ram disk or the NFS lower, which is precisely what
>>>> I want.
>>>>
>>>> So, let me pose this somewhat naive question:  Would it be possible to
>>>> simply disable any cacheing performed by the overlay to force all
>>>> reads to go to either the tmpfs upper or the (VFS-cached) NFS lower?
>>>> Would this be enough to accomplish my goal of being able to change the
>>>> lowerdir of an active overlayfs?
>>>>
>>>
>>> There is no need to disable caching. There is already a mechanism
>>> in place in VFS to revalidate inode cache entries.
>>> NFS implements d_revalidate() and overlayfs implements d_revalidate()
>>> by calling into the lower fs d_revalidate().
>>>
>>> However overlayfs intentionally errors when lower entry has been modified.
>>> (see: 7c03b5d ovl: allow distributed fs as lower layer)
>>>
>>> You can try this (untested) patch to revert this behavior, just to see if it
>>> works for your use case, but it won't change this fact
>>> from Documentation/filesystems/overlayfs.txt:
>>> " Changes to the underlying filesystems while part of a mounted overlay
>>> filesystem are not allowed.  If the underlying filesystem is changed,
>>> the behavior of the overlay is undefined, though it will not result in
>>> a crash or deadlock."
>>
>> Best way to keep things simple is to only add functionality when
>> someone actually needs it (and can test it).  This has been the design
>> policy in overlayfs and it worked wonderfully.
>>
>> So we could probably fix the undefined behavior in the above case to
>> some extent.
>>
>>>
>>> Specifically, renaming directories and files in lower that were already
>>> copied up is going to have a weird outcome.
>>>
>>> Also, the situation with changing files in lower remote fs could be worse
>>> than changing files on lower local fs, simply because right now, this
>>> use case is not tested (i.e. it results in ESTALE).
>>>
>>> I believe that fixing this use case, if at all possible, would require quite
>>> a bit of work, a lot of documentation (about expected behavior) and
>>> even more testing.
>>
>> Well, your patch seems to be safe:  if remote fs says something
>> changed, throw away node and subtree on the overlay level.
>>
>> We could introduce the same thing for local fs.  Just need to verify
>> in .d_revalidate() that underlying dentry's parent and name matches
>> overlay dentry's parent and name.  It's an overhead, and makes no
>> sense in the case when we know the lower layers won't change, so it
>> may be best to keep this check optional.
>>
>> Note, that overlay would still return ESTALE if the change on the
>> lower layer happens on a dentry already looked up (e.g. cwd, open
>> file, race of lookup with rename on underlying layer).  Same as NFS.
>>
>
> Naturally.
>
> Could you explain what was the reason for special casing (ret == 0)
> in ovl_dentry_revalidate()?

I think my reasoning was: if lower dentry is invalid, it means lower
was changed; this is against the rules (as per documentation) so
return error.

Thanks,
Miklos

overlayfs: allowing for changes to lowerdir

Commit Message

Comments

Patch